diff --git a/CITATION.cff b/CITATION.cff
new file mode 100644
index 0000000000000000000000000000000000000000..81ea8f792645b1904e792918590eb215c62dd323
--- /dev/null
+++ b/CITATION.cff
@@ -0,0 +1,9 @@
+cff-version: 1.2.0
+message: "If you use this software, please cite it as below."
+title: "OpenMMLab's Pre-training Toolbox and Benchmark"
+authors:
+  - name: "MMPreTrain Contributors"
+version: 0.15.0
+date-released: 2023-04-06
+repository-code: "https://github.com/open-mmlab/mmpretrain"
+license: Apache-2.0
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
new file mode 100644
index 0000000000000000000000000000000000000000..ce84c2a09f59785d3220a722b8ba1282c97a8030
--- /dev/null
+++ b/CONTRIBUTING.md
@@ -0,0 +1,73 @@
+# Contributing to MMPreTrain
+
+- [Contributing to MMPreTrain](#contributing-to-mmpretrain)
+  - [Workflow](#workflow)
+  - [Code style](#code-style)
+    - [Python](#python)
+    - [C++ and CUDA](#c-and-cuda)
+  - [Pre-commit Hook](#pre-commit-hook)
+
+Thanks for your interest in contributing to MMPreTrain! All kinds of contributions are welcome, including but not limited to the following.
+
+- Fix typo or bugs
+- Add documentation or translate the documentation into other languages
+- Add new features and components
+
+## Workflow
+
+We recommend the potential contributors follow this workflow for contribution.
+
+1. Fork and pull the latest MMPreTrain repository, follow [get started](https://mmpretrain.readthedocs.io/en/latest/get_started.html) to setup the environment.
+2. Checkout a new branch (**do not use the master or dev branch** for PRs)
+
+```bash
+git checkout -b xxxx # xxxx is the name of new branch
+```
+
+3. Edit the related files follow the code style mentioned below
+4. Use [pre-commit hook](https://pre-commit.com/) to check and format your changes.
+5. Commit your changes
+6. Create a PR with related information
+
+## Code style
+
+### Python
+
+We adopt [PEP8](https://www.python.org/dev/peps/pep-0008/) as the preferred code style.
+
+We use the following tools for linting and formatting:
+
+- [flake8](https://github.com/PyCQA/flake8): A wrapper around some linter tools.
+- [isort](https://github.com/timothycrosley/isort): A Python utility to sort imports.
+- [yapf](https://github.com/google/yapf): A formatter for Python files.
+- [codespell](https://github.com/codespell-project/codespell): A Python utility to fix common misspellings in text files.
+- [mdformat](https://github.com/executablebooks/mdformat): Mdformat is an opinionated Markdown formatter that can be used to enforce a consistent style in Markdown files.
+- [docformatter](https://github.com/myint/docformatter): A formatter to format docstring.
+
+Style configurations of yapf and isort can be found in [setup.cfg](https://github.com/open-mmlab/mmpretrain/blob/main/setup.cfg).
+
+### C++ and CUDA
+
+We follow the [Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html).
+
+## Pre-commit Hook
+
+We use [pre-commit hook](https://pre-commit.com/) that checks and formats for `flake8`, `yapf`, `isort`, `trailing whitespaces`, `markdown files`,
+fixes `end-of-files`, `double-quoted-strings`, `python-encoding-pragma`, `mixed-line-ending`, sorts `requirments.txt` automatically on every commit.
+The config for a pre-commit hook is stored in [.pre-commit-config](https://github.com/open-mmlab/mmpretrain/blob/main/.pre-commit-config.yaml).
+
+After you clone the repository, you will need to install initialize pre-commit hook.
+
+```shell
+pip install -U pre-commit
+```
+
+From the repository folder
+
+```shell
+pre-commit install
+```
+
+After this on every commit check code linters and formatter will be enforced.
+
+> Before you create a PR, make sure that your code lints and is formatted by yapf.
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..ae87343779455c4c4b43e10a27d1657142666726
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,203 @@
+Copyright (c) OpenMMLab. All rights reserved
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright 2020 MMPreTrain Authors.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/MANIFEST.in b/MANIFEST.in
new file mode 100644
index 0000000000000000000000000000000000000000..ad4d8dafbdeb31327429c94430a8338e5f024acb
--- /dev/null
+++ b/MANIFEST.in
@@ -0,0 +1,5 @@
+include requirements/*.txt
+include mmpretrain/.mim/model-index.yml
+include mmpretrain/.mim/dataset-index.yml
+recursive-include mmpretrain/.mim/configs *.py *.yml
+recursive-include mmpretrain/.mim/tools *.py *.sh
diff --git a/README.md b/README.md
index 4301dc733a12bb83bbaac7e18b645284677db06c..5318df5b958b8f54dcba1896776eebfb04ba9871 100644
--- a/README.md
+++ b/README.md
@@ -1,123 +1,339 @@
-# Mobilenetv2
+<div align="center">
+
+<img src="resources/mmpt-logo.png" width="600"/>
+  <div>&nbsp;</div>
+  <div align="center">
+    <b><font size="5">OpenMMLab website</font></b>
+    <sup>
+      <a href="https://openmmlab.com">
+        <i><font size="4">HOT</font></i>
+      </a>
+    </sup>
+    &nbsp;&nbsp;&nbsp;&nbsp;
+    <b><font size="5">OpenMMLab platform</font></b>
+    <sup>
+      <a href="https://platform.openmmlab.com">
+        <i><font size="4">TRY IT OUT</font></i>
+      </a>
+    </sup>
+  </div>
+  <div>&nbsp;</div>
+
+[![PyPI](https://img.shields.io/pypi/v/mmpretrain)](https://pypi.org/project/mmpretrain)
+[![Docs](https://img.shields.io/badge/docs-latest-blue)](https://mmpretrain.readthedocs.io/en/latest/)
+[![Build Status](https://github.com/open-mmlab/mmpretrain/workflows/build/badge.svg)](https://github.com/open-mmlab/mmpretrain/actions)
+[![codecov](https://codecov.io/gh/open-mmlab/mmpretrain/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/mmpretrain)
+[![license](https://img.shields.io/github/license/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/blob/main/LICENSE)
+[![open issues](https://isitmaintained.com/badge/open/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/issues)
+[![issue resolution](https://isitmaintained.com/badge/resolution/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/issues)
+
+[📘 Documentation](https://mmpretrain.readthedocs.io/en/latest/) |
+[🛠️ Installation](https://mmpretrain.readthedocs.io/en/latest/get_started.html#installation) |
+[👀 Model Zoo](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html) |
+[🆕 Update News](https://mmpretrain.readthedocs.io/en/latest/notes/changelog.html) |
+[🤔 Reporting Issues](https://github.com/open-mmlab/mmpretrain/issues/new/choose)
+
+<img src="https://user-images.githubusercontent.com/36138628/230307505-4727ad0a-7d71-4069-939d-b499c7e272b7.png" width="400"/>
+
+English | [简体中文](/README_zh-CN.md)
+
+</div>
+
+</div>
+
+<div align="center">
+  <a href="https://openmmlab.medium.com/" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://discord.gg/raweFPmdzG" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218346637-d30c8a0f-3eba-4699-8131-512fb06d46db.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
+</div>
+
+## Introduction
+
+MMPreTrain is an open source pre-training toolbox based on PyTorch. It is a part of the [OpenMMLab](https://openmmlab.com/) project.
+
+The `main` branch works with **PyTorch 1.8+**.
+
+### Major features
+
+- Various backbones and pretrained models
+- Rich training strategies (supervised learning, self-supervised learning, multi-modality learning etc.)
+- Bag of training tricks
+- Large-scale training configs
+- High efficiency and extensibility
+- Powerful toolkits for model analysis and experiments
+- Various out-of-box inference tasks.
+  - Image Classification
+  - Image Caption
+  - Visual Question Answering
+  - Visual Grounding
+  - Retrieval (Image-To-Image, Text-To-Image, Image-To-Text)
+
+https://github.com/open-mmlab/mmpretrain/assets/26739999/e4dcd3a2-f895-4d1b-a351-fbc74a04e904
+
+## What's new
+
+🌟 v1.2.0 was released in 04/01/2023
+
+- Support LLaVA 1.5.
+- Implement of RAM with a gradio interface.
+
+🌟 v1.1.0 was released in 12/10/2023
+
+- Support Mini-GPT4 training and provide a Chinese model (based on Baichuan-7B)
+- Support zero-shot classification based on CLIP.
+
+🌟 v1.0.0 was released in 04/07/2023
+
+- Support inference of more **multi-modal** algorithms, such as [**LLaVA**](./configs/llava/), [**MiniGPT-4**](./configs/minigpt4), [**Otter**](./configs/otter/), etc.
+- Support around **10 multi-modal** datasets!
+- Add [**iTPN**](./configs/itpn/), [**SparK**](./configs/spark/) self-supervised learning algorithms.
+- Provide examples of [New Config](./mmpretrain/configs/) and [DeepSpeed/FSDP with FlexibleRunner](./configs/mae/benchmarks/). Here are the documentation links of [New Config](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#a-pure-python-style-configuration-file-beta) and [DeepSpeed/FSDP with FlexibleRunner](https://mmengine.readthedocs.io/en/latest/api/generated/mmengine.runner.FlexibleRunner.html#mmengine.runner.FlexibleRunner).
+
+🌟 Upgrade from MMClassification to MMPreTrain
+
+- Integrated Self-supervised learning algorithms from **MMSelfSup**, such as **MAE**, **BEiT**, etc.
+- Support **RIFormer**, a simple but effective vision backbone by removing token mixer.
+- Refactor dataset pipeline visualization.
+- Support **LeViT**, **XCiT**, **ViG**, **ConvNeXt-V2**, **EVA**, **RevViT**, **EfficientnetV2**, **CLIP**, **TinyViT** and **MixMIM** backbones.
+
+This release introduced a brand new and flexible training & test engine, but it's still in progress. Welcome
+to try according to [the documentation](https://mmpretrain.readthedocs.io/en/latest/).
+
+And there are some BC-breaking changes. Please check [the migration tutorial](https://mmpretrain.readthedocs.io/en/latest/migration.html).
 
-## 论文
+Please refer to [changelog](https://mmpretrain.readthedocs.io/en/latest/notes/changelog.html) for more details and other release history.
 
-MobileNetV2: Inverted Residuals and Linear Bottlenecks
+## Installation
 
-- https://openaccess.thecvf.com/content_cvpr_2018/papers/Sandler_MobileNetV2_Inverted_Residuals_CVPR_2018_paper.pdf
+Below are quick steps for installation:
 
-## 模型结构
-
-MobileNetV2是一种轻量级的卷积神经网络模型，由Google在2018年提出。它是MobileNet系列中的第二个版本，主要用于移动设备和嵌入式设备等资源受限的环境中进行图像分类、目标检测等计算机视觉任务。
-
-![d15a0e56517b4f7284a862f1d6eaef9a](./images/d15a0e56517b4f7284a862f1d6eaef9a.png)
-
-
-
-## 算法原理
-
-MobileNetV2的网络结构主要由两部分组成：特征提取层和分类器。
-
-![20231124104337](./images/20231124104337.png)
-
-## 环境配置
-
-### Docker（方法一）
-
-```python
-git clone --recursive http://developer.hpccube.com/codes/modelzoo/mobilenetv2_mmcv.git
-docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.10.0-centos7.6-dtk-22.10.1-py37-latest
-# <your IMAGE ID>用以上拉取的docker的镜像ID替换
-docker run --shm-size 10g --network=host --name=mobilenetv2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $PWD/mobilenetv2_mmcv:/home/mobilenetv2_mmcv -it <your IMAGE ID> bash
-
-cd mobilenetv2_mmcv/mmclassification-mmcv
-pip install -r requirements.txt
+```shell
+conda create -n open-mmlab python=3.8 pytorch==1.10.1 torchvision==0.11.2 cudatoolkit=11.3 -c pytorch -y
+conda activate open-mmlab
+pip install openmim
+git clone https://github.com/open-mmlab/mmpretrain.git
+cd mmpretrain
+mim install -e .
 ```
 
-### Dockerfile（方法二）
-
-```plaintext
-cd mobilenetv2_mmcv/docker
-docker build --no-cache -t mobilenetv2_mmcv:latest .
-docker run --rm --shm-size 10g --network=host --name=mobilenetv2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $PWD/../../mobilenetv2_mmcv:/home/mobilenetv2_mmcv -it <your IMAGE ID> bash
-# 若遇到Dockerfile启动的方式安装环境需要长时间等待，可注释掉里面的pip安装，启动容器后再安装python库：pip install -r requirements.txt
-```
+Please refer to [installation documentation](https://mmpretrain.readthedocs.io/en/latest/get_started.html) for more detailed installation and dataset preparation.
 
-### Anaconda（方法三）
+For multi-modality models support, please install the extra dependencies by:
 
-1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装： https://developer.hpccube.com/tool/
-
-```plaintext
-DTK驱动：dtk22.10.1
-python：python3.7
-torch:1.10.0
-torchvision:0.10.0
-mmcv：1.6.1
-Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应
+```shell
+mim install -e ".[multimodal]"
 ```
 
-2、其它非特殊库参照requirements.txt安装
-
-```plaintext
-pip install -r requirements.txt
-```
-
-## 数据集
-
-在本测试中可以使用ImageNet数据集。
-
-下载ImageNet数据集：https://image-net.org/
-
-下载val数据：链接：https://pan.baidu.com/s/1oXsmsYahGVG3uOZ8e535LA?pwd=c3bc 提取码：c3bc 替换ImageNet数据集中的val目录，处理后的数据结构如下：
-
-```
-data
-    ├──imagenet
-        ├── meta
-            ├──val.txt
-            ├──train.txt
-            ...
-        ├── train
-        ├── val
-  
+## User Guides
+
+We provided a series of tutorials about the basic usage of MMPreTrain for new users:
+
+- [Learn about Configs](https://mmpretrain.readthedocs.io/en/latest/user_guides/config.html)
+- [Prepare Dataset](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html)
+- [Inference with existing models](https://mmpretrain.readthedocs.io/en/latest/user_guides/inference.html)
+- [Train](https://mmpretrain.readthedocs.io/en/latest/user_guides/train.html)
+- [Test](https://mmpretrain.readthedocs.io/en/latest/user_guides/test.html)
+- [Downstream tasks](https://mmpretrain.readthedocs.io/en/latest/user_guides/downstream.html)
+
+For more information, please refer to [our documentation](https://mmpretrain.readthedocs.io/en/latest/).
+
+## Model zoo
+
+Results and models are available in the [model zoo](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html).
+
+<div align="center">
+  <b>Overview</b>
+</div>
+<table align="center">
+  <tbody>
+    <tr align="center" valign="bottom">
+      <td>
+        <b>Supported Backbones</b>
+      </td>
+      <td>
+        <b>Self-supervised Learning</b>
+      </td>
+      <td>
+        <b>Multi-Modality Algorithms</b>
+      </td>
+      <td>
+        <b>Others</b>
+      </td>
+    </tr>
+    <tr valign="top">
+      <td>
+        <ul>
+        <li><a href="configs/vgg">VGG</a></li>
+        <li><a href="configs/resnet">ResNet</a></li>
+        <li><a href="configs/resnext">ResNeXt</a></li>
+        <li><a href="configs/seresnet">SE-ResNet</a></li>
+        <li><a href="configs/seresnet">SE-ResNeXt</a></li>
+        <li><a href="configs/regnet">RegNet</a></li>
+        <li><a href="configs/shufflenet_v1">ShuffleNet V1</a></li>
+        <li><a href="configs/shufflenet_v2">ShuffleNet V2</a></li>
+        <li><a href="configs/mobilenet_v2">MobileNet V2</a></li>
+        <li><a href="configs/mobilenet_v3">MobileNet V3</a></li>
+        <li><a href="configs/swin_transformer">Swin-Transformer</a></li>
+        <li><a href="configs/swin_transformer_v2">Swin-Transformer V2</a></li>
+        <li><a href="configs/repvgg">RepVGG</a></li>
+        <li><a href="configs/vision_transformer">Vision-Transformer</a></li>
+        <li><a href="configs/tnt">Transformer-in-Transformer</a></li>
+        <li><a href="configs/res2net">Res2Net</a></li>
+        <li><a href="configs/mlp_mixer">MLP-Mixer</a></li>
+        <li><a href="configs/deit">DeiT</a></li>
+        <li><a href="configs/deit3">DeiT-3</a></li>
+        <li><a href="configs/conformer">Conformer</a></li>
+        <li><a href="configs/t2t_vit">T2T-ViT</a></li>
+        <li><a href="configs/twins">Twins</a></li>
+        <li><a href="configs/efficientnet">EfficientNet</a></li>
+        <li><a href="configs/edgenext">EdgeNeXt</a></li>
+        <li><a href="configs/convnext">ConvNeXt</a></li>
+        <li><a href="configs/hrnet">HRNet</a></li>
+        <li><a href="configs/van">VAN</a></li>
+        <li><a href="configs/convmixer">ConvMixer</a></li>
+        <li><a href="configs/cspnet">CSPNet</a></li>
+        <li><a href="configs/poolformer">PoolFormer</a></li>
+        <li><a href="configs/inception_v3">Inception V3</a></li>
+        <li><a href="configs/mobileone">MobileOne</a></li>
+        <li><a href="configs/efficientformer">EfficientFormer</a></li>
+        <li><a href="configs/mvit">MViT</a></li>
+        <li><a href="configs/hornet">HorNet</a></li>
+        <li><a href="configs/mobilevit">MobileViT</a></li>
+        <li><a href="configs/davit">DaViT</a></li>
+        <li><a href="configs/replknet">RepLKNet</a></li>
+        <li><a href="configs/beit">BEiT</a></li>
+        <li><a href="configs/mixmim">MixMIM</a></li>
+        <li><a href="configs/efficientnet_v2">EfficientNet V2</a></li>
+        <li><a href="configs/revvit">RevViT</a></li>
+        <li><a href="configs/convnext_v2">ConvNeXt V2</a></li>
+        <li><a href="configs/vig">ViG</a></li>
+        <li><a href="configs/xcit">XCiT</a></li>
+        <li><a href="configs/levit">LeViT</a></li>
+        <li><a href="configs/riformer">RIFormer</a></li>
+        <li><a href="configs/glip">GLIP</a></li>
+        <li><a href="configs/sam">ViT SAM</a></li>
+        <li><a href="configs/eva02">EVA02</a></li>
+        <li><a href="configs/dinov2">DINO V2</a></li>
+        <li><a href="configs/hivit">HiViT</a></li>
+        </ul>
+      </td>
+      <td>
+        <ul>
+        <li><a href="configs/mocov2">MoCo V1 (CVPR'2020)</a></li>
+        <li><a href="configs/simclr">SimCLR (ICML'2020)</a></li>
+        <li><a href="configs/mocov2">MoCo V2 (arXiv'2020)</a></li>
+        <li><a href="configs/byol">BYOL (NeurIPS'2020)</a></li>
+        <li><a href="configs/swav">SwAV (NeurIPS'2020)</a></li>
+        <li><a href="configs/densecl">DenseCL (CVPR'2021)</a></li>
+        <li><a href="configs/simsiam">SimSiam (CVPR'2021)</a></li>
+        <li><a href="configs/barlowtwins">Barlow Twins (ICML'2021)</a></li>
+        <li><a href="configs/mocov3">MoCo V3 (ICCV'2021)</a></li>
+        <li><a href="configs/beit">BEiT (ICLR'2022)</a></li>
+        <li><a href="configs/mae">MAE (CVPR'2022)</a></li>
+        <li><a href="configs/simmim">SimMIM (CVPR'2022)</a></li>
+        <li><a href="configs/maskfeat">MaskFeat (CVPR'2022)</a></li>
+        <li><a href="configs/cae">CAE (arXiv'2022)</a></li>
+        <li><a href="configs/milan">MILAN (arXiv'2022)</a></li>
+        <li><a href="configs/beitv2">BEiT V2 (arXiv'2022)</a></li>
+        <li><a href="configs/eva">EVA (CVPR'2023)</a></li>
+        <li><a href="configs/mixmim">MixMIM (arXiv'2022)</a></li>
+        <li><a href="configs/itpn">iTPN (CVPR'2023)</a></li>
+        <li><a href="configs/spark">SparK (ICLR'2023)</a></li>
+        <li><a href="configs/mff">MFF (ICCV'2023)</a></li>
+        </ul>
+      </td>
+      <td>
+        <ul>
+        <li><a href="configs/blip">BLIP (arxiv'2022)</a></li>
+        <li><a href="configs/blip2">BLIP-2 (arxiv'2023)</a></li>
+        <li><a href="configs/ofa">OFA (CoRR'2022)</a></li>
+        <li><a href="configs/flamingo">Flamingo (NeurIPS'2022)</a></li>
+        <li><a href="configs/chinese_clip">Chinese CLIP (arxiv'2022)</a></li>
+        <li><a href="configs/minigpt4">MiniGPT-4 (arxiv'2023)</a></li>
+        <li><a href="configs/llava">LLaVA (arxiv'2023)</a></li>
+        <li><a href="configs/otter">Otter (arxiv'2023)</a></li>
+        </ul>
+      </td>
+      <td>
+      Image Retrieval Task:
+        <ul>
+        <li><a href="configs/arcface">ArcFace (CVPR'2019)</a></li>
+        </ul>
+      Training&Test Tips:
+        <ul>
+        <li><a href="https://arxiv.org/abs/1909.13719">RandAug</a></li>
+        <li><a href="https://arxiv.org/abs/1805.09501">AutoAug</a></li>
+        <li><a href="mmpretrain/datasets/samplers/repeat_aug.py">RepeatAugSampler</a></li>
+        <li><a href="mmpretrain/models/tta/score_tta.py">TTA</a></li>
+        <li>...</li>
+        </ul>
+      </td>
+  </tbody>
+</table>
+
+## Contributing
+
+We appreciate all contributions to improve MMPreTrain.
+Please refer to [CONTRUBUTING](https://mmpretrain.readthedocs.io/en/latest/notes/contribution_guide.html) for the contributing guideline.
+
+## Acknowledgement
+
+MMPreTrain is an open source project that is contributed by researchers and engineers from various colleges and companies. We appreciate all the contributors who implement their methods or add new features, as well as users who give valuable feedbacks.
+We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to reimplement existing methods and supporting their own academic research.
+
+## Citation
+
+If you find this project useful in your research, please consider cite:
+
+```BibTeX
+@misc{2023mmpretrain,
+    title={OpenMMLab's Pre-training Toolbox and Benchmark},
+    author={MMPreTrain Contributors},
+    howpublished = {\url{https://github.com/open-mmlab/mmpretrain}},
+    year={2023}
+}
 ```
-SCNet快速下载链接[http://113.200.138.88:18080/aidatasets/project-dependency/imagenet-2012
-](http://113.200.138.88:18080/aidatasets/project-dependency/imagenet-2012
-)
-## 训练
-
-将训练数据解压到data目录下。
-
-### 单机8卡
-
-    ./mobilenetv2.sh
-
-## result
-
-![img](https://developer.hpccube.com/codes/modelzoo/vit_pytorch/-/raw/master/image/README/1695381570003.png)
-
-### 精度
-
-测试数据使用的是ImageNet数据集，使用的加速卡是DCU Z100L。
-
-| 卡数 |           精度            |
-| :--: | :-----------------------: |
-|  8   | top1:0.71764;top5:0.90386 |
-
-## 应用场景
-
-### 算法类别
-
-图像分类
-
-### 热点行业
-
-制造,能源,交通,网安
-
-## 源码仓库及问题反馈
-
-https://developer.hpccube.com/codes/modelzoo/mobilenetv2_mmcv
-
-## 参考资料
 
-https://github.com/open-mmlab/mmpretrain
+## License
+
+This project is released under the [Apache 2.0 license](LICENSE).
+
+## Projects in OpenMMLab
+
+- [MMEngine](https://github.com/open-mmlab/mmengine): OpenMMLab foundational library for training deep learning models.
+- [MMCV](https://github.com/open-mmlab/mmcv): OpenMMLab foundational library for computer vision.
+- [MIM](https://github.com/open-mmlab/mim): MIM installs OpenMMLab packages.
+- [MMEval](https://github.com/open-mmlab/mmeval): A unified evaluation library for multiple machine learning libraries.
+- [MMPreTrain](https://github.com/open-mmlab/mmpretrain): OpenMMLab pre-training toolbox and benchmark.
+- [MMDetection](https://github.com/open-mmlab/mmdetection): OpenMMLab detection toolbox and benchmark.
+- [MMDetection3D](https://github.com/open-mmlab/mmdetection3d): OpenMMLab's next-generation platform for general 3D object detection.
+- [MMRotate](https://github.com/open-mmlab/mmrotate): OpenMMLab rotated object detection toolbox and benchmark.
+- [MMYOLO](https://github.com/open-mmlab/mmyolo): OpenMMLab YOLO series toolbox and benchmark.
+- [MMSegmentation](https://github.com/open-mmlab/mmsegmentation): OpenMMLab semantic segmentation toolbox and benchmark.
+- [MMOCR](https://github.com/open-mmlab/mmocr): OpenMMLab text detection, recognition, and understanding toolbox.
+- [MMPose](https://github.com/open-mmlab/mmpose): OpenMMLab pose estimation toolbox and benchmark.
+- [MMHuman3D](https://github.com/open-mmlab/mmhuman3d): OpenMMLab 3D human parametric model toolbox and benchmark.
+- [MMSelfSup](https://github.com/open-mmlab/mmselfsup): OpenMMLab self-supervised learning toolbox and benchmark.
+- [MMRazor](https://github.com/open-mmlab/mmrazor): OpenMMLab model compression toolbox and benchmark.
+- [MMFewShot](https://github.com/open-mmlab/mmfewshot): OpenMMLab fewshot learning toolbox and benchmark.
+- [MMAction2](https://github.com/open-mmlab/mmaction2): OpenMMLab's next-generation action understanding toolbox and benchmark.
+- [MMTracking](https://github.com/open-mmlab/mmtracking): OpenMMLab video perception toolbox and benchmark.
+- [MMFlow](https://github.com/open-mmlab/mmflow): OpenMMLab optical flow toolbox and benchmark.
+- [MMagic](https://github.com/open-mmlab/mmagic): Open**MM**Lab **A**dvanced, **G**enerative and **I**ntelligent **C**reation toolbox.
+- [MMGeneration](https://github.com/open-mmlab/mmgeneration): OpenMMLab image and video generative models toolbox.
+- [MMDeploy](https://github.com/open-mmlab/mmdeploy): OpenMMLab model deployment framework.
+- [Playground](https://github.com/open-mmlab/playground): A central hub for gathering and showcasing amazing projects built upon OpenMMLab.
diff --git a/README_zh-CN.md b/README_zh-CN.md
new file mode 100644
index 0000000000000000000000000000000000000000..9ee8dffc401d414c0c2b7135ba2a4887f80608a4
--- /dev/null
+++ b/README_zh-CN.md
@@ -0,0 +1,353 @@
+<div align="center">
+
+<img src="resources/mmpt-logo.png" width="600"/>
+  <div>&nbsp;</div>
+  <div align="center">
+    <b><font size="5">OpenMMLab 官网</font></b>
+    <sup>
+      <a href="https://openmmlab.com">
+        <i><font size="4">HOT</font></i>
+      </a>
+    </sup>
+    &nbsp;&nbsp;&nbsp;&nbsp;
+    <b><font size="5">OpenMMLab 开放平台</font></b>
+    <sup>
+      <a href="https://platform.openmmlab.com">
+        <i><font size="4">TRY IT OUT</font></i>
+      </a>
+    </sup>
+  </div>
+  <div>&nbsp;</div>
+
+[![PyPI](https://img.shields.io/pypi/v/mmpretrain)](https://pypi.org/project/mmpretrain)
+[![Docs](https://img.shields.io/badge/docs-latest-blue)](https://mmpretrain.readthedocs.io/zh_CN/latest/)
+[![Build Status](https://github.com/open-mmlab/mmpretrain/workflows/build/badge.svg)](https://github.com/open-mmlab/mmpretrain/actions)
+[![codecov](https://codecov.io/gh/open-mmlab/mmpretrain/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/mmpretrain)
+[![license](https://img.shields.io/github/license/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/blob/main/LICENSE)
+[![open issues](https://isitmaintained.com/badge/open/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/issues)
+[![issue resolution](https://isitmaintained.com/badge/resolution/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/issues)
+
+[📘 中文文档](https://mmpretrain.readthedocs.io/zh_CN/latest/) |
+[🛠️ 安装教程](https://mmpretrain.readthedocs.io/zh_CN/latest/get_started.html) |
+[👀 模型库](https://mmpretrain.readthedocs.io/zh_CN/latest/modelzoo_statistics.html) |
+[🆕 更新日志](https://mmpretrain.readthedocs.io/zh_CN/latest/notes/changelog.html) |
+[🤔 报告问题](https://github.com/open-mmlab/mmpretrain/issues/new/choose)
+
+<img src="https://user-images.githubusercontent.com/36138628/230307505-4727ad0a-7d71-4069-939d-b499c7e272b7.png" width="400"/>
+
+[English](/README.md) | 简体中文
+
+</div>
+
+<div align="center">
+  <a href="https://openmmlab.medium.com/" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://discord.gg/raweFPmdzG" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218346637-d30c8a0f-3eba-4699-8131-512fb06d46db.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
+</div>
+
+## Introduction
+
+MMPreTrain 是一款基于 PyTorch 的开源深度学习预训练工具箱，是 [OpenMMLab](https://openmmlab.com/) 项目的成员之一
+
+`主分支`代码目前支持 PyTorch 1.8 以上的版本。
+
+### 主要特性
+
+- 支持多样的主干网络与预训练模型
+- 支持多种训练策略（有监督学习，无监督学习，多模态学习等）
+- 提供多种训练技巧
+- 大量的训练配置文件
+- 高效率和高可扩展性
+- 功能强大的工具箱，有助于模型分析和实验
+- 支持多种开箱即用的推理任务
+  - 图像分类
+  - 图像描述（Image Caption）
+  - 视觉问答（Visual Question Answering）
+  - 视觉定位（Visual Grounding）
+  - 检索（图搜图，图搜文，文搜图）
+
+https://github.com/open-mmlab/mmpretrain/assets/26739999/e4dcd3a2-f895-4d1b-a351-fbc74a04e904
+
+## 更新日志
+
+🌟 2024/01/04 发布了 v1.2.0 版本
+
+- 支持了 LLaVA 1.5
+- 实现了一个 RAM 模型的 gradio 推理例程
+
+🌟 2023/10/12 发布了 v1.1.0 版本
+
+- 支持 Mini-GPT4 训练并提供一个基于 Baichuan-7B 的中文模型
+- 支持基于 CLIP 的零样本分类。
+
+🌟 2023/7/4 发布了 v1.0.0 版本
+
+- 支持更多**多模态**算法的推理, 例如 [**LLaVA**](./configs/llava/), [**MiniGPT-4**](./configs/minigpt4), [**Otter**](./configs/otter/) 等。
+- 支持约 **10 个多模态**数据集!
+- 添加自监督学习算法 [**iTPN**](./configs/itpn/), [**SparK**](./configs/spark/)。
+- 提供[新配置文件](./mmpretrain/configs/)和 [DeepSpeed/FSDP](./configs/mae/benchmarks/) 的样例。这是[新配置文件](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#a-pure-python-style-configuration-file-beta) 和 [DeepSpeed/FSDP with FlexibleRunner](https://mmengine.readthedocs.io/en/latest/api/generated/mmengine.runner.FlexibleRunner.html#mmengine.runner.FlexibleRunner) 的文档链接。
+
+🌟 从 MMClassification 升级到 MMPreTrain
+
+- 整合来自 MMSelfSup 的自监督学习算法，例如 `MAE`, `BEiT` 等
+- 支持了 **RIFormer**，简单但有效的视觉主干网络，却移除了 token mixer
+- 重构数据管道可视化
+- 支持了 **LeViT**, **XCiT**, **ViG**, **ConvNeXt-V2**, **EVA**, **RevViT**, **EfficientnetV2**, **CLIP**, **TinyViT** 和 **MixMIM** 等骨干网络结构
+
+这个版本引入一个全新的，可扩展性强的训练和测试引擎，但目前仍在开发中。欢迎根据 [文档](https://mmpretrain.readthedocs.io/zh_CN/latest/) 进行试用。
+
+同时，新版本中存在一些与旧版本不兼容的修改。请查看 [迁移文档](https://mmpretrain.readthedocs.io/zh_CN/latest/migration.html) 来详细了解这些变动。
+
+发布历史和更新细节请参考 [更新日志](https://mmpretrain.readthedocs.io/zh_CN/latest/notes/changelog.html)。
+
+## 安装
+
+以下是安装的简要步骤：
+
+```shell
+conda create -n open-mmlab python=3.8 pytorch==1.10.1 torchvision==0.11.2 cudatoolkit=11.3 -c pytorch -y
+conda activate open-mmlab
+pip3 install openmim
+git clone https://github.com/open-mmlab/mmpretrain.git
+cd mmpretrain
+mim install -e .
+```
+
+更详细的步骤请参考 [安装指南](https://mmpretrain.readthedocs.io/zh_CN/latest/get_started.html) 进行安装。
+
+如果需要多模态模型，请使用如下方式安装额外的依赖：
+
+```shell
+mim install -e ".[multimodal]"
+```
+
+## 基础教程
+
+我们为新用户提供了一系列基础教程：
+
+- [学习配置文件](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/config.html)
+- [准备数据集](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/dataset_prepare.html)
+- [使用现有模型推理](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/inference.html)
+- [训练](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/train.html)
+- [测试](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/test.html)
+- [下游任务](https://mmpretrain.readthedocs.io/zh_CN/latest/user_guides/downstream.html)
+
+关于更多的信息，请查阅我们的 [相关文档](https://mmpretrain.readthedocs.io/zh_CN/latest/)。
+
+## 模型库
+
+相关结果和模型可在 [模型库](https://mmpretrain.readthedocs.io/zh_CN/latest/modelzoo_statistics.html) 中获得。
+
+<div align="center">
+  <b>概览</b>
+</div>
+<table align="center">
+  <tbody>
+    <tr align="center" valign="bottom">
+      <td>
+        <b>支持的主干网络</b>
+      </td>
+      <td>
+        <b>自监督学习</b>
+      </td>
+      <td>
+        <b>多模态算法</b>
+      </td>
+      <td>
+        <b>其它</b>
+      </td>
+    </tr>
+    <tr valign="top">
+      <td>
+        <ul>
+        <li><a href="configs/vgg">VGG</a></li>
+        <li><a href="configs/resnet">ResNet</a></li>
+        <li><a href="configs/resnext">ResNeXt</a></li>
+        <li><a href="configs/seresnet">SE-ResNet</a></li>
+        <li><a href="configs/seresnet">SE-ResNeXt</a></li>
+        <li><a href="configs/regnet">RegNet</a></li>
+        <li><a href="configs/shufflenet_v1">ShuffleNet V1</a></li>
+        <li><a href="configs/shufflenet_v2">ShuffleNet V2</a></li>
+        <li><a href="configs/mobilenet_v2">MobileNet V2</a></li>
+        <li><a href="configs/mobilenet_v3">MobileNet V3</a></li>
+        <li><a href="configs/swin_transformer">Swin-Transformer</a></li>
+        <li><a href="configs/swin_transformer_v2">Swin-Transformer V2</a></li>
+        <li><a href="configs/repvgg">RepVGG</a></li>
+        <li><a href="configs/vision_transformer">Vision-Transformer</a></li>
+        <li><a href="configs/tnt">Transformer-in-Transformer</a></li>
+        <li><a href="configs/res2net">Res2Net</a></li>
+        <li><a href="configs/mlp_mixer">MLP-Mixer</a></li>
+        <li><a href="configs/deit">DeiT</a></li>
+        <li><a href="configs/deit3">DeiT-3</a></li>
+        <li><a href="configs/conformer">Conformer</a></li>
+        <li><a href="configs/t2t_vit">T2T-ViT</a></li>
+        <li><a href="configs/twins">Twins</a></li>
+        <li><a href="configs/efficientnet">EfficientNet</a></li>
+        <li><a href="configs/edgenext">EdgeNeXt</a></li>
+        <li><a href="configs/convnext">ConvNeXt</a></li>
+        <li><a href="configs/hrnet">HRNet</a></li>
+        <li><a href="configs/van">VAN</a></li>
+        <li><a href="configs/convmixer">ConvMixer</a></li>
+        <li><a href="configs/cspnet">CSPNet</a></li>
+        <li><a href="configs/poolformer">PoolFormer</a></li>
+        <li><a href="configs/inception_v3">Inception V3</a></li>
+        <li><a href="configs/mobileone">MobileOne</a></li>
+        <li><a href="configs/efficientformer">EfficientFormer</a></li>
+        <li><a href="configs/mvit">MViT</a></li>
+        <li><a href="configs/hornet">HorNet</a></li>
+        <li><a href="configs/mobilevit">MobileViT</a></li>
+        <li><a href="configs/davit">DaViT</a></li>
+        <li><a href="configs/replknet">RepLKNet</a></li>
+        <li><a href="configs/beit">BEiT</a></li>
+        <li><a href="configs/mixmim">MixMIM</a></li>
+        <li><a href="configs/revvit">RevViT</a></li>
+        <li><a href="configs/convnext_v2">ConvNeXt V2</a></li>
+        <li><a href="configs/vig">ViG</a></li>
+        <li><a href="configs/xcit">XCiT</a></li>
+        <li><a href="configs/levit">LeViT</a></li>
+        <li><a href="configs/riformer">RIFormer</a></li>
+        <li><a href="configs/glip">GLIP</a></li>
+        <li><a href="configs/sam">ViT SAM</a></li>
+        <li><a href="configs/eva02">EVA02</a></li>
+        <li><a href="configs/dinov2">DINO V2</a></li>
+        <li><a href="configs/hivit">HiViT</a></li>
+        </ul>
+      </td>
+      <td>
+        <ul>
+        <li><a href="configs/mocov2">MoCo V1 (CVPR'2020)</a></li>
+        <li><a href="configs/simclr">SimCLR (ICML'2020)</a></li>
+        <li><a href="configs/mocov2">MoCo V2 (arXiv'2020)</a></li>
+        <li><a href="configs/byol">BYOL (NeurIPS'2020)</a></li>
+        <li><a href="configs/swav">SwAV (NeurIPS'2020)</a></li>
+        <li><a href="configs/densecl">DenseCL (CVPR'2021)</a></li>
+        <li><a href="configs/simsiam">SimSiam (CVPR'2021)</a></li>
+        <li><a href="configs/barlowtwins">Barlow Twins (ICML'2021)</a></li>
+        <li><a href="configs/mocov3">MoCo V3 (ICCV'2021)</a></li>
+        <li><a href="configs/beit">BEiT (ICLR'2022)</a></li>
+        <li><a href="configs/mae">MAE (CVPR'2022)</a></li>
+        <li><a href="configs/simmim">SimMIM (CVPR'2022)</a></li>
+        <li><a href="configs/maskfeat">MaskFeat (CVPR'2022)</a></li>
+        <li><a href="configs/cae">CAE (arXiv'2022)</a></li>
+        <li><a href="configs/milan">MILAN (arXiv'2022)</a></li>
+        <li><a href="configs/beitv2">BEiT V2 (arXiv'2022)</a></li>
+        <li><a href="configs/eva">EVA (CVPR'2023)</a></li>
+        <li><a href="configs/mixmim">MixMIM (arXiv'2022)</a></li>
+        <li><a href="configs/itpn">iTPN (CVPR'2023)</a></li>
+        <li><a href="configs/spark">SparK (ICLR'2023)</a></li>
+        <li><a href="configs/mff">MFF (ICCV'2023)</a></li>
+        </ul>
+      </td>
+      <td>
+        <ul>
+        <li><a href="configs/blip">BLIP (arxiv'2022)</a></li>
+        <li><a href="configs/blip2">BLIP-2 (arxiv'2023)</a></li>
+        <li><a href="configs/ofa">OFA (CoRR'2022)</a></li>
+        <li><a href="configs/flamingo">Flamingo (NeurIPS'2022)</a></li>
+        <li><a href="configs/chinese_clip">Chinese CLIP (arxiv'2022)</a></li>
+        <li><a href="configs/minigpt4">MiniGPT-4 (arxiv'2023)</a></li>
+        <li><a href="configs/llava">LLaVA (arxiv'2023)</a></li>
+        <li><a href="configs/otter">Otter (arxiv'2023)</a></li>
+        </ul>
+      </td>
+      <td>
+      图像检索任务：
+        <ul>
+        <li><a href="configs/arcface">ArcFace (CVPR'2019)</a></li>
+        </ul>
+      训练和测试 Tips:
+        <ul>
+        <li><a href="https://arxiv.org/abs/1909.13719">RandAug</a></li>
+        <li><a href="https://arxiv.org/abs/1805.09501">AutoAug</a></li>
+        <li><a href="mmpretrain/datasets/samplers/repeat_aug.py">RepeatAugSampler</a></li>
+        <li><a href="mmpretrain/models/tta/score_tta.py">TTA</a></li>
+        <li>...</li>
+        </ul>
+      </td>
+  </tbody>
+</table>
+
+## 参与贡献
+
+我们非常欢迎任何有助于提升 MMPreTrain 的贡献，请参考 [贡献指南](https://mmpretrain.readthedocs.io/zh_CN/latest/notes/contribution_guide.html) 来了解如何参与贡献。
+
+## 致谢
+
+MMPreTrain 是一款由不同学校和公司共同贡献的开源项目。我们感谢所有为项目提供算法复现和新功能支持的贡献者，以及提供宝贵反馈的用户。
+我们希望该工具箱和基准测试可以为社区提供灵活的代码工具，供用户复现现有算法并开发自己的新模型，从而不断为开源社区提供贡献。
+
+## 引用
+
+如果你在研究中使用了本项目的代码或者性能基准，请参考如下 bibtex 引用 MMPreTrain。
+
+```BibTeX
+@misc{2023mmpretrain,
+    title={OpenMMLab's Pre-training Toolbox and Benchmark},
+    author={MMPreTrain Contributors},
+    howpublished = {\url{https://github.com/open-mmlab/mmpretrain}},
+    year={2023}
+}
+```
+
+## 许可证
+
+该项目开源自 [Apache 2.0 license](LICENSE).
+
+## OpenMMLab 的其他项目
+
+- [MMEngine](https://github.com/open-mmlab/mmengine): OpenMMLab 深度学习模型训练基础库
+- [MMCV](https://github.com/open-mmlab/mmcv): OpenMMLab 计算机视觉基础库
+- [MIM](https://github.com/open-mmlab/mim): MIM 是 OpenMMlab 项目、算法、模型的统一入口
+- [MMEval](https://github.com/open-mmlab/mmeval): 统一开放的跨框架算法评测库
+- [MMPreTrain](https://github.com/open-mmlab/mmpretrain): OpenMMLab 深度学习预训练工具箱
+- [MMDetection](https://github.com/open-mmlab/mmdetection): OpenMMLab 目标检测工具箱
+- [MMDetection3D](https://github.com/open-mmlab/mmdetection3d): OpenMMLab 新一代通用 3D 目标检测平台
+- [MMRotate](https://github.com/open-mmlab/mmrotate): OpenMMLab 旋转框检测工具箱与测试基准
+- [MMYOLO](https://github.com/open-mmlab/mmyolo): OpenMMLab YOLO 系列工具箱与测试基准
+- [MMSegmentation](https://github.com/open-mmlab/mmsegmentation): OpenMMLab 语义分割工具箱
+- [MMOCR](https://github.com/open-mmlab/mmocr): OpenMMLab 全流程文字检测识别理解工具包
+- [MMPose](https://github.com/open-mmlab/mmpose): OpenMMLab 姿态估计工具箱
+- [MMHuman3D](https://github.com/open-mmlab/mmhuman3d): OpenMMLab 人体参数化模型工具箱与测试基准
+- [MMSelfSup](https://github.com/open-mmlab/mmselfsup): OpenMMLab 自监督学习工具箱与测试基准
+- [MMRazor](https://github.com/open-mmlab/mmrazor): OpenMMLab 模型压缩工具箱与测试基准
+- [MMFewShot](https://github.com/open-mmlab/mmfewshot): OpenMMLab 少样本学习工具箱与测试基准
+- [MMAction2](https://github.com/open-mmlab/mmaction2): OpenMMLab 新一代视频理解工具箱
+- [MMTracking](https://github.com/open-mmlab/mmtracking): OpenMMLab 一体化视频目标感知平台
+- [MMFlow](https://github.com/open-mmlab/mmflow): OpenMMLab 光流估计工具箱与测试基准
+- [MMagic](https://github.com/open-mmlab/mmagic): OpenMMLab 新一代人工智能内容生成（AIGC）工具箱
+- [MMGeneration](https://github.com/open-mmlab/mmgeneration): OpenMMLab 图片视频生成模型工具箱
+- [MMDeploy](https://github.com/open-mmlab/mmdeploy): OpenMMLab 模型部署框架
+- [Playground](https://github.com/open-mmlab/playground): 收集和展示 OpenMMLab 相关的前沿、有趣的社区项目
+
+## 欢迎加入 OpenMMLab 社区
+
+扫描下方的二维码可关注 OpenMMLab 团队的 [知乎官方账号](https://www.zhihu.com/people/openmmlab)，扫描下方微信二维码添加喵喵好友，进入 MMPretrain 微信交流社群。【加好友申请格式：研究方向+地区+学校/公司+姓名】
+
+<div align="center">
+<img src="./resources/zhihu_qrcode.jpg" height="400"/> <img src="./resources/miaomiao_qrcode.jpg" height="400"/>
+</div>
+
+我们会在 OpenMMLab 社区为大家
+
+- 📢 分享 AI 框架的前沿核心技术
+- 💻 解读 PyTorch 常用模块源码
+- 📰 发布 OpenMMLab 的相关新闻
+- 🚀 介绍 OpenMMLab 开发的前沿算法
+- 🏃 获取更高效的问题答疑和意见反馈
+- 🔥 提供与各行各业开发者充分交流的平台
+
+干货满满 📘，等你来撩 💗，OpenMMLab 社区期待您的加入 👬
diff --git a/configs/_base_/datasets/cifar100_bs16.py b/configs/_base_/datasets/cifar100_bs16.py
new file mode 100644
index 0000000000000000000000000000000000000000..67477db0367fa1356c4514a46f4b43d56b4c5822
--- /dev/null
+++ b/configs/_base_/datasets/cifar100_bs16.py
@@ -0,0 +1,45 @@
+# dataset settings
+dataset_type = 'CIFAR100'
+data_preprocessor = dict(
+    num_classes=100,
+    # RGB format normalization parameters
+    mean=[129.304, 124.070, 112.434],
+    std=[68.170, 65.392, 70.418],
+    # loaded images are already RGB format
+    to_rgb=False)
+
+train_pipeline = [
+    dict(type='RandomCrop', crop_size=32, padding=4),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/cifar100',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/cifar100/',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/cifar10_bs16.py b/configs/_base_/datasets/cifar10_bs16.py
new file mode 100644
index 0000000000000000000000000000000000000000..408be35da845a39bf7058eb9c3ce5549295b3822
--- /dev/null
+++ b/configs/_base_/datasets/cifar10_bs16.py
@@ -0,0 +1,45 @@
+# dataset settings
+dataset_type = 'CIFAR10'
+data_preprocessor = dict(
+    num_classes=10,
+    # RGB format normalization parameters
+    mean=[125.307, 122.961, 113.8575],
+    std=[51.5865, 50.847, 51.255],
+    # loaded images are already RGB format
+    to_rgb=False)
+
+train_pipeline = [
+    dict(type='RandomCrop', crop_size=32, padding=4),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/cifar10',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/cifar10/',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/coco_caption.py b/configs/_base_/datasets/coco_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..5346111273d4120581fe854583c99f6b94e7e873
--- /dev/null
+++ b/configs/_base_/datasets/coco_caption.py
@@ -0,0 +1,70 @@
+# data settings
+# coco caption annotations can be grabbed from LAVIS repo
+# https://github.com/salesforce/LAVIS/blob/main/lavis/configs/datasets/coco/defaults_cap.yaml
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='CleanCaption', keys='gt_caption'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['gt_caption'],
+        meta_keys=['image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(384, 384),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_train.json',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_val.json',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+# # If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/coco_okvqa.py b/configs/_base_/datasets/coco_okvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..16f1577dbb5e5c7c14186f2523e94e0aeffc4b54
--- /dev/null
+++ b/configs/_base_/datasets/coco_okvqa.py
@@ -0,0 +1,75 @@
+# data settings
+
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='CleanCaption',
+        keys=['question'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='COCOVQA',
+        data_root='data/coco',
+        data_prefix='train2014',
+        question_file=
+        'annotations/okvqa_OpenEnded_mscoco_train2014_questions.json',
+        ann_file='annotations/okvqa_mscoco_train2014_annotations.json',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='COCOVQA',
+        data_root='data/coco',
+        data_prefix='val2014',
+        question_file=
+        'annotations/okvqa_OpenEnded_mscoco_val2014_questions.json',
+        ann_file='annotations/okvqa_mscoco_val2014_annotations.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/coco_retrieval.py b/configs/_base_/datasets/coco_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f6b802a3854fd029c476d78296edbc9bffd4e75
--- /dev/null
+++ b/configs/_base_/datasets/coco_retrieval.py
@@ -0,0 +1,99 @@
+# data settings
+# Here are the links to download the annotations for coco retrieval for conveniency # noqa
+# https://download.openmmlab.com/mmclassification/datasets/coco_retrieval/caption_karpathy_train2014.json
+# https://download.openmmlab.com/mmclassification/datasets/coco_retrieval/caption_karpathy_val2014.json
+# https://download.openmmlab.com/mmclassification/datasets/coco_retrieval/caption_karpathy_test2014.json
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+rand_increasing_policies = [
+    dict(type='AutoContrast'),
+    dict(type='Equalize'),
+    dict(type='Rotate', magnitude_key='angle', magnitude_range=(0, 30)),
+    dict(
+        type='Brightness', magnitude_key='magnitude',
+        magnitude_range=(0, 0.0)),
+    dict(type='Sharpness', magnitude_key='magnitude', magnitude_range=(0, 0)),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        direction='horizontal'),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        direction='vertical'),
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        crop_ratio_range=(0.5, 1.0),
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies=rand_increasing_policies,
+        num_policies=2,
+        magnitude_level=5),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'is_matched'],
+        meta_keys=['image_id']),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(384, 384),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'gt_text_id', 'gt_image_id'],
+        meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=16,
+    dataset=dict(
+        type='COCORetrieval',
+        data_root='data/coco',
+        ann_file='annotations/caption_karpathy_train2014.json',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=16,
+    dataset=dict(
+        type='COCORetrieval',
+        data_root='data/coco',
+        ann_file='annotations/caption_karpathy_val2014.json',
+        pipeline=test_pipeline,
+        # This is required for evaluation
+        test_mode=True,
+    ),
+    sampler=dict(type='SequentialSampler', subsample_type='sequential'),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(type='RetrievalRecall', topk=(1, 5, 10))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/coco_vg_vqa.py b/configs/_base_/datasets/coco_vg_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ba0eac46853c1a477e2c6b2bc3dcddbbf7e5423
--- /dev/null
+++ b/configs/_base_/datasets/coco_vg_vqa.py
@@ -0,0 +1,96 @@
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=(480, 480),
+        crop_ratio_range=(0.5, 1.0),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='simple_increasing',  # slightly different from LAVIS
+        num_policies=2,
+        magnitude_level=5),
+    dict(type='CleanCaption', keys=['question', 'gt_answer']),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight']),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CleanCaption', keys=['question']),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question'],
+        meta_keys=['question_id']),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='ConcatDataset',
+        datasets=[
+            # VQAv2 train
+            dict(
+                type='COCOVQA',
+                data_root='data/coco',
+                data_prefix='train2014',
+                question_file=
+                'annotations/v2_OpenEnded_mscoco_train2014_questions.json',
+                ann_file='annotations/v2_mscoco_train2014_annotations.json',
+                pipeline=train_pipeline,
+            ),
+            # VQAv2 val
+            dict(
+                type='COCOVQA',
+                data_root='data/coco',
+                data_prefix='val2014',
+                question_file=
+                'annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+                ann_file='annotations/v2_mscoco_val2014_annotations.json',
+                pipeline=train_pipeline,
+            ),
+            # Visual Genome
+            dict(
+                type='VisualGenomeQA',
+                data_root='visual_genome',
+                data_prefix='image',
+                ann_file='question_answers.json',
+                pipeline=train_pipeline,
+            )
+        ]),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='COCOVQA',
+        data_root='data/coco',
+        data_prefix='test2015',
+        question_file=
+        'annotations/v2_OpenEnded_mscoco_test2015_questions.json',  # noqa: E501
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test.json')
diff --git a/configs/_base_/datasets/coco_vqa.py b/configs/_base_/datasets/coco_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..7fb16bd241b357a897b168ceff5450b6e7f2dc80
--- /dev/null
+++ b/configs/_base_/datasets/coco_vqa.py
@@ -0,0 +1,84 @@
+# data settings
+
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='CleanCaption',
+        keys=['question'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='COCOVQA',
+        data_root='data/coco',
+        data_prefix='train2014',
+        question_file=
+        'annotations/v2_OpenEnded_mscoco_train2014_questions.json',
+        ann_file='annotations/v2_mscoco_train2014_annotations.json',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='COCOVQA',
+        data_root='data/coco',
+        data_prefix='val2014',
+        question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+        ann_file='annotations/v2_mscoco_val2014_annotations.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='COCOVQA',
+        data_root='data/coco',
+        data_prefix='test2015',
+        question_file=  # noqa: E251
+        'annotations/v2_OpenEnded_mscoco_test2015_questions.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test.json')
diff --git a/configs/_base_/datasets/cub_bs8_384.py b/configs/_base_/datasets/cub_bs8_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..24b3a9ffd4df6987716f15a42cc2e3d02c436b90
--- /dev/null
+++ b/configs/_base_/datasets/cub_bs8_384.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'CUB'
+data_preprocessor = dict(
+    num_classes=200,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=510),
+    dict(type='RandomCrop', crop_size=384),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=510),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=8,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/CUB_200_2011',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/CUB_200_2011',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/cub_bs8_448.py b/configs/_base_/datasets/cub_bs8_448.py
new file mode 100644
index 0000000000000000000000000000000000000000..c0bc7b7e1fbd308763c68e1b6302669c705e8f41
--- /dev/null
+++ b/configs/_base_/datasets/cub_bs8_448.py
@@ -0,0 +1,50 @@
+# dataset settings
+dataset_type = 'CUB'
+data_preprocessor = dict(
+    num_classes=200,
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=600),
+    dict(type='RandomCrop', crop_size=448),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=600),
+    dict(type='CenterCrop', crop_size=448),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=8,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/CUB_200_2011',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/CUB_200_2011',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/flickr30k_caption.py b/configs/_base_/datasets/flickr30k_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..a902b5291f1df0df719f570538385a1c75dfccfd
--- /dev/null
+++ b/configs/_base_/datasets/flickr30k_caption.py
@@ -0,0 +1,92 @@
+# data settings
+
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='CleanCaption', keys='gt_caption'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['gt_caption'],
+        meta_keys=['image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(384, 384),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type='Flickr30kCaption',
+        data_root='data/flickr30k',
+        ann_file='annotations/dataset_flickr30k.json',
+        data_prefix='images',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type='Flickr30kCaption',
+        data_root='data/flickr30k',
+        ann_file='annotations/dataset_flickr30k.json',
+        data_prefix='images',
+        split='val',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+# refer tools/dataset_converters/convert_flickr30k_ann.py
+val_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/flickr30k_val_gt.json',
+)
+
+# # If you want standard test, please manually configure the test dataset
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type='Flickr30kCaption',
+        data_root='data/flickr30k',
+        ann_file='annotations/dataset_flickr30k.json',
+        data_prefix='images',
+        split='test',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+# refer tools/dataset_converters/convert_flickr30k_ann.py
+test_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/flickr30k_test_gt.json',
+)
diff --git a/configs/_base_/datasets/flickr30k_retrieval.py b/configs/_base_/datasets/flickr30k_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..acbc645b92214599d77cd9f3ecc70e9b7235b8e5
--- /dev/null
+++ b/configs/_base_/datasets/flickr30k_retrieval.py
@@ -0,0 +1,112 @@
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+rand_increasing_policies = [
+    dict(type='AutoContrast'),
+    dict(type='Equalize'),
+    dict(type='Rotate', magnitude_key='angle', magnitude_range=(0, 30)),
+    dict(
+        type='Brightness', magnitude_key='magnitude',
+        magnitude_range=(0, 0.0)),
+    dict(type='Sharpness', magnitude_key='magnitude', magnitude_range=(0, 0)),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        direction='horizontal'),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        direction='vertical'),
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        crop_ratio_range=(0.5, 1.0),
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies=rand_increasing_policies,
+        num_policies=2,
+        magnitude_level=5),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'is_matched'],
+        meta_keys=['image_id']),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(384, 384),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'gt_text_id', 'gt_image_id'],
+        meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=16,
+    dataset=dict(
+        type='Flickr30kRetrieval',
+        data_root='data/flickr30k',
+        ann_file='annotations/dataset_flickr30k.json',
+        data_prefix='images',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=16,
+    dataset=dict(
+        type='Flickr30kRetrieval',
+        data_root='data/flickr30k',
+        ann_file='annotations/dataset_flickr30k.json',
+        data_prefix='images',
+        split='val',
+        pipeline=test_pipeline,
+        test_mode=True,  # This is required for evaluation
+    ),
+    sampler=dict(type='SequentialSampler', subsample_type='sequential'),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(type='RetrievalRecall', topk=(1, 5, 10))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = dict(
+    batch_size=64,
+    num_workers=16,
+    dataset=dict(
+        type='Flickr30kRetrieval',
+        data_root='data/flickr30k',
+        ann_file='annotations/dataset_flickr30k.json',
+        data_prefix='images',
+        split='test',
+        pipeline=test_pipeline,
+        test_mode=True,  # This is required for evaluation
+    ),
+    sampler=dict(type='SequentialSampler', subsample_type='sequential'),
+    persistent_workers=True,
+)
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/gqa.py b/configs/_base_/datasets/gqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..872ab451f32dd9cff87890c943a5ed1dc7ecb517
--- /dev/null
+++ b/configs/_base_/datasets/gqa.py
@@ -0,0 +1,81 @@
+# data settings
+
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='CleanCaption',
+        keys=['question'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='GQA',
+        data_root='data/gqa',
+        data_prefix='images',
+        ann_file='annotations/train_balanced_questions.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='GQA',
+        data_root='data/gqa',
+        data_prefix='images',
+        ann_file='annotations/testdev_balanced_questions.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='GQAAcc')
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='GQA',
+        data_root='data/gqa',
+        data_prefix='images',
+        ann_file='annotations/testdev_balanced_questions.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet21k_bs128.py b/configs/_base_/datasets/imagenet21k_bs128.py
new file mode 100644
index 0000000000000000000000000000000000000000..38bfd351bf8f49ae18d21492c6fc656a7b2ecc45
--- /dev/null
+++ b/configs/_base_/datasets/imagenet21k_bs128.py
@@ -0,0 +1,28 @@
+# dataset settings
+dataset_type = 'ImageNet21k'
+data_preprocessor = dict(
+    num_classes=21842,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet21k',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
diff --git a/configs/_base_/datasets/imagenet_bs128_mbv3.py b/configs/_base_/datasets/imagenet_bs128_mbv3.py
new file mode 100644
index 0000000000000000000000000000000000000000..d355f507bf8e2be5d9efc3cc777e9854196b9d64
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_mbv3.py
@@ -0,0 +1,66 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='AutoAugment',
+        policies='imagenet',
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.2,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_poolformer_medium_224.py b/configs/_base_/datasets/imagenet_bs128_poolformer_medium_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..be90a655674e22c3341c185c7be5532b1bef8cf1
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_poolformer_medium_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=236,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_poolformer_small_224.py b/configs/_base_/datasets/imagenet_bs128_poolformer_small_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9e0f071ade1feccf6a3f96ef7ad8f28c693e84c
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_poolformer_small_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_revvit_224.py b/configs/_base_/datasets/imagenet_bs128_revvit_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd87aaf033b08dd94b5a684eed759072ff6fd4e9
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_revvit_224.py
@@ -0,0 +1,83 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',  # should be 'pixel', but currently not supported
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_riformer_medium_384.py b/configs/_base_/datasets/imagenet_bs128_riformer_medium_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..151ded7895b378ba7e6bf5895fb11d903841b95d
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_riformer_medium_384.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=404,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_riformer_small_384.py b/configs/_base_/datasets/imagenet_bs128_riformer_small_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea9799ba9c41fcbaf049a54d9776750c860a598c
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_riformer_small_384.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=426,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs128_vig_224.py b/configs/_base_/datasets/imagenet_bs128_vig_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..abb0182a6ce53202bee905bcd3849b851852b4b4
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs128_vig_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_eva_196.py b/configs/_base_/datasets/imagenet_bs16_eva_196.py
new file mode 100644
index 0000000000000000000000000000000000000000..f668e1d6e56ab4c5e311af912fe4b560a3a12bfd
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_eva_196.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=196,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=196,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=196),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_eva_336.py b/configs/_base_/datasets/imagenet_bs16_eva_336.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2c770af0f58a4db5d0435807f3cc9b499d01295
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_eva_336.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=336,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=336,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=336),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_eva_448.py b/configs/_base_/datasets/imagenet_bs16_eva_448.py
new file mode 100644
index 0000000000000000000000000000000000000000..b90bba14eefb3c7e0bac8234dd84461a7b420462
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_eva_448.py
@@ -0,0 +1,62 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=448,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=448,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=448),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        ann_file='meta/train.txt',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        ann_file='meta/val.txt',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_eva_560.py b/configs/_base_/datasets/imagenet_bs16_eva_560.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e548cc2a8de33fcd8ec80a2652dabcb931519aa
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_eva_560.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=560,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=560,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=560),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs16_pil_bicubic_384.py b/configs/_base_/datasets/imagenet_bs16_pil_bicubic_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..8507af4dd0219d8aa6449b6b3d9a1f8d39f1bfce
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs16_pil_bicubic_384.py
@@ -0,0 +1,53 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=384, backend='pillow', interpolation='bicubic'),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_beitv2.py b/configs/_base_/datasets/imagenet_bs256_beitv2.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d420326f2cf3e26f1478d684a03e39c51799534
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_beitv2.py
@@ -0,0 +1,47 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='TwoNormDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    second_mean=[127.5, 127.5, 127.5],
+    second_std=[127.5, 127.5, 127.5],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ColorJitter',
+        brightness=0.4,
+        contrast=0.4,
+        saturation=0.4,
+        hue=0.),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandomResizedCropAndInterpolationWithTwoPic',
+        size=224,
+        second_size=224,
+        interpolation='bicubic',
+        second_interpolation='bicubic',
+        scale=(0.2, 1.0)),
+    dict(
+        type='BEiTMaskGenerator',
+        input_size=(14, 14),
+        num_masking_patches=75,
+        max_num_patches=75,
+        min_num_patches=16),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs256_davit_224.py b/configs/_base_/datasets/imagenet_bs256_davit_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ea0a8382d8feaae6f39808b6b1193684294f918
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_davit_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=236,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_itpn.py b/configs/_base_/datasets/imagenet_bs256_itpn.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b51c47272a99c4257a8c98dfe0b2bb8652e54a4
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_itpn.py
@@ -0,0 +1,49 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='TwoNormDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # clip mean & std
+    second_mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    second_std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ColorJitter',
+        brightness=0.4,
+        contrast=0.4,
+        saturation=0.4,
+        hue=0.),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandomResizedCropAndInterpolationWithTwoPic',
+        size=224,
+        second_size=224,
+        interpolation='bicubic',
+        second_interpolation='bicubic',
+        scale=(0.2, 1.0)),
+    dict(
+        type='BEiTMaskGenerator',
+        input_size=(14, 14),
+        num_masking_patches=75,
+        max_num_patches=75,
+        min_num_patches=16),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs256_levit_224.py b/configs/_base_/datasets/imagenet_bs256_levit_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..612db7d7f0777ba50c78c084be8db7ba57266942
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_levit_224.py
@@ -0,0 +1,80 @@
+dataset_type = 'ImageNet'
+
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=4,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=256,
+    num_workers=4,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_rsb_a12.py b/configs/_base_/datasets/imagenet_bs256_rsb_a12.py
new file mode 100644
index 0000000000000000000000000000000000000000..ab59d9e42fea20b316f306023c86c7b75acdb80f
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_rsb_a12.py
@@ -0,0 +1,72 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=236,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=256,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_rsb_a3.py b/configs/_base_/datasets/imagenet_bs256_rsb_a3.py
new file mode 100644
index 0000000000000000000000000000000000000000..02e34497d8ba68416cab4b08b8347a9781899a4f
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_rsb_a3.py
@@ -0,0 +1,72 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=6,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=236,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=256,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs256_simmim_192.py b/configs/_base_/datasets/imagenet_bs256_simmim_192.py
new file mode 100644
index 0000000000000000000000000000000000000000..45062e9c28bac95737e4783c80f353870343b6f2
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_simmim_192.py
@@ -0,0 +1,33 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=192, crop_ratio_range=(0.67, 1.0)),
+    dict(type='RandomFlip', prob=0.5),
+    dict(
+        type='SimMIMMaskGenerator',
+        input_size=192,
+        mask_patch_size=32,
+        model_patch_size=4,
+        mask_ratio=0.6),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs256_swin_192.py b/configs/_base_/datasets/imagenet_bs256_swin_192.py
new file mode 100644
index 0000000000000000000000000000000000000000..11c2cb2a82ec320f18b21c89e2bd455a51912c24
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs256_swin_192.py
@@ -0,0 +1,81 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=192,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=219,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=192),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    collate_fn=dict(type='default_collate'),
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    collate_fn=dict(type='default_collate'),
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='val',
+        pipeline=test_pipeline),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs32.py b/configs/_base_/datasets/imagenet_bs32.py
new file mode 100644
index 0000000000000000000000000000000000000000..a069bb9c3317079e2d7cdec8c8573ad0c7d42470
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs32_byol.py b/configs/_base_/datasets/imagenet_bs32_byol.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7235b3be6fbfb79bcdc7179aef0bcd906475a68
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_byol.py
@@ -0,0 +1,89 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+view_pipeline1 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=1.),
+    dict(type='Solarize', thr=128, prob=0.),
+]
+view_pipeline2 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.1),
+    dict(type='Solarize', thr=128, prob=0.2)
+]
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiView',
+        num_views=[1, 1],
+        transforms=[view_pipeline1, view_pipeline2]),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=4,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs32_mocov2.py b/configs/_base_/datasets/imagenet_bs32_mocov2.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc60050dc748f3f28e0b68c83a1fd0910503039b
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_mocov2.py
@@ -0,0 +1,58 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+# The difference between mocov2 and mocov1 is the transforms in the pipeline
+view_pipeline = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.2, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.4,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.5),
+    dict(type='RandomFlip', prob=0.5),
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='MultiView', num_views=2, transforms=[view_pipeline]),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    drop_last=True,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs32_pil_bicubic.py b/configs/_base_/datasets/imagenet_bs32_pil_bicubic.py
new file mode 100644
index 0000000000000000000000000000000000000000..36880ff76abd2329199801f807ec3bb0469ec140
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_pil_bicubic.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs32_pil_resize.py b/configs/_base_/datasets/imagenet_bs32_pil_resize.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9afc5cb0ed9fa7941b17fdfdae792b54adc9608
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_pil_resize.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs32_simclr.py b/configs/_base_/datasets/imagenet_bs32_simclr.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e487b00b164eb964cfb4159a6918eb55d2b404e
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs32_simclr.py
@@ -0,0 +1,52 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+view_pipeline = [
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.8,
+                contrast=0.8,
+                saturation=0.8,
+                hue=0.2)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.5),
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='MultiView', num_views=2, transforms=[view_pipeline]),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=4,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs512_mae.py b/configs/_base_/datasets/imagenet_bs512_mae.py
new file mode 100644
index 0000000000000000000000000000000000000000..03d350eb0024a872e53f7d95ab7f3f12c4e70a25
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs512_mae.py
@@ -0,0 +1,32 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.2, 1.0),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=512,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs512_mocov3.py b/configs/_base_/datasets/imagenet_bs512_mocov3.py
new file mode 100644
index 0000000000000000000000000000000000000000..1679f636e316a229744d8d79b8cda5c92e2b1450
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs512_mocov3.py
@@ -0,0 +1,90 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+view_pipeline1 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.2, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=1.),
+    dict(type='Solarize', thr=128, prob=0.),
+    dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.2, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.1),
+    dict(type='Solarize', thr=128, prob=0.2),
+    dict(type='RandomFlip', prob=0.5),
+]
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiView',
+        num_views=[1, 1],
+        transforms=[view_pipeline1, view_pipeline2]),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=512,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/configs/_base_/datasets/imagenet_bs64.py b/configs/_base_/datasets/imagenet_bs64.py
new file mode 100644
index 0000000000000000000000000000000000000000..73e6d54bdde5523604dca93a8731765b4def92db
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_autoaug.py b/configs/_base_/datasets/imagenet_bs64_autoaug.py
new file mode 100644
index 0000000000000000000000000000000000000000..3160b8cf2afaa05cd49e09cabade7f4716bbd23d
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_autoaug.py
@@ -0,0 +1,59 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='AutoAugment',
+        policies='imagenet',
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_clip_224.py b/configs/_base_/datasets/imagenet_bs64_clip_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..c200601ba45e7a1f317803e7c6f8c0ba34355623
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_clip_224.py
@@ -0,0 +1,73 @@
+# dataset settings
+dataset_type = 'ImageNet'
+img_norm_cfg = dict(
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=True)
+image_size = 224
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        size=image_size,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
+    # dict(
+    #     type='RandAugment',
+    #     policies={{_base_.rand_increasing_policies}},
+    #     num_policies=2,
+    #     total_level=10,
+    #     magnitude_level=9,
+    #     magnitude_std=0.5,
+    #     hparams=dict(
+    #         pad_val=[round(x) for x in img_norm_cfg['mean'][::-1]],
+    #         interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=img_norm_cfg['mean'][::-1],
+        fill_std=img_norm_cfg['std'][::-1]),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='ToTensor', keys=['gt_label']),
+    dict(type='Collect', keys=['img', 'gt_label'])
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        size=(image_size, -1),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=image_size),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='Collect', keys=['img'])
+]
+
+data = dict(
+    samples_per_gpu=64,
+    workers_per_gpu=8,
+    train=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    val=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    test=dict(
+        # replace `data/val` with `data/test` for standard test
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline))
+
+evaluation = dict(interval=10, metric='accuracy')
diff --git a/configs/_base_/datasets/imagenet_bs64_clip_384.py b/configs/_base_/datasets/imagenet_bs64_clip_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7caee678774a3baa1481163fe89fe35ee5e9b96
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_clip_384.py
@@ -0,0 +1,73 @@
+# dataset settings
+dataset_type = 'ImageNet'
+img_norm_cfg = dict(
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=True)
+image_size = 384
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        size=image_size,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
+    # dict(
+    #     type='RandAugment',
+    #     policies={{_base_.rand_increasing_policies}},
+    #     num_policies=2,
+    #     total_level=10,
+    #     magnitude_level=9,
+    #     magnitude_std=0.5,
+    #     hparams=dict(
+    #         pad_val=[round(x) for x in img_norm_cfg['mean'][::-1]],
+    #         interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=img_norm_cfg['mean'][::-1],
+        fill_std=img_norm_cfg['std'][::-1]),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='ToTensor', keys=['gt_label']),
+    dict(type='Collect', keys=['img', 'gt_label'])
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        size=(image_size, -1),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=image_size),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='Collect', keys=['img'])
+]
+
+data = dict(
+    samples_per_gpu=64,
+    workers_per_gpu=8,
+    train=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    val=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    test=dict(
+        # replace `data/val` with `data/test` for standard test
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline))
+
+evaluation = dict(interval=10, metric='accuracy')
diff --git a/configs/_base_/datasets/imagenet_bs64_clip_448.py b/configs/_base_/datasets/imagenet_bs64_clip_448.py
new file mode 100644
index 0000000000000000000000000000000000000000..32a92ef66a30d6caff7d399fb321ec9283965920
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_clip_448.py
@@ -0,0 +1,74 @@
+# dataset settings
+dataset_type = 'ImageNet'
+img_norm_cfg = dict(
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=True)
+image_size = 448
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        size=image_size,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
+    # dict(
+    #     type='RandAugment',
+    #     policies={{_base_.rand_increasing_policies}},
+    #     num_policies=2,
+    #     total_level=10,
+    #     magnitude_level=9,
+    #     magnitude_std=0.5,
+    #     hparams=dict(
+    #         pad_val=[round(x) for x in img_norm_cfg['mean'][::-1]],
+    #         interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=img_norm_cfg['mean'][::-1],
+        fill_std=img_norm_cfg['std'][::-1]),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='ToTensor', keys=['gt_label']),
+    dict(type='Collect', keys=['img', 'gt_label'])
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        size=(image_size, -1),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=image_size),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='Collect', keys=['img'])
+]
+
+data = dict(
+    samples_per_gpu=64,
+    workers_per_gpu=8,
+    train=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    val=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    test=dict(
+        # replace `data/val` with `data/test` for standard test
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline))
+
+evaluation = dict(interval=10, metric='accuracy')
diff --git a/configs/_base_/datasets/imagenet_bs64_convmixer_224.py b/configs/_base_/datasets/imagenet_bs64_convmixer_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e9c0aa0f9bfc8883f3ee5d58464c8ea97f5e3bc
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_convmixer_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs')
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=233,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_deit3_224.py b/configs/_base_/datasets/imagenet_bs64_deit3_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e460a4d95a21d2ca3c3d6bb0d65e5c5409c14ff
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_deit3_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_deit3_384.py b/configs/_base_/datasets/imagenet_bs64_deit3_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..bc554ddba1d6a32a83638e7c2d58d27c345a4909
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_deit3_384.py
@@ -0,0 +1,60 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=384,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_edgenext_256.py b/configs/_base_/datasets/imagenet_bs64_edgenext_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..7db9e4ef5f26691e364d244df0729827bf356293
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_edgenext_256.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=256,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=292,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_hivit_224.py b/configs/_base_/datasets/imagenet_bs64_hivit_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..4c258d7ab50ac74c3b2bb30a852f8f38a0f10b83
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_hivit_224.py
@@ -0,0 +1,83 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/val.txt',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_mixer_224.py b/configs/_base_/datasets/imagenet_bs64_mixer_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..b92a5141b5d3c0784216c83effb7b171c631fccc
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_mixer_224.py
@@ -0,0 +1,52 @@
+# dataset settings
+dataset_type = 'ImageNet'
+
+# Google research usually use the below normalization setting.
+data_preprocessor = dict(
+    num_classes=1000,
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_pil_resize.py b/configs/_base_/datasets/imagenet_bs64_pil_resize.py
new file mode 100644
index 0000000000000000000000000000000000000000..79f9325b022ac8b9219134a3b1ef47b584fcf3b2
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_pil_resize.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_pil_resize_autoaug.py b/configs/_base_/datasets/imagenet_bs64_pil_resize_autoaug.py
new file mode 100644
index 0000000000000000000000000000000000000000..c25906716c651d63440e1adeed66303ad7dae233
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_pil_resize_autoaug.py
@@ -0,0 +1,68 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='AutoAugment',
+        policies='imagenet',
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_swin_224.py b/configs/_base_/datasets/imagenet_bs64_swin_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e8786eb0feb5cade66d01b6ce99b4240e11918b
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_swin_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_swin_256.py b/configs/_base_/datasets/imagenet_bs64_swin_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ecb41ba4d69c25ddc70469de440a0fde681fbc7
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_swin_256.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=256,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=292,  # ( 256 / 224 * 256 )
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_swin_384.py b/configs/_base_/datasets/imagenet_bs64_swin_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..11264f808c1d154c80f5609fbe25e1e7e69a5c88
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_swin_384.py
@@ -0,0 +1,54 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=384, backend='pillow', interpolation='bicubic'),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs64_t2t_224.py b/configs/_base_/datasets/imagenet_bs64_t2t_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a2dc10f85647fd20afd26d07a2c87a3e3a36962
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs64_t2t_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/imagenet_bs8_pil_bicubic_320.py b/configs/_base_/datasets/imagenet_bs8_pil_bicubic_320.py
new file mode 100644
index 0000000000000000000000000000000000000000..7160084e56b44205d92a8266fc78ff51bf2a7b4c
--- /dev/null
+++ b/configs/_base_/datasets/imagenet_bs8_pil_bicubic_320.py
@@ -0,0 +1,59 @@
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[122.5, 122.5, 122.5],
+    std=[122.5, 122.5, 122.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=320,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=int(320 / 224 * 256),
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=320),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=8,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/inshop_bs32_448.py b/configs/_base_/datasets/inshop_bs32_448.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9772fa665d4a5a3abae575a8fc61fb9f360cd0e
--- /dev/null
+++ b/configs/_base_/datasets/inshop_bs32_448.py
@@ -0,0 +1,64 @@
+# dataset settings
+dataset_type = 'InShop'
+data_preprocessor = dict(
+    num_classes=3997,
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=512),
+    dict(type='RandomCrop', crop_size=448),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=512),
+    dict(type='CenterCrop', crop_size=448),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=4,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/inshop',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+query_dataloader = dict(
+    batch_size=32,
+    num_workers=4,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/inshop',
+        split='query',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+
+gallery_dataloader = dict(
+    batch_size=32,
+    num_workers=4,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/inshop',
+        split='gallery',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_dataloader = query_dataloader
+val_evaluator = [
+    dict(type='RetrievalRecall', topk=1),
+    dict(type='RetrievalAveragePrecision', topk=10),
+]
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/nlvr2.py b/configs/_base_/datasets/nlvr2.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f5314bcd14d9e4f79898411e9c687470e31ac02
--- /dev/null
+++ b/configs/_base_/datasets/nlvr2.py
@@ -0,0 +1,86 @@
+# dataset settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(
+        type='ApplyToList',
+        # NLVR requires to load two images in task.
+        scatter_key='img_path',
+        transforms=[
+            dict(type='LoadImageFromFile'),
+            dict(
+                type='RandomResizedCrop',
+                scale=384,
+                interpolation='bicubic',
+                backend='pillow'),
+            dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+        ],
+        collate_keys=['img', 'scale_factor', 'ori_shape'],
+    ),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(
+        type='ApplyToList',
+        # NLVR requires to load two images in task.
+        scatter_key='img_path',
+        transforms=[
+            dict(type='LoadImageFromFile'),
+            dict(
+                type='Resize',
+                scale=(384, 384),
+                interpolation='bicubic',
+                backend='pillow'),
+        ],
+        collate_keys=['img', 'scale_factor', 'ori_shape'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id'],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='NLVR2',
+        data_root='data/nlvr2',
+        ann_file='dev.json',
+        data_prefix='dev',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=8,
+    dataset=dict(
+        type='NLVR2',
+        data_root='data/nlvr2',
+        ann_file='dev.json',
+        data_prefix='dev',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='Accuracy')
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/nocaps.py b/configs/_base_/datasets/nocaps.py
new file mode 100644
index 0000000000000000000000000000000000000000..5176671f2b9335b12127c7b58b2626eec12476ea
--- /dev/null
+++ b/configs/_base_/datasets/nocaps.py
@@ -0,0 +1,41 @@
+# data settings
+
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(384, 384),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type='NoCaps',
+        data_root='data/nocaps/',
+        data_prefix=dict(img_path='images/'),
+        ann_file='annotations/nocaps_val_4500_captions.json',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(
+    type='NocapsSave',
+    save_dir='./',
+)
+
+# # If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/ocrvqa.py b/configs/_base_/datasets/ocrvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..09e6e3536141f8ea901d2e5bb3070c23d816e8bc
--- /dev/null
+++ b/configs/_base_/datasets/ocrvqa.py
@@ -0,0 +1,81 @@
+# data settings
+
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CleanCaption', keys=['question', 'gt_answer']),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=[],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CleanCaption', keys=['question', 'gt_answer']),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=[],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='OCRVQA',
+        data_root='data/ocrvqa',
+        data_prefix='images',
+        ann_file='annotations/dataset.json',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=8,
+    dataset=dict(
+        type='OCRVQA',
+        data_root='data/ocrvqa',
+        data_prefix='images',
+        ann_file='annotations/dataset.json',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+    batch_size=64,
+    num_workers=8,
+    dataset=dict(
+        type='OCRVQA',
+        data_root='data/ocrvqa',
+        data_prefix='images',
+        ann_file='annotations/dataset.json',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='VQAAcc')
diff --git a/configs/_base_/datasets/pipelines/auto_aug.py b/configs/_base_/datasets/pipelines/auto_aug.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a10f7eec61ea40336698118342939470f73d052
--- /dev/null
+++ b/configs/_base_/datasets/pipelines/auto_aug.py
@@ -0,0 +1,96 @@
+# Policy for ImageNet, refers to
+# https://github.com/DeepVoltaire/AutoAugment/blame/master/autoaugment.py
+policy_imagenet = [
+    [
+        dict(type='Posterize', bits=4, prob=0.4),
+        dict(type='Rotate', angle=30., prob=0.6)
+    ],
+    [
+        dict(type='Solarize', thr=256 / 9 * 4, prob=0.6),
+        dict(type='AutoContrast', prob=0.6)
+    ],
+    [dict(type='Equalize', prob=0.8),
+     dict(type='Equalize', prob=0.6)],
+    [
+        dict(type='Posterize', bits=5, prob=0.6),
+        dict(type='Posterize', bits=5, prob=0.6)
+    ],
+    [
+        dict(type='Equalize', prob=0.4),
+        dict(type='Solarize', thr=256 / 9 * 5, prob=0.2)
+    ],
+    [
+        dict(type='Equalize', prob=0.4),
+        dict(type='Rotate', angle=30 / 9 * 8, prob=0.8)
+    ],
+    [
+        dict(type='Solarize', thr=256 / 9 * 6, prob=0.6),
+        dict(type='Equalize', prob=0.6)
+    ],
+    [dict(type='Posterize', bits=6, prob=0.8),
+     dict(type='Equalize', prob=1.)],
+    [
+        dict(type='Rotate', angle=10., prob=0.2),
+        dict(type='Solarize', thr=256 / 9, prob=0.6)
+    ],
+    [
+        dict(type='Equalize', prob=0.6),
+        dict(type='Posterize', bits=5, prob=0.4)
+    ],
+    [
+        dict(type='Rotate', angle=30 / 9 * 8, prob=0.8),
+        dict(type='ColorTransform', magnitude=0., prob=0.4)
+    ],
+    [
+        dict(type='Rotate', angle=30., prob=0.4),
+        dict(type='Equalize', prob=0.6)
+    ],
+    [dict(type='Equalize', prob=0.0),
+     dict(type='Equalize', prob=0.8)],
+    [dict(type='Invert', prob=0.6),
+     dict(type='Equalize', prob=1.)],
+    [
+        dict(type='ColorTransform', magnitude=0.4, prob=0.6),
+        dict(type='Contrast', magnitude=0.8, prob=1.)
+    ],
+    [
+        dict(type='Rotate', angle=30 / 9 * 8, prob=0.8),
+        dict(type='ColorTransform', magnitude=0.2, prob=1.)
+    ],
+    [
+        dict(type='ColorTransform', magnitude=0.8, prob=0.8),
+        dict(type='Solarize', thr=256 / 9 * 2, prob=0.8)
+    ],
+    [
+        dict(type='Sharpness', magnitude=0.7, prob=0.4),
+        dict(type='Invert', prob=0.6)
+    ],
+    [
+        dict(
+            type='Shear',
+            magnitude=0.3 / 9 * 5,
+            prob=0.6,
+            direction='horizontal'),
+        dict(type='Equalize', prob=1.)
+    ],
+    [
+        dict(type='ColorTransform', magnitude=0., prob=0.4),
+        dict(type='Equalize', prob=0.6)
+    ],
+    [
+        dict(type='Equalize', prob=0.4),
+        dict(type='Solarize', thr=256 / 9 * 5, prob=0.2)
+    ],
+    [
+        dict(type='Solarize', thr=256 / 9 * 4, prob=0.6),
+        dict(type='AutoContrast', prob=0.6)
+    ],
+    [dict(type='Invert', prob=0.6),
+     dict(type='Equalize', prob=1.)],
+    [
+        dict(type='ColorTransform', magnitude=0.4, prob=0.6),
+        dict(type='Contrast', magnitude=0.8, prob=1.)
+    ],
+    [dict(type='Equalize', prob=0.8),
+     dict(type='Equalize', prob=0.6)],
+]
diff --git a/configs/_base_/datasets/pipelines/rand_aug.py b/configs/_base_/datasets/pipelines/rand_aug.py
new file mode 100644
index 0000000000000000000000000000000000000000..f2bab3c364f0d0223f2c972673da3abb6ac21bc6
--- /dev/null
+++ b/configs/_base_/datasets/pipelines/rand_aug.py
@@ -0,0 +1,43 @@
+# Refers to `_RAND_INCREASING_TRANSFORMS` in pytorch-image-models
+rand_increasing_policies = [
+    dict(type='AutoContrast'),
+    dict(type='Equalize'),
+    dict(type='Invert'),
+    dict(type='Rotate', magnitude_key='angle', magnitude_range=(0, 30)),
+    dict(type='Posterize', magnitude_key='bits', magnitude_range=(4, 0)),
+    dict(type='Solarize', magnitude_key='thr', magnitude_range=(256, 0)),
+    dict(
+        type='SolarizeAdd',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 110)),
+    dict(
+        type='ColorTransform',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.9)),
+    dict(type='Contrast', magnitude_key='magnitude', magnitude_range=(0, 0.9)),
+    dict(
+        type='Brightness', magnitude_key='magnitude',
+        magnitude_range=(0, 0.9)),
+    dict(
+        type='Sharpness', magnitude_key='magnitude', magnitude_range=(0, 0.9)),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        direction='horizontal'),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        direction='vertical'),
+    dict(
+        type='Translate',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.45),
+        direction='horizontal'),
+    dict(
+        type='Translate',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.45),
+        direction='vertical')
+]
diff --git a/configs/_base_/datasets/refcoco.py b/configs/_base_/datasets/refcoco.py
new file mode 100644
index 0000000000000000000000000000000000000000..f698e76c032fb22cc739450cc1e81e3174fd2b2f
--- /dev/null
+++ b/configs/_base_/datasets/refcoco.py
@@ -0,0 +1,105 @@
+# data settings
+
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.4,
+                hue=0.1,
+                backend='cv2')
+        ],
+        prob=0.5),
+    dict(
+        type='mmdet.RandomCrop',
+        crop_type='relative_range',
+        crop_size=(0.8, 0.8),
+        allow_negative_crop=False),
+    dict(
+        type='RandomChoiceResize',
+        scales=[(384, 384), (360, 360), (344, 344), (312, 312), (300, 300),
+                (286, 286), (270, 270)],
+        keep_ratio=False),
+    dict(
+        type='RandomTranslatePad',
+        size=384,
+        aug_translate=True,
+    ),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'gt_bboxes', 'scale_factor'],
+        meta_keys=['image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(384, 384),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'gt_bboxes', 'scale_factor'],
+        meta_keys=['image_id'],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='RefCOCO',
+        data_root='data/coco',
+        data_prefix='train2014',
+        ann_file='refcoco/instances.json',
+        split_file='refcoco/refs(unc).p',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='RefCOCO',
+        data_root='data/coco',
+        data_prefix='train2014',
+        ann_file='refcoco/instances.json',
+        split_file='refcoco/refs(unc).p',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+
+val_evaluator = dict(type='VisualGroundingMetric')
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='RefCOCO',
+        data_root='data/coco',
+        data_prefix='train2014',
+        ann_file='refcoco/instances.json',
+        split_file='refcoco/refs(unc).p',
+        split='testA',  # or 'testB'
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/tiny_imagenet_bs32.py b/configs/_base_/datasets/tiny_imagenet_bs32.py
new file mode 100644
index 0000000000000000000000000000000000000000..6701413de0f7a4b65044dbf513a4267b9092500e
--- /dev/null
+++ b/configs/_base_/datasets/tiny_imagenet_bs32.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'CustomDataset'
+data_preprocessor = dict(
+    num_classes=200,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/tiny_imagenet_bs32_pil_resize.py b/configs/_base_/datasets/tiny_imagenet_bs32_pil_resize.py
new file mode 100644
index 0000000000000000000000000000000000000000..66250a49aaa549c00623c8549c4eafcae71a9254
--- /dev/null
+++ b/configs/_base_/datasets/tiny_imagenet_bs32_pil_resize.py
@@ -0,0 +1,51 @@
+# dataset settings
+dataset_type = 'CustomDataset'
+data_preprocessor = dict(
+    num_classes=200,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/tiny_imagenet_bs64_pil_resize_autoaug.py b/configs/_base_/datasets/tiny_imagenet_bs64_pil_resize_autoaug.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c41d7f1eed186f254150acaa6d9290b27478936
--- /dev/null
+++ b/configs/_base_/datasets/tiny_imagenet_bs64_pil_resize_autoaug.py
@@ -0,0 +1,68 @@
+# dataset settings
+dataset_type = 'CustomDataset'
+data_preprocessor = dict(
+    num_classes=200,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='AutoAugment',
+        policies='imagenet',
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/tiny_imagenet_bs64_swin_224.py b/configs/_base_/datasets/tiny_imagenet_bs64_swin_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..bddb78bf25273d6d244368d411c4e8fb9235b871
--- /dev/null
+++ b/configs/_base_/datasets/tiny_imagenet_bs64_swin_224.py
@@ -0,0 +1,80 @@
+# dataset settings
+dataset_type = 'CustomDataset'
+data_preprocessor = dict(
+    num_classes=200,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/vizwiz.py b/configs/_base_/datasets/vizwiz.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb7156c07030e9c031c8796c62267b7c4a8b2d7a
--- /dev/null
+++ b/configs/_base_/datasets/vizwiz.py
@@ -0,0 +1,80 @@
+# data settings
+
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='CleanCaption',
+        keys=['question'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='VizWiz',
+        data_root='data/vizwiz/Images',
+        data_prefix='',
+        ann_file='Annotations/train.json',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='VizWiz',
+        data_root='data/vizwiz/Images',
+        data_prefix='',
+        ann_file='Annotations/val.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VizWizAcc')
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='VizWiz',
+        data_root='data/vizwiz/Images',
+        data_prefix='',
+        ann_file='Annotations/test.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test.json')
diff --git a/configs/_base_/datasets/voc_bs16.py b/configs/_base_/datasets/voc_bs16.py
new file mode 100644
index 0000000000000000000000000000000000000000..cac2248cb6f0fc96a1e1407e06bba5fbc9e70a4b
--- /dev/null
+++ b/configs/_base_/datasets/voc_bs16.py
@@ -0,0 +1,65 @@
+# dataset settings
+dataset_type = 'VOC'
+data_preprocessor = dict(
+    num_classes=20,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+    # generate onehot-format labels for multi-label classification.
+    to_onehot=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(
+        type='PackInputs',
+        # `gt_label_difficult` is needed for VOC evaluation
+        meta_keys=('sample_idx', 'img_path', 'ori_shape', 'img_shape',
+                   'scale_factor', 'flip', 'flip_direction',
+                   'gt_label_difficult')),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/VOC2007',
+        split='trainval',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/VOC2007',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+
+test_dataloader = val_dataloader
+
+# calculate precision_recall_f1 and mAP
+val_evaluator = [
+    dict(type='VOCMultiLabelMetric'),
+    dict(type='VOCMultiLabelMetric', average='micro'),
+    dict(type='VOCAveragePrecision')
+]
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/configs/_base_/datasets/vsr.py b/configs/_base_/datasets/vsr.py
new file mode 100644
index 0000000000000000000000000000000000000000..0fa9b8992d0c453797b38add80dd6c92fbfa9227
--- /dev/null
+++ b/configs/_base_/datasets/vsr.py
@@ -0,0 +1,81 @@
+# data settings
+
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='CleanCaption',
+        keys=['question'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='VSR',
+        data_root='data/coco',
+        data_prefix='',
+        ann_file='annotations/train.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+    drop_last=True,
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='VSR',
+        data_root='data/coco',
+        data_prefix='',
+        ann_file='annotations/val.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VSRAcc')
+
+test_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    dataset=dict(
+        type='VSR',
+        data_root='data/coco',
+        data_prefix='',
+        ann_file='annotations/test.json',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+test_evaluator = val_evaluator
diff --git a/configs/_base_/default_runtime.py b/configs/_base_/default_runtime.py
new file mode 100644
index 0000000000000000000000000000000000000000..3816d423fabab10d26b0abfea1f60eb270c1dc83
--- /dev/null
+++ b/configs/_base_/default_runtime.py
@@ -0,0 +1,51 @@
+# defaults to use registries in mmpretrain
+default_scope = 'mmpretrain'
+
+# configure default hooks
+default_hooks = dict(
+    # record the time of every iteration.
+    timer=dict(type='IterTimerHook'),
+
+    # print log every 100 iterations.
+    logger=dict(type='LoggerHook', interval=100),
+
+    # enable the parameter scheduler.
+    param_scheduler=dict(type='ParamSchedulerHook'),
+
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1),
+
+    # set sampler seed in distributed evrionment.
+    sampler_seed=dict(type='DistSamplerSeedHook'),
+
+    # validation results visualization, set True to enable it.
+    visualization=dict(type='VisualizationHook', enable=False),
+)
+
+# configure environment
+env_cfg = dict(
+    # whether to enable cudnn benchmark
+    cudnn_benchmark=False,
+
+    # set multi process parameters
+    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+
+    # set distributed parameters
+    dist_cfg=dict(backend='nccl'),
+)
+
+# set visualizer
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(type='UniversalVisualizer', vis_backends=vis_backends)
+
+# set log level
+log_level = 'INFO'
+
+# load from which checkpoint
+load_from = None
+
+# whether to resume training from the loaded checkpoint
+resume = False
+
+# Defaults to use random seed and disable `deterministic`
+randomness = dict(seed=None, deterministic=False)
diff --git a/configs/_base_/models/conformer/base-p16.py b/configs/_base_/models/conformer/base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..959da5059a8f36c1076bf9875c51fd466fc96fa4
--- /dev/null
+++ b/configs/_base_/models/conformer/base-p16.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Conformer', arch='base', drop_path_rate=0.1, init_cfg=None),
+    neck=None,
+    head=dict(
+        type='ConformerHead',
+        num_classes=1000,
+        in_channels=[1536, 576],
+        init_cfg=None,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/conformer/small-p16.py b/configs/_base_/models/conformer/small-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..2e4f9f80745af51538306bd8928082f3fd2e9997
--- /dev/null
+++ b/configs/_base_/models/conformer/small-p16.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Conformer', arch='small', drop_path_rate=0.1, init_cfg=None),
+    neck=None,
+    head=dict(
+        type='ConformerHead',
+        num_classes=1000,
+        in_channels=[1024, 384],
+        init_cfg=None,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/conformer/small-p32.py b/configs/_base_/models/conformer/small-p32.py
new file mode 100644
index 0000000000000000000000000000000000000000..f73811fff492f3e1770e514335ccc71b2bd3caf6
--- /dev/null
+++ b/configs/_base_/models/conformer/small-p32.py
@@ -0,0 +1,27 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Conformer',
+        arch='small',
+        patch_size=32,
+        drop_path_rate=0.1,
+        init_cfg=None),
+    neck=None,
+    head=dict(
+        type='ConformerHead',
+        num_classes=1000,
+        in_channels=[1024, 384],
+        init_cfg=None,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/conformer/tiny-p16.py b/configs/_base_/models/conformer/tiny-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa9753b6fac957a0c8f9612bd0b9a693a3ecbf4e
--- /dev/null
+++ b/configs/_base_/models/conformer/tiny-p16.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Conformer', arch='tiny', drop_path_rate=0.1, init_cfg=None),
+    neck=None,
+    head=dict(
+        type='ConformerHead',
+        num_classes=1000,
+        in_channels=[256, 384],
+        init_cfg=None,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/convmixer/convmixer-1024-20.py b/configs/_base_/models/convmixer/convmixer-1024-20.py
new file mode 100644
index 0000000000000000000000000000000000000000..a8f4d517e0d5e74c0d0412bb6e4f43b244761c03
--- /dev/null
+++ b/configs/_base_/models/convmixer/convmixer-1024-20.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvMixer', arch='1024/20'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/convmixer/convmixer-1536-20.py b/configs/_base_/models/convmixer/convmixer-1536-20.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ad8209bb4fc55665be36cdcd8102d854c533951
--- /dev/null
+++ b/configs/_base_/models/convmixer/convmixer-1536-20.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvMixer', arch='1536/20'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/convmixer/convmixer-768-32.py b/configs/_base_/models/convmixer/convmixer-768-32.py
new file mode 100644
index 0000000000000000000000000000000000000000..1cba528b0edf9d394ae9730ecd51d41bbd314b38
--- /dev/null
+++ b/configs/_base_/models/convmixer/convmixer-768-32.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvMixer', arch='768/32', act_cfg=dict(type='ReLU')),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/convnext/convnext-base.py b/configs/_base_/models/convnext/convnext-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..aba6c19d1ac5039bab2363f80d500c81d4bb809b
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-base.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvNeXt', arch='base', drop_path_rate=0.5),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext/convnext-large.py b/configs/_base_/models/convnext/convnext-large.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bd4d9f68bd47b207de129ab169c2366156199b3
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-large.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvNeXt', arch='large', drop_path_rate=0.5),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext/convnext-small.py b/configs/_base_/models/convnext/convnext-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..aeedb6d22fc8f80fe6c5fb246df44c8a28c41854
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-small.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvNeXt', arch='small', drop_path_rate=0.4),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext/convnext-tiny.py b/configs/_base_/models/convnext/convnext-tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..05baba09eefe44196a54c112c5c785ff79a1b52b
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-tiny.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvNeXt', arch='tiny', drop_path_rate=0.1),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext/convnext-xlarge.py b/configs/_base_/models/convnext/convnext-xlarge.py
new file mode 100644
index 0000000000000000000000000000000000000000..7211b94f6cebe4c93d150dec276291f725f9f513
--- /dev/null
+++ b/configs/_base_/models/convnext/convnext-xlarge.py
@@ -0,0 +1,19 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ConvNeXt', arch='xlarge', drop_path_rate=0.5),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext_v2/atto.py b/configs/_base_/models/convnext_v2/atto.py
new file mode 100644
index 0000000000000000000000000000000000000000..557ce93fce2572fe2fd95db80da4556e0dd7810d
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/atto.py
@@ -0,0 +1,20 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='atto',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=320,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.2),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+)
diff --git a/configs/_base_/models/convnext_v2/base.py b/configs/_base_/models/convnext_v2/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..1401ef75f96814d5db1f6a37aa8d8761ccfe1e39
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/base.py
@@ -0,0 +1,24 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='base',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext_v2/femto.py b/configs/_base_/models/convnext_v2/femto.py
new file mode 100644
index 0000000000000000000000000000000000000000..d56a241a97820713618480bec0fe09f94ecb1cea
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/femto.py
@@ -0,0 +1,20 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='femto',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+)
diff --git a/configs/_base_/models/convnext_v2/huge.py b/configs/_base_/models/convnext_v2/huge.py
new file mode 100644
index 0000000000000000000000000000000000000000..54141dd5220fdd0f40ce21054890e86b19597aff
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/huge.py
@@ -0,0 +1,24 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='huge',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2816,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext_v2/large.py b/configs/_base_/models/convnext_v2/large.py
new file mode 100644
index 0000000000000000000000000000000000000000..20237de2baaccd2779bcec45549ec5a294d8ba6b
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/large.py
@@ -0,0 +1,24 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='large',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/convnext_v2/nano.py b/configs/_base_/models/convnext_v2/nano.py
new file mode 100644
index 0000000000000000000000000000000000000000..05575d0e105da6880beafa08d1bdb0c608261a51
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/nano.py
@@ -0,0 +1,20 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='nano',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=640,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.2),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+)
diff --git a/configs/_base_/models/convnext_v2/pico.py b/configs/_base_/models/convnext_v2/pico.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d50ba890069457bc512ac2d2da1038ee73cd065
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/pico.py
@@ -0,0 +1,20 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='pico',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+)
diff --git a/configs/_base_/models/convnext_v2/tiny.py b/configs/_base_/models/convnext_v2/tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9835ccdb47f8c976be9519160ba13f6f4a168f9
--- /dev/null
+++ b/configs/_base_/models/convnext_v2/tiny.py
@@ -0,0 +1,24 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='tiny',
+        drop_path_rate=0.2,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='LabelSmoothLoss', label_smooth_val=0.2),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type='TruncNormal', layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/davit/davit-base.py b/configs/_base_/models/davit/davit-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..0dbf07739ecc907e4a77d0cdbd9c21f4c8fbecf1
--- /dev/null
+++ b/configs/_base_/models/davit/davit-base.py
@@ -0,0 +1,16 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DaViT', arch='base', out_indices=(3, ), drop_path_rate=0.4),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/davit/davit-small.py b/configs/_base_/models/davit/davit-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..2fa0325552c2bc28f69263ba42547090b7a521fb
--- /dev/null
+++ b/configs/_base_/models/davit/davit-small.py
@@ -0,0 +1,16 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DaViT', arch='small', out_indices=(3, ), drop_path_rate=0.2),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/davit/davit-tiny.py b/configs/_base_/models/davit/davit-tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..29432d28bd09a613bf4eaabe4f8ef4d0d763a49d
--- /dev/null
+++ b/configs/_base_/models/davit/davit-tiny.py
@@ -0,0 +1,16 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DaViT', arch='t', out_indices=(3, ), drop_path_rate=0.1),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-base-p16-224.py b/configs/_base_/models/deit3/deit3-base-p16-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..84cba1afadbf13ed78e5f3c2be112a70b5ba8be1
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-base-p16-224.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='b',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.2),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-base-p16-384.py b/configs/_base_/models/deit3/deit3-base-p16-384.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c9f42bc3a3b69c5091c5a31c0d7a137fb944cf5
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-base-p16-384.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        drop_path_rate=0.15),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-huge-p14-224.py b/configs/_base_/models/deit3/deit3-huge-p14-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7a69ce914fbc32b029cb1a891fb1cf49d4bfce0
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-huge-p14-224.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='h',
+        img_size=224,
+        patch_size=14,
+        drop_path_rate=0.55),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-large-p16-224.py b/configs/_base_/models/deit3/deit3-large-p16-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..96135c57879715a1de50efd8e6c28fc635eae1ff
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-large-p16-224.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='l',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.45),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-large-p16-384.py b/configs/_base_/models/deit3/deit3-large-p16-384.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa9326c17cd0b0e1d625270140a80f1bb92fc0bf
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-large-p16-384.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='l',
+        img_size=384,
+        patch_size=16,
+        drop_path_rate=0.4),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-medium-p16-224.py b/configs/_base_/models/deit3/deit3-medium-p16-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..84233e5cfde13cd0f142b49f64c3b3ec65ff4f68
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-medium-p16-224.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='m',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.2),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-small-p16-224.py b/configs/_base_/models/deit3/deit3-small-p16-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..af29d32bc799ebdff5a9724fe5555261ba0b584c
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-small-p16-224.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='s',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.05),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/deit3/deit3-small-p16-384.py b/configs/_base_/models/deit3/deit3-small-p16-384.py
new file mode 100644
index 0000000000000000000000000000000000000000..bebb4845e8c3a47e1d944702c49357d6d8aa4cd6
--- /dev/null
+++ b/configs/_base_/models/deit3/deit3-small-p16-384.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DeiT3',
+        arch='s',
+        img_size=384,
+        patch_size=16,
+        drop_path_rate=0.0),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/densenet/densenet121.py b/configs/_base_/models/densenet/densenet121.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a14d302584a910e87ccf598e9434bd0685207aa
--- /dev/null
+++ b/configs/_base_/models/densenet/densenet121.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='DenseNet', arch='121'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/densenet/densenet161.py b/configs/_base_/models/densenet/densenet161.py
new file mode 100644
index 0000000000000000000000000000000000000000..61a0d838806267a5c987fa30eeb6363f23387ef3
--- /dev/null
+++ b/configs/_base_/models/densenet/densenet161.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='DenseNet', arch='161'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2208,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/densenet/densenet169.py b/configs/_base_/models/densenet/densenet169.py
new file mode 100644
index 0000000000000000000000000000000000000000..779ea1709256f8c001adaa3c73155c36d3363d71
--- /dev/null
+++ b/configs/_base_/models/densenet/densenet169.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='DenseNet', arch='169'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1664,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/densenet/densenet201.py b/configs/_base_/models/densenet/densenet201.py
new file mode 100644
index 0000000000000000000000000000000000000000..2909af0d36c656c1868ff38e72981dc9dafeaa2f
--- /dev/null
+++ b/configs/_base_/models/densenet/densenet201.py
@@ -0,0 +1,11 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='DenseNet', arch='201'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1920,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/edgenext/edgenext-base.py b/configs/_base_/models/edgenext/edgenext-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..378397298ed9d51241ad737d65b05f151ac69393
--- /dev/null
+++ b/configs/_base_/models/edgenext/edgenext-base.py
@@ -0,0 +1,23 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='EdgeNeXt',
+        arch='base',
+        out_indices=(3, ),
+        drop_path_rate=0.1,
+        gap_before_final_norm=True,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+        ]),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=584,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/edgenext/edgenext-small.py b/configs/_base_/models/edgenext/edgenext-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..e1f7e1728a2f5cb895600aa0d81eeb5734dffec0
--- /dev/null
+++ b/configs/_base_/models/edgenext/edgenext-small.py
@@ -0,0 +1,23 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='EdgeNeXt',
+        arch='small',
+        out_indices=(3, ),
+        drop_path_rate=0.1,
+        gap_before_final_norm=True,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+        ]),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=304,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/edgenext/edgenext-xsmall.py b/configs/_base_/models/edgenext/edgenext-xsmall.py
new file mode 100644
index 0000000000000000000000000000000000000000..69c7d0d6a6ec9d09df03c007cd3fffa93165f5cb
--- /dev/null
+++ b/configs/_base_/models/edgenext/edgenext-xsmall.py
@@ -0,0 +1,23 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='EdgeNeXt',
+        arch='xsmall',
+        out_indices=(3, ),
+        drop_path_rate=0.1,
+        gap_before_final_norm=True,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+        ]),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/edgenext/edgenext-xxsmall.py b/configs/_base_/models/edgenext/edgenext-xxsmall.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb6881951fae8c01c2a4ea78c3d61e7c6a900f24
--- /dev/null
+++ b/configs/_base_/models/edgenext/edgenext-xxsmall.py
@@ -0,0 +1,23 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='EdgeNeXt',
+        arch='xxsmall',
+        out_indices=(3, ),
+        drop_path_rate=0.1,
+        gap_before_final_norm=True,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+        ]),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=168,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/efficientformer-l1.py b/configs/_base_/models/efficientformer-l1.py
new file mode 100644
index 0000000000000000000000000000000000000000..37dc62cd235ee5a3f0257a24c54c8eb4fc797159
--- /dev/null
+++ b/configs/_base_/models/efficientformer-l1.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='EfficientFormer',
+        arch='l1',
+        drop_path_rate=0,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+            dict(type='Constant', layer=['LayerScale'], val=1e-5)
+        ]),
+    neck=dict(type='GlobalAveragePooling', dim=1),
+    head=dict(
+        type='EfficientFormerClsHead', in_channels=448, num_classes=1000))
diff --git a/configs/_base_/models/efficientnet_b0.py b/configs/_base_/models/efficientnet_b0.py
new file mode 100644
index 0000000000000000000000000000000000000000..d9ba685306c9e411a69887a2a301808cbaa104cb
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b0.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b0'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b1.py b/configs/_base_/models/efficientnet_b1.py
new file mode 100644
index 0000000000000000000000000000000000000000..63e15c88b2f7e1d1c788811741ff26bf5f35601f
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b1.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b1'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b2.py b/configs/_base_/models/efficientnet_b2.py
new file mode 100644
index 0000000000000000000000000000000000000000..5edcfa5d5b680ec41567e531e0b7a587e160c8af
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b2.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b2'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1408,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b3.py b/configs/_base_/models/efficientnet_b3.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7c6d6d899ecb910a37cbd3818f8c79c27db87e9
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b3.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b3'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b4.py b/configs/_base_/models/efficientnet_b4.py
new file mode 100644
index 0000000000000000000000000000000000000000..06840ed559cc14ae47919f7cce67d635173e841d
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b4.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b4'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1792,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b5.py b/configs/_base_/models/efficientnet_b5.py
new file mode 100644
index 0000000000000000000000000000000000000000..a86eebd19042eb36534ef3f42cc16bb32e88fb66
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b5.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b5'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b6.py b/configs/_base_/models/efficientnet_b6.py
new file mode 100644
index 0000000000000000000000000000000000000000..4eada1d32511371bcb11c636b3aae9dc4733d379
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b6.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b6'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2304,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b7.py b/configs/_base_/models/efficientnet_b7.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d84ba427f42a186f376d829189461536e7ee383
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b7.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b7'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2560,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_b8.py b/configs/_base_/models/efficientnet_b8.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9500644dae4a3240c5ecfa02f90deb8fde4e3de
--- /dev/null
+++ b/configs/_base_/models/efficientnet_b8.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='b8'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2816,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_em.py b/configs/_base_/models/efficientnet_em.py
new file mode 100644
index 0000000000000000000000000000000000000000..abecdbeef6c3791f902b6bd13fbceb28c3ac8942
--- /dev/null
+++ b/configs/_base_/models/efficientnet_em.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    # `em` means EfficientNet-EdgeTPU-M arch
+    backbone=dict(type='EfficientNet', arch='em', act_cfg=dict(type='ReLU')),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_es.py b/configs/_base_/models/efficientnet_es.py
new file mode 100644
index 0000000000000000000000000000000000000000..911ba4a18261decd3d17e8962501083e1f1ea550
--- /dev/null
+++ b/configs/_base_/models/efficientnet_es.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    # `es` means EfficientNet-EdgeTPU-S arch
+    backbone=dict(type='EfficientNet', arch='es', act_cfg=dict(type='ReLU')),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_l2.py b/configs/_base_/models/efficientnet_l2.py
new file mode 100644
index 0000000000000000000000000000000000000000..4219c87a81a93c50296cfebed8f20b9bbd2a4c13
--- /dev/null
+++ b/configs/_base_/models/efficientnet_l2.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNet', arch='l2'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=5504,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_b0.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_b0.py
new file mode 100644
index 0000000000000000000000000000000000000000..d42e32905ed9d18ab572bfe1e90c7161f941a34f
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_b0.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='b0'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_b1.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_b1.py
new file mode 100644
index 0000000000000000000000000000000000000000..10736fc504637b07fe362e27c5e86ea73990217a
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_b1.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='b1'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_b2.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_b2.py
new file mode 100644
index 0000000000000000000000000000000000000000..61f477120e031cd8cf46340bdbd3c687ade2a035
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_b2.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='b2'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1408,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_b3.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_b3.py
new file mode 100644
index 0000000000000000000000000000000000000000..14e523fd2e4180e960aa8a3282e56f6604c38a47
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_b3.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='b3'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_l.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_l.py
new file mode 100644
index 0000000000000000000000000000000000000000..456467d6fa076db11b009fca875e231569e05288
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_l.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='l'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_m.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_m.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e4d303f624d3375416b7c41c59a68a1a64e4a19
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_m.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='m'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_s.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_s.py
new file mode 100644
index 0000000000000000000000000000000000000000..866648223c79aac1ca8519a1d18b167b7ac474ec
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_s.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='s'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/efficientnet_v2/efficientnetv2_xl.py b/configs/_base_/models/efficientnet_v2/efficientnetv2_xl.py
new file mode 100644
index 0000000000000000000000000000000000000000..2216c9daa7d5e5e11084320b3aeab6a388588f40
--- /dev/null
+++ b/configs/_base_/models/efficientnet_v2/efficientnetv2_xl.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='EfficientNetV2', arch='xl'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/eva/eva-g.py b/configs/_base_/models/eva/eva-g.py
new file mode 100644
index 0000000000000000000000000000000000000000..17bc84ad8bd2ac5599f26351b5fb5ca3fb8ec8bc
--- /dev/null
+++ b/configs/_base_/models/eva/eva-g.py
@@ -0,0 +1,29 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='eva-g',
+        img_size=224,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        out_type='avg_featmap',
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1408,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/eva/eva-l.py b/configs/_base_/models/eva/eva-l.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b08e4b1e1881b706848c121ceb3b4d23cfae34a
--- /dev/null
+++ b/configs/_base_/models/eva/eva-l.py
@@ -0,0 +1,30 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='l',
+        img_size=224,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        out_type='avg_featmap',
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        use_shared_rel_pos_bias=False,
+        layer_cfgs=dict(bias=True),
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hivit/base_224.py b/configs/_base_/models/hivit/base_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..a87a68cf6f03e3e794361324fe5158b6a7dc5faa
--- /dev/null
+++ b/configs/_base_/models/hivit/base_224.py
@@ -0,0 +1,28 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='HiViT',
+        arch='base',
+        img_size=224,
+        ape=True,
+        rpe=True,
+        drop_path_rate=0.5),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/hivit/small_224.py b/configs/_base_/models/hivit/small_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..333b2461d3ef681dd24f367f18e38f2cc87dd2de
--- /dev/null
+++ b/configs/_base_/models/hivit/small_224.py
@@ -0,0 +1,28 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='HiViT',
+        arch='small',
+        img_size=224,
+        ape=True,
+        rpe=True,
+        drop_path_rate=0.3),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/hivit/tiny_224.py b/configs/_base_/models/hivit/tiny_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..b3e2fdb3ce64aa8cfe42fb0b923d34fcdbb0524f
--- /dev/null
+++ b/configs/_base_/models/hivit/tiny_224.py
@@ -0,0 +1,28 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='HiViT',
+        arch='tiny',
+        img_size=224,
+        ape=True,
+        rpe=True,
+        drop_path_rate=0.05),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/hornet/hornet-base-gf.py b/configs/_base_/models/hornet/hornet-base-gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..b6924f96265cda310a38765fa460ad685d9d01b7
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-base-gf.py
@@ -0,0 +1,20 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='base-gf', drop_path_rate=0.5),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hornet/hornet-base.py b/configs/_base_/models/hornet/hornet-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..904379ab5f258fa366d75166e7446fccecf0bc2c
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-base.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='base', drop_path_rate=0.5),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hornet/hornet-large-gf.py b/configs/_base_/models/hornet/hornet-large-gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..1607ba2208415699697f8ada17941cc75a6270a9
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-large-gf.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='large-gf', drop_path_rate=0.2),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hornet/hornet-large-gf384.py b/configs/_base_/models/hornet/hornet-large-gf384.py
new file mode 100644
index 0000000000000000000000000000000000000000..fbb547873ed047adaed448fb1d443b4de8750ea4
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-large-gf384.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='large-gf384', drop_path_rate=0.4),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ])
diff --git a/configs/_base_/models/hornet/hornet-large.py b/configs/_base_/models/hornet/hornet-large.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5494fd8985970c2a60424ab6b6e07cd8965a6ed
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-large.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='large', drop_path_rate=0.2),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hornet/hornet-small-gf.py b/configs/_base_/models/hornet/hornet-small-gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..42e26d3a4bf75aab77a3fbdda2135bed98223476
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-small-gf.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='small-gf', drop_path_rate=0.4),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hornet/hornet-small.py b/configs/_base_/models/hornet/hornet-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..d59184d40ab2f8a5c03c82caeade85dcd32c9180
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-small.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='small', drop_path_rate=0.4),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hornet/hornet-tiny-gf.py b/configs/_base_/models/hornet/hornet-tiny-gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..6b06f5b121f18f26c5a3a3442f3bbf8842bdd206
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-tiny-gf.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='tiny-gf', drop_path_rate=0.2),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hornet/hornet-tiny.py b/configs/_base_/models/hornet/hornet-tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..aed710eb862467da4d39c13a4fad41e7e6b76f29
--- /dev/null
+++ b/configs/_base_/models/hornet/hornet-tiny.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HorNet', arch='tiny', drop_path_rate=0.2),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        dict(type='Constant', layer=['LayerScale'], val=1e-6)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/hrnet/hrnet-w18.py b/configs/_base_/models/hrnet/hrnet-w18.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7fbf298d5b64ba1cefa46a4a5d2823c2fa8cf17
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w18.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HRNet', arch='w18'),
+    neck=[
+        dict(type='HRFuseScales', in_channels=(18, 36, 72, 144)),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='LinearClsHead',
+        in_channels=2048,
+        num_classes=1000,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/hrnet/hrnet-w30.py b/configs/_base_/models/hrnet/hrnet-w30.py
new file mode 100644
index 0000000000000000000000000000000000000000..babcacac59af0ff92802a71f48b249b29a760acb
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w30.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HRNet', arch='w30'),
+    neck=[
+        dict(type='HRFuseScales', in_channels=(30, 60, 120, 240)),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='LinearClsHead',
+        in_channels=2048,
+        num_classes=1000,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/hrnet/hrnet-w32.py b/configs/_base_/models/hrnet/hrnet-w32.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c1e980048d6bb855b94e0bb3027941d07513c05
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w32.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HRNet', arch='w32'),
+    neck=[
+        dict(type='HRFuseScales', in_channels=(32, 64, 128, 256)),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='LinearClsHead',
+        in_channels=2048,
+        num_classes=1000,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/hrnet/hrnet-w40.py b/configs/_base_/models/hrnet/hrnet-w40.py
new file mode 100644
index 0000000000000000000000000000000000000000..83f65d864679297b25b39438d49eb491c92c33a1
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w40.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HRNet', arch='w40'),
+    neck=[
+        dict(type='HRFuseScales', in_channels=(40, 80, 160, 320)),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='LinearClsHead',
+        in_channels=2048,
+        num_classes=1000,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/hrnet/hrnet-w44.py b/configs/_base_/models/hrnet/hrnet-w44.py
new file mode 100644
index 0000000000000000000000000000000000000000..e75dc0f891f6f9dd14ba31b865fd29afd622f4db
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w44.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HRNet', arch='w44'),
+    neck=[
+        dict(type='HRFuseScales', in_channels=(44, 88, 176, 352)),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='LinearClsHead',
+        in_channels=2048,
+        num_classes=1000,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/hrnet/hrnet-w48.py b/configs/_base_/models/hrnet/hrnet-w48.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0604958481ba2af277e3a0f9515dc1423def6c6
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w48.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HRNet', arch='w48'),
+    neck=[
+        dict(type='HRFuseScales', in_channels=(48, 96, 192, 384)),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='LinearClsHead',
+        in_channels=2048,
+        num_classes=1000,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/hrnet/hrnet-w64.py b/configs/_base_/models/hrnet/hrnet-w64.py
new file mode 100644
index 0000000000000000000000000000000000000000..844c3fe9413f624dd374ceb1a9c3bbc185a20a3e
--- /dev/null
+++ b/configs/_base_/models/hrnet/hrnet-w64.py
@@ -0,0 +1,15 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='HRNet', arch='w64'),
+    neck=[
+        dict(type='HRFuseScales', in_channels=(64, 128, 256, 512)),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='LinearClsHead',
+        in_channels=2048,
+        num_classes=1000,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/inception_v3.py b/configs/_base_/models/inception_v3.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f6a8305efe2ef87cfd0d2676056a07595831c6b
--- /dev/null
+++ b/configs/_base_/models/inception_v3.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='InceptionV3', num_classes=1000, aux_logits=False),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)),
+)
diff --git a/configs/_base_/models/itpn_hivit-base-p16.py b/configs/_base_/models/itpn_hivit-base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..834d6fe53b30b3370df0e5aaa08d6786472810a6
--- /dev/null
+++ b/configs/_base_/models/itpn_hivit-base-p16.py
@@ -0,0 +1,33 @@
+# model settings
+model = dict(
+    type='iTPN',
+    backbone=dict(
+        type='iTPNHiViT',
+        arch='base',
+        reconstruction_type='pixel',
+        mask_ratio=0.75),
+    neck=dict(
+        type='iTPNPretrainDecoder',
+        num_patches=196,
+        patch_size=16,
+        in_chans=3,
+        embed_dim=512,
+        decoder_embed_dim=512,
+        decoder_depth=6,
+        decoder_num_heads=16,
+        mlp_ratio=4.,
+        reconstruction_type='pixel',
+        #  transformer pyramid
+        fpn_dim=256,
+        fpn_depth=2,
+        num_outs=3,
+    ),
+    head=dict(
+        type='MAEPretrainHead',
+        norm_pix=True,
+        patch_size=16,
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+    init_cfg=[
+        dict(type='Xavier', layer='Linear', distribution='uniform'),
+        dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+    ])
diff --git a/configs/_base_/models/levit-256-p16.py b/configs/_base_/models/levit-256-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..936305bd254cb0c46f1bd0e8d0698f76b9a765c4
--- /dev/null
+++ b/configs/_base_/models/levit-256-p16.py
@@ -0,0 +1,26 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='LeViT',
+        arch='256',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0,
+        attn_ratio=2,
+        mlp_ratio=2,
+        out_indices=(2, )),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LeViTClsHead',
+        num_classes=1000,
+        in_channels=512,
+        distillation=True,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]))
diff --git a/configs/_base_/models/mae_hivit-base-p16.py b/configs/_base_/models/mae_hivit-base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..bac073c840120c67e3c97b43bd5b308c62dbbbd9
--- /dev/null
+++ b/configs/_base_/models/mae_hivit-base-p16.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='MAE',
+    backbone=dict(
+        type='MAEHiViT', patch_size=16, arch='base', mask_ratio=0.75),
+    neck=dict(
+        type='MAEPretrainDecoder',
+        patch_size=16,
+        in_chans=3,
+        embed_dim=512,
+        decoder_embed_dim=512,
+        decoder_depth=6,
+        decoder_num_heads=16,
+        mlp_ratio=4.,
+    ),
+    head=dict(
+        type='MAEPretrainHead',
+        norm_pix=True,
+        patch_size=16,
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+    init_cfg=[
+        dict(type='Xavier', layer='Linear', distribution='uniform'),
+        dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+    ])
diff --git a/configs/_base_/models/mae_vit-base-p16.py b/configs/_base_/models/mae_vit-base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..8cde8cb7c775d82941324f1abfa3432727b08a07
--- /dev/null
+++ b/configs/_base_/models/mae_vit-base-p16.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+    type='MAE',
+    backbone=dict(type='MAEViT', arch='b', patch_size=16, mask_ratio=0.75),
+    neck=dict(
+        type='MAEPretrainDecoder',
+        patch_size=16,
+        in_chans=3,
+        embed_dim=768,
+        decoder_embed_dim=512,
+        decoder_depth=8,
+        decoder_num_heads=16,
+        mlp_ratio=4.,
+    ),
+    head=dict(
+        type='MAEPretrainHead',
+        norm_pix=True,
+        patch_size=16,
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+    init_cfg=[
+        dict(type='Xavier', layer='Linear', distribution='uniform'),
+        dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+    ])
diff --git a/configs/_base_/models/mixmim/mixmim_base.py b/configs/_base_/models/mixmim/mixmim_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..ccde357570d22d3e1147b14ec480fd6b31f6a4cf
--- /dev/null
+++ b/configs/_base_/models/mixmim/mixmim_base.py
@@ -0,0 +1,20 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MixMIMTransformer', arch='B', drop_rate=0.0, drop_path_rate=0.1),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        init_cfg=None,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/mlp_mixer_base_patch16.py b/configs/_base_/models/mlp_mixer_base_patch16.py
new file mode 100644
index 0000000000000000000000000000000000000000..5ebd17f337bb3d6f14e0a45b40ef6f3342477090
--- /dev/null
+++ b/configs/_base_/models/mlp_mixer_base_patch16.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MlpMixer',
+        arch='b',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.1,
+        init_cfg=[
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]),
+    neck=dict(type='GlobalAveragePooling', dim=1),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+)
diff --git a/configs/_base_/models/mlp_mixer_large_patch16.py b/configs/_base_/models/mlp_mixer_large_patch16.py
new file mode 100644
index 0000000000000000000000000000000000000000..ff107139bc9aa202b5b60696761f4167c25b5be3
--- /dev/null
+++ b/configs/_base_/models/mlp_mixer_large_patch16.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MlpMixer',
+        arch='l',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.1,
+        init_cfg=[
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]),
+    neck=dict(type='GlobalAveragePooling', dim=1),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+)
diff --git a/configs/_base_/models/mobilenet_v2_1x.py b/configs/_base_/models/mobilenet_v2_1x.py
new file mode 100644
index 0000000000000000000000000000000000000000..6ebff1eff937a1390f23567c37debd164aeb8c9e
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v2_1x.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileNetV2', widen_factor=1.0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_large_imagenet.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_large_imagenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..5318f50feeb7d0d3f54bd70e6f854d1a74fb0743
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_large_imagenet.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileNetV3', arch='large'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='StackedLinearClsHead',
+        num_classes=1000,
+        in_channels=960,
+        mid_channels=[1280],
+        dropout_rate=0.2,
+        act_cfg=dict(type='HSwish'),
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=dict(
+            type='Normal', layer='Linear', mean=0., std=0.01, bias=0.),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_050_imagenet.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_050_imagenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..6356efcd1bf4beacb200f9bb4a3780963c68a302
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_050_imagenet.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileNetV3', arch='small_050'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='StackedLinearClsHead',
+        num_classes=1000,
+        in_channels=288,
+        mid_channels=[1024],
+        dropout_rate=0.2,
+        act_cfg=dict(type='HSwish'),
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=dict(
+            type='Normal', layer='Linear', mean=0., std=0.01, bias=0.),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_075_imagenet.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_075_imagenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..19391ec26a2b1d86d0707a780e60033db166149c
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_075_imagenet.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileNetV3', arch='small_075'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='StackedLinearClsHead',
+        num_classes=1000,
+        in_channels=432,
+        mid_channels=[1024],
+        dropout_rate=0.2,
+        act_cfg=dict(type='HSwish'),
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=dict(
+            type='Normal', layer='Linear', mean=0., std=0.01, bias=0.),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_cifar.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..5dbe980c47c83733b94a7cfe5b5ae44b3dd15729
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_cifar.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileNetV3', arch='small'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='StackedLinearClsHead',
+        num_classes=10,
+        in_channels=576,
+        mid_channels=[1280],
+        act_cfg=dict(type='HSwish'),
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_imagenet.py b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_imagenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..af6cc1b8d9dcb5b0ec21b38317950149a8a61a10
--- /dev/null
+++ b/configs/_base_/models/mobilenet_v3/mobilenet_v3_small_imagenet.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileNetV3', arch='small'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='StackedLinearClsHead',
+        num_classes=1000,
+        in_channels=576,
+        mid_channels=[1024],
+        dropout_rate=0.2,
+        act_cfg=dict(type='HSwish'),
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=dict(
+            type='Normal', layer='Linear', mean=0., std=0.01, bias=0.),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/mobileone/mobileone_s0.py b/configs/_base_/models/mobileone/mobileone_s0.py
new file mode 100644
index 0000000000000000000000000000000000000000..39624e5594e5270376a3e08719831f5e84ff234a
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s0.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MobileOne',
+        arch='s0',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobileone/mobileone_s1.py b/configs/_base_/models/mobileone/mobileone_s1.py
new file mode 100644
index 0000000000000000000000000000000000000000..cea7762e4b93d6fde21901dbcdb9593209439a5f
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s1.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MobileOne',
+        arch='s1',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobileone/mobileone_s2.py b/configs/_base_/models/mobileone/mobileone_s2.py
new file mode 100644
index 0000000000000000000000000000000000000000..dfae0e1f1a896830d0fde43fdada9f84c3fd3e30
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s2.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MobileOne',
+        arch='s2',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobileone/mobileone_s3.py b/configs/_base_/models/mobileone/mobileone_s3.py
new file mode 100644
index 0000000000000000000000000000000000000000..813567530413cc4b73a3aef08a8b58dc9fca47e1
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s3.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MobileOne',
+        arch='s3',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobileone/mobileone_s4.py b/configs/_base_/models/mobileone/mobileone_s4.py
new file mode 100644
index 0000000000000000000000000000000000000000..282eec8bcf1ce3adf2bfc3861734f1a5b65ea7bf
--- /dev/null
+++ b/configs/_base_/models/mobileone/mobileone_s4.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MobileOne',
+        arch='s4',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobilevit/mobilevit_s.py b/configs/_base_/models/mobilevit/mobilevit_s.py
new file mode 100644
index 0000000000000000000000000000000000000000..f6a4e05d2c8f1fc4f7b6a6b5953ff52cdfc7a2c6
--- /dev/null
+++ b/configs/_base_/models/mobilevit/mobilevit_s.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileViT', arch='small'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=640,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobilevit/mobilevit_xs.py b/configs/_base_/models/mobilevit/mobilevit_xs.py
new file mode 100644
index 0000000000000000000000000000000000000000..f8c6ef08eb0876bd70508fe72fd81e45470ffbf8
--- /dev/null
+++ b/configs/_base_/models/mobilevit/mobilevit_xs.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileViT', arch='x_small'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mobilevit/mobilevit_xxs.py b/configs/_base_/models/mobilevit/mobilevit_xxs.py
new file mode 100644
index 0000000000000000000000000000000000000000..e1c26e6f3e9f559b2599589b7de690ef45ea5611
--- /dev/null
+++ b/configs/_base_/models/mobilevit/mobilevit_xxs.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MobileViT', arch='xx_small'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=320,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/mvit/mvitv2-base.py b/configs/_base_/models/mvit/mvitv2-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..0cb6064f627bb9ec8e80295623be6c734d1c03c9
--- /dev/null
+++ b/configs/_base_/models/mvit/mvitv2-base.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MViT', arch='base', drop_path_rate=0.3),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=1000,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/mvit/mvitv2-large.py b/configs/_base_/models/mvit/mvitv2-large.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c84424311334030010f4b0651876ee8c3bc57cc
--- /dev/null
+++ b/configs/_base_/models/mvit/mvitv2-large.py
@@ -0,0 +1,23 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MViT',
+        arch='large',
+        drop_path_rate=0.5,
+        dim_mul_in_attention=False),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        in_channels=1152,
+        num_classes=1000,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/mvit/mvitv2-small.py b/configs/_base_/models/mvit/mvitv2-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..df895f2950cbf7aa009c308a86352147e427e309
--- /dev/null
+++ b/configs/_base_/models/mvit/mvitv2-small.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MViT', arch='small', drop_path_rate=0.1),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=1000,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/mvit/mvitv2-tiny.py b/configs/_base_/models/mvit/mvitv2-tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..836f04bfce975487ccb05d38f47150e128313918
--- /dev/null
+++ b/configs/_base_/models/mvit/mvitv2-tiny.py
@@ -0,0 +1,19 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='MViT', arch='tiny', drop_path_rate=0.1),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=1000,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/_base_/models/poolformer/poolformer_m36.py b/configs/_base_/models/poolformer/poolformer_m36.py
new file mode 100644
index 0000000000000000000000000000000000000000..276a72122b18f0731aded4c7652897d92814d53d
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_m36.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PoolFormer',
+        arch='m36',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/poolformer/poolformer_m48.py b/configs/_base_/models/poolformer/poolformer_m48.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c006acbc0d01caa8ecc66b26a3d7b0e75725dab
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_m48.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PoolFormer',
+        arch='m48',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/poolformer/poolformer_s12.py b/configs/_base_/models/poolformer/poolformer_s12.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7b3600f35813acc633845050b1280873ac7ee47
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_s12.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PoolFormer',
+        arch='s12',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/poolformer/poolformer_s24.py b/configs/_base_/models/poolformer/poolformer_s24.py
new file mode 100644
index 0000000000000000000000000000000000000000..822ab5b309c043569cfff4f124680906e9593a5b
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_s24.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PoolFormer',
+        arch='s24',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/poolformer/poolformer_s36.py b/configs/_base_/models/poolformer/poolformer_s36.py
new file mode 100644
index 0000000000000000000000000000000000000000..489f2223c0dbfe25d02dc804843ff8ce379639d2
--- /dev/null
+++ b/configs/_base_/models/poolformer/poolformer_s36.py
@@ -0,0 +1,22 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PoolFormer',
+        arch='s36',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_1.6gf.py b/configs/_base_/models/regnet/regnetx_1.6gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..b81f0ad25bc5c6ccf1775e580f59b86a851fb950
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_1.6gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_1.6gf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=912,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_12gf.py b/configs/_base_/models/regnet/regnetx_12gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..383d4f87992d3d7cb6b9de35e2a82e371a46b12c
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_12gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_12gf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2240,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_3.2gf.py b/configs/_base_/models/regnet/regnetx_3.2gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..67d454139586d60c17f5468807f761f7835fd0f7
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_3.2gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_3.2gf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1008,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_4.0gf.py b/configs/_base_/models/regnet/regnetx_4.0gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..01419c64bd18a5a1f9a0c9606209726b957f24ea
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_4.0gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_4.0gf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1360,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_400mf.py b/configs/_base_/models/regnet/regnetx_400mf.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef518b9f7df4484c158d24e9522a61e41cca3f15
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_400mf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_400mf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_6.4gf.py b/configs/_base_/models/regnet/regnetx_6.4gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..44e6222af015cd5a93e5feccdb98348f1da3991a
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_6.4gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_6.4gf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1624,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_8.0gf.py b/configs/_base_/models/regnet/regnetx_8.0gf.py
new file mode 100644
index 0000000000000000000000000000000000000000..29298268d767b45d3d5dcde4dd72663b1c407525
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_8.0gf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_8.0gf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1920,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/regnet/regnetx_800mf.py b/configs/_base_/models/regnet/regnetx_800mf.py
new file mode 100644
index 0000000000000000000000000000000000000000..210f760fe29c104c662123af4cecef143ddc9ec3
--- /dev/null
+++ b/configs/_base_/models/regnet/regnetx_800mf.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='RegNet', arch='regnetx_800mf'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=672,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/replknet-31B_in1k.py b/configs/_base_/models/replknet-31B_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0cc50959d4bfc4597269de078ecabe5c663963b2
--- /dev/null
+++ b/configs/_base_/models/replknet-31B_in1k.py
@@ -0,0 +1,25 @@
+from mmpretrain.models import build_classifier
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RepLKNet',
+        arch='31B',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
+
+if __name__ == '__main__':
+    # model.pop('type')
+    model = build_classifier(model)
+    model.eval()
+    print('------------------- training-time model -------------')
+    for i in model.state_dict().keys():
+        print(i)
diff --git a/configs/_base_/models/replknet-31L_in1k.py b/configs/_base_/models/replknet-31L_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7830fb06f74a1ba2d7d437cc7733f446ecb12872
--- /dev/null
+++ b/configs/_base_/models/replknet-31L_in1k.py
@@ -0,0 +1,15 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RepLKNet',
+        arch='31L',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/replknet-XL_in1k.py b/configs/_base_/models/replknet-XL_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b63f3459c9914a247e8373e1fba4cbd8b4a5a81a
--- /dev/null
+++ b/configs/_base_/models/replknet-XL_in1k.py
@@ -0,0 +1,15 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RepLKNet',
+        arch='XL',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/repmlp-base_224.py b/configs/_base_/models/repmlp-base_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..7db0077882168d1466fede11243f70837df29395
--- /dev/null
+++ b/configs/_base_/models/repmlp-base_224.py
@@ -0,0 +1,18 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RepMLPNet',
+        arch='B',
+        img_size=224,
+        out_indices=(3, ),
+        reparam_conv_kernels=(1, 3),
+        deploy=False),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/repvgg-A0_in1k.py b/configs/_base_/models/repvgg-A0_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..093ffb7eea9f6af6a17e6fe766ba1f1a6160b28d
--- /dev/null
+++ b/configs/_base_/models/repvgg-A0_in1k.py
@@ -0,0 +1,15 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RepVGG',
+        arch='A0',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/repvgg-B3_lbs-mixup_in1k.py b/configs/_base_/models/repvgg-B3_lbs-mixup_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d88e687b35df35cd5993d24d929a686bf0af6f8b
--- /dev/null
+++ b/configs/_base_/models/repvgg-B3_lbs-mixup_in1k.py
@@ -0,0 +1,22 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RepVGG',
+        arch='B3',
+        out_indices=(3, ),
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2560,
+        loss=dict(
+            type='LabelSmoothLoss',
+            loss_weight=1.0,
+            label_smooth_val=0.1,
+            mode='classy_vision',
+            num_classes=1000),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/res2net101-w26-s4.py b/configs/_base_/models/res2net101-w26-s4.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bf64c508f95f8f3d2eb14afbe85799a49ee69aa
--- /dev/null
+++ b/configs/_base_/models/res2net101-w26-s4.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Res2Net',
+        depth=101,
+        scales=4,
+        base_width=26,
+        deep_stem=False,
+        avg_down=False,
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/res2net50-w14-s8.py b/configs/_base_/models/res2net50-w14-s8.py
new file mode 100644
index 0000000000000000000000000000000000000000..5875142c34d64f8414929bd43ccf37971bc97df8
--- /dev/null
+++ b/configs/_base_/models/res2net50-w14-s8.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Res2Net',
+        depth=50,
+        scales=8,
+        base_width=14,
+        deep_stem=False,
+        avg_down=False,
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/res2net50-w26-s4.py b/configs/_base_/models/res2net50-w26-s4.py
new file mode 100644
index 0000000000000000000000000000000000000000..be8fdb585903564a9572b575b48967dd1a12c3f4
--- /dev/null
+++ b/configs/_base_/models/res2net50-w26-s4.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Res2Net',
+        depth=50,
+        scales=4,
+        base_width=26,
+        deep_stem=False,
+        avg_down=False,
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/res2net50-w26-s6.py b/configs/_base_/models/res2net50-w26-s6.py
new file mode 100644
index 0000000000000000000000000000000000000000..281b136a67e245ee90e94bd1495b449af39118e3
--- /dev/null
+++ b/configs/_base_/models/res2net50-w26-s6.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Res2Net',
+        depth=50,
+        scales=6,
+        base_width=26,
+        deep_stem=False,
+        avg_down=False,
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/res2net50-w26-s8.py b/configs/_base_/models/res2net50-w26-s8.py
new file mode 100644
index 0000000000000000000000000000000000000000..b4f62f3ed19e4ba1f833a23cb5c8d434456b5b07
--- /dev/null
+++ b/configs/_base_/models/res2net50-w26-s8.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Res2Net',
+        depth=50,
+        scales=8,
+        base_width=26,
+        deep_stem=False,
+        avg_down=False,
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/res2net50-w48-s2.py b/configs/_base_/models/res2net50-w48-s2.py
new file mode 100644
index 0000000000000000000000000000000000000000..8675c91fa008f72ddcaa10f11b91e1f6feb79953
--- /dev/null
+++ b/configs/_base_/models/res2net50-w48-s2.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Res2Net',
+        depth=50,
+        scales=2,
+        base_width=48,
+        deep_stem=False,
+        avg_down=False,
+    ),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnest101.py b/configs/_base_/models/resnest101.py
new file mode 100644
index 0000000000000000000000000000000000000000..3780c1549359ec1850ce1db546d23a667e699d4f
--- /dev/null
+++ b/configs/_base_/models/resnest101.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeSt',
+        depth=101,
+        num_stages=4,
+        stem_channels=128,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            num_classes=1000,
+            reduction='mean',
+            loss_weight=1.0),
+        topk=(1, 5),
+        cal_acc=False),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnest200.py b/configs/_base_/models/resnest200.py
new file mode 100644
index 0000000000000000000000000000000000000000..40d8f03e7f528f8c0132bd2c19515460fd47fe70
--- /dev/null
+++ b/configs/_base_/models/resnest200.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeSt',
+        depth=200,
+        num_stages=4,
+        stem_channels=128,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            num_classes=1000,
+            reduction='mean',
+            loss_weight=1.0),
+        topk=(1, 5),
+        cal_acc=False),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnest269.py b/configs/_base_/models/resnest269.py
new file mode 100644
index 0000000000000000000000000000000000000000..c37626f5678630383693d784d2590f27caa11de2
--- /dev/null
+++ b/configs/_base_/models/resnest269.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeSt',
+        depth=269,
+        num_stages=4,
+        stem_channels=128,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            num_classes=1000,
+            reduction='mean',
+            loss_weight=1.0),
+        topk=(1, 5),
+        cal_acc=False),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnest50.py b/configs/_base_/models/resnest50.py
new file mode 100644
index 0000000000000000000000000000000000000000..51c90e86f468edccc3de3b0e7cd783548d220db4
--- /dev/null
+++ b/configs/_base_/models/resnest50.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeSt',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            num_classes=1000,
+            reduction='mean',
+            loss_weight=1.0),
+        topk=(1, 5),
+        cal_acc=False),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnet101.py b/configs/_base_/models/resnet101.py
new file mode 100644
index 0000000000000000000000000000000000000000..1147cd4be9aff00ad6ce66c31e2839c1a94f9ca3
--- /dev/null
+++ b/configs/_base_/models/resnet101.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnet101_cifar.py b/configs/_base_/models/resnet101_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..a84d470e3a9828532e5cddcb1a3f7aa4fcae9f68
--- /dev/null
+++ b/configs/_base_/models/resnet101_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=10,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/resnet152.py b/configs/_base_/models/resnet152.py
new file mode 100644
index 0000000000000000000000000000000000000000..94a718c3cec213727a7a2f11baeb3594fd37532e
--- /dev/null
+++ b/configs/_base_/models/resnet152.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=152,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnet152_cifar.py b/configs/_base_/models/resnet152_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..55c0cc6c66dbde26bebe6d99d791c3e3f28e4e27
--- /dev/null
+++ b/configs/_base_/models/resnet152_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=152,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=10,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/resnet18.py b/configs/_base_/models/resnet18.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c66758ee4aadced38c815e98af68b74aa310a2e
--- /dev/null
+++ b/configs/_base_/models/resnet18.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=18,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnet18_cifar.py b/configs/_base_/models/resnet18_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b9cf1e7337de73aa21515547b6c3d16e2b178ea
--- /dev/null
+++ b/configs/_base_/models/resnet18_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=18,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=10,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/resnet34.py b/configs/_base_/models/resnet34.py
new file mode 100644
index 0000000000000000000000000000000000000000..100ee286bead6b5dd88f1752660e8ab9d0498e37
--- /dev/null
+++ b/configs/_base_/models/resnet34.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=34,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnet34_cifar.py b/configs/_base_/models/resnet34_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..55d033bc30bcbde7aef8e57ad950f59c248ad74b
--- /dev/null
+++ b/configs/_base_/models/resnet34_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=34,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=10,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/resnet34_gem.py b/configs/_base_/models/resnet34_gem.py
new file mode 100644
index 0000000000000000000000000000000000000000..5c0e0d3e8dc5d7a0b259f1624ee2402af8a401cd
--- /dev/null
+++ b/configs/_base_/models/resnet34_gem.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=34,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GeneralizedMeanPooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnet50.py b/configs/_base_/models/resnet50.py
new file mode 100644
index 0000000000000000000000000000000000000000..129a2bb50c91f3034997d216f3a9efb743d9cc40
--- /dev/null
+++ b/configs/_base_/models/resnet50.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnet50_cifar.py b/configs/_base_/models/resnet50_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..33b66d526482245237faa2862d376797c21a8ee4
--- /dev/null
+++ b/configs/_base_/models/resnet50_cifar.py
@@ -0,0 +1,16 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=10,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/resnet50_cifar_cutmix.py b/configs/_base_/models/resnet50_cifar_cutmix.py
new file mode 100644
index 0000000000000000000000000000000000000000..73c38be271a90b1655ae63e4f36cf6c3a3c5fdc4
--- /dev/null
+++ b/configs/_base_/models/resnet50_cifar_cutmix.py
@@ -0,0 +1,18 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='MultiLabelLinearClsHead',
+        num_classes=10,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0, use_soft=True)),
+    train_cfg=dict(
+        augments=dict(type='BatchCutMix', alpha=1.0, num_classes=10,
+                      prob=1.0)))
diff --git a/configs/_base_/models/resnet50_cifar_mixup.py b/configs/_base_/models/resnet50_cifar_mixup.py
new file mode 100644
index 0000000000000000000000000000000000000000..f165c2466bd8a67cbfadd5f3a388d4fe03e6d446
--- /dev/null
+++ b/configs/_base_/models/resnet50_cifar_mixup.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='MultiLabelLinearClsHead',
+        num_classes=10,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0, use_soft=True)),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=1.)),
+)
diff --git a/configs/_base_/models/resnet50_cutmix.py b/configs/_base_/models/resnet50_cutmix.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb79088b798d1c16eb6c336006143c2fe288e6a2
--- /dev/null
+++ b/configs/_base_/models/resnet50_cutmix.py
@@ -0,0 +1,18 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='MultiLabelLinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0, use_soft=True)),
+    train_cfg=dict(
+        augments=dict(
+            type='BatchCutMix', alpha=1.0, num_classes=1000, prob=1.0)))
diff --git a/configs/_base_/models/resnet50_label_smooth.py b/configs/_base_/models/resnet50_label_smooth.py
new file mode 100644
index 0000000000000000000000000000000000000000..b6f793751904658b3e7e01a5ffdaa6b86e156e66
--- /dev/null
+++ b/configs/_base_/models/resnet50_label_smooth.py
@@ -0,0 +1,18 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnet50_mixup.py b/configs/_base_/models/resnet50_mixup.py
new file mode 100644
index 0000000000000000000000000000000000000000..23130a69c98823a6979dcd7ee7441746753a9865
--- /dev/null
+++ b/configs/_base_/models/resnet50_mixup.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='MultiLabelLinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0, use_soft=True)),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
diff --git a/configs/_base_/models/resnetv1c50.py b/configs/_base_/models/resnetv1c50.py
new file mode 100644
index 0000000000000000000000000000000000000000..3b973e20181cd3cf1c470db84abf97aeaa0549c1
--- /dev/null
+++ b/configs/_base_/models/resnetv1c50.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNetV1c',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnetv1d101.py b/configs/_base_/models/resnetv1d101.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e56223121fb22ac089800ebeb69310758d0f2e7
--- /dev/null
+++ b/configs/_base_/models/resnetv1d101.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNetV1d',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnetv1d152.py b/configs/_base_/models/resnetv1d152.py
new file mode 100644
index 0000000000000000000000000000000000000000..58cc73beb318e38f9ce79154a1265be1a7dba17b
--- /dev/null
+++ b/configs/_base_/models/resnetv1d152.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNetV1d',
+        depth=152,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnetv1d50.py b/configs/_base_/models/resnetv1d50.py
new file mode 100644
index 0000000000000000000000000000000000000000..015aaa3d8182cae50f392d7103e24e8ac8a188aa
--- /dev/null
+++ b/configs/_base_/models/resnetv1d50.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNetV1d',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnext101_32x4d.py b/configs/_base_/models/resnext101_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c89fb6488701c83f12e623ae606abbe3b78799f
--- /dev/null
+++ b/configs/_base_/models/resnext101_32x4d.py
@@ -0,0 +1,19 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeXt',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        groups=32,
+        width_per_group=4,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnext101_32x8d.py b/configs/_base_/models/resnext101_32x8d.py
new file mode 100644
index 0000000000000000000000000000000000000000..2bb63f3aeb8b37eb701135ed1c6bf2d15869fae3
--- /dev/null
+++ b/configs/_base_/models/resnext101_32x8d.py
@@ -0,0 +1,19 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeXt',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        groups=32,
+        width_per_group=8,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnext152_32x4d.py b/configs/_base_/models/resnext152_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..d392eff3dc673b0b74ed013c030152a0107799a2
--- /dev/null
+++ b/configs/_base_/models/resnext152_32x4d.py
@@ -0,0 +1,19 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeXt',
+        depth=152,
+        num_stages=4,
+        out_indices=(3, ),
+        groups=32,
+        width_per_group=4,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/resnext50_32x4d.py b/configs/_base_/models/resnext50_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..060426231e8cd845fda17ea053478cf7f57b940a
--- /dev/null
+++ b/configs/_base_/models/resnext50_32x4d.py
@@ -0,0 +1,19 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNeXt',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        groups=32,
+        width_per_group=4,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/revvit/revvit-base.py b/configs/_base_/models/revvit/revvit-base.py
new file mode 100644
index 0000000000000000000000000000000000000000..85b7af42ea7fd6856fd81bc99ee829fb40bce435
--- /dev/null
+++ b/configs/_base_/models/revvit/revvit-base.py
@@ -0,0 +1,27 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RevVisionTransformer',
+        arch='deit-base',
+        img_size=224,
+        patch_size=16,
+        out_type='avg_featmap',
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/revvit/revvit-small.py b/configs/_base_/models/revvit/revvit-small.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd1a0b2661ac2cf54554c06bd729477b94dad908
--- /dev/null
+++ b/configs/_base_/models/revvit/revvit-small.py
@@ -0,0 +1,27 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RevVisionTransformer',
+        arch='deit-small',
+        img_size=224,
+        patch_size=16,
+        out_type='avg_featmap',
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/seresnet101.py b/configs/_base_/models/seresnet101.py
new file mode 100644
index 0000000000000000000000000000000000000000..137a6f90f6bca160a073877fc43ea6398fa1d0b4
--- /dev/null
+++ b/configs/_base_/models/seresnet101.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SEResNet',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/seresnet50.py b/configs/_base_/models/seresnet50.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5f6bfce8db9ed75936229bf57992a0211a95b7d
--- /dev/null
+++ b/configs/_base_/models/seresnet50.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SEResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/seresnext101_32x4d.py b/configs/_base_/models/seresnext101_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc8a62c39305993bf9b717edf980a1546de12a2b
--- /dev/null
+++ b/configs/_base_/models/seresnext101_32x4d.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SEResNeXt',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        groups=32,
+        width_per_group=4,
+        se_ratio=16,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/seresnext50_32x4d.py b/configs/_base_/models/seresnext50_32x4d.py
new file mode 100644
index 0000000000000000000000000000000000000000..0cdf7cb696be22d3a5fa5829162052c8b9b7e7a8
--- /dev/null
+++ b/configs/_base_/models/seresnext50_32x4d.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SEResNeXt',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        groups=32,
+        width_per_group=4,
+        se_ratio=16,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/shufflenet_v1_1x.py b/configs/_base_/models/shufflenet_v1_1x.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0f9d1fbdde759e6c13d9a02705072b3f11faf02
--- /dev/null
+++ b/configs/_base_/models/shufflenet_v1_1x.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ShuffleNetV1', groups=3),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=960,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/shufflenet_v2_1x.py b/configs/_base_/models/shufflenet_v2_1x.py
new file mode 100644
index 0000000000000000000000000000000000000000..190800e343d75a89ffb67a1f7dd33db04d26429d
--- /dev/null
+++ b/configs/_base_/models/shufflenet_v2_1x.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='ShuffleNetV2', widen_factor=1.0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/swin_transformer/base_224.py b/configs/_base_/models/swin_transformer/base_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7c277f2d6494a6d069bcf053349d8c5df2a0bc3
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/base_224.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer', arch='base', img_size=224, drop_path_rate=0.5),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/swin_transformer/base_384.py b/configs/_base_/models/swin_transformer/base_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..ce78981fb0775bdb4048522f32e25c58e2159160
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/base_384.py
@@ -0,0 +1,16 @@
+# model settings
+# Only for evaluation
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer',
+        arch='base',
+        img_size=384,
+        stage_cfgs=dict(block_cfgs=dict(window_size=12))),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer/large_224.py b/configs/_base_/models/swin_transformer/large_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..747d00e44d4b81383998d7f18b7ae8668bf41c5f
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/large_224.py
@@ -0,0 +1,12 @@
+# model settings
+# Only for evaluation
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='SwinTransformer', arch='large', img_size=224),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer/large_384.py b/configs/_base_/models/swin_transformer/large_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..7026f81a31de2adc445b8ce45520904205f72cee
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/large_384.py
@@ -0,0 +1,16 @@
+# model settings
+# Only for evaluation
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer',
+        arch='large',
+        img_size=384,
+        stage_cfgs=dict(block_cfgs=dict(window_size=12))),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer/small_224.py b/configs/_base_/models/swin_transformer/small_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..d87d9d9af6ce9c80581dc03925ed13b4b36893fc
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/small_224.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer', arch='small', img_size=224,
+        drop_path_rate=0.3),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/swin_transformer/tiny_224.py b/configs/_base_/models/swin_transformer/tiny_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1781cf5f84fe9dd8386b29337a9fe4f6d717784
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/tiny_224.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer', arch='tiny', img_size=224, drop_path_rate=0.2),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/swin_transformer/tiny_base_224.py b/configs/_base_/models/swin_transformer/tiny_base_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..e353b8cf0c3e66afee351e269475dfd3b234dd2a
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/tiny_base_224.py
@@ -0,0 +1,23 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer', arch='base', img_size=224, drop_path_rate=0.5),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=200,
+        in_channels=1024,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/swin_transformer/tiny_large_224.py b/configs/_base_/models/swin_transformer/tiny_large_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9e3f9118a68485691f2445ea9dc46917a3ad2cf
--- /dev/null
+++ b/configs/_base_/models/swin_transformer/tiny_large_224.py
@@ -0,0 +1,12 @@
+# model settings
+# Only for evaluation
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='SwinTransformer', arch='large', img_size=224),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=200,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer_v2/base_256.py b/configs/_base_/models/swin_transformer_v2/base_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..66594db25b17a20a346fcff944f2d37d8ff860f7
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/base_256.py
@@ -0,0 +1,26 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformerV2',
+        arch='base',
+        img_size=256,
+        drop_path_rate=0.5),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/swin_transformer_v2/base_384.py b/configs/_base_/models/swin_transformer_v2/base_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..5fb9aead2e98bba3f9277a02024981a1e22b6046
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/base_384.py
@@ -0,0 +1,17 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformerV2',
+        arch='base',
+        img_size=384,
+        drop_path_rate=0.2),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False))
diff --git a/configs/_base_/models/swin_transformer_v2/large_256.py b/configs/_base_/models/swin_transformer_v2/large_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..fe557c32058be1563ed50696b9f44b95b3bb3bed
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/large_256.py
@@ -0,0 +1,16 @@
+# model settings
+# Only for evaluation
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformerV2',
+        arch='large',
+        img_size=256,
+        drop_path_rate=0.2),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer_v2/large_384.py b/configs/_base_/models/swin_transformer_v2/large_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..a626c40715d1ea2cb1fb0cda0a249d1df01544dc
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/large_384.py
@@ -0,0 +1,16 @@
+# model settings
+# Only for evaluation
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformerV2',
+        arch='large',
+        img_size=384,
+        drop_path_rate=0.2),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1536,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/configs/_base_/models/swin_transformer_v2/small_256.py b/configs/_base_/models/swin_transformer_v2/small_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..0ec706ff0e16e44027fad3ee54e93280018d76bd
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/small_256.py
@@ -0,0 +1,26 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformerV2',
+        arch='small',
+        img_size=256,
+        drop_path_rate=0.3),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/swin_transformer_v2/tiny_256.py b/configs/_base_/models/swin_transformer_v2/tiny_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..61055a1310ab86bea26d427fe445bc4cfe7bf89e
--- /dev/null
+++ b/configs/_base_/models/swin_transformer_v2/tiny_256.py
@@ -0,0 +1,26 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformerV2',
+        arch='tiny',
+        img_size=256,
+        drop_path_rate=0.2),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/t2t-vit-t-14.py b/configs/_base_/models/t2t-vit-t-14.py
new file mode 100644
index 0000000000000000000000000000000000000000..58ea660e742b1ef8edf93fb10ac1331734a4dbe5
--- /dev/null
+++ b/configs/_base_/models/t2t-vit-t-14.py
@@ -0,0 +1,42 @@
+# model settings
+embed_dims = 384
+num_classes = 1000
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='T2T_ViT',
+        img_size=224,
+        in_channels=3,
+        embed_dims=embed_dims,
+        t2t_cfg=dict(
+            token_dims=64,
+            use_performer=False,
+        ),
+        num_layers=14,
+        layer_cfgs=dict(
+            num_heads=6,
+            feedforward_channels=3 * embed_dims,  # mlp_ratio = 3
+        ),
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=.02),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=num_classes,
+        in_channels=embed_dims,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=.02)),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/t2t-vit-t-19.py b/configs/_base_/models/t2t-vit-t-19.py
new file mode 100644
index 0000000000000000000000000000000000000000..51741c7a7cbcfd8f13fb1574f831978a144ca1a4
--- /dev/null
+++ b/configs/_base_/models/t2t-vit-t-19.py
@@ -0,0 +1,42 @@
+# model settings
+embed_dims = 448
+num_classes = 1000
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='T2T_ViT',
+        img_size=224,
+        in_channels=3,
+        embed_dims=embed_dims,
+        t2t_cfg=dict(
+            token_dims=64,
+            use_performer=False,
+        ),
+        num_layers=19,
+        layer_cfgs=dict(
+            num_heads=7,
+            feedforward_channels=3 * embed_dims,  # mlp_ratio = 3
+        ),
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=.02),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=num_classes,
+        in_channels=embed_dims,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=.02)),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/t2t-vit-t-24.py b/configs/_base_/models/t2t-vit-t-24.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad772cf6e614bbca630ffad75393614415102bb9
--- /dev/null
+++ b/configs/_base_/models/t2t-vit-t-24.py
@@ -0,0 +1,42 @@
+# model settings
+embed_dims = 512
+num_classes = 1000
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='T2T_ViT',
+        img_size=224,
+        in_channels=3,
+        embed_dims=embed_dims,
+        t2t_cfg=dict(
+            token_dims=64,
+            use_performer=False,
+        ),
+        num_layers=24,
+        layer_cfgs=dict(
+            num_heads=8,
+            feedforward_channels=3 * embed_dims,  # mlp_ratio = 3
+        ),
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=.02),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=num_classes,
+        in_channels=embed_dims,
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+        ),
+        topk=(1, 5),
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=.02)),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
diff --git a/configs/_base_/models/tiny-vit-large-p16.py b/configs/_base_/models/tiny-vit-large-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e4e7f656bc73f5b4e66610fd134950afa377ea8
--- /dev/null
+++ b/configs/_base_/models/tiny-vit-large-p16.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='l',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.1,
+        init_cfg=[
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=200,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/tinyvit/tinyvit-11m.py b/configs/_base_/models/tinyvit/tinyvit-11m.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c046e35a0fe11aaa679300d3a2d3be59ff1051b
--- /dev/null
+++ b/configs/_base_/models/tinyvit/tinyvit-11m.py
@@ -0,0 +1,25 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='TinyViT',
+        arch='11m',
+        img_size=(224, 224),
+        window_size=[7, 7, 14, 7],
+        out_indices=(3, ),
+        drop_path_rate=0.1,
+        gap_before_final_norm=True,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+        ]),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=448,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/tinyvit/tinyvit-21m.py b/configs/_base_/models/tinyvit/tinyvit-21m.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f362f8f62789f6442e33a5a000ce8d9a458a597
--- /dev/null
+++ b/configs/_base_/models/tinyvit/tinyvit-21m.py
@@ -0,0 +1,25 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='TinyViT',
+        arch='21m',
+        img_size=(224, 224),
+        window_size=[7, 7, 14, 7],
+        out_indices=(3, ),
+        drop_path_rate=0.2,
+        gap_before_final_norm=True,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+        ]),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=576,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/tinyvit/tinyvit-5m.py b/configs/_base_/models/tinyvit/tinyvit-5m.py
new file mode 100644
index 0000000000000000000000000000000000000000..923ebd918f82f40537e0f40f550c3cd264d7e389
--- /dev/null
+++ b/configs/_base_/models/tinyvit/tinyvit-5m.py
@@ -0,0 +1,25 @@
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='TinyViT',
+        arch='5m',
+        img_size=(224, 224),
+        window_size=[7, 7, 14, 7],
+        out_indices=(3, ),
+        drop_path_rate=0.0,
+        gap_before_final_norm=True,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['LayerNorm'], val=1., bias=0.),
+        ]),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=320,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
diff --git a/configs/_base_/models/tnt_s_patch16_224.py b/configs/_base_/models/tnt_s_patch16_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e13d07828c5d89d0e9ce4fc1a29fe7a6a4875d4
--- /dev/null
+++ b/configs/_base_/models/tnt_s_patch16_224.py
@@ -0,0 +1,29 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='TNT',
+        arch='s',
+        img_size=224,
+        patch_size=16,
+        in_channels=3,
+        ffn_ratio=4,
+        qkv_bias=False,
+        drop_rate=0.,
+        attn_drop_rate=0.,
+        drop_path_rate=0.1,
+        first_stride=4,
+        num_fcs=2,
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=.02),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+        ]),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        topk=(1, 5),
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=.02)))
diff --git a/configs/_base_/models/twins_pcpvt_base.py b/configs/_base_/models/twins_pcpvt_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..14e46baedd273bd3baef163e2966653626170a1c
--- /dev/null
+++ b/configs/_base_/models/twins_pcpvt_base.py
@@ -0,0 +1,31 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PCPVT',
+        arch='base',
+        in_channels=3,
+        out_indices=(3, ),
+        qkv_bias=True,
+        norm_cfg=dict(type='LN', eps=1e-06),
+        norm_after_stage=[False, False, False, True],
+        drop_rate=0.0,
+        attn_drop_rate=0.,
+        drop_path_rate=0.3),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/twins_svt_base.py b/configs/_base_/models/twins_svt_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..a37385b018f9b345ebcd3a9aaad575cd98e8b8f3
--- /dev/null
+++ b/configs/_base_/models/twins_svt_base.py
@@ -0,0 +1,31 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SVT',
+        arch='base',
+        in_channels=3,
+        out_indices=(3, ),
+        qkv_bias=True,
+        norm_cfg=dict(type='LN'),
+        norm_after_stage=[False, False, False, True],
+        drop_rate=0.0,
+        attn_drop_rate=0.,
+        drop_path_rate=0.3),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/van/van_base.py b/configs/_base_/models/van/van_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..006459255f82f4ad4250ee01f1d9d25605beb5d1
--- /dev/null
+++ b/configs/_base_/models/van/van_base.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VAN', arch='base', drop_path_rate=0.1),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False))
diff --git a/configs/_base_/models/van/van_large.py b/configs/_base_/models/van/van_large.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ebafabdaaf7a4b828919e61e980e423385897e6
--- /dev/null
+++ b/configs/_base_/models/van/van_large.py
@@ -0,0 +1,13 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VAN', arch='large', drop_path_rate=0.2),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False))
diff --git a/configs/_base_/models/van/van_small.py b/configs/_base_/models/van/van_small.py
new file mode 100644
index 0000000000000000000000000000000000000000..29393c6308af0732f4757d1ef4bd98d7b3cddcf1
--- /dev/null
+++ b/configs/_base_/models/van/van_small.py
@@ -0,0 +1,22 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VAN', arch='small', drop_path_rate=0.1),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/van/van_tiny.py b/configs/_base_/models/van/van_tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..9cf5b28836f9216c642dfdfb62f37f3066a7ad09
--- /dev/null
+++ b/configs/_base_/models/van/van_tiny.py
@@ -0,0 +1,22 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VAN', arch='tiny', drop_path_rate=0.1),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=256,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        cal_acc=False),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vgg11.py b/configs/_base_/models/vgg11.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b6ee1426aae383b1db5c4451e37caec5eafdcfa
--- /dev/null
+++ b/configs/_base_/models/vgg11.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VGG', depth=11, num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vgg11bn.py b/configs/_base_/models/vgg11bn.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb4c64e95a85367841615fd52af7af50b5b1e9fb
--- /dev/null
+++ b/configs/_base_/models/vgg11bn.py
@@ -0,0 +1,11 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VGG', depth=11, norm_cfg=dict(type='BN'), num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vgg13.py b/configs/_base_/models/vgg13.py
new file mode 100644
index 0000000000000000000000000000000000000000..a9389100a61514043bbe7426b93cfd257df5cd26
--- /dev/null
+++ b/configs/_base_/models/vgg13.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VGG', depth=13, num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vgg13bn.py b/configs/_base_/models/vgg13bn.py
new file mode 100644
index 0000000000000000000000000000000000000000..b12173b51b80b671fd85c9fa8ececd75881d4bd2
--- /dev/null
+++ b/configs/_base_/models/vgg13bn.py
@@ -0,0 +1,11 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VGG', depth=13, norm_cfg=dict(type='BN'), num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vgg16.py b/configs/_base_/models/vgg16.py
new file mode 100644
index 0000000000000000000000000000000000000000..93ce864fac29a7c4adf4df12e5653f97ce09d7be
--- /dev/null
+++ b/configs/_base_/models/vgg16.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VGG', depth=16, num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vgg16bn.py b/configs/_base_/models/vgg16bn.py
new file mode 100644
index 0000000000000000000000000000000000000000..765e34f6367bc52e10322692a849d1003d57dfd2
--- /dev/null
+++ b/configs/_base_/models/vgg16bn.py
@@ -0,0 +1,11 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VGG', depth=16, norm_cfg=dict(type='BN'), num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vgg19.py b/configs/_base_/models/vgg19.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f4ab061b2c7a87d86aaebcf78aaf84abd2bb0cc
--- /dev/null
+++ b/configs/_base_/models/vgg19.py
@@ -0,0 +1,10 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='VGG', depth=19, num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vgg19bn.py b/configs/_base_/models/vgg19bn.py
new file mode 100644
index 0000000000000000000000000000000000000000..c468b5dea2cc5503ca2b266c57d163b2308b7dd3
--- /dev/null
+++ b/configs/_base_/models/vgg19bn.py
@@ -0,0 +1,11 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VGG', depth=19, norm_cfg=dict(type='BN'), num_classes=1000),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vig/pyramid_vig_base.py b/configs/_base_/models/vig/pyramid_vig_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..a258457c84aecc2f1cdf29131f60b522526dbdd8
--- /dev/null
+++ b/configs/_base_/models/vig/pyramid_vig_base.py
@@ -0,0 +1,32 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PyramidVig',
+        arch='base',
+        k=9,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='BN'),
+        graph_conv_type='mr',
+        graph_conv_bias=True,
+        epsilon=0.2,
+        use_stochastic=False,
+        drop_path=0.1,
+        norm_eval=False,
+        frozen_stages=0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='VigClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        hidden_dim=1024,
+        act_cfg=dict(type='GELU'),
+        dropout=0.,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vig/pyramid_vig_medium.py b/configs/_base_/models/vig/pyramid_vig_medium.py
new file mode 100644
index 0000000000000000000000000000000000000000..a551aba3e079576e13f5db3a77d5e6622079e497
--- /dev/null
+++ b/configs/_base_/models/vig/pyramid_vig_medium.py
@@ -0,0 +1,32 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PyramidVig',
+        arch='medium',
+        k=9,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='BN'),
+        graph_conv_type='mr',
+        graph_conv_bias=True,
+        epsilon=0.2,
+        use_stochastic=False,
+        drop_path=0.1,
+        norm_eval=False,
+        frozen_stages=0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='VigClsHead',
+        num_classes=1000,
+        in_channels=768,
+        hidden_dim=1024,
+        act_cfg=dict(type='GELU'),
+        dropout=0.,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vig/pyramid_vig_small.py b/configs/_base_/models/vig/pyramid_vig_small.py
new file mode 100644
index 0000000000000000000000000000000000000000..940275e6cf941ce0d6a7f7dc3e4a1b867cf88309
--- /dev/null
+++ b/configs/_base_/models/vig/pyramid_vig_small.py
@@ -0,0 +1,32 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PyramidVig',
+        arch='small',
+        k=9,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='BN'),
+        graph_conv_type='mr',
+        graph_conv_bias=True,
+        epsilon=0.2,
+        use_stochastic=False,
+        drop_path=0.1,
+        norm_eval=False,
+        frozen_stages=0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='VigClsHead',
+        num_classes=1000,
+        in_channels=640,
+        hidden_dim=1024,
+        act_cfg=dict(type='GELU'),
+        dropout=0.,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vig/pyramid_vig_tiny.py b/configs/_base_/models/vig/pyramid_vig_tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..fea0734fe9ab2e962e51b819c467ad965b88a958
--- /dev/null
+++ b/configs/_base_/models/vig/pyramid_vig_tiny.py
@@ -0,0 +1,32 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='PyramidVig',
+        arch='tiny',
+        k=9,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='BN'),
+        graph_conv_type='mr',
+        graph_conv_bias=True,
+        epsilon=0.2,
+        use_stochastic=False,
+        drop_path=0.1,
+        norm_eval=False,
+        frozen_stages=0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='VigClsHead',
+        num_classes=1000,
+        in_channels=384,
+        hidden_dim=1024,
+        act_cfg=dict(type='GELU'),
+        dropout=0.,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vig/vig_base.py b/configs/_base_/models/vig/vig_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c5f293ddfab1e8712c90f96aaa62acf62159e65
--- /dev/null
+++ b/configs/_base_/models/vig/vig_base.py
@@ -0,0 +1,33 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Vig',
+        arch='base',
+        k=9,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='BN'),
+        graph_conv_type='mr',
+        graph_conv_bias=True,
+        epsilon=0.2,
+        use_dilation=True,
+        use_stochastic=False,
+        drop_path=0.1,
+        relative_pos=False,
+        norm_eval=False,
+        frozen_stages=0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='VigClsHead',
+        num_classes=1000,
+        in_channels=640,
+        hidden_dim=1024,
+        act_cfg=dict(type='GELU'),
+        dropout=0.,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vig/vig_small.py b/configs/_base_/models/vig/vig_small.py
new file mode 100644
index 0000000000000000000000000000000000000000..93587ffba628d8900b17a537eed1406c7af57e9a
--- /dev/null
+++ b/configs/_base_/models/vig/vig_small.py
@@ -0,0 +1,33 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Vig',
+        arch='small',
+        k=9,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='BN'),
+        graph_conv_type='mr',
+        graph_conv_bias=True,
+        epsilon=0.2,
+        use_dilation=True,
+        use_stochastic=False,
+        drop_path=0.1,
+        relative_pos=False,
+        norm_eval=False,
+        frozen_stages=0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='VigClsHead',
+        num_classes=1000,
+        in_channels=320,
+        hidden_dim=1024,
+        act_cfg=dict(type='GELU'),
+        dropout=0.,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vig/vig_tiny.py b/configs/_base_/models/vig/vig_tiny.py
new file mode 100644
index 0000000000000000000000000000000000000000..c50bac222a88a665a1b7adc8398f805ff10be7f1
--- /dev/null
+++ b/configs/_base_/models/vig/vig_tiny.py
@@ -0,0 +1,33 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='Vig',
+        arch='tiny',
+        k=9,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='BN'),
+        graph_conv_type='mr',
+        graph_conv_bias=True,
+        epsilon=0.2,
+        use_dilation=True,
+        use_stochastic=False,
+        drop_path=0.1,
+        relative_pos=False,
+        norm_eval=False,
+        frozen_stages=0),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='VigClsHead',
+        num_classes=1000,
+        in_channels=192,
+        hidden_dim=1024,
+        act_cfg=dict(type='GELU'),
+        dropout=0.,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
diff --git a/configs/_base_/models/vit-base-p16.py b/configs/_base_/models/vit-base-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb42bed5fa5ecedf9aa94c82ee63462181df0605
--- /dev/null
+++ b/configs/_base_/models/vit-base-p16.py
@@ -0,0 +1,25 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.1,
+        init_cfg=[
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1,
+            mode='classy_vision'),
+    ))
diff --git a/configs/_base_/models/vit-base-p32.py b/configs/_base_/models/vit-base-p32.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad550ef9b9bdbb218e6743ccf37e7929e5758865
--- /dev/null
+++ b/configs/_base_/models/vit-base-p32.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=224,
+        patch_size=32,
+        drop_rate=0.1,
+        init_cfg=[
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vit-large-p16.py b/configs/_base_/models/vit-large-p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..97162304563827716366d20bd29a11fed542be62
--- /dev/null
+++ b/configs/_base_/models/vit-large-p16.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='l',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.1,
+        init_cfg=[
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/vit-large-p32.py b/configs/_base_/models/vit-large-p32.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9491bb561433ff01f60a8aa7a4993c28c8b9b02
--- /dev/null
+++ b/configs/_base_/models/vit-large-p32.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='l',
+        img_size=224,
+        patch_size=32,
+        drop_rate=0.1,
+        init_cfg=[
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/models/wide-resnet50.py b/configs/_base_/models/wide-resnet50.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2913b9aa6afb10c36199530441ab39348650bc7
--- /dev/null
+++ b/configs/_base_/models/wide-resnet50.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        stem_channels=64,
+        base_channels=128,
+        expansion=2,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/configs/_base_/schedules/cifar10_bs128.py b/configs/_base_/schedules/cifar10_bs128.py
new file mode 100644
index 0000000000000000000000000000000000000000..fadb6c1285515b0d0ee7c2c17c3a9d19f4a63713
--- /dev/null
+++ b/configs/_base_/schedules/cifar10_bs128.py
@@ -0,0 +1,15 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+# learning policy
+param_scheduler = dict(
+    type='MultiStepLR', by_epoch=True, milestones=[100, 150], gamma=0.1)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=200, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/_base_/schedules/cub_bs64.py b/configs/_base_/schedules/cub_bs64.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d0b4be7bd7b7043636fb2356b76512281a37e2b
--- /dev/null
+++ b/configs/_base_/schedules/cub_bs64.py
@@ -0,0 +1,34 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(
+        type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005, nesterov=True))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.01,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+    )
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=64)
diff --git a/configs/_base_/schedules/imagenet_bs1024_adamw_conformer.py b/configs/_base_/schedules/imagenet_bs1024_adamw_conformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..2285d0ea6c70de222a76d6b7404fc16e5fd28e0e
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_adamw_conformer.py
@@ -0,0 +1,43 @@
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        # for batch in each gpu is 128, 8 gpu
+        # lr = 5e-4 * 128 * 8 / 512 = 0.001
+        lr=5e-4 * 128 * 8 / 512,
+        weight_decay=0.05,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+        }),
+)
+
+# learning policy
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=295,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=5,
+        end=300)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_adamw_hivit.py b/configs/_base_/schedules/imagenet_bs1024_adamw_hivit.py
new file mode 100644
index 0000000000000000000000000000000000000000..5b2df97b813d1c3922dd470d2f0743eca44221ee
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_adamw_hivit.py
@@ -0,0 +1,41 @@
+# for batch in each gpu is 128, 8 gpu
+# lr = 5e-4 * 128 * 8 / 512 = 0.001
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        lr=5e-4 * 1024 / 512,
+        weight_decay=0.05,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0,
+        custom_keys={
+            '.pos_embed': dict(decay_mult=0.0),
+            '.relative_position_bias_table': dict(decay_mult=0.0)
+        }),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_adamw_revvit.py b/configs/_base_/schedules/imagenet_bs1024_adamw_revvit.py
new file mode 100644
index 0000000000000000000000000000000000000000..87fd202ce4076a69cae63f0d9d3f6b860639ff49
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_adamw_revvit.py
@@ -0,0 +1,41 @@
+# for batch in each gpu is 128, 8 gpu
+# lr = 5e-4 * 128 * 8 / 512 = 0.001
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        lr=5e-4 * 2048 / 512,
+        weight_decay=0.05,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=1.0),
+)
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-8 / 2e-3,
+        by_epoch=True,
+        end=70,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_adamw_swin.py b/configs/_base_/schedules/imagenet_bs1024_adamw_swin.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd06cc115a7ab4cbaa7ef7fa1d9366bdd5db878f
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_adamw_swin.py
@@ -0,0 +1,41 @@
+# for batch in each gpu is 128, 8 gpu
+# lr = 5e-4 * 128 * 8 / 512 = 0.001
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        lr=5e-4 * 1024 / 512,
+        weight_decay=0.05,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0,
+        custom_keys={
+            '.absolute_pos_embed': dict(decay_mult=0.0),
+            '.relative_position_bias_table': dict(decay_mult=0.0)
+        }),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_coslr.py b/configs/_base_/schedules/imagenet_bs1024_coslr.py
new file mode 100644
index 0000000000000000000000000000000000000000..285884d0b2b132329bab682f4418d891d7378ec1
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_coslr.py
@@ -0,0 +1,18 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=5e-5))
+
+# learning policy
+param_scheduler = [
+    dict(type='LinearLR', start_factor=0.1, by_epoch=True, begin=0, end=5),
+    dict(type='CosineAnnealingLR', T_max=95, by_epoch=True, begin=5, end=100)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py b/configs/_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf38d4731c867ac381ff0420b0063f8a7e7dfe2e
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py
@@ -0,0 +1,20 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.5, momentum=0.9, weight_decay=0.00004),
+    paramwise_cfg=dict(norm_decay_mult=0),
+)
+
+# learning policy
+param_scheduler = [
+    dict(type='ConstantLR', factor=0.1, by_epoch=False, begin=0, end=5000),
+    dict(type='PolyLR', eta_min=0, by_epoch=False, begin=5000)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/_base_/schedules/imagenet_bs2048.py b/configs/_base_/schedules/imagenet_bs2048.py
new file mode 100644
index 0000000000000000000000000000000000000000..1cfbfbe6752d923c248b92f3c7b7ace817bad9a4
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048.py
@@ -0,0 +1,21 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(
+        type='SGD', lr=0.8, momentum=0.9, weight_decay=0.0001, nesterov=True))
+
+# learning policy
+param_scheduler = [
+    dict(
+        type='LinearLR', start_factor=0.25, by_epoch=False, begin=0, end=2500),
+    dict(
+        type='MultiStepLR', by_epoch=True, milestones=[30, 60, 90], gamma=0.1)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs2048_AdamW.py b/configs/_base_/schedules/imagenet_bs2048_AdamW.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbfae8ef222b10663e1313000d05290d729ca5c8
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048_AdamW.py
@@ -0,0 +1,39 @@
+# optimizer
+# In ClassyVision, the lr is set to 0.003 for bs4096.
+# In this implementation(bs2048), lr = 0.003 / 4096 * (32bs * 64gpus) = 0.0015
+optim_wrapper = dict(
+    optimizer=dict(type='AdamW', lr=0.0015, weight_decay=0.3),
+    # specific to vit pretrain
+    paramwise_cfg=dict(custom_keys={
+        '.cls_token': dict(decay_mult=0.0),
+        '.pos_embed': dict(decay_mult=0.0)
+    }),
+)
+
+# learning policy
+warmup_epochs = 15  # about 10000 iterations for ImageNet-1k
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=warmup_epochs,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=warmup_epochs)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs2048_adamw_levit.py b/configs/_base_/schedules/imagenet_bs2048_adamw_levit.py
new file mode 100644
index 0000000000000000000000000000000000000000..25a536eaac52f1c42b37e0d0b102da252deebd67
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048_adamw_levit.py
@@ -0,0 +1,40 @@
+# for batch in each gpu is 256, 8 gpu
+# lr = 5e-4 * 256 * 8 / 512 = 0.002
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        lr=0.002,
+        weight_decay=0.025,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.attention_biases': dict(decay_mult=0.0),
+        }),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6 / 0.002,
+        by_epoch=True,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True,
+    ),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=5)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=1000)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs2048_coslr.py b/configs/_base_/schedules/imagenet_bs2048_coslr.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8551f55c8082ba07c084324c2bf1fbb9f26ea56
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048_coslr.py
@@ -0,0 +1,35 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(
+        type='SGD', lr=0.8, momentum=0.9, weight_decay=0.0001, nesterov=True))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.25,
+        by_epoch=True,
+        begin=0,
+        # about 2500 iterations for ImageNet-1k
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+    )
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs2048_rsb.py b/configs/_base_/schedules/imagenet_bs2048_rsb.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0d2d7994293afdc43b906c918d486397dc53206
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs2048_rsb.py
@@ -0,0 +1,32 @@
+# optimizer
+optim_wrapper = dict(optimizer=dict(type='Lamb', lr=0.005, weight_decay=0.02))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        eta_min=1.0e-6,
+        by_epoch=True,
+        begin=5,
+        end=100)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/_base_/schedules/imagenet_bs256.py b/configs/_base_/schedules/imagenet_bs256.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f92273d1b831ae5cd6663cfe65b1f0d8f01e630
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256.py
@@ -0,0 +1,16 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = dict(
+    type='MultiStepLR', by_epoch=True, milestones=[30, 60, 90], gamma=0.1)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_140e.py b/configs/_base_/schedules/imagenet_bs256_140e.py
new file mode 100644
index 0000000000000000000000000000000000000000..e65bf522d9739073baf38db7f10e6b27d7cd4f31
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_140e.py
@@ -0,0 +1,16 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = dict(
+    type='MultiStepLR', by_epoch=True, milestones=[40, 80, 120], gamma=0.1)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=140, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_200e_coslr_warmup.py b/configs/_base_/schedules/imagenet_bs256_200e_coslr_warmup.py
new file mode 100644
index 0000000000000000000000000000000000000000..c8d94a7606aead6d4142bf8a61228eb6475eb5c6
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_200e_coslr_warmup.py
@@ -0,0 +1,34 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.25,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True,
+    ),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=195,
+        by_epoch=True,
+        begin=5,
+        end=200,
+    )
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=200, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_coslr.py b/configs/_base_/schedules/imagenet_bs256_coslr.py
new file mode 100644
index 0000000000000000000000000000000000000000..44e2c8bb5d0800568bb3c7079b9e0c3e1322711c
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_coslr.py
@@ -0,0 +1,16 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = dict(
+    type='CosineAnnealingLR', T_max=100, by_epoch=True, begin=0, end=100)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_coslr_coswd_300e.py b/configs/_base_/schedules/imagenet_bs256_coslr_coswd_300e.py
new file mode 100644
index 0000000000000000000000000000000000000000..318e031574367aa9d34ec28453deccc60377372f
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_coslr_coswd_300e.py
@@ -0,0 +1,40 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.001,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=295,
+        eta_min=1.0e-6,
+        by_epoch=True,
+        begin=5,
+        end=300),
+    dict(
+        type='CosineAnnealingParamScheduler',
+        param_name='weight_decay',
+        eta_min=0.00001,
+        by_epoch=True,
+        begin=0,
+        end=300)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs256_epochstep.py b/configs/_base_/schedules/imagenet_bs256_epochstep.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8c2b905bf362022d07d452df76c10cccfb6565e
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs256_epochstep.py
@@ -0,0 +1,15 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.045, momentum=0.9, weight_decay=0.00004))
+
+# learning policy
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=1, gamma=0.98)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/_base_/schedules/imagenet_bs4096_AdamW.py b/configs/_base_/schedules/imagenet_bs4096_AdamW.py
new file mode 100644
index 0000000000000000000000000000000000000000..84b1f39beaef86b412c159a54d74c4f09458dc57
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_bs4096_AdamW.py
@@ -0,0 +1,39 @@
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='AdamW', lr=0.003, weight_decay=0.3),
+    # specific to vit pretrain
+    paramwise_cfg=dict(custom_keys={
+        '.cls_token': dict(decay_mult=0.0),
+        '.pos_embed': dict(decay_mult=0.0)
+    }),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=30,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=270,
+        by_epoch=True,
+        begin=30,
+        end=300,
+    )
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/_base_/schedules/imagenet_lars_coslr_200e.py b/configs/_base_/schedules/imagenet_lars_coslr_200e.py
new file mode 100644
index 0000000000000000000000000000000000000000..baba55c4f43b60620a646c812b24e6ffcbd7860a
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_lars_coslr_200e.py
@@ -0,0 +1,20 @@
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=4.8, weight_decay=1e-6, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR', T_max=190, by_epoch=True, begin=10, end=200)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=200)
diff --git a/configs/_base_/schedules/imagenet_lars_coslr_90e.py b/configs/_base_/schedules/imagenet_lars_coslr_90e.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e7875a36e76eccefbf752d704fcb12beb6c6506
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_lars_coslr_90e.py
@@ -0,0 +1,14 @@
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=1.6, momentum=0.9, weight_decay=0.))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='CosineAnnealingLR', T_max=90, by_epoch=True, begin=0, end=90)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=90)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/_base_/schedules/imagenet_sgd_coslr_100e.py b/configs/_base_/schedules/imagenet_sgd_coslr_100e.py
new file mode 100644
index 0000000000000000000000000000000000000000..08e9a3e71fc0d8c186b8fdeb5bb59fd3a1d5148e
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_sgd_coslr_100e.py
@@ -0,0 +1,14 @@
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.3, momentum=0.9, weight_decay=1e-6))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='CosineAnnealingLR', T_max=100, by_epoch=True, begin=0, end=100)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/_base_/schedules/imagenet_sgd_coslr_200e.py b/configs/_base_/schedules/imagenet_sgd_coslr_200e.py
new file mode 100644
index 0000000000000000000000000000000000000000..f38e4983038031c9178813297dc744195e855680
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_sgd_coslr_200e.py
@@ -0,0 +1,12 @@
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.03, weight_decay=1e-4, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='CosineAnnealingLR', T_max=200, by_epoch=True, begin=0, end=200)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=200)
diff --git a/configs/_base_/schedules/imagenet_sgd_steplr_100e.py b/configs/_base_/schedules/imagenet_sgd_steplr_100e.py
new file mode 100644
index 0000000000000000000000000000000000000000..75b725c7dfb074c3ebe5c7536752eb32c45b89cc
--- /dev/null
+++ b/configs/_base_/schedules/imagenet_sgd_steplr_100e.py
@@ -0,0 +1,14 @@
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=1e-4))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='MultiStepLR', by_epoch=True, milestones=[60, 80], gamma=0.1)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/arcface/README.md b/configs/arcface/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6b2ee6a3e6164531da343e954c9f5a20917f052d
--- /dev/null
+++ b/configs/arcface/README.md
@@ -0,0 +1,80 @@
+# ArcFace
+
+> [ArcFace: Additive Angular Margin Loss for Deep Face Recognition](https://arxiv.org/abs/1801.07698)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Recently, a popular line of research in face recognition is adopting margins in the well-established softmax loss function to maximize class separability. In this paper, we first introduce an Additive Angular Margin Loss (ArcFace), which not only has a clear geometric interpretation but also significantly enhances the discriminative power. Since ArcFace is susceptible to the massive label noise, we further propose sub-center ArcFace, in which each class contains K sub-centers and training samples only need to be close to any of the K positive sub-centers. Sub-center ArcFace encourages one dominant sub-class that contains the majority of clean faces and non-dominant sub-classes that include hard or noisy faces. Based on this self-propelled isolation, we boost the performance through automatically purifying raw web faces under massive real-world noise. Besides discriminative feature embedding, we also explore the inverse problem, mapping feature vectors to face images. Without training any additional generator or discriminator, the pre-trained ArcFace model can generate identity-preserved face images for both subjects inside and outside the training data only by using the network gradient and Batch Normalization (BN) priors. Extensive experiments demonstrate that ArcFace can enhance the discriminative feature embedding as well as strengthen the generative face synthesis.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24734142/212606212-8ffc3cd2-dbc1-4abf-8924-22167f3f6e34.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Retrieve image**
+
+```python
+from mmpretrain import ImageRetrievalInferencer
+
+inferencer = ImageRetrievalInferencer('resnet50-arcface_inshop', prototype='demo/')
+predict = inferencer('demo/dog.jpg', topk=2)[0]
+print(predict[0])
+print(predict[1])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('resnet50-arcface_inshop', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/arcface/resnet50-arcface_8xb32_inshop.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/arcface/resnet50-arcface_8xb32_inshop.py https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Retrieval on InShop
+
+|           Model           |                      Pretrain                      | Params(M) | Flops(G) | Recall@1 | mAP@10 |                    Config                    |                      Download                      |
+| :-----------------------: | :------------------------------------------------: | :-------: | :------: | :------: | :----: | :------------------------------------------: | :------------------------------------------------: |
+| `resnet50-arcface_inshop` | [ImageNet-21k-mill](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth) |   31.69   |  16.48   |  90.18   | 69.30  | [config](./resnet50-arcface_8xb32_inshop.py) | [model](https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.log) |
+
+## Citation
+
+```bibtex
+@inproceedings{deng2018arcface,
+title={ArcFace: Additive Angular Margin Loss for Deep Face Recognition},
+author={Deng, Jiankang and Guo, Jia and Niannan, Xue and Zafeiriou, Stefanos},
+booktitle={CVPR},
+year={2019}
+}
+```
diff --git a/configs/arcface/metafile.yml b/configs/arcface/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..050aba5b3e1c2980234aef13767106ed237eee12
--- /dev/null
+++ b/configs/arcface/metafile.yml
@@ -0,0 +1,28 @@
+Collections:
+  - Name: ArcFace
+    Metadata:
+      Training Data: InShop
+      Architecture:
+        - Additive Angular Margin Loss
+    Paper:
+      URL: https://arxiv.org/abs/1801.07698
+      Title: 'ArcFace: Additive Angular Margin Loss for Deep Face Recognition'
+    README: configs/arcface/README.md
+    Code:
+      Version: v1.0.0rc3
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc3/mmcls/models/heads/margin_head.py
+
+Models:
+  - Name: resnet50-arcface_inshop
+    Metadata:
+      FLOPs: 16571226112
+      Parameters: 31693888
+    In Collection: ArcFace
+    Results:
+      - Dataset: InShop
+        Metrics:
+          Recall@1: 90.18
+          mAP@10: 69.30
+        Task: Image Retrieval
+    Weights: https://download.openmmlab.com/mmclassification/v0/arcface/resnet50-arcface_inshop_20230202-b766fe7f.pth
+    Config: configs/arcface/resnet50-arcface_8xb32_inshop.py
diff --git a/configs/arcface/resnet50-arcface_8xb32_inshop.py b/configs/arcface/resnet50-arcface_8xb32_inshop.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc351e7870415a687679a1970bba0c24ebc02884
--- /dev/null
+++ b/configs/arcface/resnet50-arcface_8xb32_inshop.py
@@ -0,0 +1,71 @@
+_base_ = [
+    '../_base_/datasets/inshop_bs32_448.py',
+    '../_base_/schedules/cub_bs64.py',
+    '../_base_/default_runtime.py',
+]
+
+pretrained = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth'  # noqa
+model = dict(
+    type='ImageToImageRetriever',
+    image_encoder=[
+        dict(
+            type='ResNet',
+            depth=50,
+            init_cfg=dict(
+                type='Pretrained', checkpoint=pretrained, prefix='backbone')),
+        dict(type='GlobalAveragePooling'),
+    ],
+    head=dict(
+        type='ArcFaceClsHead',
+        num_classes=3997,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=None),
+    prototype={{_base_.gallery_dataloader}})
+
+# runtime settings
+default_hooks = dict(
+    # log every 20 intervals
+    logger=dict(type='LoggerHook', interval=20),
+    # save last three checkpoints
+    checkpoint=dict(
+        type='CheckpointHook',
+        save_best='auto',
+        interval=1,
+        max_keep_ckpts=3,
+        rule='greater'))
+
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(
+        type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0005, nesterov=True))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.01,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=45,
+        by_epoch=True,
+        begin=5,
+        end=50,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=50, val_interval=1)
+
+auto_scale_lr = dict(enable=True, base_batch_size=256)
+
+custom_hooks = [
+    dict(type='PrepareProtoBeforeValLoopHook'),
+    dict(type='SyncBuffersHook')
+]
diff --git a/configs/barlowtwins/README.md b/configs/barlowtwins/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..515d138856b170378ecfeb213aff6c582442f335
--- /dev/null
+++ b/configs/barlowtwins/README.md
@@ -0,0 +1,85 @@
+# BarlowTwins
+
+> [Barlow Twins: Self-Supervised Learning via Redundancy Reduction](https://arxiv.org/abs/2103.03230)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called Barlow Twins, owing to neuroscientist H. Barlow's redundancy-reduction principle applied to a pair of identical networks. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. Barlow Twins outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/163914714-082de804-0b5f-4024-94f9-880e6ef334fa.png" width="800" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('barlowtwins_resnet50_8xb256-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                         | Params (M) | Flops (G) |                          Config                          |                                     Download                                     |
+| :-------------------------------------------- | :--------: | :-------: | :------------------------------------------------------: | :------------------------------------------------------------------------------: |
+| `barlowtwins_resnet50_8xb256-coslr-300e_in1k` |   174.54   |   4.11    | [config](barlowtwins_resnet50_8xb256-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k` | [BARLOWTWINS](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth) |   25.56    |   4.11    |   71.80   | [config](benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{zbontar2021barlow,
+  title={Barlow twins: Self-supervised learning via redundancy reduction},
+  author={Zbontar, Jure and Jing, Li and Misra, Ishan and LeCun, Yann and Deny, St{\'e}phane},
+  booktitle={International Conference on Machine Learning},
+  year={2021},
+}
+```
diff --git a/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-1000e_in1k.py b/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-1000e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f12dd2e1460094e98cbc14f8bb81f67a95cb161d
--- /dev/null
+++ b/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-1000e_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_byol.py',
+    '../_base_/default_runtime.py',
+]
+# datasets
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='BarlowTwins',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=8192,
+        out_channels=8192,
+        num_layers=3,
+        with_last_bn=False,
+        with_last_bn_affine=False,
+        with_avg_pool=True,
+        init_cfg=dict(
+            type='Kaiming', distribution='uniform', layer=['Linear'])),
+    head=dict(
+        type='LatentCrossCorrelationHead',
+        in_channels=8192,
+        loss=dict(type='CrossCorrelationLoss')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=1.6, momentum=0.9, weight_decay=1e-6),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+            'bias': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(
+                decay_mult=0, lr_mult=0.024, lars_exclude=True),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.6e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=990,
+        eta_min=0.0016,
+        by_epoch=True,
+        begin=10,
+        end=1000,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1000)
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py b/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..74a7f2b9bb09a3d2cb0da644935c5f2d181bd5f4
--- /dev/null
+++ b/configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_byol.py',
+    '../_base_/default_runtime.py',
+]
+# datasets
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='BarlowTwins',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=8192,
+        out_channels=8192,
+        num_layers=3,
+        with_last_bn=False,
+        with_last_bn_affine=False,
+        with_avg_pool=True,
+        init_cfg=dict(
+            type='Kaiming', distribution='uniform', layer=['Linear'])),
+    head=dict(
+        type='LatentCrossCorrelationHead',
+        in_channels=8192,
+        loss=dict(type='CrossCorrelationLoss')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=1.6, momentum=0.9, weight_decay=1e-6),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+            'bias': dict(decay_mult=0, lr_mult=0.024, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(
+                decay_mult=0, lr_mult=0.024, lars_exclude=True),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1.6e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=290,
+        eta_min=0.0016,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py b/configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f4e4f574ffd130abff07f9b1e2ec22b80fbbaba
--- /dev/null
+++ b/configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_sgd_coslr_100e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/barlowtwins/metafile.yml b/configs/barlowtwins/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..705080e09af9c59ecc88737073deed6de170664c
--- /dev/null
+++ b/configs/barlowtwins/metafile.yml
@@ -0,0 +1,44 @@
+Collections:
+  - Name: BarlowTwins
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - LARS
+      Training Resources: 8x A100 GPUs
+      Architecture:
+        - ResNet
+        - BarlowTwins
+    Paper:
+      Title: 'Barlow Twins: Self-Supervised Learning via Redundancy Reduction'
+      URL: https://arxiv.org/abs/2103.03230
+    README: configs/barlowtwins/README.md
+
+Models:
+  - Name: barlowtwins_resnet50_8xb256-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 4109364224
+      Parameters: 174535744
+      Training Data: ImageNet-1k
+    In Collection: BarlowTwins
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/barlowtwins_resnet50_8xb256-coslr-300e_in1k_20220825-57307488.pth
+    Config: configs/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k.py
+    Downstream:
+      - resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k
+  - Name: resnet50_barlowtwins-pre_8xb32-linear-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 256
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: BarlowTwins
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 71.8
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/barlowtwins/barlowtwins_resnet50_8xb256-coslr-300e_in1k/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-52fde35f.pth
+    Config: configs/barlowtwins/benchmarks/resnet50_8xb32-linear-coslr-100e_in1k.py
diff --git a/configs/beit/README.md b/configs/beit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..404e6524a4db0e73daffd277386131717bd4106d
--- /dev/null
+++ b/configs/beit/README.md
@@ -0,0 +1,88 @@
+# BEiT
+
+> [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/203688351-adac7146-4e71-4ab6-8958-5cfe643a2dc5.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('beit-base-p16_beit-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('beit_beit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                           | Params (M) | Flops (G) |                           Config                           |                                   Download                                   |
+| :---------------------------------------------- | :--------: | :-------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------: |
+| `beit_beit-base-p16_8xb256-amp-coslr-300e_in1k` |   86.53    |   17.58   | [config](beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                   |                  Pretrain                  | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                  |                  Download                  |
+| :-------------------------------------- | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :----------------------------------------: |
+| `beit-base-p16_beit-pre_8xb128-coslr-100e_in1k` | [BEIT](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth) |   86.53    |   17.58   |   83.10   |    N/A    | [config](benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.json) |
+| `beit-base-p16_beit-in21k-pre_3rdparty_in1k`\* |             BEIT ImageNet-21k              |   86.53    |   17.58   |   85.28   |   97.59   | [config](benchmarks/beit-base-p16_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/beit/beit-base_3rdparty_in1k_20221114-c0a4df23.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/unilm/tree/master/beit). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{bao2022beit,
+    title={{BE}iT: {BERT} Pre-Training of Image Transformers},
+    author={Hangbo Bao and Li Dong and Songhao Piao and Furu Wei},
+    booktitle={International Conference on Learning Representations},
+    year={2022},
+}
+```
diff --git a/configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5786f79ef207f1e54b9ded1903c6b3a7b632b4f3
--- /dev/null
+++ b/configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,130 @@
+_base_ = '../_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='TwoNormDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    second_mean=[-31.875, -31.875, -31.875],
+    second_std=[318.75, 318.75, 318.75],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ColorJitter',
+        brightness=0.4,
+        contrast=0.4,
+        saturation=0.4,
+        hue=0.),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandomResizedCropAndInterpolationWithTwoPic',
+        size=224,
+        second_size=112,
+        interpolation='bicubic',
+        second_interpolation='lanczos',
+        scale=(0.08, 1.0)),
+    dict(
+        type='BEiTMaskGenerator',
+        input_size=(14, 14),
+        num_masking_patches=75,
+        max_num_patches=None,
+        min_num_patches=16),
+    dict(type='PackInputs')
+]
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    type='BEiT',
+    backbone=dict(
+        type='BEiTPretrainViT',
+        arch='base',
+        patch_size=16,
+        drop_path_rate=0.1,
+        final_norm=True,
+        out_type='raw',
+        layer_scale_init_value=0.1,
+        init_cfg=[
+            dict(type='TruncNormal', std=0.02, layer='Linear'),
+            dict(type='TruncNormal', std=0.02, layer='Conv2d'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=None,
+    head=dict(
+        type='BEiTV1Head',
+        embed_dims=768,
+        num_embed=8192,
+        loss=dict(type='CrossEntropyLoss')),
+    target_generator=dict(
+        type='DALL-E',
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=  # noqa: E251
+            'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/dalle_encoder.pth',  # noqa: E501
+        )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py b/configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dbab34f6e084f5c9959cfb233174a0dc059e0930
--- /dev/null
+++ b/configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,127 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+data_preprocessor = dict(
+    num_classes=1000,
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    to_rgb=True,
+)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        _delete_=True,
+        layer_decay_rate=0.65,
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=20,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0)
diff --git a/configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py b/configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8380b69afc061d1934fae3eba57b7f352a508b1e
--- /dev/null
+++ b/configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/beit/metafile.yml b/configs/beit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..e4524faec783292836fcb2520e9cff5c2262e93d
--- /dev/null
+++ b/configs/beit/metafile.yml
@@ -0,0 +1,69 @@
+Collections:
+  - Name: BEiT
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: 'BEiT: BERT Pre-Training of Image Transformers'
+      URL: https://arxiv.org/abs/2106.08254
+    README: configs/beit/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/beit.py
+      Version: v1.0.0rc4
+
+Models:
+  - Name: beit_beit-base-p16_8xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data: ImageNet-1k
+    In Collection: BEiT
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221128-ab79e626.pth
+    Config: configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - beit-base-p16_beit-pre_8xb128-coslr-100e_in1k
+  - Name: beit-base-p16_beit-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data: ImageNet-1k
+    In Collection: BEiT
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.1
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/beit/beit_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221128-0ca393e9.pth
+    Config: configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: beit-base-p16_beit-in21k-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: BEiT
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 85.28
+          Top 5 Accuracy: 97.59
+    Weights: https://download.openmmlab.com/mmclassification/v0/beit/beit-base_3rdparty_in1k_20221114-c0a4df23.pth
+    Config: configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
+    Converted From:
+      Weights: https://conversationhub.blob.core.windows.net/beit-share-public/beit/beit_base_patch16_224_pt22k_ft22kto1k.pth
+      Code: https://github.com/microsoft/unilm/tree/master/beit
diff --git a/configs/beitv2/README.md b/configs/beitv2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5447e2d3a36e1d1e0f3d6800c4cc2e2380fdc012
--- /dev/null
+++ b/configs/beitv2/README.md
@@ -0,0 +1,90 @@
+# BEiTv2
+
+> [BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers](https://arxiv.org/abs/2208.06366)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most existing studies operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this work, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we propose vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Furthermore, we introduce a patch aggregation strategy which associates discrete image patches to enhance global semantic representation. Experiments on image classification and semantic segmentation show that BEiT v2 outperforms all compared MIM methods. On ImageNet-1K (224 size), the base-size BEiT v2 achieves 85.5% top-1 accuracy for fine-tuning and 80.1% top-1 accuracy for linear probing. The large-size BEiT v2 obtains 87.3% top-1 accuracy for ImageNet-1K (224 size) fine-tuning, and 56.7% mIoU on ADE20K for semantic segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/203912182-5967a520-d455-49ea-bc67-dcbd500d76bf.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                             | Params (M) | Flops (G) |                            Config                            |                                 Download                                 |
+| :------------------------------------------------ | :--------: | :-------: | :----------------------------------------------------------: | :----------------------------------------------------------------------: |
+| `beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k` |   192.81   |   17.58   | [config](beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                   |                  Pretrain                  | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                  |                  Download                  |
+| :-------------------------------------- | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :----------------------------------------: |
+| `beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k` | [BEITV2](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth) |   86.53    |   17.58   |   85.00   |    N/A    | [config](benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.json) |
+| `beit-base-p16_beitv2-in21k-pre_3rdparty_in1k`\* |            BEITV2 ImageNet-21k             |   86.53    |   17.58   |   86.47   |   97.99   | [config](benchmarks/beit-base-p16_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/beit/beitv2-base_3rdparty_in1k_20221114-73e11905.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/unilm/tree/master/beit2). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{beitv2,
+    title={{BEiT v2}: Masked Image Modeling with Vector-Quantized Visual Tokenizers},
+    author={Zhiliang Peng and Li Dong and Hangbo Bao and Qixiang Ye and Furu Wei},
+    year={2022},
+    eprint={2208.06366},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+```
diff --git a/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-1600e_in1k.py b/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c4a2070b5de3ebbe93ed0b0658ee9157a6b62136
--- /dev/null
+++ b/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-1600e_in1k.py
@@ -0,0 +1,119 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_beitv2.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+vqkd_encoder = dict(
+    arch='base',
+    img_size=224,
+    patch_size=16,
+    in_channels=3,
+    out_indices=-1,
+    drop_rate=0.,
+    drop_path_rate=0.,
+    norm_cfg=dict(type='LN', eps=1e-6),
+    final_norm=True,
+    out_type='featmap',
+    with_cls_token=True,
+    frozen_stages=-1,
+    use_abs_pos_emb=True,
+    use_rel_pos_bias=False,
+    use_shared_rel_pos_bias=False,
+    layer_scale_init_value=0.,
+    interpolate_mode='bicubic',
+    patch_cfg=dict(),
+    layer_cfgs=dict(),
+    init_cfg=None)
+
+layer_scale_init_value = 0.1
+drop_path_rate = 0.1  # 0. for 300 epochs and 0.1 for 1600 epochs.
+model = dict(
+    type='BEiT',
+    backbone=dict(
+        type='BEiTPretrainViT',
+        arch='base',
+        patch_size=16,
+        out_indices=[-4, -1],
+        drop_path_rate=drop_path_rate,
+        final_norm=False,
+        out_type='raw',
+        layer_scale_init_value=layer_scale_init_value,
+        init_cfg=[
+            dict(type='TruncNormal', std=0.02, layer='Linear'),
+            dict(type='TruncNormal', std=0.02, layer='Conv2d'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=dict(
+        type='BEiTV2Neck',
+        num_layers=2,
+        early_layers=9,
+        backbone_arch='base',
+        drop_path_rate=drop_path_rate,
+        layer_scale_init_value=layer_scale_init_value,
+    ),
+    head=dict(
+        type='BEiTV2Head',
+        embed_dims=768,
+        num_embed=8192,
+        loss=dict(type='CrossEntropyLoss')),
+    target_generator=dict(
+        type='VQKD',
+        encoder_config=vqkd_encoder,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=  # noqa
+            'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/vqkd_encoder.pth'  # noqa
+        )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fddeccff1998fa850097ca4ae07b6fe874476dd0
--- /dev/null
+++ b/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,119 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_beitv2.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+vqkd_encoder = dict(
+    arch='base',
+    img_size=224,
+    patch_size=16,
+    in_channels=3,
+    out_indices=-1,
+    drop_rate=0.,
+    drop_path_rate=0.,
+    norm_cfg=dict(type='LN', eps=1e-6),
+    final_norm=True,
+    out_type='featmap',
+    with_cls_token=True,
+    frozen_stages=-1,
+    use_abs_pos_emb=True,
+    use_rel_pos_bias=False,
+    use_shared_rel_pos_bias=False,
+    layer_scale_init_value=0.,
+    interpolate_mode='bicubic',
+    patch_cfg=dict(),
+    layer_cfgs=dict(),
+    init_cfg=None)
+
+layer_scale_init_value = 0.1
+drop_path_rate = 0.  # 0. for 300 epochs and 0.1 for 1600 epochs.
+model = dict(
+    type='BEiT',
+    backbone=dict(
+        type='BEiTPretrainViT',
+        arch='base',
+        patch_size=16,
+        out_indices=[-4, -1],
+        drop_path_rate=drop_path_rate,
+        final_norm=False,
+        out_type='raw',
+        layer_scale_init_value=layer_scale_init_value,
+        init_cfg=[
+            dict(type='TruncNormal', std=0.02, layer='Linear'),
+            dict(type='TruncNormal', std=0.02, layer='Conv2d'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=dict(
+        type='BEiTV2Neck',
+        num_layers=2,
+        early_layers=9,
+        backbone_arch='base',
+        drop_path_rate=drop_path_rate,
+        layer_scale_init_value=layer_scale_init_value,
+    ),
+    head=dict(
+        type='BEiTV2Head',
+        embed_dims=768,
+        num_embed=8192,
+        loss=dict(type='CrossEntropyLoss')),
+    target_generator=dict(
+        type='VQKD',
+        encoder_config=vqkd_encoder,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=  # noqa
+            'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/vqkd_encoder.pth'  # noqa
+        )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.98), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py b/configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2c55a706b351d5c8bd7981aaa324877cb440b11
--- /dev/null
+++ b/configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,122 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        # 0.2 for 1600 epochs pretrained models and 0.1 for 300 epochs.
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=5e-4, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        _delete_=True,
+        # 0.6 for 1600 epochs pretrained models and 0.65 for 300 epochs
+        layer_decay_rate=0.65,
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=20,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0)
diff --git a/configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py b/configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..17ed4ff3d2cf40f8d819add1b3aa4f668a41128a
--- /dev/null
+++ b/configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/beitv2/metafile.yml b/configs/beitv2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..74c3885e11cd8140cea7aac40973ade4ce4e7e64
--- /dev/null
+++ b/configs/beitv2/metafile.yml
@@ -0,0 +1,69 @@
+Collections:
+  - Name: BEiTv2
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: 'BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers'
+      URL: https://arxiv.org/abs/2208.06366
+    README: configs/beitv2/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/beit.py
+      Version: v1.0.0rc4
+
+Models:
+  - Name: beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 17581223424
+      Parameters: 192811376
+      Training Data: ImageNet-1k
+    In Collection: BEiTv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221212-a157be30.pth
+    Config: configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k
+  - Name: beit-base-p16_beitv2-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data: ImageNet-1k
+    In Collection: BEiTv2
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.0
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/beitv2/beitv2_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221212-d1c0789e.pth
+    Config: configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: beit-base-p16_beitv2-in21k-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 17581219584
+      Parameters: 86530984
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: BEiTv2
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 86.47
+          Top 5 Accuracy: 97.99
+    Weights: https://download.openmmlab.com/mmclassification/v0/beit/beitv2-base_3rdparty_in1k_20221114-73e11905.pth
+    Config: configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
+    Converted From:
+      Weights: https://conversationhub.blob.core.windows.net/beit-share-public/beitv2/beitv2_base_patch16_224_pt1k_ft21kto1k.pth
+      Code: https://github.com/microsoft/unilm/tree/master/beit2
diff --git a/configs/blip/README.md b/configs/blip/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1a8dce392cb3ec3ab36eed8ab9b3af90ee0f1219
--- /dev/null
+++ b/configs/blip/README.md
@@ -0,0 +1,128 @@
+# BLIP
+
+> [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/236374275-94d2f94b-d9a7-4f12-b694-f15a2be00be6.png" width="90%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('blip-base_3rdparty_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'a puppy and a cat sitting on a blanket'}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/blip/blip-base_8xb32_caption.py https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model                          | Params (M) | BLEU-4 | CIDER  |                 Config                 |                                                    Download                                                    |
+| :----------------------------- | :--------: | :----: | :----: | :------------------------------------: | :------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_caption`\* |   223.97   | 40.12  | 132.82 | [config](./blip-base_8xb32_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth) |
+
+### Image Caption on NoCaps
+
+| Model                          | Params (M) | SPICE | CIDER  |                Config                 |                                                     Download                                                     |
+| :----------------------------- | :--------: | :---: | :----: | :-----------------------------------: | :--------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_caption`\* |   223.97   | 14.69 | 109.12 | [config](./blip-base_8xb32_nocaps.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth) |
+
+### Image Caption on Flickr30k
+
+| Model                          | Params (M) | SPICE | CIDER |                      Config                      |                                                Download                                                |
+| :----------------------------- | :--------: | :---: | :---: | :----------------------------------------------: | :----------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_caption`\* |   223.97   | 15.58 | 68.89 | [config](./blip-base_8xb32_caption_flickr30k.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth) |
+
+### Visual Grounding on RefCOCO
+
+| Model                     | Params (M) | Accuracy (testA) | Accuracy (testB) |                Config                |                                             Download                                              |
+| :------------------------ | :--------: | :--------------: | :--------------: | :----------------------------------: | :-----------------------------------------------------------------------------------------------: |
+| `blip-base_8xb16_refcoco` |   498.49   |      86.14       |      77.33       | [config](blip-base_8xb16_refcoco.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_8xb16_refcoco_20230508-d2d10f4c.pth) \| [log](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_8xb16_refcoco_20230508-d2d10f4c.json) |
+
+### Visual Question Answering on VQAv2
+
+| Model                      | Params (M) | Accuracy |               Config               |                                                       Download                                                        |
+| :------------------------- | :--------: | :------: | :--------------------------------: | :-------------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_vqa`\* |   361.48   |  78.20   | [config](./blip-base_8xb32_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth) |
+
+### Visual Question Answering on OK-VQA
+
+| Model                      | Params (M) | Accuracy |                Config                |                                                       Download                                                        |
+| :------------------------- | :--------: | :------: | :----------------------------------: | :-------------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_vqa`\* |   361.48   |  40.59#  | [config](./blip-base_8xb32_okvqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth) |
+
+### Visual Question Answering on OCR-VQA
+
+| Model                      | Params (M) | Accuracy |                Config                 |                                                       Download                                                        |
+| :------------------------- | :--------: | :------: | :-----------------------------------: | :-------------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_vqa`\* |   361.48   |  28.30#  | [config](./blip-base_8xb32_ocrvqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth) |
+
+### Image-To-Text Retrieval on COCO
+
+| Model                            | Params (M) | Recall@1 | Recall@5 |                  Config                  |                                                Download                                                |
+| :------------------------------- | :--------: | :------: | :------: | :--------------------------------------: | :----------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_retrieval`\* |   447.49   |  82.52   |  95.34   | [config](./blip-base_8xb32_retrieval.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
+
+### Text-To-Image Retrieval on COCO
+
+| Model                            | Params (M) | Recall@1 | Recall@5 |                  Config                  |                                                Download                                                |
+| :------------------------------- | :--------: | :------: | :------: | :--------------------------------------: | :----------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_retrieval`\* |   447.49   |  64.82   |  86.28   | [config](./blip-base_8xb32_retrieval.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
+
+### Image-To-Text Retrieval on Flickr30k
+
+| Model                            | Params (M) | Recall@1 | Recall@5 |                       Config                       |                                           Download                                           |
+| :------------------------------- | :--------: | :------: | :------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_retrieval`\* |   447.49   |  95.10#  |  99.60#  | [config](./blip-base_8xb32_retrieval_flickr30k.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
+
+### Text-To-Image Retrieval on Flickr30k
+
+| Model                            | Params (M) | Recall@1 | Recall@5 |                       Config                       |                                           Download                                           |
+| :------------------------------- | :--------: | :------: | :------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_retrieval`\* |   447.49   |  85.26#  |  96.58#  | [config](./blip-base_8xb32_retrieval_flickr30k.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth) |
+
+### NLVR on NLVR2
+
+| Model                       | Params (M) | Top-1 (%) |               Config                |                                                    Download                                                    |
+| :-------------------------- | :--------: | :-------: | :---------------------------------: | :------------------------------------------------------------------------------------------------------------: |
+| `blip-base_3rdparty_nlvr`\* |   259.37   |   82.33   | [config](./blip-base_8xb32_nlvr.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_nlvr_20230427-3b14d33f.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/salesforce/LAVIS). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+*Results with # denote zero-shot evaluation. The corresponding model hasn't been finetuned on that dataset.*
+
+## Citation
+
+```bibtex
+@inproceedings{li2022blip,
+      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
+      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
+      year={2022},
+      booktitle={ICML},
+}
+```
diff --git a/configs/blip/blip-base_8xb16_refcoco.py b/configs/blip/blip-base_8xb16_refcoco.py
new file mode 100644
index 0000000000000000000000000000000000000000..b4986143a3d6965f7176bbcea445f675cc9a80ec
--- /dev/null
+++ b/configs/blip/blip-base_8xb16_refcoco.py
@@ -0,0 +1,62 @@
+_base_ = [
+    '../_base_/datasets/refcoco.py',
+    '../_base_/default_runtime.py',
+]
+
+med_config = {
+    'architectures': ['BertModel'],
+    'attention_probs_dropout_prob': 0.1,
+    'hidden_act': 'gelu',
+    'hidden_dropout_prob': 0.1,
+    'hidden_size': 768,
+    'initializer_range': 0.02,
+    'intermediate_size': 3072,
+    'layer_norm_eps': 1e-12,
+    'max_position_embeddings': 512,
+    'model_type': 'bert',
+    'num_attention_heads': 12,
+    'num_hidden_layers': 12,
+    'pad_token_id': 0,
+    'add_type_embeddings': False,
+    'vocab_size': 30524,
+    'encoder_width': 768,
+    'add_cross_attention': True
+}
+
+model = dict(
+    type='BlipGrounding',
+    visual_encoder=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        out_type='raw',
+    ),
+    text_encoder=dict(
+        type='XBertEncoder',
+        med_config=med_config,
+    ),
+    multimodal_encoder=dict(
+        type='XBertEncoder',
+        med_config=med_config,
+    ),
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    head=dict(
+        type='GroundingHead',
+        decoder=dict(
+            type='XBertLMHeadDecoder',
+            med_config=med_config,
+        ),
+        box_l1_loss_coeff=4.0,
+        box_giou_loss_coeff=2.0,
+    ),
+)
+
+# schedule settings
+optimizer = dict(type='AdamW', lr=1.5e-5, weight_decay=0.02)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+train_cfg = dict(by_epoch=True, max_epochs=120)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip/blip-base_8xb32_caption.py b/configs/blip/blip-base_8xb32_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e24e9eababa53b17ac38502ea37eb6a9de40cf5
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_caption.py
@@ -0,0 +1,59 @@
+_base_ = [
+    '../_base_/datasets/coco_caption.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipCaption',
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        out_type='raw',
+    ),
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    decoder_head=dict(
+        type='SeqGenerationHead',
+        decoder=dict(
+            type='XBertLMHeadDecoder',
+            med_config=dict(
+                architectures=['BertModel'],
+                attention_probs_dropout_prob=0.1,
+                hidden_act='gelu',
+                hidden_dropout_prob=0.1,
+                hidden_size=768,
+                initializer_range=0.02,
+                intermediate_size=3072,
+                layer_norm_eps=1e-12,
+                max_position_embeddings=512,
+                model_type='bert',
+                num_attention_heads=12,
+                num_hidden_layers=12,
+                pad_token_id=0,
+                add_type_embeddings=False,
+                vocab_size=30524,
+                encoder_width=768,
+                add_cross_attention=True),
+        ),
+    ),
+    prompt='a picture of ',
+    max_txt_len=20,
+)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=10,
+    )
+]
+
+train_cfg = dict(max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip/blip-base_8xb32_caption_flickr30k.py b/configs/blip/blip-base_8xb32_caption_flickr30k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9fe6ec561d6b7cd09d2490e8fb50f4f8315a14ba
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_caption_flickr30k.py
@@ -0,0 +1,59 @@
+_base_ = [
+    '../_base_/datasets/flickr30k_caption.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipCaption',
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        out_type='raw',
+    ),
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    decoder_head=dict(
+        type='SeqGenerationHead',
+        decoder=dict(
+            type='XBertLMHeadDecoder',
+            med_config=dict(
+                architectures=['BertModel'],
+                attention_probs_dropout_prob=0.1,
+                hidden_act='gelu',
+                hidden_dropout_prob=0.1,
+                hidden_size=768,
+                initializer_range=0.02,
+                intermediate_size=3072,
+                layer_norm_eps=1e-12,
+                max_position_embeddings=512,
+                model_type='bert',
+                num_attention_heads=12,
+                num_hidden_layers=12,
+                pad_token_id=0,
+                add_type_embeddings=False,
+                vocab_size=30524,
+                encoder_width=768,
+                add_cross_attention=True),
+        ),
+    ),
+    prompt='a picture of ',
+    max_txt_len=20,
+)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=10,
+    )
+]
+
+train_cfg = dict(max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip/blip-base_8xb32_nlvr.py b/configs/blip/blip-base_8xb32_nlvr.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a6cfe149a07b508830069ba8b8ec4e3ccccc7c0
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_nlvr.py
@@ -0,0 +1,59 @@
+_base_ = [
+    '../_base_/datasets/nlvr2.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipNLVR',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        out_type='raw',
+    ),
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    multimodal_backbone=dict(
+        type='BertModel',
+        config=dict(
+            architectures=['BertModel'],
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            layer_norm_eps=1e-12,
+            max_position_embeddings=512,
+            model_type='bert',
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            pad_token_id=0,
+            add_type_embeddings=False,
+            vocab_size=30524,
+            encoder_width=768,
+            add_cross_attention=True,
+            nlvr=True),
+        add_pooling_layer=False),
+)
+
+# optimizer
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.05)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=10,
+    )
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(logger=dict(interval=1))
diff --git a/configs/blip/blip-base_8xb32_nocaps.py b/configs/blip/blip-base_8xb32_nocaps.py
new file mode 100644
index 0000000000000000000000000000000000000000..c47c56aeec9f6b9f36b35d4ea8c078c06df586ab
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_nocaps.py
@@ -0,0 +1,46 @@
+_base_ = [
+    '../_base_/datasets/nocaps.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipCaption',
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        out_type='raw',
+    ),
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    decoder_head=dict(
+        type='SeqGenerationHead',
+        decoder=dict(
+            type='XBertLMHeadDecoder',
+            med_config=dict(
+                architectures=['BertModel'],
+                attention_probs_dropout_prob=0.1,
+                hidden_act='gelu',
+                hidden_dropout_prob=0.1,
+                hidden_size=768,
+                initializer_range=0.02,
+                intermediate_size=3072,
+                layer_norm_eps=1e-12,
+                max_position_embeddings=512,
+                model_type='bert',
+                num_attention_heads=12,
+                num_hidden_layers=12,
+                pad_token_id=0,
+                add_type_embeddings=False,
+                vocab_size=30524,
+                encoder_width=768,
+                add_cross_attention=True),
+        ),
+    ),
+    prompt='a picture of ',
+    max_txt_len=20,
+)
+
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip/blip-base_8xb32_ocrvqa.py b/configs/blip/blip-base_8xb32_ocrvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..117d597fcb2d92aab1c0f0bc79aa895a3ab99643
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_ocrvqa.py
@@ -0,0 +1,75 @@
+_base_ = [
+    '../_base_/datasets/ocrvqa.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipVQA',
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=480,
+        patch_size=16,
+        out_type='raw'),
+    multimodal_backbone=dict(
+        type='XBertEncoder',
+        med_config=dict(
+            architectures=['BertModel'],
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            layer_norm_eps=1e-12,
+            max_position_embeddings=512,
+            model_type='bert',
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            pad_token_id=0,
+            add_type_embeddings=False,
+            vocab_size=30524,
+            encoder_width=768,
+            add_cross_attention=True),
+    ),
+    head=dict(
+        type='VQAGenerationHead',
+        decoder=dict(
+            type='XBertLMHeadDecoder',
+            med_config=dict(
+                architectures=['BertModel'],
+                attention_probs_dropout_prob=0.1,
+                hidden_act='gelu',
+                hidden_dropout_prob=0.1,
+                hidden_size=768,
+                initializer_range=0.02,
+                intermediate_size=3072,
+                layer_norm_eps=1e-12,
+                max_position_embeddings=512,
+                model_type='bert',
+                num_attention_heads=12,
+                num_hidden_layers=12,
+                pad_token_id=0,
+                add_type_embeddings=False,
+                vocab_size=30524,
+                encoder_width=768,
+                add_cross_attention=True),
+        ),
+        inference_method='generate',
+    ),
+)
+
+# schedule settings
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.05)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+train_cfg = dict(max_epochs=10, by_epoch=True)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+randomness = dict(seed=42)
diff --git a/configs/blip/blip-base_8xb32_okvqa.py b/configs/blip/blip-base_8xb32_okvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..548775c4e0f91128f41701042346b5d4a2567950
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_okvqa.py
@@ -0,0 +1,75 @@
+_base_ = [
+    '../_base_/datasets/coco_okvqa.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipVQA',
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=480,
+        patch_size=16,
+        out_type='raw'),
+    multimodal_backbone=dict(
+        type='XBertEncoder',
+        med_config=dict(
+            architectures=['BertModel'],
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            layer_norm_eps=1e-12,
+            max_position_embeddings=512,
+            model_type='bert',
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            pad_token_id=0,
+            add_type_embeddings=False,
+            vocab_size=30524,
+            encoder_width=768,
+            add_cross_attention=True),
+    ),
+    head=dict(
+        type='VQAGenerationHead',
+        decoder=dict(
+            type='XBertLMHeadDecoder',
+            med_config=dict(
+                architectures=['BertModel'],
+                attention_probs_dropout_prob=0.1,
+                hidden_act='gelu',
+                hidden_dropout_prob=0.1,
+                hidden_size=768,
+                initializer_range=0.02,
+                intermediate_size=3072,
+                layer_norm_eps=1e-12,
+                max_position_embeddings=512,
+                model_type='bert',
+                num_attention_heads=12,
+                num_hidden_layers=12,
+                pad_token_id=0,
+                add_type_embeddings=False,
+                vocab_size=30524,
+                encoder_width=768,
+                add_cross_attention=True),
+        ),
+        inference_method='generate',
+    ),
+)
+
+# schedule settings
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.05)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+train_cfg = dict(max_epochs=10, by_epoch=True)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+randomness = dict(seed=42)
diff --git a/configs/blip/blip-base_8xb32_retrieval.py b/configs/blip/blip-base_8xb32_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..645f88fd2a8e7ca06c75f603b7ad55539ef60053
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_retrieval.py
@@ -0,0 +1,83 @@
+_base_ = [
+    '../_base_/datasets/coco_retrieval.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipRetrieval',
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        out_type='raw',
+    ),
+    text_backbone=dict(
+        type='XBertEncoder',
+        med_config=dict(
+            architectures=['BertModel'],
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            layer_norm_eps=1e-12,
+            max_position_embeddings=512,
+            model_type='bert',
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            pad_token_id=0,
+            add_type_embeddings=False,
+            vocab_size=30524,
+            encoder_width=768,
+            add_cross_attention=True),
+    ),
+    vision_neck=dict(
+        type='Linear',
+        in_features=768,
+        out_features=256,
+    ),
+    text_neck=dict(
+        type='Linear',
+        in_features=768,
+        out_features=256,
+    ),
+    head=dict(
+        type='ITCHead',
+        embed_dim=256,
+    ),
+    multimodal_head=dict(
+        type='ITMHead',
+        hidden_size=768,
+        with_pooler=False,
+    ),
+    topk=256,
+    max_txt_len=35,
+)
+
+# optimizer
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.04)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+# learning rate scheduler
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=6)
+val_cfg = dict(type='RetrievalValLoop')
+test_cfg = dict(type='RetrievalTestLoop')
+
+randomness = dict(seed=42)
+
+default_hooks = dict(logger=dict(interval=1))
+
+custom_hooks = [
+    dict(
+        type='WarmupParamHook',
+        param_name='alpha',
+        module_name='head',
+        warmup_epochs=2)
+]
diff --git a/configs/blip/blip-base_8xb32_retrieval_flickr30k.py b/configs/blip/blip-base_8xb32_retrieval_flickr30k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d2e78e943161ec57539096aff5cbc7ae5f29186
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_retrieval_flickr30k.py
@@ -0,0 +1,83 @@
+_base_ = [
+    '../_base_/datasets/flickr30k_retrieval.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipRetrieval',
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=384,
+        patch_size=16,
+        out_type='raw',
+    ),
+    text_backbone=dict(
+        type='XBertEncoder',
+        med_config=dict(
+            architectures=['BertModel'],
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            layer_norm_eps=1e-12,
+            max_position_embeddings=512,
+            model_type='bert',
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            pad_token_id=0,
+            add_type_embeddings=False,
+            vocab_size=30524,
+            encoder_width=768,
+            add_cross_attention=True),
+    ),
+    vision_neck=dict(
+        type='Linear',
+        in_features=768,
+        out_features=256,
+    ),
+    text_neck=dict(
+        type='Linear',
+        in_features=768,
+        out_features=256,
+    ),
+    head=dict(
+        type='ITCHead',
+        embed_dim=256,
+    ),
+    multimodal_head=dict(
+        type='ITMHead',
+        hidden_size=768,
+        with_pooler=False,
+    ),
+    topk=256,
+    max_txt_len=35,
+)
+
+# optimizer
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.04)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+# learning rate scheduler
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=6)
+val_cfg = dict(type='RetrievalValLoop')
+test_cfg = dict(type='RetrievalTestLoop')
+
+randomness = dict(seed=42)
+
+default_hooks = dict(logger=dict(interval=1))
+
+custom_hooks = [
+    dict(
+        type='WarmupParamHook',
+        param_name='alpha',
+        module_name='head',
+        warmup_epochs=2)
+]
diff --git a/configs/blip/blip-base_8xb32_vqa.py b/configs/blip/blip-base_8xb32_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..2aa3f258579617d31b52b6e5a8e7703c56966dd4
--- /dev/null
+++ b/configs/blip/blip-base_8xb32_vqa.py
@@ -0,0 +1,76 @@
+_base_ = [
+    '../_base_/datasets/coco_vg_vqa.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='BlipVQA',
+    tokenizer=dict(type='BlipTokenizer', name_or_path='bert-base-uncased'),
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='b',
+        img_size=480,
+        patch_size=16,
+        out_type='raw'),
+    multimodal_backbone=dict(
+        type='XBertEncoder',
+        med_config=dict(
+            architectures=['BertModel'],
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            layer_norm_eps=1e-12,
+            max_position_embeddings=512,
+            model_type='bert',
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            pad_token_id=0,
+            add_type_embeddings=False,
+            vocab_size=30524,
+            encoder_width=768,
+            add_cross_attention=True),
+    ),
+    head=dict(
+        type='VQAGenerationHead',
+        decoder=dict(
+            type='XBertLMHeadDecoder',
+            med_config=dict(
+                architectures=['BertModel'],
+                attention_probs_dropout_prob=0.1,
+                hidden_act='gelu',
+                hidden_dropout_prob=0.1,
+                hidden_size=768,
+                initializer_range=0.02,
+                intermediate_size=3072,
+                layer_norm_eps=1e-12,
+                max_position_embeddings=512,
+                model_type='bert',
+                num_attention_heads=12,
+                num_hidden_layers=12,
+                pad_token_id=0,
+                add_type_embeddings=False,
+                vocab_size=30524,
+                encoder_width=768,
+                add_cross_attention=True),
+        ),
+        inference_method='rank',  # or 'generate'
+        answer_list_path=
+        'https://storage.googleapis.com/sfr-vision-language-research/datasets/answer_list.json',  # noqa: E501
+    ),
+)
+
+# schedule settings
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.05)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+train_cfg = dict(max_epochs=10, by_epoch=True)
+test_cfg = dict()
+
+# runtime settings
+randomness = dict(seed=42)
diff --git a/configs/blip/metafile.yml b/configs/blip/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8877e8192110df35415c875c834fc914bd3a038c
--- /dev/null
+++ b/configs/blip/metafile.yml
@@ -0,0 +1,99 @@
+Collections:
+  - Name: BLIP
+    Metadata:
+      Training Data:
+        - COCO
+        - VG
+        - Conceptual Captions
+        - Conceptual 12M
+        - SBU captions
+      Architecture:
+        - Transformer
+      Training Resources: 8x A100 GPUs
+    Paper:
+      Title: 'BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language
+        Understanding and Generation'
+      URL: https://arxiv.org/abs/2201.12086
+    README: configs/blip/README.md
+
+Models:
+  - Name: blip-base_8xb16_refcoco
+    Metadata:
+      FLOPs: null
+      Parameters: 498488636
+    In Collection: BLIP
+    Results:
+      - Task: Visual Grounding
+        Dataset: RefCOCO
+        Metrics:
+          Accuracy (testA): 86.14
+          Accuracy (testB): 77.33
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_8xb16_refcoco_20230508-d2d10f4c.pth
+    Config: configs/blip/blip-base_8xb16_refcoco.py
+  - Name: blip-base_3rdparty_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 223971644
+    In Collection: BLIP
+    Results:
+      - Dataset: COCO
+        Task: Image Caption
+        Metrics:
+          BLEU-4: 40.12
+          CIDER: 132.82
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth
+    Config: configs/blip/blip-base_8xb32_caption.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP/blip_coco_caption_base.pth
+      Code: https://github.com/salesforce/LAVIS
+  - Name: blip-base_3rdparty_nlvr
+    Metadata:
+      FLOPs: null
+      Parameters: 259372034
+    In Collection: BLIP
+    Results:
+      - Task: NLVR
+        Dataset: NLVR2
+        Metrics:
+          Top 1 Accuracy: 82.33
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_nlvr_20230427-3b14d33f.pth
+    Config: configs/blip/blip-base_8xb32_nlvr.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_nlvr.pth
+      Code: https://github.com/salesforce/LAVIS
+  - Name: blip-base_3rdparty_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 361478972
+    In Collection: BLIP
+    Results:
+      - Task: Visual Question Answering
+        Dataset: VQAv2
+        Metrics:
+          Accuracy: 78.2
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty-capflit_vqa_20230505-81488941.pth
+    Config: configs/blip/blip-base_8xb32_vqa.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_vqa_capfilt_large.pth
+      Code: https://github.com/salesforce/LAVIS
+  - Name: blip-base_3rdparty_retrieval
+    Metadata:
+      FLOPs: null
+      Parameters: 447486979
+    In Collection: BLIP
+    Results:
+      - Task: Image-To-Text Retrieval
+        Dataset: COCO
+        Metrics:
+          Recall@1: 82.52
+          Recall@5: 95.34
+      - Task: Text-To-Image Retrieval
+        Dataset: COCO
+        Metrics:
+          Recall@1: 64.82
+          Recall@5: 86.28
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-retrieval_20230419-a1804d2c.pth
+    Config: configs/blip/blip-base_8xb32_retrieval.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP/blip_coco_retrieval.pth
+      Code: https://github.com/salesforce/LAVIS
diff --git a/configs/blip2/README.md b/configs/blip2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..68ce679d704dfc23a0afdd7ec2528df9d144547e
--- /dev/null
+++ b/configs/blip2/README.md
@@ -0,0 +1,74 @@
+# BLIP-2
+
+> [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](http://arxiv.org/abs/2301.12597)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+The cost of vision-and-language pre-training has become increasingly prohibitive due to end-toend training of large-scale models. This paper proposes BLIP-2, a generic and efficient pretraining strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pretrained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various visionlanguage tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model’s emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30762564/236385045-dc22a621-0a9c-4352-afa4-ca3888044850.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('blip2-opt2.7b_3rdparty-zeroshot_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'a dog and a cat sitting on a blanket'}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/blip2/blip2_8xb32_retrieval.py https://download.openmmlab.com/mmclassification/v1/blip2/blip2_3rdparty_pretrain_20230505-f7ef4390.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model                                       | Params (M) | BLEU-4 | CIDER  |                   Config                   |                                           Download                                            |
+| :------------------------------------------ | :--------: | :----: | :----: | :----------------------------------------: | :-------------------------------------------------------------------------------------------: |
+| `blip2-opt2.7b_3rdparty-zeroshot_caption`\* |  3770.47   | 32.90  | 111.10 | [config](./blip2-opt2.7b_8xb32_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip2/blip2-opt2.7b_3rdparty_pretrain_20230505-b51db4e1.pth) |
+
+### Visual Question Answering on VQAv2
+
+| Model                                   | Params (M) | Accuracy |                 Config                 |                                                 Download                                                  |
+| :-------------------------------------- | :--------: | :------: | :------------------------------------: | :-------------------------------------------------------------------------------------------------------: |
+| `blip2-opt2.7b_3rdparty-zeroshot_vqa`\* |  3770.47   |  53.50   | [config](./blip2-opt2.7b_8xb16_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip2/blip2-opt2.7b_3rdparty_pretrain_20230505-b51db4e1.pth) |
+
+### Image-To-Text Retrieval on COCO
+
+| Model                        | Params (M) | Recall@1 |                Config                |                                                    Download                                                     |
+| :--------------------------- | :--------: | :------: | :----------------------------------: | :-------------------------------------------------------------------------------------------------------------: |
+| `blip2_3rdparty_retrieval`\* |  1173.19   |  85.40   | [config](./blip2_8xb32_retrieval.py) | [model](https://download.openmmlab.com/mmclassification/v1/blip2/blip2_3rdparty_pretrain_20230505-f7ef4390.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/salesforce/LAVIS). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{beitv2,
+    title={Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models},
+    author={Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven},
+    year={2023},
+    eprint={2301.12597},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+```
diff --git a/configs/blip2/blip2-opt2.7b_8xb16_gqa.py b/configs/blip2/blip2-opt2.7b_8xb16_gqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..37fbd95e8e4b49d87f4da7b8d0f4cc7650f23dcd
--- /dev/null
+++ b/configs/blip2/blip2-opt2.7b_8xb16_gqa.py
@@ -0,0 +1,87 @@
+_base_ = [
+    '../_base_/datasets/gqa.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Blip2VQA',
+    tokenizer=dict(
+        type='AutoTokenizer', name_or_path='facebook/opt-2.7b',
+        use_fast=False),
+    vision_backbone=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=364,
+        patch_size=14,
+        out_indices=-2,
+        layer_scale_init_value=0.0,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        frozen_stages=39,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw'),
+    text_backbone=dict(
+        type='OPTForCausalLM', name_or_path='facebook/opt-2.7b'),
+    multimodal_backbone=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32),
+    vision_neck=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=2560,
+    ),
+    prompt='Question: {} Short Answer:',
+    max_txt_len=10)
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='PackInputs', algorithm_keys=['question', 'gt_answer']),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(224, 224),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='CleanCaption',
+        keys=['question'],
+    ),
+    dict(type='PackInputs', algorithm_keys=['question', 'gt_answer']),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=10,
+    )
+]
+
+train_cfg = dict(max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip2/blip2-opt2.7b_8xb16_vqa.py b/configs/blip2/blip2-opt2.7b_8xb16_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..13a808dc224454642392142f9f6598f42e717b64
--- /dev/null
+++ b/configs/blip2/blip2-opt2.7b_8xb16_vqa.py
@@ -0,0 +1,95 @@
+_base_ = [
+    '../_base_/datasets/coco_vqa.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Blip2VQA',
+    tokenizer=dict(
+        type='AutoTokenizer', name_or_path='facebook/opt-2.7b',
+        use_fast=False),
+    vision_backbone=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=364,
+        patch_size=14,
+        out_indices=-2,
+        layer_scale_init_value=0.0,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        frozen_stages=39,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw'),
+    text_backbone=dict(
+        type='OPTForCausalLM', name_or_path='facebook/opt-2.7b'),
+    multimodal_backbone=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32),
+    vision_neck=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=2560,
+    ),
+    prompt='Question: {} Answer:',
+    max_txt_len=10)
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(224, 224),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(
+        type='CleanCaption',
+        keys=['question'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight'],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=10,
+    )
+]
+
+train_cfg = dict(max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/blip2/blip2-opt2.7b_8xb32_caption.py b/configs/blip2/blip2-opt2.7b_8xb32_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..52d0a63223ffdaf69730dffc2a6d4212765255a6
--- /dev/null
+++ b/configs/blip2/blip2-opt2.7b_8xb32_caption.py
@@ -0,0 +1,76 @@
+_base_ = [
+    '../_base_/datasets/coco_caption.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Blip2Caption',
+    tokenizer=dict(
+        type='AutoTokenizer', name_or_path='facebook/opt-2.7b',
+        use_fast=False),
+    vision_backbone=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=364,
+        patch_size=14,
+        out_indices=-2,
+        layer_scale_init_value=0.0,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        frozen_stages=39,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw'),
+    text_backbone=dict(
+        type='OPTForCausalLM', name_or_path='facebook/opt-2.7b'),
+    multimodal_backbone=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32),
+    vision_neck=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=2560,
+    ),
+    prompt='a photo of',
+    max_txt_len=30)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=10,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
+
+# dataset settings
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(364, 364),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
diff --git a/configs/blip2/blip2_8xb32_retrieval.py b/configs/blip2/blip2_8xb32_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..75cb66cbfd53ac5e4e53928a65eb8617f00fb4af
--- /dev/null
+++ b/configs/blip2/blip2_8xb32_retrieval.py
@@ -0,0 +1,82 @@
+_base_ = [
+    '../_base_/datasets/coco_retrieval.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Blip2Retrieval',
+    tokenizer=dict(type='Blip2Tokenizer', name_or_path='bert-base-uncased'),
+    vision_backbone=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=364,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw'),
+    multimodal_backbone=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32),
+    vision_neck=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=256,
+    ),
+    text_neck=dict(
+        type='LinearClsHead',
+        in_channels=768,
+        num_classes=256,
+    ),
+    multimodal_head=dict(
+        type='ITMHead',
+        hidden_size=768,
+        with_pooler=False,
+    ),
+    topk=128,
+    max_txt_len=35,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(364, 364),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CleanCaption', keys='text'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'gt_text_id', 'gt_image_id'],
+        meta_keys=['image_id']),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer
+optimizer = dict(type='AdamW', lr=2e-5, weight_decay=0.04)
+optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
+
+# learning rate scheduler
+param_scheduler = [dict(type='CosineAnnealingLR', by_epoch=True)]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=6)
+val_cfg = dict(type='RetrievalValLoop')
+test_cfg = dict(type='RetrievalTestLoop')
+
+randomness = dict(seed=42)
diff --git a/configs/blip2/metafile.yml b/configs/blip2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..b822103a21fa0b1b350ffbc6c5fdd6fb8ad4e8e2
--- /dev/null
+++ b/configs/blip2/metafile.yml
@@ -0,0 +1,71 @@
+Collections:
+  - Name: BLIP-2
+    Metadata:
+      Training Data:
+        - COCO
+        - VG
+        - CC3M
+        - CC12M
+        - SBU
+        - LAION-400M
+      Training Resources: 8x A100 GPUs
+      Architecture:
+        - Transformer
+        - Q-Former
+    Paper:
+      Title: 'BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
+        Encoders and Large Language Models'
+      URL: https://arxiv.org/abs/2301.12597
+    README: configs/blip2/README.md
+
+Models:
+  - Name: blip2_3rdparty_retrieval
+    Metadata:
+      FLOPs: null
+      Parameters: 1173191358
+    In Collection: BLIP-2
+    Results:
+      - Task: Image-To-Text Retrieval
+        Dataset: COCO
+        Metrics:
+          Recall@1: 85.4
+      - Task: Text-To-Image Retrieval
+        Dataset: COCO
+        Metrics:
+          Recall@1: 68.3
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip2/blip2_3rdparty_pretrain_20230505-f7ef4390.pth
+    Config: configs/blip2/blip2_8xb32_retrieval.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_opt2.7b.pth
+      Code: https://github.com/salesforce/LAVIS
+  - Name: blip2-opt2.7b_3rdparty-zeroshot_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 3770465152
+    In Collection: BLIP-2
+    Results:
+      - Task: Visual Question Answering
+        Dataset: VQAv2
+        Metrics:
+          Accuracy: 53.5
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip2/blip2-opt2.7b_3rdparty_pretrain_20230505-b51db4e1.pth
+    Config: configs/blip2/blip2-opt2.7b_8xb16_vqa.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_opt2.7b.pth
+      Code: https://github.com/salesforce/LAVIS
+  - Name: blip2-opt2.7b_3rdparty-zeroshot_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 3770465152
+    In Collection: BLIP-2
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics:
+          BLEU-4: 32.90
+          CIDER: 111.10
+    Weights: https://download.openmmlab.com/mmclassification/v1/blip2/blip2-opt2.7b_3rdparty_pretrain_20230505-b51db4e1.pth
+    Config: configs/blip2/blip2-opt2.7b_8xb32_caption.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_opt2.7b.pth
+      Code: https://github.com/salesforce/LAVIS
diff --git a/configs/byol/README.md b/configs/byol/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2bfc8d064159ecfddaf2b2a4d0dca302b55e5f1f
--- /dev/null
+++ b/configs/byol/README.md
@@ -0,0 +1,85 @@
+# BYOL
+
+> [Bootstrap your own latent: A new approach to self-supervised Learning](https://arxiv.org/abs/2006.07733)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+**B**ootstrap **Y**our **O**wn **L**atent (BYOL) is a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/149720208-5ffbee78-1437-44c7-9ddb-b8caab60d2c3.png" width="800" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_byol-pre_8xb512-linear-coslr-90e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('byol_resnet50_16xb256-coslr-200e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-7596c6f5.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                   | Params (M) | Flops (G) |                       Config                       |                                           Download                                           |
+| :-------------------------------------- | :--------: | :-------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `byol_resnet50_16xb256-coslr-200e_in1k` |   68.02    |   4.11    | [config](byol_resnet50_16xb256-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_byol-pre_8xb512-linear-coslr-90e_in1k` | [BYOL](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth) |   25.56    |   4.11    |   71.80   | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-7596c6f5.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-7596c6f5.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{grill2020bootstrap,
+  title={Bootstrap your own latent: A new approach to self-supervised learning},
+  author={Grill, Jean-Bastien and Strub, Florian and Altch{\'e}, Florent and Tallec, Corentin and Richemond, Pierre H and Buchatskaya, Elena and Doersch, Carl and Pires, Bernardo Avila and Guo, Zhaohan Daniel and Azar, Mohammad Gheshlaghi and others},
+  booktitle={NeurIPS},
+  year={2020}
+}
+```
diff --git a/configs/byol/benchmarks/mask-rcnn_r50-c4_ms-1x_coco.py b/configs/byol/benchmarks/mask-rcnn_r50-c4_ms-1x_coco.py
new file mode 100644
index 0000000000000000000000000000000000000000..4949db16a922737c5809b2c07519a6bb6867d165
--- /dev/null
+++ b/configs/byol/benchmarks/mask-rcnn_r50-c4_ms-1x_coco.py
@@ -0,0 +1,46 @@
+_base_ = 'mmdet::mask_rcnn/mask-rcnn_r50-caffe-c4_1x_coco.py'
+# https://github.com/open-mmlab/mmdetection/blob/dev-3.x/configs/mask_rcnn/mask-rcnn_r50-caffe-c4_1x_coco.py
+
+data_preprocessor = dict(
+    type='DetDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    bgr_to_rgb=True,
+    pad_mask=True,
+    pad_size_divisor=32)
+
+norm_cfg = dict(type='SyncBN', requires_grad=True)
+model = dict(
+    data_preprocessor=data_preprocessor,
+    backbone=dict(
+        frozen_stages=-1,
+        norm_cfg=norm_cfg,
+        norm_eval=False,
+        style='pytorch',
+        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
+    roi_head=dict(
+        shared_head=dict(
+            type='ResLayerExtraNorm',
+            norm_cfg=norm_cfg,
+            norm_eval=False,
+            style='pytorch')))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
+    dict(
+        type='RandomChoiceResize',
+        scales=[(1333, 640), (1333, 672), (1333, 704), (1333, 736),
+                (1333, 768), (1333, 800)],
+        keep_ratio=True),
+    dict(type='RandomFlip', prob=0.5),
+    dict(type='PackDetInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=12, val_interval=1)
+
+custom_imports = dict(
+    imports=['mmpretrain.models.utils.res_layer_extra_norm'],
+    allow_failed_imports=False)
diff --git a/configs/byol/benchmarks/mask-rcnn_r50_fpn_ms-1x_coco.py b/configs/byol/benchmarks/mask-rcnn_r50_fpn_ms-1x_coco.py
new file mode 100644
index 0000000000000000000000000000000000000000..1341f1508bdc400da6e79b47e1a174c0819fc79b
--- /dev/null
+++ b/configs/byol/benchmarks/mask-rcnn_r50_fpn_ms-1x_coco.py
@@ -0,0 +1,24 @@
+_base_ = 'mmdet::mask_rcnn/mask-rcnn_r50_fpn_1x_coco.py'
+# https://github.com/open-mmlab/mmdetection/blob/dev-3.x/configs/mask_rcnn/mask-rcnn_r50_fpn_1x_coco.py
+
+norm_cfg = dict(type='SyncBN', requires_grad=True)
+model = dict(
+    backbone=dict(frozen_stages=-1, norm_cfg=norm_cfg, norm_eval=False),
+    neck=dict(norm_cfg=norm_cfg),
+    roi_head=dict(
+        bbox_head=dict(type='Shared4Conv1FCBBoxHead', norm_cfg=norm_cfg),
+        mask_head=dict(norm_cfg=norm_cfg)))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
+    dict(
+        type='RandomChoiceResize',
+        scales=[(1333, 640), (1333, 672), (1333, 704), (1333, 736),
+                (1333, 768), (1333, 800)],
+        keep_ratio=True),
+    dict(type='RandomFlip', prob=0.5),
+    dict(type='PackDetInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
diff --git a/configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py b/configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b5074c082b8b6fb36bd3c6711b60bab6394b4ce
--- /dev/null
+++ b/configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
@@ -0,0 +1,18 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_lars_coslr_90e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# dataset summary
+train_dataloader = dict(batch_size=512)
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py b/configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8dd3fd8bee88206f18d79500c401fa1f787d6e7f
--- /dev/null
+++ b/configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py
@@ -0,0 +1,60 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_byol.py',
+    '../_base_/schedules/imagenet_lars_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='BYOL',
+    base_momentum=0.01,
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=False),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=4096,
+        out_channels=256,
+        num_layers=2,
+        with_bias=True,
+        with_last_bn=False,
+        with_avg_pool=True),
+    head=dict(
+        type='LatentPredictHead',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=256,
+            hid_channels=4096,
+            out_channels=256,
+            num_layers=2,
+            with_bias=True,
+            with_last_bn=False,
+            with_avg_pool=False),
+        loss=dict(type='CosineSimilarityLoss')),
+)
+
+# optimizer
+optimizer = dict(type='LARS', lr=4.8, momentum=0.9, weight_decay=1e-6)
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=optimizer,
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True),
+        }),
+)
+
+# runtime settings
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/byol/metafile.yml b/configs/byol/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..09aacad1c580ec4ec4abe08e60dffd30eba540a8
--- /dev/null
+++ b/configs/byol/metafile.yml
@@ -0,0 +1,44 @@
+Collections:
+  - Name: BYOL
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - LARS
+      Training Resources: 8x V100 GPUs (b256), 16x A100-80G GPUs (b4096)
+      Architecture:
+        - ResNet
+        - BYOL
+    Paper:
+      Title: 'Bootstrap your own latent: A new approach to self-supervised Learning'
+      URL: https://arxiv.org/abs/2006.07733
+    README: configs/byol/README.md
+
+Models:
+  - Name: byol_resnet50_16xb256-coslr-200e_in1k
+    Metadata:
+      Epochs: 200
+      Batch Size: 4096
+      FLOPs: 4109364224
+      Parameters: 68024448
+      Training Data: ImageNet-1k
+    In Collection: BYOL
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth
+    Config: configs/byol/byol_resnet50_16xb256-coslr-200e_in1k.py
+    Downstream:
+      - resnet50_byol-pre_8xb512-linear-coslr-90e_in1k
+  - Name: resnet50_byol-pre_8xb512-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 4096
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: BYOL
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 71.8
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-7596c6f5.pth
+    Config: configs/byol/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
diff --git a/configs/cae/README.md b/configs/cae/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..dc1c818d71c35300930c4a11b5e2ed52b995cd0e
--- /dev/null
+++ b/configs/cae/README.md
@@ -0,0 +1,86 @@
+# CAE
+
+> [Context Autoencoder for Self-Supervised Representation Learning](https://arxiv.org/abs/2202.03026)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We present a novel masked image modeling (MIM) approach, context autoencoder (CAE), for self-supervised learning. We randomly partition the image into two sets: visible patches and masked patches. The CAE architecture consists of: (i) an encoder that takes visible patches as input and outputs their latent representations, (ii) a latent context regressor that predicts the masked patch representations from the visible patch representations that are not updated in this regressor, (iii) a decoder that takes the estimated masked patch representations as input and makes predictions for the masked patches, and (iv) an alignment module that aligns the masked patch representation estimation with the masked patch representations computed from the encoder. In comparison to previous MIM methods that couple the encoding and decoding roles, e.g., using a single module in BEiT, our approach attempts to separate the encoding role (content understanding) from the decoding role (making predictions for masked patches) using different modules, improving the content understanding capability. In addition, our approach makes predictions from the visible patches to the masked patches in the latent representation space that is expected to take on semantics. In addition, we present the explanations about why contrastive pretraining and supervised pretraining perform similarly and why MIM potentially performs better. We demonstrate the effectiveness of our CAE through superior transfer performance in downstream tasks: semantic segmentation, and object detection and instance segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30762564/165459947-6c6ef13c-0593-4765-b44e-6da0a079802a.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('beit-base-p16_cae-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('cae_beit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_16xb128-fp16-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k_20220825-f3d234cd.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                          | Params (M) | Flops (G) |                          Config                           |                                    Download                                    |
+| :--------------------------------------------- | :--------: | :-------: | :-------------------------------------------------------: | :----------------------------------------------------------------------------: |
+| `cae_beit-base-p16_8xb256-amp-coslr-300e_in1k` |   288.43   |   17.58   | [config](cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221230-808170f3.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221230-808170f3.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `beit-base-p16_cae-pre_8xb128-coslr-100e_in1k` | [CAE](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221230-808170f3.pth) |   86.68    |   17.58   |   83.20   | [config](benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_16xb128-fp16-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k_20220825-f3d234cd.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_16xb128-fp16-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k_20220825-f3d234cd.json) |
+
+## Citation
+
+```bibtex
+@article{CAE,
+  title={Context Autoencoder for Self-Supervised Representation Learning},
+  author={Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo,
+  Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang},
+  journal={ArXiv},
+  year={2022}
+}
+```
diff --git a/configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py b/configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e7083ce80a8311220fe6ebd5b6024c195887aa57
--- /dev/null
+++ b/configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,130 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+# CAE fine-tuning setting
+
+# dataset
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline), batch_size=128)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        final_norm=False,  # do not use final norm
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.1,
+        out_type='avg_featmap',
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=2e-5)),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=8e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0)
diff --git a/configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..725b0f07ce71fa0ea98ae7343f0dbf47adda3ebb
--- /dev/null
+++ b/configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,115 @@
+_base_ = '../_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='TwoNormDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    second_mean=[-31.875, -31.875, -31.875],
+    second_std=[318.75, 318.75, 318.75],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(
+        type='RandomResizedCropAndInterpolationWithTwoPic',
+        size=224,
+        second_size=112,
+        interpolation='bicubic',
+        second_interpolation='lanczos',
+        scale=(0.08, 1.0)),
+    dict(
+        type='BEiTMaskGenerator',
+        input_size=(14, 14),
+        num_masking_patches=75,
+        max_num_patches=None,
+        min_num_patches=16),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    type='CAE',
+    backbone=dict(
+        type='CAEPretrainViT',
+        arch='b',
+        patch_size=16,
+        layer_scale_init_value=0.1,
+        bias='qv_bias'),
+    neck=dict(
+        type='CAENeck',
+        embed_dims=768,
+        num_heads=12,
+        regressor_depth=4,
+        decoder_depth=4,
+        mlp_ratio=4,
+        layer_scale_init_value=0.1,
+    ),
+    head=dict(type='CAEHead', loss=dict(type='CAELoss', lambd=2)),
+    target_generator=dict(
+        type='DALL-E',
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=  # noqa: E251
+            'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/dalle_encoder.pth',  # noqa: E501
+        )),
+    base_momentum=0.0)
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0, norm_decay_mult=0.0, flat_decay_mult=0.0))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=290,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/cae/metafile.yml b/configs/cae/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..83f46f9f810384979a0f0b4483e9ab518653bcff
--- /dev/null
+++ b/configs/cae/metafile.yml
@@ -0,0 +1,43 @@
+Collections:
+  - Name: CAE
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 8x A100-80G GPUs
+      Architecture:
+        - ViT
+    Paper:
+      Title: Context Autoencoder for Self-Supervised Representation Learning
+      URL: https://arxiv.org/abs/2202.03026
+    README: configs/cae/README.md
+
+Models:
+  - Name: cae_beit-base-p16_8xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 17581976064
+      Parameters: 288429952
+      Training Data: ImageNet-1k
+    In Collection: CAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k/cae_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221230-808170f3.pth
+    Config: configs/cae/cae_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - beit-base-p16_cae-pre_8xb128-coslr-100e_in1k
+  - Name: beit-base-p16_cae-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581219584
+      Parameters: 86682280
+      Training Data: ImageNet-1k
+    In Collection: CAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.2
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/cae/cae_vit-base-p16_16xb128-fp16-coslr-300e_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k/vit-base-p16_ft-8xb128-coslr-100e-rpe_in1k_20220825-f3d234cd.pth
+    Config: configs/cae/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
diff --git a/configs/chinese_clip/README.md b/configs/chinese_clip/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..acb37e7a2adfdf641e07a695ec064cf8507f33ed
--- /dev/null
+++ b/configs/chinese_clip/README.md
@@ -0,0 +1,69 @@
+# ChineseCLIP
+
+> [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining. In this work, we construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets, and we pretrain Chinese CLIP models on the new dataset. We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters. Furthermore, we propose a two-stage pretraining method, where the model is first trained with the image encoder frozen and then trained with all parameters being optimized, to achieve enhanced model performance. Our comprehensive experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN in the setups of zero-shot learning and finetuning, and it is able to achieve competitive performance in zero-shot image classification based on the evaluation on the ELEVATER benchmark (Li et al., 2022). We have released our codes, models, and demos in https://github.com/OFA-Sys/Chinese-CLIP
+
+<div align=center>
+<img src="https://github.com/open-mmlab/mmpretrain/assets/36138628/4d05e51f-d834-4ef5-bbf0-0e2f80fea461" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model for zero-shot classification**
+
+```python
+from mmpretrain import ImageClassificationInferencer
+
+inferencer = ImageClassificationInferencer(
+    'cn-clip_resnet50_zeroshot-cls_cifar100',
+    pretrained=True,
+    classes=['鸟', '狗', '猫', '蛇'],
+    text_prototype=['鸟', '狗', '猫', '蛇'],
+)
+
+prediction = inferencer('./demo/bird.JPEG')[0]
+print('Results:', prediction['pred_class'])
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_resnet50_3rdparty_20230519-6a2b3eb2.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on CIFAR100
+
+| Model                                           | Params (M) | Top-1 (%) |                          Config                          |                                    Download                                    |
+| :---------------------------------------------- | :--------: | :-------: | :------------------------------------------------------: | :----------------------------------------------------------------------------: |
+| `cn-clip_resnet50_zeroshot-cls_cifar100`\*      |   77.00    |   40.70   |   [config](cn-clip_resnet50_zeroshot-cls_cifar100.py)    | [model](https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_resnet50_3rdparty_20230519-6a2b3eb2.pth) |
+| `cn-clip_vit-base-p16_zeroshot-cls_cifar100`\*  |   188.00   |   64.50   | [config](cn-clip_vit-base-p16_zeroshot-cls_cifar100.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-base-p16_3rdparty_20230519-37fbc59e.pth) |
+| `cn-clip_vit-large-p14_zeroshot-cls_cifar100`\* |   406.00   |   74.80   | [config](cn-clip_vit-large-p14_zeroshot-cls_cifar100.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-large-p14_3rdparty_20230519-3f844503.pth) |
+| `cn-clip_vit-huge-p14_zeroshot-cls_cifar100`\*  |   958.00   |   79.10   | [config](cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-huge-p14_3rdparty_20230519-e4f49b00.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/OFA-Sys/Chinese-CLIP). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{chinese-clip,
+  title={Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese},
+  author={Yang, An and Pan, Junshu and Lin, Junyang and Men, Rui and Zhang, Yichang and Zhou, Jingren and Zhou, Chang},
+  journal={arXiv preprint arXiv:2211.01335},
+  year={2022}
+}
+```
diff --git a/configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py b/configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..e109a5bfbb4442580aa830259a2a29f4ba11a0b5
--- /dev/null
+++ b/configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py
@@ -0,0 +1,72 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='CIFAR100',
+        data_root='data/cifar100',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, ))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='ChineseCLIP',
+    vision_backbone=dict(
+        type='ModifiedResNet',
+        depth=50,
+        base_channels=64,
+        input_size=224,
+        num_attn_heads=32,
+        output_dim=1024,
+    ),
+    text_backbone=dict(
+        type='BertModelCN',
+        config=dict(
+            vocab_size=21128,
+            pad_token_id=0,
+            add_type_embeddings=True,
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            max_position_embeddings=512,
+            num_attention_heads=12,
+            num_hidden_layers=3,
+            type_vocab_size=2,
+            layer_norm_eps=1e-12)),
+    tokenizer=dict(
+        type='FullTokenizer',
+        vocab_file=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/vocab.txt'
+    ),
+    proj_dim=1024,
+    text_prototype='cifar100',
+)
diff --git a/configs/chinese_clip/cn-clip_vit-base-p16_zeroshot-cls_cifar100.py b/configs/chinese_clip/cn-clip_vit-base-p16_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c0ad1c9e39bcbfc615e688d5fc8c2812789989b
--- /dev/null
+++ b/configs/chinese_clip/cn-clip_vit-base-p16_zeroshot-cls_cifar100.py
@@ -0,0 +1,76 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='CIFAR100',
+        data_root='data/cifar100',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, ))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='ChineseCLIP',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        final_norm=True,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+        out_type='cls_token',
+    ),
+    text_backbone=dict(
+        type='BertModelCN',
+        config=dict(
+            vocab_size=21128,
+            pad_token_id=0,
+            add_type_embeddings=True,
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            max_position_embeddings=512,
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            type_vocab_size=2,
+            layer_norm_eps=1e-12)),
+    tokenizer=dict(
+        type='FullTokenizer',
+        vocab_file=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/vocab.txt'
+    ),
+    proj_dim=512,
+    text_prototype='cifar100',
+)
diff --git a/configs/chinese_clip/cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py b/configs/chinese_clip/cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..83aae122e8f0d2ec4fd78bb69e94feda09672980
--- /dev/null
+++ b/configs/chinese_clip/cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py
@@ -0,0 +1,75 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='CIFAR100',
+        data_root='data/cifar100',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, ))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='ChineseCLIP',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='huge',
+        img_size=224,
+        patch_size=14,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        final_norm=True,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+        out_type='cls_token',
+    ),
+    text_backbone=dict(
+        type='BertModelCN',
+        config=dict(
+            vocab_size=21128,
+            pad_token_id=0,
+            add_type_embeddings=True,
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=1024,
+            initializer_range=0.02,
+            intermediate_size=4096,
+            max_position_embeddings=512,
+            num_attention_heads=16,
+            num_hidden_layers=24,
+            type_vocab_size=2,
+            layer_norm_eps=1e-12)),
+    tokenizer=dict(
+        type='FullTokenizer',
+        vocab_file=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/vocab.txt'
+    ),
+    proj_dim=1024,
+    text_prototype='cifar100',
+)
diff --git a/configs/chinese_clip/cn-clip_vit-large-p14_zeroshot-cls_cifar100.py b/configs/chinese_clip/cn-clip_vit-large-p14_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..35f0b6fb53fa2bf8d389f4a0f6ea08bdbac72175
--- /dev/null
+++ b/configs/chinese_clip/cn-clip_vit-large-p14_zeroshot-cls_cifar100.py
@@ -0,0 +1,75 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='CIFAR100',
+        data_root='data/cifar100',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, ))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='ChineseCLIP',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=14,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        final_norm=True,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+        out_type='cls_token',
+    ),
+    text_backbone=dict(
+        type='BertModelCN',
+        config=dict(
+            vocab_size=21128,
+            pad_token_id=0,
+            add_type_embeddings=True,
+            attention_probs_dropout_prob=0.1,
+            hidden_act='gelu',
+            hidden_dropout_prob=0.1,
+            hidden_size=768,
+            initializer_range=0.02,
+            intermediate_size=3072,
+            max_position_embeddings=512,
+            num_attention_heads=12,
+            num_hidden_layers=12,
+            type_vocab_size=2,
+            layer_norm_eps=1e-12)),
+    tokenizer=dict(
+        type='FullTokenizer',
+        vocab_file=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/vocab.txt'
+    ),
+    proj_dim=768,
+    text_prototype='cifar100',
+)
diff --git a/configs/chinese_clip/metafile.yml b/configs/chinese_clip/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..40ebb49e001691c4b8adc87a9e1f24d352e41441
--- /dev/null
+++ b/configs/chinese_clip/metafile.yml
@@ -0,0 +1,79 @@
+Collections:
+  - Name: ChineseCLIP
+    Metadata:
+      Training Data:
+        - LAION-5B
+        - WuKong
+        - VisualGenome
+        - MSCOCO
+      Architecture:
+        - Transformer
+    Paper:
+      Title: 'Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese'
+      URL: https://arxiv.org/abs/2211.01335
+    README: configs/chinese_clip/README.md
+
+Models:
+  - Name: cn-clip_resnet50_zeroshot-cls_cifar100
+    Metadata:
+      FLOPs: null
+      Parameters: 77000000
+    In Collection: ChineseCLIP
+    Results:
+      - Task: Image Classification
+        Dataset: CIFAR100
+        Metrics:
+          Top 1 Accuracy: 40.7
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_resnet50_3rdparty_20230519-6a2b3eb2.pth
+    Config: configs/chinese_clip/cn-clip_resnet50_zeroshot-cls_cifar100.py
+    Converted From:
+      Weights: https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_rn50.pt
+      Code: https://github.com/OFA-Sys/Chinese-CLIP
+
+  - Name: cn-clip_vit-base-p16_zeroshot-cls_cifar100
+    Metadata:
+      FLOPs: null
+      Parameters: 188000000
+    In Collection: ChineseCLIP
+    Results:
+      - Task: Image Classification
+        Dataset: CIFAR100
+        Metrics:
+          Top 1 Accuracy: 64.5
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-base-p16_3rdparty_20230519-37fbc59e.pth
+    Config: configs/chinese_clip/cn-clip_vit-base-p16_zeroshot-cls_cifar100.py
+    Converted From:
+      Weights: https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_vit-b-16.pt
+      Code: https://github.com/OFA-Sys/Chinese-CLIP
+
+  - Name: cn-clip_vit-large-p14_zeroshot-cls_cifar100
+    Metadata:
+      FLOPs: null
+      Parameters: 406000000
+    In Collection: ChineseCLIP
+    Results:
+      - Task: Image Classification
+        Dataset: CIFAR100
+        Metrics:
+          Top 1 Accuracy: 74.8
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-large-p14_3rdparty_20230519-3f844503.pth
+    Config: configs/chinese_clip/cn-clip_vit-large-p14_zeroshot-cls_cifar100.py
+    Converted From:
+      Weights: https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_vit-l-14.pt
+      Code: https://github.com/OFA-Sys/Chinese-CLIP
+
+  - Name: cn-clip_vit-huge-p14_zeroshot-cls_cifar100
+    Metadata:
+      FLOPs: null
+      Parameters: 958000000
+    In Collection: ChineseCLIP
+    Results:
+      - Task: Image Classification
+        Dataset: CIFAR100
+        Metrics:
+          Top 1 Accuracy: 79.1
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/chinese_clip/cn-clip_vit-huge-p14_3rdparty_20230519-e4f49b00.pth
+    Config: configs/chinese_clip/cn-clip_vit-huge-p14_zeroshot-cls_cifar100.py
+    Converted From:
+      Weights: https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_vit-h-14.pt
+      Code: https://github.com/OFA-Sys/Chinese-CLIP
diff --git a/configs/clip/README.md b/configs/clip/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7a14be4d8e05fe3ba1c9d51106889b63029964b9
--- /dev/null
+++ b/configs/clip/README.md
@@ -0,0 +1,90 @@
+# CLIP
+
+> [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.
+
+<div align=center>
+<img src="https://raw.githubusercontent.com/Scarecrow0/figures_cache/main/clip_main_fig.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/clip/vit-base-p32_pt-64xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k_20221220-b384e830.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                        |         Pretrain          | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                     Config                     |                     Download                     |
+| :------------------------------------------- | :-----------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------------: | :----------------------------------------------: |
+| `vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k`\* | CLIP LAION2B ImageNet-12k |   88.22    |   4.36    |   83.06   |   96.49   |    [config](vit-base-p32_pt-64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k_20221220-b384e830.pth) |
+| `vit-base-p32_clip-laion2b-pre_3rdparty_in1k`\* |       CLIP LAION2B        |   88.22    |   4.36    |   82.46   |   96.12   |    [config](vit-base-p32_pt-64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-pre_3rdparty_in1k_20221220-194df57f.pth) |
+| `vit-base-p32_clip-openai-pre_3rdparty_in1k`\* |        CLIP OPENAI        |   88.22    |   4.36    |   81.77   |   95.89   |    [config](vit-base-p32_pt-64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_openai-pre_3rdparty_in1k_20221220-a0182ba9.pth) |
+| `vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k-384px`\* | CLIP LAION2B ImageNet-12k |   88.22    |   12.66   |   85.39   |   97.67   | [config](vit-base-p32_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k-384px_20221220-c7757552.pth) |
+| `vit-base-p32_clip-openai-in12k-pre_3rdparty_in1k-384px`\* | CLIP OPENAI ImageNet-12k  |   88.22    |   12.66   |   85.13   |   97.42   | [config](vit-base-p32_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_openai-in12k-pre_3rdparty_in1k-384px_20221220-dc2e49ea.pth) |
+| `vit-base-p16_clip-laion2b-in12k-pre_3rdparty_in1k`\* | CLIP LAION2B ImageNet-12k |   86.57    |   16.86   |   86.02   |   97.76   |    [config](vit-base-p16_pt-64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-in12k-pre_3rdparty_in1k_20221220-a5e31f8c.pth) |
+| `vit-base-p16_clip-laion2b-pre_3rdparty_in1k`\* |       CLIP LAION2B        |   86.57    |   16.86   |   85.49   |   97.59   |    [config](vit-base-p16_pt-64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-pre_3rdparty_in1k_20221220-5e24ff58.pth) |
+| `vit-base-p16_clip-openai-in12k-pre_3rdparty_in1k`\* | CLIP OPENAI ImageNet-12k  |   86.57    |   16.86   |   85.99   |   97.72   |    [config](vit-base-p16_pt-64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-in12k-pre_3rdparty_in1k_20221220-90d930a8.pth) |
+| `vit-base-p16_clip-openai-pre_3rdparty_in1k`\* |        CLIP OPENAI        |   86.57    |   16.86   |   85.30   |   97.50   |    [config](vit-base-p16_pt-64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-pre_3rdparty_in1k_20221220-c7d9c899.pth) |
+| `vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k-448px`\* | CLIP LAION2B ImageNet-12k |   88.22    |   17.20   |   85.76   |   97.63   | [config](vit-base-p32_pt-64xb64_in1k-448px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k-448px_20221220-ca404a7d.pth) |
+| `vit-base-p16_clip-laion2b-in12k-pre_3rdparty_in1k-384px`\* | CLIP LAION2B ImageNet-12k |   86.57    |   49.37   |   87.17   |   98.02   | [config](vit-base-p16_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-in12k-pre_3rdparty_in1k-384px_20221220-84ed0cc0.pth) |
+| `vit-base-p16_clip-laion2b-pre_3rdparty_in1k-384px`\* |       CLIP LAION2B        |   86.57    |   49.37   |   86.52   |   97.97   | [config](vit-base-p16_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-pre_3rdparty_in1k-384px_20221220-558ed826.pth) |
+| `vit-base-p16_clip-openai-in12k-pre_3rdparty_in1k-384px`\* | CLIP OPENAI ImageNet-12k  |   86.57    |   49.37   |   86.87   |   98.05   | [config](vit-base-p16_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-in12k-pre_3rdparty_in1k-384px_20221220-8df86b74.pth) |
+| `vit-base-p16_clip-openai-pre_3rdparty_in1k-384px`\* |        CLIP OPENAI        |   86.57    |   49.37   |   86.25   |   97.90   | [config](vit-base-p16_pt-64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-pre_3rdparty_in1k-384px_20221220-eb012e87.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@InProceedings{pmlr-v139-radford21a,
+title = {Learning Transferable Visual Models From Natural Language Supervision},
+author = {Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya},
+booktitle = {Proceedings of the 38th International Conference on Machine Learning},
+year = {2021},
+series = {Proceedings of Machine Learning Research},
+publisher = {PMLR},
+}
+```
diff --git a/configs/clip/clip_vit-base-p16_zeroshot-cls_cifar100.py b/configs/clip/clip_vit-base-p16_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd684a50a319e9e2b4942ce59ae6e20744b2743e
--- /dev/null
+++ b/configs/clip/clip_vit-base-p16_zeroshot-cls_cifar100.py
@@ -0,0 +1,68 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='CIFAR100',
+        data_root='data/cifar100',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='CLIPZeroShot',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+    ),
+    projection=dict(type='CLIPProjection', in_channels=768, out_channels=512),
+    text_backbone=dict(
+        type='CLIPTransformer',
+        width=512,
+        layers=12,
+        heads=8,
+        attn_mask=True,
+    ),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='openai/clip-vit-base-patch16',
+        use_fast=False),
+    vocab_size=49408,
+    transformer_width=512,
+    proj_dim=512,
+    text_prototype='cifar100',
+    text_prompt='openai_cifar100',
+    context_length=77,
+)
diff --git a/configs/clip/clip_vit-base-p16_zeroshot-cls_in1k.py b/configs/clip/clip_vit-base-p16_zeroshot-cls_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..80c4fde82f514c96d9f171d6b3ed57fdbccd923a
--- /dev/null
+++ b/configs/clip/clip_vit-base-p16_zeroshot-cls_in1k.py
@@ -0,0 +1,69 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='ImageNet',
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='CLIPZeroShot',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+    ),
+    projection=dict(type='CLIPProjection', in_channels=768, out_channels=512),
+    text_backbone=dict(
+        type='CLIPTransformer',
+        width=512,
+        layers=12,
+        heads=8,
+        attn_mask=True,
+    ),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='openai/clip-vit-base-patch16',
+        use_fast=False),
+    vocab_size=49408,
+    transformer_width=512,
+    proj_dim=512,
+    text_prototype='imagenet',
+    text_prompt='openai_imagenet_sub',  # openai_imagenet, openai_imagenet_sub
+    context_length=77,
+)
diff --git a/configs/clip/clip_vit-large-p14_zeroshot-cls_cifar100.py b/configs/clip/clip_vit-large-p14_zeroshot-cls_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..a6dd7c1141211914c9e9835b73d0ee84a46ea3b6
--- /dev/null
+++ b/configs/clip/clip_vit-large-p14_zeroshot-cls_cifar100.py
@@ -0,0 +1,68 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='CIFAR100',
+        data_root='data/cifar100',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='CLIPZeroShot',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=14,
+        drop_rate=0.,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+    ),
+    projection=dict(type='CLIPProjection', in_channels=1024, out_channels=768),
+    text_backbone=dict(
+        type='CLIPTransformer',
+        width=768,
+        layers=12,
+        heads=12,
+        attn_mask=True,
+    ),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='openai/clip-vit-large-patch14',
+        use_fast=False),
+    vocab_size=49408,
+    transformer_width=768,
+    proj_dim=768,
+    text_prototype='cifar100',
+    text_prompt='openai_cifar100',
+    context_length=77,
+)
diff --git a/configs/clip/clip_vit-large-p14_zeroshot-cls_in1k.py b/configs/clip/clip_vit-large-p14_zeroshot-cls_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..10500017a9300e7c2cf8082e575378f346888c3d
--- /dev/null
+++ b/configs/clip/clip_vit-large-p14_zeroshot-cls_in1k.py
@@ -0,0 +1,69 @@
+_base_ = '../_base_/default_runtime.py'
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=(224, 224), interpolation='bicubic'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    dataset=dict(
+        type='ImageNet',
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+# schedule settings
+train_cfg = None
+val_cfg = None
+test_cfg = dict()
+
+# model settings
+model = dict(
+    type='CLIPZeroShot',
+    vision_backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=14,
+        drop_rate=0.,
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        pre_norm=True,
+    ),
+    projection=dict(type='CLIPProjection', in_channels=1024, out_channels=768),
+    text_backbone=dict(
+        type='CLIPTransformer',
+        width=768,
+        layers=12,
+        heads=12,
+        attn_mask=True,
+    ),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='openai/clip-vit-large-patch14',
+        use_fast=False),
+    vocab_size=49408,
+    transformer_width=768,
+    proj_dim=768,
+    text_prototype='imagenet',
+    text_prompt='openai_imagenet_sub',  # openai_imagenet, openai_imagenet_sub
+    context_length=77,
+)
diff --git a/configs/clip/metafile.yml b/configs/clip/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..a82eea49aa0815cf94ac9324ffaea445f815a473
--- /dev/null
+++ b/configs/clip/metafile.yml
@@ -0,0 +1,308 @@
+Collections:
+  - Name: CLIP
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: Learning Transferable Visual Models From Natural Language Supervision
+      URL: https://arxiv.org/abs/2103.00020
+    README: configs/clip/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/vision_transformer.py
+      Version: v1.0.0
+
+Models:
+  - Name: vit-base-p32_clip-openai-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 4364335104
+      Parameters: 88225000
+      Training Data:
+        - OpenAI
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.77
+          Top 5 Accuracy: 95.89
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_openai-pre_3rdparty_in1k_20221220-a0182ba9.pth
+    Config: configs/clip/vit-base-p32_pt-64xb64_in1k.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch32_clip_224.openai_ft_in1k
+  - Name: vit-base-p32_clip-laion2b-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 4364335104
+      Parameters: 88225000
+      Training Data:
+        - LAION-2B
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.46
+          Top 5 Accuracy: 96.12
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-pre_3rdparty_in1k_20221220-194df57f.pth
+    Config: configs/clip/vit-base-p32_pt-64xb64_in1k.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch32_clip_224.laion2b_ft_in1k
+  - Name: vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 4364335104
+      Parameters: 88225000
+      Training Data:
+        - LAION-2B
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.06
+          Top 5 Accuracy: 96.49
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k_20221220-b384e830.pth
+    Config: configs/clip/vit-base-p32_pt-64xb64_in1k.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch32_clip_224.laion2b_ft_in12k_in1k
+  - Name: vit-base-p32_clip-openai-in12k-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 12661054464
+      Parameters: 88225000
+      Training Data:
+        - OpenAI
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.13
+          Top 5 Accuracy: 97.42
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_openai-in12k-pre_3rdparty_in1k-384px_20221220-dc2e49ea.pth
+    Config: configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch32_clip_384.openai_ft_in12k_in1k
+  - Name: vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 12661054464
+      Parameters: 88225000
+      Training Data:
+        - LAION-2B
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.39
+          Top 5 Accuracy: 97.67
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k-384px_20221220-c7757552.pth
+    Config: configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch32_clip_384.laion2b_ft_in12k_in1k
+  - Name: vit-base-p16_clip-openai-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 16855600128
+      Parameters: 86568424
+      Training Data:
+        - OpenAI
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.3
+          Top 5 Accuracy: 97.5
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-pre_3rdparty_in1k_20221220-c7d9c899.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_224.openai_ft_in1k
+  - Name: vit-base-p16_clip-laion2b-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 16855600128
+      Parameters: 86568424
+      Training Data:
+        - LAION-2B
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.49
+          Top 5 Accuracy: 97.59
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-pre_3rdparty_in1k_20221220-5e24ff58.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_224.laion2b_ft_in1k
+  - Name: vit-base-p16_clip-openai-in12k-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 16855600128
+      Parameters: 86568424
+      Training Data:
+        - OpenAI
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.99
+          Top 5 Accuracy: 97.72
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-in12k-pre_3rdparty_in1k_20221220-90d930a8.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_224.openai_ft_in12k_in1k
+  - Name: vit-base-p16_clip-laion2b-in12k-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 16855600128
+      Parameters: 86568424
+      Training Data:
+        - LAION-2B
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.02
+          Top 5 Accuracy: 97.76
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-in12k-pre_3rdparty_in1k_20221220-a5e31f8c.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_224.laion2b_ft_in12k_in1k
+  - Name: vit-base-p32_clip-laion2b-in12k-pre_3rdparty_in1k-448px
+    Metadata:
+      FLOPs: 17202416640
+      Parameters: 88225000
+      Training Data:
+        - LAION-2B
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.76
+          Top 5 Accuracy: 97.63
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p32_laion2b-in12k-pre_3rdparty_in1k-448px_20221220-ca404a7d.pth
+    Config: configs/clip/vit-base-p32_pt-64xb64_in1k-448px.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch32_clip_448.laion2b_ft_in12k_in1k
+  - Name: vit-base-p16_clip-openai-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 49370078208
+      Parameters: 86568424
+      Training Data:
+        - OpenAI
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.25
+          Top 5 Accuracy: 97.9
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-pre_3rdparty_in1k-384px_20221220-eb012e87.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_384.openai_ft_in1k
+  - Name: vit-base-p16_clip-laion2b-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 49370078208
+      Parameters: 86568424
+      Training Data:
+        - LAION-2B
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.52
+          Top 5 Accuracy: 97.97
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-pre_3rdparty_in1k-384px_20221220-558ed826.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_384.laion2b_ft_in1k
+  - Name: vit-base-p16_clip-openai-in12k-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 49370078208
+      Parameters: 86568424
+      Training Data:
+        - OpenAI
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.87
+          Top 5 Accuracy: 98.05
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_openai-in12k-pre_3rdparty_in1k-384px_20221220-8df86b74.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_384.openai_ft_in12k_in1k
+  - Name: vit-base-p16_clip-laion2b-in12k-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 49370078208
+      Parameters: 86568424
+      Training Data:
+        - LAION-2B
+        - ImageNet-12k
+        - ImageNet-1k
+    In Collection: CLIP
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.17
+          Top 5 Accuracy: 98.02
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/clip-vit-base-p16_laion2b-in12k-pre_3rdparty_in1k-384px_20221220-84ed0cc0.pth
+    Config: configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
+    Converted From:
+      Code: https://github.com/rwightman/pytorch-image-models
+      Weights: https://huggingface.co/timm/vit_base_patch16_clip_384.laion2b_ft_in12k_in1k
+  - Name: vit-large-p14_clip-openai-pre_3rdparty
+    Metadata:
+      FLOPs: 59696580608
+      Parameters: 303302656
+      Training Data:
+        - OpenAI
+    In Collection: CLIP
+    Weights: https://download.openmmlab.com/mmclassification/v0/clip/vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth
+    Config: configs/clip/vit-large-p14_headless.py
+    Converted From:
+      Code: https://github.com/mlfoundations/open_clip
+      Weights: https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt
diff --git a/configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py b/configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..14046ce3e40cce46944ccc0ddef6c884c38d9c89
--- /dev/null
+++ b/configs/clip/vit-base-p16_pt-64xb64_in1k-384px.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=384,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p16_pt-64xb64_in1k-448px.py b/configs/clip/vit-base-p16_pt-64xb64_in1k-448px.py
new file mode 100644
index 0000000000000000000000000000000000000000..02af585753074f3a831188a01085917eb04dad4b
--- /dev/null
+++ b/configs/clip/vit-base-p16_pt-64xb64_in1k-448px.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=448,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=448,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=448),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p16_pt-64xb64_in1k.py b/configs/clip/vit-base-p16_pt-64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cd018bac622744bdcf6cd50821612a9148c4a85d
--- /dev/null
+++ b/configs/clip/vit-base-p16_pt-64xb64_in1k.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py b/configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1acf78ab6bf335cc0e3cd1012fbe7773336c61e
--- /dev/null
+++ b/configs/clip/vit-base-p32_pt-64xb64_in1k-384px.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/vit-base-p32.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=384,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p32_pt-64xb64_in1k-448px.py b/configs/clip/vit-base-p32_pt-64xb64_in1k-448px.py
new file mode 100644
index 0000000000000000000000000000000000000000..0f50391f15bb1dc60b94d5ef163f4e88e3b4e509
--- /dev/null
+++ b/configs/clip/vit-base-p32_pt-64xb64_in1k-448px.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/vit-base-p32.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=448,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=448,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=448),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-base-p32_pt-64xb64_in1k.py b/configs/clip/vit-base-p32_pt-64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..abbb50089edb9057504e7571bd29fddaa1c53dc9
--- /dev/null
+++ b/configs/clip/vit-base-p32_pt-64xb64_in1k.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/vit-base-p32.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(pre_norm=True))
+
+# data settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/clip/vit-large-p14_headless.py b/configs/clip/vit-large-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9b965d4f0edc4794b05a3ea6a917a0d350a27f3
--- /dev/null
+++ b/configs/clip/vit-large-p14_headless.py
@@ -0,0 +1,34 @@
+_base_ = ['../_base_/default_runtime.py']
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='l',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.1,
+        pre_norm=True,
+    ),
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+test_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type='ImageNet',
+        data_root='data/imagenet',
+        ann_file='meta/val.txt',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+test_evaluator = None
diff --git a/configs/conformer/README.md b/configs/conformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..04b5d4770b22c346a149dfc0bf7c1dfc2713a2a6
--- /dev/null
+++ b/configs/conformer/README.md
@@ -0,0 +1,84 @@
+# Conformer
+
+> [Conformer: Local Features Coupling Global Representations for Visual Recognition](https://arxiv.org/abs/2105.03889)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features but experience difficulty to capture global representations. Within visual transformer, the cascaded self-attention modules can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local features and global representations are retained to the maximum extent. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet. On MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/144957687-926390ed-6119-4e4c-beaa-9bc0017fe953.png" width="90%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('conformer-tiny-p16_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('conformer-tiny-p16_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/conformer/conformer-small-p32_8xb128_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/conformer/conformer-tiny-p16_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/conformer/conformer-tiny-p16_3rdparty_8xb128_in1k_20211206-f6860372.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                 |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                    Config                    |                                Download                                |
+| :------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------: | :--------------------------------------------------------------------: |
+| `conformer-tiny-p16_3rdparty_in1k`\*  | From scratch |   23.52    |   4.90    |   81.31   |   95.60   | [config](conformer-tiny-p16_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-tiny-p16_3rdparty_8xb128_in1k_20211206-f6860372.pth) |
+| `conformer-small-p16_3rdparty_in1k`\* | From scratch |   37.67    |   10.31   |   83.32   |   96.46   | [config](conformer-small-p16_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-small-p16_3rdparty_8xb128_in1k_20211206-3065dcf5.pth) |
+| `conformer-small-p32_8xb128_in1k`     | From scratch |   38.85    |   7.09    |   81.96   |   96.02   | [config](conformer-small-p32_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-small-p32_8xb128_in1k_20211206-947a0816.pth) |
+| `conformer-base-p16_3rdparty_in1k`\*  | From scratch |   83.29    |   22.89   |   83.82   |   96.59   | [config](conformer-base-p16_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-base-p16_3rdparty_8xb128_in1k_20211206-bfdf8637.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/pengzhiliang/Conformer/blob/main/models.py#L89). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{peng2021conformer,
+      title={Conformer: Local Features Coupling Global Representations for Visual Recognition},
+      author={Zhiliang Peng and Wei Huang and Shanzhi Gu and Lingxi Xie and Yaowei Wang and Jianbin Jiao and Qixiang Ye},
+      journal={arXiv preprint arXiv:2105.03889},
+      year={2021},
+}
+```
diff --git a/configs/conformer/conformer-base-p16_8xb128_in1k.py b/configs/conformer/conformer-base-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a44f56f3ac3213c616a6e960ce2476466eb65bbd
--- /dev/null
+++ b/configs/conformer/conformer-base-p16_8xb128_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/conformer/base-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_conformer.py',
+    '../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=128)
diff --git a/configs/conformer/conformer-small-p16_8xb128_in1k.py b/configs/conformer/conformer-small-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a937f4f9e60c3987a6ff3d2b7320a0dd49855cbc
--- /dev/null
+++ b/configs/conformer/conformer-small-p16_8xb128_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/conformer/small-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_conformer.py',
+    '../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=128)
diff --git a/configs/conformer/conformer-small-p32_8xb128_in1k.py b/configs/conformer/conformer-small-p32_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b07ce2ce3fba146675b7a8453cc581f2a011db1
--- /dev/null
+++ b/configs/conformer/conformer-small-p32_8xb128_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/conformer/small-p32.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_conformer.py',
+    '../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=128)
diff --git a/configs/conformer/conformer-tiny-p16_8xb128_in1k.py b/configs/conformer/conformer-tiny-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f88c6c3b0da3c50e0b3ccb2454b200dfbaf7c4c7
--- /dev/null
+++ b/configs/conformer/conformer-tiny-p16_8xb128_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/conformer/tiny-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_conformer.py',
+    '../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=128)
diff --git a/configs/conformer/metafile.yml b/configs/conformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..c0821bad059c32db978f02c4935a41ec0c054c16
--- /dev/null
+++ b/configs/conformer/metafile.yml
@@ -0,0 +1,78 @@
+Collections:
+  - Name: Conformer
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Layer Normalization
+        - Scaled Dot-Product Attention
+        - Dropout
+    Paper:
+      URL: https://arxiv.org/abs/2105.03889
+      Title: "Conformer: Local Features Coupling Global Representations for Visual Recognition"
+    README: configs/conformer/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.19.0/mmcls/models/backbones/conformer.py
+      Version: v0.19.0
+
+Models:
+  - Name: conformer-tiny-p16_3rdparty_in1k
+    In Collection: Conformer
+    Config: configs/conformer/conformer-tiny-p16_8xb128_in1k.py
+    Metadata:
+      FLOPs: 4899611328
+      Parameters: 23524704
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.31
+          Top 5 Accuracy: 95.60
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/conformer/conformer-tiny-p16_3rdparty_8xb128_in1k_20211206-f6860372.pth
+    Converted From:
+      Weights: https://drive.google.com/file/d/19SxGhKcWOR5oQSxNUWUM2MGYiaWMrF1z/view?usp=sharing
+      Code: https://github.com/pengzhiliang/Conformer/blob/main/models.py#L65
+  - Name: conformer-small-p16_3rdparty_in1k
+    In Collection: Conformer
+    Config: configs/conformer/conformer-small-p16_8xb128_in1k.py
+    Metadata:
+      FLOPs: 10311309312
+      Parameters: 37673424
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.32
+          Top 5 Accuracy: 96.46
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/conformer/conformer-small-p16_3rdparty_8xb128_in1k_20211206-3065dcf5.pth
+    Converted From:
+      Weights: https://drive.google.com/file/d/1mpOlbLaVxOfEwV4-ha78j_1Ebqzj2B83/view?usp=sharing
+      Code: https://github.com/pengzhiliang/Conformer/blob/main/models.py#L73
+  - Name: conformer-small-p32_8xb128_in1k
+    In Collection: Conformer
+    Config: configs/conformer/conformer-small-p32_8xb128_in1k.py
+    Metadata:
+      FLOPs: 7087281792
+      Parameters: 38853072
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.96
+          Top 5 Accuracy: 96.02
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/conformer/conformer-small-p32_8xb128_in1k_20211206-947a0816.pth
+  - Name: conformer-base-p16_3rdparty_in1k
+    In Collection: Conformer
+    Config: configs/conformer/conformer-base-p16_8xb128_in1k.py
+    Metadata:
+      FLOPs: 22892078080
+      Parameters: 83289136
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.82
+          Top 5 Accuracy: 96.59
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/conformer/conformer-base-p16_3rdparty_8xb128_in1k_20211206-bfdf8637.pth
+    Converted From:
+      Weights: https://drive.google.com/file/d/1oeQ9LSOGKEUaYGu7WTlUGl3KDsQIi0MA/view?usp=sharing
+      Code: https://github.com/pengzhiliang/Conformer/blob/main/models.py#L89
diff --git a/configs/convmixer/README.md b/configs/convmixer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a87d27ffb8ec0dd6a6182d99133a227b0b29945b
--- /dev/null
+++ b/configs/convmixer/README.md
@@ -0,0 +1,79 @@
+# ConvMixer
+
+> [Patches Are All You Need?](https://arxiv.org/abs/2201.09792)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/42952108/156284977-abf2245e-d9ba-4e0d-8e10-c0664a20f4c8.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('convmixer-768-32_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('convmixer-768-32_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/convmixer/convmixer-768-32_10xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-768-32_3rdparty_10xb64_in1k_20220323-bca1f7b8.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                               |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                   |                                  Download                                  |
+| :---------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------: | :------------------------------------------------------------------------: |
+| `convmixer-768-32_3rdparty_in1k`\*  | From scratch |   21.11    |   19.62   |   80.16   |   95.08   | [config](convmixer-768-32_10xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-768-32_3rdparty_10xb64_in1k_20220323-bca1f7b8.pth) |
+| `convmixer-1024-20_3rdparty_in1k`\* | From scratch |   24.38    |   5.55    |   76.94   |   93.36   | [config](convmixer-1024-20_10xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-1024-20_3rdparty_10xb64_in1k_20220323-48f8aeba.pth) |
+| `convmixer-1536-20_3rdparty_in1k`\* | From scratch |   51.63    |   48.71   |   81.37   |   95.61   | [config](convmixer-1536-20_10xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-1536_20_3rdparty_10xb64_in1k_20220323-ea5786f3.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/locuslab/convmixer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{trockman2022patches,
+      title={Patches Are All You Need?},
+      author={Asher Trockman and J. Zico Kolter},
+      year={2022},
+      eprint={2201.09792},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/configs/convmixer/convmixer-1024-20_10xb64_in1k.py b/configs/convmixer/convmixer-1024-20_10xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0dbc664261e2244cb35a779211c45b5b854d4cc5
--- /dev/null
+++ b/configs/convmixer/convmixer-1024-20_10xb64_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/models/convmixer/convmixer-1024-20.py',
+    '../_base_/datasets/imagenet_bs64_convmixer_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=0.01),
+    clip_grad=dict(max_norm=5.0),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=130,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=20,
+        end=150)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=150)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (10 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=640)
diff --git a/configs/convmixer/convmixer-1536-20_10xb64_in1k.py b/configs/convmixer/convmixer-1536-20_10xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3c8cc95c20312311ee06cee911dc186944de5b7f
--- /dev/null
+++ b/configs/convmixer/convmixer-1536-20_10xb64_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/models/convmixer/convmixer-1536-20.py',
+    '../_base_/datasets/imagenet_bs64_convmixer_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=0.01),
+    clip_grad=dict(max_norm=5.0),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=130,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=20,
+        end=150)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=150)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (10 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=640)
diff --git a/configs/convmixer/convmixer-768-32_10xb64_in1k.py b/configs/convmixer/convmixer-768-32_10xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d872d4429134ef8c88ea87da3c93b6532472423e
--- /dev/null
+++ b/configs/convmixer/convmixer-768-32_10xb64_in1k.py
@@ -0,0 +1,19 @@
+_base_ = [
+    '../_base_/models/convmixer/convmixer-768-32.py',
+    '../_base_/datasets/imagenet_bs64_convmixer_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=0.01),
+    clip_grad=dict(max_norm=5.0),
+)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (10 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=640)
diff --git a/configs/convmixer/metafile.yml b/configs/convmixer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f9dcdc7cc71ddc72791ab47666c0a35d30a9f349
--- /dev/null
+++ b/configs/convmixer/metafile.yml
@@ -0,0 +1,61 @@
+Collections:
+  - Name: ConvMixer
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - 1x1 Convolution
+        - LayerScale
+    Paper:
+      URL: https://arxiv.org/abs/2201.09792
+      Title: Patches Are All You Need?
+    README: configs/convmixer/README.md
+
+Models:
+  - Name: convmixer-768-32_3rdparty_in1k
+    Metadata:
+      FLOPs: 19623051264
+      Parameters: 21110248
+    In Collection: ConvMixer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.16
+          Top 5 Accuracy: 95.08
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-768-32_3rdparty_10xb64_in1k_20220323-bca1f7b8.pth
+    Config: configs/convmixer/convmixer-768-32_10xb64_in1k.py
+    Converted From:
+      Weights: https://github.com/tmp-iclr/convmixer/releases/download/v1.0/convmixer_768_32_ks7_p7_relu.pth.tar
+      Code: https://github.com/locuslab/convmixer
+  - Name: convmixer-1024-20_3rdparty_in1k
+    Metadata:
+      FLOPs: 5550112768
+      Parameters: 24383464
+    In Collection: ConvMixer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.94
+          Top 5 Accuracy: 93.36
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-1024-20_3rdparty_10xb64_in1k_20220323-48f8aeba.pth
+    Config: configs/convmixer/convmixer-1024-20_10xb64_in1k.py
+    Converted From:
+      Weights: https://github.com/tmp-iclr/convmixer/releases/download/v1.0/convmixer_1024_20_ks9_p14.pth.tar
+      Code: https://github.com/locuslab/convmixer
+  - Name: convmixer-1536-20_3rdparty_in1k
+    Metadata:
+      FLOPs: 48713170944
+      Parameters: 51625960
+    In Collection: ConvMixer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.37
+          Top 5 Accuracy: 95.61
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convmixer/convmixer-1536_20_3rdparty_10xb64_in1k_20220323-ea5786f3.pth
+    Config: configs/convmixer/convmixer-1536-20_10xb64_in1k.py
+    Converted From:
+      Weights: https://github.com/tmp-iclr/convmixer/releases/download/v1.0/convmixer_1536_20_ks9_p7.pth.tar
+      Code: https://github.com/locuslab/convmixer
diff --git a/configs/convnext/README.md b/configs/convnext/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2e6e14c2f2e65af68c1f8177bdec91f70a0b3149
--- /dev/null
+++ b/configs/convnext/README.md
@@ -0,0 +1,123 @@
+# ConvNeXt
+
+> [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545v1)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**ConvNeXt** is initially described in [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545v1), which is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers. The ConvNeXt has the pyramid structure and achieve competitive  performance on various vision tasks, with simplicity and efficiency.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/8370623/148624004-e9581042-ea4d-4e10-b3bd-42c92b02053b.png" width="100%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('convnext-tiny_32xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('convnext-tiny_32xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/convnext/convnext-tiny_32xb128_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/convnext/convnext-tiny_32xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                              | Params (M) | Flops (G) |                  Config                   |                                                  Download                                                  |
+| :--------------------------------- | :--------: | :-------: | :---------------------------------------: | :--------------------------------------------------------------------------------------------------------: |
+| `convnext-base_3rdparty_in21k`\*   |   88.59    |   15.36   | [config](convnext-base_32xb128_in21k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_in21k_20220124-13b83eec.pth) |
+| `convnext-large_3rdparty_in21k`\*  |   197.77   |   34.37   | [config](convnext-large_64xb64_in21k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_in21k_20220124-41b5a79f.pth) |
+| `convnext-xlarge_3rdparty_in21k`\* |   350.20   |   60.93   | [config](convnext-xlarge_64xb64_in21k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_3rdparty_in21k_20220124-f909bad7.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/ConvNeXt). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model                                             |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                     Config                     |                         Download                         |
+| :------------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------------: | :------------------------------------------------------: |
+| `convnext-tiny_32xb128_in1k`                      | From scratch |   28.59    |   4.46    |   82.14   |   96.06   |    [config](convnext-tiny_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.json) |
+| `convnext-tiny_32xb128-noema_in1k`                | From scratch |   28.59    |   4.46    |   81.95   |   95.89   |    [config](convnext-tiny_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128-noema_in1k_20221208-5d4509c7.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.json) |
+| `convnext-tiny_in21k-pre_3rdparty_in1k`\*         | ImageNet-21k |   28.59    |   4.46    |   82.90   |   96.62   |    [config](convnext-tiny_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_in21k-pre_3rdparty_in1k_20221219-7501e534.pth) |
+| `convnext-tiny_in21k-pre_3rdparty_in1k-384px`\*   | ImageNet-21k |   28.59    |   13.14   |   84.11   |   97.14   | [config](convnext-tiny_32xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_in21k-pre_3rdparty_in1k-384px_20221219-c1182362.pth) |
+| `convnext-small_32xb128_in1k`                     | From scratch |   50.22    |   8.69    |   83.16   |   96.56   |    [config](convnext-small_32xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128_in1k_20221207-4ab7052c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128_in1k_20221207-4ab7052c.json) |
+| `convnext-small_32xb128-noema_in1k`               | From scratch |   50.22    |   8.69    |   83.21   |   96.48   |    [config](convnext-small_32xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128-noema_in1k_20221208-4a618995.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128_in1k_20221207-4ab7052c.json) |
+| `convnext-small_in21k-pre_3rdparty_in1k`\*        | ImageNet-21k |   50.22    |   8.69    |   84.59   |   97.41   |    [config](convnext-small_32xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_in21k-pre_3rdparty_in1k_20221219-aeca4c93.pth) |
+| `convnext-small_in21k-pre_3rdparty_in1k-384px`\*  | ImageNet-21k |   50.22    |   25.58   |   85.75   |   97.88   | [config](convnext-small_32xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_in21k-pre_3rdparty_in1k-384px_20221219-96f0bb87.pth) |
+| `convnext-base_32xb128_in1k`                      | From scratch |   88.59    |   15.36   |   83.66   |   96.74   |    [config](convnext-base_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128_in1k_20221207-fbdb5eb9.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128_in1k_20221207-fbdb5eb9.json) |
+| `convnext-base_32xb128-noema_in1k`                | From scratch |   88.59    |   15.36   |   83.64   |   96.61   |    [config](convnext-base_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128-noema_in1k_20221208-f8182678.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128_in1k_20221207-fbdb5eb9.json) |
+| `convnext-base_3rdparty_in1k`\*                   | From scratch |   88.59    |   15.36   |   83.85   |   96.74   |    [config](convnext-base_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_32xb128_in1k_20220124-d0915162.pth) |
+| `convnext-base_3rdparty-noema_in1k`\*             | From scratch |   88.59    |   15.36   |   83.71   |   96.60   |    [config](convnext-base_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_32xb128-noema_in1k_20220222-dba4f95f.pth) |
+| `convnext-base_3rdparty_in1k-384px`\*             | From scratch |   88.59    |   45.21   |   85.10   |   97.34   | [config](convnext-base_32xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_in1k-384px_20221219-c8f1dc2b.pth) |
+| `convnext-base_in21k-pre_3rdparty_in1k`\*         | ImageNet-21k |   88.59    |   15.36   |   85.81   |   97.86   |    [config](convnext-base_32xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_in21k-pre-3rdparty_32xb128_in1k_20220124-eb2d6ada.pth) |
+| `convnext-base_in21k-pre-3rdparty_in1k-384px`\*   | From scratch |   88.59    |   45.21   |   86.82   |   98.25   | [config](convnext-base_32xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_in21k-pre-3rdparty_in1k-384px_20221219-4570f792.pth) |
+| `convnext-large_3rdparty_in1k`\*                  | From scratch |   197.77   |   34.37   |   84.30   |   96.89   |    [config](convnext-large_64xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_64xb64_in1k_20220124-f8a0ded0.pth) |
+| `convnext-large_3rdparty_in1k-384px`\*            | From scratch |   197.77   |  101.10   |   85.50   |   97.59   | [config](convnext-large_64xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_in1k-384px_20221219-6dd29d10.pth) |
+| `convnext-large_in21k-pre_3rdparty_in1k`\*        | ImageNet-21k |   197.77   |   34.37   |   86.61   |   98.04   |    [config](convnext-large_64xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_in21k-pre-3rdparty_64xb64_in1k_20220124-2412403d.pth) |
+| `convnext-large_in21k-pre-3rdparty_in1k-384px`\*  | From scratch |   197.77   |  101.10   |   87.46   |   98.37   | [config](convnext-large_64xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_in21k-pre-3rdparty_in1k-384px_20221219-6d38dd66.pth) |
+| `convnext-xlarge_in21k-pre_3rdparty_in1k`\*       | ImageNet-21k |   350.20   |   60.93   |   86.97   |   98.20   |    [config](convnext-xlarge_64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_in21k-pre-3rdparty_64xb64_in1k_20220124-76b6863d.pth) |
+| `convnext-xlarge_in21k-pre-3rdparty_in1k-384px`\* | From scratch |   350.20   |  179.20   |   87.76   |   98.55   | [config](convnext-xlarge_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_in21k-pre-3rdparty_in1k-384px_20221219-b161bc14.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/ConvNeXt). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@Article{liu2022convnet,
+  author  = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
+  title   = {A ConvNet for the 2020s},
+  journal = {arXiv preprint arXiv:2201.03545},
+  year    = {2022},
+}
+```
diff --git a/configs/convnext/convnext-base_32xb128_in1k-384px.py b/configs/convnext/convnext-base_32xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..65546942562ac17b3d4510c78d3090aa8b87a831
--- /dev/null
+++ b/configs/convnext/convnext-base_32xb128_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-base.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-base_32xb128_in1k.py b/configs/convnext/convnext-base_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5ae8ec47c4c7ac3f22712c97dbad315c7a798e6f
--- /dev/null
+++ b/configs/convnext/convnext-base_32xb128_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-base.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-base_32xb128_in21k.py b/configs/convnext/convnext-base_32xb128_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c343526c7f084501fc3651c1581752209f5019a4
--- /dev/null
+++ b/configs/convnext/convnext-base_32xb128_in21k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-base.py',
+    '../_base_/datasets/imagenet21k_bs128.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# model setting
+model = dict(head=dict(num_classes=21841))
+
+# dataset setting
+data_preprocessor = dict(num_classes=21841)
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-large_64xb64_in1k-384px.py b/configs/convnext/convnext-large_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6698b9edcdae463d6d1cf943237efbaf236cd71c
--- /dev/null
+++ b/configs/convnext/convnext-large_64xb64_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-large.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-large_64xb64_in1k.py b/configs/convnext/convnext-large_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a78c58bc3d85e0e08083d339378886f870388bc
--- /dev/null
+++ b/configs/convnext/convnext-large_64xb64_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-large.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-large_64xb64_in21k.py b/configs/convnext/convnext-large_64xb64_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..420edab67b1dc094f08b4a3810af522b2a988b62
--- /dev/null
+++ b/configs/convnext/convnext-large_64xb64_in21k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-base.py',
+    '../_base_/datasets/imagenet21k_bs128.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# model setting
+model = dict(head=dict(num_classes=21841))
+
+# dataset setting
+data_preprocessor = dict(num_classes=21841)
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-small_32xb128_in1k-384px.py b/configs/convnext/convnext-small_32xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..729f00ad2fdf53943ffae9de165e2e9985e733c7
--- /dev/null
+++ b/configs/convnext/convnext-small_32xb128_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-small.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-small_32xb128_in1k.py b/configs/convnext/convnext-small_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b623e900f830fbea7891b61c737398c0dee1076e
--- /dev/null
+++ b/configs/convnext/convnext-small_32xb128_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-small.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-tiny_32xb128_in1k-384px.py b/configs/convnext/convnext-tiny_32xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6513ad8dfa41714ecb5c9de5992496716337c596
--- /dev/null
+++ b/configs/convnext/convnext-tiny_32xb128_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-tiny.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-tiny_32xb128_in1k.py b/configs/convnext/convnext-tiny_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..59d3004bde89510b5c44110c8a6513957c0cbba0
--- /dev/null
+++ b/configs/convnext/convnext-tiny_32xb128_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-tiny.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=128)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-xlarge_64xb64_in1k-384px.py b/configs/convnext/convnext-xlarge_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6edc94d2448157fc82bf38a988bf4393f192a89f
--- /dev/null
+++ b/configs/convnext/convnext-xlarge_64xb64_in1k-384px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-xlarge.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-xlarge_64xb64_in1k.py b/configs/convnext/convnext-xlarge_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..528894e808b7085ee66d8be89cf84f860ddec979
--- /dev/null
+++ b/configs/convnext/convnext-xlarge_64xb64_in1k.py
@@ -0,0 +1,23 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-xlarge.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/convnext-xlarge_64xb64_in21k.py b/configs/convnext/convnext-xlarge_64xb64_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..420edab67b1dc094f08b4a3810af522b2a988b62
--- /dev/null
+++ b/configs/convnext/convnext-xlarge_64xb64_in21k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext/convnext-base.py',
+    '../_base_/datasets/imagenet21k_bs128.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# model setting
+model = dict(head=dict(num_classes=21841))
+
+# dataset setting
+data_preprocessor = dict(num_classes=21841)
+train_dataloader = dict(batch_size=64)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/convnext/metafile.yml b/configs/convnext/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..16896629f07ffadd5313a6e38bc1532ddc3c08f2
--- /dev/null
+++ b/configs/convnext/metafile.yml
@@ -0,0 +1,410 @@
+Collections:
+  - Name: ConvNeXt
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - 1x1 Convolution
+        - LayerScale
+    Paper:
+      URL: https://arxiv.org/abs/2201.03545v1
+      Title: A ConvNet for the 2020s
+    README: configs/convnext/README.md
+    Code:
+      Version: v0.20.1
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/convnext.py
+
+Models:
+  - Name: convnext-tiny_32xb128_in1k
+    Metadata:
+      FLOPs: 4457472768
+      Parameters: 28589128
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.14
+          Top 5 Accuracy: 96.06
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128_in1k_20221207-998cf3e9.pth
+    Config: configs/convnext/convnext-tiny_32xb128_in1k.py
+  - Name: convnext-tiny_32xb128-noema_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 4457472768
+      Parameters: 28589128
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.95
+          Top 5 Accuracy: 95.89
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_32xb128-noema_in1k_20221208-5d4509c7.pth
+    Config: configs/convnext/convnext-tiny_32xb128_in1k.py
+  - Name: convnext-tiny_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 4457472768
+      Parameters: 28589128
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.90
+          Top 5 Accuracy: 96.62
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_in21k-pre_3rdparty_in1k_20221219-7501e534.pth
+    Config: configs/convnext/convnext-tiny_32xb128_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_tiny_22k_1k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-tiny_in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 13135236864
+      Parameters: 28589128
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.11
+          Top 5 Accuracy: 97.14
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_in21k-pre_3rdparty_in1k-384px_20221219-c1182362.pth
+    Config: configs/convnext/convnext-tiny_32xb128_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_tiny_22k_1k_384.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-small_32xb128_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 8687008512
+      Parameters: 50223688
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.16
+          Top 5 Accuracy: 96.56
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128_in1k_20221207-4ab7052c.pth
+    Config: configs/convnext/convnext-small_32xb128_in1k.py
+  - Name: convnext-small_32xb128-noema_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 8687008512
+      Parameters: 50223688
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.21
+          Top 5 Accuracy: 96.48
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_32xb128-noema_in1k_20221208-4a618995.pth
+    Config: configs/convnext/convnext-small_32xb128_in1k.py
+  - Name: convnext-small_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 8687008512
+      Parameters: 50223688
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.59
+          Top 5 Accuracy: 97.41
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_in21k-pre_3rdparty_in1k_20221219-aeca4c93.pth
+    Config: configs/convnext/convnext-small_32xb128_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_small_22k_1k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-small_in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 25580818176
+      Parameters: 50223688
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.75
+          Top 5 Accuracy: 97.88
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_in21k-pre_3rdparty_in1k-384px_20221219-96f0bb87.pth
+    Config: configs/convnext/convnext-small_32xb128_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_small_22k_1k_384.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-base_32xb128_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 15359124480
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.66
+          Top 5 Accuracy: 96.74
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128_in1k_20221207-fbdb5eb9.pth
+    Config: configs/convnext/convnext-base_32xb128_in1k.py
+  - Name: convnext-base_32xb128-noema_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 15359124480
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.64
+          Top 5 Accuracy: 96.61
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_32xb128-noema_in1k_20221208-f8182678.pth
+    Config: configs/convnext/convnext-base_32xb128_in1k.py
+  - Name: convnext-base_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 15359124480
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.85
+          Top 5 Accuracy: 96.74
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_32xb128_in1k_20220124-d0915162.pth
+    Config: configs/convnext/convnext-base_32xb128_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_1k_224_ema.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-base_3rdparty-noema_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 15359124480
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.71
+          Top 5 Accuracy: 96.60
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_32xb128-noema_in1k_20220222-dba4f95f.pth
+    Config: configs/convnext/convnext-base_32xb128_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_1k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-base_3rdparty_in1k-384px
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 45205885952
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.10
+          Top 5 Accuracy: 97.34
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_in1k-384px_20221219-c8f1dc2b.pth
+    Config: configs/convnext/convnext-base_32xb128_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_1k_384.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-base_3rdparty_in21k
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 15359124480
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_in21k_20220124-13b83eec.pth
+    Config: configs/convnext/convnext-base_32xb128_in21k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_22k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-base_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 15359124480
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.81
+          Top 5 Accuracy: 97.86
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_in21k-pre-3rdparty_32xb128_in1k_20220124-eb2d6ada.pth
+    Config: configs/convnext/convnext-base_32xb128_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_22k_1k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-base_in21k-pre-3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 45205885952
+      Parameters: 88591464
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.82
+          Top 5 Accuracy: 98.25
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_in21k-pre-3rdparty_in1k-384px_20221219-4570f792.pth
+    Config: configs/convnext/convnext-base_32xb128_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_base_22k_1k_384.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-large_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 34368026112
+      Parameters: 197767336
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.30
+          Top 5 Accuracy: 96.89
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_64xb64_in1k_20220124-f8a0ded0.pth
+    Config: configs/convnext/convnext-large_64xb64_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_1k_224_ema.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-large_3rdparty_in1k-384px
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 101103214080
+      Parameters: 197767336
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.50
+          Top 5 Accuracy: 97.59
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_in1k-384px_20221219-6dd29d10.pth
+    Config: configs/convnext/convnext-large_64xb64_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_1k_384.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-large_3rdparty_in21k
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 34368026112
+      Parameters: 197767336
+    In Collection: ConvNeXt
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_3rdparty_in21k_20220124-41b5a79f.pth
+    Config: configs/convnext/convnext-large_64xb64_in21k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_22k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-large_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 34368026112
+      Parameters: 197767336
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.61
+          Top 5 Accuracy: 98.04
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_in21k-pre-3rdparty_64xb64_in1k_20220124-2412403d.pth
+    Config: configs/convnext/convnext-large_64xb64_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_22k_1k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-large_in21k-pre-3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 101103214080
+      Parameters: 197767336
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.46
+          Top 5 Accuracy: 98.37
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_in21k-pre-3rdparty_in1k-384px_20221219-6d38dd66.pth
+    Config: configs/convnext/convnext-large_64xb64_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_large_22k_1k_384.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-xlarge_3rdparty_in21k
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 60929820672
+      Parameters: 350196968
+    In Collection: ConvNeXt
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_3rdparty_in21k_20220124-f909bad7.pth
+    Config: configs/convnext/convnext-xlarge_64xb64_in21k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_xlarge_22k_224.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-xlarge_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 60929820672
+      Parameters: 350196968
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.97
+          Top 5 Accuracy: 98.20
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_in21k-pre-3rdparty_64xb64_in1k_20220124-76b6863d.pth
+    Config: configs/convnext/convnext-xlarge_64xb64_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_xlarge_22k_1k_224_ema.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
+  - Name: convnext-xlarge_in21k-pre-3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 179196798976
+      Parameters: 350196968
+    In Collection: ConvNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.76
+          Top 5 Accuracy: 98.55
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_in21k-pre-3rdparty_in1k-384px_20221219-b161bc14.pth
+    Config: configs/convnext/convnext-xlarge_64xb64_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnext_xlarge_22k_1k_384_ema.pth
+      Code: https://github.com/facebookresearch/ConvNeXt
diff --git a/configs/convnext_v2/README.md b/configs/convnext_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e561387412aa3a8e088cb7d015e7b98dba8e50c1
--- /dev/null
+++ b/configs/convnext_v2/README.md
@@ -0,0 +1,107 @@
+# ConvNeXt V2
+
+> [Co-designing and Scaling ConvNets with Masked Autoencoders](http://arxiv.org/abs/2301.00808)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/210496285-f235083f-218f-4153-8e21-c8a64481a2f5.png" width="50%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('convnext-v2-atto_fcmae-pre_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('convnext-v2-atto_3rdparty-fcmae_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_fcmae-pre_3rdparty_in1k_20230104-23765f83.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                     | Params (M) | Flops (G) |                   Config                   |                                              Download                                              |
+| :---------------------------------------- | :--------: | :-------: | :----------------------------------------: | :------------------------------------------------------------------------------------------------: |
+| `convnext-v2-atto_3rdparty-fcmae_in1k`\*  |    3.71    |   0.55    | [config](convnext-v2-atto_32xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_3rdparty-fcmae_in1k_20230104-07514db4.pth) |
+| `convnext-v2-femto_3rdparty-fcmae_in1k`\* |    5.23    |   0.78    | [config](convnext-v2-femto_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-femto_3rdparty-fcmae_in1k_20230104-adbe2082.pth) |
+| `convnext-v2-pico_3rdparty-fcmae_in1k`\*  |    9.07    |   1.37    | [config](convnext-v2-pico_32xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-pico_3rdparty-fcmae_in1k_20230104-147b1b59.pth) |
+| `convnext-v2-nano_3rdparty-fcmae_in1k`\*  |   15.62    |   2.45    | [config](convnext-v2-nano_32xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_3rdparty-fcmae_in1k_20230104-3dd1f29e.pth) |
+| `convnext-v2-tiny_3rdparty-fcmae_in1k`\*  |   28.64    |   4.47    | [config](convnext-v2-tiny_32xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_3rdparty-fcmae_in1k_20230104-80513adc.pth) |
+| `convnext-v2-base_3rdparty-fcmae_in1k`\*  |   88.72    |   15.38   | [config](convnext-v2-base_32xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_3rdparty-fcmae_in1k_20230104-8a798eaf.pth) |
+| `convnext-v2-large_3rdparty-fcmae_in1k`\* |   197.96   |   34.40   | [config](convnext-v2-large_32xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_3rdparty-fcmae_in1k_20230104-bf38df92.pth) |
+| `convnext-v2-huge_3rdparty-fcmae_in1k`\*  |   660.29   |  115.00   | [config](convnext-v2-huge_32xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_3rdparty-fcmae_in1k_20230104-fe43ae6c.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/ConvNeXt-V2). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model                                           |      Pretrain      | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                      Config                      |                      Download                      |
+| :---------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------------: | :------------------------------------------------: |
+| `convnext-v2-atto_fcmae-pre_3rdparty_in1k`\*    |       FCMAE        |    3.71    |   0.55    |   76.64   |   93.04   |    [config](convnext-v2-atto_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_fcmae-pre_3rdparty_in1k_20230104-23765f83.pth) |
+| `convnext-v2-femto_fcmae-pre_3rdparty_in1k`\*   |       FCMAE        |    5.23    |   0.78    |   78.48   |   93.98   |    [config](convnext-v2-femto_32xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-femto_fcmae-pre_3rdparty_in1k_20230104-92a75d75.pth) |
+| `convnext-v2-pico_fcmae-pre_3rdparty_in1k`\*    |       FCMAE        |    9.07    |   1.37    |   80.31   |   95.08   |    [config](convnext-v2-pico_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-pico_fcmae-pre_3rdparty_in1k_20230104-d20263ca.pth) |
+| `convnext-v2-nano_fcmae-pre_3rdparty_in1k`\*    |       FCMAE        |   15.62    |   2.45    |   81.86   |   95.75   |    [config](convnext-v2-nano_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-pre_3rdparty_in1k_20230104-fe1aaaf2.pth) |
+| `convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k`\* | FCMAE ImageNet-21k |   15.62    |   2.45    |   82.04   |   96.16   |    [config](convnext-v2-nano_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k_20230104-91fa8ae2.pth) |
+| `convnext-v2-tiny_fcmae-pre_3rdparty_in1k`\*    |       FCMAE        |   28.64    |   4.47    |   82.94   |   96.29   |    [config](convnext-v2-tiny_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-pre_3rdparty_in1k_20230104-471a86de.pth) |
+| `convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k`\* | FCMAE ImageNet-21k |   28.64    |   4.47    |   83.89   |   96.96   |    [config](convnext-v2-tiny_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k_20230104-8cc8b8f2.pth) |
+| `convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k |   15.62    |   7.21    |   83.36   |   96.75   | [config](convnext-v2-nano_32xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-f951ae87.pth) |
+| `convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k |   28.64    |   13.14   |   85.09   |   97.63   | [config](convnext-v2-tiny_32xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-d8579f84.pth) |
+| `convnext-v2-base_fcmae-pre_3rdparty_in1k`\*    |       FCMAE        |   88.72    |   15.38   |   84.87   |   97.08   |    [config](convnext-v2-base_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-pre_3rdparty_in1k_20230104-00a70fa4.pth) |
+| `convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k`\* | FCMAE ImageNet-21k |   88.72    |   15.38   |   86.74   |   98.02   |    [config](convnext-v2-base_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k_20230104-c48d16a5.pth) |
+| `convnext-v2-large_fcmae-pre_3rdparty_in1k`\*   |       FCMAE        |   197.96   |   34.40   |   85.76   |   97.59   |    [config](convnext-v2-large_32xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-pre_3rdparty_in1k_20230104-ef393013.pth) |
+| `convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k`\* | FCMAE ImageNet-21k |   197.96   |   34.40   |   87.26   |   98.24   |    [config](convnext-v2-large_32xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k_20230104-d9c4dc0c.pth) |
+| `convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k |   88.72    |   45.21   |   87.63   |   98.42   | [config](convnext-v2-base_32xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-379425cc.pth) |
+| `convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k |   197.96   |  101.10   |   88.18   |   98.52   | [config](convnext-v2-large_32xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-9139a1f3.pth) |
+| `convnext-v2-huge_fcmae-pre_3rdparty_in1k`\*    |       FCMAE        |   660.29   |  115.00   |   86.25   |   97.75   |    [config](convnext-v2-huge_32xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-pre_3rdparty_in1k_20230104-f795e5b8.pth) |
+| `convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-384px`\* | FCMAE ImageNet-21k |   660.29   |  337.96   |   88.68   |   98.73   | [config](convnext-v2-huge_32xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-02a4eb35.pth) |
+| `convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-512px`\* | FCMAE ImageNet-21k |   660.29   |  600.81   |   88.86   |   98.74   | [config](convnext-v2-huge_32xb32_in1k-512px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-512px_20230104-ce32e63c.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/ConvNeXt-V2). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{Woo2023ConvNeXtV2,
+  title={ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders},
+  author={Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon and Saining Xie},
+  year={2023},
+  journal={arXiv preprint arXiv:2301.00808},
+}
+```
diff --git a/configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..68f34c9634e3390bb3c600351ef37e9a94c6d575
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext_v2/atto.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=8e-4, weight_decay=0.3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-base_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-base_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..70b7f18e0c9dfa92791ff1a8a77553680de673e7
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-base_32xb32_in1k-384px.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/base.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-base_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b66b375eb3a3872842b4fdf72285db36a76dc3b8
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/base.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..053e19478fe75dac91b616fa314f4fbdd2667c61
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext_v2/femto.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=8e-4, weight_decay=0.3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b734b271ef9a7ada6085c14465a43ee05841b348
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-384px.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/huge.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-512px.py b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-512px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c63b023be3cbcca94e0847ed88febfd1b099223
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k-512px.py
@@ -0,0 +1,54 @@
+_base_ = [
+    '../_base_/models/convnext_v2/huge.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=512,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=512, backend='pillow', interpolation='bicubic'),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=32, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..18621f3aeb86c1a8ad620d71625c2952ca145320
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/huge.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-large_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-large_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b08b12eb0507b2582fe237b498c97f57452e29ec
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-large_32xb32_in1k-384px.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/large.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-large_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9695d08e9c63bae6f440a427c07ddb68b08403b
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/large.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-nano_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-nano_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a9b36dc59229e0dba661211c3570771453f54113
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-nano_32xb32_in1k-384px.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext_v2/nano.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=8e-4, weight_decay=0.3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a7c9e3e629522b42b9ff4d02a479b4688a74b92
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext_v2/nano.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=8e-4, weight_decay=0.3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2cc52ff252972724d4d6737dda1e784abc4d536
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/convnext_v2/pico.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=8e-4, weight_decay=0.3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True)]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k-384px.py b/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a19fd6cc670c33726187d41cef41ff33e69d8edd
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k-384px.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/tiny.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=3.2e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=40,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=40)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py b/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c6fbd0f2cd4189fb1699959cf8d63228a1ab3515
--- /dev/null
+++ b/configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
@@ -0,0 +1,35 @@
+_base_ = [
+    '../_base_/models/convnext_v2/tiny.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=3.2e-3),
+    clip_grad=None,
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=40,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=40)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=1e-4, priority='ABOVE_NORMAL')]
diff --git a/configs/convnext_v2/metafile.yml b/configs/convnext_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..86baa586ec6824603351cc70348c219f68fa71a2
--- /dev/null
+++ b/configs/convnext_v2/metafile.yml
@@ -0,0 +1,433 @@
+Collections:
+  - Name: ConvNeXt V2
+    Metadata:
+      Architecture:
+        - Global Response Normalization
+    Paper:
+      Title: Co-designing and Scaling ConvNets with Masked Autoencoders
+      URL: http://arxiv.org/abs/2301.00808
+    README: configs/convnext_v2/README.md
+
+Models:
+  - Name: convnext-v2-atto_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 551718080
+      Parameters: 3708400
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_3rdparty-fcmae_in1k_20230104-07514db4.pth
+    Config: configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_atto_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-atto_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 551718080
+      Parameters: 3708400
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.64
+          Top 5 Accuracy: 93.04
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-atto_fcmae-pre_3rdparty_in1k_20230104-23765f83.pth
+    Config: configs/convnext_v2/convnext-v2-atto_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_atto_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-femto_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 784892544
+      Parameters: 5233240
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-femto_3rdparty-fcmae_in1k_20230104-adbe2082.pth
+    Config: configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_femto_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-femto_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 784892544
+      Parameters: 5233240
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.48
+          Top 5 Accuracy: 93.98
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-femto_fcmae-pre_3rdparty_in1k_20230104-92a75d75.pth
+    Config: configs/convnext_v2/convnext-v2-femto_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_femto_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-pico_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 1374072320
+      Parameters: 9066280
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-pico_3rdparty-fcmae_in1k_20230104-147b1b59.pth
+    Config: configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_pico_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-pico_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 1374072320
+      Parameters: 9066280
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.31
+          Top 5 Accuracy: 95.08
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-pico_fcmae-pre_3rdparty_in1k_20230104-d20263ca.pth
+    Config: configs/convnext_v2/convnext-v2-pico_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_pico_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-nano_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 2454926720
+      Parameters: 15623800
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_3rdparty-fcmae_in1k_20230104-3dd1f29e.pth
+    Config: configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_nano_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-nano_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 2454926720
+      Parameters: 15623800
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.86
+          Top 5 Accuracy: 95.75
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-pre_3rdparty_in1k_20230104-fe1aaaf2.pth
+    Config: configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_nano_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 2454926720
+      Parameters: 15623800
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.04
+          Top 5 Accuracy: 96.16
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k_20230104-91fa8ae2.pth
+    Config: configs/convnext_v2/convnext-v2-nano_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_nano_22k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-tiny_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 4469631744
+      Parameters: 28635496
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_3rdparty-fcmae_in1k_20230104-80513adc.pth
+    Config: configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_tiny_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-tiny_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 4469631744
+      Parameters: 28635496
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.94
+          Top 5 Accuracy: 96.29
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-pre_3rdparty_in1k_20230104-471a86de.pth
+    Config: configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_tiny_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 4469631744
+      Parameters: 28635496
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.89
+          Top 5 Accuracy: 96.96
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k_20230104-8cc8b8f2.pth
+    Config: configs/convnext_v2/convnext-v2-tiny_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_tiny_22k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 7214472320
+      Parameters: 15623800
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.36
+          Top 5 Accuracy: 96.75
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-nano_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-f951ae87.pth
+    Config: configs/convnext_v2/convnext-v2-nano_32xb32_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_nano_22k_384_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 13135236864
+      Parameters: 28635496
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.09
+          Top 5 Accuracy: 97.63
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-tiny_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-d8579f84.pth
+    Config: configs/convnext_v2/convnext-v2-tiny_32xb32_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_tiny_22k_384_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-base_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 15382561792
+      Parameters: 88717800
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_3rdparty-fcmae_in1k_20230104-8a798eaf.pth
+    Config: configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_base_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-base_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 15382561792
+      Parameters: 88717800
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.87
+          Top 5 Accuracy: 97.08
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-pre_3rdparty_in1k_20230104-00a70fa4.pth
+    Config: configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_base_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 15382561792
+      Parameters: 88717800
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.74
+          Top 5 Accuracy: 98.02
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k_20230104-c48d16a5.pth
+    Config: configs/convnext_v2/convnext-v2-base_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_base_22k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-large_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 34403182080
+      Parameters: 197956840
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_3rdparty-fcmae_in1k_20230104-bf38df92.pth
+    Config: configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_large_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-large_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 34403182080
+      Parameters: 197956840
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.76
+          Top 5 Accuracy: 97.59
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-pre_3rdparty_in1k_20230104-ef393013.pth
+    Config: configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_large_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 34403182080
+      Parameters: 197956840
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.26
+          Top 5 Accuracy: 98.24
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k_20230104-d9c4dc0c.pth
+    Config: configs/convnext_v2/convnext-v2-large_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_large_22k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 45205885952
+      Parameters: 88717800
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.63
+          Top 5 Accuracy: 98.42
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-base_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-379425cc.pth
+    Config: configs/convnext_v2/convnext-v2-base_32xb32_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_base_22k_384_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 101103214080
+      Parameters: 197956840
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 88.18
+          Top 5 Accuracy: 98.52
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-large_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-9139a1f3.pth
+    Config: configs/convnext_v2/convnext-v2-large_32xb32_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_large_22k_384_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-huge_3rdparty-fcmae_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 114998639360
+      Parameters: 660289640
+    In Collection: ConvNeXt V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_3rdparty-fcmae_in1k_20230104-fe43ae6c.pth
+    Config: configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/pt_only/convnextv2_huge_1k_224_fcmae.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-huge_fcmae-pre_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 114998639360
+      Parameters: 660289640
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.25
+          Top 5 Accuracy: 97.75
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-pre_3rdparty_in1k_20230104-f795e5b8.pth
+    Config: configs/convnext_v2/convnext-v2-huge_32xb32_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_huge_1k_224_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 337955157760
+      Parameters: 660289640
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 88.68
+          Top 5 Accuracy: 98.73
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-384px_20230104-02a4eb35.pth
+    Config: configs/convnext_v2/convnext-v2-huge_32xb32_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_huge_22k_384_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
+  - Name: convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-512px
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 600809158400
+      Parameters: 660289640
+    In Collection: ConvNeXt V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 88.86
+          Top 5 Accuracy: 98.74
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/convnext-v2/convnext-v2-huge_fcmae-in21k-pre_3rdparty_in1k-512px_20230104-ce32e63c.pth
+    Config: configs/convnext_v2/convnext-v2-huge_32xb32_in1k-512px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_huge_22k_512_ema.pt
+      Code: https://github.com/facebookresearch/ConvNeXt-V2
diff --git a/configs/cspnet/README.md b/configs/cspnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f3b145ba0399b660d03233d9deb11913fbc3c438
--- /dev/null
+++ b/configs/cspnet/README.md
@@ -0,0 +1,78 @@
+# CSPNet
+
+> [CSPNet: A New Backbone that can Enhance Learning Capability of CNN](https://arxiv.org/abs/1911.11929)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection. However, such success greatly relies on costly computation resources, which hinders people with cheap devices from appreciating the advanced technology. In this paper, we propose Cross Stage Partial Network (CSPNet) to mitigate the problem that previous works require heavy inference computations from the network architecture perspective. We attribute the problem to the duplicate gradient information within network optimization. The proposed networks respect the variability of the gradients by integrating feature maps from the beginning and the end of a network stage, which, in our experiments, reduces computations by 20% with equivalent or even superior accuracy on the ImageNet dataset, and significantly outperforms state-of-the-art approaches in terms of AP50 on the MS COCO object detection dataset. The CSPNet is easy to implement and general enough to cope with architectures based on ResNet, ResNeXt, and DenseNet. Source code is at this https URL.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/18586273/159420842-6147c687-a488-460c-8bb2-4ea5276c26c7.png" width="60%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('cspdarknet50_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('cspdarknet50_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/cspnet/cspdarknet50_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/cspnet/cspdarknet50_3rdparty_8xb32_in1k_20220329-bd275287.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                    Download                                     |
+| :----------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :-----------------------------------------------------------------------------: |
+| `cspdarknet50_3rdparty_8xb32_in1k`\* | From scratch |   27.64    |   5.04    |   80.05   |   95.07   | [config](cspdarknet50_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/cspnet/cspdarknet50_3rdparty_8xb32_in1k_20220329-bd275287.pth) |
+| `cspresnet50_3rdparty_8xb32_in1k`\*  | From scratch |   21.62    |   3.48    |   79.55   |   94.68   | [config](cspresnet50_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/cspnet/cspresnet50_3rdparty_8xb32_in1k_20220329-dd6dddfb.pth) |
+| `cspresnext50_3rdparty_8xb32_in1k`\* | From scratch |   20.57    |   3.11    |   79.96   |   94.96   | [config](cspresnext50_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/cspnet/cspresnext50_3rdparty_8xb32_in1k_20220329-2cc84d21.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/rwightman/pytorch-image-models). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{wang2020cspnet,
+  title={CSPNet: A new backbone that can enhance learning capability of CNN},
+  author={Wang, Chien-Yao and Liao, Hong-Yuan Mark and Wu, Yueh-Hua and Chen, Ping-Yang and Hsieh, Jun-Wei and Yeh, I-Hau},
+  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops},
+  pages={390--391},
+  year={2020}
+}
+```
diff --git a/configs/cspnet/cspdarknet50_8xb32_in1k.py b/configs/cspnet/cspdarknet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..851148109e72202cd5eca721fb66023ab2934e90
--- /dev/null
+++ b/configs/cspnet/cspdarknet50_8xb32_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='CSPDarkNet', depth=53),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=288,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/cspnet/cspresnet50_8xb32_in1k.py b/configs/cspnet/cspresnet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d149637aabae7b8cdf691262796becc4cfcc5efc
--- /dev/null
+++ b/configs/cspnet/cspresnet50_8xb32_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='CSPResNet', depth=50),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=288,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/cspnet/cspresnext50_8xb32_in1k.py b/configs/cspnet/cspresnext50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1f8c15c12f6ab42349eda2a3680f07eabb855448
--- /dev/null
+++ b/configs/cspnet/cspresnext50_8xb32_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='CSPResNeXt', depth=50),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/cspnet/metafile.yml b/configs/cspnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..31036325f6e9c96c574a303f60990e28fe7822b9
--- /dev/null
+++ b/configs/cspnet/metafile.yml
@@ -0,0 +1,64 @@
+Collections:
+  - Name: CSPNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Cross Stage Partia Stage
+    Paper:
+      URL: https://arxiv.org/abs/1911.11929
+      Title: 'CSPNet: A New Backbone that can Enhance Learning Capability of CNN'
+    README: configs/cspnet/README.md
+    Code:
+      Version: v0.22.0
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.22.0/mmcls/models/backbones/cspnet.py
+
+Models:
+  - Name: cspdarknet50_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 5040000000
+      Parameters: 27640000
+    In Collection: CSPNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.05
+          Top 5 Accuracy: 95.07
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/cspnet/cspdarknet50_3rdparty_8xb32_in1k_20220329-bd275287.pth
+    Config: configs/cspnet/cspdarknet50_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/cspdarknet53_ra_256-d05c7c21.pth
+      Code: https://github.com/rwightman/pytorch-image-models
+  - Name: cspresnet50_3rdparty_8xb32_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 3480000000
+      Parameters: 21620000
+    In Collection: CSPNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.55
+          Top 5 Accuracy: 94.68
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/cspnet/cspresnet50_3rdparty_8xb32_in1k_20220329-dd6dddfb.pth
+    Config: configs/cspnet/cspresnet50_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/cspresnet50_ra-d3e8d487.pth
+      Code: https://github.com/rwightman/pytorch-image-models
+  - Name: cspresnext50_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 3110000000
+      Parameters: 20570000
+    In Collection: CSPNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.96
+          Top 5 Accuracy: 94.96
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/cspnet/cspresnext50_3rdparty_8xb32_in1k_20220329-2cc84d21.pth
+    Config: configs/cspnet/cspresnext50_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/cspresnext50_ra_224-648b4713.pth
+      Code: https://github.com/rwightman/pytorch-image-models
diff --git a/configs/csra/README.md b/configs/csra/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..99b29571c9e602d501518c0fdfcd490cee83f183
--- /dev/null
+++ b/configs/csra/README.md
@@ -0,0 +1,73 @@
+# CSRA
+
+> [Residual Attention: A Simple but Effective Method for Multi-Label Recognition](https://arxiv.org/abs/2108.02456)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Multi-label image recognition is a challenging computer vision task of practical use. Progresses in this area, however, are often characterized by complicated methods, heavy computations, and lack of intuitive explanations. To effectively capture different spatial regions occupied by objects from different categories, we propose an embarrassingly simple module, named class-specific residual attention (CSRA). CSRA generates class-specific features for every category by proposing a simple spatial attention score, and then combines it with the class-agnostic average pooling feature. CSRA achieves state-of-the-art results on multilabel recognition, and at the same time is much simpler than them. Furthermore, with only 4 lines of code, CSRA also leads to consistent improvement across many diverse pretrained models and datasets without any extra training. CSRA is both easy to implement and light in computations, which also enjoys intuitive explanations and visualizations.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/84259897/176982245-3ffcff56-a4ea-4474-9967-bc2b612bbaa3.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('resnet101-csra_1xb16_voc07-448px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/csra/resnet101-csra_1xb16_voc07-448px.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/csra/resnet101-csra_1xb16_voc07-448px.py https://download.openmmlab.com/mmclassification/v0/csra/resnet101-csra_1xb16_voc07-448px_20220722-29efb40a.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Multi-Label Classification on PASCAL VOC 2007
+
+| Model                              |   Pretrain   | Params (M) | Flops (G) |  CF1  |  OF1  |  mAP  |                    Config                     |                                  Download                                   |
+| :--------------------------------- | :----------: | :--------: | :-------: | :---: | :---: | :---: | :-------------------------------------------: | :-------------------------------------------------------------------------: |
+| `resnet101-csra_1xb16_voc07-448px` | From scratch |   23.55    |   4.12    | 89.16 | 90.80 | 94.98 | [config](resnet101-csra_1xb16_voc07-448px.py) | [model](https://download.openmmlab.com/mmclassification/v0/csra/resnet101-csra_1xb16_voc07-448px_20220722-29efb40a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/csra/resnet101-csra_1xb16_voc07-448px_20220722-29efb40a.json) |
+
+## Citation
+
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.2108.02456,
+  doi = {10.48550/ARXIV.2108.02456},
+  url = {https://arxiv.org/abs/2108.02456},
+  author = {Zhu, Ke and Wu, Jianxin},
+  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  title = {Residual Attention: A Simple but Effective Method for Multi-Label Recognition},
+  publisher = {arXiv},
+  year = {2021},
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
+```
diff --git a/configs/csra/metafile.yml b/configs/csra/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..112f50c9d44e1bc12359653f89920b93eae67361
--- /dev/null
+++ b/configs/csra/metafile.yml
@@ -0,0 +1,29 @@
+Collections:
+  - Name: CSRA
+    Metadata:
+      Training Data: PASCAL VOC 2007
+      Architecture:
+        - Class-specific Residual Attention
+    Paper:
+      URL: https://arxiv.org/abs/2108.02456
+      Title: 'Residual Attention: A Simple but Effective Method for Multi-Label Recognition'
+    README: configs/csra/README.md
+    Code:
+      Version: v0.24.0
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.24.0/mmcls/models/heads/multi_label_csra_head.py
+
+Models:
+  - Name: resnet101-csra_1xb16_voc07-448px
+    Metadata:
+      FLOPs: 4120000000
+      Parameters: 23550000
+    In Collection: CSRA
+    Results:
+      - Dataset: PASCAL VOC 2007
+        Metrics:
+          mAP: 94.98
+          OF1: 90.80
+          CF1: 89.16
+        Task: Multi-Label Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/csra/resnet101-csra_1xb16_voc07-448px_20220722-29efb40a.pth
+    Config: configs/csra/resnet101-csra_1xb16_voc07-448px.py
diff --git a/configs/csra/resnet101-csra_1xb16_voc07-448px.py b/configs/csra/resnet101-csra_1xb16_voc07-448px.py
new file mode 100644
index 0000000000000000000000000000000000000000..85135ae215c072accb4038b1a3fb4b3b796a6072
--- /dev/null
+++ b/configs/csra/resnet101-csra_1xb16_voc07-448px.py
@@ -0,0 +1,78 @@
+_base_ = ['../_base_/datasets/voc_bs16.py', '../_base_/default_runtime.py']
+
+# Pre-trained Checkpoint Path
+checkpoint = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_8xb32_in1k_20210831-539c63f8.pth'  # noqa
+# If you want to use the pre-trained weight of ResNet101-CutMix from the
+# originary repo(https://github.com/Kevinz-code/CSRA). Script of
+# 'tools/model_converters/torchvision_to_mmpretrain.py' can help you convert
+# weight into mmpretrain format. The mAP result would hit 95.5 by using the
+# weight. checkpoint = 'PATH/TO/PRE-TRAINED_WEIGHT'
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet',
+        depth=101,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch',
+        init_cfg=dict(
+            type='Pretrained', checkpoint=checkpoint, prefix='backbone')),
+    neck=None,
+    head=dict(
+        type='CSRAClsHead',
+        num_classes=20,
+        in_channels=2048,
+        num_heads=1,
+        lam=0.1,
+        loss=dict(type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0)))
+
+# dataset setting
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0, 0, 0],
+    std=[255, 255, 255])
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=448, crop_ratio_range=(0.7, 1.0)),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=448),
+    dict(
+        type='PackInputs',
+        # `gt_label_difficult` is needed for VOC evaluation
+        meta_keys=('sample_idx', 'img_path', 'ori_shape', 'img_shape',
+                   'scale_factor', 'flip', 'flip_direction',
+                   'gt_label_difficult')),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer
+# the lr of classifier.head is 10 * base_lr, which help convergence.
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.0002, momentum=0.9, weight_decay=0.0001),
+    paramwise_cfg=dict(custom_keys={'head': dict(lr_mult=10)}))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-7,
+        by_epoch=True,
+        begin=0,
+        end=1,
+        convert_to_iter_based=True),
+    dict(type='StepLR', by_epoch=True, step_size=6, gamma=0.1)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=20, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/davit/README.md b/configs/davit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1be19d98e37d4bf75dcc3d89ce689d09512b0505
--- /dev/null
+++ b/configs/davit/README.md
@@ -0,0 +1,77 @@
+# DaViT
+
+> [DaViT: Dual Attention Vision Transformers](https://arxiv.org/abs/2204.03645v1)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens". With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24734142/196125065-e232409b-f710-4729-b657-4e5f9158f2d1.png" width="90%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('davit-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('davit-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/davit/davit-tiny_4xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/davit/davit-tiny_3rdparty_in1k_20221116-700fdf7d.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                        Download                                        |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :------------------------------------------------------------------------------------: |
+| `davit-tiny_3rdparty_in1k`\*  | From scratch |   28.36    |   4.54    |   82.24   |   96.13   | [config](davit-tiny_4xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/davit/davit-tiny_3rdparty_in1k_20221116-700fdf7d.pth) |
+| `davit-small_3rdparty_in1k`\* | From scratch |   49.75    |   8.80    |   83.61   |   96.75   | [config](davit-small_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/davit/davit-small_3rdparty_in1k_20221116-51a849a6.pth) |
+| `davit-base_3rdparty_in1k`\*  | From scratch |   87.95    |   15.51   |   84.09   |   96.82   | [config](davit-base_4xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/davit/davit-base_3rdparty_in1k_20221116-19e0d956.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/dingmyu/davit/blob/main/mmdet/mmdet/models/backbones/davit.py#L355). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{ding2022davit,
+    title={DaViT: Dual Attention Vision Transformer},
+    author={Ding, Mingyu and Xiao, Bin and Codella, Noel and Luo, Ping and Wang, Jingdong and Yuan, Lu},
+    booktitle={ECCV},
+    year={2022},
+}
+```
diff --git a/configs/davit/davit-base_4xb256_in1k.py b/configs/davit/davit-base_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..071702fa7b69a3d893d9999ecf9ace28afbe193d
--- /dev/null
+++ b/configs/davit/davit-base_4xb256_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/davit/davit-base.py',
+    '../_base_/datasets/imagenet_bs256_davit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# data settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/davit/davit-small_4xb256_in1k.py b/configs/davit/davit-small_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e341031016c53b57adb477093f89b4524c6db4c1
--- /dev/null
+++ b/configs/davit/davit-small_4xb256_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/davit/davit-small.py',
+    '../_base_/datasets/imagenet_bs256_davit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# data settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/davit/davit-tiny_4xb256_in1k.py b/configs/davit/davit-tiny_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a16d87f4630b73fd4d76b52bbe926cb75dbb1d03
--- /dev/null
+++ b/configs/davit/davit-tiny_4xb256_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/davit/davit-tiny.py',
+    '../_base_/datasets/imagenet_bs256_davit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# data settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/davit/metafile.yml b/configs/davit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..588c18fd6dade71ff114a724a42a68a1a38b72bc
--- /dev/null
+++ b/configs/davit/metafile.yml
@@ -0,0 +1,71 @@
+Collections:
+  - Name: DaViT
+    Metadata:
+      Architecture:
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+    Paper:
+      URL: https://arxiv.org/abs/2204.03645v1
+      Title: 'DaViT: Dual Attention Vision Transformers'
+    README: configs/davit/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc3/mmcls/models/backbones/davit.py
+      Version: v1.0.0rc3
+
+Models:
+  - Name: davit-tiny_3rdparty_in1k
+    In Collection: DaViT
+    Metadata:
+      FLOPs: 4539698688
+      Parameters: 28360168
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 82.24
+        Top 5 Accuracy: 96.13
+    Weights: https://download.openmmlab.com/mmclassification/v0/davit/davit-tiny_3rdparty_in1k_20221116-700fdf7d.pth
+    Converted From:
+      Weights: https://drive.google.com/file/d/1RSpi3lxKaloOL5-or20HuG975tbPwxRZ/view?usp=sharing
+      Code: https://github.com/dingmyu/davit/blob/main/mmdet/mmdet/models/backbones/davit.py#L355
+    Config: configs/davit/davit-tiny_4xb256_in1k.py
+  - Name: davit-small_3rdparty_in1k
+    In Collection: DaViT
+    Metadata:
+      FLOPs: 8799942144
+      Parameters: 49745896
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 83.61
+        Top 5 Accuracy: 96.75
+    Weights: https://download.openmmlab.com/mmclassification/v0/davit/davit-small_3rdparty_in1k_20221116-51a849a6.pth
+    Converted From:
+      Weights: https://drive.google.com/file/d/1q976ruj45mt0RhO9oxhOo6EP_cmj4ahQ/view?usp=sharing
+      Code: https://github.com/dingmyu/davit/blob/main/mmdet/mmdet/models/backbones/davit.py#L355
+    Config: configs/davit/davit-small_4xb256_in1k.py
+  - Name: davit-base_3rdparty_in1k
+    In Collection: DaViT
+    Metadata:
+      FLOPs: 15509702656
+      Parameters: 87954408
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 84.09
+        Top 5 Accuracy: 96.82
+    Weights: https://download.openmmlab.com/mmclassification/v0/davit/davit-base_3rdparty_in1k_20221116-19e0d956.pth
+    Converted From:
+      Weights: https://drive.google.com/file/d/1u9sDBEueB-YFuLigvcwf4b2YyA4MIVsZ/view?usp=sharing
+      Code: https://github.com/dingmyu/davit/blob/main/mmdet/mmdet/models/backbones/davit.py#L355
+    Config: configs/davit/davit-base_4xb256_in1k.py
diff --git a/configs/deit/README.md b/configs/deit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ee434140a4316fed147c171ea425b6deff2aead6
--- /dev/null
+++ b/configs/deit/README.md
@@ -0,0 +1,97 @@
+# DeiT
+
+> [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption.   In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data.   More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/143225703-c287c29e-82c9-4c85-a366-dfae30d198cd.png" width="40%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('deit-tiny_4xb256_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('deit-tiny_4xb256_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/deit/deit-tiny_4xb256_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/deit/deit-tiny_4xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny_pt-4xb256_in1k_20220218-13b382a0.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                             |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                       Config                       |                       Download                       |
+| :------------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------------: | :--------------------------------------------------: |
+| `deit-tiny_4xb256_in1k`                           | From scratch |    5.72    |   1.26    |   74.50   |   92.24   |         [config](deit-tiny_4xb256_in1k.py)         | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny_pt-4xb256_in1k_20220218-13b382a0.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny_pt-4xb256_in1k_20220218-13b382a0.json) |
+| `deit-tiny-distilled_3rdparty_in1k`\*             | From scratch |    5.91    |   1.27    |   74.51   |   91.90   |    [config](deit-tiny-distilled_4xb256_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny-distilled_3rdparty_pt-4xb256_in1k_20211216-c429839a.pth) |
+| `deit-small_4xb256_in1k`                          | From scratch |   22.05    |   4.61    |   80.69   |   95.06   |        [config](deit-small_4xb256_in1k.py)         | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-small_pt-4xb256_in1k_20220218-9425b9bb.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/deit/deit-small_pt-4xb256_in1k_20220218-9425b9bb.json) |
+| `deit-small-distilled_3rdparty_in1k`\*            | From scratch |   22.44    |   4.63    |   81.17   |   95.40   |   [config](deit-small-distilled_4xb256_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-small-distilled_3rdparty_pt-4xb256_in1k_20211216-4de1d725.pth) |
+| `deit-base_16xb64_in1k`                           | From scratch |   86.57    |   17.58   |   81.76   |   95.81   |         [config](deit-base_16xb64_in1k.py)         | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base_pt-16xb64_in1k_20220216-db63c16c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/deit/deit-base_pt-16xb64_in1k_20220216-db63c16c.json) |
+| `deit-base_3rdparty_in1k`\*                       | From scratch |   86.57    |   17.58   |   81.79   |   95.59   |         [config](deit-base_16xb64_in1k.py)         | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base_3rdparty_pt-16xb64_in1k_20211124-6f40c188.pth) |
+| `deit-base-distilled_3rdparty_in1k`\*             | From scratch |   87.34    |   17.67   |   83.33   |   96.49   |    [config](deit-base-distilled_16xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base-distilled_3rdparty_pt-16xb64_in1k_20211216-42891296.pth) |
+| `deit-base_224px-pre_3rdparty_in1k-384px`\*       |    224px     |   86.86    |   55.54   |   83.04   |   96.31   |      [config](deit-base_16xb32_in1k-384px.py)      | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base_3rdparty_ft-16xb32_in1k-384px_20211124-822d02f2.pth) |
+| `deit-base-distilled_224px-pre_3rdparty_in1k-384px`\* |    224px     |   87.63    |   55.65   |   85.55   |   97.35   | [config](deit-base-distilled_16xb32_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base-distilled_3rdparty_ft-16xb32_in1k-384px_20211216-e48d6000.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L168). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+```{warning}
+MMPretrain doesn't support training the distilled version DeiT.
+And we provide distilled version checkpoints for inference only.
+```
+
+## Citation
+
+```bibtex
+@InProceedings{pmlr-v139-touvron21a,
+  title =     {Training data-efficient image transformers &amp; distillation through attention},
+  author =    {Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and Jegou, Herve},
+  booktitle = {International Conference on Machine Learning},
+  pages =     {10347--10357},
+  year =      {2021},
+  volume =    {139},
+  month =     {July}
+}
+```
diff --git a/configs/deit/deit-base-distilled_16xb32_in1k-384px.py b/configs/deit/deit-base-distilled_16xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..60d3112fd530917d2196a24c25d8d0223731c52d
--- /dev/null
+++ b/configs/deit/deit-base-distilled_16xb32_in1k-384px.py
@@ -0,0 +1,37 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DistilledVisionTransformer',
+        arch='deit-base',
+        img_size=384,
+        patch_size=16,
+    ),
+    neck=None,
+    head=dict(
+        type='DeiTClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    # Change to the path of the pretrained model
+    # init_cfg=dict(type='Pretrained', checkpoint=''),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=32)
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (16 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/deit/deit-base-distilled_16xb64_in1k.py b/configs/deit/deit-base-distilled_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..207bf250f62f3317df6535cf9b7e8dd0b4a1f5ac
--- /dev/null
+++ b/configs/deit/deit-base-distilled_16xb64_in1k.py
@@ -0,0 +1,46 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DistilledVisionTransformer',
+        arch='deit-base',
+        img_size=224,
+        patch_size=16),
+    neck=None,
+    head=dict(
+        type='DeiTClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/deit-base_16xb32_in1k-384px.py b/configs/deit/deit-base_16xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..762b4604348d1e8f0940f0243c9c824215d4b207
--- /dev/null
+++ b/configs/deit/deit-base_16xb32_in1k-384px.py
@@ -0,0 +1,37 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='deit-base',
+        img_size=384,
+        patch_size=16,
+    ),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    # Change to the path of the pretrained model
+    # init_cfg=dict(type='Pretrained', checkpoint=''),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=32)
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (16 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/deit/deit-base_16xb64_in1k.py b/configs/deit/deit-base_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..66f03a99f20a10649a954c15b2aa9c44374704fe
--- /dev/null
+++ b/configs/deit/deit-base_16xb64_in1k.py
@@ -0,0 +1,50 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='deit-base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/deit/deit-small-distilled_4xb256_in1k.py b/configs/deit/deit-small-distilled_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c7c58cb3d76e8b36f766080e4ec7de056a0621b
--- /dev/null
+++ b/configs/deit/deit-small-distilled_4xb256_in1k.py
@@ -0,0 +1,46 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DistilledVisionTransformer',
+        arch='deit-small',
+        img_size=224,
+        patch_size=16),
+    neck=None,
+    head=dict(
+        type='DeiTClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# data settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/deit-small_4xb256_in1k.py b/configs/deit/deit-small_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b96d84ec46bf2badd08b69fddaa2d8b8109b1ebf
--- /dev/null
+++ b/configs/deit/deit-small_4xb256_in1k.py
@@ -0,0 +1,48 @@
+# In small and tiny arch, remove drop path and EMA hook comparing with the
+# original config
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='deit-small',
+        img_size=224,
+        patch_size=16),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# data settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/deit-tiny-distilled_4xb256_in1k.py b/configs/deit/deit-tiny-distilled_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..00a9c4bd214a7c3d3eb1163b73aeb70251ce1bbc
--- /dev/null
+++ b/configs/deit/deit-tiny-distilled_4xb256_in1k.py
@@ -0,0 +1,47 @@
+# The distillation config is only for evaluation.
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='DistilledVisionTransformer',
+        arch='deit-tiny',
+        img_size=224,
+        patch_size=16),
+    neck=None,
+    head=dict(
+        type='DeiTClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# data settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/deit-tiny_4xb256_in1k.py b/configs/deit/deit-tiny_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..486669e9c16e01ccc3d469c55bb04e714225b624
--- /dev/null
+++ b/configs/deit/deit-tiny_4xb256_in1k.py
@@ -0,0 +1,48 @@
+# In small and tiny arch, remove drop path and EMA hook comparing with the
+# original config
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='deit-tiny',
+        img_size=224,
+        patch_size=16),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# data settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/deit/metafile.yml b/configs/deit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f6f0c5e56a4f72fc7df812705b9d2ec4a6a589bb
--- /dev/null
+++ b/configs/deit/metafile.yml
@@ -0,0 +1,153 @@
+Collections:
+  - Name: DeiT
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Layer Normalization
+        - Scaled Dot-Product Attention
+        - Attention Dropout
+        - Multi-Head Attention
+    Paper:
+      Title: Training data-efficient image transformers & distillation through attention
+      URL: https://arxiv.org/abs/2012.12877
+    README: configs/deit/README.md
+    Code:
+      URL: v0.19.0
+      Version: https://github.com/open-mmlab/mmpretrain/blob/v0.19.0/mmcls/models/backbones/deit.py
+
+Models:
+  - Name: deit-tiny_4xb256_in1k
+    Metadata:
+      FLOPs: 1258219200
+      Parameters: 5717416
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.5
+          Top 5 Accuracy: 92.24
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny_pt-4xb256_in1k_20220218-13b382a0.pth
+    Config: configs/deit/deit-tiny_4xb256_in1k.py
+  - Name: deit-tiny-distilled_3rdparty_in1k
+    Metadata:
+      FLOPs: 1265371776
+      Parameters: 5910800
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.51
+          Top 5 Accuracy: 91.9
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny-distilled_3rdparty_pt-4xb256_in1k_20211216-c429839a.pth
+    Config: configs/deit/deit-tiny-distilled_4xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_tiny_distilled_patch16_224-b40b3cf7.pth
+      Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L108
+  - Name: deit-small_4xb256_in1k
+    Metadata:
+      FLOPs: 4607954304
+      Parameters: 22050664
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.69
+          Top 5 Accuracy: 95.06
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-small_pt-4xb256_in1k_20220218-9425b9bb.pth
+    Config: configs/deit/deit-small_4xb256_in1k.py
+  - Name: deit-small-distilled_3rdparty_in1k
+    Metadata:
+      FLOPs: 4632876288
+      Parameters: 22436432
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.17
+          Top 5 Accuracy: 95.4
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-small-distilled_3rdparty_pt-4xb256_in1k_20211216-4de1d725.pth
+    Config: configs/deit/deit-small-distilled_4xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_small_distilled_patch16_224-649709d9.pth
+      Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L123
+  - Name: deit-base_16xb64_in1k
+    Metadata:
+      FLOPs: 17581972224
+      Parameters: 86567656
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.76
+          Top 5 Accuracy: 95.81
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base_pt-16xb64_in1k_20220216-db63c16c.pth
+    Config: configs/deit/deit-base_16xb64_in1k.py
+  - Name: deit-base_3rdparty_in1k
+    Metadata:
+      FLOPs: 17581972224
+      Parameters: 86567656
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.79
+          Top 5 Accuracy: 95.59
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base_3rdparty_pt-16xb64_in1k_20211124-6f40c188.pth
+    Config: configs/deit/deit-base_16xb64_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth
+      Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L93
+  - Name: deit-base-distilled_3rdparty_in1k
+    Metadata:
+      FLOPs: 17674283520
+      Parameters: 87338192
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.33
+          Top 5 Accuracy: 96.49
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base-distilled_3rdparty_pt-16xb64_in1k_20211216-42891296.pth
+    Config: configs/deit/deit-base-distilled_16xb64_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_base_distilled_patch16_224-df68dfff.pth
+      Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L138
+  - Name: deit-base_224px-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 55538974464
+      Parameters: 86859496
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.04
+          Top 5 Accuracy: 96.31
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base_3rdparty_ft-16xb32_in1k-384px_20211124-822d02f2.pth
+    Config: configs/deit/deit-base_16xb32_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_base_patch16_384-8de9b5d1.pth
+      Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L153
+  - Name: deit-base-distilled_224px-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 55645294080
+      Parameters: 87630032
+    In Collection: DeiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.55
+          Top 5 Accuracy: 97.35
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit/deit-base-distilled_3rdparty_ft-16xb32_in1k-384px_20211216-e48d6000.pth
+    Config: configs/deit/deit-base-distilled_16xb32_in1k-384px.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_base_distilled_patch16_384-d0272ac0.pth
+      Code: https://github.com/facebookresearch/deit/blob/f5123946205daf72a88783dae94cabff98c49c55/models.py#L168
diff --git a/configs/deit3/README.md b/configs/deit3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..18694b7eb9b97589aece3c9bfc7187b9c9d83841
--- /dev/null
+++ b/configs/deit3/README.md
@@ -0,0 +1,90 @@
+# DeiT III: Revenge of the ViT
+
+> [DeiT III: Revenge of the ViT](https://arxiv.org/abs/2204.07118)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has limited built-in architectural priors, in contrast to more recent architectures that incorporate priors either about the input data or of specific tasks. Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT. In this paper, we revisit the supervised training of ViTs. Our procedure builds upon and simplifies a recipe introduced for training ResNet-50. It includes a new simple data-augmentation procedure with only 3 augmentations, closer to the practice in self-supervised learning. Our evaluations on Image classification (ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning and semantic segmentation show that our procedure outperforms by a large margin previous fully supervised training recipes for ViT. It also reveals that the performance of our ViT trained with supervision is comparable to that of more recent architectures. Our results could serve as better baselines for recent self-supervised approaches demonstrated on ViT.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24734142/192964480-46726469-21d9-4e45-a06a-87c6a57c3367.png" width="90%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('deit3-small-p16_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('deit3-small-p16_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/deit3/deit3-small-p16_64xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k_20221008-0f7c70cf.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                             |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                     Config                     |                         Download                         |
+| :------------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------------: | :------------------------------------------------------: |
+| `deit3-small-p16_3rdparty_in1k`\*                 | From scratch |   22.06    |   4.61    |   81.35   |   95.31   |    [config](deit3-small-p16_64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k_20221008-0f7c70cf.pth) |
+| `deit3-small-p16_3rdparty_in1k-384px`\*           | From scratch |   22.21    |   15.52   |   83.43   |   96.68   | [config](deit3-small-p16_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k-384px_20221008-a2c1a0c7.pth) |
+| `deit3-small-p16_in21k-pre_3rdparty_in1k`\*       | ImageNet-21k |   22.06    |   4.61    |   83.06   |   96.77   |    [config](deit3-small-p16_64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_in21k-pre_3rdparty_in1k_20221009-dcd90827.pth) |
+| `deit3-small-p16_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k |   22.21    |   15.52   |   84.84   |   97.48   | [config](deit3-small-p16_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_in21k-pre_3rdparty_in1k-384px_20221009-de116dd7.pth) |
+| `deit3-medium-p16_3rdparty_in1k`\*                | From scratch |   38.85    |   8.00    |   82.99   |   96.22   |   [config](deit3-medium-p16_64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-medium-p16_3rdparty_in1k_20221008-3b21284d.pth) |
+| `deit3-medium-p16_in21k-pre_3rdparty_in1k`\*      | ImageNet-21k |   38.85    |   8.00    |   84.56   |   97.19   |   [config](deit3-medium-p16_64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-medium-p16_in21k-pre_3rdparty_in1k_20221009-472f11e2.pth) |
+| `deit3-base-p16_3rdparty_in1k`\*                  | From scratch |   86.59    |   17.58   |   83.80   |   96.55   |    [config](deit3-base-p16_64xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_3rdparty_in1k_20221008-60b8c8bf.pth) |
+| `deit3-base-p16_3rdparty_in1k-384px`\*            | From scratch |   86.88    |   55.54   |   85.08   |   97.25   | [config](deit3-base-p16_64xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_3rdparty_in1k-384px_20221009-e19e36d4.pth) |
+| `deit3-base-p16_in21k-pre_3rdparty_in1k`\*        | ImageNet-21k |   86.59    |   17.58   |   85.70   |   97.75   |    [config](deit3-base-p16_64xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_in21k-pre_3rdparty_in1k_20221009-87983ca1.pth) |
+| `deit3-base-p16_in21k-pre_3rdparty_in1k-384px`\*  | ImageNet-21k |   86.88    |   55.54   |   86.73   |   98.11   | [config](deit3-base-p16_64xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_in21k-pre_3rdparty_in1k-384px_20221009-5e4e37b9.pth) |
+| `deit3-large-p16_3rdparty_in1k`\*                 | From scratch |   304.37   |   61.60   |   84.87   |   97.01   |    [config](deit3-large-p16_64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_3rdparty_in1k_20221009-03b427ea.pth) |
+| `deit3-large-p16_3rdparty_in1k-384px`\*           | From scratch |   304.76   |  191.21   |   85.82   |   97.60   | [config](deit3-large-p16_64xb16_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_3rdparty_in1k-384px_20221009-4317ce62.pth) |
+| `deit3-large-p16_in21k-pre_3rdparty_in1k`\*       | ImageNet-21k |   304.37   |   61.60   |   86.97   |   98.24   |    [config](deit3-large-p16_64xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_in21k-pre_3rdparty_in1k_20221009-d8d27084.pth) |
+| `deit3-large-p16_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k |   304.76   |  191.21   |   87.73   |   98.51   | [config](deit3-large-p16_64xb16_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_in21k-pre_3rdparty_in1k-384px_20221009-75fea03f.pth) |
+| `deit3-huge-p14_3rdparty_in1k`\*                  | From scratch |   632.13   |  167.40   |   85.21   |   97.36   |    [config](deit3-huge-p14_64xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-huge-p14_3rdparty_in1k_20221009-e107bcb7.pth) |
+| `deit3-huge-p14_in21k-pre_3rdparty_in1k`\*        | ImageNet-21k |   632.13   |  167.40   |   87.19   |   98.26   |    [config](deit3-huge-p14_64xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-huge-p14_in21k-pre_3rdparty_in1k_20221009-19b8a535.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{Touvron2022DeiTIR,
+  title={DeiT III: Revenge of the ViT},
+  author={Hugo Touvron and Matthieu Cord and Herve Jegou},
+  journal={arXiv preprint arXiv:2204.07118},
+  year={2022},
+}
+```
diff --git a/configs/deit3/deit3-base-p16_64xb32_in1k-384px.py b/configs/deit3/deit3-base-p16_64xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b6c8a8c411ee96a88bc44c042cdf134a36eb05da
--- /dev/null
+++ b/configs/deit3/deit3-base-p16_64xb32_in1k-384px.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-base-p16-384.py',
+    '../_base_/datasets/imagenet_bs64_deit3_384.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/deit3/deit3-base-p16_64xb64_in1k.py b/configs/deit3/deit3-base-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c69a64cdd06da1e868bb08e9eec5cbf9b82f5aa9
--- /dev/null
+++ b/configs/deit3/deit3-base-p16_64xb64_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-base-p16-224.py',
+    '../_base_/datasets/imagenet_bs64_deit3_224.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/deit3-huge-p14_64xb32_in1k.py b/configs/deit3/deit3-huge-p14_64xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f8cae075b6a28f8519390983621b2dc98173e507
--- /dev/null
+++ b/configs/deit3/deit3-huge-p14_64xb32_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-huge-p14-224.py',
+    '../_base_/datasets/imagenet_bs64_deit3_224.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=32)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/deit3/deit3-large-p16_64xb16_in1k-384px.py b/configs/deit3/deit3-large-p16_64xb16_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..84fb0feae636a3f3c4b2297ed6935e817701cbea
--- /dev/null
+++ b/configs/deit3/deit3-large-p16_64xb16_in1k-384px.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-large-p16-384.py',
+    '../_base_/datasets/imagenet_bs64_deit3_384.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=16)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (16 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1025)
diff --git a/configs/deit3/deit3-large-p16_64xb64_in1k.py b/configs/deit3/deit3-large-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a67ac21f9ba3fefdb7e22429e565fb6ee6eeff86
--- /dev/null
+++ b/configs/deit3/deit3-large-p16_64xb64_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-large-p16-224.py',
+    '../_base_/datasets/imagenet_bs64_deit3_224.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/deit3-medium-p16_64xb64_in1k.py b/configs/deit3/deit3-medium-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..def48e682a5fa66e166f4419b8e1850e26f75d17
--- /dev/null
+++ b/configs/deit3/deit3-medium-p16_64xb64_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-medium-p16-224.py',
+    '../_base_/datasets/imagenet_bs64_deit3_224.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/deit3-small-p16_64xb64_in1k-384px.py b/configs/deit3/deit3-small-p16_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..e6b3e892c34268d2bdfeb9f7ab7f1808ea203558
--- /dev/null
+++ b/configs/deit3/deit3-small-p16_64xb64_in1k-384px.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-small-p16-384.py',
+    '../_base_/datasets/imagenet_bs64_deit3_384.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/deit3-small-p16_64xb64_in1k.py b/configs/deit3/deit3-small-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..58b0a2f1837e09edc3c43d6776fda169e4b0480b
--- /dev/null
+++ b/configs/deit3/deit3-small-p16_64xb64_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/deit3/deit3-small-p16-224.py',
+    '../_base_/datasets/imagenet_bs64_deit3_224.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=1e-5, weight_decay=0.1))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/deit3/metafile.yml b/configs/deit3/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..6f50fdc396c017fcbf3d2542f6fe52c0ed5bf546
--- /dev/null
+++ b/configs/deit3/metafile.yml
@@ -0,0 +1,310 @@
+Collections:
+  - Name: DeiT3
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      URL: https://arxiv.org/abs/2204.07118
+      Title: 'DeiT III: Revenge of the ViT'
+    README: configs/deit3/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc2/mmcls/models/backbones/deit3.py
+      Version: v1.0.0rc2
+
+Models:
+  - Name: deit3-small-p16_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 4607954304
+      Parameters: 22059496
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 81.35
+        Top 5 Accuracy: 95.31
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k_20221008-0f7c70cf.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_small_224_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-small-p16_64xb64_in1k.py
+  - Name: deit3-small-p16_3rdparty_in1k-384px
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 15517663104
+      Parameters: 22205416
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 83.43
+        Top 5 Accuracy: 96.68
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_3rdparty_in1k-384px_20221008-a2c1a0c7.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_small_384_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-small-p16_64xb64_in1k-384px.py
+  - Name: deit3-small-p16_in21k-pre_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 4607954304
+      Parameters: 22059496
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 83.06
+        Top 5 Accuracy: 96.77
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_in21k-pre_3rdparty_in1k_20221009-dcd90827.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_small_224_21k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-small-p16_64xb64_in1k.py
+  - Name: deit3-small-p16_in21k-pre_3rdparty_in1k-384px
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 15517663104
+      Parameters: 22205416
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 84.84
+        Top 5 Accuracy: 97.48
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_in21k-pre_3rdparty_in1k-384px_20221009-de116dd7.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_small_384_21k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-small-p16_64xb64_in1k-384px.py
+  - Name: deit3-medium-p16_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 8003064320
+      Parameters: 38849512
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 82.99
+        Top 5 Accuracy: 96.22
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-medium-p16_3rdparty_in1k_20221008-3b21284d.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_medium_224_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-medium-p16_64xb64_in1k.py
+  - Name: deit3-medium-p16_in21k-pre_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 8003064320
+      Parameters: 38849512
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 84.56
+        Top 5 Accuracy: 97.19
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-medium-p16_in21k-pre_3rdparty_in1k_20221009-472f11e2.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_medium_224_21k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-medium-p16_64xb64_in1k.py
+  - Name: deit3-base-p16_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 17581972224
+      Parameters: 86585320
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 83.80
+        Top 5 Accuracy: 96.55
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_3rdparty_in1k_20221008-60b8c8bf.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_base_224_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-base-p16_64xb64_in1k.py
+  - Name: deit3-base-p16_3rdparty_in1k-384px
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 55538974464
+      Parameters: 86877160
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.08
+        Top 5 Accuracy: 97.25
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_3rdparty_in1k-384px_20221009-e19e36d4.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_base_384_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-base-p16_64xb32_in1k-384px.py
+  - Name: deit3-base-p16_in21k-pre_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 17581972224
+      Parameters: 86585320
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.70
+        Top 5 Accuracy: 97.75
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_in21k-pre_3rdparty_in1k_20221009-87983ca1.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_base_224_21k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-base-p16_64xb64_in1k.py
+  - Name: deit3-base-p16_in21k-pre_3rdparty_in1k-384px
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 55538974464
+      Parameters: 86877160
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 86.73
+        Top 5 Accuracy: 98.11
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_in21k-pre_3rdparty_in1k-384px_20221009-5e4e37b9.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_base_384_21k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-base-p16_64xb32_in1k-384px.py
+  - Name: deit3-large-p16_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 61603111936
+      Parameters: 304374760
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 84.87
+        Top 5 Accuracy: 97.01
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_3rdparty_in1k_20221009-03b427ea.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_large_224_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-large-p16_64xb64_in1k.py
+  - Name: deit3-large-p16_3rdparty_in1k-384px
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 191210034176
+      Parameters: 304763880
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.82
+        Top 5 Accuracy: 97.60
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_3rdparty_in1k-384px_20221009-4317ce62.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_large_384_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-large-p16_64xb16_in1k-384px.py
+  - Name: deit3-large-p16_in21k-pre_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 61603111936
+      Parameters: 304374760
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 86.97
+        Top 5 Accuracy: 98.24
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_in21k-pre_3rdparty_in1k_20221009-d8d27084.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_large_224_21k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-large-p16_64xb64_in1k.py
+  - Name: deit3-large-p16_in21k-pre_3rdparty_in1k-384px
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 191210034176
+      Parameters: 304763880
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 87.73
+        Top 5 Accuracy: 98.51
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_in21k-pre_3rdparty_in1k-384px_20221009-75fea03f.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_large_384_21k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-large-p16_64xb16_in1k-384px.py
+  - Name: deit3-huge-p14_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 167400741120
+      Parameters: 632126440
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.21
+        Top 5 Accuracy: 97.36
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-huge-p14_3rdparty_in1k_20221009-e107bcb7.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_huge_224_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-huge-p14_64xb32_in1k.py
+  - Name: deit3-huge-p14_in21k-pre_3rdparty_in1k
+    In Collection: DeiT3
+    Metadata:
+      FLOPs: 167400741120
+      Parameters: 632126440
+      Training Data:
+        - ImageNet-21k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 87.19
+        Top 5 Accuracy: 98.26
+    Weights: https://download.openmmlab.com/mmclassification/v0/deit3/deit3-huge-p14_in21k-pre_3rdparty_in1k_20221009-19b8a535.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/deit/deit_3_huge_224_1k.pth
+      Code: https://github.com/facebookresearch/deit/blob/main/models_v2.py#L171
+    Config: configs/deit3/deit3-huge-p14_64xb32_in1k.py
diff --git a/configs/densecl/README.md b/configs/densecl/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d1e1295d9f6a12d47196e6d2c4663d0758076167
--- /dev/null
+++ b/configs/densecl/README.md
@@ -0,0 +1,85 @@
+# DenseCL
+
+> [Dense contrastive learning for self-supervised visual pre-training](https://arxiv.org/abs/2011.09157)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/149721111-bab03a6d-a30d-418e-b338-43c3689cfc65.png" width="900" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('densecl_resnet50_8xb32-coslr-200e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                    | Params (M) | Flops (G) |                       Config                        |                                          Download                                          |
+| :--------------------------------------- | :--------: | :-------: | :-------------------------------------------------: | :----------------------------------------------------------------------------------------: |
+| `densecl_resnet50_8xb32-coslr-200e_in1k` |   64.85    |   4.11    | [config](densecl_resnet50_8xb32-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k` | [DENSECL](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth) |   25.56    |   4.11    |   63.50   | [config](benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{wang2021dense,
+  title={Dense contrastive learning for self-supervised visual pre-training},
+  author={Wang, Xinlong and Zhang, Rufeng and Shen, Chunhua and Kong, Tao and Li, Lei},
+  booktitle={CVPR},
+  year={2021}
+}
+```
diff --git a/configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py b/configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..37795d9c866c5f9b26b0e016959a01677b8a216e
--- /dev/null
+++ b/configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_sgd_steplr_100e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=30., momentum=0.9, weight_decay=0.))
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py b/configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a3959f1a91c1911e426563759795afeef71bea0
--- /dev/null
+++ b/configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_mocov2.py',
+    '../_base_/schedules/imagenet_sgd_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='DenseCL',
+    queue_len=65536,
+    feat_dim=128,
+    momentum=0.001,
+    loss_lambda=0.5,
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='BN'),
+        zero_init_residual=False),
+    neck=dict(
+        type='DenseCLNeck',
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=128,
+        num_grid=None),
+    head=dict(
+        type='ContrastiveHead',
+        loss=dict(type='CrossEntropyLoss'),
+        temperature=0.2),
+)
+find_unused_parameters = True
+
+# runtime settings
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/densecl/metafile.yml b/configs/densecl/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..24449910aaa5930cbd32ec8fae18dec62ee73505
--- /dev/null
+++ b/configs/densecl/metafile.yml
@@ -0,0 +1,44 @@
+Collections:
+  - Name: DenseCL
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Architecture:
+        - ResNet
+    Paper:
+      Title: Dense contrastive learning for self-supervised visual pre-training
+      URL: https://arxiv.org/abs/2011.09157
+    README: configs/densecl/README.md
+
+Models:
+  - Name: densecl_resnet50_8xb32-coslr-200e_in1k
+    Metadata:
+      Epochs: 200
+      Batch Size: 256
+      FLOPs: 4109364224
+      Parameters: 64850560
+      Training Data: ImageNet-1k
+    In Collection: DenseCL
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/densecl_resnet50_8xb32-coslr-200e_in1k_20220825-3078723b.pth
+    Config: configs/densecl/densecl_resnet50_8xb32-coslr-200e_in1k.py
+    Downstream:
+      - resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k
+  - Name: resnet50_densecl-pre_8xb32-linear-steplr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 256
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: DenseCL
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 63.5
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/densecl/densecl_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-f0f0a579.pth
+    Config: configs/densecl/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
diff --git a/configs/densenet/README.md b/configs/densenet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..fe40fdd99cf069d76b4937e96ae252c5122ba953
--- /dev/null
+++ b/configs/densenet/README.md
@@ -0,0 +1,82 @@
+# DenseNet
+
+> [Densely Connected Convolutional Networks](https://arxiv.org/abs/1608.06993)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/42952108/162675098-9a670883-b13a-4a5a-a9c9-06c39c616a0a.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('densenet121_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('densenet121_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/densenet/densenet121_4xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                        Download                                        |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :------------------------------------------------------------------------------------: |
+| `densenet121_3rdparty_in1k`\* | From scratch |    7.98    |   2.88    |   74.96   |   92.21   | [config](densenet121_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth) |
+| `densenet169_3rdparty_in1k`\* | From scratch |   14.15    |   3.42    |   76.08   |   93.11   | [config](densenet169_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet169_4xb256_in1k_20220426-a2889902.pth) |
+| `densenet201_3rdparty_in1k`\* | From scratch |   20.01    |   4.37    |   77.32   |   93.64   | [config](densenet201_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet201_4xb256_in1k_20220426-05cae4ef.pth) |
+| `densenet161_3rdparty_in1k`\* | From scratch |   28.68    |   7.82    |   77.61   |   93.83   | [config](densenet161_4xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet161_4xb256_in1k_20220426-ee6a80a9.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.1608.06993,
+      doi = {10.48550/ARXIV.1608.06993},
+      url = {https://arxiv.org/abs/1608.06993},
+      author = {Huang, Gao and Liu, Zhuang and van der Maaten, Laurens and Weinberger, Kilian Q.},
+      keywords = {Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
+      title = {Densely Connected Convolutional Networks},
+      publisher = {arXiv},
+      year = {2016},
+      copyright = {arXiv.org perpetual, non-exclusive license}
+}
+```
diff --git a/configs/densenet/densenet121_4xb256_in1k.py b/configs/densenet/densenet121_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc9854f5b44da27bcf4a5a4d5faefca625dc85b0
--- /dev/null
+++ b/configs/densenet/densenet121_4xb256_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/densenet/densenet121.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/densenet/densenet161_4xb256_in1k.py b/configs/densenet/densenet161_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a28a278bfc8132f4099afc576c43b05fd4095fd0
--- /dev/null
+++ b/configs/densenet/densenet161_4xb256_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/densenet/densenet161.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/densenet/densenet169_4xb256_in1k.py b/configs/densenet/densenet169_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..73469da115d23da250d790d68a36f55fb8eccfff
--- /dev/null
+++ b/configs/densenet/densenet169_4xb256_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/densenet/densenet169.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/densenet/densenet201_4xb256_in1k.py b/configs/densenet/densenet201_4xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4a9b7b1923351fc1f47ad1aa0e4470316e076e96
--- /dev/null
+++ b/configs/densenet/densenet201_4xb256_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/densenet/densenet201.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# schedule settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (256 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/densenet/metafile.yml b/configs/densenet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..40575acb6b4314d8ebc5c9317e9e032e0a8b0cba
--- /dev/null
+++ b/configs/densenet/metafile.yml
@@ -0,0 +1,76 @@
+Collections:
+  - Name: DenseNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - DenseBlock
+    Paper:
+      URL: https://arxiv.org/abs/1608.06993
+      Title: Densely Connected Convolutional Networks
+    README: configs/densenet/README.md
+
+Models:
+  - Name: densenet121_3rdparty_in1k
+    Metadata:
+      FLOPs: 2881695488
+      Parameters: 7978856
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.96
+          Top 5 Accuracy: 92.21
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet121_4xb256_in1k_20220426-07450f99.pth
+    Config: configs/densenet/densenet121_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet121-a639ec97.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+  - Name: densenet169_3rdparty_in1k
+    Metadata:
+      FLOPs: 3416860160
+      Parameters: 14149480
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.08
+          Top 5 Accuracy: 93.11
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet169_4xb256_in1k_20220426-a2889902.pth
+    Config: configs/densenet/densenet169_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet169-b2777c0a.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+  - Name: densenet201_3rdparty_in1k
+    Metadata:
+      FLOPs: 4365236736
+      Parameters: 20013928
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.32
+          Top 5 Accuracy: 93.64
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet201_4xb256_in1k_20220426-05cae4ef.pth
+    Config: configs/densenet/densenet201_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet201-c1103571.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
+  - Name: densenet161_3rdparty_in1k
+    Metadata:
+      FLOPs: 7816363968
+      Parameters: 28681000
+    In Collection: DenseNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.61
+          Top 5 Accuracy: 93.83
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/densenet/densenet161_4xb256_in1k_20220426-ee6a80a9.pth
+    Config: configs/densenet/densenet161_4xb256_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/densenet161-8d451a50.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
diff --git a/configs/dinov2/README.md b/configs/dinov2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..aa79d6b43c677f96236a52630b39ca9a6e381e5d
--- /dev/null
+++ b/configs/dinov2/README.md
@@ -0,0 +1,58 @@
+# DINOv2
+
+> [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing allpurpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/234560516-b495795c-c75c-444c-a712-bb61a3de444e.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-small-p14_dinov2-pre_3rdparty', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                 | Params (M) | Flops (G) |                     Config                     |                                              Download                                              |
+| :------------------------------------ | :--------: | :-------: | :--------------------------------------------: | :------------------------------------------------------------------------------------------------: |
+| `vit-small-p14_dinov2-pre_3rdparty`\* |   22.06    |   46.76   | [config](vit-small-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-small-p14_dinov2-pre_3rdparty_20230426-5641ca5a.pth) |
+| `vit-base-p14_dinov2-pre_3rdparty`\*  |   86.58    |  152.00   | [config](vit-base-p14_dinov2-pre_headless.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-base-p14_dinov2-pre_3rdparty_20230426-ba246503.pth) |
+| `vit-large-p14_dinov2-pre_3rdparty`\* |   304.00   |  507.00   | [config](vit-large-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-large-p14_dinov2-pre_3rdparty_20230426-f3302d9e.pth) |
+| `vit-giant-p14_dinov2-pre_3rdparty`\* |  1136.00   |  1784.00  | [config](vit-giant-p14_dinov2-pre_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-giant-p14_dinov2-pre_3rdparty_20230426-2934a630.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/dinov2). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{oquab2023dinov2,
+  title={DINOv2: Learning Robust Visual Features without Supervision},
+  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
+  journal={arXiv:2304.07193},
+  year={2023}
+}
+```
diff --git a/configs/dinov2/metafile.yml b/configs/dinov2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..48f205a24abf006019fa00041bfc8cb5a138aa55
--- /dev/null
+++ b/configs/dinov2/metafile.yml
@@ -0,0 +1,73 @@
+Collections:
+  - Name: DINOv2
+    Metadata:
+      Architecture:
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+    Paper:
+      Title: 'DINOv2: Learning Robust Visual Features without Supervision'
+      URL: https://arxiv.org/abs/2304.07193
+    README: configs/dinov2/README.md
+    Code:
+      URL: null
+      Version: null
+
+Models:
+  - Name: vit-small-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 46762000000
+      Parameters: 22056000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-small-p14_dinov2-pre_3rdparty_20230426-5641ca5a.pth
+    Config: configs/dinov2/vit-small-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
+
+  - Name: vit-base-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 152000000000
+      Parameters: 86580000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-base-p14_dinov2-pre_3rdparty_20230426-ba246503.pth
+    Config: configs/dinov2/vit-base-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
+
+  - Name: vit-large-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 507000000000
+      Parameters: 304000000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-large-p14_dinov2-pre_3rdparty_20230426-f3302d9e.pth
+    Config: configs/dinov2/vit-large-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
+
+  - Name: vit-giant-p14_dinov2-pre_3rdparty
+    Metadata:
+      FLOPs: 1784000000000
+      Parameters: 1136000000
+      Training Data:
+        - LVD-142M
+    In Collection: DINOv2
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/dinov2/vit-giant-p14_dinov2-pre_3rdparty_20230426-2934a630.pth
+    Config: configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth
+      Code: https://github.com/facebookresearch/dinov2
diff --git a/configs/dinov2/vit-base-p14_dinov2-pre_headless.py b/configs/dinov2/vit-base-p14_dinov2-pre_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..524dfe30bf47db1614d203097ffcfeeec5f68c1a
--- /dev/null
+++ b/configs/dinov2/vit-base-p14_dinov2-pre_headless.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+    ),
+    neck=None,
+    head=None)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/dinov2/vit-giant-p14_dinov2-pre_headless.py b/configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..a127359e5c44b6fa99482c3720cc1555432af699
--- /dev/null
+++ b/configs/dinov2/vit-giant-p14_dinov2-pre_headless.py
@@ -0,0 +1,21 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='dinov2-giant',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+        layer_cfgs=dict(ffn_type='swiglu_fused'),
+    ),
+    neck=None,
+    head=None)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/dinov2/vit-large-p14_dinov2-pre_headless.py b/configs/dinov2/vit-large-p14_dinov2-pre_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ec7bc68455520bef8986a8d563e5c732f3bf994
--- /dev/null
+++ b/configs/dinov2/vit-large-p14_dinov2-pre_headless.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+    ),
+    neck=None,
+    head=None)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/dinov2/vit-small-p14_dinov2-pre_headless.py b/configs/dinov2/vit-small-p14_dinov2-pre_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..198c5e51ab29be9202ac053c082366ec818e3982
--- /dev/null
+++ b/configs/dinov2/vit-small-p14_dinov2-pre_headless.py
@@ -0,0 +1,20 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='dinov2-small',
+        img_size=518,
+        patch_size=14,
+        layer_scale_init_value=1e-5,
+    ),
+    neck=None,
+    head=None)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/edgenext/README.md b/configs/edgenext/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1c9686f7d96183feb115f2bb6860688e48440ed8
--- /dev/null
+++ b/configs/edgenext/README.md
@@ -0,0 +1,80 @@
+# EdgeNeXt
+
+> [EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications](https://arxiv.org/abs/2206.10589)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to build resource-efficient general purpose networks due to their usefulness in several application areas. In this work, we strive to effectively combine the strengths of both CNN and Transformer models and propose a new efficient hybrid architecture EdgeNeXt. Specifically in EdgeNeXt, we introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups and utilizes depth-wise convolution along with self-attention across channel dimensions to implicitly increase the receptive field and encode multi-scale features. Our extensive experiments on classification, detection and segmentation tasks, reveal the merits of the proposed approach, outperforming state-of-the-art methods with comparatively lower compute requirements. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K, outperforming MobileViT with an absolute gain of 2.2% with 28% reduction in FLOPs. Further, our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
+
+<div align=center>
+<img src="https://github.com/mmaaz60/EdgeNeXt/raw/main/images/EdgeNext.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('edgenext-xxsmall_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('edgenext-xxsmall_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/edgenext/edgenext-xxsmall_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                    |                                 Download                                 |
+| :----------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :----------------------------------------------------------------------: |
+| `edgenext-xxsmall_3rdparty_in1k`\*   | From scratch |    1.33    |   0.26    |   71.20   |   89.91   |  [config](edgenext-xxsmall_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth) |
+| `edgenext-xsmall_3rdparty_in1k`\*    | From scratch |    2.34    |   0.53    |   74.86   |   92.31   |  [config](edgenext-xsmall_8xb256_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xsmall_3rdparty_in1k_20220801-974f9fe7.pth) |
+| `edgenext-small_3rdparty_in1k`\*     | From scratch |    5.59    |   1.25    |   79.41   |   94.53   |   [config](edgenext-small_8xb256_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty_in1k_20220801-d00db5f8.pth) |
+| `edgenext-small-usi_3rdparty_in1k`\* | From scratch |    5.59    |   1.25    |   81.06   |   95.34   | [config](edgenext-small_8xb256-usi_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty-usi_in1k_20220801-ae6d8dd3.pth) |
+| `edgenext-base_3rdparty_in1k`\*      | From scratch |   18.51    |   3.81    |   82.48   |   96.20   |   [config](edgenext-base_8xb256_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty_in1k_20220801-9ade408b.pth) |
+| `edgenext-base_3rdparty-usi_in1k`\*  | From scratch |   18.51    |   3.81    |   83.67   |   96.70   | [config](edgenext-base_8xb256-usi_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty-usi_in1k_20220801-909e8939.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/mmaaz60/EdgeNeXt). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{Maaz2022EdgeNeXt,
+    title={EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications},
+    author={Muhammad Maaz and Abdelrahman Shaker and Hisham Cholakkal and Salman Khan and Syed Waqas Zamir and Rao Muhammad Anwer and Fahad Shahbaz Khan},
+    journal={2206.10589},
+    year={2022}
+}
+```
diff --git a/configs/edgenext/edgenext-base_8xb256-usi_in1k.py b/configs/edgenext/edgenext-base_8xb256-usi_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..13949deaed9b09f7473fca60d4bab2012ce00c48
--- /dev/null
+++ b/configs/edgenext/edgenext-base_8xb256-usi_in1k.py
@@ -0,0 +1,19 @@
+_base_ = ['./edgenext-base_8xb256_in1k.py']
+
+# dataset setting
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=269,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs')
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+test_dataloader = val_dataloader
diff --git a/configs/edgenext/edgenext-base_8xb256_in1k.py b/configs/edgenext/edgenext-base_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d0a75c62fe0c771e65541937ca32b9b7ca3e9e0
--- /dev/null
+++ b/configs/edgenext/edgenext-base_8xb256_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../_base_/models/edgenext/edgenext-base.py',
+    '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=6e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/edgenext/edgenext-small_8xb256-usi_in1k.py b/configs/edgenext/edgenext-small_8xb256-usi_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d6bc904be7f7e82eb3b9769260dd3559ee33e45f
--- /dev/null
+++ b/configs/edgenext/edgenext-small_8xb256-usi_in1k.py
@@ -0,0 +1,19 @@
+_base_ = ['./edgenext-small_8xb256_in1k.py']
+
+# dataset setting
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=269,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs')
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+test_dataloader = val_dataloader
diff --git a/configs/edgenext/edgenext-small_8xb256_in1k.py b/configs/edgenext/edgenext-small_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1d99bdc9f6958037306c98ba863ffb8743fa347
--- /dev/null
+++ b/configs/edgenext/edgenext-small_8xb256_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../_base_/models/edgenext/edgenext-small.py',
+    '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=6e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/edgenext/edgenext-xsmall_8xb256_in1k.py b/configs/edgenext/edgenext-xsmall_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d2326fc9deda56d1366a4ec9cafff4e4740c24c
--- /dev/null
+++ b/configs/edgenext/edgenext-xsmall_8xb256_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../_base_/models/edgenext/edgenext-xsmall.py',
+    '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=6e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/edgenext/edgenext-xxsmall_8xb256_in1k.py b/configs/edgenext/edgenext-xxsmall_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..507c3cb598fab10416d621e0e4cf4f78114a7918
--- /dev/null
+++ b/configs/edgenext/edgenext-xxsmall_8xb256_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../_base_/models/edgenext/edgenext-xxsmall.py',
+    '../_base_/datasets/imagenet_bs64_edgenext_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=6e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/edgenext/metafile.yml b/configs/edgenext/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..e69ac17405ea5081c515e8a48ff550e09675e867
--- /dev/null
+++ b/configs/edgenext/metafile.yml
@@ -0,0 +1,118 @@
+Collections:
+  - Name: EdgeNeXt
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - SDTA
+        - 1x1 Convolution
+        - Channel Self-attention
+    Paper:
+      URL: https://arxiv.org/abs/2206.10589
+      Title: 'EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications'
+    README: configs/edgenext/README.md
+    Code:
+      Version: v1.0.0rc1
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.23.2/mmcls/models/backbones/edgenext.py
+
+Models:
+  - Name: edgenext-xxsmall_3rdparty_in1k
+    Metadata:
+      FLOPs: 255640144
+      Parameters: 1327216
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 71.20
+          Top 5 Accuracy: 89.91
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth
+    Config: configs/edgenext/edgenext-xxsmall_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_xxsmall.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-xsmall_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 529970560
+      Parameters: 2336804
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.86
+          Top 5 Accuracy: 92.31
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-xsmall_3rdparty_in1k_20220801-974f9fe7.pth
+    Config: configs/edgenext/edgenext-xsmall_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_xsmall.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-small_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 1249788000
+      Parameters: 5586832
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.41
+          Top 5 Accuracy: 94.53
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty_in1k_20220801-d00db5f8.pth
+    Config: configs/edgenext/edgenext-small_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.0/edgenext_small.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-small-usi_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 1249788000
+      Parameters: 5586832
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.06
+          Top 5 Accuracy: 95.34
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty-usi_in1k_20220801-ae6d8dd3.pth
+    Config: configs/edgenext/edgenext-small_8xb256-usi_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.1/edgenext_small_usi.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-base_3rdparty_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 3814395280
+      Parameters: 18511292
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.48
+          Top 5 Accuracy: 96.2
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty_in1k_20220801-9ade408b.pth
+    Config: configs/edgenext/edgenext-base_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.2/edgenext_base.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
+  - Name: edgenext-base_3rdparty-usi_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 3814395280
+      Parameters: 18511292
+    In Collection: EdgeNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.67
+          Top 5 Accuracy: 96.7
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty-usi_in1k_20220801-909e8939.pth
+    Config: configs/edgenext/edgenext-base_8xb256-usi_in1k.py
+    Converted From:
+      Weights: https://github.com/mmaaz60/EdgeNeXt/releases/download/v1.2/edgenext_base_usi.pth
+      Code: https://github.com/mmaaz60/EdgeNeXt
diff --git a/configs/efficientformer/README.md b/configs/efficientformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..537777efc0da6cba6aa198ab204945a1c3712688
--- /dev/null
+++ b/configs/efficientformer/README.md
@@ -0,0 +1,88 @@
+# EfficientFormer
+
+> [EfficientFormer: Vision Transformers at MobileNet Speed](https://arxiv.org/abs/2206.01191)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm.  Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12 (compiled with CoreML), which runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1), and our largest model, EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/18586273/180713426-9d3d77e3-3584-42d8-9098-625b4170d796.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('efficientformer-l1_3rdparty_8xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('efficientformer-l1_3rdparty_8xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/efficientformer/efficientformer-l1_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l1_3rdparty_in1k_20220915-cc3e1ac6.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                       |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                    |                             Download                              |
+| :------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :---------------------------------------------------------------: |
+| `efficientformer-l1_3rdparty_8xb128_in1k`\* | From scratch |   12.28    |   1.30    |   80.46   |   94.99   | [config](efficientformer-l1_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l1_3rdparty_in1k_20220915-cc3e1ac6.pth) |
+| `efficientformer-l3_3rdparty_8xb128_in1k`\* | From scratch |   31.41    |   3.74    |   82.45   |   96.18   | [config](efficientformer-l3_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l3_3rdparty_in1k_20220915-466793d6.pth) |
+| `efficientformer-l7_3rdparty_8xb128_in1k`\* | From scratch |   82.23    |   10.16   |   83.40   |   96.60   | [config](efficientformer-l7_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l7_3rdparty_in1k_20220915-185e30af.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/snap-research/EfficientFormer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.2206.01191,
+  doi = {10.48550/ARXIV.2206.01191},
+
+  url = {https://arxiv.org/abs/2206.01191},
+
+  author = {Li, Yanyu and Yuan, Geng and Wen, Yang and Hu, Eric and Evangelidis, Georgios and Tulyakov, Sergey and Wang, Yanzhi and Ren, Jian},
+
+  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
+
+  title = {EfficientFormer: Vision Transformers at MobileNet Speed},
+
+  publisher = {arXiv},
+
+  year = {2022},
+
+  copyright = {Creative Commons Attribution 4.0 International}
+}
+```
diff --git a/configs/efficientformer/efficientformer-l1_8xb128_in1k.py b/configs/efficientformer/efficientformer-l1_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f55dc653eccad42dcf95d60f9aab86460ca9117
--- /dev/null
+++ b/configs/efficientformer/efficientformer-l1_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/efficientformer-l1.py',
+    '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/efficientformer/efficientformer-l3_8xb128_in1k.py b/configs/efficientformer/efficientformer-l3_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8be5efae1ad93f175c25eabc6361a20c1ece76f
--- /dev/null
+++ b/configs/efficientformer/efficientformer-l3_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './efficientformer-l1_8xb128_in1k.py'
+
+model = dict(backbone=dict(arch='l3'), head=dict(in_channels=512))
diff --git a/configs/efficientformer/efficientformer-l7_8xb128_in1k.py b/configs/efficientformer/efficientformer-l7_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c2252652efe55840880ad64cde121a51614f4e84
--- /dev/null
+++ b/configs/efficientformer/efficientformer-l7_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './efficientformer-l1_8xb128_in1k.py'
+
+model = dict(backbone=dict(arch='l7'), head=dict(in_channels=768))
diff --git a/configs/efficientformer/metafile.yml b/configs/efficientformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..5c70f07ec52f956e0644d4e25d4162ed009ac72a
--- /dev/null
+++ b/configs/efficientformer/metafile.yml
@@ -0,0 +1,67 @@
+Collections:
+  - Name: EfficientFormer
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Pooling
+        - 1x1 Convolution
+        - LayerScale
+        - MetaFormer
+    Paper:
+      URL: https://arxiv.org/abs/2206.01191
+      Title: "EfficientFormer: Vision Transformers at MobileNet Speed"
+    README: configs/efficientformer/README.md
+    Code:
+      Version: v1.0.0rc1
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc1/configs/efficientformer/metafile.yml
+
+Models:
+  - Name: efficientformer-l1_3rdparty_8xb128_in1k
+    Metadata:
+      FLOPs: 1304601088     # 1.3G
+      Parameters: 12278696   # 12M
+    In Collection: EfficientFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.46
+          Top 5 Accuracy: 94.99
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l1_3rdparty_in1k_20220915-cc3e1ac6.pth
+    Config: configs/efficientformer/efficientformer-l1_8xb128_in1k.py
+    Converted From:
+      Weights: https://drive.google.com/file/d/11SbX-3cfqTOc247xKYubrAjBiUmr818y/view?usp=sharing
+      Code: https://github.com/snap-research/EfficientFormer
+  - Name: efficientformer-l3_3rdparty_8xb128_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 3737045760        # 3.7G
+      Parameters: 31406000     # 31M
+    In Collection: EfficientFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.45
+          Top 5 Accuracy: 96.18
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l3_3rdparty_in1k_20220915-466793d6.pth
+    Config: configs/efficientformer/efficientformer-l3_8xb128_in1k.py
+    Converted From:
+      Weights: https://drive.google.com/file/d/1OyyjKKxDyMj-BcfInp4GlDdwLu3hc30m/view?usp=sharing
+      Code: https://github.com/snap-research/EfficientFormer
+  - Name: efficientformer-l7_3rdparty_8xb128_in1k
+    Metadata:
+      FLOPs: 10163951616      # 10.2G
+      Parameters: 82229328    # 82M
+    In Collection: EfficientFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.40
+          Top 5 Accuracy: 96.60
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-l7_3rdparty_in1k_20220915-185e30af.pth
+    Config: configs/efficientformer/efficientformer-l7_8xb128_in1k.py
+    Converted From:
+      Weights: https://drive.google.com/file/d/1cVw-pctJwgvGafeouynqWWCwgkcoFMM5/view?usp=sharing
+      Code: https://github.com/snap-research/EfficientFormer
diff --git a/configs/efficientnet/README.md b/configs/efficientnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c7b7b76ab5db29c3f9bc54eaefffdcf9cda4c13a
--- /dev/null
+++ b/configs/efficientnet/README.md
@@ -0,0 +1,122 @@
+# EfficientNet
+
+> [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946v5)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, yet being an order-of-magnitude smaller and faster than previous models.
+
+EfficientNets are based on AutoML and Compound Scaling. In particular, we first use [AutoML MNAS Mobile framework](https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html) to develop a mobile-size baseline network, named as EfficientNet-B0; Then, we use the compound scaling method to scale up this baseline to obtain EfficientNet-B1 to B7.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/150078232-d28c91fc-d0e8-43e3-9d20-b5162f0fb463.png" width="60%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Click to show the detailed Abstract</summary>
+
+<br>
+Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet.   To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('efficientnet-b0_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('efficientnet-b0_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/efficientnet/efficientnet-b0_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32_in1k_20220119-a7e2a0b1.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                               |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                     Config                     |                        Download                        |
+| :-------------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------------: | :----------------------------------------------------: |
+| `efficientnet-b0_3rdparty_8xb32_in1k`\*             | From scratch |    5.29    |   0.42    |   76.74   |   93.17   |    [config](efficientnet-b0_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32_in1k_20220119-a7e2a0b1.pth) |
+| `efficientnet-b0_3rdparty_8xb32-aa_in1k`\*          | From scratch |    5.29    |   0.42    |   77.26   |   93.41   |    [config](efficientnet-b0_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32-aa_in1k_20220119-8d939117.pth) |
+| `efficientnet-b0_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |    5.29    |   0.42    |   77.53   |   93.61   | [config](efficientnet-b0_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32-aa-advprop_in1k_20220119-26434485.pth) |
+| `efficientnet-b0_3rdparty-ra-noisystudent_in1k`\*   | From scratch |    5.29    |   0.42    |   77.63   |   94.00   |    [config](efficientnet-b0_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty-ra-noisystudent_in1k_20221103-75cd08d3.pth) |
+| `efficientnet-b1_3rdparty_8xb32_in1k`\*             | From scratch |    7.79    |   0.74    |   78.68   |   94.28   |    [config](efficientnet-b1_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32_in1k_20220119-002556d9.pth) |
+| `efficientnet-b1_3rdparty_8xb32-aa_in1k`\*          | From scratch |    7.79    |   0.74    |   79.20   |   94.42   |    [config](efficientnet-b1_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32-aa_in1k_20220119-619d8ae3.pth) |
+| `efficientnet-b1_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |    7.79    |   0.74    |   79.52   |   94.43   | [config](efficientnet-b1_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32-aa-advprop_in1k_20220119-5715267d.pth) |
+| `efficientnet-b1_3rdparty-ra-noisystudent_in1k`\*   | From scratch |    7.79    |   0.74    |   81.44   |   95.83   |    [config](efficientnet-b1_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty-ra-noisystudent_in1k_20221103-756bcbc0.pth) |
+| `efficientnet-b2_3rdparty_8xb32_in1k`\*             | From scratch |    9.11    |   1.07    |   79.64   |   94.80   |    [config](efficientnet-b2_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32_in1k_20220119-ea374a30.pth) |
+| `efficientnet-b2_3rdparty_8xb32-aa_in1k`\*          | From scratch |    9.11    |   1.07    |   80.21   |   94.96   |    [config](efficientnet-b2_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32-aa_in1k_20220119-dd61e80b.pth) |
+| `efficientnet-b2_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |    9.11    |   1.07    |   80.45   |   95.07   | [config](efficientnet-b2_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32-aa-advprop_in1k_20220119-1655338a.pth) |
+| `efficientnet-b2_3rdparty-ra-noisystudent_in1k`\*   | From scratch |    9.11    |   1.07    |   82.47   |   96.23   |    [config](efficientnet-b2_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty-ra-noisystudent_in1k_20221103-301ed299.pth) |
+| `efficientnet-b3_3rdparty_8xb32_in1k`\*             | From scratch |   12.23    |   1.95    |   81.01   |   95.34   |    [config](efficientnet-b3_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32_in1k_20220119-4b4d7487.pth) |
+| `efficientnet-b3_3rdparty_8xb32-aa_in1k`\*          | From scratch |   12.23    |   1.95    |   81.58   |   95.67   |    [config](efficientnet-b3_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32-aa_in1k_20220119-5b4887a0.pth) |
+| `efficientnet-b3_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |   12.23    |   1.95    |   81.81   |   95.69   | [config](efficientnet-b3_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32-aa-advprop_in1k_20220119-53b41118.pth) |
+| `efficientnet-b3_3rdparty-ra-noisystudent_in1k`\*   | From scratch |   12.23    |   1.95    |   84.02   |   96.89   |    [config](efficientnet-b3_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty-ra-noisystudent_in1k_20221103-a4ab5fd6.pth) |
+| `efficientnet-b4_3rdparty_8xb32_in1k`\*             | From scratch |   19.34    |   4.66    |   82.57   |   96.09   |    [config](efficientnet-b4_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32_in1k_20220119-81fd4077.pth) |
+| `efficientnet-b4_3rdparty_8xb32-aa_in1k`\*          | From scratch |   19.34    |   4.66    |   82.95   |   96.26   |    [config](efficientnet-b4_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32-aa_in1k_20220119-45b8bd2b.pth) |
+| `efficientnet-b4_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |   19.34    |   4.66    |   83.25   |   96.44   | [config](efficientnet-b4_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32-aa-advprop_in1k_20220119-38c2238c.pth) |
+| `efficientnet-b4_3rdparty-ra-noisystudent_in1k`\*   | From scratch |   19.34    |   4.66    |   85.25   |   97.52   |    [config](efficientnet-b4_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty-ra-noisystudent_in1k_20221103-16ba8a2d.pth) |
+| `efficientnet-b5_3rdparty_8xb32_in1k`\*             | From scratch |   30.39    |   10.80   |   83.18   |   96.47   |    [config](efficientnet-b5_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32_in1k_20220119-e9814430.pth) |
+| `efficientnet-b5_3rdparty_8xb32-aa_in1k`\*          | From scratch |   30.39    |   10.80   |   83.82   |   96.76   |    [config](efficientnet-b5_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32-aa_in1k_20220119-2cab8b78.pth) |
+| `efficientnet-b5_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |   30.39    |   10.80   |   84.21   |   96.98   | [config](efficientnet-b5_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32-aa-advprop_in1k_20220119-f57a895a.pth) |
+| `efficientnet-b5_3rdparty-ra-noisystudent_in1k`\*   | From scratch |   30.39    |   10.80   |   86.08   |   97.75   |    [config](efficientnet-b5_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty-ra-noisystudent_in1k_20221103-111a185f.pth) |
+| `efficientnet-b6_3rdparty_8xb32-aa_in1k`\*          | From scratch |   43.04    |   19.97   |   84.05   |   96.82   |    [config](efficientnet-b6_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty_8xb32-aa_in1k_20220119-45b03310.pth) |
+| `efficientnet-b6_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |   43.04    |   19.97   |   84.74   |   97.14   | [config](efficientnet-b6_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty_8xb32-aa-advprop_in1k_20220119-bfe3485e.pth) |
+| `efficientnet-b6_3rdparty-ra-noisystudent_in1k`\*   | From scratch |   43.04    |   19.97   |   86.47   |   97.87   |    [config](efficientnet-b6_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty-ra-noisystudent_in1k_20221103-7de7d2cc.pth) |
+| `efficientnet-b7_3rdparty_8xb32-aa_in1k`\*          | From scratch |   66.35    |   39.32   |   84.38   |   96.88   |    [config](efficientnet-b7_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty_8xb32-aa_in1k_20220119-bf03951c.pth) |
+| `efficientnet-b7_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |   66.35    |   39.32   |   85.14   |   97.23   | [config](efficientnet-b7_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty_8xb32-aa-advprop_in1k_20220119-c6dbff10.pth) |
+| `efficientnet-b7_3rdparty-ra-noisystudent_in1k`\*   | From scratch |   66.35    |   39.32   |   86.83   |   98.08   |    [config](efficientnet-b7_8xb32_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty-ra-noisystudent_in1k_20221103-a82894bc.pth) |
+| `efficientnet-b8_3rdparty_8xb32-aa-advprop_in1k`\*  | From scratch |   87.41    |   65.00   |   85.38   |   97.28   | [config](efficientnet-b8_8xb32-01norm_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b8_3rdparty_8xb32-aa-advprop_in1k_20220119-297ce1b7.pth) |
+| `efficientnet-l2_3rdparty-ra-noisystudent_in1k-800px`\* | From scratch |   480.31   |  174.20   |   88.33   |   98.65   |  [config](efficientnet-l2_8xb8_in1k-800px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-l2_3rdparty-ra-noisystudent_in1k_20221103-be73be13.pth) |
+| `efficientnet-l2_3rdparty-ra-noisystudent_in1k-475px`\* | From scratch |   480.31   |  484.98   |   88.18   |   98.55   | [config](efficientnet-l2_8xb32_in1k-475px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-l2_3rdparty-ra-noisystudent_in1k-475px_20221103-5a0d8058.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{tan2019efficientnet,
+  title={Efficientnet: Rethinking model scaling for convolutional neural networks},
+  author={Tan, Mingxing and Le, Quoc},
+  booktitle={International Conference on Machine Learning},
+  pages={6105--6114},
+  year={2019},
+  organization={PMLR}
+}
+```
diff --git a/configs/efficientnet/efficientnet-b0_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b0_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..369d0a43d1950de5da47789d0f28465c95fdaae5
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b0_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b0.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b0_8xb32_in1k.py b/configs/efficientnet/efficientnet-b0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4263da196430b310fae4da3273d13bb66e89075
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b0_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b0.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b1_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b1_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0405cf5f84eeedf0a2e761670bc600d9f82401af
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b1_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b1.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=240),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=240),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b1_8xb32_in1k.py b/configs/efficientnet/efficientnet-b1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5bf2e8076d81c97adb4d1883cfbdb5f645b6b93
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b1_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b1.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=240),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=240),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b2_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b2_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..da3f23b84c6f7fc8b5d415b90ca2f69f4d6e58c4
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b2_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b2.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=260),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=260),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b2_8xb32_in1k.py b/configs/efficientnet/efficientnet-b2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..060a2ad3ea9247131c4207d738dce0bfacd16a16
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b2_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b2.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=260),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=260),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b3_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b3_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..55729a9c2258352a6ed981dff25777b0acaaae85
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b3_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b3.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=300),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=300),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b3_8xb32_in1k.py b/configs/efficientnet/efficientnet-b3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d84de5a79316ab6d7f73e45f266fbaec43ed9629
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b3_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b3.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=300),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=300),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b4_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b4_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4dbfb212fd03d508b678a684f4d8b6854f648c6
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b4_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b4.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=380),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=380),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b4_8xb32_in1k.py b/configs/efficientnet/efficientnet-b4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..08e246c3851d12ee067469d9afb10fc7f0933de7
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b4_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b4.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=380),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=380),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b5_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b5_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c646da43d4baf23cebfc6835ec400dba6d5bd35
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b5_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b5.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=456),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=456),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b5_8xb32_in1k.py b/configs/efficientnet/efficientnet-b5_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..af4fa4b8fcbce99ae1ac163c72cec11789109482
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b5_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b5.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=456),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=456),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b6_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b6_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd15054928b56bdae2c3a2ef479e96826824fe2b
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b6_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b6.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=528),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=528),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b6_8xb32_in1k.py b/configs/efficientnet/efficientnet-b6_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fae02aed6dd5b8fbb1b42140856333b771c927d1
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b6_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b6.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=528),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=528),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b7_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b7_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..687dfd261d73d84061b289c955cb0260059999b2
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b7_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b7.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=600),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=600),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b7_8xb32_in1k.py b/configs/efficientnet/efficientnet-b7_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d783bb30bf1939aa1c8c9a010e5733ae7b1342b
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b7_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b7.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=600),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=600),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b8_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-b8_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..07d3692baa9b9f3d10109e63d1da5e74cc62ee26
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b8_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_b8.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=672),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=672),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-b8_8xb32_in1k.py b/configs/efficientnet/efficientnet-b8_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..868986f52488233b36631c13d66d8da2aac8c348
--- /dev/null
+++ b/configs/efficientnet/efficientnet-b8_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_b8.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=672),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=672),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-em_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-em_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9de3b27fb31a1382c08a646987b7cf4d996e77f4
--- /dev/null
+++ b/configs/efficientnet/efficientnet-em_8xb32-01norm_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/models/efficientnet_em.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=240),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=240),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-es_8xb32-01norm_in1k.py b/configs/efficientnet/efficientnet-es_8xb32-01norm_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e643d55089b932732d47c5dbe5734c2085a2fb3e
--- /dev/null
+++ b/configs/efficientnet/efficientnet-es_8xb32-01norm_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_es.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-l2_8xb32_in1k-475px.py b/configs/efficientnet/efficientnet-l2_8xb32_in1k-475px.py
new file mode 100644
index 0000000000000000000000000000000000000000..560695144f50194c00bc78707c8ddf7288e4cd52
--- /dev/null
+++ b/configs/efficientnet/efficientnet-l2_8xb32_in1k-475px.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_l2.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=475),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=475),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/efficientnet-l2_8xb8_in1k-800px.py b/configs/efficientnet/efficientnet-l2_8xb8_in1k-800px.py
new file mode 100644
index 0000000000000000000000000000000000000000..61bddfa735117db68377a224f72c1160a046ae1c
--- /dev/null
+++ b/configs/efficientnet/efficientnet-l2_8xb8_in1k-800px.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/efficientnet_l2.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=800),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=800),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=8, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=8, dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(batch_size=8, dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet/metafile.yml b/configs/efficientnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..21130c4ff1d64895295372acac18961a4f90bd7c
--- /dev/null
+++ b/configs/efficientnet/metafile.yml
@@ -0,0 +1,551 @@
+Collections:
+  - Name: EfficientNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - 1x1 Convolution
+        - Average Pooling
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - Inverted Residual Block
+        - RMSProp
+        - Squeeze-and-Excitation Block
+        - Swish
+    Paper:
+      URL: https://arxiv.org/abs/1905.11946v5
+      Title: "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks"
+    README: configs/efficientnet/README.md
+    Code:
+      Version: v0.20.1
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/efficientnet.py
+
+Models:
+  - Name: efficientnet-b0_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 420592480
+      Parameters: 5288548
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.74
+          Top 5 Accuracy: 93.17
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32_in1k_20220119-a7e2a0b1.pth
+    Config: configs/efficientnet/efficientnet-b0_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b0.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b0_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 420592480
+      Parameters: 5288548
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.26
+          Top 5 Accuracy: 93.41
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32-aa_in1k_20220119-8d939117.pth
+    Config: configs/efficientnet/efficientnet-b0_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b0.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b0_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 420592480
+      Parameters: 5288548
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.53
+          Top 5 Accuracy: 93.61
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32-aa-advprop_in1k_20220119-26434485.pth
+    Config: configs/efficientnet/efficientnet-b0_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b0.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b0_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 420592480
+      Parameters: 5288548
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.63
+          Top 5 Accuracy: 94.00
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty-ra-noisystudent_in1k_20221103-75cd08d3.pth
+    Config: configs/efficientnet/efficientnet-b0_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b0.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b1_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 744059920
+      Parameters: 7794184
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.68
+          Top 5 Accuracy: 94.28
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32_in1k_20220119-002556d9.pth
+    Config: configs/efficientnet/efficientnet-b1_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b1.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b1_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 744059920
+      Parameters: 7794184
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.20
+          Top 5 Accuracy: 94.42
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32-aa_in1k_20220119-619d8ae3.pth
+    Config: configs/efficientnet/efficientnet-b1_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b1.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b1_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 744059920
+      Parameters: 7794184
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.52
+          Top 5 Accuracy: 94.43
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32-aa-advprop_in1k_20220119-5715267d.pth
+    Config: configs/efficientnet/efficientnet-b1_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b1.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b1_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 744059920
+      Parameters: 7794184
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.44
+          Top 5 Accuracy: 95.83
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty-ra-noisystudent_in1k_20221103-756bcbc0.pth
+    Config: configs/efficientnet/efficientnet-b1_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b1.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b2_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 1066620392
+      Parameters: 9109994
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.64
+          Top 5 Accuracy: 94.80
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32_in1k_20220119-ea374a30.pth
+    Config: configs/efficientnet/efficientnet-b2_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b2.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b2_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 1066620392
+      Parameters: 9109994
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.21
+          Top 5 Accuracy: 94.96
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32-aa_in1k_20220119-dd61e80b.pth
+    Config: configs/efficientnet/efficientnet-b2_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b2.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b2_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 1066620392
+      Parameters: 9109994
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.45
+          Top 5 Accuracy: 95.07
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32-aa-advprop_in1k_20220119-1655338a.pth
+    Config: configs/efficientnet/efficientnet-b2_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b2.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b2_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 1066620392
+      Parameters: 9109994
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.47
+          Top 5 Accuracy: 96.23
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty-ra-noisystudent_in1k_20221103-301ed299.pth
+    Config: configs/efficientnet/efficientnet-b2_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b2.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b3_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 1953798216
+      Parameters: 12233232
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.01
+          Top 5 Accuracy: 95.34
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32_in1k_20220119-4b4d7487.pth
+    Config: configs/efficientnet/efficientnet-b3_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b3.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b3_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 1953798216
+      Parameters: 12233232
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.58
+          Top 5 Accuracy: 95.67
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32-aa_in1k_20220119-5b4887a0.pth
+    Config: configs/efficientnet/efficientnet-b3_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b3.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b3_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 1953798216
+      Parameters: 12233232
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.81
+          Top 5 Accuracy: 95.69
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32-aa-advprop_in1k_20220119-53b41118.pth
+    Config: configs/efficientnet/efficientnet-b3_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b3.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b3_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 1953798216
+      Parameters: 12233232
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.02
+          Top 5 Accuracy: 96.89
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty-ra-noisystudent_in1k_20221103-a4ab5fd6.pth
+    Config: configs/efficientnet/efficientnet-b3_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b3.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b4_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 4659080176
+      Parameters: 19341616
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.57
+          Top 5 Accuracy: 96.09
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32_in1k_20220119-81fd4077.pth
+    Config: configs/efficientnet/efficientnet-b4_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b4.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b4_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 4659080176
+      Parameters: 19341616
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.95
+          Top 5 Accuracy: 96.26
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32-aa_in1k_20220119-45b8bd2b.pth
+    Config: configs/efficientnet/efficientnet-b4_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b4.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b4_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 4659080176
+      Parameters: 19341616
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.25
+          Top 5 Accuracy: 96.44
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32-aa-advprop_in1k_20220119-38c2238c.pth
+    Config: configs/efficientnet/efficientnet-b4_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b4.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b4_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 4659080176
+      Parameters: 19341616
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.25
+          Top 5 Accuracy: 97.52
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty-ra-noisystudent_in1k_20221103-16ba8a2d.pth
+    Config: configs/efficientnet/efficientnet-b4_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b4.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b5_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 10799472560
+      Parameters: 30389784
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.18
+          Top 5 Accuracy: 96.47
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32_in1k_20220119-e9814430.pth
+    Config: configs/efficientnet/efficientnet-b5_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckpts/efficientnet-b5.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b5_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 10799472560
+      Parameters: 30389784
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.82
+          Top 5 Accuracy: 96.76
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32-aa_in1k_20220119-2cab8b78.pth
+    Config: configs/efficientnet/efficientnet-b5_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b5.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b5_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 10799472560
+      Parameters: 30389784
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.21
+          Top 5 Accuracy: 96.98
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32-aa-advprop_in1k_20220119-f57a895a.pth
+    Config: configs/efficientnet/efficientnet-b5_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b5.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b5_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 10799472560
+      Parameters: 30389784
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.08
+          Top 5 Accuracy: 97.75
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty-ra-noisystudent_in1k_20221103-111a185f.pth
+    Config: configs/efficientnet/efficientnet-b5_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b5.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b6_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 19971777560
+      Parameters: 43040704
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.05
+          Top 5 Accuracy: 96.82
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty_8xb32-aa_in1k_20220119-45b03310.pth
+    Config: configs/efficientnet/efficientnet-b6_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b6.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b6_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 19971777560
+      Parameters: 43040704
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.74
+          Top 5 Accuracy: 97.14
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty_8xb32-aa-advprop_in1k_20220119-bfe3485e.pth
+    Config: configs/efficientnet/efficientnet-b6_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b6.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b6_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 19971777560
+      Parameters: 43040704
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.47
+          Top 5 Accuracy: 97.87
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty-ra-noisystudent_in1k_20221103-7de7d2cc.pth
+    Config: configs/efficientnet/efficientnet-b6_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b6.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b7_3rdparty_8xb32-aa_in1k
+    Metadata:
+      FLOPs: 39316473392
+      Parameters: 66347960
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.38
+          Top 5 Accuracy: 96.88
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty_8xb32-aa_in1k_20220119-bf03951c.pth
+    Config: configs/efficientnet/efficientnet-b7_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/ckptsaug/efficientnet-b7.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b7_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 39316473392
+      Parameters: 66347960
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.14
+          Top 5 Accuracy: 97.23
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty_8xb32-aa-advprop_in1k_20220119-c6dbff10.pth
+    Config: configs/efficientnet/efficientnet-b7_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b7.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b7_3rdparty-ra-noisystudent_in1k
+    Metadata:
+      FLOPs: 39316473392
+      Parameters: 66347960
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.83
+          Top 5 Accuracy: 98.08
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty-ra-noisystudent_in1k_20221103-a82894bc.pth
+    Config: configs/efficientnet/efficientnet-b7_8xb32_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-b7.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-b8_3rdparty_8xb32-aa-advprop_in1k
+    Metadata:
+      FLOPs: 64999827816
+      Parameters: 87413142
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.38
+          Top 5 Accuracy: 97.28
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b8_3rdparty_8xb32-aa-advprop_in1k_20220119-297ce1b7.pth
+    Config: configs/efficientnet/efficientnet-b8_8xb32-01norm_in1k.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/advprop/efficientnet-b8.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-l2_3rdparty-ra-noisystudent_in1k-800px
+    Metadata:
+      FLOPs: 174203533416
+      Parameters: 480309308
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 88.33
+          Top 5 Accuracy: 98.65
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-l2_3rdparty-ra-noisystudent_in1k_20221103-be73be13.pth
+    Config: configs/efficientnet/efficientnet-l2_8xb8_in1k-800px.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-l2.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
+  - Name: efficientnet-l2_3rdparty-ra-noisystudent_in1k-475px
+    Metadata:
+      FLOPs: 484984099280
+      Parameters: 480309308
+    In Collection: EfficientNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 88.18
+          Top 5 Accuracy: 98.55
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-l2_3rdparty-ra-noisystudent_in1k-475px_20221103-5a0d8058.pth
+    Config: configs/efficientnet/efficientnet-l2_8xb32_in1k-475px.py
+    Converted From:
+      Weights: https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/noisystudent/noisy_student_efficientnet-l2_475.tar.gz
+      Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
diff --git a/configs/efficientnet_v2/README.md b/configs/efficientnet_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..965421823e7fe3e6cf8504d717864bf8a499ab2e
--- /dev/null
+++ b/configs/efficientnet_v2/README.md
@@ -0,0 +1,98 @@
+# EfficientNetV2
+
+> [EfficientNetV2: Smaller Models and Faster Training](https://arxiv.org/abs/2104.00298)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models. To develop this family of models, we use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency. The models were searched from the search space enriched with new ops such as Fused-MBConv. Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller.   Our training can be further sped up by progressively increasing the image size during training, but it often causes a drop in accuracy. To compensate for this accuracy drop, we propose to adaptively adjust regularization (e.g., dropout and data augmentation) as well, such that we can achieve both fast training and good accuracy.   With progressive learning, our EfficientNetV2 significantly outperforms previous models on ImageNet and CIFAR/Cars/Flowers datasets. By pretraining on the same ImageNet21k, our EfficientNetV2 achieves 87.3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the recent ViT by 2.0% accuracy while training 5x-11x faster using the same computing resources. Code will be available at https://github.com/google/automl/tree/master/efficientnetv2.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/18586273/208616931-0c5107f1-f08c-48d3-8694-7a6eaf227dc2.png" width="50%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('efficientnetv2-b0_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('efficientnetv2-b0_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b0_3rdparty_in1k_20221221-9ef6e736.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                | Params (M) | Flops (G) |                   Config                   |                                                Download                                                 |
+| :----------------------------------- | :--------: | :-------: | :----------------------------------------: | :-----------------------------------------------------------------------------------------------------: |
+| `efficientnetv2-s_3rdparty_in21k`\*  |   48.16    |   3.31    | [config](efficientnetv2-s_8xb32_in21k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_3rdparty_in21k_20221220-c0572b56.pth) |
+| `efficientnetv2-m_3rdparty_in21k`\*  |   80.84    |   5.86    | [config](efficientnetv2-m_8xb32_in21k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_3rdparty_in21k_20221220-073e944c.pth) |
+| `efficientnetv2-l_3rdparty_in21k`\*  |   145.22   |   13.11   | [config](efficientnetv2-l_8xb32_in21k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_3rdparty_in21k_20221220-f28f91e1.pth) |
+| `efficientnetv2-xl_3rdparty_in21k`\* |   234.82   |   18.86   | [config](efficientnetv2-xl_8xb32_in21k.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-xl_3rdparty_in21k_20221220-b2c9329c.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model                                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                     Config                      |                          Download                           |
+| :-------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------------: | :---------------------------------------------------------: |
+| `efficientnetv2-b0_3rdparty_in1k`\*           | From scratch |    7.14    |   0.92    |   78.52   |   94.44   |    [config](efficientnetv2-b0_8xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b0_3rdparty_in1k_20221221-9ef6e736.pth) |
+| `efficientnetv2-b1_3rdparty_in1k`\*           | From scratch |    8.14    |   1.44    |   79.80   |   94.89   |    [config](efficientnetv2-b1_8xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b1_3rdparty_in1k_20221221-6955d9ce.pth) |
+| `efficientnetv2-b2_3rdparty_in1k`\*           | From scratch |   10.10    |   1.99    |   80.63   |   95.30   |    [config](efficientnetv2-b2_8xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b2_3rdparty_in1k_20221221-74f7d493.pth) |
+| `efficientnetv2-b3_3rdparty_in1k`\*           | From scratch |   14.36    |   3.50    |   82.03   |   95.88   |    [config](efficientnetv2-b3_8xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b3_3rdparty_in1k_20221221-b6f07a36.pth) |
+| `efficientnetv2-s_3rdparty_in1k`\*            | From scratch |   21.46    |   9.72    |   83.82   |   96.67   | [config](efficientnetv2-s_8xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_3rdparty_in1k_20221220-f0eaff9d.pth) |
+| `efficientnetv2-m_3rdparty_in1k`\*            | From scratch |   54.14    |   26.88   |   85.01   |   97.26   | [config](efficientnetv2-m_8xb32_in1k-480px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_3rdparty_in1k_20221220-9dc0c729.pth) |
+| `efficientnetv2-l_3rdparty_in1k`\*            | From scratch |   118.52   |   60.14   |   85.43   |   97.31   | [config](efficientnetv2-l_8xb32_in1k-480px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_3rdparty_in1k_20221220-5c3bac0f.pth) |
+| `efficientnetv2-s_in21k-pre_3rdparty_in1k`\*  | ImageNet-21k |   21.46    |   9.72    |   84.29   |   97.26   | [config](efficientnetv2-s_8xb32_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_in21k-pre-3rdparty_in1k_20221220-7a7c8475.pth) |
+| `efficientnetv2-m_in21k-pre_3rdparty_in1k`\*  | ImageNet-21k |   54.14    |   26.88   |   85.47   |   97.76   | [config](efficientnetv2-m_8xb32_in1k-480px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_in21k-pre-3rdparty_in1k_20221220-a1013a04.pth) |
+| `efficientnetv2-l_in21k-pre_3rdparty_in1k`\*  | ImageNet-21k |   118.52   |   60.14   |   86.31   |   97.99   | [config](efficientnetv2-l_8xb32_in1k-480px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_in21k-pre-3rdparty_in1k_20221220-63df0efd.pth) |
+| `efficientnetv2-xl_in21k-pre_3rdparty_in1k`\* | ImageNet-21k |   208.12   |   98.34   |   86.39   |   97.83   | [config](efficientnetv2-xl_8xb32_in1k-512px.py) | [model](https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-xl_in21k-pre-3rdparty_in1k_20221220-583ac18b.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{tan2021efficientnetv2,
+  title={Efficientnetv2: Smaller models and faster training},
+  author={Tan, Mingxing and Le, Quoc},
+  booktitle={International Conference on Machine Learning},
+  pages={10096--10106},
+  year={2021},
+  organization={PMLR}
+}
+```
diff --git a/configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py b/configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4dc23d4904ef87f3ca581dc022a65f8d9c925038
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py
@@ -0,0 +1,58 @@
+_base_ = [
+    '../_base_/models/efficientnet_v2/efficientnetv2_b0.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=192,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=224, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-b1_8xb32_in1k.py b/configs/efficientnet_v2/efficientnetv2-b1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa187ff1503531732b10e2b178751361e4a4de2d
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-b1_8xb32_in1k.py
@@ -0,0 +1,21 @@
+_base_ = ['./efficientnetv2-b0_8xb32_in1k.py']
+
+# model setting
+model = dict(backbone=dict(arch='b1'), head=dict(in_channels=1280, ))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=192),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=240, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-b2_8xb32_in1k.py b/configs/efficientnet_v2/efficientnetv2-b2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ff5530d1dbac739295c6fbc1f61fa6b36d8aa65
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-b2_8xb32_in1k.py
@@ -0,0 +1,21 @@
+_base_ = ['./efficientnetv2-b0_8xb32_in1k.py']
+
+# model setting
+model = dict(backbone=dict(arch='b2'), head=dict(in_channels=1408, ))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=208),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=260, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-b3_8xb32_in1k.py b/configs/efficientnet_v2/efficientnetv2-b3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..84fb29a55400a44af414b909c49806381f9564b9
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-b3_8xb32_in1k.py
@@ -0,0 +1,21 @@
+_base_ = ['./efficientnetv2-b0_8xb32_in1k.py']
+
+# model setting
+model = dict(backbone=dict(arch='b3'), head=dict(in_channels=1536, ))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=240),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=300, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py b/configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py
new file mode 100644
index 0000000000000000000000000000000000000000..c3606cf07086f6a8f0580183e6f94d9e1950dae3
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    'efficientnetv2-s_8xb32_in1k-384px.py',
+]
+
+# model setting
+model = dict(backbone=dict(arch='l'), )
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=384, crop_padding=0),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=480, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-l_8xb32_in21k.py b/configs/efficientnet_v2/efficientnetv2-l_8xb32_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..179c72075f6f5caa4fc551fee0e3462db6dcba18
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-l_8xb32_in21k.py
@@ -0,0 +1,4 @@
+_base_ = ['./efficientnetv2-s_8xb32_in21k.py']
+
+# model setting
+model = dict(backbone=dict(arch='l'), )
diff --git a/configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py b/configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7bdd9be3b8e45ccb512f86049df482306ad91d9
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    'efficientnetv2-s_8xb32_in1k-384px.py',
+]
+
+# model setting
+model = dict(backbone=dict(arch='m'), )
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=384, crop_padding=0),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=480, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-m_8xb32_in21k.py b/configs/efficientnet_v2/efficientnetv2-m_8xb32_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f04d616376aa523526425c595904e64db0214ecc
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-m_8xb32_in21k.py
@@ -0,0 +1,4 @@
+_base_ = ['./efficientnetv2-s_8xb32_in21k.py']
+
+# model setting
+model = dict(backbone=dict(arch='m'), )
diff --git a/configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py b/configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..2bdee636a20bf50cff4126cd50087724b7a9072f
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/models/efficientnet_v2/efficientnetv2_s.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=300, crop_padding=0),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=384, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-s_8xb32_in21k.py b/configs/efficientnet_v2/efficientnetv2-s_8xb32_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..54f8a5af4eb92f8de1d7e5f488a8b222afda9239
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-s_8xb32_in21k.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../_base_/models/efficientnet_v2/efficientnetv2_s.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# model setting
+model = dict(head=dict(num_classes=21843))
+
+# dataset settings
+dataset_type = 'ImageNet21k'
+data_preprocessor = dict(
+    num_classes=21843,
+    # RGB format normalization parameters
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=224, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
diff --git a/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in1k-512px.py b/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in1k-512px.py
new file mode 100644
index 0000000000000000000000000000000000000000..18f56ff063b3dd1eee15f81718cd88cd83eeb9df
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in1k-512px.py
@@ -0,0 +1,23 @@
+_base_ = [
+    'efficientnetv2-s_8xb32_in1k-384px.py',
+]
+
+# model setting
+model = dict(backbone=dict(arch='xl'), )
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetRandomCrop', scale=384, crop_padding=0),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=512, crop_padding=0),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in21k.py b/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2ee84cb32f7b83bf6d950a92088e983063ce049
--- /dev/null
+++ b/configs/efficientnet_v2/efficientnetv2-xl_8xb32_in21k.py
@@ -0,0 +1,4 @@
+_base_ = ['./efficientnetv2-s_8xb32_in21k.py']
+
+# model setting
+model = dict(backbone=dict(arch='xl'), )
diff --git a/configs/efficientnet_v2/metafile.yml b/configs/efficientnet_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..6c927dce99ad0bf9c6e5555c4e9496e2613960d3
--- /dev/null
+++ b/configs/efficientnet_v2/metafile.yml
@@ -0,0 +1,255 @@
+Collections:
+  - Name: EfficientNetV2
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - 1x1 Convolution
+        - Average Pooling
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - Inverted Residual Block
+        - RMSProp
+        - Squeeze-and-Excitation Block
+        - Swish
+    Paper:
+      URL: https://arxiv.org/abs/2104.00298
+      Title: "EfficientNetV2: Smaller Models and Faster Training"
+    README: configs/efficientnet_v2/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/beit.py
+      Version: v1.0.0rc4
+
+Models:
+  - Name: efficientnetv2-b0_3rdparty_in1k
+    Metadata:
+      FLOPs: 919843360
+      Parameters: 7139704
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.52
+          Top 5 Accuracy: 94.44
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b0_3rdparty_in1k_20221221-9ef6e736.pth
+    Config: configs/efficientnet_v2/efficientnetv2-b0_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_b0-c7cc451f.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-b1_3rdparty_in1k
+    Metadata:
+      FLOPs: 1438287552
+      Parameters: 8141052
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.80
+          Top 5 Accuracy: 94.89
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b1_3rdparty_in1k_20221221-6955d9ce.pth
+    Config: configs/efficientnet_v2/efficientnetv2-b1_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_b1-be6e41b0.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-b2_3rdparty_in1k
+    Metadata:
+      FLOPs: 1986433080
+      Parameters: 10096086
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.63
+          Top 5 Accuracy: 95.30
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b2_3rdparty_in1k_20221221-74f7d493.pth
+    Config: configs/efficientnet_v2/efficientnetv2-b2_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_b2-847de54e.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-b3_3rdparty_in1k
+    Metadata:
+      FLOPs: 3498068400
+      Parameters: 14358406
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.03
+          Top 5 Accuracy: 95.88
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-b3_3rdparty_in1k_20221221-b6f07a36.pth
+    Config: configs/efficientnet_v2/efficientnetv2-b3_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_b3-57773f13.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-s_3rdparty_in1k
+    Metadata:
+      FLOPs: 9719420928
+      Parameters: 21458488
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.82
+          Top 5 Accuracy: 96.67
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_3rdparty_in1k_20221220-f0eaff9d.pth
+    Config: configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_s-eb54923e.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-m_3rdparty_in1k
+    Metadata:
+      FLOPs: 26880363584
+      Parameters: 54139356
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.01
+          Top 5 Accuracy: 97.26
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_3rdparty_in1k_20221220-9dc0c729.pth
+    Config: configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_m-cc09e0cd.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-l_3rdparty_in1k
+    Metadata:
+      FLOPs: 60142387008
+      Parameters: 118515272
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.43
+          Top 5 Accuracy: 97.31
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_3rdparty_in1k_20221220-5c3bac0f.pth
+    Config: configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_l-d664b728.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-s_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 9719420928
+      Parameters: 21458488
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.29
+          Top 5 Accuracy: 97.26
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_in21k-pre-3rdparty_in1k_20221220-7a7c8475.pth
+    Config: configs/efficientnet_v2/efficientnetv2-s_8xb32_in1k-384px.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_s_21ft1k-d7dafa41.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-m_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 26880363584
+      Parameters: 54139356
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.47
+          Top 5 Accuracy: 97.76
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_in21k-pre-3rdparty_in1k_20221220-a1013a04.pth
+    Config: configs/efficientnet_v2/efficientnetv2-m_8xb32_in1k-480px.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_m_21ft1k-bf41664a.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-l_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 60142387008
+      Parameters: 118515272
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.31
+          Top 5 Accuracy: 97.99
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_in21k-pre-3rdparty_in1k_20221220-63df0efd.pth
+    Config: configs/efficientnet_v2/efficientnetv2-l_8xb32_in1k-480px.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_l_21ft1k-60127a9d.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-xl_in21k-pre_3rdparty_in1k
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 98341230592
+      Parameters: 208119808
+    In Collection: EfficientNetV2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.39
+          Top 5 Accuracy: 97.83
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-xl_in21k-pre-3rdparty_in1k_20221220-583ac18b.pth
+    Config: configs/efficientnet_v2/efficientnetv2-xl_8xb32_in1k-512px.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_xl_in21ft1k-06c35c48.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-s_3rdparty_in21k
+    Metadata:
+      FLOPs: 3309720768
+      Parameters: 48158371
+    In Collection: EfficientNetV2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-s_3rdparty_in21k_20221220-c0572b56.pth
+    Config: configs/efficientnet_v2/efficientnetv2-s_8xb32_in21k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_s_21k-6337ad01.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-m_3rdparty_in21k
+    Metadata:
+      FLOPs: 5861638208
+      Parameters: 80839239
+    In Collection: EfficientNetV2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-m_3rdparty_in21k_20221220-073e944c.pth
+    Config: configs/efficientnet_v2/efficientnetv2-m_8xb32_in21k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_m_21k-361418a2.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-l_3rdparty_in21k
+    Metadata:
+      FLOPs: 13114950464
+      Parameters: 145215155
+    In Collection: EfficientNetV2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-l_3rdparty_in21k_20221220-f28f91e1.pth
+    Config: configs/efficientnet_v2/efficientnetv2-l_8xb32_in21k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_l_21k-91a19ec9.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
+  - Name: efficientnetv2-xl_3rdparty_in21k
+    Metadata:
+      FLOPs: 18855244288
+      Parameters: 234819691
+    In Collection: EfficientNetV2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/efficientnetv2/efficientnetv2-xl_3rdparty_in21k_20221220-b2c9329c.pth
+    Config: configs/efficientnet_v2/efficientnetv2-xl_8xb32_in21k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_xl_in21k-fd7e8abf.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py
diff --git a/configs/eva/README.md b/configs/eva/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6e49c8abe8e88bc8eb683dd6dcc0ff06faf86f5f
--- /dev/null
+++ b/configs/eva/README.md
@@ -0,0 +1,101 @@
+# EVA
+
+> [EVA: Exploring the Limits of Masked Visual Representation Learning at Scale](https://arxiv.org/abs/2211.07636)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24734142/205410193-f1164e56-c117-4165-86f5-4cbfd797bc87.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                                | Params (M) | Flops (G) |                             Config                              |                              Download                              |
+| :--------------------------------------------------- | :--------: | :-------: | :-------------------------------------------------------------: | :----------------------------------------------------------------: |
+| `eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k` |   111.78   |   17.58   | [config](eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.json) |
+| `beit-l-p14_3rdparty-eva_in21k`\*                    |   303.18   |   81.08   |                 [config](eva-l-p14_headless.py)                 | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_3rdparty-mim_in21k_20221213-3a5da50b.pth) |
+| `beit-l-p14_eva-pre_3rdparty_in21k`\*                |   303.18   |   81.08   |                 [config](eva-l-p14_headless.py)                 | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in21k_20221213-8f194fa2.pth) |
+| `beit-g-p16_3rdparty-eva_30m`\*                      |  1011.32   |  203.52   |                 [config](eva-g-p16_headless.py)                 | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p16_3rdparty_30m_20221213-7bed23ee.pth) |
+| `beit-g-p14_3rdparty-eva_30m`\*                      |  1011.60   |  267.17   |                 [config](eva-g-p14_headless.py)                 | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_3rdparty_30m_20221213-3b7aca97.pth) |
+| `beit-g-p14_eva-30m-pre_3rdparty_in21k`\*            |  1011.60   |  267.17   |                 [config](eva-g-p14_headless.py)                 | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-pre_3rdparty_in21k_20221213-d72285b7.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/baaivision/EVA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model                                   |                  Pretrain                  | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                  |                  Download                  |
+| :-------------------------------------- | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :----------------------------------------: |
+| `vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k` | [EVA MAE STYLE](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.pth) |   86.57    |   17.58   |   83.70   |    N/A    | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.json) |
+| `vit-base-p16_eva-mae-style-pre_8xb2048-linear-coslr-100e_in1k` | [EVA MAE STYLE](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.pth) |   86.57    |   17.58   |   69.00   |    N/A    | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221226-ef51bf09.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221226-ef51bf09.json) |
+| `beit-l-p14_eva-pre_3rdparty_in1k-196px`\* | [EVA](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_3rdparty-mim_in21k_20221213-3a5da50b.pth) |   304.14   |   61.57   |   87.94   |   98.5    | [config](eva-l-p14_8xb16_in1k-196px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-196px_20221214-2adf4d28.pth) |
+| `beit-l-p14_eva-in21k-pre_3rdparty_in1k-196px`\* |              EVA ImageNet-21k              |   304.14   |   61.57   |   88.58   |   98.65   | [config](eva-l-p14_8xb16_in1k-196px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-196px_20221213-b730c7e7.pth) |
+| `beit-l-p14_eva-pre_3rdparty_in1k-336px`\* | [EVA](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_3rdparty-mim_in21k_20221213-3a5da50b.pth) |   304.53   |  191.10   |   88.66   |   98.75   | [config](eva-l-p14_8xb16_in1k-336px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-336px_20221214-07785cfd.pth) |
+| `beit-l-p14_eva-in21k-pre_3rdparty_in1k-336px`\* |              EVA ImageNet-21k              |   304.53   |  191.10   |   89.17   |   98.86   | [config](eva-l-p14_8xb16_in1k-336px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-336px_20221213-f25b7634.pth) |
+| `beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-336px`\* | [EVA 30M ImageNet-21k](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-pre_3rdparty_in21k_20221213-d72285b7.pth) |  1013.01   |  620.64   |   89.61   |   98.93   | [config](eva-g-p14_8xb16_in1k-336px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-336px_20221213-210f9071.pth) |
+| `beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-560px`\* | [EVA 30M ImageNet-21k](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-pre_3rdparty_in21k_20221213-d72285b7.pth) |  1014.45   |  1906.76  |   89.71   |   98.96   | [config](eva-g-p14_8xb16_in1k-560px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-560px_20221213-fa1c3652.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/baaivision/EVA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{EVA,
+  title={EVA: Exploring the Limits of Masked Visual Representation Learning at Scale},
+  author={Fang, Yuxin and Wang, Wen and Xie, Binhui and Sun, Quan and Wu, Ledell and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
+  journal={arXiv preprint arXiv:2211.07636},
+  year={2022}
+}
+```
diff --git a/configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py b/configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e8a3f4983ac19208090ee63e9c9160b945b22ee6
--- /dev/null
+++ b/configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=4e-4, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=100)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/eva/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py b/configs/eva/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b7333ca475ad1d9607ddda898acb623e1bd7aa4
--- /dev/null
+++ b/configs/eva/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=2048, drop_last=True)
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        frozen_stages=12,
+        out_type='cls_token',
+        final_norm=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=dict(type='ClsBatchNormNeck', input_features=768),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]),
+    data_preprocessor=dict(
+        num_classes=1000,
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        to_rgb=True,
+    ))
+
+# optimizer
+optim_wrapper = dict(
+    _delete_=True,
+    type='AmpOptimWrapper',
+    optimizer=dict(type='LARS', lr=3.2, weight_decay=0.0, momentum=0.9),
+)
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=90,
+        by_epoch=True,
+        begin=10,
+        end=100,
+        eta_min=0.0,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/eva/eva-g-p14_8xb16_in1k-336px.py b/configs/eva/eva-g-p14_8xb16_in1k-336px.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa2bd7ee5be0167c5d69d5f1cc96a069e5f17cb5
--- /dev/null
+++ b/configs/eva/eva-g-p14_8xb16_in1k-336px.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/eva/eva-g.py',
+    '../_base_/datasets/imagenet_bs16_eva_336.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=336))
diff --git a/configs/eva/eva-g-p14_8xb16_in1k-560px.py b/configs/eva/eva-g-p14_8xb16_in1k-560px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ed20866b7f0dc19b919a06a71e50a205370194a0
--- /dev/null
+++ b/configs/eva/eva-g-p14_8xb16_in1k-560px.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/eva/eva-g.py',
+    '../_base_/datasets/imagenet_bs16_eva_560.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=560))
diff --git a/configs/eva/eva-g-p14_headless.py b/configs/eva/eva-g-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..b278aceab6211c55702c69beb1b396f37064a8b9
--- /dev/null
+++ b/configs/eva/eva-g-p14_headless.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='eva-g',
+        img_size=224,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        out_type='avg_featmap',
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/eva/eva-g-p16_headless.py b/configs/eva/eva-g-p16_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca5de1860f5edb0ee768eb12ce7c528fa17e2a00
--- /dev/null
+++ b/configs/eva/eva-g-p16_headless.py
@@ -0,0 +1,24 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='eva-g',
+        img_size=224,
+        patch_size=16,
+        layer_scale_init_value=0.0,
+        out_type='avg_featmap',
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/eva/eva-l-p14_8xb16_in1k-196px.py b/configs/eva/eva-l-p14_8xb16_in1k-196px.py
new file mode 100644
index 0000000000000000000000000000000000000000..3503ca5d78022e29f1c1c945aa1226085f1c3eb6
--- /dev/null
+++ b/configs/eva/eva-l-p14_8xb16_in1k-196px.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/eva/eva-l.py',
+    '../_base_/datasets/imagenet_bs16_eva_196.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=196))
diff --git a/configs/eva/eva-l-p14_8xb16_in1k-336px.py b/configs/eva/eva-l-p14_8xb16_in1k-336px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7094df8ba3de0540049eaeb4693ef5b09094dc2b
--- /dev/null
+++ b/configs/eva/eva-l-p14_8xb16_in1k-336px.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/eva/eva-l.py',
+    '../_base_/datasets/imagenet_bs16_eva_336.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=336))
diff --git a/configs/eva/eva-l-p14_headless.py b/configs/eva/eva-l-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..89a4ce10990489daf92e95c1355669f242838ff3
--- /dev/null
+++ b/configs/eva/eva-l-p14_headless.py
@@ -0,0 +1,25 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='BEiTViT',
+        arch='l',
+        img_size=224,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        out_type='avg_featmap',
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        use_shared_rel_pos_bias=False,
+        layer_cfgs=dict(bias=True),
+    ),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py b/configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbedb07c727aaa38c2de9f57fa6cfe9fdbdd87a2
--- /dev/null
+++ b/configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
@@ -0,0 +1,86 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='EVA',
+    backbone=dict(init_cfg=[
+        dict(type='Xavier', distribution='uniform', layer='Linear'),
+        dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+    ]),
+    neck=dict(
+        type='MAEPretrainDecoder',
+        predict_feature_dim=512,
+        init_cfg=[
+            dict(type='Xavier', distribution='uniform', layer='Linear'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    head=dict(
+        _delete_=True,
+        type='MIMHead',
+        loss=dict(
+            type='CosineSimilarityLoss', shift_factor=2.0, scale_factor=2.0),
+    ),
+    target_generator=dict(
+        type='CLIPGenerator',
+        tokenizer_path=  # noqa
+        'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/clip_vit_base_16.pth.tar'  # noqa
+    ),
+    init_cfg=None)
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/eva/metafile.yml b/configs/eva/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..dd8dbbf761486532d228bbf3df5ef396b92d4880
--- /dev/null
+++ b/configs/eva/metafile.yml
@@ -0,0 +1,261 @@
+Collections:
+  - Name: EVA
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: 'EVA: Exploring the Limits of Masked Visual Representation Learning at
+        Scale'
+      URL: https://arxiv.org/abs/2211.07636
+    README: configs/eva/README.md
+    Code:
+      URL: null
+      Version: null
+
+Models:
+  - Name: eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k
+    Metadata:
+      Epochs: 400
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 111776512
+      Training Data: ImageNet-1k
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k_20221226-26d90f07.pth
+    Config: configs/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k.py
+    Downstream:
+      - vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k
+      - vit-base-p16_eva-mae-style-pre_8xb2048-linear-coslr-100e_in1k
+  - Name: vit-base-p16_eva-mae-style-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.7
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20221226-f61cf992.pth
+    Config: configs/eva/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_eva-mae-style-pre_8xb2048-linear-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 16384
+      FLOPs: 17581972992
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.0
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/eva/eva-mae-style_vit-base-p16_16xb256-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221226-ef51bf09.pth
+    Config: configs/eva/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
+  - Name: beit-l-p14_eva-pre_3rdparty_in1k-196px
+    Metadata:
+      FLOPs: 61565981696
+      Parameters: 304142312
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 87.94
+          Top 5 Accuracy: 98.5
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-196px_20221214-2adf4d28.pth
+    Config: configs/eva/eva-l-p14_8xb16_in1k-196px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_196px_1k_ft_88p0.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-l-p14_eva-in21k-pre_3rdparty_in1k-196px
+    Metadata:
+      FLOPs: 61565981696
+      Parameters: 304142312
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 88.58
+          Top 5 Accuracy: 98.65
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-196px_20221213-b730c7e7.pth
+    Config: configs/eva/eva-l-p14_8xb16_in1k-196px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_196px_21k_to_1k_ft_88p6.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-l-p14_3rdparty-eva_in21k
+    Metadata:
+      FLOPs: 81075147776
+      Parameters: 303178752
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_3rdparty-mim_in21k_20221213-3a5da50b.pth
+    Config: configs/eva/eva-l-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14.pt
+      Code: https://github.com/baaivision/EVA
+    Downstream:
+      - beit-l-p14_eva-pre_3rdparty_in21k
+      - beit-l-p14_eva-pre_3rdparty_in1k-336px
+      - beit-l-p14_eva-pre_3rdparty_in1k-196px
+  - Name: beit-l-p14_eva-pre_3rdparty_in21k
+    Metadata:
+      FLOPs: 81075147776
+      Parameters: 303178752
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in21k_20221213-8f194fa2.pth
+    Config: configs/eva/eva-l-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_21k_ft.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-l-p14_eva-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 191100916736
+      Parameters: 304531432
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 88.66
+          Top 5 Accuracy: 98.75
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-pre_3rdparty_in1k-336px_20221214-07785cfd.pth
+    Config: configs/eva/eva-l-p14_8xb16_in1k-336px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_336px_1k_ft_88p65.pt
+      Code: https://github.com/baaivision/EVA
+    Downstream:
+      - beit-l-p14_eva-in21k-pre_3rdparty_in1k-336px
+      - beit-l-p14_eva-in21k-pre_3rdparty_in1k-196px
+  - Name: beit-l-p14_eva-in21k-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 191100916736
+      Parameters: 304531432
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.17
+          Top 5 Accuracy: 98.86
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-l-p14_mim-in21k-pre_3rdparty_in1k-336px_20221213-f25b7634.pth
+    Config: configs/eva/eva-l-p14_8xb16_in1k-336px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_l_psz14_336px_21k_to_1k_ft_89p2.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-g-p16_3rdparty-eva_30m
+    Metadata:
+      FLOPs: 203517463424
+      Parameters: 1011315072
+      Training Data:
+        - merged-30M
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p16_3rdparty_30m_20221213-7bed23ee.pth
+    Config: configs/eva/eva-g-p16_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_psz14to16.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-g-p14_3rdparty-eva_30m
+    Metadata:
+      FLOPs: 267174833024
+      Parameters: 1011596672
+      Training Data:
+        - merged-30M
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_3rdparty_30m_20221213-3b7aca97.pth
+    Config: configs/eva/eva-g-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_psz14.pt
+      Code: https://github.com/baaivision/EVA
+    Downstream:
+      - beit-g-p14_eva-30m-pre_3rdparty_in21k
+  - Name: beit-g-p14_eva-30m-pre_3rdparty_in21k
+    Metadata:
+      FLOPs: 267174833024
+      Parameters: 1011596672
+      Training Data:
+        - merged-30M
+        - ImageNet-21k
+    In Collection: EVA
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-pre_3rdparty_in21k_20221213-d72285b7.pth
+    Config: configs/eva/eva-g-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_224px_psz14.pt
+      Code: https://github.com/baaivision/EVA
+    Downstream:
+      - beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-336px
+      - beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-560px
+  - Name: beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 620642757504
+      Parameters: 1013005672
+      Training Data:
+        - merged-30M
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.61
+          Top 5 Accuracy: 98.93
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-336px_20221213-210f9071.pth
+    Config: configs/eva/eva-g-p14_8xb16_in1k-336px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_1k_336px_psz14_ema_89p6.pt
+      Code: https://github.com/baaivision/EVA
+  - Name: beit-g-p14_eva-30m-in21k-pre_3rdparty_in1k-560px
+    Metadata:
+      FLOPs: 1906761591680
+      Parameters: 1014447464
+      Training Data:
+        - merged-30M
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.71
+          Top 5 Accuracy: 98.96
+    Weights: https://download.openmmlab.com/mmclassification/v0/eva/eva-g-p14_30m-in21k-pre_3rdparty_in1k-560px_20221213-fa1c3652.pth
+    Config: configs/eva/eva-g-p14_8xb16_in1k-560px.py
+    Converted From:
+      Weights: https://huggingface.co/BAAI/EVA/blob/main/eva_21k_1k_560px_psz14_ema_89p7.pt
+      Code: https://github.com/baaivision/EVA
diff --git a/configs/eva02/README.md b/configs/eva02/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bc8f64e76d1601ade6ef052a2f23f7d2f6123843
--- /dev/null
+++ b/configs/eva02/README.md
@@ -0,0 +1,109 @@
+# EVA-02
+
+> [EVA-02: A Visual Representation for Neon Genesis](https://arxiv.org/abs/2303.11331)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set.  Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open accessand open research, we release the complete suite of EVA-02 to the community.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/40905160/229037980-b83dceb5-41d6-406c-a20b-63b83c80136d.png" width="70%" alt="TrV builds upon the original plain ViT architecture and includes several enhancements: SwinGLU FFN, sub-LN, 2D RoPE, and JAX weight initialization. To keep the parameter & FLOPs consistent with the baseline, the FFN hidden dim of SwiGLU is 2/3× of the typical MLP counterpart."/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px', pretrained=True)
+inputs = torch.rand(1, 3, 336, 336)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/eva02/eva02-tiny-p14_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/eva02/eva02-tiny-p14_in1k.py /path/to/eva02-tiny-p14_in1k.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                             | Params (M) | Flops (G) |                Config                 |                                                   Download                                                    |
+| :-------------------------------- | :--------: | :-------: | :-----------------------------------: | :-----------------------------------------------------------------------------------------------------------: |
+| `vit-tiny-p14_eva02-pre_in21k`\*  |    5.50    |   1.70    | [config](eva02-tiny-p14_headless.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_pre_in21k_20230505-d703e7b1.pth)  |
+| `vit-small-p14_eva02-pre_in21k`\* |   21.62    |   6.14    | [config](eva02-small-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_pre_in21k_20230505-3175f463.pth) |
+| `vit-base-p14_eva02-pre_in21k`\*  |   85.77    |   23.22   | [config](eva02-base-p14_headless.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_pre_in21k_20230505-2f2d4d3c.pth)  |
+| `vit-large-p14_eva02-pre_in21k`\* |   303.29   |   81.15   | [config](eva02-large-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_in21k_20230505-9072de5d.pth) |
+| `vit-large-p14_eva02-pre_m38m`\*  |   303.29   |   81.15   | [config](eva02-large-p14_headless.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_m38m_20230505-b8a1a261.pth)  |
+
+- The input size / patch size of MIM pre-trained EVA-02 is `224x224` / `14x14`.
+
+*Models with * are converted from the [official repo](https://github.com/baaivision/EVA).*
+
+### Image Classification on ImageNet-1k
+
+#### (*w/o* IN-21K intermediate fine-tuning)
+
+| Model                                                 |      Pretrain      | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config                |                         Download                          |
+| :---------------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------: |
+| `vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px`\*  | EVA02 ImageNet-21k |    5.76    |   4.68    |   80.69   |   95.54   | [config](./eva02-tiny-p14_in1k.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_in21k-pre_3rdparty_in1k-336px_20230505-a4e8708a.pth) |
+| `vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px`\* | EVA02 ImageNet-21k |   22.13    |   15.48   |   85.78   |   97.60   | [config](./eva02-small-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_in21k-pre_3rdparty_in1k-336px_20230505-9c5b0e85.pth) |
+| `vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px`\*  | EVA02 ImageNet-21k |   87.13    |  107.11   |   88.29   |   98.53   | [config](./eva02-base-p14_in1k.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_3rdparty_in1k-448px_20230505-8ad211c5.pth) |
+
+*Models with * are converted from the  [official repo](https://github.com/baaivision/EVA/tree/master/EVA-02). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+#### (*w* IN-21K intermediate fine-tuning)
+
+| Model                                                 |      Pretrain      | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config                |                         Download                          |
+| :---------------------------------------------------- | :----------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------: |
+| `vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k |   87.13    |  107.11   |   88.47   |   98.62   | [config](./eva02-base-p14_in1k.py)  | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-5cd4d87f.pth) |
+| `vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px`\* | EVA02 ImageNet-21k |   305.08   |  362.33   |   89.65   |   98.95   | [config](./eva02-large-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-926d1599.pth) |
+| `vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px`\* |  EVA02 Merged-38M  |   305.10   |  362.33   |   89.83   |   99.00   | [config](./eva02-large-p14_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_m38m-pre_in21k-medft_3rdparty_in1k-448px_20230505-150dc5ed.pth) |
+
+*Models with * are converted from the  [official repo](https://github.com/baaivision/EVA/tree/master/EVA-02). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{EVA-02,
+  title={EVA-02: A Visual Representation for Neon Genesis},
+  author={Yuxin Fang and Quan Sun and Xinggang Wang and Tiejun Huang and Xinlong Wang and Yue Cao},
+  journal={arXiv preprint arXiv:2303.11331},
+  year={2023}
+}
+```
diff --git a/configs/eva02/eva02-base-p14_headless.py b/configs/eva02/eva02-base-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..27aa8f8a502810d39865ee85fd45b5152c8d5269
--- /dev/null
+++ b/configs/eva02/eva02-base-p14_headless.py
@@ -0,0 +1,21 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='b',
+        img_size=224,
+        patch_size=14,
+        sub_ln=True,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/eva02/eva02-base-p14_in1k.py b/configs/eva02/eva02-base-p14_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c8400d38542d71ee5d3f9713e34236bdc0e7783a
--- /dev/null
+++ b/configs/eva02/eva02-base-p14_in1k.py
@@ -0,0 +1,32 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs16_eva_448.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='b',
+        img_size=448,
+        patch_size=14,
+        sub_ln=True,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/eva02/eva02-large-p14_headless.py b/configs/eva02/eva02-large-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..e101ac977c8590572190350292325c78477dbfd3
--- /dev/null
+++ b/configs/eva02/eva02-large-p14_headless.py
@@ -0,0 +1,21 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='l',
+        img_size=224,
+        patch_size=14,
+        sub_ln=True,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/eva02/eva02-large-p14_in1k.py b/configs/eva02/eva02-large-p14_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..91a42776dafd0f78ba6f3c1fbe68bfc602ad502e
--- /dev/null
+++ b/configs/eva02/eva02-large-p14_in1k.py
@@ -0,0 +1,32 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs16_eva_448.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='l',
+        img_size=448,
+        patch_size=14,
+        sub_ln=True,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/eva02/eva02-small-p14_headless.py b/configs/eva02/eva02-small-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..a969819308e9cea449b06ae3533839d72a2b96fe
--- /dev/null
+++ b/configs/eva02/eva02-small-p14_headless.py
@@ -0,0 +1,20 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='s',
+        img_size=224,
+        patch_size=14,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/eva02/eva02-small-p14_in1k.py b/configs/eva02/eva02-small-p14_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4a16d92456e39bb1147423682333cd24673133e6
--- /dev/null
+++ b/configs/eva02/eva02-small-p14_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs16_eva_336.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='s',
+        img_size=336,
+        patch_size=14,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/eva02/eva02-tiny-p14_headless.py b/configs/eva02/eva02-tiny-p14_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..783d0ea2ebf35df3af8072958322f4f572e36210
--- /dev/null
+++ b/configs/eva02/eva02-tiny-p14_headless.py
@@ -0,0 +1,20 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='t',
+        img_size=224,
+        patch_size=14,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+    std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/eva02/eva02-tiny-p14_in1k.py b/configs/eva02/eva02-tiny-p14_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..84e68d7edd92d91689aa501397a9dbe3eba0b8b3
--- /dev/null
+++ b/configs/eva02/eva02-tiny-p14_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs16_eva_336.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTEVA02',
+        arch='t',
+        img_size=336,
+        patch_size=14,
+        final_norm=False,
+        out_type='avg_featmap'),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
diff --git a/configs/eva02/metafile.yml b/configs/eva02/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..80acf904fb46e95f0ae52b1ff6fe3cf620cc8ae7
--- /dev/null
+++ b/configs/eva02/metafile.yml
@@ -0,0 +1,199 @@
+Collections:
+  - Name: EVA02
+    Metadata:
+      Architecture:
+        - Rotary Position Embedding
+        - Sub Layer Normalization
+        - SwiGLU
+    Paper:
+      Title: 'EVA-02: A Visual Representation for Neon Genesis'
+      URL: https://arxiv.org/abs/2303.11331
+    README: configs/eva02/README.md
+
+Models:
+  - Name: vit-tiny-p14_eva02-pre_in21k
+    Metadata:
+      FLOPs: 1703439360
+      Parameters: 5504064
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_pre_in21k_20230505-d703e7b1.pth
+    Config: configs/eva02/eva02-tiny-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_Ti_pt_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px
+  - Name: vit-tiny-p14_eva02-in21k-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 4675416000
+      Parameters: 5758888
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 80.69
+          Top 5 Accuracy: 95.54
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-tiny-p14_in21k-pre_3rdparty_in1k-336px_20230505-a4e8708a.pth
+    Config: configs/eva02/eva02-tiny-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_Ti_pt_in21k_ft_in1k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-small-p14_eva02-pre_in21k
+    Metadata:
+      FLOPs: 6135404544
+      Parameters: 21624960
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_pre_in21k_20230505-3175f463.pth
+    Config: configs/eva02/eva02-small-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_S_pt_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px
+  - Name: vit-small-p14_eva02-in21k-pre_3rdparty_in1k-336px
+    Metadata:
+      FLOPs: 15476744064
+      Parameters: 22133608
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 85.78
+          Top 5 Accuracy: 97.60
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-small-p14_in21k-pre_3rdparty_in1k-336px_20230505-9c5b0e85.pth
+    Config: configs/eva02/eva02-small-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_S_pt_in21k_ft_in1k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-base-p14_eva02-pre_in21k
+    Metadata:
+      FLOPs: 23216492544
+      Parameters: 85766400
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_pre_in21k_20230505-2f2d4d3c.pth
+    Config: configs/eva02/eva02-base-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_B_pt_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px
+      - vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+  - Name: vit-base-p14_eva02-in21k-pre_3rdparty_in1k-448px
+    Metadata:
+      FLOPs: 107105984256
+      Parameters: 87126760
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 88.29
+          Top 5 Accuracy: 98.53
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_3rdparty_in1k-448px_20230505-8ad211c5.pth
+    Config: configs/eva02/eva02-base-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in1k/eva02_B_pt_in21k_ft_in1k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-base-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+    Metadata:
+      FLOPs: 107105984256
+      Parameters: 87126760
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 88.47
+          Top 5 Accuracy: 98.62
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-base-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-5cd4d87f.pth
+    Config: configs/eva02/eva02-base-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_B_pt_in21k_medft_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-large-p14_eva02-pre_in21k
+    Metadata:
+      FLOPs: 81146703792
+      Parameters: 303291328
+      Training Data:
+        - ImageNet-21k
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_in21k_20230505-9072de5d.pth
+    Config: configs/eva02/eva02-large-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_L_pt_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+  - Name: vit-large-p14_eva02-in21k-pre_in21k-medft_3rdparty_in1k-448px
+    Metadata:
+      FLOPs: 362333836208
+      Parameters: 305104808
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.65
+          Top 5 Accuracy: 98.95
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_in21k-pre_in21k-medft_3rdparty_in1k-448px_20230505-926d1599.pth
+    Config: configs/eva02/eva02-large-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_L_pt_in21k_medft_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+  - Name: vit-large-p14_eva02-pre_m38m
+    Metadata:
+      FLOPs: 81146703792
+      Parameters: 303291328
+      Training Data:
+        - Merged-38M
+    In Collection: EVA02
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_pre_m38m_20230505-b8a1a261.pth
+    Config: configs/eva02/eva02-large-p14_headless.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/pt/eva02_L_pt_m38m_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
+    Downstream:
+      - vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px
+  - Name: vit-large-p14_eva02_m38m-pre_in21k-medft_3rdparty_in1k-448px
+    Metadata:
+      FLOPs: 362333836208
+      Parameters: 305104808
+      Training Data:
+        - Merged-38M
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: EVA02
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 89.83
+          Top 5 Accuracy: 99.00
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/eva02/eva02-large-p14_m38m-pre_in21k-medft_3rdparty_in1k-448px_20230505-150dc5ed.pth
+    Config: configs/eva02/eva02-large-p14_in1k.py
+    Converted From:
+      Weights: https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/cls/in21k/eva02_L_pt_m38m_medft_in21k_p14.pt
+      Code: https://github.com/baaivision/EVA/tree/master/EVA-02
diff --git a/configs/flamingo/README.md b/configs/flamingo/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..60c6af0f50e43cb0f84d2a3dbd2d343a435c6310
--- /dev/null
+++ b/configs/flamingo/README.md
@@ -0,0 +1,82 @@
+# Flamingo
+
+> [Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/236371424-3b9d2e16-3966-4c64-8b87-e33fd6348824.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('flamingo_3rdparty-zeroshot_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'A dog and a cat are looking at each other. '}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/flamingo/flamingo_zeroshot_caption.py https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model                                  | Params (G) | CIDER |                 Config                 |                                                   Download                                                    |
+| :------------------------------------- | :--------: | :---: | :------------------------------------: | :-----------------------------------------------------------------------------------------------------------: |
+| `flamingo_3rdparty-zeroshot_caption`\* |   8.220    | 65.50 | [config](flamingo_zeroshot_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth) |
+
+*Models with * are converted from the [openflamingo](https://github.com/mlfoundations/open_flamingo). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Visual Question Answering on VQAv2
+
+| Model                              | Params (G) | Accuracy |               Config               |                                                      Download                                                      |
+| :--------------------------------- | :--------: | :------: | :--------------------------------: | :----------------------------------------------------------------------------------------------------------------: |
+| `flamingo_3rdparty-zeroshot_vqa`\* |    8.22    |  43.50   | [config](flamingo_zeroshot_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth) |
+
+*Models with * are converted from the [openflamingo](https://github.com/mlfoundations/open_flamingo). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{Alayrac2022FlamingoAV,
+  title={Flamingo: a Visual Language Model for Few-Shot Learning},
+  author={Jean-Baptiste Alayrac and Jeff Donahue and Pauline Luc and Antoine Miech and Iain Barr and Yana Hasson and Karel Lenc and Arthur Mensch and Katie Millican and Malcolm Reynolds and Roman Ring and Eliza Rutherford and Serkan Cabi and Tengda Han and Zhitao Gong and Sina Samangooei and Marianne Monteiro and Jacob Menick and Sebastian Borgeaud and Andy Brock and Aida Nematzadeh and Sahand Sharifzadeh and Mikolaj Binkowski and Ricardo Barreira and Oriol Vinyals and Andrew Zisserman and Karen Simonyan},
+  journal={ArXiv},
+  year={2022},
+  volume={abs/2204.14198}
+}
+```
+
+```bibtex
+@software{anas_awadalla_2023_7733589,
+  author = {Awadalla, Anas and Gao, Irena and Gardner, Joshua and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Jitsev, Jenia and Kornblith, Simon and Koh, Pang Wei and Ilharco, Gabriel and Wortsman, Mitchell and Schmidt, Ludwig},
+  title = {OpenFlamingo},
+  month        = mar,
+  year         = 2023,
+  publisher    = {Zenodo},
+  version      = {v0.1.1},
+  doi          = {10.5281/zenodo.7733589},
+  url          = {https://doi.org/10.5281/zenodo.7733589}
+}
+```
diff --git a/configs/flamingo/flamingo_fewshot_caption.py b/configs/flamingo/flamingo_fewshot_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..d6f9c2bfccdfb9617a14fae454af9bf209f3199a
--- /dev/null
+++ b/configs/flamingo/flamingo_fewshot_caption.py
@@ -0,0 +1,95 @@
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Flamingo',
+    tokenizer=dict(
+        type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='decapoda-research/llama-7b-hf',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False),
+    ),
+    task='caption',
+    shot_prompt_tmpl='<image>Output:{caption}<|endofchunk|>',
+    final_prompt_tmpl='<image>Output:',
+    generation_cfg=dict(num_beams=3, max_new_tokens=20, length_penalty=-2.0))
+
+# data settings
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(
+        type='ApplyToList',
+        # Flamingo requires to load multiple images during few-shot inference.
+        scatter_key='img_path',
+        transforms=[
+            dict(type='LoadImageFromFile'),
+            dict(
+                type='ResizeEdge',
+                scale=224,
+                interpolation='bicubic',
+                backend='pillow'),
+            dict(type='CenterCrop', crop_size=(224, 224)),
+        ],
+        collate_keys=['img', 'scale_factor', 'ori_shape'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['gt_caption', 'shots'],
+        meta_keys=['image_id']),
+]
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/captions_train2014.json',
+        data_prefix=dict(img_path='train2014'),
+        pipeline=test_pipeline,
+        num_shots=2,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/captions_train2014.json')
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/flamingo/flamingo_fewshot_vqa.py b/configs/flamingo/flamingo_fewshot_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..b85a6989b75b4cd1d7bf585cb83b40add12f104f
--- /dev/null
+++ b/configs/flamingo/flamingo_fewshot_vqa.py
@@ -0,0 +1,109 @@
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Flamingo',
+    tokenizer=dict(
+        type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='decapoda-research/llama-7b-hf',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False),
+    ),
+    task='vqa',
+    shot_prompt_tmpl=
+    '<image>Question:{question} Short Answer:{answer}<|endofchunk|>',
+    final_prompt_tmpl='<image>Question:{question} Short Answer:',
+    generation_cfg=dict(num_beams=3, max_new_tokens=5, length_penalty=-2.0))
+
+# data settings
+data_preprocessor = dict(
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(
+        type='ApplyToList',
+        # Flamingo requires to load multiple images during few-shot inference.
+        scatter_key='img_path',
+        transforms=[
+            dict(type='LoadImageFromFile'),
+            dict(
+                type='ResizeEdge',
+                scale=224,
+                interpolation='bicubic',
+                backend='pillow'),
+            dict(type='CenterCrop', crop_size=(224, 224)),
+        ],
+        collate_keys=['img', 'scale_factor', 'ori_shape'],
+    ),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight', 'shots'],
+        meta_keys=['image_id']),
+]
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='val2014',
+        question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+        ann_file='annotations/v2_mscoco_val2014_annotations.json',
+        pipeline=test_pipeline,
+        num_shots=2,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='test2015',
+        question_file=
+        'annotations/v2_OpenEnded_mscoco_test-dev2015_questions.json',
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test-dev.json')
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/flamingo/flamingo_zeroshot_caption.py b/configs/flamingo/flamingo_zeroshot_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..deb786e4d56e70abd26723462068dfb9ad4ed9aa
--- /dev/null
+++ b/configs/flamingo/flamingo_zeroshot_caption.py
@@ -0,0 +1,95 @@
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+zeroshot_prompt = (
+    'Output:A child holding a flowered umbrella and petting a yak.<|endofchunk|>'  # noqa: E501
+    'Output:The child is holding a brush close to his mouth.<|endofchunk|>'  # noqa: E501
+)
+
+# model settings
+model = dict(
+    type='Flamingo',
+    tokenizer=dict(
+        type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='decapoda-research/llama-7b-hf',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False),
+    ),
+    task='caption',
+    zeroshot_prompt=zeroshot_prompt,
+    final_prompt_tmpl='<image>Output:',
+    generation_cfg=dict(num_beams=3, max_new_tokens=20, length_penalty=-2.0),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CenterCrop', crop_size=(224, 224)),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['gt_caption'],
+        meta_keys=['image_id'],
+    ),
+]
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/captions_train2014.json',
+        data_prefix=dict(img_path='train2014'),
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/captions_train2014.json')
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/flamingo/flamingo_zeroshot_vqa.py b/configs/flamingo/flamingo_zeroshot_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..c43c7b8686679364490aa8acf893c61f4c5500f7
--- /dev/null
+++ b/configs/flamingo/flamingo_zeroshot_vqa.py
@@ -0,0 +1,107 @@
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+zeroshot_prompt = (
+    'Question:What is this photo taken looking through? Short Answer:pitcher<|endofchunk|>'  # noqa: E501
+    'Question:How many people are wearing shorts in the forefront of this photo? Short Answer:4<|endofchunk|>'  # noqa: E501
+)
+
+# model settings
+model = dict(
+    type='Flamingo',
+    tokenizer=dict(
+        type='LlamaTokenizer', name_or_path='decapoda-research/llama-7b-hf'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='decapoda-research/llama-7b-hf',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False),
+    ),
+    task='vqa',
+    zeroshot_prompt=zeroshot_prompt,
+    final_prompt_tmpl='<image>Question:{question} Short Answer:',
+    generation_cfg=dict(num_beams=3, max_new_tokens=5, length_penalty=-2.0))
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CenterCrop', crop_size=(224, 224)),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight', 'shots'],
+        meta_keys=['image_id'],
+    ),
+]
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='val2014',
+        question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+        ann_file='annotations/v2_mscoco_val2014_annotations.json',
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='test2015',
+        question_file=
+        'annotations/v2_OpenEnded_mscoco_test-dev2015_questions.json',
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test-dev.json')
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/flamingo/metafile.yml b/configs/flamingo/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..6ff33e93b24ce1e10efb57c7465e9e6663709f97
--- /dev/null
+++ b/configs/flamingo/metafile.yml
@@ -0,0 +1,42 @@
+Collections:
+  - Name: Flamingo
+    Metadata:
+      Architecture:
+        - Transformer
+        - Gated Cross-Attention Dense
+    Paper:
+      Title: 'Flamingo: a Visual Language Model for Few-Shot Learning'
+      URL: https://arxiv.org/abs/2204.14198
+    README: configs/flamingo/README.md
+
+Models:
+  - Name: flamingo_3rdparty-zeroshot_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 8220452880
+    In Collection: Flamingo
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics:
+          CIDER: 65.50  # Report from the official repo
+    Weights: https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
+    Config: configs/flamingo/flamingo_zeroshot_caption.py
+    Converted From:
+      Weights: https://huggingface.co/openflamingo/OpenFlamingo-9B
+      Code: https://github.com/mlfoundations/open_flamingo
+  - Name: flamingo_3rdparty-zeroshot_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 8220452880
+    In Collection: Flamingo
+    Results:
+      - Task: Visual Question Answering
+        Dataset: VQAv2
+        Metrics:
+          Accuracy: 43.50  # Report from the official repo
+    Weights: https://download.openmmlab.com/mmclassification/v1/flamingo/openflamingo-9b-adapter_20230505-554310c8.pth
+    Config: configs/flamingo/flamingo_zeroshot_vqa.py
+    Converted From:
+      Weights: https://huggingface.co/openflamingo/OpenFlamingo-9B
+      Code: https://github.com/mlfoundations/open_flamingo
diff --git a/configs/glip/README.md b/configs/glip/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..48ee30560a92b8ce3c926f536f625b67cca957c2
--- /dev/null
+++ b/configs/glip/README.md
@@ -0,0 +1,57 @@
+# GLIP
+
+> [Grounded Language-Image Pre-training](https://arxiv.org/abs/2112.03857)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head.
+
+<div align="center">
+<img src="https://github.com/microsoft/GLIP/blob/main/docs/lead.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+model = get_model('swin-t_glip-pre_3rdparty', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+<!-- [TABS-END] -->
+
+## Results and models
+
+### Pre-trained models
+
+The pre-trained models are used to fine-tune, and therefore don't have evaluation results.
+
+| Model                                       |          Pretrain          | resolution |                                                       Download                                                        |
+| :------------------------------------------ | :------------------------: | :--------: | :-------------------------------------------------------------------------------------------------------------------: |
+| GLIP-T (`swin-t_glip-pre_3rdparty`)\*       |    O365,GoldG,CC3M,SBU     |  224x224   |    [model](https://download.openmmlab.com/mmclassification/v1/glip/swin-t_glip-pre_3rdparty_20230413-d85813b5.pth)    |
+| GLIP-L (`swin-l_glip-pre_3rdparty_384px`)\* | FourODs,GoldG,CC3M+12M,SBU |  384x384   | [model](https://download.openmmlab.com/mmclassification/v1/glip/swin-l_glip-pre_3rdparty_384px_20230413-04b198e8.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/GLIP).*
+
+## Citation
+
+```bibtex
+@inproceedings{li2021grounded,
+      title={Grounded Language-Image Pre-training},
+      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
+      year={2022},
+      booktitle={CVPR},
+}
+```
diff --git a/configs/glip/glip-l_headless.py b/configs/glip/glip-l_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..991b6b85039bf0d24237a617dfeae285f97d7555
--- /dev/null
+++ b/configs/glip/glip-l_headless.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer',
+        arch='large',
+        img_size=384,
+        out_indices=(1, 2, 3),  # original weight is for detection
+        stage_cfgs=dict(block_cfgs=dict(window_size=12))),
+    neck=None,
+    head=None)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[103.53, 116.28, 123.675],
+    std=[57.375, 57.12, 58.395],
+    # convert image from BGR to RGB
+    to_rgb=False,
+)
diff --git a/configs/glip/glip-t_headless.py b/configs/glip/glip-t_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..08b89f8f1e02a1d1fa230e437e6b6e3ac873821f
--- /dev/null
+++ b/configs/glip/glip-t_headless.py
@@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer',
+        arch='tiny',
+        img_size=224,
+        out_indices=(1, 2, 3),  # original weight is for detection
+    ),
+    neck=None,
+    head=None)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[103.53, 116.28, 123.675],
+    std=[57.375, 57.12, 58.395],
+    # convert image from BGR to RGB
+    to_rgb=False,
+)
diff --git a/configs/glip/metafile.yml b/configs/glip/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..0691fd0d06c184082718be80d110a52dd9fae06b
--- /dev/null
+++ b/configs/glip/metafile.yml
@@ -0,0 +1,49 @@
+Collections:
+  - Name: GLIP
+    Metadata:
+      Training Techniques:
+        - AdamW
+        - Weight Decay
+      Architecture:
+        - Shift Window Multihead Self Attention
+    Paper:
+      URL: https://arxiv.org/abs/2112.03857
+      Title: "Grounded Language-Image Pre-training"
+    README: configs/glip/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/vit.py
+      Version: v1.0.0rc8
+
+Models:
+  - Name: swin-t_glip-pre_3rdparty
+    In Collection: GLIP
+    Metadata:
+      FLOPs: 4508464128
+      Parameters: 29056354
+      Training Data:
+        - O365
+        - GoldG
+        - CC3M
+        - SBU
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/glip/swin-t_glip-pre_3rdparty_20230413-d85813b5.pth
+    Converted From:
+      Weights: https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/glip_tiny_model_o365_goldg_cc_sbu.pth
+      Code: https://github.com/microsoft/GLIP
+    Config: configs/glip/glip-t_headless.py
+  - Name: swin-l_glip-pre_3rdparty_384px
+    In Collection: GLIP
+    Metadata:
+      FLOPs: 104080343040
+      Parameters: 196735516
+      Training Data:
+        - FourODs
+        - GoldG
+        - CC3M+12M
+        - SBU
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/glip/swin-l_glip-pre_3rdparty_384px_20230413-04b198e8.pth
+    Converted From:
+      Weights: https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/glip_large_model.pth
+      Code: https://github.com/microsoft/GLIP
+    Config: configs/glip/glip-l_headless.py
diff --git a/configs/hivit/README.md b/configs/hivit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..18ae0862c5db52a7f6f82451d398ee3e47d709ce
--- /dev/null
+++ b/configs/hivit/README.md
@@ -0,0 +1,81 @@
+# HiViT
+
+> [HiViT: A Simple and More Efficient Design of Hierarchical Vision Transformer](https://arxiv.org/abs/2205.14949)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Recently, masked image modeling (MIM) has offered a new methodology of self-supervised pre-training of vision transformers. A key idea of efficient implementation is to discard the masked image patches (or tokens) throughout the target network (encoder), which requires the encoder to be a plain vision transformer (e.g., ViT), albeit hierarchical vision transformers (e.g., Swin Transformer) have potentially better properties in formulating vision inputs. In this paper, we offer a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) that enjoys both high efficiency and good performance in MIM. The key is to remove the unnecessary "local inter-unit operations", deriving structurally simple hierarchical vision transformers in which mask-units can be serialized like plain vision transformers. For this purpose, we start with Swin Transformer and (i) set the masking unit size to be the token size in the main stage of Swin Transformer, (ii) switch off inter-unit self-attentions before the main stage, and (iii) eliminate all operations after the main stage. Empirical studies demonstrate the advantageous performance of HiViT in terms of fully-supervised, self-supervised, and transfer learning. In particular, in running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$\times$ speed-up over Swin-B, and the performance gain generalizes to downstream tasks of detection and segmentation. Code will be made publicly available.
+
+<div align=center>
+<img src="https://github.com/open-mmlab/mmpretrain/assets/36138628/4a99cf9d-15df-4866-8750-bd2c3db5d894" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+<!-- **Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('hivit-tiny-p16_16xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+<!-- **Use the model** -->
+
+<!-- ```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('hivit-tiny-p16_16xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+``` -->
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/hivit/hivit-tiny-p16_16xb64_in1k.py
+```
+
+<!-- Test:
+
+```shell
+python tools/test.py configs/hivit/hivit-tiny-p16_16xb64_in1k.py None
+``` -->
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) |                  Config                  | Download |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :--------------------------------------: | :------: |
+| `hivit-tiny-p16_16xb64_in1k`  | From scratch |   19.18    |   4.60    |   82.10   | [config](hivit-tiny-p16_16xb64_in1k.py)  |   N/A    |
+| `hivit-small-p16_16xb64_in1k` | From scratch |   37.53    |   9.07    |    N/A    | [config](hivit-small-p16_16xb64_in1k.py) |   N/A    |
+| `hivit-base-p16_16xb64_in1k`  | From scratch |   79.05    |   18.47   |    N/A    | [config](hivit-base-p16_16xb64_in1k.py)  |   N/A    |
+
+## Citation
+
+```bibtex
+@inproceedings{zhanghivit,
+  title={HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer},
+  author={Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi},
+  booktitle={International Conference on Learning Representations},
+  year={2023},
+}
+```
diff --git a/configs/hivit/hivit-base-p16_16xb64_in1k.py b/configs/hivit/hivit-base-p16_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d37dcda86ba8db69cea47477f240e24564fcf91f
--- /dev/null
+++ b/configs/hivit/hivit-base-p16_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/hivit/base_224.py',
+    '../_base_/datasets/imagenet_bs64_hivit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_hivit.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/hivit/hivit-small-p16_16xb64_in1k.py b/configs/hivit/hivit-small-p16_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fa3976672e839354c8a215ded9a02874ab78aca
--- /dev/null
+++ b/configs/hivit/hivit-small-p16_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/hivit/small_224.py',
+    '../_base_/datasets/imagenet_bs64_hivit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_hivit.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/hivit/hivit-tiny-p16_16xb64_in1k.py b/configs/hivit/hivit-tiny-p16_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ed3b6a7ae95a232995c50d26002fd6d5aa0fbe1
--- /dev/null
+++ b/configs/hivit/hivit-tiny-p16_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/hivit/tiny_224.py',
+    '../_base_/datasets/imagenet_bs64_hivit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_hivit.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/hivit/metafile.yml b/configs/hivit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..67f3a6961637a1a43f64063bdcdd567c163ab3df
--- /dev/null
+++ b/configs/hivit/metafile.yml
@@ -0,0 +1,63 @@
+Collections:
+  - Name: HiViT
+    Metadata:
+      Architecture:
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+    Paper:
+      Title: 'HiViT: A Simple and More Efficient Design of Hierarchical Vision Transformer'
+      URL: https://arxiv.org/abs/2205.14949
+    README: configs/hivit/README.md
+    Code:
+      URL: null
+      Version: null
+
+Models:
+  - Name: hivit-tiny-p16_16xb64_in1k
+    Metadata:
+      FLOPs: 4603000000
+      Parameters: 19181000
+      Training Data:
+        - ImageNet-1k
+    In Collection: HiViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.1
+        Task: Image Classification
+    Weights:
+    Config: configs/hivit/hivit-tiny-p16_16xb64_in1k.py
+
+  - Name: hivit-small-p16_16xb64_in1k
+    Metadata:
+      FLOPs: 9072000000
+      Parameters: 37526000
+      Training Data:
+        - ImageNet-1k
+    In Collection: HiViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy:
+        Task: Image Classification
+    Weights:
+    Config: configs/hivit/hivit-small-p16_16xb64_in1k.py
+
+  - Name: hivit-base-p16_16xb64_in1k
+    Metadata:
+      FLOPs: 18474000000
+      Parameters: 79051000
+      Training Data:
+        - ImageNet-1k
+    In Collection: HiViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy:
+        Task: Image Classification
+    Weights:
+    Config: configs/hivit/hivit-base-p16_16xb64_in1k.py
diff --git a/configs/hornet/README.md b/configs/hornet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b4dbf05bd35d4cfc0fc165ea857110e18ace664c
--- /dev/null
+++ b/configs/hornet/README.md
@@ -0,0 +1,80 @@
+# HorNet
+
+> [HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions](https://arxiv.org/abs/2207.14284)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution (g nConv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. g nConv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the operation, we construct a new family of generic vision backbones named HorNet. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation show HorNet outperform Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and a larger model size. Apart from the effectiveness in visual encoders, we also show g nConv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. Our results demonstrate that g nConv can be a new basic module for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code is available at https://github.com/raoyongming/HorNet.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24734142/188356236-b8e3db94-eaa6-48e9-b323-15e5ba7f2991.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('hornet-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('hornet-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/hornet/hornet-tiny_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny_3rdparty_in1k_20220915-0e8eedff.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                             |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                 Config                  |                                    Download                                     |
+| :-------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-------------------------------------: | :-----------------------------------------------------------------------------: |
+| `hornet-tiny_3rdparty_in1k`\*     | From scratch |   22.41    |   3.98    |   82.84   |   96.24   |  [config](hornet-tiny_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny_3rdparty_in1k_20220915-0e8eedff.pth) |
+| `hornet-tiny-gf_3rdparty_in1k`\*  | From scratch |   22.99    |   3.90    |   82.98   |   96.38   | [config](hornet-tiny-gf_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny-gf_3rdparty_in1k_20220915-4c35a66b.pth) |
+| `hornet-small_3rdparty_in1k`\*    | From scratch |   49.53    |   8.83    |   83.79   |   96.75   |  [config](hornet-small_8xb64_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-small_3rdparty_in1k_20220915-5935f60f.pth) |
+| `hornet-small-gf_3rdparty_in1k`\* | From scratch |   50.40    |   8.71    |   83.98   |   96.77   | [config](hornet-small-gf_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-small-gf_3rdparty_in1k_20220915-649ca492.pth) |
+| `hornet-base_3rdparty_in1k`\*     | From scratch |   87.26    |   15.58   |   84.24   |   96.94   |   [config](hornet-base_8xb64_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-base_3rdparty_in1k_20220915-a06176bb.pth) |
+| `hornet-base-gf_3rdparty_in1k`\*  | From scratch |   88.42    |   15.42   |   84.32   |   96.95   | [config](hornet-base-gf_8xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-base-gf_3rdparty_in1k_20220915-82c06fa7.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/raoyongming/HorNet). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{rao2022hornet,
+  title={HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions},
+  author={Rao, Yongming and Zhao, Wenliang and Tang, Yansong and Zhou, Jie and Lim, Ser-Lam and Lu, Jiwen},
+  journal={arXiv preprint arXiv:2207.14284},
+  year={2022}
+}
+```
diff --git a/configs/hornet/hornet-base-gf_8xb64_in1k.py b/configs/hornet/hornet-base-gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b27012df51b4bc90303d5c30df83fb24a2d76690
--- /dev/null
+++ b/configs/hornet/hornet-base-gf_8xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/hornet/hornet-base-gf.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=64)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=1.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-base_8xb64_in1k.py b/configs/hornet/hornet-base_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb78a7ddaac26bcde4032c8342de251c3c26fb68
--- /dev/null
+++ b/configs/hornet/hornet-base_8xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/hornet/hornet-base.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=64)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=5.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-small-gf_8xb64_in1k.py b/configs/hornet/hornet-small-gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..96fcc77d8ca1f693f479f795e97469240f4632c3
--- /dev/null
+++ b/configs/hornet/hornet-small-gf_8xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/hornet/hornet-small-gf.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=64)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=1.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-small_8xb64_in1k.py b/configs/hornet/hornet-small_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0535ade00cdff0c4a25e6570a1316216f6fd37b
--- /dev/null
+++ b/configs/hornet/hornet-small_8xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/hornet/hornet-small.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=64)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=5.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-tiny-gf_8xb128_in1k.py b/configs/hornet/hornet-tiny-gf_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3556de9c15ccb29b98fe1a7b68ee59cbbf320536
--- /dev/null
+++ b/configs/hornet/hornet-tiny-gf_8xb128_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/hornet/hornet-tiny-gf.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=128)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=1.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/hornet-tiny_8xb128_in1k.py b/configs/hornet/hornet-tiny_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..31bd1dd3fc9c4918c3043916fc155f9eb7faad1d
--- /dev/null
+++ b/configs/hornet/hornet-tiny_8xb128_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/hornet/hornet-tiny.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+data = dict(samples_per_gpu=128)
+
+optim_wrapper = dict(optimizer=dict(lr=4e-3), clip_grad=dict(max_norm=100.0))
+
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
diff --git a/configs/hornet/metafile.yml b/configs/hornet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..eba0ed2f4c9ac8eb758f5f5a81d023440ae53484
--- /dev/null
+++ b/configs/hornet/metafile.yml
@@ -0,0 +1,115 @@
+Collections:
+  - Name: HorNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+        - Weight Decay
+      Architecture:
+        - HorNet
+        - gnConv
+    Paper:
+      URL: https://arxiv.org/abs/2207.14284
+      Title: "HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions"
+    README: configs/hornet/README.md
+    Code:
+      Version: v0.24.0
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.24.0/mmcls/models/backbones/hornet.py
+
+Models:
+  - Name: hornet-tiny_3rdparty_in1k
+    Metadata:
+      FLOPs: 3976156352   # 3.98G
+      Parameters: 22409512      # 22.41M
+    In Collection: HorNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.84
+          Top 5 Accuracy: 96.24
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny_3rdparty_in1k_20220915-0e8eedff.pth
+    Config: configs/hornet/hornet-tiny_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/raoyongming/HorNet
+      Weights: https://cloud.tsinghua.edu.cn/f/1ca970586c6043709a3f/?dl=1
+  - Name: hornet-tiny-gf_3rdparty_in1k
+    Metadata:
+      FLOPs: 3896472160   # 3.9G
+      Parameters: 22991848      # 22.99M
+    In Collection: HorNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.98
+          Top 5 Accuracy: 96.38
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny-gf_3rdparty_in1k_20220915-4c35a66b.pth
+    Config: configs/hornet/hornet-tiny-gf_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/raoyongming/HorNet
+      Weights: https://cloud.tsinghua.edu.cn/f/511faad0bde94dfcaa54/?dl=1
+  - Name: hornet-small_3rdparty_in1k
+    Metadata:
+      FLOPs:  8825621280    # 8.83G
+      Parameters: 49528264          # 49.53M
+    In Collection: HorNet
+    Results:
+        - Dataset: ImageNet-1k
+          Metrics:
+            Top 1 Accuracy: 83.79
+            Top 5 Accuracy: 96.75
+          Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-small_3rdparty_in1k_20220915-5935f60f.pth
+    Config: configs/hornet/hornet-small_8xb64_in1k.py
+    Converted From:
+      Code: https://github.com/raoyongming/HorNet
+      Weights: https://cloud.tsinghua.edu.cn/f/46422799db2941f7b684/?dl=1
+  - Name: hornet-small-gf_3rdparty_in1k
+    Metadata:
+      FLOPs:  8706094992    # 8.71G
+      Parameters: 50401768          # 50.4M
+    In Collection: HorNet
+    Results:
+        - Dataset: ImageNet-1k
+          Metrics:
+            Top 1 Accuracy: 83.98
+            Top 5 Accuracy: 96.77
+          Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-small-gf_3rdparty_in1k_20220915-649ca492.pth
+    Config: configs/hornet/hornet-small-gf_8xb64_in1k.py
+    Converted From:
+      Code: https://github.com/raoyongming/HorNet
+      Weights: https://cloud.tsinghua.edu.cn/f/8405c984bf084d2ba85a/?dl=1
+  - Name: hornet-base_3rdparty_in1k
+    Metadata:
+      FLOPs: 15582677376              # 15.59G
+      Parameters: 87256680            # 87.26M
+    In Collection: HorNet
+    Results:
+        - Dataset: ImageNet-1k
+          Metrics:
+            Top 1 Accuracy: 84.24
+            Top 5 Accuracy: 96.94
+          Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-base_3rdparty_in1k_20220915-a06176bb.pth
+    Config: configs/hornet/hornet-base_8xb64_in1k.py
+    Converted From:
+      Code: https://github.com/raoyongming/HorNet
+      Weights: https://cloud.tsinghua.edu.cn/f/5c86cb3d655d4c17a959/?dl=1
+  - Name: hornet-base-gf_3rdparty_in1k
+    Metadata:
+      FLOPs: 15423308992              # 15.42G
+      Parameters: 88421352            # 88.42M
+    In Collection: HorNet
+    Results:
+        - Dataset: ImageNet-1k
+          Metrics:
+            Top 1 Accuracy: 84.32
+            Top 5 Accuracy: 96.95
+          Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hornet/hornet-base-gf_3rdparty_in1k_20220915-82c06fa7.pth
+    Config: configs/hornet/hornet-base-gf_8xb64_in1k.py
+    Converted From:
+      Code: https://github.com/raoyongming/HorNet
+      Weights: https://cloud.tsinghua.edu.cn/f/6c84935e63b547f383fb/?dl=1
diff --git a/configs/hrnet/README.md b/configs/hrnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..31725cf8a4e062552fcc7a0be60562885944924c
--- /dev/null
+++ b/configs/hrnet/README.md
@@ -0,0 +1,85 @@
+# HRNet
+
+> [Deep High-Resolution Representation Learning for Visual Recognition](https://arxiv.org/abs/1908.07919v2)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions *in series* (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams *in parallel*; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/149920446-cbe05670-989d-4fe6-accc-df20ae2984eb.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('hrnet-w18_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('hrnet-w18_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/hrnet/hrnet-w18_4xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32_in1k_20220120-0c10b180.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                  |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |              Config               |                                     Download                                     |
+| :------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-------------------------------: | :------------------------------------------------------------------------------: |
+| `hrnet-w18_3rdparty_8xb32_in1k`\*      | From scratch |   21.30    |   4.33    |   76.75   |   93.44   | [config](hrnet-w18_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32_in1k_20220120-0c10b180.pth) |
+| `hrnet-w30_3rdparty_8xb32_in1k`\*      | From scratch |   37.71    |   8.17    |   78.19   |   94.22   | [config](hrnet-w30_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w30_3rdparty_8xb32_in1k_20220120-8aa3832f.pth) |
+| `hrnet-w32_3rdparty_8xb32_in1k`\*      | From scratch |   41.23    |   8.99    |   78.44   |   94.19   | [config](hrnet-w32_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w32_3rdparty_8xb32_in1k_20220120-c394f1ab.pth) |
+| `hrnet-w40_3rdparty_8xb32_in1k`\*      | From scratch |   57.55    |   12.77   |   78.94   |   94.47   | [config](hrnet-w40_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w40_3rdparty_8xb32_in1k_20220120-9a2dbfc5.pth) |
+| `hrnet-w44_3rdparty_8xb32_in1k`\*      | From scratch |   67.06    |   14.96   |   78.88   |   94.37   | [config](hrnet-w44_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w44_3rdparty_8xb32_in1k_20220120-35d07f73.pth) |
+| `hrnet-w48_3rdparty_8xb32_in1k`\*      | From scratch |   77.47    |   17.36   |   79.32   |   94.52   | [config](hrnet-w48_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w48_3rdparty_8xb32_in1k_20220120-e555ef50.pth) |
+| `hrnet-w64_3rdparty_8xb32_in1k`\*      | From scratch |   128.06   |   29.00   |   79.46   |   94.65   | [config](hrnet-w64_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w64_3rdparty_8xb32_in1k_20220120-19126642.pth) |
+| `hrnet-w18_3rdparty_8xb32-ssld_in1k`\* | From scratch |   21.30    |   4.33    |   81.06   |   95.70   | [config](hrnet-w18_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32-ssld_in1k_20220120-455f69ea.pth) |
+| `hrnet-w48_3rdparty_8xb32-ssld_in1k`\* | From scratch |   77.47    |   17.36   |   83.63   |   96.79   | [config](hrnet-w48_4xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w48_3rdparty_8xb32-ssld_in1k_20220120-d0459c38.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/HRNet/HRNet-Image-Classification). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{WangSCJDZLMTWLX19,
+  title={Deep High-Resolution Representation Learning for Visual Recognition},
+  author={Jingdong Wang and Ke Sun and Tianheng Cheng and
+          Borui Jiang and Chaorui Deng and Yang Zhao and Dong Liu and Yadong Mu and
+          Mingkui Tan and Xinggang Wang and Wenyu Liu and Bin Xiao},
+  journal={TPAMI},
+  year={2019}
+}
+```
diff --git a/configs/hrnet/hrnet-w18_4xb32_in1k.py b/configs/hrnet/hrnet-w18_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3bc329a7e050131b01305d0209cc087c8f2daa24
--- /dev/null
+++ b/configs/hrnet/hrnet-w18_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/hrnet/hrnet-w18.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w30_4xb32_in1k.py b/configs/hrnet/hrnet-w30_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..669a66b8cc7af8b8b394dba3f915f184e3b9d28f
--- /dev/null
+++ b/configs/hrnet/hrnet-w30_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/hrnet/hrnet-w30.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w32_4xb32_in1k.py b/configs/hrnet/hrnet-w32_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e487403ffd242f4886962237a5bbfd57d6bbd62
--- /dev/null
+++ b/configs/hrnet/hrnet-w32_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/hrnet/hrnet-w32.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w40_4xb32_in1k.py b/configs/hrnet/hrnet-w40_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1866a2a2b93d49164ebc8892342d11781a1ba9a5
--- /dev/null
+++ b/configs/hrnet/hrnet-w40_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/hrnet/hrnet-w40.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w44_4xb32_in1k.py b/configs/hrnet/hrnet-w44_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ec913f7188151ea913f7ba324dc31845b1e9c11
--- /dev/null
+++ b/configs/hrnet/hrnet-w44_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/hrnet/hrnet-w44.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w48_4xb32_in1k.py b/configs/hrnet/hrnet-w48_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0fc3f18ff03fafba4ff24d510546b6b0434c76c4
--- /dev/null
+++ b/configs/hrnet/hrnet-w48_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/hrnet/hrnet-w48.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/hrnet-w64_4xb32_in1k.py b/configs/hrnet/hrnet-w64_4xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..659b3cd23ef16d953dc181d83016f955cd1570e0
--- /dev/null
+++ b/configs/hrnet/hrnet-w64_4xb32_in1k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/hrnet/hrnet-w64.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (4 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/hrnet/metafile.yml b/configs/hrnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..3a17b1251333c17b3b1c7834b46d15b4c43b8bd3
--- /dev/null
+++ b/configs/hrnet/metafile.yml
@@ -0,0 +1,162 @@
+Collections:
+  - Name: HRNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Batch Normalization
+        - Convolution
+        - ReLU
+        - Residual Connection
+    Paper:
+      URL: https://arxiv.org/abs/1908.07919v2
+      Title: "Deep High-Resolution Representation Learning for Visual Recognition"
+    README: configs/hrnet/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/hrnet.py
+      Version: v0.20.1
+
+Models:
+  - Name: hrnet-w18_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 4330397932
+      Parameters: 21295164
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.75
+          Top 5 Accuracy: 93.44
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32_in1k_20220120-0c10b180.pth
+    Config: configs/hrnet/hrnet-w18_4xb32_in1k.py
+    Converted From:
+      Weights: https://1drv.ms/u/s!Aus8VCZ_C_33cMkPimlmClRvmpw
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w30_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 8168305684
+      Parameters: 37708380
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.19
+          Top 5 Accuracy: 94.22
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w30_3rdparty_8xb32_in1k_20220120-8aa3832f.pth
+    Config: configs/hrnet/hrnet-w30_4xb32_in1k.py
+    Converted From:
+      Weights: https://1drv.ms/u/s!Aus8VCZ_C_33cQoACCEfrzcSaVI
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w32_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 8986267584
+      Parameters: 41228840
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.44
+          Top 5 Accuracy: 94.19
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w32_3rdparty_8xb32_in1k_20220120-c394f1ab.pth
+    Config: configs/hrnet/hrnet-w32_4xb32_in1k.py
+    Converted From:
+      Weights: https://1drv.ms/u/s!Aus8VCZ_C_33dYBMemi9xOUFR0w
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w40_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 12767574064
+      Parameters: 57553320
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.94
+          Top 5 Accuracy: 94.47
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w40_3rdparty_8xb32_in1k_20220120-9a2dbfc5.pth
+    Config: configs/hrnet/hrnet-w40_4xb32_in1k.py
+    Converted From:
+      Weights: https://1drv.ms/u/s!Aus8VCZ_C_33ck0gvo5jfoWBOPo
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w44_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 14963902632
+      Parameters: 67061144
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.88
+          Top 5 Accuracy: 94.37
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w44_3rdparty_8xb32_in1k_20220120-35d07f73.pth
+    Config: configs/hrnet/hrnet-w44_4xb32_in1k.py
+    Converted From:
+      Weights: https://1drv.ms/u/s!Aus8VCZ_C_33czZQ0woUb980gRs
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w48_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 17364014752
+      Parameters: 77466024
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.32
+          Top 5 Accuracy: 94.52
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w48_3rdparty_8xb32_in1k_20220120-e555ef50.pth
+    Config: configs/hrnet/hrnet-w48_4xb32_in1k.py
+    Converted From:
+      Weights: https://1drv.ms/u/s!Aus8VCZ_C_33dKvqI6pBZlifgJk
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w64_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 29002298752
+      Parameters: 128056104
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.46
+          Top 5 Accuracy: 94.65
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w64_3rdparty_8xb32_in1k_20220120-19126642.pth
+    Config: configs/hrnet/hrnet-w64_4xb32_in1k.py
+    Converted From:
+      Weights: https://1drv.ms/u/s!Aus8VCZ_C_33gQbJsUPTIj3rQu99
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w18_3rdparty_8xb32-ssld_in1k
+    Metadata:
+      FLOPs: 4330397932
+      Parameters: 21295164
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.06
+          Top 5 Accuracy: 95.7
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32-ssld_in1k_20220120-455f69ea.pth
+    Config: configs/hrnet/hrnet-w18_4xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/HRNet/HRNet-Image-Classification/releases/download/PretrainedWeights/HRNet_W18_C_ssld_pretrained.pth
+      Code: https://github.com/HRNet/HRNet-Image-Classification
+  - Name: hrnet-w48_3rdparty_8xb32-ssld_in1k
+    Metadata:
+      FLOPs: 17364014752
+      Parameters: 77466024
+    In Collection: HRNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.63
+          Top 5 Accuracy: 96.79
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w48_3rdparty_8xb32-ssld_in1k_20220120-d0459c38.pth
+    Config: configs/hrnet/hrnet-w48_4xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/HRNet/HRNet-Image-Classification/releases/download/PretrainedWeights/HRNet_W48_C_ssld_pretrained.pth
+      Code: https://github.com/HRNet/HRNet-Image-Classification
diff --git a/configs/inception_v3/README.md b/configs/inception_v3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..24fde38118de66a642938d4d23f95ed5e5bfb412
--- /dev/null
+++ b/configs/inception_v3/README.md
@@ -0,0 +1,76 @@
+# Inception V3
+
+> [Rethinking the Inception Architecture for Computer Vision](http://arxiv.org/abs/1512.00567)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we explore ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error on the validation set (3.6% error on the test set) and 17.3% top-1 error on the validation set.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/177241797-c103eff4-79bb-414d-aef6-eac323b65a50.png" width="40%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('inception-v3_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('inception-v3_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/inception_v3/inception-v3_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/inception-v3/inception-v3_3rdparty_8xb32_in1k_20220615-dcd4d910.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                    Download                                     |
+| :----------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :-----------------------------------------------------------------------------: |
+| `inception-v3_3rdparty_8xb32_in1k`\* | From scratch |   23.83    |   5.75    |   77.57   |   93.58   | [config](inception-v3_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/inception-v3/inception-v3_3rdparty_8xb32_in1k_20220615-dcd4d910.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/pytorch/vision/blob/main/torchvision/models/inception.py#L28). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{szegedy2016rethinking,
+  title={Rethinking the inception architecture for computer vision},
+  author={Szegedy, Christian and Vanhoucke, Vincent and Ioffe, Sergey and Shlens, Jon and Wojna, Zbigniew},
+  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+  pages={2818--2826},
+  year={2016}
+}
+```
diff --git a/configs/inception_v3/inception-v3_8xb32_in1k.py b/configs/inception_v3/inception-v3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac977f4edbeca55afc3de118162b95cf47f7c15e
--- /dev/null
+++ b/configs/inception_v3/inception-v3_8xb32_in1k.py
@@ -0,0 +1,24 @@
+_base_ = [
+    '../_base_/models/inception_v3.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py',
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=299),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=342, edge='short'),
+    dict(type='CenterCrop', crop_size=299),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/inception_v3/metafile.yml b/configs/inception_v3/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..0b556deccf0d4ed4bc096d59338da061190ae62f
--- /dev/null
+++ b/configs/inception_v3/metafile.yml
@@ -0,0 +1,37 @@
+Collections:
+  - Name: Inception V3
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Epochs: 100
+      Batch Size: 256
+      Architecture:
+        - Inception
+    Paper:
+      URL: http://arxiv.org/abs/1512.00567
+      Title: "Rethinking the Inception Architecture for Computer Vision"
+    README: configs/inception_v3/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc1/configs/inception_v3/metafile.yml
+      Version: v1.0.0rc1
+
+Models:
+  - Name: inception-v3_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 5745177632
+      Parameters: 23834568
+    In Collection: Inception V3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.57
+          Top 5 Accuracy: 93.58
+    Weights: https://download.openmmlab.com/mmclassification/v0/inception-v3/inception-v3_3rdparty_8xb32_in1k_20220615-dcd4d910.pth
+    Config: configs/inception_v3/inception-v3_8xb32_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/inception_v3_google-0cc3c7bd.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/inception.py#L28
diff --git a/configs/itpn/README.md b/configs/itpn/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..93200d0224b64158961f68f9c0fcea0e4fb1da59
--- /dev/null
+++ b/configs/itpn/README.md
@@ -0,0 +1,65 @@
+# iTPN
+
+> [Integrally Pre-Trained Transformer Pyramid Networks](https://arxiv.org/abs/2211.12735)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this paper, we present an integral pre-training framework based on masked image modeling (MIM). We advocate for pre-training the backbone and neck jointly so that the transfer gap between MIM and downstream recognition tasks is minimal. We make two technical contributions. First, we unify the reconstruction and recognition necks by inserting a feature pyramid into the pre-training stage. Second, we complement mask image modeling (MIM) with masked feature modeling (MFM) that offers multi-stage supervision to the feature pyramid. The pre-trained models, termed integrally pre-trained transformer pyramid networks (iTPNs), serve as powerful foundation models for visual recognition. In particular, the base/large-level iTPN achieves an 86.2%/87.8% top-1 accuracy on ImageNet-1K, a 53.2%/55.6% box AP on COCO object detection with 1x training schedule using Mask-RCNN, and a 54.7%/57.7% mIoU on ADE20K semantic segmentation using UPerHead -- all these results set new records. Our work inspires the community to work on unifying upstream pre-training and downstream fine-tuning tasks. Code and the pre-trained models will be released at https://github.com/sunsmarterjie/iTPN.
+
+<div align=center>
+<img src="https://github.com/open-mmlab/mmpretrain/assets/36138628/2e53d5b5-300e-4640-8507-c1173965ca62" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+<!-- **Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+``` -->
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                                   | Params (M) | Flops (G) |                               Config                               | Download |
+| :------------------------------------------------------ | :--------: | :-------: | :----------------------------------------------------------------: | :------: |
+| `itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k` |   233.00   |   18.47   | [config](itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py) |   N/A    |
+| `itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k`  |   103.00   |   18.47   | [config](itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py)  |   N/A    |
+| `itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k` |   314.00   |   63.98   | [config](itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py) |   N/A    |
+
+## Citation
+
+```bibtex
+@article{tian2022integrally,
+  title={Integrally Pre-Trained Transformer Pyramid Networks},
+  author={Tian, Yunjie and Xie, Lingxi and Wang, Zhaozhi and Wei, Longhui and Zhang, Xiaopeng and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
+  journal={arXiv preprint arXiv:2211.12735},
+  year={2022}
+}
+```
diff --git a/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..40f35d9486e7b532dfd4904d94d379167222b62f
--- /dev/null
+++ b/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,84 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_itpn.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='iTPN',
+    backbone=dict(
+        type='iTPNHiViT',
+        arch='base',
+        drop_path_rate=0.0,
+        rpe=True,
+        layer_scale_init_value=0.1,
+        reconstruction_type='clip'),
+    neck=dict(
+        type='iTPNPretrainDecoder',
+        patch_size=16,
+        in_chans=3,
+        embed_dim=512,
+        mlp_ratio=4.,
+        reconstruction_type='clip',
+        #  transformer pyramid
+        fpn_dim=256,
+        fpn_depth=2,
+        num_outs=3,
+    ),
+    head=dict(
+        type='iTPNClipHead',
+        embed_dims=512,
+        num_embed=512,
+        loss=dict(type='CosineSimilarityLoss')),
+    target_generator=dict(
+        type='CLIPGenerator',
+        tokenizer_path=  # noqa
+        'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/clip_vit_base_16.pth.tar'  # noqa
+    ),
+)
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.98), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            '.norm': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py b/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c624e7302924ea544ff2e347966956c4652e4f5
--- /dev/null
+++ b/configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py
@@ -0,0 +1,84 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_itpn.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='iTPN',
+    backbone=dict(
+        type='iTPNHiViT',
+        arch='base',
+        drop_path_rate=0.1,
+        rpe=True,
+        layer_scale_init_value=0.1,
+        reconstruction_type='clip'),
+    neck=dict(
+        type='iTPNPretrainDecoder',
+        patch_size=16,
+        in_chans=3,
+        embed_dim=512,
+        mlp_ratio=4.,
+        reconstruction_type='clip',
+        #  transformer pyramid
+        fpn_dim=256,
+        fpn_depth=2,
+        num_outs=3,
+    ),
+    head=dict(
+        type='iTPNClipHead',
+        embed_dims=512,
+        num_embed=512,
+        loss=dict(type='CrossEntropyLoss')),
+    target_generator=dict(
+        type='CLIPGenerator',
+        tokenizer_path=  # noqa
+        'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/clip_vit_base_16.pth.tar'  # noqa
+    ),
+)
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 800/1600 epochs.
+    optimizer=dict(
+        type='AdamW', lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            '.norm': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d324a448fae9edd36fdcfa48c65829fa24a1be51
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/itpn_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c489dda9321774829fd5bf6e56de65603e177c6a
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/itpn_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ebc5be011a816d23fb0d6ce801d43fd8f4019ae7
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/itpn_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..359191bc84599016e33b7228a136a06db832b9ea
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/itpn_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='iTPNHiViT', arch='large'),
+    neck=dict(type='iTPNPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca4ba00b23789e1b31e57bb6d1078498a9375f7a
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/itpn_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='iTPNHiViT', arch='large'),
+    neck=dict(type='iTPNPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b1e298b0b97db3c4391dcda5adac4e01438fdfc9
--- /dev/null
+++ b/configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/itpn_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='iTPNHiViT', arch='large'),
+    neck=dict(type='iTPNPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/itpn/metafile.yml b/configs/itpn/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..b8f5844de10f3df4114ba9eb655ed5baf844cb0e
--- /dev/null
+++ b/configs/itpn/metafile.yml
@@ -0,0 +1,50 @@
+Collections:
+  - Name: iTPN
+    Metadata:
+      Architecture:
+        - Dense Connections
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+    Paper:
+      Title: 'Integrally Pre-Trained Transformer Pyramid Networks'
+      URL: https://arxiv.org/abs/2211.12735
+    README: configs/itpn/README.md
+    Code:
+      URL: null
+      Version: null
+
+Models:
+  - Name: itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k
+    Metadata:
+      FLOPs: 18474000000
+      Parameters: 233000000
+      Training Data:
+        - ImageNet-1k
+    In Collection: iTPN
+    Results: null
+    Weights:
+    Config: configs/itpn/itpn-clip-b_hivit-base-p16_8xb256-amp-coslr-800e_in1k.py
+
+  - Name: itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k
+    Metadata:
+      FLOPs: 18474000000
+      Parameters: 103000000
+      Training Data:
+        - ImageNet-1k
+    In Collection: iTPN
+    Results: null
+    Weights:
+    Config: configs/itpn/itpn-pixel_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
+
+  - Name: itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k
+    Metadata:
+      FLOPs: 63977000000
+      Parameters: 314000000
+      Training Data:
+        - ImageNet-1k
+    In Collection: iTPN
+    Results: null
+    Weights:
+    Config: configs/itpn/itpn-pixel_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
diff --git a/configs/lenet/README.md b/configs/lenet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2cd68eac42ed7fa1d0167fe1f7b9ad917e5ce735
--- /dev/null
+++ b/configs/lenet/README.md
@@ -0,0 +1,28 @@
+# LeNet
+
+> [Backpropagation Applied to Handwritten Zip Code Recognition](https://ieeexplore.ieee.org/document/6795724)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142561080-cd1c4bdc-8739-46ca-bc32-76d462a32901.png" width="50%"/>
+</div>
+
+## Citation
+
+```
+@ARTICLE{6795724,
+  author={Y. {LeCun} and B. {Boser} and J. S. {Denker} and D. {Henderson} and R. E. {Howard} and W. {Hubbard} and L. D. {Jackel}},
+  journal={Neural Computation},
+  title={Backpropagation Applied to Handwritten Zip Code Recognition},
+  year={1989},
+  volume={1},
+  number={4},
+  pages={541-551},
+  doi={10.1162/neco.1989.1.4.541}}
+}
+```
diff --git a/configs/lenet/lenet5_mnist.py b/configs/lenet/lenet5_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..0ae8192548626c0073228a827d6b6b6595730a5e
--- /dev/null
+++ b/configs/lenet/lenet5_mnist.py
@@ -0,0 +1,89 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(type='LeNet5', num_classes=10),
+    neck=None,
+    head=dict(
+        type='ClsHead',
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# dataset settings
+dataset_type = 'MNIST'
+data_preprocessor = dict(mean=[33.46], std=[78.87], num_classes=10)
+
+pipeline = [dict(type='Resize', scale=32), dict(type='PackInputs')]
+
+common_data_cfg = dict(
+    type=dataset_type, data_prefix='data/mnist', pipeline=pipeline)
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=2,
+    dataset=dict(**common_data_cfg, test_mode=False),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=128,
+    num_workers=2,
+    dataset=dict(**common_data_cfg, test_mode=True),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001))
+
+param_scheduler = dict(
+    type='MultiStepLR',  # learning policy, decay on several milestones.
+    by_epoch=True,  # update based on epoch.
+    milestones=[15],  # decay at the 15th epochs.
+    gamma=0.1,  # decay to 0.1 times.
+)
+
+train_cfg = dict(by_epoch=True, max_epochs=5, val_interval=1)  # train 5 epochs
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+default_scope = 'mmpretrain'
+
+default_hooks = dict(
+    # record the time of every iteration.
+    timer=dict(type='IterTimerHook'),
+    # print log every 150 iterations.
+    logger=dict(type='LoggerHook', interval=150),
+    # enable the parameter scheduler.
+    param_scheduler=dict(type='ParamSchedulerHook'),
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1),
+    # set sampler seed in distributed evrionment.
+    sampler_seed=dict(type='DistSamplerSeedHook'),
+)
+
+env_cfg = dict(
+    # disable cudnn benchmark
+    cudnn_benchmark=False,
+    # set multi process parameters
+    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+    # set distributed parameters
+    dist_cfg=dict(backend='nccl'),
+)
+
+log_level = 'INFO'
+
+# load from which checkpoint
+load_from = None
+
+# whether to resume the training of the checkpoint
+resume_from = None
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (1 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/levit/README.md b/configs/levit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..234edb60618b3edd61cb01c0c172513011b1b042
--- /dev/null
+++ b/configs/levit/README.md
@@ -0,0 +1,81 @@
+# LeViT
+
+> [LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference](https://arxiv.org/abs/2104.01136)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU.
+
+<div align=center>
+<img src="https://raw.githubusercontent.com/facebookresearch/LeViT/main/.github/levit.png" width="90%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('levit-128s_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('levit-128s_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/levit/levit-128s_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/levit/levit-128s_3rdparty_in1k_20230117-e9fbd209.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                        |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config                |                                         Download                                         |
+| :--------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :--------------------------------------------------------------------------------------: |
+| `levit-128s_3rdparty_in1k`\* | From scratch |    7.39    |   0.31    |   76.51   |   92.90   | [config](levit-128s_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-128s_3rdparty_in1k_20230117-e9fbd209.pth) |
+| `levit-128_3rdparty_in1k`\*  | From scratch |    8.83    |   0.41    |   78.58   |   93.95   | [config](levit-128_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-128_3rdparty_in1k_20230117-3be02a02.pth) |
+| `levit-192_3rdparty_in1k`\*  | From scratch |   10.56    |   0.67    |   79.86   |   94.75   | [config](levit-192_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-192_3rdparty_in1k_20230117-8217a0f9.pth) |
+| `levit-256_3rdparty_in1k`\*  | From scratch |   18.38    |   1.14    |   81.59   |   95.46   | [config](levit-256_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-256_3rdparty_in1k_20230117-5ae2ce7d.pth) |
+| `levit-384_3rdparty_in1k`\*  | From scratch |   38.36    |   2.37    |   82.59   |   95.95   | [config](levit-384_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/levit/levit-384_3rdparty_in1k_20230117-f3539cce.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/LeViT). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@InProceedings{Graham_2021_ICCV,
+    author    = {Graham, Benjamin and El-Nouby, Alaaeldin and Touvron, Hugo and Stock, Pierre and Joulin, Armand and Jegou, Herve and Douze, Matthijs},
+    title     = {LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference},
+    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
+    month     = {October},
+    year      = {2021},
+    pages     = {12259-12269}
+}
+```
diff --git a/configs/levit/deploy/levit-128_8xb256_in1k.py b/configs/levit/deploy/levit-128_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ab58119395339cc11a4cb09caad1ea0cb6c7ae3b
--- /dev/null
+++ b/configs/levit/deploy/levit-128_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-128_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/deploy/levit-128s_8xb256_in1k.py b/configs/levit/deploy/levit-128s_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..93ebc3724714b73b362bc12de1b9029040cbc4f6
--- /dev/null
+++ b/configs/levit/deploy/levit-128s_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-128s_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/deploy/levit-192_8xb256_in1k.py b/configs/levit/deploy/levit-192_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..34249fda74d97b4f1e591cd39722b9cbdd94d3d2
--- /dev/null
+++ b/configs/levit/deploy/levit-192_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-192_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/deploy/levit-256_8xb256_in1k.py b/configs/levit/deploy/levit-256_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..687f83506e30fcf36041729b70b30822b30cae81
--- /dev/null
+++ b/configs/levit/deploy/levit-256_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-256_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/deploy/levit-384_8xb256_in1k.py b/configs/levit/deploy/levit-384_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a83d47a54507022389bfb34c50ae466c978586b
--- /dev/null
+++ b/configs/levit/deploy/levit-384_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../levit-384_8xb256_in1k.py'
+
+model = dict(backbone=dict(deploy=True), head=dict(deploy=True))
diff --git a/configs/levit/levit-128_8xb256_in1k.py b/configs/levit/levit-128_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cdec48e3ffbb317ae464be244bf8e05cf4c41165
--- /dev/null
+++ b/configs/levit/levit-128_8xb256_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/levit-256-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(arch='128'), head=dict(in_channels=384))
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/levit-128s_8xb256_in1k.py b/configs/levit/levit-128s_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0564cac7e018ec4e311f5e970e9211260ada402c
--- /dev/null
+++ b/configs/levit/levit-128s_8xb256_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/levit-256-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(arch='128s'), head=dict(in_channels=384))
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/levit-192_8xb256_in1k.py b/configs/levit/levit-192_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dfbf70e0ad2f0a35e4acca090bd6d2cadd6932f0
--- /dev/null
+++ b/configs/levit/levit-192_8xb256_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/levit-256-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(arch='192'), head=dict(in_channels=384))
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/levit-256_8xb256_in1k.py b/configs/levit/levit-256_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e961e776faf923f7acceef8b2578f86e7f630afa
--- /dev/null
+++ b/configs/levit/levit-256_8xb256_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/levit-256-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/levit-384_8xb256_in1k.py b/configs/levit/levit-384_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..10ceac45c4cc907165d75c6b1b320c07f9a384e9
--- /dev/null
+++ b/configs/levit/levit-384_8xb256_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+    '../_base_/models/levit-256-p16.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_adamw_levit.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(arch='384', drop_path_rate=0.1),
+    head=dict(in_channels=768),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
diff --git a/configs/levit/metafile.yml b/configs/levit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..78b62c5c12dcee63d1790b597f0222d7f8324361
--- /dev/null
+++ b/configs/levit/metafile.yml
@@ -0,0 +1,101 @@
+Collections:
+  - Name: LeViT
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - 1x1 Convolution
+        - LeViT Attention Block
+    Paper:
+      Title: "LeViT: a Vision Transformer in ConvNet\u2019s Clothing for Faster Inference"
+      URL: https://arxiv.org/abs/2104.01136
+    README: configs/levit/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/levit.py
+      Version: v1.0.0rc5
+
+Models:
+  - Name: levit-128s_3rdparty_in1k
+    Metadata:
+      FLOPs: 310342496
+      Parameters: 7391290
+      Training Data: ImageNet-1k
+    In Collection: LeViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.51
+          Top 5 Accuracy: 92.90
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-128s_3rdparty_in1k_20230117-e9fbd209.pth
+    Config: configs/levit/levit-128s_8xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-128S-96703c44.pth
+      Code: https://github.com/facebookresearch/LeViT
+  - Name: levit-128_3rdparty_in1k
+    Metadata:
+      FLOPs: 413060992
+      Parameters: 8828168
+      Training Data: ImageNet-1k
+    In Collection: LeViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.58
+          Top 5 Accuracy: 93.95
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-128_3rdparty_in1k_20230117-3be02a02.pth
+    Config: configs/levit/levit-128_8xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-128-b88c2750.pth
+      Code: https://github.com/facebookresearch/LeViT
+  - Name: levit-192_3rdparty_in1k
+    Metadata:
+      FLOPs: 667860704
+      Parameters: 10561301
+      Training Data: ImageNet-1k
+    In Collection: LeViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.86
+          Top 5 Accuracy: 94.75
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-192_3rdparty_in1k_20230117-8217a0f9.pth
+    Config: configs/levit/levit-192_8xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-192-92712e41.pth
+      Code: https://github.com/facebookresearch/LeViT
+  - Name: levit-256_3rdparty_in1k
+    Metadata:
+      FLOPs: 1141625216
+      Parameters: 18379852
+      Training Data: ImageNet-1k
+    In Collection: LeViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.59
+          Top 5 Accuracy: 95.46
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-256_3rdparty_in1k_20230117-5ae2ce7d.pth
+    Config: configs/levit/levit-256_8xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-256-13b5763e.pth
+      Code: https://github.com/facebookresearch/LeViT
+  - Name: levit-384_3rdparty_in1k
+    Metadata:
+      FLOPs: 2372941568
+      Parameters: 38358300
+      Training Data: ImageNet-1k
+    In Collection: LeViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.59
+          Top 5 Accuracy: 95.95
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/levit/levit-384_3rdparty_in1k_20230117-f3539cce.pth
+    Config: configs/levit/levit-384_8xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/LeViT/LeViT-384-9bdaf2e2.pth
+      Code: https://github.com/facebookresearch/LeViT
diff --git a/configs/llava/README.md b/configs/llava/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..581abfe5a66c30ce9ff1062d2fe605e17bb2f501
--- /dev/null
+++ b/configs/llava/README.md
@@ -0,0 +1,51 @@
+# LLaVA
+
+> [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.
+
+<div align=center>
+<img src="https://github-production-user-asset-6210df.s3.amazonaws.com/26739999/246466979-c2f41b71-1de3-4da8-b20a-eaebe722c339.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model, inference_model
+
+out = inference_model('llava-7b-v1_caption', 'demo/cat-dog.png', device='cuda')
+print(out)
+# {'pred_caption': 'In the image, there are two cats sitting on a blanket.'}
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model                   | Params (M) |               Config               |                                                    Download                                                     |
+| :---------------------- | :--------: | :--------------------------------: | :-------------------------------------------------------------------------------------------------------------: |
+| `llava-7b-v1_caption`   |  7045.82   |  [config](llava-7b-v1_caption.py)  |  [ckpt](https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1_liuhaotian_20231025-c9e119b6.pth)  |
+| `llava-7b-v1.5_caption` |  7062.90   | [config](llava-7b-v1.5_caption.py) | [ckpt](https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1.5_liuhaotian_20231025-5828aa5a.pth) |
+| `llava-7b-v1.5_vqa`     |  7062.90   |   [config](llava-7b-v1.5_vqa.py)   | [ckpt](https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1.5_liuhaotian_20231025-5828aa5a.pth) |
+
+## Citation
+
+```bibtex
+@misc{liu2023llava,
+      title={Visual Instruction Tuning},
+      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
+      publisher={arXiv:2304.08485},
+      year={2023},
+}
+```
diff --git a/configs/llava/llava-7b-v1.5_caption.py b/configs/llava/llava-7b-v1.5_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..371c9b5f6174416ade8708b9c74bc7f684f2af8c
--- /dev/null
+++ b/configs/llava/llava-7b-v1.5_caption.py
@@ -0,0 +1,76 @@
+_base_ = '../_base_/default_runtime.py'
+
+meta_prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."  # noqa: E501
+image_size = 336
+prompt_tmpl = f'''{meta_prompt} User: <image>
+Describe the image in detail. ASSISTANT:'''
+
+# model settings
+model = dict(
+    type='Llava',
+    tokenizer=dict(
+        type='AutoTokenizer', name_or_path='liuhaotian/llava-v1.5-7b'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        img_size=image_size,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='mmpretrain.QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained='https://download.openmmlab.com/mmclassification/v0/clip/'
+        'vit-large-p14_clip-openai-pre_336px_20231025-fb1315ed.pth',
+    ),
+    mm_hidden_size=1024,
+    use_im_patch=False,
+    use_im_start_end=False,
+    mm_proj_depth=2,
+    lang_encoder=dict(
+        type='AutoModelForCausalLM',
+        name_or_path='huggyllama/llama-7b',
+    ),
+    task='caption',
+    prompt_tmpl=prompt_tmpl,
+    generation_cfg=dict(num_beams=3, max_new_tokens=50, length_penalty=-1.0),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(image_size, image_size),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=5,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_val.json',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+test_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+# schedule settings
+test_cfg = dict()
diff --git a/configs/llava/llava-7b-v1.5_vqa.py b/configs/llava/llava-7b-v1.5_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..5cb9812cd98b207c96b44da8261f4a11b4f04691
--- /dev/null
+++ b/configs/llava/llava-7b-v1.5_vqa.py
@@ -0,0 +1,76 @@
+_base_ = '../_base_/default_runtime.py'
+
+meta_prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."  # noqa: E501
+image_size = 336
+prompt_tmpl = f'''{meta_prompt} User: <image>
+{{question}} ASSISTANT:'''
+
+# model settings
+model = dict(
+    type='Llava',
+    tokenizer=dict(
+        type='AutoTokenizer', name_or_path='liuhaotian/llava-v1.5-7b'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        img_size=image_size,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='mmpretrain.QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained='https://download.openmmlab.com/mmclassification/v0/clip/'
+        'vit-large-p14_clip-openai-pre_336px_20231025-fb1315ed.pth',
+    ),
+    mm_hidden_size=1024,
+    use_im_patch=False,
+    use_im_start_end=False,
+    mm_proj_depth=2,
+    lang_encoder=dict(
+        type='AutoModelForCausalLM',
+        name_or_path='huggyllama/llama-7b',
+    ),
+    task='vqa',
+    prompt_tmpl=prompt_tmpl,
+    generation_cfg=dict(max_new_tokens=100),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(image_size, image_size),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id', 'question']),
+]
+
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=5,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_val.json',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+test_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+# schedule settings
+test_cfg = dict()
diff --git a/configs/llava/llava-7b-v1_caption.py b/configs/llava/llava-7b-v1_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..92e2d1fb65aab218a2c285c8d97b9f8886681304
--- /dev/null
+++ b/configs/llava/llava-7b-v1_caption.py
@@ -0,0 +1,78 @@
+_base_ = '../_base_/default_runtime.py'
+
+meta_prompt = 'You are LLaVA, a large language and vision assistant trained by UW Madison WAIV Lab.You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.Follow the instructions carefully and explain your answers in detail.'  # noqa: E501
+image_size = 224
+prompt_tmpl = f'''{meta_prompt} User: <im_start><image><im_end>
+Describe the image in detail. ASSISTANT:'''
+
+# model settings
+model = dict(
+    type='Llava',
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='liuhaotian/LLaVA-Lightning-7B-delta-v1-1'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        img_size=image_size,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='mmpretrain.QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    mm_hidden_size=1024,
+    use_im_patch=False,
+    use_im_start_end=True,
+    mm_proj_depth=1,
+    lang_encoder=dict(
+        type='AutoModelForCausalLM',
+        name_or_path='huggyllama/llama-7b',
+    ),
+    task='caption',
+    prompt_tmpl=prompt_tmpl,
+    generation_cfg=dict(max_new_tokens=50),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(image_size, image_size),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=5,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_val.json',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+test_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+# schedule settings
+test_cfg = dict()
diff --git a/configs/llava/metafile.yml b/configs/llava/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..406a214c33a5d8a3d1e2b73cfebd51975a27071e
--- /dev/null
+++ b/configs/llava/metafile.yml
@@ -0,0 +1,51 @@
+Collections:
+  - Name: LLaVA
+    Metadata:
+      Architecture:
+        - LLaMA
+        - CLIP
+    Paper:
+      Title: Visual Instruction Tuning
+      URL: https://arxiv.org/abs/2304.08485
+    README: configs/llava/README.md
+
+Models:
+  - Name: llava-7b-v1_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 7045816320
+    In Collection: LLaVA
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics:
+          BLEU-4: null
+          CIDER: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1_liuhaotian_20231025-c9e119b6.pth
+    Config: configs/llava/llava-7b-v1_caption.py
+  - Name: llava-7b-v1.5_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 7062900736
+    In Collection: LLaVA
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics:
+          BLEU-4: null
+          CIDER: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1.5_liuhaotian_20231025-5828aa5a.pth
+    Config: configs/llava/llava-7b-v1.5_caption.py
+  - Name: llava-7b-v1.5_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 7062900736
+    In Collection: LLaVA
+    Results:
+      - Task: Visual Question Answering
+        Dataset: COCO
+        Metrics:
+          BLEU-4: null
+          CIDER: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/llava/llava-7b-v1.5_liuhaotian_20231025-5828aa5a.pth
+    Config: configs/llava/llava-7b-v1.5_vqa.py
diff --git a/configs/mae/README.md b/configs/mae/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..69f5f9bf35f9aa4bbe3097c58256496445f864dd
--- /dev/null
+++ b/configs/mae/README.md
@@ -0,0 +1,123 @@
+# MAE
+
+> [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+This paper shows that masked autoencoders (MAE) are
+scalable self-supervised learners for computer vision. Our
+MAE approach is simple: we mask random patches of the
+input image and reconstruct the missing pixels. It is based
+on two core designs. First, we develop an asymmetric
+encoder-decoder architecture, with an encoder that operates only on the
+visible subset of patches (without mask tokens), along with a lightweight
+decoder that reconstructs the original image from the latent representation
+and mask tokens. Second, we find that masking a high proportion
+of the input image, e.g., 75%, yields a nontrivial and
+meaningful self-supervisory task. Coupling these two designs enables us to
+train large models efficiently and effectively: we accelerate
+training (by 3× or more) and improve accuracy. Our scalable approach allows
+for learning high-capacity models that generalize well: e.g., a vanilla
+ViT-Huge model achieves the best accuracy (87.8%) among
+methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30762564/150733959-2959852a-c7bd-4d3f-911f-3e8d8839fe67.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p16_mae-300e-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mae_vit-base-p16_8xb512-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py None
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                           | Params (M) | Flops (G) |                           Config                           |                                   Download                                   |
+| :---------------------------------------------- | :--------: | :-------: | :--------------------------------------------------------: | :--------------------------------------------------------------------------: |
+| `mae_vit-base-p16_8xb512-amp-coslr-300e_in1k`   |   111.91   |   17.58   |  [config](mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.json) |
+| `mae_vit-base-p16_8xb512-amp-coslr-400e_in1k`   |   111.91   |   17.58   |  [config](mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.json) |
+| `mae_vit-base-p16_8xb512-amp-coslr-800e_in1k`   |   111.91   |   17.58   |  [config](mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.json) |
+| `mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k`  |   111.91   |   17.58   | [config](mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.json) |
+| `mae_vit-large-p16_8xb512-amp-coslr-400e_in1k`  |   329.54   |   61.60   | [config](mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.json) |
+| `mae_vit-large-p16_8xb512-amp-coslr-800e_in1k`  |   329.54   |   61.60   | [config](mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.json) |
+| `mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k` |   329.54   |   61.60   | [config](mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.json) |
+| `mae_vit-huge-p16_8xb512-amp-coslr-1600e_in1k`  |   657.07   |  167.40   | [config](mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_mae-300e-pre_8xb128-coslr-100e_in1k` | [MAE 300-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.pth) |   86.57    |   17.58   |   83.10   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) |                      N/A                      |
+| `vit-base-p16_mae-400e-pre_8xb128-coslr-100e_in1k` | [MAE 400-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.pth) |   86.57    |   17.58   |   83.30   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) |                      N/A                      |
+| `vit-base-p16_mae-800e-pre_8xb128-coslr-100e_in1k` | [MAE 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth) |   86.57    |   17.58   |   83.30   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) |                      N/A                      |
+| `vit-base-p16_mae-1600e-pre_8xb128-coslr-100e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth) |   86.57    |   17.58   |   83.50   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.json) |
+| `vit-base-p16_mae-300e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 300-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.pth) |   86.57    |   17.58   |   60.80   | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py) |                      N/A                      |
+| `vit-base-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 400-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.pth) |   86.57    |   17.58   |   62.50   | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py) |                      N/A                      |
+| `vit-base-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth) |   86.57    |   17.58   |   65.10   | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py) |                      N/A                      |
+| `vit-base-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth) |   86.57    |   17.58   |   67.10   | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py) |                      N/A                      |
+| `vit-large-p16_mae-400e-pre_8xb128-coslr-50e_in1k` | [MAE 400-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.pth) |   304.32   |   61.60   |   85.20   | [config](benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py) |                      N/A                      |
+| `vit-large-p16_mae-800e-pre_8xb128-coslr-50e_in1k` | [MAE 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.pth) |   304.32   |   61.60   |   85.40   | [config](benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py) |                      N/A                      |
+| `vit-large-p16_mae-1600e-pre_8xb128-coslr-50e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.pth) |   304.32   |   61.60   |   85.70   | [config](benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py) |                      N/A                      |
+| `vit-large-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 400-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.pth) |   304.33   |   61.60   |   70.70   | [config](benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py) |                      N/A                      |
+| `vit-large-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.pth) |   304.33   |   61.60   |   73.70   | [config](benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py) |                      N/A                      |
+| `vit-large-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.pth) |   304.33   |   61.60   |   75.50   | [config](benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py) |                      N/A                      |
+| `vit-huge-p14_mae-1600e-pre_8xb128-coslr-50e_in1k` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.pth) |   632.04   |  167.40   |   86.90   | [config](benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k_20220916-0bfc9bfd.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k_20220916-0bfc9bfd.json) |
+| `vit-huge-p14_mae-1600e-pre_32xb8-coslr-50e_in1k-448px` | [MAE 1600-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.pth) |   633.03   |  732.13   |   87.30   | [config](benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448_20220916-95b6a0ce.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448_20220916-95b6a0ce.json) |
+
+## Citation
+
+```bibtex
+@article{He2021MaskedAA,
+  title={Masked Autoencoders Are Scalable Vision Learners},
+  author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and
+  Piotr Doll'ar and Ross B. Girshick},
+  journal={arXiv},
+  year={2021}
+}
+```
diff --git a/configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py b/configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4cf9ca1134766cd3b0179b7581511cd94dedbbc2
--- /dev/null
+++ b/configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=2e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py b/configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0545c99d002925886349c7979ab0722fbf8f37a
--- /dev/null
+++ b/configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
@@ -0,0 +1,64 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=2048, drop_last=True)
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        frozen_stages=12,
+        out_type='cls_token',
+        final_norm=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=dict(type='ClsBatchNormNeck', input_features=768),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]))
+
+# optimizer
+optim_wrapper = dict(
+    _delete_=True,
+    type='AmpOptimWrapper',
+    optimizer=dict(type='LARS', lr=6.4, weight_decay=0.0, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        by_epoch=True,
+        begin=10,
+        end=90,
+        eta_min=0.0,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py b/configs/mae/benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py
new file mode 100644
index 0000000000000000000000000000000000000000..60046b48d49f2bcc74a672c7b615da3062ad829b
--- /dev/null
+++ b/configs/mae/benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py
@@ -0,0 +1,116 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=448,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=512,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=448),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='huge',
+        img_size=448,
+        patch_size=14,
+        drop_path_rate=0.3,  # set to 0.3
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+# learning rate and layer decay rate are set to 0.004 and 0.75 respectively
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.75,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=45,
+        by_epoch=True,
+        begin=5,
+        end=50,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=50)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a9ff51890be80c6070058b2dd3e837027864da5
--- /dev/null
+++ b/configs/mae/benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py
@@ -0,0 +1,115 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='huge',
+        img_size=224,
+        patch_size=14,
+        drop_path_rate=0.3,  # set to 0.3
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+# learning rate and layer decay rate are set to 0.004 and 0.75 respectively
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.75,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=45,
+        by_epoch=True,
+        begin=5,
+        end=50,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=50)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-huge-p14_8xb128-ds-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-huge-p14_8xb128-ds-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..813f7c03f300e1579b2ca036995b1a78135f2293
--- /dev/null
+++ b/configs/mae/benchmarks/vit-huge-p14_8xb128-ds-coslr-50e_in1k.py
@@ -0,0 +1,31 @@
+_base_ = ['./vit-huge-p14_8xb128-coslr-50e_in1k.py']
+
+# optimizer wrapper
+optim_wrapper = dict(type='DeepSpeedOptimWrapper')
+
+# training strategy
+strategy = dict(
+    type='DeepSpeedStrategy',
+    fp16=dict(
+        enabled=True,
+        fp16_master_weights_and_grads=False,
+        loss_scale=0,
+        loss_scale_window=500,
+        hysteresis=2,
+        min_loss_scale=1,
+        initial_scale_power=15,
+    ),
+    inputs_to_half=['inputs'],
+    zero_optimization=dict(
+        stage=1,
+        allgather_partitions=True,
+        reduce_scatter=True,
+        allgather_bucket_size=50000000,
+        reduce_bucket_size=50000000,
+        overlap_comm=True,
+        contiguous_gradients=True,
+        cpu_offload=False,
+    ))
+
+# runner which supports strategies
+runner_type = 'FlexibleRunner'
diff --git a/configs/mae/benchmarks/vit-huge-p14_8xb128-fsdp-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-huge-p14_8xb128-fsdp-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f8dfb760f3e0282a5efce7bd9322ca381a802c2
--- /dev/null
+++ b/configs/mae/benchmarks/vit-huge-p14_8xb128-fsdp-coslr-50e_in1k.py
@@ -0,0 +1,13 @@
+_base_ = ['./vit-huge-p14_8xb128-coslr-50e_in1k.py']
+
+strategy = dict(
+    type='FSDPStrategy',
+    model_wrapper=dict(
+        auto_wrap_policy=dict(
+            type='torch.distributed.fsdp.wrap.size_based_auto_wrap_policy',
+            min_num_params=1e7)))
+
+optim_wrapper = dict(type='AmpOptimWrapper')
+
+# runner which supports strategies
+runner_type = 'FlexibleRunner'
diff --git a/configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae86b40b8a262bc9f33e523afd161fdb014971bd
--- /dev/null
+++ b/configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
@@ -0,0 +1,115 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.2,  # set to 0.2
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+# learning rate and layer decay rate are set to 0.004 and 0.75 respectively
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.75,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=45,
+        by_epoch=True,
+        begin=5,
+        end=50,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=50)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/benchmarks/vit-large-p16_8xb128-ds-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-large-p16_8xb128-ds-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9aedb431c5521f725912983444523f25340eac2a
--- /dev/null
+++ b/configs/mae/benchmarks/vit-large-p16_8xb128-ds-coslr-50e_in1k.py
@@ -0,0 +1,31 @@
+_base_ = ['./vit-large-p16_8xb128-coslr-50e_in1k.py']
+
+# optimizer wrapper
+optim_wrapper = dict(type='DeepSpeedOptimWrapper')
+
+# training strategy
+strategy = dict(
+    type='DeepSpeedStrategy',
+    fp16=dict(
+        enabled=True,
+        fp16_master_weights_and_grads=False,
+        loss_scale=0,
+        loss_scale_window=500,
+        hysteresis=2,
+        min_loss_scale=1,
+        initial_scale_power=15,
+    ),
+    inputs_to_half=['inputs'],
+    zero_optimization=dict(
+        stage=1,
+        allgather_partitions=True,
+        reduce_scatter=True,
+        allgather_bucket_size=50000000,
+        reduce_bucket_size=50000000,
+        overlap_comm=True,
+        contiguous_gradients=True,
+        cpu_offload=False,
+    ))
+
+# runner which supports strategies
+runner_type = 'FlexibleRunner'
diff --git a/configs/mae/benchmarks/vit-large-p16_8xb128-fsdp-coslr-50e_in1k.py b/configs/mae/benchmarks/vit-large-p16_8xb128-fsdp-coslr-50e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a8a67401eb3bb7204521d6ff97603eebc7e00c9
--- /dev/null
+++ b/configs/mae/benchmarks/vit-large-p16_8xb128-fsdp-coslr-50e_in1k.py
@@ -0,0 +1,13 @@
+_base_ = ['./vit-large-p16_8xb128-coslr-50e_in1k.py']
+
+strategy = dict(
+    type='FSDPStrategy',
+    model_wrapper=dict(
+        auto_wrap_policy=dict(
+            type='torch.distributed.fsdp.wrap.size_based_auto_wrap_policy',
+            min_num_params=1e7)))
+
+optim_wrapper = dict(type='AmpOptimWrapper')
+
+# runner which supports strategies
+runner_type = 'FlexibleRunner'
diff --git a/configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py b/configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c89518141c148161b2dbf082aa7b0a2eb0843539
--- /dev/null
+++ b/configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
@@ -0,0 +1,64 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=2048, drop_last=True)
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=16,
+        frozen_stages=24,
+        out_type='cls_token',
+        final_norm=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=dict(type='ClsBatchNormNeck', input_features=1024),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type='CrossEntropyLoss'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]))
+
+# optimizer
+optim_wrapper = dict(
+    _delete_=True,
+    type='AmpOptimWrapper',
+    optimizer=dict(type='LARS', lr=6.4, weight_decay=0.0, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        by_epoch=True,
+        begin=10,
+        end=90,
+        eta_min=0.0,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..76c0df22b7bc5ac52dd50ebdaf4b141efa20352f
--- /dev/null
+++ b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/mae_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8107fccb5c5c18df90cda43cccf21cb7b86f5245
--- /dev/null
+++ b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/mae_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c150e0412b2092ec7a137bd3e488cea00ef2fc7f
--- /dev/null
+++ b/configs/mae/mae_hivit-base-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/mae_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d5e40db5478755f751f4dd1c989d0c5906ca1d7
--- /dev/null
+++ b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/mae_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEHiViT', arch='large'),
+    neck=dict(type='MAEPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c6c47d08fdfa676dd30f628fa06c60595434f85
--- /dev/null
+++ b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/mae_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEHiViT', arch='large'),
+    neck=dict(type='MAEPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ed7d207a135264f9a1c20863fbf80d493f6f678
--- /dev/null
+++ b/configs/mae/mae_hivit-large-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/mae_hivit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEHiViT', arch='large'),
+    neck=dict(type='MAEPretrainDecoder', embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbad841818f0a96ab233b96820446c7b0d72de4a
--- /dev/null
+++ b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f11fb2fa98c55034a7fa3397ea337044e43f3358
--- /dev/null
+++ b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8f0398356cc8c1302d9739d73b88bec0bab3b92
--- /dev/null
+++ b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..01e0fb423969642174ac38d19a57e0db5c6cfc61
--- /dev/null
+++ b/configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.000000001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5eb7a427eb0a7cfcf2da5cbc85aa1ca89d82d152
--- /dev/null
+++ b/configs/mae/mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,66 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEViT', arch='h', patch_size=14),
+    neck=dict(
+        type='MAEPretrainDecoder',
+        embed_dim=1280,
+        patch_size=14,
+        num_patches=256),
+    head=dict(patch_size=14))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..683790c0c9a80c532e0865627f48e313b3fc6595
--- /dev/null
+++ b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEViT', arch='l'),
+    neck=dict(type='MAEPretrainDecoder', embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-300e_in1k.py b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..539207466d25617946b2dde38612587da2b6f30e
--- /dev/null
+++ b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-300e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEViT', arch='l'),
+    neck=dict(type='MAEPretrainDecoder', embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f050522a2209fea0feaa2a594e10900fca47f006
--- /dev/null
+++ b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEViT', arch='l'),
+    neck=dict(type='MAEPretrainDecoder', embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a4294db3275a405357c08b09c07f5672faa4adc
--- /dev/null
+++ b/configs/mae/mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,61 @@
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(type='MAEViT', arch='l'),
+    neck=dict(type='MAEPretrainDecoder', embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.000000001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mae/metafile.yml b/configs/mae/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8192672305de26ee20d00e1a59ad3180322491ed
--- /dev/null
+++ b/configs/mae/metafile.yml
@@ -0,0 +1,367 @@
+Collections:
+  - Name: MAE
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 8x A100-80G GPUs
+      Architecture:
+        - ViT
+    Paper:
+      Title: Masked Autoencoders Are Scalable Vision Learners
+      URL: https://arxiv.org/abs/2111.06377
+    README: configs/mae/README.md
+
+Models:
+  - Name: mae_vit-base-p16_8xb512-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 111907840
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-300e_in1k/mae_vit-base-p16_8xb512-coslr-300e-fp16_in1k_20220829-c2cf66ba.pth
+    Config: configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+    Downstream:
+      - vit-base-p16_mae-300e-pre_8xb2048-linear-coslr-90e_in1k
+      - vit-base-p16_mae-300e-pre_8xb128-coslr-100e_in1k
+  - Name: mae_vit-base-p16_8xb512-amp-coslr-400e_in1k
+    Metadata:
+      Epochs: 400
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 111907840
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-base-p16_8xb512-coslr-400e-fp16_in1k_20220825-bc79e40b.pth
+    Config: configs/mae/mae_vit-base-p16_8xb512-amp-coslr-400e_in1k.py
+    Downstream:
+      - vit-base-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k
+      - vit-base-p16_mae-400e-pre_8xb128-coslr-100e_in1k
+  - Name: mae_vit-base-p16_8xb512-amp-coslr-800e_in1k
+    Metadata:
+      Epochs: 800
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 111907840
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth
+    Config: configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
+    Downstream:
+      - vit-base-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k
+      - vit-base-p16_mae-800e-pre_8xb128-coslr-100e_in1k
+  - Name: mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k
+    Metadata:
+      Epochs: 1600
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 111907840
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth
+    Config: configs/mae/mae_vit-base-p16_8xb512-amp-coslr-1600e_in1k.py
+    Downstream:
+      - vit-base-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k
+      - vit-base-p16_mae-1600e-pre_8xb128-coslr-100e_in1k
+  - Name: mae_vit-large-p16_8xb512-amp-coslr-400e_in1k
+    Metadata:
+      Epochs: 400
+      Batch Size: 4096
+      FLOPs: 61603111936
+      Parameters: 329541888
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-400e_in1k_20220825-b11d0425.pth
+    Config: configs/mae/mae_vit-large-p16_8xb512-amp-coslr-400e_in1k.py
+    Downstream:
+      - vit-large-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k
+      - vit-large-p16_mae-400e-pre_8xb128-coslr-50e_in1k
+  - Name: mae_vit-large-p16_8xb512-amp-coslr-800e_in1k
+    Metadata:
+      Epochs: 800
+      Batch Size: 4096
+      FLOPs: 61603111936
+      Parameters: 329541888
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-800e_in1k_20220825-df72726a.pth
+    Config: configs/mae/mae_vit-large-p16_8xb512-amp-coslr-800e_in1k.py
+    Downstream:
+      - vit-large-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k
+      - vit-large-p16_mae-800e-pre_8xb128-coslr-50e_in1k
+  - Name: mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k
+    Metadata:
+      Epochs: 1600
+      Batch Size: 4096
+      FLOPs: 61603111936
+      Parameters: 329541888
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.pth
+    Config: configs/mae/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py
+    Downstream:
+      - vit-large-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k
+      - vit-large-p16_mae-1600e-pre_8xb128-coslr-50e_in1k
+  - Name: mae_vit-huge-p16_8xb512-amp-coslr-1600e_in1k
+    Metadata:
+      Epochs: 1600
+      Batch Size: 4096
+      FLOPs: 167400741120
+      Parameters: 657074508
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k_20220916-ff848775.pth
+    Config: configs/mae/mae_vit-huge-p14_8xb512-amp-coslr-1600e_in1k.py
+    Downstream:
+      - vit-huge-p14_mae-1600e-pre_8xb128-coslr-50e_in1k
+      - vit-huge-p14_mae-1600e-pre_32xb8-coslr-50e_in1k-448px
+  - Name: vit-base-p16_mae-300e-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.1
+    Weights: null
+    Config: configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_mae-400e-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.3
+    Weights: null
+    Config: configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_mae-800e-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.3
+    Weights: null
+    Config: configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_mae-1600e-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.5
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.pth
+    Config: configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_mae-300e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 17581972992
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 60.8
+    Weights: null
+    Config: configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-base-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 17581972992
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 62.5
+    Weights: null
+    Config: configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-base-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 17581972992
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 65.1
+    Weights: null
+    Config: configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-base-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 17581972992
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 67.1
+    Weights: null
+    Config: configs/mae/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-large-p16_mae-400e-pre_8xb128-coslr-50e_in1k
+    Metadata:
+      Epochs: 50
+      Batch Size: 1024
+      FLOPs: 61602103296
+      Parameters: 304324584
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.2
+    Weights: null
+    Config: configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
+  - Name: vit-large-p16_mae-800e-pre_8xb128-coslr-50e_in1k
+    Metadata:
+      Epochs: 50
+      Batch Size: 1024
+      FLOPs: 61602103296
+      Parameters: 304324584
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.4
+    Weights: null
+    Config: configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
+  - Name: vit-large-p16_mae-1600e-pre_8xb128-coslr-50e_in1k
+    Metadata:
+      Epochs: 50
+      Batch Size: 1024
+      FLOPs: 61602103296
+      Parameters: 304324584
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.7
+    Weights: null
+    Config: configs/mae/benchmarks/vit-large-p16_8xb128-coslr-50e_in1k.py
+  - Name: vit-large-p16_mae-400e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 61603112960
+      Parameters: 304326632
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 70.7
+    Weights: null
+    Config: configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-large-p16_mae-800e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 61603112960
+      Parameters: 304326632
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 73.7
+    Weights: null
+    Config: configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-large-p16_mae-1600e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 61603112960
+      Parameters: 304326632
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 75.5
+    Weights: null
+    Config: configs/mae/benchmarks/vit-large-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-huge-p14_mae-1600e-pre_8xb128-coslr-50e_in1k
+    Metadata:
+      Epochs: 50
+      Batch Size: 1024
+      FLOPs: 167399096320
+      Parameters: 632043240
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.9
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k/vit-huge-p16_ft-8xb128-coslr-50e_in1k_20220916-0bfc9bfd.pth
+    Config: configs/mae/benchmarks/vit-huge-p14_8xb128-coslr-50e_in1k.py
+  - Name: vit-huge-p14_mae-1600e-pre_32xb8-coslr-50e_in1k-448px
+    Metadata:
+      Epochs: 50
+      Batch Size: 256
+      FLOPs: 732131983360
+      Parameters: 633026280
+      Training Data: ImageNet-1k
+    In Collection: MAE
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.3
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-huge-p16_8xb512-fp16-coslr-1600e_in1k/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448/vit-huge-p16_ft-32xb8-coslr-50e_in1k-448_20220916-95b6a0ce.pth
+    Config: configs/mae/benchmarks/vit-huge-p14_32xb8-coslr-50e_in1k-448px.py
diff --git a/configs/maskfeat/README.md b/configs/maskfeat/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d25b32bb2d45990d91185de0cb34ee7e5dd9ecc5
--- /dev/null
+++ b/configs/maskfeat/README.md
@@ -0,0 +1,85 @@
+# MaskFeat
+
+> [Masked Feature Prediction for Self-Supervised Visual Pre-Training](https://arxiv.org/abs/2112.09133v1)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 38.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/48178838/190090285-428f07c0-0887-4ce8-b94f-f719cfd25622.png" width="60%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                              | Params (M) | Flops (G) |                            Config                             |                                Download                                |
+| :------------------------------------------------- | :--------: | :-------: | :-----------------------------------------------------------: | :--------------------------------------------------------------------: |
+| `maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k` |   85.88    |   17.58   | [config](maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k` | [MASKFEAT](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.pth) |   86.57    |   17.58   |   83.40   | [config](benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.json) |
+
+## Citation
+
+```bibtex
+@InProceedings{wei2022masked,
+    author    = {Wei, Chen and Fan, Haoqi and Xie, Saining and Wu, Chao-Yuan and Yuille, Alan and Feichtenhofer, Christoph},
+    title     = {Masked Feature Prediction for Self-Supervised Visual Pre-Training},
+    booktitle = {CVPR},
+    year      = {2022},
+}
+```
diff --git a/configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py b/configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a7620b46b4337adbff8aa97834d347c5da09e55
--- /dev/null
+++ b/configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs'),
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=256, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=256, dataset=dict(pipeline=test_pipeline))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=2e-5, bias=2e-5)
+        ]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=8e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        by_epoch=True,
+        begin=20,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0)
diff --git a/configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py b/configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..465ff5c36465080be4ad50e6b1511b728c3318f1
--- /dev/null
+++ b/configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,103 @@
+_base_ = '../_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.5, 1.0),
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='BEiTMaskGenerator',
+        input_size=14,
+        num_masking_patches=78,
+        min_num_patches=15,
+    ),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    type='MaskFeat',
+    backbone=dict(type='MaskFeatViT', arch='b', patch_size=16),
+    neck=dict(
+        type='LinearNeck',
+        in_channels=768,
+        out_channels=108,
+        norm_cfg=None,
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=0.02, bias=0)),
+    head=dict(
+        type='MIMHead',
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+    target_generator=dict(
+        type='HOGGenerator', nbins=9, pool=8, gaussian_window=16))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW', lr=2e-4 * 8, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=0.02),
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0,
+        norm_decay_mult=0.0,
+        flat_decay_mult=0.0,
+        custom_keys={
+            # 'pos_embed': dict(decay_mult=0.),
+            # 'cls_token': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=30,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=270,
+        eta_min=1e-6,
+        by_epoch=True,
+        begin=30,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/maskfeat/metafile.yml b/configs/maskfeat/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..1e1e1b4ae263077d2f88bc40aa893a57e3bba14a
--- /dev/null
+++ b/configs/maskfeat/metafile.yml
@@ -0,0 +1,43 @@
+Collections:
+  - Name: MaskFeat
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 8x A100-80G GPUs
+      Architecture:
+        - ViT
+    Paper:
+      Title: Masked Feature Prediction for Self-Supervised Visual Pre-Training
+      URL: https://arxiv.org/abs/2112.09133v1
+    README: configs/maskfeat/README.md
+
+Models:
+  - Name: maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 17581972224
+      Parameters: 85882692
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k_20221101-6dfc8bf3.pth
+    Config: configs/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k
+  - Name: vit-base-p16_maskfeat-pre_8xb256-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.4
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_vit-base-p16_8xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k/vit-base-p16_ft-8xb256-coslr-100e_in1k_20221028-5134431c.pth
+    Config: configs/maskfeat/benchmarks/vit-base-p16_8xb256-coslr-100e_in1k.py
diff --git a/configs/mff/README.md b/configs/mff/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7001c74be203f5997275372e57e9de4952a8f9f3
--- /dev/null
+++ b/configs/mff/README.md
@@ -0,0 +1,60 @@
+# MFF
+
+> [Improving Pixel-based MIM by Reducing Wasted Modeling Capability](https://arxiv.org/abs/2308.00261)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2% on fine-tuning, 2.8% on linear probing, and 2.6% on semantic segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30762564/257412932-5f36b11b-ee64-4ce7-b7d1-a31000302bd8.png" width="80%"/>
+</div>
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py None
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                         | Params (M) | Flops (G) |                          Config                          |                                     Download                                     |
+| :-------------------------------------------- | :--------: | :-------: | :------------------------------------------------------: | :------------------------------------------------------------------------------: |
+| `mff_vit-base-p16_8xb512-amp-coslr-300e_in1k` |     -      |     -     | [config](mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.json) |
+| `mff_vit-base-p16_8xb512-amp-coslr-800e_in1k` |     -      |     -     | [config](mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_mff-300e-pre_8xb128-coslr-100e_in1k` | [MFF 300-Epochs](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth) |   86.57    |   17.58   |   83.00   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.pth)  /   [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.json) |
+| `vit-base-p16_mff-800e-pre_8xb128-coslr-100e_in1k` | [MFF 800-Epochs](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.pth) |   86.57    |   17.58   |   83.70   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb128-coslr-100e/vit-base-p16_8xb128-coslr-100e_20230802-6780e47d.pth) / [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb128-coslr-100e/vit-base-p16_8xb128-coslr-100e_20230802-6780e47d.json) |
+| `vit-base-p16_mff-300e-pre_8xb2048-linear-coslr-90e_in1k` | [MFF 300-Epochs](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth) |   304.33   |   61.60   |   64.20   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb2048-linear-coslr-90e_in1k/vit-base-p16_8xb2048-linear-coslr-90e_in1k.json) |
+| `vit-base-p16_mff-800e-pre_8xb2048-linear-coslr-90e_in1k` | [MFF 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k_20220825-f7569ca2.pth) |   304.33   |   61.60   |   68.30   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb2048-linear-coslr-90e/vit-base-p16_8xb2048-linear-coslr-90e_20230802-6b1f7bc8.pth)  / [log](https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb2048-linear-coslr-90e/vit-base-p16_8xb2048-linear-coslr-90e_20230802-6b1f7bc8.json) |
+
+## Citation
+
+```bibtex
+@article{MFF,
+  title={Improving Pixel-based MIM by Reducing Wasted Modeling Capability},
+  author={Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin},
+  journal={arXiv},
+  year={2023}
+}
+```
diff --git a/configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py b/configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4cf9ca1134766cd3b0179b7581511cd94dedbbc2
--- /dev/null
+++ b/configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=2e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py b/configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc5f23077a20dad906fb44cf074322b394ea021d
--- /dev/null
+++ b/configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
@@ -0,0 +1,74 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ToPIL', to_rgb=True),
+    dict(type='MAERandomResizedCrop', size=224, interpolation=3),
+    dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+    dict(type='ToNumpy', to_bgr=True),
+    dict(type='PackInputs'),
+]
+
+# dataset settings
+train_dataloader = dict(
+    batch_size=2048, drop_last=True, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        frozen_stages=12,
+        out_type='cls_token',
+        final_norm=True,
+        init_cfg=dict(type='Pretrained', prefix='backbone.')),
+    neck=dict(type='ClsBatchNormNeck', input_features=768),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]))
+
+# optimizer
+optim_wrapper = dict(
+    _delete_=True,
+    type='AmpOptimWrapper',
+    optimizer=dict(type='LARS', lr=6.4, weight_decay=0.0, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        by_epoch=True,
+        begin=10,
+        end=90,
+        eta_min=0.0,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=1),
+    logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/mff/metafile.yml b/configs/mff/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f1da4cc4823e7a4b80bb150987ceccd40e91bedd
--- /dev/null
+++ b/configs/mff/metafile.yml
@@ -0,0 +1,103 @@
+Collections:
+  - Name: MFF
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 8x A100-80G GPUs
+      Architecture:
+        - ViT
+    Paper:
+      Title: Improving Pixel-based MIM by Reducing Wasted Modeling Capability
+      URL: https://arxiv.org/pdf/2308.00261.pdf
+    README: configs/mff/README.md
+
+Models:
+  - Name: mff_vit-base-p16_8xb512-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 17581972224
+      Parameters: 85882692
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k_20230801-3c1bcce4.pth
+    Config: configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
+    Downstream:
+      - vit-base-p16_mff-300e-pre_8xb128-coslr-100e_in1k
+      - vit-base-p16_mff-300e-pre_8xb2048-linear-coslr-90e_in1k
+  - Name: mff_vit-base-p16_8xb512-amp-coslr-800e_in1k
+    Metadata:
+      Epochs: 800
+      Batch Size: 2048
+      FLOPs: 17581972224
+      Parameters: 85882692
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k_20230801-3af7cd9d.pth
+    Config: configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
+    Downstream:
+      - vit-base-p16_mff-800e-pre_8xb128-coslr-100e_in1k
+      - vit-base-p16_mff-800e-pre_8xb2048-linear-coslr-90e_in1k
+  - Name: vit-base-p16_mff-300e-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MaskFeat
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.0
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.pth
+    Config: configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_mff-800e-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MFF
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.7
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k/vit-base-p16_8xb128-coslr-100e/vit-base-p16_8xb128-coslr-100e_20230802-6780e47d.pth
+    Config: configs/mff/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_mff-300e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MFF
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 64.2
+    Weights:
+    Config: configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
+  - Name: vit-base-p16_mff-800e-pre_8xb2048-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 16384
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MFF
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 68.3
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k/vit-base-p16_8xb128-coslr-100e_in1k/vit-base-p16_8xb128-coslr-100e_in1k_20230802-d746fdb7.pth
+    Config: configs/mff/benchmarks/vit-base-p16_8xb2048-linear-coslr-90e_in1k.py
diff --git a/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py b/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9fc5219e4d8d7384bfc0e24bc98c67a71964962
--- /dev/null
+++ b/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-300e_in1k.py
@@ -0,0 +1,24 @@
+_base_ = '../mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py'
+
+randomness = dict(seed=2, diff_rank_seed=True)
+
+# dataset config
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ToPIL', to_rgb=True),
+    dict(type='torchvision/Resize', size=224),
+    dict(
+        type='torchvision/RandomCrop',
+        size=224,
+        padding=4,
+        padding_mode='reflect'),
+    dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+    dict(type='ToNumpy', to_bgr=True),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+# model config
+model = dict(
+    type='MFF', backbone=dict(type='MFFViT', out_indices=[0, 2, 4, 6, 8, 11]))
diff --git a/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py b/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8976b22dd94d4d5d0906542c495fc23833d8e02
--- /dev/null
+++ b/configs/mff/mff_vit-base-p16_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,24 @@
+_base_ = '../mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py'
+
+randomness = dict(seed=2, diff_rank_seed=True)
+
+# dataset config
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ToPIL', to_rgb=True),
+    dict(type='torchvision/Resize', size=224),
+    dict(
+        type='torchvision/RandomCrop',
+        size=224,
+        padding=4,
+        padding_mode='reflect'),
+    dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+    dict(type='ToNumpy', to_bgr=True),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+# model config
+model = dict(
+    type='MFF', backbone=dict(type='MFFViT', out_indices=[0, 2, 4, 6, 8, 11]))
diff --git a/configs/milan/README.md b/configs/milan/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e1fe2289c56d27bd2fb9c6655dce769e92b155c7
--- /dev/null
+++ b/configs/milan/README.md
@@ -0,0 +1,104 @@
+# MILAN
+
+> [MILAN: Masked Image Pretraining on Language Assisted Representation](https://arxiv.org/pdf/2208.06049)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Self-attention based transformer models have been dominating many computer
+vision tasks in the past few years. Their superb model qualities heavily depend
+on the excessively large labeled image datasets. In order to reduce the reliance
+on large labeled datasets, reconstruction based masked autoencoders are gaining
+popularity, which learn high quality transferable representations from unlabeled
+images. For the same purpose, recent weakly supervised image pretraining methods
+explore language supervision from text captions accompanying the images. In this
+work, we propose masked image pretraining on language assisted representation,
+dubbed as MILAN. Instead of predicting raw pixels or low level features, our
+pretraining objective is to reconstruct the image features with substantial semantic
+signals that are obtained using caption supervision. Moreover, to accommodate our
+reconstruction target, we propose a more efficient prompting decoder architecture
+and a semantic aware mask sampling mechanism, which further advance the
+transfer performance of the pretrained model. Experimental results demonstrate
+that MILAN delivers higher accuracy than the previous works. When the masked
+autoencoder is pretrained and finetuned on ImageNet-1K dataset with an input
+resolution of 224×224, MILAN achieves a top-1 accuracy of 85.4% on ViTB/16, surpassing previous state-of-the-arts by 1%. In the downstream semantic
+segmentation task, MILAN achieves 52.7 mIoU using ViT-B/16 backbone on
+ADE20K dataset, outperforming previous masked pretraining results by 4 points.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30762564/205210369-41a65c4c-bcd4-4147-91ea-c6c9061ab455.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p16_milan-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('milan_vit-base-p16_16xb256-amp-coslr-400e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                            | Params (M) | Flops (G) |                           Config                            |                                  Download                                  |
+| :----------------------------------------------- | :--------: | :-------: | :---------------------------------------------------------: | :------------------------------------------------------------------------: |
+| `milan_vit-base-p16_16xb256-amp-coslr-400e_in1k` |   111.91   |   17.58   | [config](milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `vit-base-p16_milan-pre_8xb128-coslr-100e_in1k` | [MILAN](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) |   86.57    |   17.58   |   85.30   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.json) |
+| `vit-base-p16_milan-pre_8xb2048-linear-coslr-100e_in1k` | [MILAN](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) |   86.57    |   17.58   |   78.90   | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.json) |
+
+## Citation
+
+```bibtex
+@article{Hou2022MILANMI,
+  title={MILAN: Masked Image Pretraining on Language Assisted Representation},
+  author={Zejiang Hou and Fei Sun and Yen-Kuang Chen and Yuan Xie and S. Y. Kung},
+  journal={ArXiv},
+  year={2022}
+}
+```
diff --git a/configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py b/configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e8a3f4983ac19208090ee63e9c9160b945b22ee6
--- /dev/null
+++ b/configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        final_norm=False,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.02)]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=4e-4, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.65,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=100)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py b/configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b7333ca475ad1d9607ddda898acb623e1bd7aa4
--- /dev/null
+++ b/configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../../_base_/default_runtime.py'
+]
+
+train_dataloader = dict(batch_size=2048, drop_last=True)
+val_dataloader = dict(drop_last=False)
+test_dataloader = dict(drop_last=False)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        frozen_stages=12,
+        out_type='cls_token',
+        final_norm=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=dict(type='ClsBatchNormNeck', input_features=768),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=0.01)]),
+    data_preprocessor=dict(
+        num_classes=1000,
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        to_rgb=True,
+    ))
+
+# optimizer
+optim_wrapper = dict(
+    _delete_=True,
+    type='AmpOptimWrapper',
+    optimizer=dict(type='LARS', lr=3.2, weight_decay=0.0, momentum=0.9),
+)
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=90,
+        by_epoch=True,
+        begin=10,
+        end=100,
+        eta_min=0.0,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=10))
+
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/milan/metafile.yml b/configs/milan/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..a790815fa28d063f909dfc1855b2a33f67f59893
--- /dev/null
+++ b/configs/milan/metafile.yml
@@ -0,0 +1,59 @@
+Collections:
+  - Name: MILAN
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 16x A100-80G GPUs
+      Architecture:
+        - ViT
+    Paper:
+      Title: 'MILAN: Masked Image Pretraining on Language Assisted Representation'
+      URL: https://arxiv.org/pdf/2208.06049
+    README: configs/milan/README.md
+
+Models:
+  - Name: milan_vit-base-p16_16xb256-amp-coslr-400e_in1k
+    Metadata:
+      Epochs: 400
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 111907584
+      Training Data: ImageNet-1k
+    In Collection: MILAN
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth
+    Config: configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
+    Downstream:
+      - vit-base-p16_milan-pre_8xb128-coslr-100e_in1k
+      - vit-base-p16_milan-pre_8xb2048-linear-coslr-100e_in1k
+  - Name: vit-base-p16_milan-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 17581215744
+      Parameters: 86566120
+      Training Data: ImageNet-1k
+    In Collection: MILAN
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.3
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth
+    Config: configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py
+  - Name: vit-base-p16_milan-pre_8xb2048-linear-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 16384
+      FLOPs: 17581972992
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MILAN
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.9
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.pth
+    Config: configs/milan/benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py
diff --git a/configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py b/configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac80ab7b1bff159eed3eacc432a1b7b48e4cb221
--- /dev/null
+++ b/configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
@@ -0,0 +1,88 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='MILAN',
+    backbone=dict(
+        type='MILANViT',
+        arch='b',
+        patch_size=16,
+        mask_ratio=0.75,
+        init_cfg=[
+            dict(type='Xavier', distribution='uniform', layer='Linear'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=dict(
+        type='MILANPretrainDecoder',
+        init_cfg=[
+            dict(type='Xavier', distribution='uniform', layer='Linear'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    head=dict(
+        type='MIMHead',
+        loss=dict(
+            type='CosineSimilarityLoss', shift_factor=2.0, scale_factor=2.0),
+    ),
+    target_generator=dict(
+        type='CLIPGenerator',
+        tokenizer_path=  # noqa
+        'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/clip_vit_base_16.pth.tar'  # noqa
+    ),
+    init_cfg=None)
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/minigpt4/README.md b/configs/minigpt4/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..23666fc9f951262bf9aee65dda933c0000b891f8
--- /dev/null
+++ b/configs/minigpt4/README.md
@@ -0,0 +1,53 @@
+# MiniGPT4
+
+> [MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models](https://arxiv.org/abs/2304.10592)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. We believe the primary reason for GPT-4's advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer. Our findings reveal that MiniGPT-4 possesses many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc. In our experiment, we found that only performing the pretraining on raw image-text pairs could produce unnatural language outputs that lack coherency including repetition and fragmented sentences. To address this problem, we curate a high-quality, well-aligned dataset in the second stage to finetune our model using a conversational template. This step proved crucial for augmenting the model's generation reliability and overall usability. Notably, our model is highly computationally efficient, as we only train a projection layer utilizing approximately 5 million aligned image-text pairs. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/.
+
+<div align=center>
+<img src="https://github.com/open-mmlab/mmpretrain/assets/36138628/1d8f328d-6c91-493e-8992-29e84a0fc3c8" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('minigpt-4_vicuna-7b_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'This image shows a small dog and a kitten sitting on a blanket in a field of flowers. The dog is looking up at the kitten with a playful expression on its face. The background is a colorful striped blanket, and there are flowers all around them. The image is well composed with the two animals sitting in the center of the frame, surrounded by the flowers and blanket.'}
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+For Vicuna model, please refer to [MiniGPT-4 page](https://github.com/Vision-CAIR/MiniGPT-4) for preparation guidelines.
+
+### Pretrained models
+
+| Model                           | Params (M) | Flops (G) |                   Config                   |                                                   Download                                                   |
+| :------------------------------ | :--------: | :-------: | :----------------------------------------: | :----------------------------------------------------------------------------------------------------------: |
+| `minigpt-4_baichuan-7b_caption` |  8094.77   |    N/A    | [config](minigpt-4_baichuan-7b_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_baichuan7b_20231011-5dca7ed6.pth) |
+| `minigpt-4_vicuna-7b_caption`\* |  8121.32   |    N/A    |  [config](minigpt-4_vicuna-7b_caption.py)  | [model](https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_vicuna7b_20230615-714b5f52.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Vision-CAIR/MiniGPT-4/tree/main). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{zhu2023minigpt,
+  title={MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models},
+  author={Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed},
+  journal={arXiv preprint arXiv:2304.10592},
+  year={2023}
+}
+```
diff --git a/configs/minigpt4/metafile.yml b/configs/minigpt4/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f70cc9ba6045f237414f8dc3ee8572187528a667
--- /dev/null
+++ b/configs/minigpt4/metafile.yml
@@ -0,0 +1,37 @@
+Collections:
+  - Name: MiniGPT4
+    Metadata:
+      Architecture:
+        - Transformer
+        - Gated Cross-Attention Dense
+    Paper:
+      Title: 'MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models'
+      URL: https://arxiv.org/abs/2304.10592
+    README: configs/minigpt4/README.md
+
+Models:
+  - Name: minigpt-4_vicuna-7b_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 8121315072
+    In Collection: MiniGPT4
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_vicuna7b_20230615-714b5f52.pth
+    Config: configs/minigpt4/minigpt-4_vicuna-7b_caption.py
+    Converted From:
+      Weights: https://github.com/Vision-CAIR/MiniGPT-4/tree/main
+      Code: https://github.com/Vision-CAIR/MiniGPT-4/tree/main
+  - Name: minigpt-4_baichuan-7b_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 8094769024
+    In Collection: MiniGPT4
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/minigpt4/minigpt-4_linear_baichuan7b_20231011-5dca7ed6.pth
+    Config: configs/minigpt4/minigpt-4_baichuan-7b_caption.py
diff --git a/configs/minigpt4/minigpt-4_baichuan-7b_caption.py b/configs/minigpt4/minigpt-4_baichuan-7b_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e610a099c8dfcea86dff87c69487f6879926f21
--- /dev/null
+++ b/configs/minigpt4/minigpt-4_baichuan-7b_caption.py
@@ -0,0 +1,190 @@
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(224, 224),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='CleanCaption',
+        keys='chat_content',
+        remove_chars='',
+        lowercase=False),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['chat_content', 'lang'],
+        meta_keys=['image_id']),
+]
+
+train_dataloader = dict(
+    batch_size=2,
+    num_workers=4,
+    dataset=dict(
+        type='MiniGPT4Dataset',
+        data_root='YOUR_DATA_DIRECTORY',
+        ann_file='YOUR_DATA_FILE',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    drop_last=False,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(224, 224),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+test_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/coco_karpathy_val_gt.json',
+)
+
+test_dataloader = dict(
+    batch_size=1,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_val.json',
+        pipeline=test_pipeline))
+
+# model settings
+model = dict(
+    type='MiniGPT4',
+    vision_encoder=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=224,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        frozen_stages=39,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw',
+        pretrained=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_eva-g-p14_20230615-e908c021.pth'  # noqa
+    ),
+    q_former_model=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32,
+        pretrained=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_qformer_20230615-1dfa889c.pth'  # noqa
+    ),
+    lang_encoder=dict(
+        type='AutoModelForCausalLM',
+        name_or_path='baichuan-inc/baichuan-7B',
+        trust_remote_code=True),
+    tokenizer=dict(
+        type='AutoTokenizer',
+        name_or_path='baichuan-inc/baichuan-7B',
+        trust_remote_code=True),
+    task='caption',
+    prompt_template=dict([('en', '###Ask: {} ###Answer: '),
+                          ('zh', '###问：{} ###答：')]),
+    raw_prompts=dict([
+        ('en', [('<Img><ImageHere></Img> '
+                 'Describe this image in detail.'),
+                ('<Img><ImageHere></Img> '
+                 'Take a look at this image and describe what you notice.'),
+                ('<Img><ImageHere></Img> '
+                 'Please provide a detailed description of the picture.'),
+                ('<Img><ImageHere></Img> '
+                 'Could you describe the contents of this image for me?')]),
+        ('zh', [('<Img><ImageHere></Img> '
+                 '详细描述这张图片。'), ('<Img><ImageHere></Img> '
+                                '浏览这张图片并描述你注意到什么。'),
+                ('<Img><ImageHere></Img> '
+                 '请对这张图片进行详细的描述。'),
+                ('<Img><ImageHere></Img> '
+                 '你能为我描述这张图片的内容吗？')])
+    ]),
+    max_txt_len=160,
+    end_sym='###')
+
+strategy = dict(
+    type='DeepSpeedStrategy',
+    fp16=dict(
+        enabled=True,
+        auto_cast=False,
+        fp16_master_weights_and_grads=False,
+        loss_scale=0,
+        loss_scale_window=1000,
+        hysteresis=1,
+        min_loss_scale=1,
+        initial_scale_power=16,
+    ),
+    inputs_to_half=[0],
+    zero_optimization=dict(
+        stage=2,
+        allgather_partitions=True,
+        allgather_bucket_size=2e8,
+        reduce_scatter=True,
+        reduce_bucket_size='auto',
+        overlap_comm=True,
+        contiguous_gradients=True,
+    ),
+)
+
+# schedule settings
+optim_wrapper = dict(
+    type='DeepSpeedOptimWrapper',
+    optimizer=dict(type='AdamW', lr=1e-3, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-3 / 500,
+        by_epoch=False,
+        begin=0,
+        end=500,
+    ),
+    dict(
+        type='CosineAnnealingLR',
+        eta_min=2e-4,
+        by_epoch=False,
+        begin=500,
+    ),
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=6)
+test_cfg = dict()
+
+runner_type = 'FlexibleRunner'
+
+default_hooks = dict(
+    checkpoint=dict(
+        type='CheckpointHook',
+        interval=1,
+        by_epoch=True,
+        save_last=True,
+        max_keep_ckpts=1,
+    ))
diff --git a/configs/minigpt4/minigpt-4_vicuna-7b_caption.py b/configs/minigpt4/minigpt-4_vicuna-7b_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..f468e2d8fac7ce46871801c9cc490acb97db683d
--- /dev/null
+++ b/configs/minigpt4/minigpt-4_vicuna-7b_caption.py
@@ -0,0 +1,94 @@
+_base_ = [
+    '../_base_/datasets/coco_caption.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(224, 224),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+val_dataloader = dict(batch_size=1, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    type='MiniGPT4',
+    vision_encoder=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=224,
+        patch_size=14,
+        layer_scale_init_value=0.0,
+        frozen_stages=39,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw',
+        pretrained=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_eva-g-p14_20230615-e908c021.pth'  # noqa
+    ),
+    q_former_model=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32,
+        pretrained=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_qformer_20230615-1dfa889c.pth'  # noqa
+    ),
+    lang_encoder=dict(
+        type='AutoModelForCausalLM', name_or_path='YOUR_PATH_TO_VICUNA'),
+    tokenizer=dict(type='LlamaTokenizer', name_or_path='YOUR_PATH_TO_VICUNA'),
+    task='caption',
+    prompt_template=dict([('en', '###Ask: {} ###Answer: '),
+                          ('zh', '###问：{} ###答：')]),
+    raw_prompts=dict([
+        ('en', [('<Img><ImageHere></Img> '
+                 'Describe this image in detail.'),
+                ('<Img><ImageHere></Img> '
+                 'Take a look at this image and describe what you notice.'),
+                ('<Img><ImageHere></Img> '
+                 'Please provide a detailed description of the picture.'),
+                ('<Img><ImageHere></Img> '
+                 'Could you describe the contents of this image for me?')]),
+        ('zh', [('<Img><ImageHere></Img> '
+                 '详细描述这张图片。'), ('<Img><ImageHere></Img> '
+                                '浏览这张图片并描述你注意到什么。'),
+                ('<Img><ImageHere></Img> '
+                 '请对这张图片进行详细的描述。'),
+                ('<Img><ImageHere></Img> '
+                 '你能为我描述这张图片的内容吗？')])
+    ]),
+    max_txt_len=160,
+    end_sym='###')
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=5,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=5)
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/mixmim/README.md b/configs/mixmim/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e07f5011b32463a7be65d2cbe285148e88a6b3fc
--- /dev/null
+++ b/configs/mixmim/README.md
@@ -0,0 +1,102 @@
+# MixMIM
+
+> [MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning](https://arxiv.org/abs/2205.13137)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this study, we propose Mixed and Masked Image Modeling (MixMIM), a
+simple but efficient MIM method that is applicable to various hierarchical Vision
+Transformers. Existing MIM methods replace a random subset of input tokens with
+a special [MASK] symbol and aim at reconstructing original image tokens from
+the corrupted image. However, we find that using the [MASK] symbol greatly
+slows down the training and causes training-finetuning inconsistency, due to the
+large masking ratio (e.g., 40% in BEiT). In contrast, we replace the masked tokens
+of one image with visible tokens of another image, i.e., creating a mixed image.
+We then conduct dual reconstruction to reconstruct the original two images from
+the mixed input, which significantly improves efficiency. While MixMIM can
+be applied to various architectures, this paper explores a simpler but stronger
+hierarchical Transformer, and scales with MixMIM-B, -L, and -H. Empirical
+results demonstrate that MixMIM can learn high-quality visual representations
+efficiently. Notably, MixMIM-B with 88M parameters achieves 85.1% top-1
+accuracy on ImageNet-1K by pretraining for 600 epochs, setting a new record for
+neural networks with comparable model sizes (e.g., ViT-B) among MIM methods.
+Besides, its transferring performances on the other 6 datasets show MixMIM has
+better FLOPs / performance tradeoff than previous MIM methods
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/56866854/202853730-d26fb3d7-e5e8-487a-aad5-e3d4600cef87.png"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mixmim-base_mixmim-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mixmim_mixmim-base_16xb128-coslr-300e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                        | Params (M) | Flops (G) |                         Config                          |                                      Download                                      |
+| :------------------------------------------- | :--------: | :-------: | :-----------------------------------------------------: | :--------------------------------------------------------------------------------: |
+| `mixmim_mixmim-base_16xb128-coslr-300e_in1k` |   114.67   |   16.35   | [config](mixmim_mixmim-base_16xb128-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `mixmim-base_mixmim-pre_8xb128-coslr-100e_in1k` | [MIXMIM](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.pth) |   88.34    |   16.35   |   84.63   | [config](benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.json) |
+
+## Citation
+
+```bibtex
+@article{MixMIM2022,
+  author  = {Jihao Liu, Xin Huang, Yu Liu, Hongsheng Li},
+  journal = {arXiv:2205.13137},
+  title   = {MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning},
+  year    = {2022},
+}
+```
diff --git a/configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py b/configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c48ee3b8b64e96490e4e9ceaaab5b2b5b1f3f3cc
--- /dev/null
+++ b/configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,133 @@
+_base_ = [
+    '../../_base_/models/mixmim/mixmim_base.py',
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+
+data_preprocessor = dict(
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=16,
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    persistent_workers=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=8,
+    pin_memory=True,
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/val.txt',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+test_dataloader = val_dataloader
+
+model = dict(
+    backbone=dict(
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=5e-4 * (8 * 128 / 256),
+        betas=(0.9, 0.999),
+        weight_decay=0.05),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.7,
+        custom_keys={
+            '.ln': dict(decay_mult=0.0),  # do not decay on ln and bias
+            '.bias': dict(decay_mult=0.0)
+        }))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        eta_min=1e-6,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=10)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=1))
diff --git a/configs/mixmim/benchmarks/mixmim-base_8xb64_in1k.py b/configs/mixmim/benchmarks/mixmim-base_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..86ada85f4ef1e7934e44b4f044ff9d9adf88f782
--- /dev/null
+++ b/configs/mixmim/benchmarks/mixmim-base_8xb64_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../../_base_/models/mixmim/mixmim_base.py',
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/schedules/imagenet_bs256.py',
+    '../../_base_/default_runtime.py'
+]
diff --git a/configs/mixmim/metafile.yml b/configs/mixmim/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..5bf87bda937f5091629c89143fd997cad0deb132
--- /dev/null
+++ b/configs/mixmim/metafile.yml
@@ -0,0 +1,51 @@
+Collections:
+  - Name: MixMIM
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: 'MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation
+        Learning'
+      URL: https://arxiv.org/abs/2205.13137
+    README: configs/mixmim/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/mixmim.py
+      Version: v1.0.0rc4
+
+Models:
+  - Name: mixmim_mixmim-base_16xb128-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 2048
+      FLOPs: 16351906816
+      Parameters: 114665784
+      Training Data: ImageNet-1k
+    In Collection: MixMIM
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_16xb128-coslr-300e_in1k_20221208-44fe8d2c.pth
+    Config: configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py
+    Downstream:
+      - mixmim-base_mixmim-pre_8xb128-coslr-100e_in1k
+  - Name: mixmim-base_mixmim-pre_8xb128-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 1024
+      FLOPs: 16351906816
+      Parameters: 88344352
+      Training Data: ImageNet-1k
+    In Collection: MixMIM
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.63
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mixmim/mixmim-base-p16_16xb128-coslr-300e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k/mixmim-base-p16_ft-8xb128-coslr-100e_in1k_20221208-41ecada9.pth
+    Config: configs/mixmim/benchmarks/mixmim-base_8xb128-coslr-100e_in1k.py
diff --git a/configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py b/configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..29b94eaea311767a7fe91c47753680e5af6d0181
--- /dev/null
+++ b/configs/mixmim/mixmim_mixmim-base_16xb128-coslr-300e_in1k.py
@@ -0,0 +1,98 @@
+_base_ = '../_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.2, 1.0),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=8,
+    persistent_workers=True,
+    pin_memory=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    type='MixMIM',
+    backbone=dict(
+        type='MixMIMPretrainTransformer',
+        arch='B',
+        drop_rate=0.0,
+        drop_path_rate=0.0,  # drop_path_rate=0.0 during pretraining
+        mask_ratio=0.5),
+    neck=dict(
+        type='MixMIMPretrainDecoder',
+        num_patches=49,
+        encoder_stride=32,
+        embed_dim=1024,
+        decoder_embed_dim=512,
+        decoder_depth=8,
+        decoder_num_heads=16),
+    head=dict(
+        type='MixMIMPretrainHead',
+        norm_pix=True,
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * (2048 / 256),
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(custom_keys={
+        'ln': dict(decay_mult=0.0),
+        'bias': dict(decay_mult=0.0)
+    }))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=1))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/mlp_mixer/README.md b/configs/mlp_mixer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f0bb4ce0984627f9dafe2f86910348cc20a8a0a7
--- /dev/null
+++ b/configs/mlp_mixer/README.md
@@ -0,0 +1,78 @@
+# MLP-Mixer
+
+> [MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/abs/2105.01601)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/143178327-7118b48a-5f5f-4844-a614-a571917384ca.png" width="90%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mlp-mixer-base-p16_3rdparty_64xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mlp-mixer-base-p16_3rdparty_64xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-base-p16_3rdparty_64xb64_in1k_20211124-1377e3e0.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                        |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                    Config                    |                            Download                             |
+| :------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------: | :-------------------------------------------------------------: |
+| `mlp-mixer-base-p16_3rdparty_64xb64_in1k`\*  | From scratch |   59.88    |   12.61   |   76.68   |   92.25   | [config](mlp-mixer-base-p16_64xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-base-p16_3rdparty_64xb64_in1k_20211124-1377e3e0.pth) |
+| `mlp-mixer-large-p16_3rdparty_64xb64_in1k`\* | From scratch |   208.20   |   44.57   |   72.34   |   88.02   | [config](mlp-mixer-large-p16_64xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-large-p16_3rdparty_64xb64_in1k_20211124-5a2519d2.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{tolstikhin2021mlpmixer,
+      title={MLP-Mixer: An all-MLP Architecture for Vision},
+      author={Ilya Tolstikhin and Neil Houlsby and Alexander Kolesnikov and Lucas Beyer and Xiaohua Zhai and Thomas Unterthiner and Jessica Yung and Andreas Steiner and Daniel Keysers and Jakob Uszkoreit and Mario Lucic and Alexey Dosovitskiy},
+      year={2021},
+      eprint={2105.01601},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/configs/mlp_mixer/metafile.yml b/configs/mlp_mixer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8b632db100373b10ad7653ed9e0302fa37013ee4
--- /dev/null
+++ b/configs/mlp_mixer/metafile.yml
@@ -0,0 +1,50 @@
+Collections:
+  - Name: MLP-Mixer
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - MLP
+        - Layer Normalization
+        - Dropout
+    Paper:
+      URL: https://arxiv.org/abs/2105.01601
+      Title: "MLP-Mixer: An all-MLP Architecture for Vision"
+    README: configs/mlp_mixer/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.18.0/mmcls/models/backbones/mlp_mixer.py
+      Version: v0.18.0
+
+Models:
+  - Name: mlp-mixer-base-p16_3rdparty_64xb64_in1k
+    In Collection: MLP-Mixer
+    Config: configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py
+    Metadata:
+      FLOPs: 12610000000  # 12.61 G
+      Parameters: 59880000  # 59.88 M
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.68
+          Top 5 Accuracy: 92.25
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-base-p16_3rdparty_64xb64_in1k_20211124-1377e3e0.pth
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_mixer_b16_224-76587d61.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py#L70
+
+  - Name: mlp-mixer-large-p16_3rdparty_64xb64_in1k
+    In Collection: MLP-Mixer
+    Config: configs/mlp_mixer/mlp-mixer-large-p16_64xb64_in1k.py
+    Metadata:
+      FLOPs: 44570000000  # 44.57 G
+      Parameters: 208200000  # 208.2 M
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 72.34
+          Top 5 Accuracy: 88.02
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-large-p16_3rdparty_64xb64_in1k_20211124-5a2519d2.pth
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_mixer_b16_224_in21k-617b3de2.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mlp_mixer.py#L73
diff --git a/configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py b/configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbf4268d3c6121be57d48e8577f3edebde05114b
--- /dev/null
+++ b/configs/mlp_mixer/mlp-mixer-base-p16_64xb64_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/mlp_mixer_base_patch16.py',
+    '../_base_/datasets/imagenet_bs64_mixer_224.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py',
+]
+
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/mlp_mixer/mlp-mixer-large-p16_64xb64_in1k.py b/configs/mlp_mixer/mlp-mixer-large-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fbe9c5c9ebc70ee1b718e904af1bc49fb6d3c78
--- /dev/null
+++ b/configs/mlp_mixer/mlp-mixer-large-p16_64xb64_in1k.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/mlp_mixer_large_patch16.py',
+    '../_base_/datasets/imagenet_bs64_mixer_224.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py',
+]
+
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/mobilenet_v2/README.md b/configs/mobilenet_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..74548e19698ead42fd7cfb86f8a7c04fbee7f022
--- /dev/null
+++ b/configs/mobilenet_v2/README.md
@@ -0,0 +1,97 @@
+# MobileNet V2
+
+> [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**MobileNet V2** is initially described in [the paper](https://arxiv.org/pdf/1801.04381.pdf), which improves the state of the art performance of mobile models on multiple tasks. MobileNetV2 is an improvement on V1. Its new ideas include Linear Bottleneck and Inverted Residuals, and is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers. The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. The author of MobileNet V2 measure its performance on Imagenet classification, COCO object detection, and VOC image segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142563365-7a9ea577-8f79-4c21-a750-ebcaad9bcc2f.png" width="60%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3.
+
+The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mobilenet-v2_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mobilenet-v2_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                          Download                                          |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :----------------------------------------------------------------------------------------: |
+| `mobilenet-v2_8xb32_in1k` | From scratch |    3.50    |   0.32    |   71.86   |   90.42   | [config](mobilenet-v2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.json) |
+
+## Citation
+
+```bibtex
+@INPROCEEDINGS{8578572,
+  author={M. {Sandler} and A. {Howard} and M. {Zhu} and A. {Zhmoginov} and L. {Chen}},
+  booktitle={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  title={MobileNetV2: Inverted Residuals and Linear Bottlenecks},
+  year={2018},
+  volume={},
+  number={},
+  pages={4510-4520},
+  doi={10.1109/CVPR.2018.00474}}
+}
+```
diff --git a/configs/mobilenet_v2/metafile.yml b/configs/mobilenet_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..aaa490ae485e87c3965f946f3fe25aa52919830b
--- /dev/null
+++ b/configs/mobilenet_v2/metafile.yml
@@ -0,0 +1,34 @@
+Collections:
+  - Name: MobileNet V2
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Epochs: 300
+      Batch Size: 256
+      Architecture:
+        - MobileNet V2
+    Paper:
+      URL: https://arxiv.org/abs/1801.04381
+      Title: "MobileNetV2: Inverted Residuals and Linear Bottlenecks"
+    README: configs/mobilenet_v2/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/mobilenet_v2.py#L101
+      Version: v0.15.0
+
+Models:
+  - Name: mobilenet-v2_8xb32_in1k
+    Metadata:
+      FLOPs: 319000000
+      Parameters: 3500000
+    In Collection: MobileNet V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 71.86
+          Top 5 Accuracy: 90.42
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth
+    Config: configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py
diff --git a/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py b/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..afd2d9795af601010833ba239465c3e2c5abdf20
--- /dev/null
+++ b/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/mobilenet_v2_1x.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_epochstep.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/mobilenet_v3/README.md b/configs/mobilenet_v3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..833de5b25aae9a8af43f5e086e6e2fd212669d03
--- /dev/null
+++ b/configs/mobilenet_v3/README.md
@@ -0,0 +1,99 @@
+# MobileNet V3
+
+> [Searching for MobileNetV3](https://arxiv.org/abs/1905.02244)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**MobileNet V3** is initially described in [the paper](https://arxiv.org/pdf/1905.02244.pdf). MobileNetV3 parameters are obtained by NAS (network architecture search) search, and some practical results of V1 and V2 are inherited, and the attention mechanism of SE channel is attracted, which can be considered as a masterpiece. The author create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. The author of MobileNet V3 measure its performance on Imagenet classification, COCO object detection, and Cityscapes segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142563801-ef4feacc-ecd7-4d14-a411-8c9d63571749.png" width="60%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2% more accurate on ImageNet classification while reducing latency by 15% compared to MobileNetV2. MobileNetV3-Small is 4.6% more accurate while reducing latency by 5% compared to MobileNetV2. MobileNetV3-Large detection is 25% faster at roughly the same accuracy as MobileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 30% faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mobilenet-v3-small-050_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mobilenet-v3-small-050_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-050_3rdparty_in1k_20221114-e0b86be1.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                    |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                     Config                      |                             Download                             |
+| :--------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------------: | :--------------------------------------------------------------: |
+| `mobilenet-v3-small-050_3rdparty_in1k`\* | From scratch |    1.59    |   0.02    |   57.91   |   80.19   | [config](mobilenet-v3-small-050_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-050_3rdparty_in1k_20221114-e0b86be1.pth) |
+| `mobilenet-v3-small-075_3rdparty_in1k`\* | From scratch |    2.04    |   0.04    |   65.23   |   85.44   | [config](mobilenet-v3-small-075_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-075_3rdparty_in1k_20221114-2011fa76.pth) |
+| `mobilenet-v3-small_8xb128_in1k`         | From scratch |    2.54    |   0.06    |   66.68   |   86.74   |   [config](mobilenet-v3-small_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small_8xb128_in1k_20221114-bd1bfcde.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small_8xb128_in1k_20221114-bd1bfcde.json) |
+| `mobilenet-v3-small_3rdparty_in1k`\*     | From scratch |    2.54    |   0.06    |   67.66   |   87.41   |   [config](mobilenet-v3-small_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_small-8427ecf0.pth) |
+| `mobilenet-v3-large_8xb128_in1k`         | From scratch |    5.48    |   0.23    |   73.49   |   91.31   |   [config](mobilenet-v3-large_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-large_8xb128_in1k_20221114-0ed9ed9a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-large_8xb128_in1k_20221114-0ed9ed9a.json) |
+| `mobilenet-v3-large_3rdparty_in1k`\*     | From scratch |    5.48    |   0.23    |   74.04   |   91.34   |   [config](mobilenet-v3-large_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/pytorch/vision/blob/main/torchvision/models/mobilenetv3.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{Howard_2019_ICCV,
+    author = {Howard, Andrew and Sandler, Mark and Chu, Grace and Chen, Liang-Chieh and Chen, Bo and Tan, Mingxing and Wang, Weijun and Zhu, Yukun and Pang, Ruoming and Vasudevan, Vijay and Le, Quoc V. and Adam, Hartwig},
+    title = {Searching for MobileNetV3},
+    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
+    month = {October},
+    year = {2019}
+}
+```
diff --git a/configs/mobilenet_v3/metafile.yml b/configs/mobilenet_v3/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..53f1653682fa2af2155b786ee5a8f0be9c98698e
--- /dev/null
+++ b/configs/mobilenet_v3/metafile.yml
@@ -0,0 +1,111 @@
+Collections:
+  - Name: MobileNet V3
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - RMSprop with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Epochs: 600
+      Batch Size: 1024
+      Architecture:
+        - MobileNet V3
+    Paper:
+      URL: https://arxiv.org/abs/1905.02244
+      Title: Searching for MobileNetV3
+    README: configs/mobilenet_v3/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/mobilenet_v3.py
+      Version: v0.15.0
+
+Models:
+  - Name: mobilenet-v3-small-050_3rdparty_in1k
+    Metadata:
+      FLOPs: 24895000
+      Parameters: 1590000
+    In Collection: MobileNet V3
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 57.91
+          Top 5 Accuracy: 80.19
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-050_3rdparty_in1k_20221114-e0b86be1.pth
+    Config: configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/mobilenetv3_small_050_lambc-4b7bbe87.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/mobilenetv3.py
+  - Name: mobilenet-v3-small-075_3rdparty_in1k
+    Metadata:
+      FLOPs: 44791000
+      Parameters: 2040000
+    In Collection: MobileNet V3
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 65.23
+          Top 5 Accuracy: 85.44
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-075_3rdparty_in1k_20221114-2011fa76.pth
+    Config: configs/mobilenet_v3/mobilenet-v3-small-075_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/mobilenetv3_small_075_lambc-384766db.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/mobilenetv3.py
+  - Name: mobilenet-v3-small_8xb128_in1k
+    Metadata:
+      FLOPs: 60000000
+      Parameters: 2540000
+    In Collection: MobileNet V3
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 66.68
+          Top 5 Accuracy: 86.74
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small_8xb128_in1k_20221114-bd1bfcde.pth
+    Config: configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
+  - Name: mobilenet-v3-small_3rdparty_in1k
+    Metadata:
+      FLOPs: 60000000
+      Parameters: 2540000
+    In Collection: MobileNet V3
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 67.66
+          Top 5 Accuracy: 87.41
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_small-8427ecf0.pth
+    Config: configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/mobilenet_v3_small-047dcff4.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/mobilenetv3.py
+  - Name: mobilenet-v3-large_8xb128_in1k
+    Metadata:
+      FLOPs: 230000000
+      Parameters: 5480000
+    In Collection: MobileNet V3
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 73.49
+          Top 5 Accuracy: 91.31
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-large_8xb128_in1k_20221114-0ed9ed9a.pth
+    Config: configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py
+  - Name: mobilenet-v3-large_3rdparty_in1k
+    Metadata:
+      FLOPs: 230000000
+      Parameters: 5480000
+    In Collection: MobileNet V3
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.04
+          Top 5 Accuracy: 91.34
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth
+    Config: configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/mobilenet_v3_large-8738ca79.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/mobilenetv3.py
diff --git a/configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py b/configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f5c05baf39f1cffdb9610d41b1603119a2edc727
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py
@@ -0,0 +1,28 @@
+# Refers to https://pytorch.org/blog/ml-models-torchvision-v0.9/#classification
+
+_base_ = [
+    '../_base_/models/mobilenet_v3/mobilenet_v3_large_imagenet.py',
+    '../_base_/datasets/imagenet_bs128_mbv3.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='RMSprop',
+        lr=0.064,
+        alpha=0.9,
+        momentum=0.9,
+        eps=0.0316,
+        weight_decay=1e-5))
+
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py b/configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fc145625ca22f44ff48a6f4684589ab6833313e3
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-small-050_8xb128_in1k.py
@@ -0,0 +1,70 @@
+_base_ = [
+    '../_base_/models/mobilenet_v3/mobilenet_v3_small_050_imagenet.py',
+    '../_base_/datasets/imagenet_bs128_mbv3.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(norm_cfg=dict(type='BN', eps=1e-5, momentum=0.1)))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='AutoAugment',
+        policies='imagenet',
+        hparams=dict(pad_val=[round(x) for x in [103.53, 116.28, 123.675]])),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.2,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='RMSprop',
+        lr=0.064,
+        alpha=0.9,
+        momentum=0.9,
+        eps=0.0316,
+        weight_decay=1e-5))
+
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=10)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/mobilenet_v3/mobilenet-v3-small-075_8xb128_in1k.py b/configs/mobilenet_v3/mobilenet-v3-small-075_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..464b7cbd60e8b741f9765df091bfdadbfe1712a3
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-small-075_8xb128_in1k.py
@@ -0,0 +1,68 @@
+_base_ = [
+    '../_base_/models/mobilenet_v3/mobilenet_v3_small_075_imagenet.py',
+    '../_base_/datasets/imagenet_bs128_mbv3.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(backbone=dict(norm_cfg=dict(type='BN', eps=1e-5, momentum=0.1)))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='AutoAugment',
+        policies='imagenet',
+        hparams=dict(pad_val=[round(x) for x in [103.53, 116.28, 123.675]])),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.2,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='RMSprop',
+        lr=0.064,
+        alpha=0.9,
+        momentum=0.9,
+        eps=0.0316,
+        weight_decay=1e-5))
+
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=10)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py b/configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..06b0a328106611ced7ede94c0439f3e39d12f306
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py
@@ -0,0 +1,28 @@
+# Refers to https://pytorch.org/blog/ml-models-torchvision-v0.9/#classification
+
+_base_ = [
+    '../_base_/models/mobilenet_v3/mobilenet_v3_small_imagenet.py',
+    '../_base_/datasets/imagenet_bs128_mbv3.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='RMSprop',
+        lr=0.064,
+        alpha=0.9,
+        momentum=0.9,
+        eps=0.0316,
+        weight_decay=1e-5))
+
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/mobilenet_v3/mobilenet-v3-small_8xb16_cifar10.py b/configs/mobilenet_v3/mobilenet-v3-small_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..4cfaa2f629523ad66966d3e70c9ca084644e1f8d
--- /dev/null
+++ b/configs/mobilenet_v3/mobilenet-v3-small_8xb16_cifar10.py
@@ -0,0 +1,15 @@
+_base_ = [
+    '../_base_/models/mobilenet_v3/mobilenet_v3_small_cifar.py',
+    '../_base_/datasets/cifar10_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+    type='MultiStepLR',
+    by_epoch=True,
+    milestones=[120, 170],
+    gamma=0.1,
+)
+
+train_cfg = dict(by_epoch=True, max_epochs=200)
diff --git a/configs/mobileone/README.md b/configs/mobileone/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e753aff9089fe30700f6db4313fd337f73f7d47d
--- /dev/null
+++ b/configs/mobileone/README.md
@@ -0,0 +1,98 @@
+# MobileOne
+
+> [An Improved One millisecond Mobile Backbone](https://arxiv.org/abs/2206.04040)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+Mobileone is proposed by apple and based on reparameterization. On the apple chips, the accuracy of the model is close to 0.76 on the ImageNet dataset when the latency is less than 1ms. Its main improvements based on [RepVGG](../repvgg) are fllowing:
+
+- Reparameterization using Depthwise convolution and Pointwise convolution instead of normal convolution.
+- Removal of the residual structure which is not friendly to access memory.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/18586273/183552452-74657532-f461-48f7-9aa7-c23f006cdb07.png" width="40%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+Efficient neural network backbones for mobile devices are often optimized for metrics such as FLOPs or parameter count. However, these metrics may not correlate well with latency of the network when deployed on a mobile device. Therefore, we perform extensive analysis of different metrics by deploying several mobile-friendly networks on a mobile device. We identify and analyze architectural and optimization bottlenecks in recent efficient neural networks and provide ways to mitigate these bottlenecks. To this end, we design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet. We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile. Our best model obtains similar performance on ImageNet as MobileFormer while being 38x faster. Our model obtains 2.3% better top-1 accuracy on ImageNet than EfficientNet at similar latency. Furthermore, we show that our model generalizes to multiple tasks - image classification, object detection, and semantic segmentation with significant improvements in latency and accuracy as compared to existing efficient architectures when deployed on a mobile device.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mobileone-s0_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mobileone-s0_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mobileone/mobileone-s0_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mobileone/mobileone-s0_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                          Download                                          |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :----------------------------------------------------------------------------------------: |
+| `mobileone-s0_8xb32_in1k` | From scratch |    2.08    |   0.27    |   71.34   |   89.87   | [config](mobileone-s0_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.json) |
+| `mobileone-s1_8xb32_in1k` | From scratch |    4.76    |   0.82    |   75.72   |   92.54   | [config](mobileone-s1_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s1_8xb32_in1k_20221110-ceeef467.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s1_8xb32_in1k_20221110-ceeef467.json) |
+| `mobileone-s2_8xb32_in1k` | From scratch |    7.81    |   1.30    |   77.37   |   93.34   | [config](mobileone-s2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s2_8xb32_in1k_20221110-9c7ecb97.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s2_8xb32_in1k_20221110-9c7ecb97.json) |
+| `mobileone-s3_8xb32_in1k` | From scratch |   10.08    |   1.89    |   78.06   |   93.83   | [config](mobileone-s3_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s3_8xb32_in1k_20221110-c95eb3bf.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s3_8xb32_in1k_20221110-c95eb3bf.json) |
+| `mobileone-s4_8xb32_in1k` | From scratch |   14.84    |   2.98    |   79.69   |   94.46   | [config](mobileone-s4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s4_8xb32_in1k_20221110-28d888cb.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s4_8xb32_in1k_20221110-28d888cb.json) |
+
+## Citation
+
+```bibtex
+@article{mobileone2022,
+  title={An Improved One millisecond Mobile Backbone},
+  author={Vasu, Pavan Kumar Anasosalu and Gabriel, James and Zhu, Jeff and Tuzel, Oncel and Ranjan, Anurag},
+  journal={arXiv preprint arXiv:2206.04040},
+  year={2022}
+}
+```
diff --git a/configs/mobileone/deploy/mobileone-s0_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s0_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..145f3f4ec90f643a056177a7d7c0b8fc370539cc
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s0_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s0_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/deploy/mobileone-s1_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s1_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8602c31ce6c7c3115e3f45313b687816f0854ddb
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s1_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s1_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/deploy/mobileone-s2_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s2_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..97aaddd0740b0a005ecab5b08d3459b0da6c474c
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s2_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s2_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/deploy/mobileone-s3_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s3_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d335a7ba9300f8d6d35a288dab02baf0adabdb2
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s3_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s3_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/deploy/mobileone-s4_deploy_8xb32_in1k.py b/configs/mobileone/deploy/mobileone-s4_deploy_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b82f5a9ac7ecd6c5fc84369083c66d6dae0afd51
--- /dev/null
+++ b/configs/mobileone/deploy/mobileone-s4_deploy_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['../mobileone-s4_8xb32_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/mobileone/metafile.yml b/configs/mobileone/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..70370da0e8d56baf8001ddaff1f78110462ad86a
--- /dev/null
+++ b/configs/mobileone/metafile.yml
@@ -0,0 +1,83 @@
+Collections:
+  - Name: MobileOne
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - re-parameterization Convolution
+        - VGG-style Neural Network
+        - Depthwise Convolution
+        - Pointwise Convolution
+    Paper:
+      URL: https://arxiv.org/abs/2206.04040
+      Title: 'An Improved One millisecond Mobile Backbone'
+    README: configs/mobileone/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc1/configs/mobileone/metafile.yml
+      Version: v1.0.0rc1
+
+Models:
+  - Name: mobileone-s0_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s0_8xb32_in1k.py
+    Metadata:
+      FLOPs: 274136576  # 0.27G
+      Parameters: 2078504  # 2.08M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 71.34
+        Top 5 Accuracy: 89.87
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s0_8xb32_in1k_20221110-0bc94952.pth
+  - Name: mobileone-s1_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s1_8xb32_in1k.py
+    Metadata:
+      FLOPs: 823839744    # 8.6G
+      Parameters: 4764840  # 4.82M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 75.72
+        Top 5 Accuracy: 92.54
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s1_8xb32_in1k_20221110-ceeef467.pth
+  - Name: mobileone-s2_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s2_8xb32_in1k.py
+    Metadata:
+      FLOPs: 1296478848
+      Parameters: 7808168
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 77.37
+        Top 5 Accuracy: 93.34
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s2_8xb32_in1k_20221110-9c7ecb97.pth
+  - Name: mobileone-s3_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s3_8xb32_in1k.py
+    Metadata:
+      FLOPs: 1893842944
+      Parameters: 10078312
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 78.06
+        Top 5 Accuracy: 93.83
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s3_8xb32_in1k_20221110-c95eb3bf.pth
+  - Name: mobileone-s4_8xb32_in1k
+    In Collection: MobileOne
+    Config: configs/mobileone/mobileone-s4_8xb32_in1k.py
+    Metadata:
+      FLOPs: 2979222528
+      Parameters: 14838352
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 79.69
+        Top 5 Accuracy: 94.46
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s4_8xb32_in1k_20221110-28d888cb.pth
diff --git a/configs/mobileone/mobileone-s0_8xb32_in1k.py b/configs/mobileone/mobileone-s0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..be56b86c3ce4afc3cc61995efa60830be98050e0
--- /dev/null
+++ b/configs/mobileone/mobileone-s0_8xb32_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s0.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
diff --git a/configs/mobileone/mobileone-s1_8xb32_in1k.py b/configs/mobileone/mobileone-s1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0bc3fb08922e0c87ad681e79c378d2b5404b696f
--- /dev/null
+++ b/configs/mobileone/mobileone-s1_8xb32_in1k.py
@@ -0,0 +1,60 @@
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s1.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(type='PackInputs')
+]
+
+import copy  # noqa: E402
+
+# modify start epoch's RandomResizedCrop.scale to 160
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.1
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+# modify 37 epoch's RandomResizedCrop.scale to 192
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_1e[3]['magnitude_level'] *= 0.2
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+
+custom_hooks = [
+    dict(
+        type='SwitchRecipeHook',
+        schedule=[
+            dict(action_epoch=37, pipeline=train_pipeline_37e),
+            dict(action_epoch=112, pipeline=train_pipeline_112e),
+        ]),
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
diff --git a/configs/mobileone/mobileone-s2_8xb32_in1k.py b/configs/mobileone/mobileone-s2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a7d4aae074952538d5d037b33438172f4c283613
--- /dev/null
+++ b/configs/mobileone/mobileone-s2_8xb32_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s2.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+import copy  # noqa: E402
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(type='PackInputs')
+]
+
+# modify start epoch RandomResizedCrop.scale to 160
+# and RA.magnitude_level * 0.3
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+import copy  # noqa: E402
+
+# modify 137 epoch's RandomResizedCrop.scale to 192
+# and RA.magnitude_level * 0.7
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_37e[3]['magnitude_level'] *= 0.7
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+# and RA.magnitude_level * 1.0
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_112e[3]['magnitude_level'] *= 1.0
+
+custom_hooks = [
+    dict(
+        type='SwitchRecipeHook',
+        schedule=[
+            dict(action_epoch=37, pipeline=train_pipeline_37e),
+            dict(action_epoch=112, pipeline=train_pipeline_112e),
+        ]),
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
diff --git a/configs/mobileone/mobileone-s3_8xb32_in1k.py b/configs/mobileone/mobileone-s3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2be0dc7e814c4e5a28369ae8888221f3e26ec657
--- /dev/null
+++ b/configs/mobileone/mobileone-s3_8xb32_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s3.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+import copy  # noqa: E402
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(type='PackInputs')
+]
+
+# modify start epoch RandomResizedCrop.scale to 160
+# and RA.magnitude_level * 0.3
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+import copy  # noqa: E402
+
+# modify 137 epoch's RandomResizedCrop.scale to 192
+# and RA.magnitude_level * 0.7
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_37e[3]['magnitude_level'] *= 0.7
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+# and RA.magnitude_level * 1.0
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_112e[3]['magnitude_level'] *= 1.0
+
+custom_hooks = [
+    dict(
+        type='SwitchRecipeHook',
+        schedule=[
+            dict(action_epoch=37, pipeline=train_pipeline_37e),
+            dict(action_epoch=112, pipeline=train_pipeline_112e),
+        ]),
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
diff --git a/configs/mobileone/mobileone-s4_8xb32_in1k.py b/configs/mobileone/mobileone-s4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..49356f05f9574f90192dc32d5b14c3b74a5cd459
--- /dev/null
+++ b/configs/mobileone/mobileone-s4_8xb32_in1k.py
@@ -0,0 +1,63 @@
+_base_ = [
+    '../_base_/models/mobileone/mobileone_s4.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr_coswd_300e.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(paramwise_cfg=dict(norm_decay_mult=0.))
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+bgr_mean = _base_.data_preprocessor['mean'][::-1]
+base_train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(type='PackInputs')
+]
+
+import copy  # noqa: E402
+
+# modify start epoch RandomResizedCrop.scale to 160
+# and RA.magnitude_level * 0.3
+train_pipeline_1e = copy.deepcopy(base_train_pipeline)
+train_pipeline_1e[1]['scale'] = 160
+train_pipeline_1e[3]['magnitude_level'] *= 0.3
+_base_.train_dataloader.dataset.pipeline = train_pipeline_1e
+
+# modify 137 epoch's RandomResizedCrop.scale to 192
+# and RA.magnitude_level * 0.7
+train_pipeline_37e = copy.deepcopy(base_train_pipeline)
+train_pipeline_37e[1]['scale'] = 192
+train_pipeline_37e[3]['magnitude_level'] *= 0.7
+
+# modify 112 epoch's RandomResizedCrop.scale to 224
+# and RA.magnitude_level * 1.0
+train_pipeline_112e = copy.deepcopy(base_train_pipeline)
+train_pipeline_112e[1]['scale'] = 224
+train_pipeline_112e[3]['magnitude_level'] *= 1.0
+
+custom_hooks = [
+    dict(
+        type='SwitchRecipeHook',
+        schedule=[
+            dict(action_epoch=37, pipeline=train_pipeline_37e),
+            dict(action_epoch=112, pipeline=train_pipeline_112e),
+        ]),
+    dict(
+        type='EMAHook',
+        momentum=5e-4,
+        priority='ABOVE_NORMAL',
+        update_buffers=True)
+]
diff --git a/configs/mobilevit/README.md b/configs/mobilevit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..fa0960d123aed6eae6fee1155fd99d0955355280
--- /dev/null
+++ b/configs/mobilevit/README.md
@@ -0,0 +1,96 @@
+# MobileViT
+
+> [MobileViT Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**MobileViT** aims at introducing a light-weight network, which takes the advantages of both ViTs and CNNs, uses the `InvertedResidual` blocks in [MobileNetV2](../mobilenet_v2/README.md) and `MobileViTBlock` which refers to [ViT](../vision_transformer/README.md) transformer blocks to build a standard 5-stage model structure.
+
+The MobileViTBlock reckons transformers as convolutions to perform a global representation, meanwhile conbined with original convolution layers for local representation to build a block with global receptive field. This is different from ViT, which adds an extra class token and position embeddings for learning relative relationship. Without any position embeddings, MobileViT can benfit from multi-scale inputs during training.
+
+Also, this paper puts forward a strategy for multi-scale training to dynamically adjust batch size based on the image size to both improve training efficiency and final performance.
+
+It is also proven effective in downstream tasks such as object detection and segmentation.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/42952108/193229983-822bf025-89a6-4d95-b6be-76b7f1a62f2c.png" width="70%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+
+Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mobilevit-small_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mobilevit-small_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/mobilevit/mobilevit-small_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-small_3rdparty_in1k_20221018-cb4f741c.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                               |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                   |                                  Download                                  |
+| :---------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------: | :------------------------------------------------------------------------: |
+| `mobilevit-small_3rdparty_in1k`\*   | From scratch |    5.58    |   2.03    |   78.25   |   94.09   |  [config](mobilevit-small_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-small_3rdparty_in1k_20221018-cb4f741c.pth) |
+| `mobilevit-xsmall_3rdparty_in1k`\*  | From scratch |    2.32    |   1.05    |   74.75   |   92.32   | [config](mobilevit-xsmall_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xsmall_3rdparty_in1k_20221018-be39a6e7.pth) |
+| `mobilevit-xxsmall_3rdparty_in1k`\* | From scratch |    1.27    |   0.42    |   69.02   |   88.91   | [config](mobilevit-xxsmall_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xxsmall_3rdparty_in1k_20221018-77835605.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/apple/ml-cvnets). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{mehta2021mobilevit,
+  title={MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer},
+  author={Mehta, Sachin and Rastegari, Mohammad},
+  journal={arXiv preprint arXiv:2110.02178},
+  year={2021}
+}
+```
diff --git a/configs/mobilevit/metafile.yml b/configs/mobilevit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..15fd84ad54cacf0c7c0337b5139ba891d14c22f5
--- /dev/null
+++ b/configs/mobilevit/metafile.yml
@@ -0,0 +1,60 @@
+Collections:
+  - Name: MobileViT
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - MobileViT Block
+    Paper:
+      URL: https://arxiv.org/abs/2110.02178
+      Title: MobileViT Light-weight, General-purpose, and Mobile-friendly Vision Transformer
+    README: configs/mobilevit/README.md
+
+Models:
+  - Name: mobilevit-small_3rdparty_in1k
+    Metadata:
+      FLOPs: 2030000000
+      Parameters: 5580000
+    In Collection: MobileViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.25
+          Top 5 Accuracy: 94.09
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-small_3rdparty_in1k_20221018-cb4f741c.pth
+    Config: configs/mobilevit/mobilevit-small_8xb128_in1k.py
+    Converted From:
+      Weights: https://docs-assets.developer.apple.com/ml-research/models/cvnets/classification/mobilevit_s.pt
+      Code: https://github.com/apple/ml-cvnets
+  - Name: mobilevit-xsmall_3rdparty_in1k
+    Metadata:
+      FLOPs: 1050000000
+      Parameters: 2320000
+    In Collection: MobileViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.75
+          Top 5 Accuracy: 92.32
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xsmall_3rdparty_in1k_20221018-be39a6e7.pth
+    Config: configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py
+    Converted From:
+      Weights: https://docs-assets.developer.apple.com/ml-research/models/cvnets/classification/mobilevit_xs.pt
+      Code: https://github.com/apple/ml-cvnets
+  - Name: mobilevit-xxsmall_3rdparty_in1k
+    Metadata:
+      FLOPs: 420000000
+      Parameters: 1270000
+    In Collection: MobileViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.02
+          Top 5 Accuracy: 88.91
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-xxsmall_3rdparty_in1k_20221018-77835605.pth
+    Config: configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py
+    Converted From:
+      Weights: https://docs-assets.developer.apple.com/ml-research/models/cvnets/classification/mobilevit_xxs.pt
+      Code: https://github.com/apple/ml-cvnets
diff --git a/configs/mobilevit/mobilevit-small_8xb128_in1k.py b/configs/mobilevit/mobilevit-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..596939631c0520e67d480a37669704556719f2dc
--- /dev/null
+++ b/configs/mobilevit/mobilevit-small_8xb128_in1k.py
@@ -0,0 +1,30 @@
+_base_ = [
+    '../_base_/models/mobilevit/mobilevit_s.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/default_runtime.py',
+    '../_base_/schedules/imagenet_bs256.py',
+]
+
+# no normalize for original implements
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0, 0, 0],
+    std=[255, 255, 255],
+    # use bgr directly
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=288, edge='short'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128)
+
+val_dataloader = dict(
+    batch_size=128,
+    dataset=dict(pipeline=test_pipeline),
+)
+test_dataloader = val_dataloader
diff --git a/configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py b/configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..557892bcc4911912d7e5d585cb0d27235cf08cd5
--- /dev/null
+++ b/configs/mobilevit/mobilevit-xsmall_8xb128_in1k.py
@@ -0,0 +1,30 @@
+_base_ = [
+    '../_base_/models/mobilevit/mobilevit_xs.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/default_runtime.py',
+    '../_base_/schedules/imagenet_bs256.py',
+]
+
+# no normalize for original implements
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0, 0, 0],
+    std=[255, 255, 255],
+    # use bgr directly
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=288, edge='short'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128)
+
+val_dataloader = dict(
+    batch_size=128,
+    dataset=dict(pipeline=test_pipeline),
+)
+test_dataloader = val_dataloader
diff --git a/configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py b/configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..74aea82f32bd65fd71962c588384e4a1e6ab43ea
--- /dev/null
+++ b/configs/mobilevit/mobilevit-xxsmall_8xb128_in1k.py
@@ -0,0 +1,30 @@
+_base_ = [
+    '../_base_/models/mobilevit/mobilevit_xxs.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/default_runtime.py',
+    '../_base_/schedules/imagenet_bs256.py',
+]
+
+# no normalize for original implements
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[0, 0, 0],
+    std=[255, 255, 255],
+    # use bgr directly
+    to_rgb=False,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=288, edge='short'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128)
+
+val_dataloader = dict(
+    batch_size=128,
+    dataset=dict(pipeline=test_pipeline),
+)
+test_dataloader = val_dataloader
diff --git a/configs/mocov2/README.md b/configs/mocov2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..cb0ae4ee7468f3294b28157eafb32cb04b63814d
--- /dev/null
+++ b/configs/mocov2/README.md
@@ -0,0 +1,85 @@
+# MoCoV2
+
+> [Improved Baselines with Momentum Contrastive Learning](https://arxiv.org/abs/2003.04297)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Contrastive unsupervised learning has recently shown encouraging progress, e.g., in Momentum Contrast (MoCo) and SimCLR. In this note, we verify the effectiveness of two of SimCLR’s design improvements by implementing them in the MoCo framework. With simple modifications to MoCo—namely, using an MLP projection head and more data augmentation—we establish stronger baselines that outperform SimCLR and do not require large training batches. We hope this will make state-of-the-art unsupervised learning research more accessible.
+
+<div align=center>
+<img  src="https://user-images.githubusercontent.com/36138628/149720067-b65e5736-d425-45b3-93ed-6f2427fc6217.png" width="500" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mocov2_resnet50_8xb32-coslr-200e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                   | Params (M) | Flops (G) |                       Config                       |                                           Download                                           |
+| :-------------------------------------- | :--------: | :-------: | :------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `mocov2_resnet50_8xb32-coslr-200e_in1k` |   55.93    |   4.11    | [config](mocov2_resnet50_8xb32-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k` | [MOCOV2](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.pth) |   25.56    |   4.11    |   67.50   | [config](benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.json) |
+
+## Citation
+
+```bibtex
+@article{chen2020improved,
+  title={Improved baselines with momentum contrastive learning},
+  author={Chen, Xinlei and Fan, Haoqi and Girshick, Ross and He, Kaiming},
+  journal={arXiv preprint arXiv:2003.04297},
+  year={2020}
+}
+```
diff --git a/configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py b/configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..37795d9c866c5f9b26b0e016959a01677b8a216e
--- /dev/null
+++ b/configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_sgd_steplr_100e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=30., momentum=0.9, weight_decay=0.))
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/mocov2/metafile.yml b/configs/mocov2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..4440db45b5a1a6ab8352c589471cbd4b6d6bb786
--- /dev/null
+++ b/configs/mocov2/metafile.yml
@@ -0,0 +1,45 @@
+Collections:
+  - Name: MoCoV2
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Architecture:
+        - ResNet
+        - MoCo
+    Paper:
+      Title: Improved Baselines with Momentum Contrastive Learning
+      URL: https://arxiv.org/abs/2003.04297
+    README: configs/mocov2/README.md
+
+Models:
+  - Name: mocov2_resnet50_8xb32-coslr-200e_in1k
+    Metadata:
+      Epochs: 200
+      Batch Size: 256
+      FLOPs: 4109364224
+      Parameters: 55933312
+      Training Data: ImageNet-1k
+    In Collection: MoCoV2
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/mocov2_resnet50_8xb32-coslr-200e_in1k_20220825-b6d23c86.pth
+    Config: configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
+    Downstream:
+      - resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k
+  - Name: resnet50_mocov2-pre_8xb32-linear-steplr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 256
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: MoCoV2
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 67.5
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb32-steplr-100e_in1k/resnet50_linear-8xb32-steplr-100e_in1k_20220825-994c4128.pth
+    Config: configs/mocov2/benchmarks/resnet50_8xb32-linear-steplr-100e_in1k.py
diff --git a/configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py b/configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8037d075a2e5a8490dc4c3709f274784a6f3f4f0
--- /dev/null
+++ b/configs/mocov2/mocov2_resnet50_8xb32-coslr-200e_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_mocov2.py',
+    '../_base_/schedules/imagenet_sgd_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='MoCo',
+    queue_len=65536,
+    feat_dim=128,
+    momentum=0.001,
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='BN'),
+        zero_init_residual=False),
+    neck=dict(
+        type='MoCoV2Neck',
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=128,
+        with_avg_pool=True),
+    head=dict(
+        type='ContrastiveHead',
+        loss=dict(type='CrossEntropyLoss'),
+        temperature=0.2))
+
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/mocov3/README.md b/configs/mocov3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a9477e8a6da037a4e773bcb693b0f449f8e8fda7
--- /dev/null
+++ b/configs/mocov3/README.md
@@ -0,0 +1,96 @@
+# MoCoV3
+
+> [An Empirical Study of Training Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.02057)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. We reveal that these results are indeed partial failure, and they can be improved when training is made more stable. We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. We discuss the currently positive evidence as well as challenges and open questions. We hope that this work will provide useful data points and experience for future research.
+
+<div align=center>
+<img  src="https://user-images.githubusercontent.com/36138628/151305362-e6e8ea35-b3b8-45f6-8819-634e67083218.png" width="500" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_mocov3-100e-pre_8xb128-linear-coslr-90e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mocov3_resnet50_8xb512-amp-coslr-100e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-8f7d937e.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                              | Params (M) | Flops (G) |                            Config                             |                                Download                                |
+| :------------------------------------------------- | :--------: | :-------: | :-----------------------------------------------------------: | :--------------------------------------------------------------------: |
+| `mocov3_resnet50_8xb512-amp-coslr-100e_in1k`       |   68.01    |   4.11    |    [config](mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py)    | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.json) |
+| `mocov3_resnet50_8xb512-amp-coslr-300e_in1k`       |   68.01    |   4.11    |    [config](mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py)    | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/mocov3_resnet50_8xb512-amp-coslr-300e_in1k_20220927-1e4f3304.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/mocov3_resnet50_8xb512-amp-coslr-300e_in1k_20220927-1e4f3304.json) |
+| `mocov3_resnet50_8xb512-amp-coslr-800e_in1k`       |   68.01    |   4.11    |    [config](mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py)    | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20220927-e043f51a.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20220927-e043f51a.json) |
+| `mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k` |   84.27    |   4.61    | [config](mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k-224_20220826-08bc52f7.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k-224_20220826-08bc52f7.json) |
+| `mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k`  |   215.68   |   17.58   | [config](mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.json) |
+| `mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k`  |   652.78   |   61.60   | [config](mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py)  | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k-224_20220829-9b88a442.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k-224_20220829-9b88a442.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_mocov3-100e-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3 100-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.pth) |   25.56    |   4.11    |   69.60   | [config](benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-8f7d937e.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-8f7d937e.json) |
+| `resnet50_mocov3-300e-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3 300-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/mocov3_resnet50_8xb512-amp-coslr-300e_in1k_20220927-1e4f3304.pth) |   25.56    |   4.11    |   72.80   | [config](benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-d21ddac2.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-d21ddac2.json) |
+| `resnet50_mocov3-800e-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20220927-e043f51a.pth) |   25.56    |   4.11    |   74.40   | [config](benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-0e97a483.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-0e97a483.json) |
+| `vit-small-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k-224_20220826-08bc52f7.pth) |   22.05    |   4.61    |   73.60   | [config](benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k_20220826-376674ef.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k_20220826-376674ef.json) |
+| `vit-base-p16_mocov3-pre_8xb64-coslr-150e_in1k` | [MOCOV3](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.pth) |   86.57    |   17.58   |   83.00   | [config](benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k_20220826-f1e6c442.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k_20220826-f1e6c442.json) |
+| `vit-base-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k` | [MOCOV3](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.pth) |   86.57    |   17.58   |   76.90   | [config](benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k_20220826-83be7758.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k_20220826-83be7758.json) |
+| `vit-large-p16_mocov3-pre_8xb64-coslr-100e_in1k` | [MOCOV3](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k-224_20220829-9b88a442.pth) |   304.33   |   61.60   |   83.70   | [config](benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k_20220829-878a2f7f.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k_20220829-878a2f7f.json) |
+
+## Citation
+
+```bibtex
+@InProceedings{Chen_2021_ICCV,
+    title     = {An Empirical Study of Training Self-Supervised Vision Transformers},
+    author    = {Chen, Xinlei and Xie, Saining and He, Kaiming},
+    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
+    year      = {2021}
+}
+```
diff --git a/configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py b/configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4d0b202b0f643c51e5d931cbf1ee59793aae03cb
--- /dev/null
+++ b/configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
@@ -0,0 +1,31 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_sgd_coslr_100e.py',
+    '../../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        norm_eval=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.4, momentum=0.9, weight_decay=0.))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='CosineAnnealingLR', T_max=90, by_epoch=True, begin=0, end=90)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=90)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/mocov3/benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py b/configs/mocov3/benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..91509fc05d6b6274a4bf5237d27d9e28ee365b9d
--- /dev/null
+++ b/configs/mocov3/benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MoCoV3ViT',
+        arch='base',  # embed_dim = 768
+        img_size=224,
+        patch_size=16,
+        stop_grad_conv1=True,
+        frozen_stages=12,
+        norm_eval=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=dict(type='Normal', std=0.01, layer='Linear'),
+    ))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=12, momentum=0.9, weight_decay=0.))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='CosineAnnealingLR', T_max=90, by_epoch=True, begin=0, end=90)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=90)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/mocov3/benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py b/configs/mocov3/benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f3d074f6ed93a4f5b108c441d00b12cb51802a62
--- /dev/null
+++ b/configs/mocov3/benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py
@@ -0,0 +1,74 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        ]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(
+        type='AdamW', lr=5e-4, eps=1e-8, betas=(0.9, 0.999),
+        weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=145,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=5,
+        end=150,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=150)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+randomness = dict(seed=0)
diff --git a/configs/mocov3/benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py b/configs/mocov3/benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..46d7f48299edfa39316eeb137c71d72d3a7955b7
--- /dev/null
+++ b/configs/mocov3/benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py
@@ -0,0 +1,74 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.5,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=[
+            dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+        ]),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(
+        type='AdamW', lr=5e-4, eps=1e-8, betas=(0.9, 0.999),
+        weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=95,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=5,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+randomness = dict(seed=0)
diff --git a/configs/mocov3/benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py b/configs/mocov3/benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c1ffa1972641194beff66d2e4ccfa31e5426fca
--- /dev/null
+++ b/configs/mocov3/benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py
@@ -0,0 +1,45 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='MoCoV3ViT',
+        arch='mocov3-small',  # embed_dim = 384
+        img_size=224,
+        patch_size=16,
+        stop_grad_conv1=True,
+        frozen_stages=12,
+        norm_eval=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        init_cfg=dict(type='Normal', std=0.01, layer='Linear'),
+    ))
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=12, momentum=0.9, weight_decay=0.))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='CosineAnnealingLR', T_max=90, by_epoch=True, begin=0, end=90)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=90)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/mocov3/metafile.yml b/configs/mocov3/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..649d9f439e65f18b7b1613a861113425cba480ae
--- /dev/null
+++ b/configs/mocov3/metafile.yml
@@ -0,0 +1,201 @@
+Collections:
+  - Name: MoCoV3
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - LARS
+      Training Resources: 32x V100 GPUs
+      Architecture:
+        - ResNet
+        - ViT
+        - MoCo
+    Paper:
+      Title: An Empirical Study of Training Self-Supervised Vision Transformers
+      URL: https://arxiv.org/abs/2104.02057
+    README: configs/mocov3/README.md
+
+Models:
+  - Name: mocov3_resnet50_8xb512-amp-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 4096
+      FLOPs: 4109364224
+      Parameters: 68012160
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.pth
+    Config: configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py
+    Downstream:
+      - resnet50_mocov3-100e-pre_8xb128-linear-coslr-90e_in1k
+  - Name: mocov3_resnet50_8xb512-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 4096
+      FLOPs: 4109364224
+      Parameters: 68012160
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/mocov3_resnet50_8xb512-amp-coslr-300e_in1k_20220927-1e4f3304.pth
+    Config: configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py
+    Downstream:
+      - resnet50_mocov3-300e-pre_8xb128-linear-coslr-90e_in1k
+  - Name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k
+    Metadata:
+      Epochs: 800
+      Batch Size: 4096
+      FLOPs: 4109364224
+      Parameters: 68012160
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20220927-e043f51a.pth
+    Config: configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py
+    Downstream:
+      - resnet50_mocov3-800e-pre_8xb128-linear-coslr-90e_in1k
+  - Name: resnet50_mocov3-100e-pre_8xb128-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 1024
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.6
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-8f7d937e.pth
+    Config: configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
+  - Name: resnet50_mocov3-300e-pre_8xb128-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 1024
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 72.8
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-d21ddac2.pth
+    Config: configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
+  - Name: resnet50_mocov3-800e-pre_8xb128-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 1024
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.4
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k/resnet50_linear-8xb128-coslr-90e_in1k/resnet50_linear-8xb128-coslr-90e_in1k_20220927-0e97a483.pth
+    Config: configs/mocov3/benchmarks/resnet50_8xb128-linear-coslr-90e_in1k.py
+  - Name: mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 4096
+      FLOPs: 4607954304
+      Parameters: 84266752
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k-224_20220826-08bc52f7.pth
+    Config: configs/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - vit-small-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k
+  - Name: vit-small-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 1024
+      FLOPs: 4607954304
+      Parameters: 22050664
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 73.6
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k/vit-small-p16_linear-8xb128-coslr-90e_in1k_20220826-376674ef.pth
+    Config: configs/mocov3/benchmarks/vit-small-p16_8xb128-linear-coslr-90e_in1k.py
+  - Name: mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 4096
+      FLOPs: 17581972224
+      Parameters: 215678464
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k-224_20220826-25213343.pth
+    Config: configs/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py
+    Downstream:
+      - vit-base-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k
+      - vit-base-p16_mocov3-pre_8xb64-coslr-150e_in1k
+  - Name: vit-base-p16_mocov3-pre_8xb64-coslr-150e_in1k
+    Metadata:
+      Epochs: 150
+      Batch Size: 512
+      FLOPs: 17581972224
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.0
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k/vit-base-p16_ft-8xb64-coslr-150e_in1k_20220826-f1e6c442.pth
+    Config: configs/mocov3/benchmarks/vit-base-p16_8xb64-coslr-150e_in1k.py
+  - Name: vit-base-p16_mocov3-pre_8xb128-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 1024
+      FLOPs: 17581972224
+      Parameters: 86567656
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.9
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k/vit-base-p16_linear-8xb128-coslr-90e_in1k_20220826-83be7758.pth
+    Config: configs/mocov3/benchmarks/vit-base-p16_8xb128-linear-coslr-90e_in1k.py
+  - Name: mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k
+    Metadata:
+      Epochs: 300
+      Batch Size: 4096
+      FLOPs: 61603111936
+      Parameters: 652781568
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k-224_20220829-9b88a442.pth
+    Config: configs/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py
+    Downstream:
+      - vit-large-p16_mocov3-pre_8xb64-coslr-100e_in1k
+  - Name: vit-large-p16_mocov3-pre_8xb64-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 512
+      FLOPs: 61603111936
+      Parameters: 304326632
+      Training Data: ImageNet-1k
+    In Collection: MoCoV3
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.7
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k/vit-large-p16_ft-8xb64-coslr-100e_in1k_20220829-878a2f7f.pth
+    Config: configs/mocov3/benchmarks/vit-large-p16_8xb64-coslr-100e_in1k.py
diff --git a/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4eabccad9017df0cb3838f423091365c30a7e12
--- /dev/null
+++ b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py
@@ -0,0 +1,82 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mocov3.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+temperature = 1.0
+model = dict(
+    type='MoCoV3',
+    base_momentum=0.01,  # 0.01 for 100e and 300e, 0.004 for 1000e
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=False),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=4096,
+        out_channels=256,
+        num_layers=2,
+        with_bias=False,
+        with_last_bn=True,
+        with_last_bn_affine=False,
+        with_last_bias=False,
+        with_avg_pool=True),
+    head=dict(
+        type='MoCoV3Head',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=256,
+            hid_channels=4096,
+            out_channels=256,
+            num_layers=2,
+            with_bias=False,
+            with_last_bn=False,
+            with_last_bn_affine=False,
+            with_last_bias=False,
+            with_avg_pool=False),
+        loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+        temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(type='LARS', lr=9.6, weight_decay=1e-6, momentum=0.9),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True),
+        }),
+)
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=90,
+        by_epoch=True,
+        begin=10,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc0e4141032b0f8cbe82af08b653db9849013a36
--- /dev/null
+++ b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-300e_in1k.py
@@ -0,0 +1,82 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mocov3.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+temperature = 1.0
+model = dict(
+    type='MoCoV3',
+    base_momentum=0.01,  # 0.01 for 100e and 300e, 0.004 for 1000e
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=False),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=4096,
+        out_channels=256,
+        num_layers=2,
+        with_bias=False,
+        with_last_bn=True,
+        with_last_bn_affine=False,
+        with_last_bias=False,
+        with_avg_pool=True),
+    head=dict(
+        type='MoCoV3Head',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=256,
+            hid_channels=4096,
+            out_channels=256,
+            num_layers=2,
+            with_bias=False,
+            with_last_bn=False,
+            with_last_bn_affine=False,
+            with_last_bias=False,
+            with_avg_pool=False),
+        loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+        temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(type='LARS', lr=4.8, weight_decay=1e-6, momentum=0.9),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True),
+        }),
+)
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=290,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..87f18e350ca2209fd2958a867ea6bf9887c695e5
--- /dev/null
+++ b/configs/mocov3/mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,82 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mocov3.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+temperature = 1.0
+model = dict(
+    type='MoCoV3',
+    base_momentum=0.004,  # 0.01 for 100e and 300e, 0.004 for 800 and 1000e
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=False),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=4096,
+        out_channels=256,
+        num_layers=2,
+        with_bias=False,
+        with_last_bn=True,
+        with_last_bn_affine=False,
+        with_last_bias=False,
+        with_avg_pool=True),
+    head=dict(
+        type='MoCoV3Head',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=256,
+            hid_channels=4096,
+            out_channels=256,
+            num_layers=2,
+            with_bias=False,
+            with_last_bn=False,
+            with_last_bn_affine=False,
+            with_last_bias=False,
+            with_avg_pool=False),
+        loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+        temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(type='LARS', lr=4.8, weight_decay=1.5e-6, momentum=0.9),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True),
+        }),
+)
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=790,
+        by_epoch=True,
+        begin=10,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py b/configs/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..6b18fda74d646fbc6c85a0c95d70f52d91712142
--- /dev/null
+++ b/configs/mocov3/mocov3_vit-base-p16_16xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,151 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mocov3.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+# the difference between ResNet50 and ViT pipeline is the `scale` in
+# `RandomResizedCrop`, `scale=(0.08, 1.)` in ViT pipeline
+view_pipeline1 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.08, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=1.),
+    dict(type='Solarize', thr=128, prob=0.),
+    dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.08, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.1),
+    dict(type='Solarize', thr=128, prob=0.2),
+    dict(type='RandomFlip', prob=0.5),
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiView',
+        num_views=[1, 1],
+        transforms=[view_pipeline1, view_pipeline2]),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=256, dataset=dict(pipeline=train_pipeline))
+
+# model settings
+temperature = 0.2
+model = dict(
+    type='MoCoV3',
+    base_momentum=0.01,
+    backbone=dict(
+        type='MoCoV3ViT',
+        arch='base',  # embed_dim = 768
+        img_size=224,
+        patch_size=16,
+        stop_grad_conv1=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=768,
+        hid_channels=4096,
+        out_channels=256,
+        num_layers=3,
+        with_bias=False,
+        with_last_bn=True,
+        with_last_bn_affine=False,
+        with_last_bias=False,
+        with_avg_pool=False),
+    head=dict(
+        type='MoCoV3Head',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=256,
+            hid_channels=4096,
+            out_channels=256,
+            num_layers=2,
+            with_bias=False,
+            with_last_bn=True,
+            with_last_bn_affine=False,
+            with_last_bias=False,
+            with_avg_pool=False),
+        loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+        temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(type='AdamW', lr=2.4e-3, weight_decay=0.1))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py b/configs/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae31c6d8c9540640591a668be09f3cc670970283
--- /dev/null
+++ b/configs/mocov3/mocov3_vit-large-p16_64xb64-amp-coslr-300e_in1k.py
@@ -0,0 +1,154 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mocov3.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+# the difference between ResNet50 and ViT pipeline is the `scale` in
+# `RandomResizedCrop`, `scale=(0.08, 1.)` in ViT pipeline
+view_pipeline1 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.08, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=1.),
+    dict(type='Solarize', thr=128, prob=0.),
+    dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.08, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.1),
+    dict(type='Solarize', thr=128, prob=0.2),
+    dict(type='RandomFlip', prob=0.5),
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiView',
+        num_views=[1, 1],
+        transforms=[view_pipeline1, view_pipeline2]),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=64, dataset=dict(pipeline=train_pipeline))
+
+# model settings
+temperature = 0.2
+model = dict(
+    type='MoCoV3',
+    base_momentum=0.01,
+    backbone=dict(
+        type='MoCoV3ViT',
+        arch='large',  # embed_dim = 1024
+        img_size=224,
+        patch_size=16,
+        stop_grad_conv1=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=1024,
+        hid_channels=4096,
+        out_channels=256,
+        num_layers=3,
+        with_bias=False,
+        with_last_bn=True,
+        with_last_bn_affine=False,
+        with_last_bias=False,
+        with_avg_pool=False),
+    head=dict(
+        type='MoCoV3Head',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=256,
+            hid_channels=4096,
+            out_channels=256,
+            num_layers=2,
+            with_bias=False,
+            with_last_bn=True,
+            with_last_bn_affine=False,
+            with_last_bias=False,
+            with_avg_pool=False),
+        loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+        temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    clip_grad=dict(max_norm=5.0, error_if_nonfinite=False),
+    optimizer=dict(type='AdamW', lr=2.4e-3, weight_decay=0.1))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+randomness = dict(seed=0)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py b/configs/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d26eec77d847c5f7fdb02b20bea224b43ce393d
--- /dev/null
+++ b/configs/mocov3/mocov3_vit-small-p16_16xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,151 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mocov3.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+# the difference between ResNet50 and ViT pipeline is the `scale` in
+# `RandomResizedCrop`, `scale=(0.08, 1.)` in ViT pipeline
+view_pipeline1 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.08, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=1.),
+    dict(type='Solarize', thr=128, prob=0.),
+    dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.08, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.4,
+                contrast=0.4,
+                saturation=0.2,
+                hue=0.1)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.1),
+    dict(type='Solarize', thr=128, prob=0.2),
+    dict(type='RandomFlip', prob=0.5),
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiView',
+        num_views=[1, 1],
+        transforms=[view_pipeline1, view_pipeline2]),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(batch_size=256, dataset=dict(pipeline=train_pipeline))
+
+# model settings
+temperature = 0.2
+model = dict(
+    type='MoCoV3',
+    base_momentum=0.01,
+    backbone=dict(
+        type='MoCoV3ViT',
+        arch='mocov3-small',  # embed_dim = 384
+        img_size=224,
+        patch_size=16,
+        stop_grad_conv1=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=384,
+        hid_channels=4096,
+        out_channels=256,
+        num_layers=3,
+        with_bias=False,
+        with_last_bn=True,
+        with_last_bn_affine=False,
+        with_last_bias=False,
+        with_avg_pool=False),
+    head=dict(
+        type='MoCoV3Head',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=256,
+            hid_channels=4096,
+            out_channels=256,
+            num_layers=2,
+            with_bias=False,
+            with_last_bn=True,
+            with_last_bn_affine=False,
+            with_last_bias=False,
+            with_avg_pool=False),
+        loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature),
+        temperature=temperature))
+
+# optimizer
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(type='AdamW', lr=2.4e-3, weight_decay=0.1))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+# only keeps the latest 3 checkpoints
+default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/mvit/README.md b/configs/mvit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1bf72e5e4cbb71c8ba548d9a730b0180e47fbc37
--- /dev/null
+++ b/configs/mvit/README.md
@@ -0,0 +1,85 @@
+# MViT V2
+
+> [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video
+classification, as well as object detection. We present an improved version of MViT that incorporates
+decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture
+in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where
+it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where
+it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art
+performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as
+well as 86.1% on Kinetics-400 video classification.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/180376227-755243fa-158e-4068-940a-416036519665.png" width="50%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('mvitv2-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('mvitv2-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/mvit/mvitv2-tiny_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-tiny_3rdparty_in1k_20220722-db7beeef.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                          |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                 |                                       Download                                       |
+| :----------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------: | :----------------------------------------------------------------------------------: |
+| `mvitv2-tiny_3rdparty_in1k`\*  | From scratch |   24.17    |   4.70    |   82.33   |   96.15   | [config](mvitv2-tiny_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-tiny_3rdparty_in1k_20220722-db7beeef.pth) |
+| `mvitv2-small_3rdparty_in1k`\* | From scratch |   34.87    |   7.00    |   83.63   |   96.51   | [config](mvitv2-small_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-small_3rdparty_in1k_20220722-986bd741.pth) |
+| `mvitv2-base_3rdparty_in1k`\*  | From scratch |   51.47    |   10.16   |   84.34   |   96.86   | [config](mvitv2-base_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-base_3rdparty_in1k_20220722-9c4f0a17.pth) |
+| `mvitv2-large_3rdparty_in1k`\* | From scratch |   217.99   |   43.87   |   85.25   |   97.14   | [config](mvitv2-large_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-large_3rdparty_in1k_20220722-2b57b983.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/mvit). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{li2021improved,
+  title={MViTv2: Improved multiscale vision transformers for classification and detection},
+  author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
+  booktitle={CVPR},
+  year={2022}
+}
+```
diff --git a/configs/mvit/metafile.yml b/configs/mvit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..c16f4f8871562637e7251eb2950bd72d3fee7df7
--- /dev/null
+++ b/configs/mvit/metafile.yml
@@ -0,0 +1,95 @@
+Collections:
+  - Name: MViT V2
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - GELU
+        - Layer Normalization
+        - Scaled Dot-Product Attention
+        - Attention Pooling
+    Paper:
+      URL: http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf
+      Title: 'MViTv2: Improved Multiscale Vision Transformers for Classification and Detection'
+    README: configs/mvit/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.24.0/mmcls/models/backbones/mvit.py
+      Version: v0.24.0
+
+Models:
+  - Name: mvitv2-tiny_3rdparty_in1k
+    In Collection: MViT V2
+    Metadata:
+      FLOPs: 4703510768
+      Parameters: 24173320
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 82.33
+        Top 5 Accuracy: 96.15
+    Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-tiny_3rdparty_in1k_20220722-db7beeef.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_T_in1k.pyth
+      Code: https://github.com/facebookresearch/mvit
+    Config: configs/mvit/mvitv2-tiny_8xb256_in1k.py
+
+  - Name: mvitv2-small_3rdparty_in1k
+    In Collection: MViT V2
+    Metadata:
+      FLOPs: 6997555136
+      Parameters: 34870216
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 83.63
+        Top 5 Accuracy: 96.51
+    Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-small_3rdparty_in1k_20220722-986bd741.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_S_in1k.pyth
+      Code: https://github.com/facebookresearch/mvit
+    Config: configs/mvit/mvitv2-small_8xb256_in1k.py
+
+  - Name: mvitv2-base_3rdparty_in1k
+    In Collection: MViT V2
+    Metadata:
+      FLOPs: 10157964400
+      Parameters: 51472744
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 84.34
+        Top 5 Accuracy: 96.86
+    Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-base_3rdparty_in1k_20220722-9c4f0a17.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_B_in1k.pyth
+      Code: https://github.com/facebookresearch/mvit
+    Config: configs/mvit/mvitv2-base_8xb256_in1k.py
+
+  - Name: mvitv2-large_3rdparty_in1k
+    In Collection: MViT V2
+    Metadata:
+      FLOPs: 43868151412
+      Parameters: 217992952
+      Training Data:
+        - ImageNet-1k
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.25
+        Top 5 Accuracy: 97.14
+    Weights: https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-large_3rdparty_in1k_20220722-2b57b983.pth
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_L_in1k.pyth
+      Code: https://github.com/facebookresearch/mvit
+    Config: configs/mvit/mvitv2-large_8xb256_in1k.py
diff --git a/configs/mvit/mvitv2-base_8xb256_in1k.py b/configs/mvit/mvitv2-base_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee3ec11e2bc9873e21b58f0e3e940b5d9fc1e4d5
--- /dev/null
+++ b/configs/mvit/mvitv2-base_8xb256_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../_base_/models/mvit/mvitv2-base.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-4),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.pos_embed': dict(decay_mult=0.0),
+            '.rel_pos_h': dict(decay_mult=0.0),
+            '.rel_pos_w': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=1.0),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=70,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/mvit/mvitv2-large_8xb256_in1k.py b/configs/mvit/mvitv2-large_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..eacddf96e9f9ab6b0da3f3edec973d69d41d1c9b
--- /dev/null
+++ b/configs/mvit/mvitv2-large_8xb256_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../_base_/models/mvit/mvitv2-large.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-4),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.pos_embed': dict(decay_mult=0.0),
+            '.rel_pos_h': dict(decay_mult=0.0),
+            '.rel_pos_w': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=1.0),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=70,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/mvit/mvitv2-small_8xb256_in1k.py b/configs/mvit/mvitv2-small_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..74cfd0a357a7ab773f5ac27404bbc0b78b06f901
--- /dev/null
+++ b/configs/mvit/mvitv2-small_8xb256_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../_base_/models/mvit/mvitv2-small.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-4),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.pos_embed': dict(decay_mult=0.0),
+            '.rel_pos_h': dict(decay_mult=0.0),
+            '.rel_pos_w': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=1.0),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=70,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/mvit/mvitv2-tiny_8xb256_in1k.py b/configs/mvit/mvitv2-tiny_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e563a2c9840fe27ae7ba4425976b540b40d21bc
--- /dev/null
+++ b/configs/mvit/mvitv2-tiny_8xb256_in1k.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../_base_/models/mvit/mvitv2-tiny.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs2048_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=2.5e-4),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.pos_embed': dict(decay_mult=0.0),
+            '.rel_pos_h': dict(decay_mult=0.0),
+            '.rel_pos_w': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=1.0),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        end=70,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', eta_min=1e-5, by_epoch=True, begin=70)
+]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/ofa/README.md b/configs/ofa/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..22e20f8bd85d41ed7faa1794273aeec002311f17
--- /dev/null
+++ b/configs/ofa/README.md
@@ -0,0 +1,88 @@
+# OFA
+
+> [OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework](https://arxiv.org/abs/2202.03052)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/236164275-2429bf20-6e2a-4325-acc2-6117f9b53a53.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('ofa-base_3rdparty-finetuned_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'a dog and a kitten sitting next to each other'}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/ofa/ofa-base_finetuned_refcoco.py https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_refcoco_20230418-2797d3ab.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model                                   | Params (M) | BLEU-4 | CIDER  |                 Config                  |                                               Download                                               |
+| :-------------------------------------- | :--------: | :----: | :----: | :-------------------------------------: | :--------------------------------------------------------------------------------------------------: |
+| `ofa-base_3rdparty-finetuned_caption`\* |   182.24   | 42.64  | 144.50 | [config](ofa-base_finetuned_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_coco-caption_20230418-de18914e.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/OFA-Sys/OFA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Visual Grounding on RefCOCO
+
+| Model                                   | Params (M) | Accuracy (testA) | Accuracy (testB) |                 Config                  |                                     Download                                     |
+| :-------------------------------------- | :--------: | :--------------: | :--------------: | :-------------------------------------: | :------------------------------------------------------------------------------: |
+| `ofa-base_3rdparty-finetuned_refcoco`\* |   182.24   |      90.49       |      83.63       | [config](ofa-base_finetuned_refcoco.py) | [model](https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_refcoco_20230418-2797d3ab.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/OFA-Sys/OFA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Visual Question Answering on VQAv2
+
+| Model                               | Params (M) | Accuracy |               Config                |                                                     Download                                                     |
+| :---------------------------------- | :--------: | :------: | :---------------------------------: | :--------------------------------------------------------------------------------------------------------------: |
+| `ofa-base_3rdparty-finetuned_vqa`\* |   182.24   |  78.00   | [config](ofa-base_finetuned_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_coco-vqa_20230418-f38539a5.pth) |
+| `ofa-base_3rdparty-zeroshot_vqa`\*  |   182.24   |  58.32   | [config](ofa-base_zeroshot_vqa.py)  | [model](https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_pretrain_20230418-dccfc07f.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/OFA-Sys/OFA). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{wang2022ofa,
+  author    = {Peng Wang and
+               An Yang and
+               Rui Men and
+               Junyang Lin and
+               Shuai Bai and
+               Zhikang Li and
+               Jianxin Ma and
+               Chang Zhou and
+               Jingren Zhou and
+               Hongxia Yang},
+  title     = {OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence
+               Learning Framework},
+  journal   = {CoRR},
+  volume    = {abs/2202.03052},
+  year      = {2022}
+}
+```
diff --git a/configs/ofa/metafile.yml b/configs/ofa/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..9c4b3ebf72b766ae64b89bc22ab60c616159af1d
--- /dev/null
+++ b/configs/ofa/metafile.yml
@@ -0,0 +1,89 @@
+Collections:
+  - Name: OFA
+    Metadata:
+      Architecture:
+        - ResNet
+        - Transformer
+      Training Data:
+        - CC12M
+        - CC3M
+        - SBU
+        - COCO
+        - VG
+        - VQAv2
+        - GQA
+        - RefCOCO
+        - OpenImages
+        - Object365
+        - YFCC100M
+        - ImageNet-21K
+        - Pile
+    Paper:
+      Title: 'OFA: Unifying Architectures, Tasks, and Modalities Through a Simple
+        Sequence-to-Sequence Learning Framework'
+      URL: https://arxiv.org/abs/2202.03052
+    README: configs/ofa/README.md
+
+Models:
+  - Name: ofa-base_3rdparty-finetuned_refcoco
+    Metadata:
+      FLOPs: null
+      Parameters: 182238536
+    In Collection: OFA
+    Results:
+      - Task: Visual Grounding
+        Dataset: RefCOCO
+        Metrics:
+          Accuracy (testA): 90.49
+          Accuracy (testB): 83.63
+    Weights: https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_refcoco_20230418-2797d3ab.pth
+    Config: configs/ofa/ofa-base_finetuned_refcoco.py
+    Converted From:
+      Weights: https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/refcoco_base_best.pt
+      Code: https://github.com/OFA-Sys/OFA
+  - Name: ofa-base_3rdparty-finetuned_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 182238536
+    In Collection: OFA
+    Results:
+      - Task: Visual Question Answering
+        Dataset: VQAv2
+        Metrics:
+          Accuracy: 78.00   # Report from the official repo
+    Weights: https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_coco-vqa_20230418-f38539a5.pth
+    Config: configs/ofa/ofa-base_finetuned_vqa.py
+    Converted From:
+      Weights: https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/vqa_large_best.pt
+      Code: https://github.com/OFA-Sys/OFA
+  - Name: ofa-base_3rdparty-finetuned_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 182238536
+    In Collection: OFA
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics:
+          BLEU-4: 42.64
+          CIDER: 144.50
+    Weights: https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_coco-caption_20230418-de18914e.pth
+    Config: configs/ofa/ofa-base_finetuned_caption.py
+    Converted From:
+      Weights: https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/caption_base_best.pt
+      Code: https://github.com/OFA-Sys/OFA
+  - Name: ofa-base_3rdparty-zeroshot_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 182238536
+    In Collection: OFA
+    Results:
+      - Task: Visual Question Answering
+        Dataset: VQAv2
+        Metrics:
+          Accuracy: 58.32
+    Weights: https://download.openmmlab.com/mmclassification/v1/ofa/ofa-base_3rdparty_pretrain_20230418-dccfc07f.pth
+    Config: configs/ofa/ofa-base_zeroshot_vqa.py
+    Converted From:
+      Weights: https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_base.pt
+      Code: https://github.com/OFA-Sys/OFA
diff --git a/configs/ofa/ofa-base_finetuned_caption.py b/configs/ofa/ofa-base_finetuned_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..45efff06ec8ebd5ecc85dbdf15834819fb07bb38
--- /dev/null
+++ b/configs/ofa/ofa-base_finetuned_caption.py
@@ -0,0 +1,41 @@
+_base_ = [
+    '../_base_/datasets/coco_caption.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='OFA',
+    task='caption',
+    vocab_size=59457,
+    embedding_dim=768,
+    encoder_cfg=dict(
+        embed_images=dict(type='OFAResNet', depth=101),
+        num_layers=6,
+    ),
+    decoder_cfg=dict(num_layers=6),
+    generation_cfg=dict(use_cache=True),
+    tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-base'),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=(480, 480)),
+    dict(type='PackInputs', meta_keys=('image_id', )),
+]
+
+train_dataloader = None
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/ofa/ofa-base_finetuned_refcoco.py b/configs/ofa/ofa-base_finetuned_refcoco.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a7435dbd467ed71b3ee6a4e2c6020083c180729
--- /dev/null
+++ b/configs/ofa/ofa-base_finetuned_refcoco.py
@@ -0,0 +1,45 @@
+_base_ = [
+    '../_base_/datasets/refcoco.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='OFA',
+    task='refcoco',
+    vocab_size=59457,
+    embedding_dim=768,
+    encoder_cfg=dict(
+        embed_images=dict(type='OFAResNet', depth=101),
+        num_layers=6,
+    ),
+    decoder_cfg=dict(num_layers=6),
+    generation_cfg=dict(use_cache=True),
+    tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-base'),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=(512, 512)),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['text', 'gt_bboxes'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+train_dataloader = None
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/ofa/ofa-base_finetuned_vqa.py b/configs/ofa/ofa-base_finetuned_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..b120d091e5b9d1b38a3e0ebd1466f0fed9d0f611
--- /dev/null
+++ b/configs/ofa/ofa-base_finetuned_vqa.py
@@ -0,0 +1,64 @@
+_base_ = [
+    '../_base_/datasets/coco_vqa.py',
+    '../_base_/default_runtime.py',
+]
+
+ANS2LABEL = 'https://ofa-beijing.oss-cn-beijing.aliyuncs.com/datasets/vqa_data/trainval_ans2label.pkl'  # noqa: E501
+
+# model settings
+model = dict(
+    type='OFA',
+    task='vqa',
+    vocab_size=59457,
+    embedding_dim=768,
+    ans2label=ANS2LABEL,
+    encoder_cfg=dict(
+        embed_images=dict(type='OFAResNet', depth=101),
+        num_layers=6,
+        num_heads=12,
+    ),
+    decoder_cfg=dict(
+        num_layers=6,
+        num_heads=12,
+    ),
+    generation_cfg=dict(
+        num_beams=5,
+        max_new_tokens=200,
+        length_penalty=0.,  # VQA doesn't require longer answer.
+        use_cache=True,
+    ),
+    tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-base'),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(480, 480),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='OFAAddObjects'),
+    dict(
+        type='PackInputs',
+        algorithm_keys=[
+            'question', 'gt_answer', 'gt_answer_weight', 'decoder_prompt'
+        ],
+        meta_keys=['question_id', 'image_id'],
+    ),
+]
+
+train_dataloader = None  # Eval only
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/ofa/ofa-base_zeroshot_vqa.py b/configs/ofa/ofa-base_zeroshot_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..9890cdd2a48484102877e3f3a946b73fefa6dbae
--- /dev/null
+++ b/configs/ofa/ofa-base_zeroshot_vqa.py
@@ -0,0 +1,42 @@
+_base_ = [
+    '../_base_/datasets/coco_vqa.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='OFA',
+    task='vqa',
+    vocab_size=59457,
+    embedding_dim=768,
+    encoder_cfg=dict(
+        embed_images=dict(type='OFAResNet', depth=101),
+        num_layers=6,
+        num_heads=12,
+    ),
+    decoder_cfg=dict(
+        num_layers=6,
+        num_heads=12,
+    ),
+    generation_cfg=dict(
+        num_beams=20,
+        max_new_tokens=200,
+        length_penalty=0.,  # VQA doesn't require longer answer.
+        use_cache=True,
+    ),
+    tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-base'),
+)
+
+# data settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    to_rgb=True,
+)
+
+train_dataloader = None  # Eval only
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/ofa/ofa-large_zeroshot_vqa.py b/configs/ofa/ofa-large_zeroshot_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..8b47121127c21baabbb963ccc8407a27d823cec1
--- /dev/null
+++ b/configs/ofa/ofa-large_zeroshot_vqa.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../_base_/datasets/coco_vqa.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='OFA',
+    task='vqa',
+    vocab_size=59457,
+    embedding_dim=1024,
+    encoder_cfg=dict(
+        embed_images=dict(type='OFAResNet', depth=152),
+        num_layers=12,
+        num_heads=16,
+    ),
+    decoder_cfg=dict(
+        num_layers=12,
+        num_heads=16,
+    ),
+    generation_cfg=dict(
+        num_beams=20,
+        max_new_tokens=200,
+        length_penalty=0.,  # VQA doesn't require longer answer.
+        use_cache=True,
+    ),
+    tokenizer=dict(type='OFATokenizer', name_or_path='OFA-Sys/OFA-large'),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    to_rgb=True,
+)
+
+train_dataloader = None  # Eval only
+
+# schedule settings
+train_cfg = None
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/otter/README.md b/configs/otter/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..18a84684f84e61a664c0742ff96ecaa440f2633b
--- /dev/null
+++ b/configs/otter/README.md
@@ -0,0 +1,79 @@
+# Otter
+
+> [Otter: A Multi-Modal Model with In-Context Instruction Tuning](https://arxiv.org/abs/2305.03726)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Large language models (LLMs) have demonstrated significant universal capabilities as few/zero-shot learners in various tasks due to their pre-training on vast amounts of text data, as exemplified by GPT-3, which boosted to InstrctGPT and ChatGPT, effectively following natural language instructions to accomplish real-world tasks. In this paper, we propose to introduce instruction tuning into multi-modal models, motivated by the Flamingo model's upstream interleaved format pretraining dataset. We adopt a similar approach to construct our MultI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. We then introduce Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following ability and in-context learning. We also optimize OpenFlamingo's implementation for researchers, democratizing the required training resources from 1$\times$ A100 GPU to 4$\times$ RTX-3090 GPUs, and integrate both OpenFlamingo and Otter into Huggingface Transformers for more researchers to incorporate the models into their customized training and inference pipelines.
+
+<div align=center>
+<img src="https://camo.githubusercontent.com/70613ab882a7827808148a2c577029d544371e707b0832a0b01151c54ce553c3/68747470733a2f2f692e706f7374696d672e63632f5477315a304243572f6f7474657276302d322d64656d6f2e706e67" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model, inference_model
+
+model = get_model('otter-9b_3rdparty_caption', pretrained=True, device='cuda', generation_cfg=dict(max_new_tokens=50))
+out = inference_model(model, 'demo/cat-dog.png')
+print(out)
+# {'pred_caption': 'The image features two adorable small puppies sitting next to each other on the grass. One puppy is brown and white, while the other is tan and white. They appear to be relaxing outdoors, enjoying each other'}
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/otter/otter-9b_caption.py https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Caption on COCO
+
+| Model                         | Params (M) |  BLEU-4  |  CIDER   |            Config             |                                                 Download                                                 |
+| :---------------------------- | :--------: | :------: | :------: | :---------------------------: | :------------------------------------------------------------------------------------------------------: |
+| `otter-9b_3rdparty_caption`\* |  8220.45   | Upcoming | Upcoming | [config](otter-9b_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Luodian/Otter/tree/main). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Visual Question Answering on VQAv2
+
+| Model                     | Params (M) | Accuracy |          Config           |                                                 Download                                                 |
+| :------------------------ | :--------: | :------: | :-----------------------: | :------------------------------------------------------------------------------------------------------: |
+| `otter-9b_3rdparty_vqa`\* |  8220.45   | Upcoming | [config](otter-9b_vqa.py) | [model](https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Luodian/Otter/tree/main). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{li2023otter,
+  title={Otter: A Multi-Modal Model with In-Context Instruction Tuning},
+  author={Li, Bo and Zhang, Yuanhan and Chen, Liangyu and Wang, Jinghao and Yang, Jingkang and Liu, Ziwei},
+  journal={arXiv preprint arXiv:2305.03726},
+  year={2023}
+}
+
+@article{li2023mimicit,
+    title={MIMIC-IT: Multi-Modal In-Context Instruction Tuning},
+    author={Bo Li and Yuanhan Zhang and Liangyu Chen and Jinghao Wang and Fanyi Pu and Jingkang Yang and Chunyuan Li and Ziwei Liu},
+    year={2023},
+    eprint={2306.05425},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+```
diff --git a/configs/otter/metafile.yml b/configs/otter/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..6ee89c62a4d073b5eada03e8f9fbb3508041b8d5
--- /dev/null
+++ b/configs/otter/metafile.yml
@@ -0,0 +1,43 @@
+Collections:
+  - Name: Otter
+    Metadata:
+      Architecture:
+        - Transformer
+        - Gated Cross-Attention Dense
+    Paper:
+      Title: 'Otter: A Multi-Modal Model with In-Context Instruction Tuning'
+      URL: https://arxiv.org/abs/2305.03726
+    README: configs/otter/README.md
+
+Models:
+  - Name: otter-9b_3rdparty_caption
+    Metadata:
+      FLOPs: null
+      Parameters: 8220452880
+    In Collection: Otter
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics:
+          BLEU-4: null
+          CIDER: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth
+    Config: configs/otter/otter-9b_caption.py
+    Converted From:
+      Weights: https://huggingface.co/luodian/otter-9b-hf
+      Code: https://github.com/Luodian/Otter/tree/main
+  - Name: otter-9b_3rdparty_vqa
+    Metadata:
+      FLOPs: null
+      Parameters: 8220452880
+    In Collection: Otter
+    Results:
+      - Task: Visual Question Answering
+        Dataset: VQAv2
+        Metrics:
+          Accuracy: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/otter/otter-9b-adapter_20230613-51c5be8d.pth
+    Config: configs/otter/otter-9b_vqa.py
+    Converted From:
+      Weights: https://huggingface.co/luodian/otter-9b-hf
+      Code: https://github.com/Luodian/Otter/tree/main
diff --git a/configs/otter/otter-9b_caption.py b/configs/otter/otter-9b_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..e35e92ef40cabcccd35f17dd661199b04a76dd6b
--- /dev/null
+++ b/configs/otter/otter-9b_caption.py
@@ -0,0 +1,87 @@
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Otter',
+    tokenizer=dict(type='LlamaTokenizer', name_or_path='huggyllama/llama-7b'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='mmpretrain.QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='huggyllama/llama-7b',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False,
+            only_attend_previous=True,
+        ),
+    ),
+    task='caption',
+    final_prompt_tmpl='<image>User:Please describe the image. GPT:<answer>',
+    generation_cfg=dict(
+        num_beams=3, max_new_tokens=24, no_repeat_ngram_size=3),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CenterCrop', crop_size=(224, 224)),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['gt_caption'],
+        meta_keys=['image_id'],
+    ),
+]
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='COCOCaption',
+        data_root='data/coco',
+        ann_file='annotations/coco_karpathy_val.json',
+        pipeline=test_pipeline,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+
+val_evaluator = dict(
+    type='COCOCaption',
+    ann_file='data/coco/annotations/coco_karpathy_val_gt.json')
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/otter/otter-9b_vqa.py b/configs/otter/otter-9b_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..72f2b64281126cbf71a81929b12318b0a00f9e36
--- /dev/null
+++ b/configs/otter/otter-9b_vqa.py
@@ -0,0 +1,104 @@
+_base_ = [
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='Otter',
+    tokenizer=dict(type='LlamaTokenizer', name_or_path='huggyllama/llama-7b'),
+    vision_encoder=dict(
+        type='VisionTransformer',
+        arch='l',
+        patch_size=14,
+        pre_norm=True,
+        norm_cfg=dict(type='LN', eps=1e-5),
+        layer_cfgs=dict(act_cfg=dict(type='QuickGELU')),
+        final_norm=False,
+        out_type='raw',
+        pretrained=(
+            'https://download.openmmlab.com/mmclassification/v0/clip/'
+            'vit-large-p14_clip-openai-pre_3rdparty_20230517-95e2af0b.pth'),
+    ),
+    lang_encoder=dict(
+        base=dict(
+            type='AutoModelForCausalLM',
+            name_or_path='huggyllama/llama-7b',
+            local_files_only=True),
+        adapter=dict(
+            type='FlamingoLMAdapter',
+            vis_hidden_size=1024,
+            cross_attn_every_n_layers=4,
+            use_media_placement_augmentation=False,
+            only_attend_previous=True,
+        ),
+    ),
+    task='vqa',
+    final_prompt_tmpl='<image>User:{question} GPT:<answer>',
+    generation_cfg=dict(
+        num_beams=3, max_new_tokens=24, no_repeat_ngram_size=3),
+)
+
+# data settings
+data_preprocessor = dict(
+    type='MultiModalDataPreprocessor',
+    mean=[122.770938, 116.7460125, 104.09373615],
+    std=[68.5005327, 66.6321579, 70.32316305],
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='CenterCrop', crop_size=(224, 224)),
+    dict(
+        type='PackInputs',
+        algorithm_keys=['question', 'gt_answer', 'gt_answer_weight', 'shots'],
+        meta_keys=['image_id'],
+    ),
+]
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='val2014',
+        question_file='annotations/v2_OpenEnded_mscoco_val2014_questions.json',
+        ann_file='annotations/v2_mscoco_val2014_annotations.json',
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+val_evaluator = dict(type='VQAAcc')
+
+test_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    dataset=dict(
+        type='FlamingoEvalCOCOVQA',
+        data_root='data/coco',
+        data_prefix='test2015',
+        question_file=
+        'annotations/v2_OpenEnded_mscoco_test-dev2015_questions.json',
+        pipeline=test_pipeline,
+        num_shots=0,
+        num_support_examples=2048,
+        num_query_examples=5000,
+    ),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+test_evaluator = dict(type='ReportVQA', file_path='vqa_test-dev.json')
+
+# schedule settings
+val_cfg = dict()
+test_cfg = dict()
diff --git a/configs/poolformer/README.md b/configs/poolformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2c4b249329ea03662f768aa350a08fb8eebc763b
--- /dev/null
+++ b/configs/poolformer/README.md
@@ -0,0 +1,80 @@
+# PoolFormer
+
+> [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model's performance. To verify this, we deliberately replace the attention module in transformers with an embarrassingly simple spatial pooling operator to conduct only basic token mixing. Surprisingly, we observe that the derived model, termed as PoolFormer, achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolFormer achieves 82.1% top-1 accuracy, surpassing well-tuned vision transformer/MLP-like baselines DeiT-B/ResMLP-B24 by 0.3%/1.1% accuracy with 35%/52% fewer parameters and 49%/61% fewer MACs. The effectiveness of PoolFormer verifies our hypothesis and urges us to initiate the concept of "MetaFormer", a general architecture abstracted from transformers without specifying the token mixer. Based on the extensive experiments, we argue that MetaFormer is the key player in achieving superior results for recent transformer and MLP-like models on vision tasks. This work calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules. Additionally, our proposed PoolFormer could serve as a starting baseline for future MetaFormer architecture design.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/15921929/144710761-1635f59a-abde-4946-984c-a2c3f22a19d2.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('poolformer-s12_3rdparty_32xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('poolformer-s12_3rdparty_32xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/poolformer/poolformer-s12_32xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s12_3rdparty_32xb128_in1k_20220414-f8d83051.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                    |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                  |                                Download                                 |
+| :--------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :---------------------------------------------------------------------: |
+| `poolformer-s12_3rdparty_32xb128_in1k`\* | From scratch |   11.92    |   1.87    |   77.24   |   93.51   | [config](poolformer-s12_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s12_3rdparty_32xb128_in1k_20220414-f8d83051.pth) |
+| `poolformer-s24_3rdparty_32xb128_in1k`\* | From scratch |   21.39    |   3.51    |   80.33   |   95.05   | [config](poolformer-s24_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s24_3rdparty_32xb128_in1k_20220414-d7055904.pth) |
+| `poolformer-s36_3rdparty_32xb128_in1k`\* | From scratch |   30.86    |   5.15    |   81.43   |   95.45   | [config](poolformer-s36_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s36_3rdparty_32xb128_in1k_20220414-d78ff3e8.pth) |
+| `poolformer-m36_3rdparty_32xb128_in1k`\* | From scratch |   56.17    |   8.96    |   82.14   |   95.71   | [config](poolformer-m36_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m36_3rdparty_32xb128_in1k_20220414-c55e0949.pth) |
+| `poolformer-m48_3rdparty_32xb128_in1k`\* | From scratch |   73.47    |   11.80   |   82.51   |   95.95   | [config](poolformer-m48_32xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m48_3rdparty_32xb128_in1k_20220414-9378f3eb.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/sail-sg/poolformer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{yu2022metaformer,
+  title={Metaformer is actually what you need for vision},
+  author={Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng},
+  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
+  pages={10819--10829},
+  year={2022}
+}
+```
diff --git a/configs/poolformer/metafile.yml b/configs/poolformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..55285ddd0230270030f25bef09b1461dc7278dc3
--- /dev/null
+++ b/configs/poolformer/metafile.yml
@@ -0,0 +1,99 @@
+Collections:
+  - Name: PoolFormer
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Pooling
+        - 1x1 Convolution
+        - LayerScale
+    Paper:
+      URL: https://arxiv.org/abs/2111.11418
+      Title: MetaFormer is Actually What You Need for Vision
+    README: configs/poolformer/README.md
+    Code:
+      Version: v0.22.1
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.22.1/mmcls/models/backbones/poolformer.py
+
+Models:
+  - Name: poolformer-s12_3rdparty_32xb128_in1k
+    Metadata:
+      FLOPs: 1871399424
+      Parameters: 11915176
+    In Collection: PoolFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.24
+          Top 5 Accuracy: 93.51
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s12_3rdparty_32xb128_in1k_20220414-f8d83051.pth
+    Config: configs/poolformer/poolformer-s12_32xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s12.pth.tar
+      Code: https://github.com/sail-sg/poolformer
+  - Name: poolformer-s24_3rdparty_32xb128_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 3510411008
+      Parameters: 21388968
+    In Collection: PoolFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.33
+          Top 5 Accuracy: 95.05
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s24_3rdparty_32xb128_in1k_20220414-d7055904.pth
+    Config: configs/poolformer/poolformer-s24_32xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s24.pth.tar
+      Code: https://github.com/sail-sg/poolformer
+  - Name: poolformer-s36_3rdparty_32xb128_in1k
+    Metadata:
+      FLOPs: 5149422592
+      Parameters: 30862760
+    In Collection: PoolFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.43
+          Top 5 Accuracy: 95.45
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-s36_3rdparty_32xb128_in1k_20220414-d78ff3e8.pth
+    Config: configs/poolformer/poolformer-s36_32xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s36.pth.tar
+      Code: https://github.com/sail-sg/poolformer
+  - Name: poolformer-m36_3rdparty_32xb128_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 8960175744
+      Parameters: 56172520
+    In Collection: PoolFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.14
+          Top 5 Accuracy: 95.71
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m36_3rdparty_32xb128_in1k_20220414-c55e0949.pth
+    Config: configs/poolformer/poolformer-m36_32xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_m36.pth.tar
+      Code: https://github.com/sail-sg/poolformer
+  - Name: poolformer-m48_3rdparty_32xb128_in1k
+    Metadata:
+      FLOPs: 11801805696
+      Parameters: 73473448
+    In Collection: PoolFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.51
+          Top 5 Accuracy: 95.95
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-m48_3rdparty_32xb128_in1k_20220414-9378f3eb.pth
+    Config: configs/poolformer/poolformer-m48_32xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_m48.pth.tar
+      Code: https://github.com/sail-sg/poolformer
diff --git a/configs/poolformer/poolformer-m36_32xb128_in1k.py b/configs/poolformer/poolformer-m36_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..13065b8cf5100b4d16696d54cfa8c0a727541831
--- /dev/null
+++ b/configs/poolformer/poolformer-m36_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/poolformer/poolformer_m36.py',
+    '../_base_/datasets/imagenet_bs128_poolformer_medium_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/poolformer/poolformer-m48_32xb128_in1k.py b/configs/poolformer/poolformer-m48_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2078df39c4a16783b8f1a7ffc5c5da2b346eb1f0
--- /dev/null
+++ b/configs/poolformer/poolformer-m48_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/poolformer/poolformer_m48.py',
+    '../_base_/datasets/imagenet_bs128_poolformer_medium_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/poolformer/poolformer-s12_32xb128_in1k.py b/configs/poolformer/poolformer-s12_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7cf4a6365604def73f2ea293b857ebdc8b2ed9b3
--- /dev/null
+++ b/configs/poolformer/poolformer-s12_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/poolformer/poolformer_s12.py',
+    '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/poolformer/poolformer-s24_32xb128_in1k.py b/configs/poolformer/poolformer-s24_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ffb2482d16c3e432c1f3d0a233a69a76b99efdd8
--- /dev/null
+++ b/configs/poolformer/poolformer-s24_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/poolformer/poolformer_s24.py',
+    '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/poolformer/poolformer-s36_32xb128_in1k.py b/configs/poolformer/poolformer-s36_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..842dab3ac51645046d15f04b8bc1ace42781144b
--- /dev/null
+++ b/configs/poolformer/poolformer-s36_32xb128_in1k.py
@@ -0,0 +1,17 @@
+_base_ = [
+    '../_base_/models/poolformer/poolformer_s36.py',
+    '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/regnet/README.md b/configs/regnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..63031f4e89b934d823ce53f08cdbad597729fd7e
--- /dev/null
+++ b/configs/regnet/README.md
@@ -0,0 +1,88 @@
+# RegNet
+
+> [Designing Network Design Spaces](https://arxiv.org/abs/2003.13678)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142572813-5dad3317-9d58-4177-971f-d346e01fb3c4.png" width=60%/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('regnetx-400mf_8xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('regnetx-400mf_8xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/regnet/regnetx-400mf_8xb128_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/regnet/regnetx-400mf_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-400mf_8xb128_in1k_20211213-89bfc226.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                       |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                 Config                 |                                        Download                                        |
+| :-------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------: | :------------------------------------------------------------------------------------: |
+| `regnetx-400mf_8xb128_in1k` | From scratch |    5.16    |   0.41    |   72.56   |   90.78   | [config](regnetx-400mf_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-400mf_8xb128_in1k_20211213-89bfc226.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-400mf_8xb128_in1k_20211208_143316.json) |
+| `regnetx-800mf_8xb128_in1k` | From scratch |    7.26    |   0.81    |   74.76   |   92.32   | [config](regnetx-800mf_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-800mf_8xb128_in1k_20211213-222b0f11.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-800mf_8xb128_in1k_20211207_143037.log.json) |
+| `regnetx-1.6gf_8xb128_in1k` | From scratch |    9.19    |   1.63    |   76.84   |   93.31   | [config](regnetx-1.6gf_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-1.6gf_8xb128_in1k_20211213-d1b89758.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-1.6gf_8xb128_in1k_20211208_143018.log.json) |
+| `regnetx-3.2gf_8xb64_in1k`  | From scratch |    3.21    |   1.53    |   78.09   |   94.08   | [config](regnetx-3.2gf_8xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-3.2gf_8xb64_in1k_20211213-1fdd82ae.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-3.2gf_8xb64_in1k_20211208_142720.log.json) |
+| `regnetx-4.0gf_8xb64_in1k`  | From scratch |   22.12    |   4.00    |   78.60   |   94.17   | [config](regnetx-4.0gf_8xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-4.0gf_8xb64_in1k_20211213-efed675c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-4.0gf_8xb64_in1k_20211207_150431.log.json) |
+| `regnetx-6.4gf_8xb64_in1k`  | From scratch |   26.21    |   6.51    |   79.38   |   94.65   | [config](regnetx-6.4gf_8xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-6.4gf_8xb64_in1k_20211215-5c6089da.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-6.4gf_8xb64_in1k_20211213_172748.log.json) |
+| `regnetx-8.0gf_8xb64_in1k`  | From scratch |   39.57    |   8.03    |   79.12   |   94.51   | [config](regnetx-8.0gf_8xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-8.0gf_8xb64_in1k_20211213-9a9fcc76.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-8.0gf_8xb64_in1k_20211208_103250.log.json) |
+| `regnetx-12gf_8xb64_in1k`   | From scratch |   46.11    |   12.15   |   79.67   |   95.03   |  [config](regnetx-12gf_8xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-12gf_8xb64_in1k_20211213-5df8c2f8.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-12gf_8xb64_in1k_20211208_143713.log.json) |
+
+## Citation
+
+```bibtex
+@article{radosavovic2020designing,
+    title={Designing Network Design Spaces},
+    author={Ilija Radosavovic and Raj Prateek Kosaraju and Ross Girshick and Kaiming He and Piotr Dollár},
+    year={2020},
+    eprint={2003.13678},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+```
diff --git a/configs/regnet/metafile.yml b/configs/regnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..4796a9f42a19092e956b3511467b84b26e372b99
--- /dev/null
+++ b/configs/regnet/metafile.yml
@@ -0,0 +1,122 @@
+Collections:
+  - Name: RegNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Neural Architecture Search
+        - Design Space Design
+        - Precise BN
+        - SGD with nesterov
+    Paper:
+      URL: https://arxiv.org/abs/2003.13678
+      Title: Designing Network Design Spaces
+    README: configs/regnet/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.18.0/mmcls/models/backbones/regnet.py
+      Version: v0.18.0
+
+Models:
+  - Name: regnetx-400mf_8xb128_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-400mf_8xb128_in1k.py
+    Metadata:
+      FLOPs: 410000000     # 0.41G
+      Parameters: 5160000  # 5.16M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 72.56
+        Top 5 Accuracy: 90.78
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-400mf_8xb128_in1k_20211213-89bfc226.pth
+  - Name: regnetx-800mf_8xb128_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-800mf_8xb128_in1k.py
+    Metadata:
+      FLOPs: 810000000     # 0.81G
+      Parameters: 7260000  # 7.26M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 74.76
+        Top 5 Accuracy: 92.32
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-800mf_8xb128_in1k_20211213-222b0f11.pth
+  - Name: regnetx-1.6gf_8xb128_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-1.6gf_8xb128_in1k.py
+    Metadata:
+      FLOPs: 1630000000     # 1.63G
+      Parameters: 9190000   # 9.19M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 76.84
+        Top 5 Accuracy: 93.31
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-1.6gf_8xb128_in1k_20211213-d1b89758.pth
+  - Name: regnetx-3.2gf_8xb64_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-3.2gf_8xb64_in1k.py
+    Metadata:
+      FLOPs: 1530000000     # 1.53G
+      Parameters: 3210000   # 32.1M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 78.09
+        Top 5 Accuracy: 94.08
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-3.2gf_8xb64_in1k_20211213-1fdd82ae.pth
+  - Name: regnetx-4.0gf_8xb64_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-4.0gf_8xb64_in1k.py
+    Metadata:
+      FLOPs: 4000000000     # 4G
+      Parameters: 22120000  # 22.12M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 78.60
+        Top 5 Accuracy: 94.17
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-4.0gf_8xb64_in1k_20211213-efed675c.pth
+  - Name: regnetx-6.4gf_8xb64_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-6.4gf_8xb64_in1k.py
+    Metadata:
+      FLOPs: 6510000000      # 6.51G
+      Parameters: 26210000   # 26.21M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 79.38
+        Top 5 Accuracy: 94.65
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-6.4gf_8xb64_in1k_20211215-5c6089da.pth
+  - Name: regnetx-8.0gf_8xb64_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-8.0gf_8xb64_in1k.py
+    Metadata:
+      FLOPs: 8030000000     # 8.03G
+      Parameters: 39570000  # 39.57M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 79.12
+        Top 5 Accuracy: 94.51
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-8.0gf_8xb64_in1k_20211213-9a9fcc76.pth
+  - Name: regnetx-12gf_8xb64_in1k
+    In Collection: RegNet
+    Config: configs/regnet/regnetx-12gf_8xb64_in1k.py
+    Metadata:
+      FLOPs: 12150000000      # 12.15G
+      Parameters: 46110000    # 46.11M
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 79.67
+        Top 5 Accuracy: 95.03
+    Weights: https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-12gf_8xb64_in1k_20211213-5df8c2f8.pth
diff --git a/configs/regnet/regnetx-1.6gf_8xb128_in1k.py b/configs/regnet/regnetx-1.6gf_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d3e9e934fede12e5c06673dc12898db35654cf2a
--- /dev/null
+++ b/configs/regnet/regnetx-1.6gf_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+    backbone=dict(type='RegNet', arch='regnetx_1.6gf'),
+    head=dict(in_channels=912, ))
diff --git a/configs/regnet/regnetx-12gf_8xb64_in1k.py b/configs/regnet/regnetx-12gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a2c0b5aa15ec760c461bf46d6ff9537c68f0fa4
--- /dev/null
+++ b/configs/regnet/regnetx-12gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+    backbone=dict(type='RegNet', arch='regnetx_12gf'),
+    head=dict(in_channels=2240, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-3.2gf_8xb64_in1k.py b/configs/regnet/regnetx-3.2gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a78478d6df89eee57960f239069192a7d529682e
--- /dev/null
+++ b/configs/regnet/regnetx-3.2gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+    backbone=dict(type='RegNet', arch='regnetx_3.2gf'),
+    head=dict(in_channels=1008, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-4.0gf_8xb64_in1k.py b/configs/regnet/regnetx-4.0gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..dfc241fe0c8469ae3b8d522b7da7fb2da49f39de
--- /dev/null
+++ b/configs/regnet/regnetx-4.0gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+    backbone=dict(type='RegNet', arch='regnetx_4.0gf'),
+    head=dict(in_channels=1360, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-400mf_8xb128_in1k.py b/configs/regnet/regnetx-400mf_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bad16785c04ad49db3b125fdcb343aa4c559cdd9
--- /dev/null
+++ b/configs/regnet/regnetx-400mf_8xb128_in1k.py
@@ -0,0 +1,58 @@
+_base_ = [
+    '../_base_/models/regnet/regnetx_400mf.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs1024_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+data_preprocessor = dict(
+    # BGR format normalization parameters
+    mean=[103.53, 116.28, 123.675],
+    std=[57.375, 57.12, 58.395],
+    to_rgb=False,  # The checkpoints from PyCls requires BGR format inputs.
+)
+
+# lighting params, in order of BGR, from repo. pycls
+EIGVAL = [0.2175, 0.0188, 0.0045]
+EIGVEC = [
+    [-0.5836, -0.6948, 0.4203],
+    [-0.5808, -0.0045, -0.814],
+    [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='Lighting',
+        eigval=EIGVAL,
+        eigvec=EIGVEC,
+        alphastd=25.5,  # because the value range of images is [0,255]
+        to_rgb=False),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128)
+test_dataloader = dict(batch_size=128)
+
+# schedule settings
+
+# sgd with nesterov, base ls is 0.8 for batch_size 1024,
+optim_wrapper = dict(optimizer=dict(lr=0.8, nesterov=True))
+
+# runtime settings
+
+# Precise BN hook will update the bn stats, so this hook should be executed
+# before CheckpointHook(priority of 'VERY_LOW') and
+# EMAHook(priority of 'NORMAL') So set the priority of PreciseBNHook to
+# 'ABOVENORMAL' here.
+custom_hooks = [
+    dict(
+        type='PreciseBNHook',
+        num_samples=8192,
+        interval=1,
+        priority='ABOVE_NORMAL')
+]
diff --git a/configs/regnet/regnetx-6.4gf_8xb64_in1k.py b/configs/regnet/regnetx-6.4gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..edb1fb8e482cd51f44c377c493f00c3e6d7185ad
--- /dev/null
+++ b/configs/regnet/regnetx-6.4gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+    backbone=dict(type='RegNet', arch='regnetx_6.4gf'),
+    head=dict(in_channels=1624, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-8.0gf_8xb64_in1k.py b/configs/regnet/regnetx-8.0gf_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..04b75bbe25987a6b10a984f264288e6c90b29719
--- /dev/null
+++ b/configs/regnet/regnetx-8.0gf_8xb64_in1k.py
@@ -0,0 +1,18 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+    backbone=dict(type='RegNet', arch='regnetx_8.0gf'),
+    head=dict(in_channels=1920, ))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
+
+# schedule settings
+# for batch_size 512, use lr = 0.4
+optim_wrapper = dict(optimizer=dict(lr=0.4))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/regnet/regnetx-800mf_8xb128_in1k.py b/configs/regnet/regnetx-800mf_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9cd71379a108703f5ca3ce7f4f156227085045aa
--- /dev/null
+++ b/configs/regnet/regnetx-800mf_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = ['./regnetx-400mf_8xb128_in1k.py']
+
+# model settings
+model = dict(
+    backbone=dict(type='RegNet', arch='regnetx_800mf'),
+    head=dict(in_channels=672, ))
diff --git a/configs/replknet/README.md b/configs/replknet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3d312f24aa95837c056892cea315458749558206
--- /dev/null
+++ b/configs/replknet/README.md
@@ -0,0 +1,108 @@
+# RepLKNet
+
+> [Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs](https://arxiv.org/abs/2203.06717)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We revisit large kernel design in modern convolutional neural networks (CNNs). Inspired by recent advances in vision transformers (ViTs), in this paper, we demonstrate that using a few large convolutional kernels instead of a stack of small kernels could be a more powerful paradigm. We suggested five guidelines, e.g., applying re-parameterized large depth-wise convolutions, to design efficient highperformance large-kernel CNNs. Following the guidelines, we propose RepLKNet, a pure CNN architecture whose kernel size is as large as 31×31, in contrast to commonly used 3×3. RepLKNet greatly closes the performance gap between CNNs and ViTs, e.g., achieving comparable or superior results than Swin Transformer on ImageNet and a few typical downstream tasks, with lower latency. RepLKNet also shows nice scalability to big data and large models, obtaining 87.8% top-1 accuracy on ImageNet and 56.0% mIoU on ADE20K, which is very competitive among the state-of-the-arts with similar model sizes. Our study further reveals that, in contrast to small-kernel CNNs, large kernel CNNs have much larger effective receptive fields and higher shape bias rather than texture bias.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/48375204/197546040-cdf078c3-7fbd-400f-8b27-01668c8dfebf.png" width="60%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model, get_model
+
+model = get_model('replknet-31B_3rdparty_in1k', pretrained=True)
+model.backbone.switch_to_deploy()
+predict = inference_model(model, 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('replknet-31B_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/replknet/replknet-31B_32xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k_20221118-fd08e268.pth
+```
+
+**Reparameterization**
+
+The checkpoints provided are all `training-time` models. Use the reparameterize tool to switch them to more efficient `inference-time` architecture, which not only has fewer parameters but also less calculations.
+
+```bash
+python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}
+```
+
+`${CFG_PATH}` is the config file, `${SRC_CKPT_PATH}` is the source chenpoint file, `${TARGET_CKPT_PATH}` is the target deploy weight file path.
+
+To use reparameterized weights, the config file must switch to the deploy config files.
+
+```bash
+python tools/test.py ${deploy_cfg} ${deploy_checkpoint} --metrics accuracy
+```
+
+You can also use `backbone.switch_to_deploy()` to switch to the deploy mode in Python code. For example:
+
+```python
+from mmpretrain.models import RepLKNet
+
+backbone = RepLKNet(arch='31B')
+backbone.switch_to_deploy()
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                          |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                    |                            Download                            |
+| :--------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------------: | :------------------------------------------------------------: |
+| `replknet-31B_3rdparty_in1k`\*                 | From scratch |   79.86    |   15.64   |   83.48   |   96.57   |    [config](replknet-31B_32xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k_20221118-fd08e268.pth) |
+| `replknet-31B_3rdparty_in1k-384px`\*           | From scratch |   79.86    |   45.95   |   84.84   |   97.34   | [config](replknet-31B_32xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k-384px_20221118-03a170ce.pth) |
+| `replknet-31B_in21k-pre_3rdparty_in1k`\*       | ImageNet-21k |   79.86    |   15.64   |   85.20   |   97.56   |    [config](replknet-31B_32xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_in21k-pre_3rdparty_in1k_20221118-54ed5c46.pth) |
+| `replknet-31B_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k |   79.86    |   45.95   |   85.99   |   97.75   | [config](replknet-31B_32xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_in21k-pre_3rdparty_in1k-384px_20221118-76c92b24.pth) |
+| `replknet-31L_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k |   172.67   |   97.24   |   86.63   |   98.00   | [config](replknet-31L_32xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31L_in21k-pre_3rdparty_in1k-384px_20221118-dc3fc07c.pth) |
+| `replknet-XL_meg73m-pre_3rdparty_in1k-320px`\* |    MEG73M    |   335.44   |  129.57   |   87.57   |   98.39   | [config](replknet-XL_32xb64_in1k-320px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-XL_meg73m-pre_3rdparty_in1k-320px_20221118-88259b1d.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{ding2022scaling,
+  title={Scaling up your kernels to 31x31: Revisiting large kernel design in cnns},
+  author={Ding, Xiaohan and Zhang, Xiangyu and Han, Jungong and Ding, Guiguang},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={11963--11975},
+  year={2022}
+}
+```
diff --git a/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k-384px.py b/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a14fe63efafbff3f249a2e4d5b2c96de931c6c1f
--- /dev/null
+++ b/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../replknet-31B_32xb64_in1k-384px.py'
+
+model = dict(backbone=dict(small_kernel_merged=True))
diff --git a/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k.py b/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4f92c494f8afd0d494e199de20f26af7ce151aa1
--- /dev/null
+++ b/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../replknet-31B_32xb64_in1k.py'
+
+model = dict(backbone=dict(small_kernel_merged=True))
diff --git a/configs/replknet/deploy/replknet-31L-deploy_32xb64_in1k-384px.py b/configs/replknet/deploy/replknet-31L-deploy_32xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..63e590f9786173d879b1f4390c91392f1df45bec
--- /dev/null
+++ b/configs/replknet/deploy/replknet-31L-deploy_32xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../replknet-31L_32xb64_in1k-384px.py'
+
+model = dict(backbone=dict(small_kernel_merged=True))
diff --git a/configs/replknet/deploy/replknet-XL-deploy_32xb64_in1k-320px.py b/configs/replknet/deploy/replknet-XL-deploy_32xb64_in1k-320px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0a8ed5f8f30aea7e53811ae63767187d5494bc6
--- /dev/null
+++ b/configs/replknet/deploy/replknet-XL-deploy_32xb64_in1k-320px.py
@@ -0,0 +1,3 @@
+_base_ = '../replknet-XL_32xb64_in1k-320px.py'
+
+model = dict(backbone=dict(small_kernel_merged=True))
diff --git a/configs/replknet/metafile.yml b/configs/replknet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f9f37449778415e0de57394adb457c8bc57c9e2b
--- /dev/null
+++ b/configs/replknet/metafile.yml
@@ -0,0 +1,129 @@
+Collections:
+  - Name: RepLKNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Large-Kernel Convolution
+        - VGG-style Neural Network
+    Paper:
+      URL: https://arxiv.org/abs/2203.06717
+      Title: 'Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs'
+    README: configs/replknet/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v1.0.0rc3/mmcls/models/backbones/replknet.py
+      Version: v1.0.0rc3
+
+Models:
+  - Name: replknet-31B_3rdparty_in1k
+    In Collection: RepLKNet
+    Config: configs/replknet/replknet-31B_32xb64_in1k.py
+    Metadata:
+      FLOPs: 15636547584
+      Parameters: 79864168
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 83.48
+        Top 5 Accuracy: 96.57
+    Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k_20221118-fd08e268.pth
+    Converted From:
+      Weights: https://drive.google.com/u/0/uc?id=1azQUiCxK9feYVkkrPqwVPBtNsTzDrX7S&export=download
+      Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+  - Name: replknet-31B_3rdparty_in1k-384px
+    In Collection: RepLKNet
+    Config: configs/replknet/replknet-31B_32xb64_in1k-384px.py
+    Metadata:
+      FLOPs: 45952303104
+      Parameters: 79864168
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 84.84
+        Top 5 Accuracy: 97.34
+    Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k-384px_20221118-03a170ce.pth
+    Converted From:
+      Weights: https://drive.google.com/u/0/uc?id=1vo-P3XB6mRLUeDzmgv90dOu73uCeLfZN&export=download
+      Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+  - Name: replknet-31B_in21k-pre_3rdparty_in1k
+    In Collection: RepLKNet
+    Config: configs/replknet/replknet-31B_32xb64_in1k.py
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 15636547584
+      Parameters: 79864168
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.20
+        Top 5 Accuracy: 97.56
+    Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_in21k-pre_3rdparty_in1k_20221118-54ed5c46.pth
+    Converted From:
+      Weights: https://drive.google.com/u/0/uc?id=1DslZ2voXZQR1QoFY9KnbsHAeF84hzS0s&export=download
+      Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+  - Name: replknet-31B_in21k-pre_3rdparty_in1k-384px
+    In Collection: RepLKNet
+    Config: configs/replknet/replknet-31B_32xb64_in1k-384px.py
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 45952303104
+      Parameters: 79864168
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.99
+        Top 5 Accuracy: 97.75
+    Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_in21k-pre_3rdparty_in1k-384px_20221118-76c92b24.pth
+    Converted From:
+      Weights: https://drive.google.com/u/0/uc?id=1Sc46BWdXXm2fVP-K_hKKU_W8vAB-0duX&export=download
+      Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+  - Name: replknet-31L_in21k-pre_3rdparty_in1k-384px
+    In Collection: RepLKNet
+    Config: configs/replknet/replknet-31L_32xb64_in1k-384px.py
+    Metadata:
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+      FLOPs: 97240006656
+      Parameters: 172671016
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 86.63
+        Top 5 Accuracy: 98.00
+    Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31L_in21k-pre_3rdparty_in1k-384px_20221118-dc3fc07c.pth
+    Converted From:
+      Weights: https://drive.google.com/u/0/uc?id=1JYXoNHuRvC33QV1pmpzMTKEni1hpWfBl&export=download
+      Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
+
+  - Name: replknet-XL_meg73m-pre_3rdparty_in1k-320px
+    In Collection: RepLKNet
+    Config: configs/replknet/replknet-XL_32xb64_in1k-320px.py
+    Metadata:
+      Training Data:
+        - MegData-73M
+        - ImageNet-1k
+      FLOPs: 129570201600
+      Parameters: 335435752
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 87.57
+        Top 5 Accuracy: 98.39
+    Weights: https://download.openmmlab.com/mmclassification/v0/replknet/replknet-XL_meg73m-pre_3rdparty_in1k-320px_20221118-88259b1d.pth
+    Converted From:
+      Weights: https://drive.google.com/u/0/uc?id=1tPC60El34GntXByIRHb-z-Apm4Y5LX1T&export=download
+      Code: https://github.com/DingXiaoH/RepLKNet-pytorch/blob/main/replknet.py
diff --git a/configs/replknet/replknet-31B_32xb64_in1k-384px.py b/configs/replknet/replknet-31B_32xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e714f347a40101f2baf41a0723181a8502af85a
--- /dev/null
+++ b/configs/replknet/replknet-31B_32xb64_in1k-384px.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/replknet-31B_in1k.py',
+    '../_base_/datasets/imagenet_bs16_pil_bicubic_384.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+    type='CosineAnnealingLR', T_max=300, by_epoch=True, begin=0, end=300)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/replknet/replknet-31B_32xb64_in1k.py b/configs/replknet/replknet-31B_32xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf06f2d86a39450574747d670f4bb9a7dfffaca6
--- /dev/null
+++ b/configs/replknet/replknet-31B_32xb64_in1k.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/replknet-31B_in1k.py',
+    '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+    type='CosineAnnealingLR', T_max=300, by_epoch=True, begin=0, end=300)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/replknet/replknet-31L_32xb64_in1k-384px.py b/configs/replknet/replknet-31L_32xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..8cdab249fefba7b7878211479b682768538c4b27
--- /dev/null
+++ b/configs/replknet/replknet-31L_32xb64_in1k-384px.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/replknet-31L_in1k.py',
+    '../_base_/datasets/imagenet_bs16_pil_bicubic_384.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+    type='CosineAnnealingLR', T_max=300, by_epoch=True, begin=0, end=300)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/replknet/replknet-XL_32xb64_in1k-320px.py b/configs/replknet/replknet-XL_32xb64_in1k-320px.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b0aab114e725e822dbffb99a637cc9e770a91e7
--- /dev/null
+++ b/configs/replknet/replknet-XL_32xb64_in1k-320px.py
@@ -0,0 +1,12 @@
+_base_ = [
+    '../_base_/models/replknet-XL_in1k.py',
+    '../_base_/datasets/imagenet_bs8_pil_bicubic_320.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+param_scheduler = dict(
+    type='CosineAnnealingLR', T_max=300, by_epoch=True, begin=0, end=300)
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/repmlp/README.md b/configs/repmlp/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..41dfa234bd09153695a09af39b3901e536ca19b6
--- /dev/null
+++ b/configs/repmlp/README.md
@@ -0,0 +1,103 @@
+# RepMLP
+
+> [RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition](https://arxiv.org/abs/2105.01883)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, which is composed of a series of fully-connected (FC) layers. Compared to convolutional layers, FC layers are more efficient, better at modeling the long-range dependencies and positional patterns, but worse at capturing the local structures, hence usually less favored for image recognition. We propose a structural re-parameterization technique that adds local prior into an FC to make it powerful for image recognition. Specifically, we construct convolutional layers inside a RepMLP during training and merge them into the FC for inference. On CIFAR, a simple pure-MLP model shows performance very close to CNN. By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs. Our intriguing findings highlight that combining the global representational capacity and positional perception of FC with the local prior of convolution can improve the performance of neural network with faster speed on both the tasks with translation invariance (e.g., semantic segmentation) and those with aligned images and positional patterns (e.g., face recognition).
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/18586273/155455288-a17a5c48-11af-4b74-995a-cf7183f0e2d2.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model, get_model
+
+model = get_model('repmlp-base_3rdparty_8xb64_in1k', pretrained=True)
+model.backbone.switch_to_deploy()
+predict = inference_model(model, 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('repmlp-base_3rdparty_8xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/repmlp/repmlp-base_8xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k_20220330-1cb1f11b.pth
+```
+
+**Reparameterization**
+
+The checkpoints provided are all `training-time` models. Use the reparameterize tool to switch them to more efficient `inference-time` architecture, which not only has fewer parameters but also less calculations.
+
+```bash
+python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}
+```
+
+`${CFG_PATH}` is the config file, `${SRC_CKPT_PATH}` is the source chenpoint file, `${TARGET_CKPT_PATH}` is the target deploy weight file path.
+
+To use reparameterized weights, the config file must switch to the deploy config files.
+
+```bash
+python tools/test.py ${deploy_cfg} ${deploy_checkpoint} --metrics accuracy
+```
+
+You can also use `backbone.switch_to_deploy()` to switch to the deploy mode in Python code. For example:
+
+```python
+from mmpretrain.models import RepMLPNet
+
+backbone = RepMLPNet(arch='B', img_size=224, reparam_conv_kernels=(1, 3))
+backbone.switch_to_deploy()
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                   |                               Download                                |
+| :---------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :-------------------------------------------------------------------: |
+| `repmlp-base_3rdparty_8xb64_in1k`\*       | From scratch |   68.24    |   6.71    |   80.41   |   95.14   |    [config](repmlp-base_8xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k_20220330-1cb1f11b.pth) |
+| `repmlp-base_3rdparty_8xb64_in1k-256px`\* | From scratch |   96.45    |   9.69    |   81.11   |   95.50   | [config](repmlp-base_8xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k-256px_20220330-7c5a91ce.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/DingXiaoH/RepMLP/blob/072d8516beba83d75dfe6ebb12f625abad4b53d5/repmlpnet.py#L278). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{ding2021repmlp,
+  title={Repmlp: Re-parameterizing convolutions into fully-connected layers for image recognition},
+  author={Ding, Xiaohan and Xia, Chunlong and Zhang, Xiangyu and Chu, Xiaojie and Han, Jungong and Ding, Guiguang},
+  journal={arXiv preprint arXiv:2105.01883},
+  year={2021}
+}
+```
diff --git a/configs/repmlp/metafile.yml b/configs/repmlp/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..7f391e04b7cfc2b3ffc93dbd2a781e6b201d1cde
--- /dev/null
+++ b/configs/repmlp/metafile.yml
@@ -0,0 +1,48 @@
+Collections:
+  - Name: RepMLP
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Multi-layer Perceptron
+        - Re-parameterization Convolution
+    Paper:
+      URL: https://arxiv.org/abs/2105.01883
+      Title: 'RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition'
+    README: configs/repmlp/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.21.0/mmcls/models/backbones/repmlp.py
+      Version: v0.21.0
+
+Models:
+  - Name: repmlp-base_3rdparty_8xb64_in1k
+    In Collection: RepMLP
+    Config: configs/repmlp/repmlp-base_8xb64_in1k.py
+    Metadata:
+      FLOPs: 6710000000  # 6.71 G
+      Parameters: 68240000  # 68.24 M
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.41
+          Top 5 Accuracy: 95.14
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k_20220330-1cb1f11b.pth
+    Converted From:
+      Weights: https://github.com/DingXiaoH/RepMLP
+      Code: https://github.com/DingXiaoH/RepMLP/blob/072d8516beba83d75dfe6ebb12f625abad4b53d5/repmlpnet.py#L274
+  - Name: repmlp-base_3rdparty_8xb64_in1k-256px
+    In Collection: RepMLP
+    Config: configs/repmlp/repmlp-base_8xb64_in1k-256px.py
+    Metadata:
+      FLOPs: 9690000000  # 9.69 G
+      Parameters: 96450000  # 96.45M
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.11
+          Top 5 Accuracy: 95.50
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-base_3rdparty_8xb64_in1k-256px_20220330-7c5a91ce.pth
+    Converted From:
+      Weights: https://github.com/DingXiaoH/RepMLP
+      Code: https://github.com/DingXiaoH/RepMLP/blob/072d8516beba83d75dfe6ebb12f625abad4b53d5/repmlpnet.py#L278
diff --git a/configs/repmlp/repmlp-base_8xb64_in1k-256px.py b/configs/repmlp/repmlp-base_8xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..81dc55a204918dec83b31c80cd37125a4ce3bb27
--- /dev/null
+++ b/configs/repmlp/repmlp-base_8xb64_in1k-256px.py
@@ -0,0 +1,36 @@
+_base_ = [
+    '../_base_/models/repmlp-base_224.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(backbone=dict(img_size=256))
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=256),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=292, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=256),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/repmlp/repmlp-base_8xb64_in1k.py b/configs/repmlp/repmlp-base_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..666ce405440d7c764a0959900cc3650f329cc019
--- /dev/null
+++ b/configs/repmlp/repmlp-base_8xb64_in1k.py
@@ -0,0 +1,26 @@
+_base_ = [
+    '../_base_/models/repmlp-base_224.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    # resizing to (256, 256) here, different from resizing shorter edge to 256
+    dict(type='Resize', scale=(256, 256), backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/repmlp/repmlp-base_delopy_8xb64_in1k.py b/configs/repmlp/repmlp-base_delopy_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5b2c882341421f225b0b3ca0b57e2efd6c06e07
--- /dev/null
+++ b/configs/repmlp/repmlp-base_delopy_8xb64_in1k.py
@@ -0,0 +1,3 @@
+_base_ = ['./repmlp-base_8xb64_in1k.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/repmlp/repmlp-base_deploy_8xb64_in1k-256px.py b/configs/repmlp/repmlp-base_deploy_8xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..27ff50a02dc65c56162e7f851506f00dbb6bc8da
--- /dev/null
+++ b/configs/repmlp/repmlp-base_deploy_8xb64_in1k-256px.py
@@ -0,0 +1,3 @@
+_base_ = ['./repmlp-base_8xb64_in1k-256px.py']
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/repvgg/README.md b/configs/repvgg/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9a47f9d1e0a56a027072b661aef54225f1423205
--- /dev/null
+++ b/configs/repvgg/README.md
@@ -0,0 +1,142 @@
+# RepVGG
+
+> [RepVGG: Making VGG-style ConvNets Great Again](https://arxiv.org/abs/2101.03697)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+RepVGG is a VGG-style convolutional architecture. It has the following advantages:
+
+1. The model has a VGG-like plain (a.k.a. feed-forward) topology 1 without any branches. I.e., every layer takes the output of its only preceding layer as input and feeds the output into its only following layer.
+2. The model’s body uses only 3 × 3 conv and ReLU.
+3. The concrete architecture (including the specific depth and layer widths) is instantiated with no automatic search, manual refinement, compound scaling, nor other heavy designs.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142573223-f7f14d32-ea08-43a1-81ad-5a6a83ee0122.png" width="60%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+We present a simple but powerful architecture of convolutional neural network, which has a VGG-like inference-time body composed of nothing but a stack of 3x3 convolution and ReLU, while the training-time model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique so that the model is named RepVGG. On ImageNet, RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model, to the best of our knowledge. On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models like EfficientNet and RegNet.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model, get_model
+
+model = get_model('repvgg-A0_8xb32_in1k', pretrained=True)
+model.backbone.switch_to_deploy()
+predict = inference_model(model, 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('repvgg-A0_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/repvgg/repvgg-A0_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/repvgg/repvgg-A0_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A0_8xb32_in1k_20221213-60ae8e23.pth
+```
+
+Test with reparameterized model:
+
+```shell
+python tools/test.py configs/repvgg/repvgg-A0_8xb32_in1k.py repvgg_A0_deploy.pth --cfg-options model.backbone.deploy=True
+```
+
+**Reparameterization**
+
+The checkpoints provided are all `training-time` models. Use the reparameterize tool to switch them to more efficient `inference-time` architecture, which not only has fewer parameters but also less calculations.
+
+```bash
+python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}
+```
+
+`${CFG_PATH}` is the config file, `${SRC_CKPT_PATH}` is the source chenpoint file, `${TARGET_CKPT_PATH}` is the target deploy weight file path.
+
+To use reparameterized weights, the config file must switch to the deploy config files.
+
+```bash
+python tools/test.py ${deploy_cfg} ${deploy_checkpoint} --metrics accuracy
+```
+
+You can also use `backbone.switch_to_deploy()` to switch to the deploy mode in Python code. For example:
+
+```python
+from mmpretrain.models import RepVGG
+
+backbone = RepVGG(arch='A0')
+backbone.switch_to_deploy()
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config                |                                        Download                                         |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :-------------------------------------------------------------------------------------: |
+| `repvgg-A0_8xb32_in1k`        | From scratch |    8.31    |   1.36    |   72.37   |   90.56   |  [config](repvgg-A0_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A0_8xb32_in1k_20221213-60ae8e23.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A0_8xb32_in1k_20221213-60ae8e23.log) |
+| `repvgg-A1_8xb32_in1k`        | From scratch |   12.79    |   2.36    |   74.23   |   91.80   |  [config](repvgg-A1_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A1_8xb32_in1k_20221213-f81bf3df.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A1_8xb32_in1k_20221213-f81bf3df.log) |
+| `repvgg-A2_8xb32_in1k`        | From scratch |   25.50    |   5.12    |   76.49   |   93.09   |  [config](repvgg-A2_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A2_8xb32_in1k_20221213-a8767caf.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A2_8xb32_in1k_20221213-a8767caf.log) |
+| `repvgg-B0_8xb32_in1k`        | From scratch |    3.42    |   15.82   |   75.27   |   92.21   |  [config](repvgg-B0_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B0_8xb32_in1k_20221213-5091ecc7.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B0_8xb32_in1k_20221213-5091ecc7.log) |
+| `repvgg-B1_8xb32_in1k`        | From scratch |   51.83    |   11.81   |   78.19   |   94.04   |  [config](repvgg-B1_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1_8xb32_in1k_20221213-d17c45e7.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1_8xb32_in1k_20221213-d17c45e7.log) |
+| `repvgg-B1g2_8xb32_in1k`      | From scratch |   41.36    |   8.81    |   77.87   |   93.99   | [config](repvgg-B1g2_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g2_8xb32_in1k_20221213-ae6428fd.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g2_8xb32_in1k_20221213-ae6428fd.log) |
+| `repvgg-B1g4_8xb32_in1k`      | From scratch |   36.13    |   7.30    |   77.81   |   93.77   | [config](repvgg-B1g4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g4_8xb32_in1k_20221213-a7a4aaea.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g4_8xb32_in1k_20221213-a7a4aaea.log) |
+| `repvgg-B2_8xb32_in1k`        | From scratch |   80.32    |   18.37   |   78.58   |   94.23   |  [config](repvgg-B2_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2_8xb32_in1k_20221213-d8b420ef.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2_8xb32_in1k_20221213-d8b420ef.log) |
+| `repvgg-B2g4_8xb32_in1k`      | From scratch |   55.78    |   11.33   |   79.44   |   94.72   | [config](repvgg-B2g4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2g4_8xb32_in1k_20221213-0c1990eb.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2g4_8xb32_in1k_20221213-0c1990eb.log) |
+| `repvgg-B3_8xb32_in1k`        | From scratch |   110.96   |   26.21   |   80.58   |   95.33   |  [config](repvgg-B3_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3_8xb32_in1k_20221213-927a329a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3_8xb32_in1k_20221213-927a329a.log) |
+| `repvgg-B3g4_8xb32_in1k`      | From scratch |   75.63    |   16.06   |   80.26   |   95.15   | [config](repvgg-B3g4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3g4_8xb32_in1k_20221213-e01cb280.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3g4_8xb32_in1k_20221213-e01cb280.log) |
+| `repvgg-D2se_3rdparty_in1k`\* | From scratch |   120.39   |   32.84   |   81.81   |   95.94   | [config](repvgg-D2se_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-D2se_3rdparty_4xb64-autoaug-lbs-mixup-coslr-200e_in1k_20210909-cf3139b7.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/DingXiaoH/RepVGG/blob/9f272318abfc47a2b702cd0e916fca8d25d683e7/repvgg.py#L250). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{ding2021repvgg,
+  title={Repvgg: Making vgg-style convnets great again},
+  author={Ding, Xiaohan and Zhang, Xiangyu and Ma, Ningning and Han, Jungong and Ding, Guiguang and Sun, Jian},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={13733--13742},
+  year={2021}
+}
+```
diff --git a/configs/repvgg/metafile.yml b/configs/repvgg/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..e93250ae2288b2ace58081bdcc24fc80c2f3c5b5
--- /dev/null
+++ b/configs/repvgg/metafile.yml
@@ -0,0 +1,175 @@
+Collections:
+  - Name: RepVGG
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - re-parameterization Convolution
+        - VGG-style Neural Network
+    Paper:
+      URL: https://arxiv.org/abs/2101.03697
+      Title: 'RepVGG: Making VGG-style ConvNets Great Again'
+    README: configs/repvgg/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.16.0/mmcls/models/backbones/repvgg.py#L257
+      Version: v0.16.0
+
+Models:
+  - Name: repvgg-A0_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-A0_8xb32_in1k.py
+    Metadata:
+      FLOPs: 1360233728
+      Parameters: 8309384
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 72.37
+        Top 5 Accuracy: 90.56
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A0_8xb32_in1k_20221213-60ae8e23.pth
+  - Name: repvgg-A1_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-A1_8xb32_in1k.py
+    Metadata:
+      FLOPs: 2362750208
+      Parameters: 12789864
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 74.23
+        Top 5 Accuracy: 91.80
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A1_8xb32_in1k_20221213-f81bf3df.pth
+  - Name: repvgg-A2_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-A2_8xb32_in1k.py
+    Metadata:
+      FLOPs: 5115612544
+      Parameters: 25499944
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 76.49
+        Top 5 Accuracy: 93.09
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A2_8xb32_in1k_20221213-a8767caf.pth
+  - Name: repvgg-B0_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B0_8xb32_in1k.py
+    Metadata:
+      FLOPs: 15820000000
+      Parameters: 3420000
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 75.27
+        Top 5 Accuracy: 92.21
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B0_8xb32_in1k_20221213-5091ecc7.pth
+  - Name: repvgg-B1_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B1_8xb32_in1k.py
+    Metadata:
+      FLOPs: 11813537792
+      Parameters: 51829480
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 78.19
+        Top 5 Accuracy: 94.04
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1_8xb32_in1k_20221213-d17c45e7.pth
+  - Name: repvgg-B1g2_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B1g2_8xb32_in1k.py
+    Metadata:
+      FLOPs: 8807794688
+      Parameters: 41360104
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 77.87
+        Top 5 Accuracy: 93.99
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g2_8xb32_in1k_20221213-ae6428fd.pth
+  - Name: repvgg-B1g4_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B1g4_8xb32_in1k.py
+    Metadata:
+      FLOPs: 7304923136
+      Parameters: 36125416
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 77.81
+        Top 5 Accuracy: 93.77
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g4_8xb32_in1k_20221213-a7a4aaea.pth
+  - Name: repvgg-B2_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B2_8xb32_in1k.py
+    Metadata:
+      FLOPs: 18374175232
+      Parameters: 80315112
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 78.58
+        Top 5 Accuracy: 94.23
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2_8xb32_in1k_20221213-d8b420ef.pth
+  - Name: repvgg-B2g4_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B2g4_8xb32_in1k.py
+    Metadata:
+      FLOPs: 11329464832
+      Parameters: 55777512
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 79.44
+        Top 5 Accuracy: 94.72
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2g4_8xb32_in1k_20221213-0c1990eb.pth
+  - Name: repvgg-B3_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B3_8xb32_in1k.py
+    Metadata:
+      FLOPs: 26206448128
+      Parameters: 110960872
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 80.58
+        Top 5 Accuracy: 95.33
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3_8xb32_in1k_20221213-927a329a.pth
+  - Name: repvgg-B3g4_8xb32_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-B3g4_8xb32_in1k.py
+    Metadata:
+      FLOPs: 16062065152
+      Parameters: 75626728
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 80.26
+        Top 5 Accuracy: 95.15
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3g4_8xb32_in1k_20221213-e01cb280.pth
+  - Name: repvgg-D2se_3rdparty_in1k
+    In Collection: RepVGG
+    Config: configs/repvgg/repvgg-D2se_8xb32_in1k.py
+    Metadata:
+      FLOPs: 32838581760
+      Parameters: 120387572
+    Results:
+    - Dataset: ImageNet-1k
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 81.81
+        Top 5 Accuracy: 95.94
+    Weights: https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-D2se_3rdparty_4xb64-autoaug-lbs-mixup-coslr-200e_in1k_20210909-cf3139b7.pth
+    Converted From:
+      Weights: https://drive.google.com/drive/folders/1Avome4KvNp0Lqh2QwhXO6L5URQjzCjUq
+      Code: https://github.com/DingXiaoH/RepVGG/blob/9f272318abfc47a2b702cd0e916fca8d25d683e7/repvgg.py#L250
diff --git a/configs/repvgg/repvgg-A0_8xb32_in1k.py b/configs/repvgg/repvgg-A0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b767ae2a3e4062563cec782385baafdf6181baf3
--- /dev/null
+++ b/configs/repvgg/repvgg-A0_8xb32_in1k.py
@@ -0,0 +1,33 @@
+_base_ = [
+    '../_base_/models/repvgg-A0_in1k.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+val_dataloader = dict(batch_size=256)
+test_dataloader = dict(batch_size=256)
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0,
+        custom_keys={
+            'branch_3x3.norm': dict(decay_mult=0.0),
+            'branch_1x1.norm': dict(decay_mult=0.0),
+            'branch_norm.bias': dict(decay_mult=0.0),
+        }))
+
+# schedule settings
+param_scheduler = dict(
+    type='CosineAnnealingLR',
+    T_max=120,
+    by_epoch=True,
+    begin=0,
+    end=120,
+    convert_to_iter_based=True)
+
+train_cfg = dict(by_epoch=True, max_epochs=120)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
diff --git a/configs/repvgg/repvgg-A0_deploy_in1k.py b/configs/repvgg/repvgg-A0_deploy_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..897e2bb36e9ad8197b4889f22530a32a79fef055
--- /dev/null
+++ b/configs/repvgg/repvgg-A0_deploy_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/repvgg/repvgg-A1_8xb32_in1k.py b/configs/repvgg/repvgg-A1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fab5e586359370dd59a7ba55b91511541e922a11
--- /dev/null
+++ b/configs/repvgg/repvgg-A1_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='A1'))
diff --git a/configs/repvgg/repvgg-A2_8xb32_in1k.py b/configs/repvgg/repvgg-A2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f6196f02fbfedb36e9e498160884eeb7315513f6
--- /dev/null
+++ b/configs/repvgg/repvgg-A2_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='A2'), head=dict(in_channels=1408))
diff --git a/configs/repvgg/repvgg-B0_8xb32_in1k.py b/configs/repvgg/repvgg-B0_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bbc4ab2259ccd929eae948cae0f676b7fca4b74
--- /dev/null
+++ b/configs/repvgg/repvgg-B0_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B0'), head=dict(in_channels=1280))
diff --git a/configs/repvgg/repvgg-B1_8xb32_in1k.py b/configs/repvgg/repvgg-B1_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e08db3c4b8145cd3141851a7b41bbbe4fbfff776
--- /dev/null
+++ b/configs/repvgg/repvgg-B1_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B1'), head=dict(in_channels=2048))
diff --git a/configs/repvgg/repvgg-B1g2_8xb32_in1k.py b/configs/repvgg/repvgg-B1g2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a1c53fded4e0ff0c59038fb82ca8cb0ca3e41742
--- /dev/null
+++ b/configs/repvgg/repvgg-B1g2_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B1g2'), head=dict(in_channels=2048))
diff --git a/configs/repvgg/repvgg-B1g4_8xb32_in1k.py b/configs/repvgg/repvgg-B1g4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0757b1e580e5091b9d5c633cd87c856a526ebdf0
--- /dev/null
+++ b/configs/repvgg/repvgg-B1g4_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B1g4'), head=dict(in_channels=2048))
diff --git a/configs/repvgg/repvgg-B2_8xb32_in1k.py b/configs/repvgg/repvgg-B2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9a7d4ca5570518f0c4d0b81951e0e97c46606f9
--- /dev/null
+++ b/configs/repvgg/repvgg-B2_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-A0_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B2'), head=dict(in_channels=2560))
diff --git a/configs/repvgg/repvgg-B2g4_8xb32_in1k.py b/configs/repvgg/repvgg-B2g4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8b3397881d74785870c266f1212cfee364dab38d
--- /dev/null
+++ b/configs/repvgg/repvgg-B2g4_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-B3_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B2g4'), head=dict(in_channels=2560))
diff --git a/configs/repvgg/repvgg-B3_8xb32_in1k.py b/configs/repvgg/repvgg-B3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9d5257838c9e2061dfbe39aa2b1456820009ff3
--- /dev/null
+++ b/configs/repvgg/repvgg-B3_8xb32_in1k.py
@@ -0,0 +1,67 @@
+_base_ = [
+    '../_base_/models/repvgg-B3_lbs-mixup_in1k.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0,
+        custom_keys={
+            'branch_3x3.norm': dict(decay_mult=0.0),
+            'branch_1x1.norm': dict(decay_mult=0.0),
+            'branch_norm.bias': dict(decay_mult=0.0),
+        }))
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=7,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=256, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+param_scheduler = dict(
+    type='CosineAnnealingLR',
+    T_max=200,
+    by_epoch=True,
+    begin=0,
+    end=200,
+    convert_to_iter_based=True)
+
+train_cfg = dict(by_epoch=True, max_epochs=200)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
diff --git a/configs/repvgg/repvgg-B3g4_8xb32_in1k.py b/configs/repvgg/repvgg-B3g4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0c5c00af845f5e4f02b44105095f78835f35096
--- /dev/null
+++ b/configs/repvgg/repvgg-B3g4_8xb32_in1k.py
@@ -0,0 +1,3 @@
+_base_ = './repvgg-B3_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='B3g4'))
diff --git a/configs/repvgg/repvgg-D2se_8xb32_in1k.py b/configs/repvgg/repvgg-D2se_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f532dcd79686a119e1bed528a1e7c36195e70857
--- /dev/null
+++ b/configs/repvgg/repvgg-D2se_8xb32_in1k.py
@@ -0,0 +1,28 @@
+_base_ = './repvgg-B3_8xb32_in1k.py'
+
+model = dict(backbone=dict(arch='D2se'), head=dict(in_channels=2560))
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=295,
+        eta_min=1.0e-6,
+        by_epoch=True,
+        begin=5,
+        end=300)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=300)
+
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
diff --git a/configs/res2net/README.md b/configs/res2net/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..68b1acce79c18d994d2e310392a75a4b74db6078
--- /dev/null
+++ b/configs/res2net/README.md
@@ -0,0 +1,78 @@
+# Res2Net
+
+> [Res2Net: A New Multi-scale Backbone Architecture](https://arxiv.org/abs/1904.01169)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layer-wise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142573547-cde68abf-287b-46db-a848-5cffe3068faf.png" width="50%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('res2net50-w14-s8_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('res2net50-w14-s8_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/res2net/res2net50-w14-s8_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w14-s8_3rdparty_8xb32_in1k_20210927-bc967bf1.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                   |                               Download                                |
+| :---------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :-------------------------------------------------------------------: |
+| `res2net50-w14-s8_3rdparty_8xb32_in1k`\*  | From scratch |   25.06    |   4.22    |   78.14   |   93.85   | [config](res2net50-w14-s8_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w14-s8_3rdparty_8xb32_in1k_20210927-bc967bf1.pth) |
+| `res2net50-w26-s8_3rdparty_8xb32_in1k`\*  | From scratch |   48.40    |   8.39    |   79.20   |   94.36   | [config](res2net50-w26-s8_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w26-s8_3rdparty_8xb32_in1k_20210927-f547a94b.pth) |
+| `res2net101-w26-s4_3rdparty_8xb32_in1k`\* | From scratch |   45.21    |   8.12    |   79.19   |   94.44   | [config](res2net101-w26-s4_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/res2net/res2net101-w26-s4_3rdparty_8xb32_in1k_20210927-870b6c36.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Res2Net/Res2Net-PretrainedModels/blob/master/res2net.py#L181). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{gao2019res2net,
+  title={Res2Net: A New Multi-scale Backbone Architecture},
+  author={Gao, Shang-Hua and Cheng, Ming-Ming and Zhao, Kai and Zhang, Xin-Yu and Yang, Ming-Hsuan and Torr, Philip},
+  journal={IEEE TPAMI},
+  year={2021},
+  doi={10.1109/TPAMI.2019.2938758},
+}
+```
diff --git a/configs/res2net/metafile.yml b/configs/res2net/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..b19b102f443998a335362a43b0deb57e0bc264a5
--- /dev/null
+++ b/configs/res2net/metafile.yml
@@ -0,0 +1,70 @@
+Collections:
+  - Name: Res2Net
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Architecture:
+        - Batch Normalization
+        - Convolution
+        - Global Average Pooling
+        - ReLU
+        - Res2Net Block
+    Paper:
+      Title: 'Res2Net: A New Multi-scale Backbone Architecture'
+      URL: https://arxiv.org/abs/1904.01169
+    README: configs/res2net/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.17.0/mmcls/models/backbones/res2net.py
+      Version: v0.17.0
+
+Models:
+  - Name: res2net50-w14-s8_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 4220000000
+      Parameters: 25060000
+    In Collection: Res2Net
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.14
+          Top 5 Accuracy: 93.85
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w14-s8_3rdparty_8xb32_in1k_20210927-bc967bf1.pth
+    Converted From:
+      Weights: https://1drv.ms/u/s!AkxDDnOtroRPdOTqhF8ne_aakDI?e=EVb8Ri
+      Code: https://github.com/Res2Net/Res2Net-PretrainedModels/blob/master/res2net.py#L221
+    Config: configs/res2net/res2net50-w14-s8_8xb32_in1k.py
+  - Name: res2net50-w26-s8_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 8390000000
+      Parameters: 48400000
+    In Collection: Res2Net
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.20
+          Top 5 Accuracy: 94.36
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w26-s8_3rdparty_8xb32_in1k_20210927-f547a94b.pth
+    Converted From:
+      Weights: https://1drv.ms/u/s!AkxDDnOtroRPdTrAd_Afzc26Z7Q?e=slYqsR
+      Code: https://github.com/Res2Net/Res2Net-PretrainedModels/blob/master/res2net.py#L201
+    Config: configs/res2net/res2net50-w26-s8_8xb32_in1k.py
+  - Name: res2net101-w26-s4_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 8120000000
+      Parameters: 45210000
+    In Collection: Res2Net
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.19
+          Top 5 Accuracy: 94.44
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/res2net/res2net101-w26-s4_3rdparty_8xb32_in1k_20210927-870b6c36.pth
+    Converted From:
+      Weights: https://1drv.ms/u/s!AkxDDnOtroRPcJRgTLkahL0cFYw?e=nwbnic
+      Code: https://github.com/Res2Net/Res2Net-PretrainedModels/blob/master/res2net.py#L181
+    Config: configs/res2net/res2net101-w26-s4_8xb32_in1k.py
diff --git a/configs/res2net/res2net101-w26-s4_8xb32_in1k.py b/configs/res2net/res2net101-w26-s4_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ebe9e94d64a305a06dda71c3c20d8c6c77cfc06
--- /dev/null
+++ b/configs/res2net/res2net101-w26-s4_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/res2net101-w26-s4.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/res2net/res2net50-w14-s8_8xb32_in1k.py b/configs/res2net/res2net50-w14-s8_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..56cc02e3b893e4976940badabfa577db471620bc
--- /dev/null
+++ b/configs/res2net/res2net50-w14-s8_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/res2net50-w14-s8.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/res2net/res2net50-w26-s8_8xb32_in1k.py b/configs/res2net/res2net50-w26-s8_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d7dcbeb9164875b21aa782ac5bed5f4618a4363e
--- /dev/null
+++ b/configs/res2net/res2net50-w26-s8_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/res2net50-w26-s8.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnest/README.md b/configs/resnest/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..eb6c5fd728c3032b6b8c429100f1399b8803b765
--- /dev/null
+++ b/configs/resnest/README.md
@@ -0,0 +1,26 @@
+# ResNeSt
+
+> [ResNeSt: Split-Attention Networks](https://arxiv.org/abs/2004.08955)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+It is well known that featuremap attention and multi-path representation are important for visual recognition. In this paper, we present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations. Our design results in a simple and unified computation block, which can be parameterized using only a few variables. Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification. In addition, ResNeSt has achieved superior transfer learning results on several public benchmarks serving as the backbone, and has been adopted by the winning entries of COCO-LVIS challenge. The source code for complete system and pretrained models are publicly available.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142573827-a8189607-614b-4385-b579-b0db148b3db7.png" width="60%"/>
+</div>
+
+## Citation
+
+```
+@misc{zhang2020resnest,
+      title={ResNeSt: Split-Attention Networks},
+      author={Hang Zhang and Chongruo Wu and Zhongyue Zhang and Yi Zhu and Haibin Lin and Zhi Zhang and Yue Sun and Tong He and Jonas Mueller and R. Manmatha and Mu Li and Alexander Smola},
+      year={2020},
+      eprint={2004.08955},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/configs/resnest/_randaug_policies.py b/configs/resnest/_randaug_policies.py
new file mode 100644
index 0000000000000000000000000000000000000000..d650caa2f586045ab76102a5506885e6da2fb4ed
--- /dev/null
+++ b/configs/resnest/_randaug_policies.py
@@ -0,0 +1,92 @@
+policies = [
+    dict(type='AutoContrast', prob=0.5),
+    dict(type='Equalize', prob=0.5),
+    dict(type='Invert', prob=0.5),
+    dict(
+        type='Rotate',
+        magnitude_key='angle',
+        magnitude_range=(0, 30),
+        pad_val=0,
+        prob=0.5,
+        random_negative_prob=0.5),
+    dict(
+        type='Posterize',
+        magnitude_key='bits',
+        magnitude_range=(0, 4),
+        prob=0.5),
+    dict(
+        type='Solarize',
+        magnitude_key='thr',
+        magnitude_range=(0, 256),
+        prob=0.5),
+    dict(
+        type='SolarizeAdd',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 110),
+        thr=128,
+        prob=0.5),
+    dict(
+        type='ColorTransform',
+        magnitude_key='magnitude',
+        magnitude_range=(-0.9, 0.9),
+        prob=0.5,
+        random_negative_prob=0.),
+    dict(
+        type='Contrast',
+        magnitude_key='magnitude',
+        magnitude_range=(-0.9, 0.9),
+        prob=0.5,
+        random_negative_prob=0.),
+    dict(
+        type='Brightness',
+        magnitude_key='magnitude',
+        magnitude_range=(-0.9, 0.9),
+        prob=0.5,
+        random_negative_prob=0.),
+    dict(
+        type='Sharpness',
+        magnitude_key='magnitude',
+        magnitude_range=(-0.9, 0.9),
+        prob=0.5,
+        random_negative_prob=0.),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        pad_val=0,
+        prob=0.5,
+        direction='horizontal',
+        random_negative_prob=0.5),
+    dict(
+        type='Shear',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        pad_val=0,
+        prob=0.5,
+        direction='vertical',
+        random_negative_prob=0.5),
+    dict(
+        type='Cutout',
+        magnitude_key='shape',
+        magnitude_range=(1, 41),
+        pad_val=0,
+        prob=0.5),
+    dict(
+        type='Translate',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        pad_val=0,
+        prob=0.5,
+        direction='horizontal',
+        random_negative_prob=0.5,
+        interpolation='bicubic'),
+    dict(
+        type='Translate',
+        magnitude_key='magnitude',
+        magnitude_range=(0, 0.3),
+        pad_val=0,
+        prob=0.5,
+        direction='vertical',
+        random_negative_prob=0.5,
+        interpolation='bicubic')
+]
diff --git a/configs/resnest/resnest101_32xb64_in1k.py b/configs/resnest/resnest101_32xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac78659147a6fd1a56a89f56ed552ef3736488c4
--- /dev/null
+++ b/configs/resnest/resnest101_32xb64_in1k.py
@@ -0,0 +1,78 @@
+_base_ = [
+    '../_base_/models/resnest101.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/default_runtime.py',
+    './_randaug_policies.py',
+]
+
+# dataset settings
+
+# lighting params, in order of BGR
+EIGVAL = [55.4625, 4.7940, 1.1475]
+EIGVEC = [
+    [-0.5836, -0.6948, 0.4203],
+    [-0.5808, -0.0045, -0.8140],
+    [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandAugment',
+        policies={{_base_.policies}},
+        num_policies=2,
+        magnitude_level=12),
+    dict(type='EfficientNetRandomCrop', scale=256, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='Lighting',
+        eigval=EIGVAL,
+        eigvec=EIGVEC,
+        alphastd=0.1,
+        to_rgb=False),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=256, backend='pillow'),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=1e-4),
+    paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=265,
+        by_epoch=True,
+        begin=5,
+        end=270,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=270)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/resnest/resnest200_64xb32_in1k.py b/configs/resnest/resnest200_64xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e3b9fb3d7dad8357829a820286f27ef0097426b6
--- /dev/null
+++ b/configs/resnest/resnest200_64xb32_in1k.py
@@ -0,0 +1,74 @@
+_base_ = [
+    '../_base_/models/resnest200.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/default_runtime.py',
+    './_randaug_policies.py',
+]
+
+# dataset settings
+
+# lighting params, in order of BGR
+EIGVAL = [55.4625, 4.7940, 1.1475]
+EIGVEC = [
+    [-0.5836, -0.6948, 0.4203],
+    [-0.5808, -0.0045, -0.8140],
+    [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandAugment',
+        policies={{_base_.policies}},
+        num_policies=2,
+        magnitude_level=12),
+    dict(type='EfficientNetRandomCrop', scale=320, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='Lighting',
+        eigval=EIGVAL,
+        eigvec=EIGVEC,
+        alphastd=0.1,
+        to_rgb=False),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=320, backend='pillow'),
+    dict(type='PackInputs'),
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=1e-4),
+    paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=265,
+        by_epoch=True,
+        begin=5,
+        end=270,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=270)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/resnest/resnest269_64xb32_in1k.py b/configs/resnest/resnest269_64xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e884d63586f8210143ca0bf1e9cf33b2449a4f9
--- /dev/null
+++ b/configs/resnest/resnest269_64xb32_in1k.py
@@ -0,0 +1,78 @@
+_base_ = [
+    '../_base_/models/resnest269.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/default_runtime.py',
+    './_randaug_policies.py',
+]
+
+# dataset settings
+
+# lighting params, in order of BGR
+EIGVAL = [55.4625, 4.7940, 1.1475]
+EIGVEC = [
+    [-0.5836, -0.6948, 0.4203],
+    [-0.5808, -0.0045, -0.8140],
+    [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandAugment',
+        policies={{_base_.policies}},
+        num_policies=2,
+        magnitude_level=12),
+    dict(type='EfficientNetRandomCrop', scale=416, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='Lighting',
+        eigval=EIGVAL,
+        eigvec=EIGVEC,
+        alphastd=0.1,
+        to_rgb=False),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=416, backend='pillow'),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=1e-4),
+    paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=265,
+        by_epoch=True,
+        begin=5,
+        end=270,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=270)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (32 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/resnest/resnest50_32xb64_in1k.py b/configs/resnest/resnest50_32xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..05f839b38b669a3093a8a7df7f78f135b88e6b77
--- /dev/null
+++ b/configs/resnest/resnest50_32xb64_in1k.py
@@ -0,0 +1,78 @@
+_base_ = [
+    '../_base_/models/resnest50.py',
+    '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/default_runtime.py',
+    './_randaug_policies.py',
+]
+
+# dataset settings
+
+# lighting params, in order of BGR
+EIGVAL = [55.4625, 4.7940, 1.1475]
+EIGVEC = [
+    [-0.5836, -0.6948, 0.4203],
+    [-0.5808, -0.0045, -0.8140],
+    [-0.5675, 0.7192, 0.4009],
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandAugment',
+        policies={{_base_.policies}},
+        num_policies=2,
+        magnitude_level=12),
+    dict(type='EfficientNetRandomCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='Lighting',
+        eigval=EIGVAL,
+        eigvec=EIGVEC,
+        alphastd=0.1,
+        to_rgb=False),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='EfficientNetCenterCrop', crop_size=256, backend='pillow'),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.8, momentum=0.9, weight_decay=1e-4),
+    paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=265,
+        by_epoch=True,
+        begin=5,
+        end=270,
+    )
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=270)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/resnet/README.md b/configs/resnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..286b77381a57401607cc52568d1d81b8ba5b4d83
--- /dev/null
+++ b/configs/resnet/README.md
@@ -0,0 +1,140 @@
+# ResNet
+
+> [Deep Residual Learning for Image Recognition](https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**Residual Networks**, or **ResNets**, learn residual functions with reference to the layer inputs, instead of
+learning unreferenced functions. In the mainstream previous works, like VGG, the neural networks are a stack
+of layers and every layer attempts to fit a desired underlying mapping. In ResNets, a few stacked layers are
+grouped as a block, and the layers in a block attempts to learn a residual mapping.
+
+Formally, denoting the desired underlying mapping of a block as $\mathcal{H}(x)$, split the underlying mapping
+into the sum of the identity and the residual mapping as $\mathcal{H}(x) = x + \mathcal{F}(x)$, and let the
+stacked non-linear layers fit the residual mapping $\mathcal{F}(x)$.
+
+Many works proved this method makes deep neural networks easier to optimize, and can gain accuracy from
+considerably increased depth. Recently, the residual structure is widely used in various models.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142574068-60cfdeea-c4ec-4c49-abb2-5dc2facafc3b.png" width="40%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
+
+The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet18_8xb16_cifar10', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('resnet18_8xb16_cifar10', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/resnet/resnet18_8xb16_cifar10.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/resnet/resnet18_8xb16_cifar10.py https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                              |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                    Config                     |                                 Download                                 |
+| :--------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-------------------------------------------: | :----------------------------------------------------------------------: |
+| `resnet18_8xb32_in1k`              | From scratch |   11.69    |   1.82    |   69.90   |   89.43   |       [config](resnet18_8xb32_in1k.py)        | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_8xb32_in1k_20210831-fbbb1da6.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_8xb32_in1k_20210831-fbbb1da6.json) |
+| `resnet34_8xb32_in1k`              | From scratch |    2.18    |   3.68    |   73.62   |   91.59   |       [config](resnet34_8xb32_in1k.py)        | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_8xb32_in1k_20210831-f257d4e6.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_8xb32_in1k_20210831-f257d4e6.json) |
+| `resnet50_8xb32_in1k`              | From scratch |   25.56    |   4.12    |   76.55   |   93.06   |       [config](resnet50_8xb32_in1k.py)        | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.json) |
+| `resnet101_8xb32_in1k`             | From scratch |   44.55    |   7.85    |   77.97   |   94.06   |       [config](resnet101_8xb32_in1k.py)       | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_8xb32_in1k_20210831-539c63f8.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_8xb32_in1k_20210831-539c63f8.json) |
+| `resnet152_8xb32_in1k`             | From scratch |   60.19    |   11.58   |   78.48   |   94.13   |       [config](resnet152_8xb32_in1k.py)       | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_8xb32_in1k_20210901-4d7582fa.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_8xb32_in1k_20210901-4d7582fa.json) |
+| `resnetv1d50_8xb32_in1k`           | From scratch |   25.58    |   4.36    |   77.54   |   93.57   |      [config](resnetv1d50_8xb32_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d50_b32x8_imagenet_20210531-db14775a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d50_b32x8_imagenet_20210531-db14775a.json) |
+| `resnetv1d101_8xb32_in1k`          | From scratch |   44.57    |   8.09    |   78.93   |   94.48   |     [config](resnetv1d101_8xb32_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.json) |
+| `resnetv1d152_8xb32_in1k`          | From scratch |   60.21    |   11.82   |   79.41   |   94.70   |     [config](resnetv1d152_8xb32_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d152_b32x8_imagenet_20210531-278cf22a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d152_b32x8_imagenet_20210531-278cf22a.json) |
+| `resnet50_8xb32-fp16_in1k`         | From scratch |   25.56    |   4.12    |   76.30   |   93.07   |     [config](resnet50_8xb32-fp16_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/fp16/resnet50_batch256_fp16_imagenet_20210320-b3964210.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/fp16/resnet50_batch256_fp16_imagenet_20210320-b3964210.json) |
+| `resnet50_8xb256-rsb-a1-600e_in1k` | From scratch |   25.56    |   4.12    |   80.12   |   94.78   | [config](resnet50_8xb256-rsb-a1-600e_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a1-600e_in1k_20211228-20e21305.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a1-600e_in1k_20211228-20e21305.json) |
+| `resnet50_8xb256-rsb-a2-300e_in1k` | From scratch |   25.56    |   4.12    |   79.55   |   94.37   | [config](resnet50_8xb256-rsb-a2-300e_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a2-300e_in1k_20211228-0fd8be6e.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a2-300e_in1k_20211228-0fd8be6e.json) |
+| `resnet50_8xb256-rsb-a3-100e_in1k` | From scratch |   25.56    |   4.12    |   78.30   |   93.80   | [config](resnet50_8xb256-rsb-a3-100e_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a3-100e_in1k_20211228-3493673c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a3-100e_in1k_20211228-3493673c.json) |
+| `resnetv1c50_8xb32_in1k`           | From scratch |   25.58    |   4.36    |   77.01   |   93.58   |      [config](resnetv1c50_8xb32_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c50_8xb32_in1k_20220214-3343eccd.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c50_8xb32_in1k_20220214-3343eccd.json) |
+| `resnetv1c101_8xb32_in1k`          | From scratch |   44.57    |   8.09    |   78.30   |   94.27   |     [config](resnetv1c101_8xb32_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c101_8xb32_in1k_20220214-434fe45f.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c101_8xb32_in1k_20220214-434fe45f.json) |
+| `resnetv1c152_8xb32_in1k`          | From scratch |   60.21    |   11.82   |   78.76   |   94.41   |     [config](resnetv1c152_8xb32_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c152_8xb32_in1k_20220214-c013291f.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c152_8xb32_in1k_20220214-c013291f.json) |
+
+### Image Classification on CIFAR-10
+
+| Model                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) |                Config                |                                              Download                                               |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :----------------------------------: | :-------------------------------------------------------------------------------------------------: |
+| `resnet18_8xb16_cifar10`  | From scratch |   11.17    |   0.56    |   94.82   | [config](resnet18_8xb16_cifar10.py)  | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.json) |
+| `resnet34_8xb16_cifar10`  | From scratch |   21.28    |   1.16    |   95.34   | [config](resnet34_8xb16_cifar10.py)  | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_b16x8_cifar10_20210528-a8aa36a6.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_b16x8_cifar10_20210528-a8aa36a6.json) |
+| `resnet50_8xb16_cifar10`  | From scratch |   23.52    |   1.31    |   95.55   | [config](resnet50_8xb16_cifar10.py)  | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.json) |
+| `resnet101_8xb16_cifar10` | From scratch |   42.51    |   2.52    |   95.58   | [config](resnet101_8xb16_cifar10.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_b16x8_cifar10_20210528-2d29e936.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_b16x8_cifar10_20210528-2d29e936.json) |
+| `resnet152_8xb16_cifar10` | From scratch |   58.16    |   3.74    |   95.76   | [config](resnet152_8xb16_cifar10.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_b16x8_cifar10_20210528-3e8e9178.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_b16x8_cifar10_20210528-3e8e9178.json) |
+
+### Image Classification on CIFAR-100
+
+| Model                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                          Download                                          |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :----------------------------------------------------------------------------------------: |
+| `resnet50_8xb16_cifar100` | From scratch |   23.71    |   1.31    |   79.90   |   95.19   | [config](resnet50_8xb16_cifar100.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar100_20210528-67b58a1b.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar100_20210528-67b58a1b.json) |
+
+### Image Classification on CUB-200-2011
+
+| Model               |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) |             Config             |                                                    Download                                                     |
+| :------------------ | :----------: | :--------: | :-------: | :-------: | :----------------------------: | :-------------------------------------------------------------------------------------------------------------: |
+| `resnet50_8xb8_cub` | From scratch |   23.92    |   16.48   |   88.45   | [config](resnet50_8xb8_cub.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb8_cub_20220307-57840e60.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb8_cub_20220307-57840e60.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{he2016deep,
+  title={Deep residual learning for image recognition},
+  author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
+  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+  pages={770--778},
+  year={2016}
+}
+```
diff --git a/configs/resnet/metafile.yml b/configs/resnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..16387248c43aea59c5563b4c6c98df8dd8effead
--- /dev/null
+++ b/configs/resnet/metafile.yml
@@ -0,0 +1,352 @@
+Collections:
+  - Name: ResNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Epochs: 100
+      Batch Size: 256
+      Architecture:
+        - ResNet
+    Paper:
+      URL: https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html
+      Title: "Deep Residual Learning for Image Recognition"
+    README: configs/resnet/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/resnet.py#L383
+      Version: v0.15.0
+
+Models:
+  - Name: resnet18_8xb16_cifar10
+    Metadata:
+      Training Data: CIFAR-10
+      Epochs: 200
+      Batch Size: 128
+      FLOPs: 560000000
+      Parameters: 11170000
+    In Collection: ResNet
+    Results:
+      - Dataset: CIFAR-10
+        Metrics:
+          Top 1 Accuracy: 94.82
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth
+    Config: configs/resnet/resnet18_8xb16_cifar10.py
+  - Name: resnet34_8xb16_cifar10
+    Metadata:
+      Training Data: CIFAR-10
+      Epochs: 200
+      Batch Size: 128
+      FLOPs: 1160000000
+      Parameters: 21280000
+    In Collection: ResNet
+    Results:
+      - Dataset: CIFAR-10
+        Metrics:
+          Top 1 Accuracy: 95.34
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_b16x8_cifar10_20210528-a8aa36a6.pth
+    Config: configs/resnet/resnet34_8xb16_cifar10.py
+  - Name: resnet50_8xb16_cifar10
+    Metadata:
+      Training Data: CIFAR-10
+      Epochs: 200
+      Batch Size: 128
+      FLOPs: 1310000000
+      Parameters: 23520000
+    In Collection: ResNet
+    Results:
+      - Dataset: CIFAR-10
+        Metrics:
+          Top 1 Accuracy: 95.55
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth
+    Config: configs/resnet/resnet50_8xb16_cifar10.py
+  - Name: resnet101_8xb16_cifar10
+    Metadata:
+      Training Data: CIFAR-10
+      Epochs: 200
+      Batch Size: 128
+      FLOPs: 2520000000
+      Parameters: 42510000
+    In Collection: ResNet
+    Results:
+      - Dataset: CIFAR-10
+        Metrics:
+          Top 1 Accuracy: 95.58
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_b16x8_cifar10_20210528-2d29e936.pth
+    Config: configs/resnet/resnet101_8xb16_cifar10.py
+  - Name: resnet152_8xb16_cifar10
+    Metadata:
+      Training Data: CIFAR-10
+      Epochs: 200
+      Batch Size: 128
+      FLOPs: 3740000000
+      Parameters: 58160000
+    In Collection: ResNet
+    Results:
+      - Dataset: CIFAR-10
+        Metrics:
+          Top 1 Accuracy: 95.76
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_b16x8_cifar10_20210528-3e8e9178.pth
+    Config: configs/resnet/resnet152_8xb16_cifar10.py
+  - Name: resnet50_8xb16_cifar100
+    Metadata:
+      Training Data: CIFAR-100
+      Epochs: 200
+      Batch Size: 128
+      FLOPs: 1310000000
+      Parameters: 23710000
+    In Collection: ResNet
+    Results:
+      - Dataset: CIFAR-100
+        Metrics:
+          Top 1 Accuracy: 79.90
+          Top 5 Accuracy: 95.19
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar100_20210528-67b58a1b.pth
+    Config: configs/resnet/resnet50_8xb16_cifar100.py
+  - Name: resnet18_8xb32_in1k
+    Metadata:
+      FLOPs: 1820000000
+      Parameters: 11690000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.90
+          Top 5 Accuracy: 89.43
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_8xb32_in1k_20210831-fbbb1da6.pth
+    Config: configs/resnet/resnet18_8xb32_in1k.py
+  - Name: resnet34_8xb32_in1k
+    Metadata:
+      FLOPs: 3680000000
+      Parameters: 2180000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 73.62
+          Top 5 Accuracy: 91.59
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_8xb32_in1k_20210831-f257d4e6.pth
+    Config: configs/resnet/resnet34_8xb32_in1k.py
+  - Name: resnet50_8xb32_in1k
+    Metadata:
+      FLOPs: 4120000000
+      Parameters: 25560000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.55
+          Top 5 Accuracy: 93.06
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth
+    Config: configs/resnet/resnet50_8xb32_in1k.py
+  - Name: resnet101_8xb32_in1k
+    Metadata:
+      FLOPs: 7850000000
+      Parameters: 44550000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.97
+          Top 5 Accuracy: 94.06
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_8xb32_in1k_20210831-539c63f8.pth
+    Config: configs/resnet/resnet101_8xb32_in1k.py
+  - Name: resnet152_8xb32_in1k
+    Metadata:
+      FLOPs: 11580000000
+      Parameters: 60190000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.48
+          Top 5 Accuracy: 94.13
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_8xb32_in1k_20210901-4d7582fa.pth
+    Config: configs/resnet/resnet152_8xb32_in1k.py
+  - Name: resnetv1d50_8xb32_in1k
+    Metadata:
+      FLOPs: 4360000000
+      Parameters: 25580000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.54
+          Top 5 Accuracy: 93.57
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d50_b32x8_imagenet_20210531-db14775a.pth
+    Config: configs/resnet/resnetv1d50_8xb32_in1k.py
+  - Name: resnetv1d101_8xb32_in1k
+    Metadata:
+      FLOPs: 8090000000
+      Parameters: 44570000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.93
+          Top 5 Accuracy: 94.48
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.pth
+    Config: configs/resnet/resnetv1d101_8xb32_in1k.py
+  - Name: resnetv1d152_8xb32_in1k
+    Metadata:
+      FLOPs: 11820000000
+      Parameters: 60210000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.41
+          Top 5 Accuracy: 94.70
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d152_b32x8_imagenet_20210531-278cf22a.pth
+    Config: configs/resnet/resnetv1d152_8xb32_in1k.py
+  - Name: resnet50_8xb32-fp16_in1k
+    Metadata:
+      FLOPs: 4120000000
+      Parameters: 25560000
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+        - Mixed Precision Training
+    In Collection: ResNet
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.30
+          Top 5 Accuracy: 93.07
+    Weights: https://download.openmmlab.com/mmclassification/v0/fp16/resnet50_batch256_fp16_imagenet_20210320-b3964210.pth
+    Config: configs/resnet/resnet50_8xb32-fp16_in1k.py
+  - Name: resnet50_8xb256-rsb-a1-600e_in1k
+    Metadata:
+      FLOPs: 4120000000
+      Parameters: 25560000
+      Training Techniques:
+        - LAMB
+        - Weight Decay
+        - Cosine Annealing
+        - Mixup
+        - CutMix
+        - RepeatAugSampler
+        - RandAugment
+      Epochs: 600
+      Batch Size: 2048
+    In Collection: ResNet
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.12
+          Top 5 Accuracy: 94.78
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a1-600e_in1k_20211228-20e21305.pth
+    Config: configs/resnet/resnet50_8xb256-rsb-a1-600e_in1k.py
+  - Name: resnet50_8xb256-rsb-a2-300e_in1k
+    Metadata:
+      FLOPs: 4120000000
+      Parameters: 25560000
+      Training Techniques:
+        - LAMB
+        - Weight Decay
+        - Cosine Annealing
+        - Mixup
+        - CutMix
+        - RepeatAugSampler
+        - RandAugment
+      Epochs: 300
+      Batch Size: 2048
+    In Collection: ResNet
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.55
+          Top 5 Accuracy: 94.37
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a2-300e_in1k_20211228-0fd8be6e.pth
+    Config: configs/resnet/resnet50_8xb256-rsb-a2-300e_in1k.py
+  - Name: resnet50_8xb256-rsb-a3-100e_in1k
+    Metadata:
+      FLOPs: 4120000000
+      Parameters: 25560000
+      Training Techniques:
+        - LAMB
+        - Weight Decay
+        - Cosine Annealing
+        - Mixup
+        - CutMix
+        - RandAugment
+      Batch Size: 2048
+    In Collection: ResNet
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.30
+          Top 5 Accuracy: 93.80
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-a3-100e_in1k_20211228-3493673c.pth
+    Config: configs/resnet/resnet50_8xb256-rsb-a3-100e_in1k.py
+  - Name: resnetv1c50_8xb32_in1k
+    Metadata:
+      FLOPs: 4360000000
+      Parameters: 25580000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.01
+          Top 5 Accuracy: 93.58
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c50_8xb32_in1k_20220214-3343eccd.pth
+    Config: configs/resnet/resnetv1c50_8xb32_in1k.py
+  - Name: resnetv1c101_8xb32_in1k
+    Metadata:
+      FLOPs: 8090000000
+      Parameters: 44570000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.30
+          Top 5 Accuracy: 94.27
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c101_8xb32_in1k_20220214-434fe45f.pth
+    Config: configs/resnet/resnetv1c101_8xb32_in1k.py
+  - Name: resnetv1c152_8xb32_in1k
+    Metadata:
+      FLOPs: 11820000000
+      Parameters: 60210000
+    In Collection: ResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.76
+          Top 5 Accuracy: 94.41
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c152_8xb32_in1k_20220214-c013291f.pth
+    Config: configs/resnet/resnetv1c152_8xb32_in1k.py
+  - Name: resnet50_8xb8_cub
+    Metadata:
+      FLOPs: 16480000000
+      Parameters: 23920000
+    In Collection: ResNet
+    Results:
+      - Dataset: CUB-200-2011
+        Metrics:
+          Top 1 Accuracy: 88.45
+        Task: Image Classification
+    Pretrain: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb8_cub_20220307-57840e60.pth
+    Config: configs/resnet/resnet50_8xb8_cub.py
diff --git a/configs/resnet/resnet101_8xb16_cifar10.py b/configs/resnet/resnet101_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..166a1740b09c5fb74462a0672cd5fef54caae8f7
--- /dev/null
+++ b/configs/resnet/resnet101_8xb16_cifar10.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet101_cifar.py',
+    '../_base_/datasets/cifar10_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet101_8xb32_in1k.py b/configs/resnet/resnet101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..388d2cd918ab75ec46346faa0448ef9cf2893fc8
--- /dev/null
+++ b/configs/resnet/resnet101_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet101.py', '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet152_8xb16_cifar10.py b/configs/resnet/resnet152_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f307b6aa81661558b8308094de6e8327d08c830
--- /dev/null
+++ b/configs/resnet/resnet152_8xb16_cifar10.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet152_cifar.py',
+    '../_base_/datasets/cifar10_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet152_8xb32_in1k.py b/configs/resnet/resnet152_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc9dc2cee4a0fd8a9d47d461b2d5d00bf9962bf5
--- /dev/null
+++ b/configs/resnet/resnet152_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet152.py', '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet18_8xb16_cifar10.py b/configs/resnet/resnet18_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7afa397b7b6a01decd0a010816ebe3678ca44aa
--- /dev/null
+++ b/configs/resnet/resnet18_8xb16_cifar10.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet18_cifar.py', '../_base_/datasets/cifar10_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet18_8xb32_in1k.py b/configs/resnet/resnet18_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac452ff75602464eba84a3eea150b30748122c69
--- /dev/null
+++ b/configs/resnet/resnet18_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet18.py', '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet34_8xb16_cifar10.py b/configs/resnet/resnet34_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..7f5cd517d505ea479b506b6e4756c117c392dabd
--- /dev/null
+++ b/configs/resnet/resnet34_8xb16_cifar10.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet34_cifar.py', '../_base_/datasets/cifar10_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet34_8xb32_in1k.py b/configs/resnet/resnet34_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7749261c80defef7cbf94c4e1284c26382246dc6
--- /dev/null
+++ b/configs/resnet/resnet34_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet34.py', '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_32xb64-warmup-coslr_in1k.py b/configs/resnet/resnet50_32xb64-warmup-coslr_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c26245ef53a736c22c0ef7d4e9d8b7876509fe2e
--- /dev/null
+++ b/configs/resnet/resnet50_32xb64-warmup-coslr_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet50.py', '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs2048_coslr.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_32xb64-warmup-lbs_in1k.py b/configs/resnet/resnet50_32xb64-warmup-lbs_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f24f9a0f2c54a2bb634c1f374bc1b534d63697f
--- /dev/null
+++ b/configs/resnet/resnet50_32xb64-warmup-lbs_in1k.py
@@ -0,0 +1,12 @@
+_base_ = ['./resnet50_32xb64-warmup_in1k.py']
+model = dict(
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(
+            type='LabelSmoothLoss',
+            loss_weight=1.0,
+            label_smooth_val=0.1,
+            num_classes=1000),
+    ))
diff --git a/configs/resnet/resnet50_32xb64-warmup_in1k.py b/configs/resnet/resnet50_32xb64-warmup_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..34d5288b9d3f9fcf3f0b409dc1c17906654c2170
--- /dev/null
+++ b/configs/resnet/resnet50_32xb64-warmup_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet50.py', '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/schedules/imagenet_bs2048.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb128_coslr-90e_in21k.py b/configs/resnet/resnet50_8xb128_coslr-90e_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d2cc1ee2830661998505310d8c7074d8ae5da6b4
--- /dev/null
+++ b/configs/resnet/resnet50_8xb128_coslr-90e_in21k.py
@@ -0,0 +1,11 @@
+_base_ = [
+    '../_base_/models/resnet50.py', '../_base_/datasets/imagenet21k_bs128.py',
+    '../_base_/schedules/imagenet_bs1024_coslr.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(head=dict(num_classes=21843))
+
+# runtime settings
+train_cfg = dict(by_epoch=True, max_epochs=90)
diff --git a/configs/resnet/resnet50_8xb16-mixup_cifar10.py b/configs/resnet/resnet50_8xb16-mixup_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..2420ebfeb0a34675a4b1b2a69c0b8a39e197ce35
--- /dev/null
+++ b/configs/resnet/resnet50_8xb16-mixup_cifar10.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet50_cifar_mixup.py',
+    '../_base_/datasets/cifar10_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb16_cifar10.py b/configs/resnet/resnet50_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..669e5de27e526dd46d9f06c99e478dce16f0ac9a
--- /dev/null
+++ b/configs/resnet/resnet50_8xb16_cifar10.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet50_cifar.py', '../_base_/datasets/cifar10_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb16_cifar100.py b/configs/resnet/resnet50_8xb16_cifar100.py
new file mode 100644
index 0000000000000000000000000000000000000000..ebde6c76ecca6d23b58edfb85ebc3b72ce15a2b2
--- /dev/null
+++ b/configs/resnet/resnet50_8xb16_cifar100.py
@@ -0,0 +1,19 @@
+_base_ = [
+    '../_base_/models/resnet50_cifar.py',
+    '../_base_/datasets/cifar100_bs16.py',
+    '../_base_/schedules/cifar10_bs128.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(head=dict(num_classes=100))
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(weight_decay=0.0005))
+
+param_scheduler = dict(
+    type='MultiStepLR',
+    by_epoch=True,
+    milestones=[60, 120, 160],
+    gamma=0.2,
+)
diff --git a/configs/resnet/resnet50_8xb256-rsb-a1-600e_in1k.py b/configs/resnet/resnet50_8xb256-rsb-a1-600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4ea15984a0063c06e09eb5063d49b2cf90371cf
--- /dev/null
+++ b/configs/resnet/resnet50_8xb256-rsb-a1-600e_in1k.py
@@ -0,0 +1,56 @@
+_base_ = [
+    '../_base_/models/resnet50.py',
+    '../_base_/datasets/imagenet_bs256_rsb_a12.py',
+    '../_base_/schedules/imagenet_bs2048_rsb.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    backbone=dict(
+        norm_cfg=dict(type='SyncBN', requires_grad=True),
+        drop_path_rate=0.05,
+    ),
+    head=dict(
+        loss=dict(
+            type='LabelSmoothLoss',
+            label_smooth_val=0.1,
+            mode='original',
+            use_sigmoid=True,
+        )),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.2),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(sampler=dict(type='RepeatAugSampler', shuffle=True))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(weight_decay=0.01),
+    paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=595,
+        eta_min=1.0e-6,
+        by_epoch=True,
+        begin=5,
+        end=600)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=600)
diff --git a/configs/resnet/resnet50_8xb256-rsb-a2-300e_in1k.py b/configs/resnet/resnet50_8xb256-rsb-a2-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..df8edc0370400a3f3985c33bffae2d04afc55772
--- /dev/null
+++ b/configs/resnet/resnet50_8xb256-rsb-a2-300e_in1k.py
@@ -0,0 +1,46 @@
+_base_ = [
+    '../_base_/models/resnet50.py',
+    '../_base_/datasets/imagenet_bs256_rsb_a12.py',
+    '../_base_/schedules/imagenet_bs2048_rsb.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    backbone=dict(
+        norm_cfg=dict(type='SyncBN', requires_grad=True),
+        drop_path_rate=0.05,
+    ),
+    head=dict(loss=dict(use_sigmoid=True)),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.1),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# dataset settings
+train_dataloader = dict(sampler=dict(type='RepeatAugSampler', shuffle=True))
+
+# schedule settings
+optim_wrapper = dict(
+    paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.))
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=295,
+        eta_min=1.0e-6,
+        by_epoch=True,
+        begin=5,
+        end=300)
+]
+train_cfg = dict(by_epoch=True, max_epochs=300)
diff --git a/configs/resnet/resnet50_8xb256-rsb-a3-100e_in1k.py b/configs/resnet/resnet50_8xb256-rsb-a3-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a36c5843a69aea20fdb9287561e5c2a96459852
--- /dev/null
+++ b/configs/resnet/resnet50_8xb256-rsb-a3-100e_in1k.py
@@ -0,0 +1,22 @@
+_base_ = [
+    '../_base_/models/resnet50.py',
+    '../_base_/datasets/imagenet_bs256_rsb_a3.py',
+    '../_base_/schedules/imagenet_bs2048_rsb.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    backbone=dict(norm_cfg=dict(type='SyncBN', requires_grad=True)),
+    head=dict(loss=dict(use_sigmoid=True)),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.1),
+        dict(type='CutMix', alpha=1.0)
+    ]),
+)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=0.008),
+    paramwise_cfg=dict(bias_decay_mult=0., norm_decay_mult=0.),
+)
diff --git a/configs/resnet/resnet50_8xb32-coslr-preciseBN_in1k.py b/configs/resnet/resnet50_8xb32-coslr-preciseBN_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..01fefbbf2852eeceddb0ad026fb5098e763e0710
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-coslr-preciseBN_in1k.py
@@ -0,0 +1,13 @@
+_base_ = 'resnet50_8xb32-coslr_in1k.py'
+
+# Precise BN hook will update the bn stats, so this hook should be executed
+# before CheckpointHook(priority of 'VERY_LOW') and
+# EMAHook(priority of 'NORMAL') So set the priority of PreciseBNHook to
+# 'ABOVENORMAL' here.
+custom_hooks = [
+    dict(
+        type='PreciseBNHook',
+        num_samples=8192,
+        interval=1,
+        priority='ABOVE_NORMAL')
+]
diff --git a/configs/resnet/resnet50_8xb32-coslr_in1k.py b/configs/resnet/resnet50_8xb32-coslr_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..938a114b79696b5ad3442c1dd2a7aea33342b679
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-coslr_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet50.py', '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256_coslr.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb32-cutmix_in1k.py b/configs/resnet/resnet50_8xb32-cutmix_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f8d0ca9f3a500344c18b669f25f3cb78393d7dd
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-cutmix_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet50_cutmix.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb32-fp16-dynamic_in1k.py b/configs/resnet/resnet50_8xb32-fp16-dynamic_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..58f6fe4cf25e8f0b3d321a7aab4b746552aa4163
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-fp16-dynamic_in1k.py
@@ -0,0 +1,4 @@
+_base_ = ['./resnet50_8xb32_in1k.py']
+
+# schedule settings
+optim_wrapper = dict(type='AmpOptimWrapper', loss_scale='dynamic')
diff --git a/configs/resnet/resnet50_8xb32-fp16_in1k.py b/configs/resnet/resnet50_8xb32-fp16_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..19ee6ee4f82ec02f34628bdf8dd74a379798cc67
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-fp16_in1k.py
@@ -0,0 +1,4 @@
+_base_ = ['./resnet50_8xb32_in1k.py']
+
+# schedule settings
+optim_wrapper = dict(type='AmpOptimWrapper', loss_scale=512.)
diff --git a/configs/resnet/resnet50_8xb32-lbs_in1k.py b/configs/resnet/resnet50_8xb32-lbs_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1c1aa5a2c4eee10c10159175224d9b77ea57e57b
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-lbs_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet50_label_smooth.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb32-mixup_in1k.py b/configs/resnet/resnet50_8xb32-mixup_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a153d0e18f521f72b8beaf4cbea36d41f5b3300
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32-mixup_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnet50_mixup.py',
+    '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb32_in1k.py b/configs/resnet/resnet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c32f333b67c255c6101469323636bf242eebb8da
--- /dev/null
+++ b/configs/resnet/resnet50_8xb32_in1k.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/models/resnet50.py', '../_base_/datasets/imagenet_bs32.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnet50_8xb8_cub.py b/configs/resnet/resnet50_8xb8_cub.py
new file mode 100644
index 0000000000000000000000000000000000000000..17054ef536930d74136897f8f25637321a364ce7
--- /dev/null
+++ b/configs/resnet/resnet50_8xb8_cub.py
@@ -0,0 +1,20 @@
+_base_ = [
+    '../_base_/models/resnet50.py',
+    '../_base_/datasets/cub_bs8_448.py',
+    '../_base_/schedules/cub_bs64.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+# use pre-train weight converted from https://github.com/Alibaba-MIIL/ImageNet21K # noqa
+pretrained = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-faac000b.pth'  # noqa
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        init_cfg=dict(
+            type='Pretrained', checkpoint=pretrained, prefix='backbone')),
+    head=dict(num_classes=200, ))
+
+# runtime settings
+default_hooks = dict(logger=dict(type='LoggerHook', interval=20))
diff --git a/configs/resnet/resnetv1c101_8xb32_in1k.py b/configs/resnet/resnetv1c101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..441aff591851f402a176c142c93dc866a77b82c2
--- /dev/null
+++ b/configs/resnet/resnetv1c101_8xb32_in1k.py
@@ -0,0 +1,7 @@
+_base_ = [
+    '../_base_/models/resnetv1c50.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(depth=101))
diff --git a/configs/resnet/resnetv1c152_8xb32_in1k.py b/configs/resnet/resnetv1c152_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9f466f85c8e8c89fb78f53c27eca1d5acaf5221
--- /dev/null
+++ b/configs/resnet/resnetv1c152_8xb32_in1k.py
@@ -0,0 +1,7 @@
+_base_ = [
+    '../_base_/models/resnetv1c50.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(depth=152))
diff --git a/configs/resnet/resnetv1c50_8xb32_in1k.py b/configs/resnet/resnetv1c50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa1c8b6475ce373f4a35123a72e31419b87027c0
--- /dev/null
+++ b/configs/resnet/resnetv1c50_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnetv1c50.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnetv1d101_8xb32_in1k.py b/configs/resnet/resnetv1d101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b16ca863db2c50267764b1b37aa8b2db891ad2c9
--- /dev/null
+++ b/configs/resnet/resnetv1d101_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnetv1d101.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnetv1d152_8xb32_in1k.py b/configs/resnet/resnetv1d152_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..76926ddbb661029b8cff86ad0d98028531235fa1
--- /dev/null
+++ b/configs/resnet/resnetv1d152_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnetv1d152.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnet/resnetv1d50_8xb32_in1k.py b/configs/resnet/resnetv1d50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..208bde470ad12407d7e56eddeddfc88529e3708b
--- /dev/null
+++ b/configs/resnet/resnetv1d50_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnetv1d50.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnext/README.md b/configs/resnext/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b901b31bd5bd3b99bce07cc2454e4b9a12d40bb2
--- /dev/null
+++ b/configs/resnext/README.md
@@ -0,0 +1,83 @@
+# ResNeXt
+
+> [Aggregated Residual Transformations for Deep Neural Networks](https://openaccess.thecvf.com/content_cvpr_2017/html/Xie_Aggregated_Residual_Transformations_CVPR_2017_paper.html)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call "cardinality" (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142574479-21fb00a2-e63e-4bc6-a9f2-989cd6e15528.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnext50-32x4d_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('resnext50-32x4d_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/resnext/resnext50-32x4d_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/resnext/resnext50-32x4d_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/resnext/resnext50_32x4d_b32x8_imagenet_20210429-56066e27.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                  |                                      Download                                      |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------------: | :--------------------------------------------------------------------------------: |
+| `resnext50-32x4d_8xb32_in1k`  | From scratch |   25.03    |   4.27    |   77.90   |   93.66   | [config](resnext50-32x4d_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/resnext/resnext50_32x4d_b32x8_imagenet_20210429-56066e27.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnext/resnext50_32x4d_b32x8_imagenet_20210429-56066e27.json) |
+| `resnext101-32x4d_8xb32_in1k` | From scratch |   44.18    |   8.03    |   78.61   |   94.17   | [config](resnext101-32x4d_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x4d_b32x8_imagenet_20210506-e0fa3dd5.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x4d_b32x8_imagenet_20210506-e0fa3dd5.json) |
+| `resnext101-32x8d_8xb32_in1k` | From scratch |   88.79    |   16.50   |   79.27   |   94.58   | [config](resnext101-32x8d_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x8d_b32x8_imagenet_20210506-23a247d5.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x8d_b32x8_imagenet_20210506-23a247d5.json) |
+| `resnext152-32x4d_8xb32_in1k` | From scratch |   59.95    |   11.80   |   78.88   |   94.33   | [config](resnext152-32x4d_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnext/resnext152_32x4d_b32x8_imagenet_20210524-927787be.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/resnext/resnext152_32x4d_b32x8_imagenet_20210524-927787be.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{xie2017aggregated,
+  title={Aggregated residual transformations for deep neural networks},
+  author={Xie, Saining and Girshick, Ross and Doll{\'a}r, Piotr and Tu, Zhuowen and He, Kaiming},
+  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+  pages={1492--1500},
+  year={2017}
+}
+```
diff --git a/configs/resnext/metafile.yml b/configs/resnext/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..71283288fd743116c00b14ee1dc1697770b0706c
--- /dev/null
+++ b/configs/resnext/metafile.yml
@@ -0,0 +1,73 @@
+Collections:
+  - Name: ResNeXt
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Epochs: 100
+      Batch Size: 256
+      Architecture:
+        - ResNeXt
+    Paper:
+      URL: https://openaccess.thecvf.com/content_cvpr_2017/html/Xie_Aggregated_Residual_Transformations_CVPR_2017_paper.html
+      Title: "Aggregated Residual Transformations for Deep Neural Networks"
+    README: configs/resnext/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/resnext.py#L90
+      Version: v0.15.0
+
+Models:
+  - Name: resnext50-32x4d_8xb32_in1k
+    Metadata:
+      FLOPs: 4270000000
+      Parameters: 25030000
+    In Collection: ResNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.90
+          Top 5 Accuracy: 93.66
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnext/resnext50_32x4d_b32x8_imagenet_20210429-56066e27.pth
+    Config: configs/resnext/resnext50-32x4d_8xb32_in1k.py
+  - Name: resnext101-32x4d_8xb32_in1k
+    Metadata:
+      FLOPs: 8030000000
+      Parameters: 44180000
+    In Collection: ResNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.61
+          Top 5 Accuracy: 94.17
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x4d_b32x8_imagenet_20210506-e0fa3dd5.pth
+    Config: configs/resnext/resnext101-32x4d_8xb32_in1k.py
+  - Name: resnext101-32x8d_8xb32_in1k
+    Metadata:
+      FLOPs: 16500000000
+      Parameters: 88790000
+    In Collection: ResNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.27
+          Top 5 Accuracy: 94.58
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x8d_b32x8_imagenet_20210506-23a247d5.pth
+    Config: configs/resnext/resnext101-32x8d_8xb32_in1k.py
+  - Name: resnext152-32x4d_8xb32_in1k
+    Metadata:
+      FLOPs: 11800000000
+      Parameters: 59950000
+    In Collection: ResNeXt
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.88
+          Top 5 Accuracy: 94.33
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/resnext/resnext152_32x4d_b32x8_imagenet_20210524-927787be.pth
+    Config: configs/resnext/resnext152-32x4d_8xb32_in1k.py
diff --git a/configs/resnext/resnext101-32x4d_8xb32_in1k.py b/configs/resnext/resnext101-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..970aa60f35fb6b04f72688d5862155575858b1fe
--- /dev/null
+++ b/configs/resnext/resnext101-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnext101_32x4d.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnext/resnext101-32x8d_8xb32_in1k.py b/configs/resnext/resnext101-32x8d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..315d05fd57b34d80ab1590077f98d21b80453209
--- /dev/null
+++ b/configs/resnext/resnext101-32x8d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnext101_32x8d.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnext/resnext152-32x4d_8xb32_in1k.py b/configs/resnext/resnext152-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c137313cb7f357f8328048ffe833cdc4952cb84
--- /dev/null
+++ b/configs/resnext/resnext152-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnext152_32x4d.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/resnext/resnext50-32x4d_8xb32_in1k.py b/configs/resnext/resnext50-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bd9c9fcf4e6d9941cb87ffc963cc99b39069116c
--- /dev/null
+++ b/configs/resnext/resnext50-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/resnext50_32x4d.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/revvit/README.md b/configs/revvit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0439b22ac9d196a56016503f210fc73d3baab71d
--- /dev/null
+++ b/configs/revvit/README.md
@@ -0,0 +1,91 @@
+# Reversible Vision Transformers
+
+> [Reversible Vision Transformers](https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**RevViT** is initially described in [Reversible Vision Tranformers](https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf), which introduce the reversible idea into vision transformer, to reduce the GPU memory footprint required for training.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://github.com/facebookresearch/SlowFast/raw/main/projects/rev/teaser.png" width="70%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory footprint from the depth of the model, Reversible Vision Transformers enable memory efficient scaling of transformer architectures. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants and benchmark extensively across both model sizes and tasks of image classification, object detection and video classification. Reversible Vision Transformers achieve a reduced memory footprint of up to 15.5× at identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for resource limited training regimes. Finally, we find that the additional computational burden of recomputing activations is more than overcome for deeper models, where throughput can increase up to 3.9× over their non-reversible counterparts.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('revvit-small_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('revvit-small_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/revvit/revvit-small_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/revvit/revvit-base_3rdparty_in1k_20221213-87a7b0a5.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                          |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                 |                                       Download                                       |
+| :----------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------------: | :----------------------------------------------------------------------------------: |
+| `revvit-small_3rdparty_in1k`\* | From scratch |   22.44    |   4.58    |   79.87   |   94.90   | [config](revvit-small_8xb256_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/revvit/revvit-base_3rdparty_in1k_20221213-87a7b0a5.pth) |
+| `revvit-base_3rdparty_in1k`\*  | From scratch |   87.34    |   17.49   |   81.81   |   95.56   | [config](revvit-base_8xb256_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/revvit/revvit-small_3rdparty_in1k_20221213-a3a34f5c.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/SlowFast). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{mangalam2022reversible,
+  title={Reversible Vision Transformers},
+  author={Mangalam, Karttikeya and Fan, Haoqi and Li, Yanghao and Wu, Chao-Yuan and Xiong, Bo and Feichtenhofer, Christoph and Malik, Jitendra},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={10830--10840},
+  year={2022}
+}
+```
diff --git a/configs/revvit/metafile.yml b/configs/revvit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..842de071f1b15cc9bc65b1ff85d208b6d7131b9d
--- /dev/null
+++ b/configs/revvit/metafile.yml
@@ -0,0 +1,48 @@
+Collections:
+  - Name: RevViT
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Vision Transformer
+        - Reversible
+    Paper:
+      URL: https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf
+      Title: Reversible Vision Transformers
+    README: configs/revvit/README.md
+    Code:
+      Version: v1.0.0rc5
+      URL: https://github.com/open-mmlab/mmpretrain/blob/1.0.0rc5/mmcls/models/backbones/revvit.py
+
+Models:
+  - Name: revvit-small_3rdparty_in1k
+    Metadata:
+      FLOPs: 4583427072
+      Parameters: 22435432
+    In Collection: RevViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.87
+          Top 5 Accuracy: 94.90
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/revvit/revvit-small_3rdparty_in1k_20221213-a3a34f5c.pth
+    Config: configs/revvit/revvit-small_8xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/pyslowfast/rev/REV_VIT_S.pyth
+      Code: https://github.com/facebookresearch/SlowFast
+  - Name: revvit-base_3rdparty_in1k
+    Metadata:
+      FLOPs: 17490450432
+      Parameters: 87337192
+    In Collection: RevViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.81
+          Top 5 Accuracy: 95.56
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/revvit/revvit-base_3rdparty_in1k_20221213-87a7b0a5.pth
+    Config: configs/revvit/revvit-base_8xb256_in1k.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/pyslowfast/rev/REV_VIT_B.pyth
+      Code: https://github.com/facebookresearch/SlowFast
diff --git a/configs/revvit/revvit-base_8xb256_in1k.py b/configs/revvit/revvit-base_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4fde5c9487fb675b75c824608f88ba96f27e9aa
--- /dev/null
+++ b/configs/revvit/revvit-base_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/revvit/revvit-base.py',
+    '../_base_/datasets/imagenet_bs128_revvit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_revvit.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/revvit/revvit-small_8xb256_in1k.py b/configs/revvit/revvit-small_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec3904a3da8164f7f69c61e49d9dfee217a6b99b
--- /dev/null
+++ b/configs/revvit/revvit-small_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/revvit/revvit-small.py',
+    '../_base_/datasets/imagenet_bs128_revvit_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_revvit.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/riformer/README.md b/configs/riformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6be694d1bf72fd7ba5e5bac0c99d33b9338e0893
--- /dev/null
+++ b/configs/riformer/README.md
@@ -0,0 +1,181 @@
+# RIFormer
+
+> [RIFormer: Keep Your Vision Backbone Effective But Removing Token Mixer](https://arxiv.org/abs/2304.05659)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+RIFormer is a way to keep a vision backbone effective while removing token mixers in its basic building blocks. Equipped with our proposed optimization strategy, we are able to build an extremely simple vision backbone with encouraging performance, while enjoying the high efficiency during inference. RIFormer shares nearly the same macro and micro design as MetaFormer, but safely removing all token mixers. The quantitative results show that our networks outperform many prevailing backbones with faster inference speed on ImageNet-1K.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/48375204/223930120-dc075c8e-0513-42eb-9830-469a45c1d941.png" width="65%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+This paper studies how to keep a vision backbone effective while removing token mixers in its basic building blocks. Token mixers, as self-attention for vision transformers (ViTs), are intended to perform information communication between different spatial tokens but suffer from considerable computational cost and latency. However, directly removing them will lead to an incomplete model structure prior, and thus brings a significant accuracy drop. To this end, we first develop an RepIdentityFormer base on the re-parameterizing idea, to study the token mixer free model architecture. And we then explore the improved learning paradigm to break the limitation of simple token mixer free backbone, and summarize the empirical practice into 5 guidelines. Equipped with the proposed optimization strategy, we are able to build an extremely simple vision backbone with encouraging performance, while enjoying the high efficiency during inference. Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy. We hope this work can serve as a starting point for the exploration of optimization-driven efficient network design.
+</br>
+
+</details>
+
+## How to use
+
+The checkpoints provided are all `training-time` models. Use the reparameterize tool or `switch_to_deploy` interface to switch them to more efficient `inference-time` architecture, which not only has fewer parameters but also less calculations.
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+Use `classifier.backbone.switch_to_deploy()` interface to switch the RIFormer models into inference mode.
+
+```python
+>>> import torch
+>>> from mmpretrain import get_model, inference_model
+>>>
+>>> model = get_model("riformer-s12_in1k", pretrained=True)
+>>> results = inference_model(model, 'demo/demo.JPEG')
+>>> print( (results['pred_class'], results['pred_score']) )
+('sea snake', 0.7827484011650085)
+>>>
+>>> # switch to deploy mode
+>>> model.backbone.switch_to_deploy()
+>>> results = inference_model(model, 'demo/demo.JPEG')
+>>> print( (results['pred_class'], results['pred_score']) )
+('sea snake', 0.7827480435371399)
+```
+
+**Use the model**
+
+```python
+>>> import torch
+>>>
+>>> model = get_model("riformer-s12_in1k", pretrained=True)
+>>> model.eval()
+>>> inputs = torch.rand(1, 3, 224, 224).to(model.data_preprocessor.device)
+>>> # To get classification scores.
+>>> out = model(inputs)
+>>> print(out.shape)
+torch.Size([1, 1000])
+>>> # To extract features.
+>>> outs = model.extract_feat(inputs)
+>>> print(outs[0].shape)
+torch.Size([1, 512])
+>>>
+>>> # switch to deploy mode
+>>> model.backbone.switch_to_deploy()
+>>> out_deploy = model(inputs)
+>>> print(out.shape)
+torch.Size([1, 1000])
+>>> assert torch.allclose(out, out_deploy, rtol=1e-4, atol=1e-5) # pass without error
+```
+
+**Test Command**
+
+Place the ImageNet dataset to the `data/imagenet/` directory, or prepare datasets according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+*224×224*
+
+Download Checkpoint:
+
+```shell
+wget https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k_20230406-6741ce71.pth
+```
+
+Test use unfused model:
+
+```shell
+python tools/test.py configs/riformer/riformer-s12_8xb128_in1k.py riformer-s12_32xb128_in1k_20230406-6741ce71.pth
+```
+
+Reparameterize checkpoint:
+
+```shell
+python tools/model_converters/reparameterize_model.py configs/riformer/riformer-s12_8xb128_in1k.py riformer-s12_32xb128_in1k_20230406-6741ce71.pth riformer-s12_deploy.pth
+```
+
+Test use fused model:
+
+```shell
+python tools/test.py configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py riformer-s12_deploy.pth
+```
+
+<!-- [TABS-END] -->
+
+For more configurable parameters, please refer to the [API](https://mmpretrain.readthedocs.io/en/latest/api/generated/mmpretrain.models.backbones.RIFormer.html#mmpretrain.models.backbones.RIFormer).
+
+<details>
+
+<summary><b>How to use the reparameterization tool</b>(click to show)</summary>
+
+<br>
+
+Use provided tool to reparameterize the given model and save the checkpoint:
+
+```bash
+python tools/convert_models/reparameterize_model.py ${CFG_PATH} ${SRC_CKPT_PATH} ${TARGET_CKPT_PATH}
+```
+
+`${CFG_PATH}` is the config file path, `${SRC_CKPT_PATH}` is the source chenpoint file path, `${TARGET_CKPT_PATH}` is the target deploy weight file path.
+
+For example:
+
+```shell
+# download the weight
+wget https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k_20230406-6741ce71.pth
+
+# reparameterize unfused weight to fused weight
+python tools/model_converters/reparameterize_model.py configs/riformer/riformer-s12_8xb128_in1k.py riformer-s12_32xb128_in1k_20230406-6741ce71.pth riformer-s12_deploy.pth
+```
+
+To use reparameterized weights, you can use the deploy model config file such as the [s12_deploy example](./deploy/riformer-s12-deploy_8xb128_in1k.py):
+
+```text
+# in riformer-s12-deploy_8xb128_in1k.py
+_base_ = '../deploy/riformer-s12-deploy_8xb128_in1k.py'  # basic s12 config
+
+model = dict(backbone=dict(deploy=True))  # switch model into deploy mode
+```
+
+```shell
+python tools/test.py configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py riformer-s12_deploy.pth
+```
+
+</br>
+
+</details>
+
+## Results and models
+
+### ImageNet-1k
+
+|         Model         | resolution | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) |                    Config                     |                                         Download                                          |
+| :-------------------: | :--------: | :-------: | :------: | :-------: | :-------: | :-------------------------------------------: | :---------------------------------------------------------------------------------------: |
+|   riformer-s12_in1k   |  224x224   |   11.92   |   1.82   |   76.90   |   93.06   |    [config](./riformer-s12_8xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k_20230406-6741ce71.pth) |
+|   riformer-s24_in1k   |  224x224   |   21.39   |   3.41   |   80.28   |   94.80   |    [config](./riformer-s24_8xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s24_32xb128_in1k_20230406-fdab072a.pth) |
+|   riformer-s36_in1k   |  224x224   |   30.86   |   5.00   |   81.29   |   95.41   |    [config](./riformer-s36_8xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s36_32xb128_in1k_20230406-fdfcd3b0.pth) |
+|   riformer-m36_in1k   |  224x224   |   56.17   |   8.80   |   82.57   |   95.99   |    [config](./riformer-m36_8xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m36_32xb128_in1k_20230406-2fcb9d9b.pth) |
+|   riformer-m48_in1k   |  224x224   |   73.47   |  11.59   |   82.75   |   96.11   |    [config](./riformer-m48_8xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m48_32xb128_in1k_20230406-2b9d1abf.pth) |
+| riformer-s12_384_in1k |  384x384   |   11.92   |   5.36   |   78.29   |   93.93   | [config](./riformer-s12_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k-384px_20230406-145eda4c.pth) |
+| riformer-s24_384_in1k |  384x384   |   21.39   |  10.03   |   81.36   |   95.40   | [config](./riformer-s24_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s24_32xb128_in1k-384px_20230406-bafae7ab.pth) |
+| riformer-s36_384_in1k |  384x384   |   30.86   |  14.70   |   82.22   |   95.95   | [config](./riformer-s36_8xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s36_32xb128_in1k-384px_20230406-017ed3c4.pth) |
+| riformer-m36_384_in1k |  384x384   |   56.17   |  25.87   |   83.39   |   96.40   | [config](./riformer-m36_8xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m36_32xb128_in1k-384px_20230406-66a6f764.pth) |
+| riformer-m48_384_in1k |  384x384   |   73.47   |  34.06   |   83.70   |   96.60   | [config](./riformer-m48_8xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m48_32xb128_in1k-384px_20230406-2e874826.pth) |
+
+The config files of these models are only for inference.
+
+## Citation
+
+```bibtex
+@inproceedings{wang2023riformer,
+  title={RIFormer: Keep Your Vision Backbone Effective But Removing Token Mixer},
+  author={Wang, Jiahao and Zhang, Songyang and Liu, Yong and Wu, Taiqiang and Yang, Yujiu and Liu, Xihui and Chen, Kai and Luo, Ping and Lin, Dahua},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  year={2023}
+}
+```
diff --git a/configs/riformer/deploy/riformer-m36-deploy_8xb128_in1k.py b/configs/riformer/deploy/riformer-m36-deploy_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fcec41c810849d20c080faa1a710692e4b2bb9a0
--- /dev/null
+++ b/configs/riformer/deploy/riformer-m36-deploy_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-m36_8xb128_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-m36-deploy_8xb64_in1k-384px.py b/configs/riformer/deploy/riformer-m36-deploy_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..e18f836f89d9057b1d8a1b6d31cd83d6bdca6b3a
--- /dev/null
+++ b/configs/riformer/deploy/riformer-m36-deploy_8xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-m36_8xb64_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k-384px.py b/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..0ab33534e271ccad60a9f6d896fa15238601a4e0
--- /dev/null
+++ b/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-m48_8xb64_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k.py b/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e32ad328f893aaa0da1a4072315a91f514a594ce
--- /dev/null
+++ b/configs/riformer/deploy/riformer-m48-deploy_8xb64_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-m48_8xb64_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k-384px.py b/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ffbb4be31d76716432ff283d9d7c2d77370ddbb0
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s12_8xb128_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py b/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..70fd8b74342e07ec2e3b4299364681ffbea5ec25
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s12-deploy_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s12_8xb128_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k-384px.py b/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d05e5c1a14afe10e05ae648e47c16d53220f226
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s24_8xb128_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k.py b/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..47f83a08f4f2c6fa6ffc7105265b41c12e30fd2e
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s24-deploy_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s24_8xb128_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s36-deploy_8xb128_in1k.py b/configs/riformer/deploy/riformer-s36-deploy_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c03bb15106829f22ba959d2a84d0a92ceba4dac
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s36-deploy_8xb128_in1k.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s36_8xb128_in1k.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/deploy/riformer-s36-deploy_8xb64_in1k-384px.py b/configs/riformer/deploy/riformer-s36-deploy_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..67b17ee5173e5bef7d2ecdf6d92e09cbb48db482
--- /dev/null
+++ b/configs/riformer/deploy/riformer-s36-deploy_8xb64_in1k-384px.py
@@ -0,0 +1,3 @@
+_base_ = '../riformer-s36_8xb64_in1k-384px.py'
+
+model = dict(backbone=dict(deploy=True))
diff --git a/configs/riformer/metafile.yml b/configs/riformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..5f3e2ec8773d26cde570bb874d2a45a73a49bc7b
--- /dev/null
+++ b/configs/riformer/metafile.yml
@@ -0,0 +1,152 @@
+Collections:
+  - Name: RIFormer
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Resources: 8x A100 GPUs
+      Architecture:
+        - Affine
+        - 1x1 Convolution
+        - LayerScale
+    Paper:
+      URL: https://arxiv.org/abs/xxxx.xxxxx
+      Title: "RIFormer: Keep Your Vision Backbone Effective But Removing Token Mixer"
+    README: configs/riformer/README.md
+    Code:
+      Version: v1.0.0rc7
+      URL: null
+
+Models:
+  - Name: riformer-s12_in1k
+    Metadata:
+      FLOPs: 1822000000
+      Parameters: 11915000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.90
+          Top 5 Accuracy: 93.06
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k_20230406-6741ce71.pth
+    Config: configs/riformer/riformer-s12_8xb128_in1k.py
+  - Name: riformer-s24_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 3412000000
+      Parameters: 21389000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.28
+          Top 5 Accuracy: 94.80
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s24_32xb128_in1k_20230406-fdab072a.pth
+    Config: configs/riformer/riformer-s24_8xb128_in1k.py
+  - Name: riformer-s36_in1k
+    Metadata:
+      FLOPs: 5003000000
+      Parameters: 30863000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.29
+          Top 5 Accuracy: 95.41
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s36_32xb128_in1k_20230406-fdfcd3b0.pth
+    Config: configs/riformer/riformer-s36_8xb128_in1k.py
+  - Name: riformer-m36_in1k
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 8801000000
+      Parameters: 56173000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.57
+          Top 5 Accuracy: 95.99
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m36_32xb128_in1k_20230406-2fcb9d9b.pth
+    Config: configs/riformer/riformer-m36_8xb128_in1k.py
+  - Name: riformer-m48_in1k
+    Metadata:
+      FLOPs: 11590000000
+      Parameters: 73473000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.75
+          Top 5 Accuracy: 96.11
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m48_32xb128_in1k_20230406-2b9d1abf.pth
+    Config: configs/riformer/riformer-m48_8xb64_in1k.py
+  - Name: riformer-s12_in1k-384
+    Metadata:
+      FLOPs: 5355000000
+      Parameters: 11915000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.29
+          Top 5 Accuracy: 93.93
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s12_32xb128_in1k-384px_20230406-145eda4c.pth
+    Config: configs/riformer/riformer-s12_8xb128_in1k-384px.py
+  - Name: riformer-s24_in1k-384
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 10028000000
+      Parameters: 21389000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.36
+          Top 5 Accuracy: 95.40
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s24_32xb128_in1k-384px_20230406-bafae7ab.pth
+    Config: configs/riformer/riformer-s24_8xb128_in1k-384px.py
+  - Name: riformer-s36_in1k-384
+    Metadata:
+      FLOPs: 14702000000
+      Parameters: 30863000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.22
+          Top 5 Accuracy: 95.95
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-s36_32xb128_in1k-384px_20230406-017ed3c4.pth
+    Config: configs/riformer/riformer-s36_8xb64_in1k-384px.py
+  - Name: riformer-m36_in1k-384
+    Metadata:
+      Training Data: ImageNet-1k
+      FLOPs: 25865000000
+      Parameters: 56173000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.39
+          Top 5 Accuracy: 96.40
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m36_32xb128_in1k-384px_20230406-66a6f764.pth
+    Config: configs/riformer/riformer-m36_8xb64_in1k-384px.py
+  - Name: riformer-m48_in1k-384
+    Metadata:
+      FLOPs: 34060000000
+      Parameters: 73473000
+    In Collection: RIFormer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.70
+          Top 5 Accuracy: 96.60
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v1/riformer/riformer-m48_32xb128_in1k-384px_20230406-2e874826.pth
+    Config: configs/riformer/riformer-m48_8xb64_in1k-384px.py
diff --git a/configs/riformer/riformer-m36_8xb128_in1k.py b/configs/riformer/riformer-m36_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..30e93aa83d0f5c0b379367e2dc9b7a7d038108b4
--- /dev/null
+++ b/configs/riformer/riformer-m36_8xb128_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_poolformer_medium_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='m36',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-m36_8xb64_in1k-384px.py b/configs/riformer/riformer-m36_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..57f687cd50b60d99978dec7baeec4bf6a67e5de5
--- /dev/null
+++ b/configs/riformer/riformer-m36_8xb64_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_riformer_medium_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='m36',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-m48_8xb64_in1k-384px.py b/configs/riformer/riformer-m48_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef6f1964624f76e204a5d257ddee2410f21ab456
--- /dev/null
+++ b/configs/riformer/riformer-m48_8xb64_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_riformer_medium_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='m48',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-m48_8xb64_in1k.py b/configs/riformer/riformer-m48_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9dc5c3e291f136d40633e05c9c2931d140c532bc
--- /dev/null
+++ b/configs/riformer/riformer-m48_8xb64_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_poolformer_medium_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='m48',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s12_8xb128_in1k-384px.py b/configs/riformer/riformer-s12_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d19dae07c811aeb0ca5af3cb92e57903405e49b
--- /dev/null
+++ b/configs/riformer/riformer-s12_8xb128_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_riformer_small_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='s12',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s12_8xb128_in1k.py b/configs/riformer/riformer-s12_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e85f8fb883de19f1021b8148fc680711149b5a9d
--- /dev/null
+++ b/configs/riformer/riformer-s12_8xb128_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='s12',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s24_8xb128_in1k-384px.py b/configs/riformer/riformer-s24_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6a1ec7b57385c4910ffaebcd152296bbdee360e1
--- /dev/null
+++ b/configs/riformer/riformer-s24_8xb128_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_riformer_small_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='s24',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s24_8xb128_in1k.py b/configs/riformer/riformer-s24_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..560cddcf8829703d2f1e9aaf4856e947b762b49a
--- /dev/null
+++ b/configs/riformer/riformer-s24_8xb128_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='s24',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s36_8xb128_in1k.py b/configs/riformer/riformer-s36_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..28511307a294031301cb425d513844780d199606
--- /dev/null
+++ b/configs/riformer/riformer-s36_8xb128_in1k.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_poolformer_small_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='s36',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/riformer/riformer-s36_8xb64_in1k-384px.py b/configs/riformer/riformer-s36_8xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b3077357051632c81426e5d94322558412430373
--- /dev/null
+++ b/configs/riformer/riformer-s36_8xb64_in1k-384px.py
@@ -0,0 +1,39 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs128_riformer_small_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='RIFormer',
+        arch='s36',
+        drop_path_rate=0.1,
+        init_cfg=[
+            dict(
+                type='TruncNormal',
+                layer=['Conv2d', 'Linear'],
+                std=.02,
+                bias=0.),
+            dict(type='Constant', layer=['GroupNorm'], val=1., bias=0.),
+        ]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/sam/README.md b/configs/sam/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1a5668a3d0bff5aacac10f26a41714afe3622c78
--- /dev/null
+++ b/configs/sam/README.md
@@ -0,0 +1,57 @@
+# SAM
+
+> [Segment Anything](https://arxiv.org/abs/2304.02643)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billionmasks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive – often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/36138628/231106092-261ff035-dd3b-4a8b-b2e7-e91f195090a1.png" width="100%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-base-p16_sam-pre_3rdparty_sa1b-1024px', pretrained=True)
+inputs = torch.rand(1, 3, 1024, 1024)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                          | Params (M) | Flops (G) |                 Config                  |                                             Download                                             |
+| :--------------------------------------------- | :--------: | :-------: | :-------------------------------------: | :----------------------------------------------------------------------------------------------: |
+| `vit-base-p16_sam-pre_3rdparty_sa1b-1024px`\*  |   89.67    |  486.00   | [config](vit-base-p16_sam_headless.py)  | [model](https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-base-p16_sam-pre_3rdparty_sa1b-1024px_20230411-2320f9cc.pth) |
+| `vit-large-p16_sam-pre_3rdparty_sa1b-1024px`\* |   308.00   |  1494.00  | [config](vit-large-p16_sam_headless.py) | [model](https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-large-p16_sam-pre_3rdparty_sa1b-1024px_20230411-595feafd.pth) |
+| `vit-huge-p16_sam-pre_3rdparty_sa1b-1024px`\*  |   637.00   |  2982.00  | [config](vit-huge-p16_sam_headless.py)  | [model](https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-huge-p16_sam-pre_3rdparty_sa1b-1024px_20230411-3f13c653.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/segment-anything/). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{kirillov2023segany,
+  title={Segment Anything},
+  author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Doll{\'a}r, Piotr and Girshick, Ross},
+  journal={arXiv:2304.02643},
+  year={2023}
+}
+```
diff --git a/configs/sam/metafile.yml b/configs/sam/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..1ac65ce7715e91468e108132493ecdcbb4db277c
--- /dev/null
+++ b/configs/sam/metafile.yml
@@ -0,0 +1,61 @@
+Collections:
+  - Name: SAM
+    Metadata:
+      Architecture:
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+    Paper:
+      Title: 'Segment Anything'
+      URL: https://arxiv.org/abs/2304.02643
+    README: configs/sam/README.md
+    Code:
+      URL: null
+      Version: null
+
+Models:
+  - Name: vit-base-p16_sam-pre_3rdparty_sa1b-1024px
+    Metadata:
+      FLOPs: 486000000000
+      Parameters: 89671000
+      Training Data:
+        - SA-1B
+    In Collection: SAM
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-base-p16_sam-pre_3rdparty_sa1b-1024px_20230411-2320f9cc.pth
+    Config: configs/sam/vit-base-p16_sam_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth
+      Code: https://github.com/facebookresearch/segment-anything/
+
+  - Name: vit-large-p16_sam-pre_3rdparty_sa1b-1024px
+    Metadata:
+      FLOPs: 1494000000000
+      Parameters: 308000000
+      Training Data:
+        - SA-1B
+    In Collection: SAM
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-large-p16_sam-pre_3rdparty_sa1b-1024px_20230411-595feafd.pth
+    Config: configs/sam/vit-large-p16_sam_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_l_0b3195.pth
+      Code: https://github.com/facebookresearch/segment-anything/
+
+  - Name: vit-huge-p16_sam-pre_3rdparty_sa1b-1024px
+    Metadata:
+      FLOPs: 2982000000000
+      Parameters: 637000000
+      Training Data:
+        - SA-1B
+    In Collection: SAM
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/vit_sam/vit-huge-p16_sam-pre_3rdparty_sa1b-1024px_20230411-3f13c653.pth
+    Config: configs/sam/vit-huge-p16_sam_headless.py
+    Converted From:
+      Weights: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
+      Code: https://github.com/facebookresearch/segment-anything/
diff --git a/configs/sam/vit-base-p16_sam_headless.py b/configs/sam/vit-base-p16_sam_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..bea26376ee932af5704fd5d232efc3cdf128e310
--- /dev/null
+++ b/configs/sam/vit-base-p16_sam_headless.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTSAM',
+        arch='base',
+        img_size=1024,
+        patch_size=16,
+        out_channels=256,
+        use_abs_pos=True,
+        use_rel_pos=True,
+        window_size=14,
+    ),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/sam/vit-huge-p16_sam_headless.py b/configs/sam/vit-huge-p16_sam_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..8004755bfbe7dd0e5366297f03f73494dc27c27b
--- /dev/null
+++ b/configs/sam/vit-huge-p16_sam_headless.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTSAM',
+        arch='huge',
+        img_size=1024,
+        patch_size=16,
+        out_channels=256,
+        use_abs_pos=True,
+        use_rel_pos=True,
+        window_size=14,
+    ),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/sam/vit-large-p16_sam_headless.py b/configs/sam/vit-large-p16_sam_headless.py
new file mode 100644
index 0000000000000000000000000000000000000000..1cebeb098205d081a4340fb4af369e2c29a20d66
--- /dev/null
+++ b/configs/sam/vit-large-p16_sam_headless.py
@@ -0,0 +1,24 @@
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ViTSAM',
+        arch='large',
+        img_size=1024,
+        patch_size=16,
+        out_channels=256,
+        use_abs_pos=True,
+        use_rel_pos=True,
+        window_size=14,
+    ),
+    neck=None,
+    head=None,
+)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
diff --git a/configs/seresnet/README.md b/configs/seresnet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b5151ccde85112f12af2170796b169933e9a93ab
--- /dev/null
+++ b/configs/seresnet/README.md
@@ -0,0 +1,81 @@
+# SEResNet
+
+> [Squeeze-and-Excitation Networks](https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper.html)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142574668-3464d087-b962-48ba-ad1d-5d6b33c3ba0b.png" width="50%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('seresnet50_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('seresnet50_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/seresnet/seresnet50_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/seresnet/seresnet50_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet50_batch256_imagenet_20200804-ae206104.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                    |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config                |                                           Download                                           |
+| :----------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------: | :------------------------------------------------------------------------------------------: |
+| `seresnet50_8xb32_in1k`  | From scratch |   28.09    |   4.13    |   77.74   |   93.84   | [config](seresnet50_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet50_batch256_imagenet_20200804-ae206104.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet50_batch256_imagenet_20200708-657b3c36.log.json) |
+| `seresnet101_8xb32_in1k` | From scratch |   49.33    |   7.86    |   78.26   |   94.07   | [config](seresnet101_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet101_batch256_imagenet_20200804-ba5b51d4.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet101_batch256_imagenet_20200708-038a4d04.log.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{hu2018squeeze,
+  title={Squeeze-and-excitation networks},
+  author={Hu, Jie and Shen, Li and Sun, Gang},
+  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+  pages={7132--7141},
+  year={2018}
+}
+```
diff --git a/configs/seresnet/metafile.yml b/configs/seresnet/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..1a9f116da4c8014e91e31af5db33d7b13b151826
--- /dev/null
+++ b/configs/seresnet/metafile.yml
@@ -0,0 +1,47 @@
+Collections:
+  - Name: SEResNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Epochs: 140
+      Batch Size: 256
+      Architecture:
+        - ResNet
+    Paper:
+      URL: https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper.html
+      Title: "Squeeze-and-Excitation Networks"
+    README: configs/seresnet/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/seresnet.py#L58
+      Version: v0.15.0
+
+Models:
+  - Name: seresnet50_8xb32_in1k
+    Metadata:
+      FLOPs: 4130000000
+      Parameters: 28090000
+    In Collection: SEResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.74
+          Top 5 Accuracy: 93.84
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet50_batch256_imagenet_20200804-ae206104.pth
+    Config: configs/seresnet/seresnet50_8xb32_in1k.py
+  - Name: seresnet101_8xb32_in1k
+    Metadata:
+      FLOPs: 7860000000
+      Parameters: 49330000
+    In Collection: SEResNet
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.26
+          Top 5 Accuracy: 94.07
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet101_batch256_imagenet_20200804-ba5b51d4.pth
+    Config: configs/seresnet/seresnet101_8xb32_in1k.py
diff --git a/configs/seresnet/seresnet101_8xb32_in1k.py b/configs/seresnet/seresnet101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8be39e7a32aa38a5c7d0b355c39a28ddff087cf1
--- /dev/null
+++ b/configs/seresnet/seresnet101_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/seresnet101.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/seresnet/seresnet50_8xb32_in1k.py b/configs/seresnet/seresnet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..19082bd0dd6bde367a064900f5c51d730bea2923
--- /dev/null
+++ b/configs/seresnet/seresnet50_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/seresnet50.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_140e.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/seresnet/seresnext101-32x4d_8xb32_in1k.py b/configs/seresnet/seresnext101-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..01778305caf8196e73a77f39783ead80a0c3ea56
--- /dev/null
+++ b/configs/seresnet/seresnext101-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/seresnext101_32x4d.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/seresnet/seresnext50-32x4d_8xb32_in1k.py b/configs/seresnet/seresnext50-32x4d_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4d593e45b8992254f97de77fa4d157e9c31ce352
--- /dev/null
+++ b/configs/seresnet/seresnext50-32x4d_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/seresnext50_32x4d.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/shufflenet_v1/README.md b/configs/shufflenet_v1/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..618a22d775eae984809e4881207c0f645fc1d8c9
--- /dev/null
+++ b/configs/shufflenet_v1/README.md
@@ -0,0 +1,80 @@
+# Shufflenet V1
+
+> [ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices](https://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We introduce an extremely computation-efficient CNN architecture named ShuffleNet, which is designed specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs). The new architecture utilizes two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top-1 error (absolute 7.8%) than recent MobileNet on ImageNet classification task, under the computation budget of 40 MFLOPs. On an ARM-based mobile device, ShuffleNet achieves ~13x actual speedup over AlexNet while maintaining comparable accuracy.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142575730-dc2f616d-80df-4fb1-93e1-77ebb2b835cf.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('shufflenet-v1-1x_16xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('shufflenet-v1-1x_16xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                          |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                   |                                     Download                                     |
+| :----------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :------------------------------------------------------------------------------: |
+| `shufflenet-v1-1x_16xb64_in1k` | From scratch |    1.87    |   0.15    |   68.13   |   87.81   | [config](shufflenet-v1-1x_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{zhang2018shufflenet,
+  title={Shufflenet: An extremely efficient convolutional neural network for mobile devices},
+  author={Zhang, Xiangyu and Zhou, Xinyu and Lin, Mengxiao and Sun, Jian},
+  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+  pages={6848--6856},
+  year={2018}
+}
+```
diff --git a/configs/shufflenet_v1/metafile.yml b/configs/shufflenet_v1/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..e3ca1393e629153f81791c4f584ec0ded04839e2
--- /dev/null
+++ b/configs/shufflenet_v1/metafile.yml
@@ -0,0 +1,35 @@
+Collections:
+  - Name: Shufflenet V1
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+        - No BN decay
+      Training Resources: 8x 1080 GPUs
+      Epochs: 300
+      Batch Size: 1024
+      Architecture:
+        - Shufflenet V1
+    Paper:
+      URL: https://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html
+      Title: "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices"
+    README: configs/shufflenet_v1/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/shufflenet_v1.py#L152
+      Version: v0.15.0
+
+Models:
+  - Name: shufflenet-v1-1x_16xb64_in1k
+    Metadata:
+      FLOPs: 146000000
+      Parameters: 1870000
+    In Collection: Shufflenet V1
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 68.13
+          Top 5 Accuracy: 87.81
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.pth
+    Config: configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py
diff --git a/configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py b/configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..58e45f1ba419f285d750d4487e40a3dbc803d8e1
--- /dev/null
+++ b/configs/shufflenet_v1/shufflenet-v1-1x_16xb64_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/shufflenet_v1_1x.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/shufflenet_v2/README.md b/configs/shufflenet_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..804aac18087ad8d1cf49c4b7c10ab36eb8128ade
--- /dev/null
+++ b/configs/shufflenet_v2/README.md
@@ -0,0 +1,80 @@
+# Shufflenet V2
+
+> [ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design](https://openaccess.thecvf.com/content_ECCV_2018/papers/Ningning_Light-weight_CNN_Architecture_ECCV_2018_paper.pdf)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Currently, the neural network architecture design is mostly guided by the *indirect* metric of computation complexity, i.e., FLOPs. However, the *direct* metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work derives several practical *guidelines* for efficient network design. Accordingly, a new architecture is presented, called *ShuffleNet V2*. Comprehensive ablation experiments verify that our model is the state-of-the-art in terms of speed and accuracy tradeoff.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142576336-e0db2866-3add-44e6-a792-14d4f11bd983.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('shufflenet-v2-1x_16xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('shufflenet-v2-1x_16xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                          |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                   |                                     Download                                     |
+| :----------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :------------------------------------------------------------------------------: |
+| `shufflenet-v2-1x_16xb64_in1k` | From scratch |    2.28    |   0.15    |   69.55   |   88.92   | [config](shufflenet-v2-1x_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{ma2018shufflenet,
+  title={Shufflenet v2: Practical guidelines for efficient cnn architecture design},
+  author={Ma, Ningning and Zhang, Xiangyu and Zheng, Hai-Tao and Sun, Jian},
+  booktitle={Proceedings of the European conference on computer vision (ECCV)},
+  pages={116--131},
+  year={2018}
+}
+```
diff --git a/configs/shufflenet_v2/metafile.yml b/configs/shufflenet_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..9c1eebc5e9fdb66523f719bdae1bdd38a58fea84
--- /dev/null
+++ b/configs/shufflenet_v2/metafile.yml
@@ -0,0 +1,35 @@
+Collections:
+  - Name: Shufflenet V2
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+        - No BN decay
+      Training Resources: 8x 1080 GPUs
+      Epochs: 300
+      Batch Size: 1024
+      Architecture:
+        - Shufflenet V2
+    Paper:
+      URL: https://openaccess.thecvf.com/content_ECCV_2018/papers/Ningning_Light-weight_CNN_Architecture_ECCV_2018_paper.pdf
+      Title: "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design"
+    README: configs/shufflenet_v2/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/shufflenet_v2.py#L134
+      Version: v0.15.0
+
+Models:
+  - Name: shufflenet-v2-1x_16xb64_in1k
+    Metadata:
+      FLOPs: 149000000
+      Parameters: 2280000
+    In Collection: Shufflenet V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.55
+          Top 5 Accuracy: 88.92
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.pth
+    Config: configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py
diff --git a/configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py b/configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a106ab8686c985a66b1c9b6af3407ef48a40c64e
--- /dev/null
+++ b/configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/shufflenet_v2_1x.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs1024_linearlr_bn_nowd.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/simclr/README.md b/configs/simclr/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..17d0de2b79499ec47cdcb4e5eff59d362b77fced
--- /dev/null
+++ b/configs/simclr/README.md
@@ -0,0 +1,87 @@
+# SimCLR
+
+> [A simple framework for contrastive learning of visual representations](https://arxiv.org/abs/2002.05709)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50.
+
+<div align=center>
+<img  src="https://user-images.githubusercontent.com/36138628/149723851-cf5f309e-d891-454d-90c0-e5337e5a11ed.png" width="400" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_simclr-200e-pre_8xb512-linear-coslr-90e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('simclr_resnet50_16xb256-coslr-200e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f12c0457.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                     | Params (M) | Flops (G) |                        Config                        |                                         Download                                         |
+| :---------------------------------------- | :--------: | :-------: | :--------------------------------------------------: | :--------------------------------------------------------------------------------------: |
+| `simclr_resnet50_16xb256-coslr-200e_in1k` |   27.97    |   4.11    | [config](simclr_resnet50_16xb256-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/simclr_resnet50_16xb256-coslr-200e_in1k_20220825-4d9cce50.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/simclr_resnet50_16xb256-coslr-200e_in1k_20220825-4d9cce50.json) |
+| `simclr_resnet50_16xb256-coslr-800e_in1k` |   27.97    |   4.11    | [config](simclr_resnet50_16xb256-coslr-800e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/simclr_resnet50_16xb256-coslr-800e_in1k_20220825-85fcc4de.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/simclr_resnet50_16xb256-coslr-800e_in1k_20220825-85fcc4de.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_simclr-200e-pre_8xb512-linear-coslr-90e_in1k` | [SIMCLR 200-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/simclr_resnet50_16xb256-coslr-200e_in1k_20220825-4d9cce50.pth) |   25.56    |   4.11    |   66.90   | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f12c0457.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f12c0457.json) |
+| `resnet50_simclr-800e-pre_8xb512-linear-coslr-90e_in1k` | [SIMCLR 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/simclr_resnet50_16xb256-coslr-800e_in1k_20220825-85fcc4de.pth) |   25.56    |   4.11    |   69.20   | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-b80ae1e5.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-b80ae1e5.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{chen2020simple,
+  title={A simple framework for contrastive learning of visual representations},
+  author={Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey},
+  booktitle={ICML},
+  year={2020},
+}
+```
diff --git a/configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py b/configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b5074c082b8b6fb36bd3c6711b60bab6394b4ce
--- /dev/null
+++ b/configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
@@ -0,0 +1,18 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_lars_coslr_90e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# dataset summary
+train_dataloader = dict(batch_size=512)
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/simclr/metafile.yml b/configs/simclr/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..23c31ed3533160739f66731b9c02f6547910dd44
--- /dev/null
+++ b/configs/simclr/metafile.yml
@@ -0,0 +1,72 @@
+Collections:
+  - Name: SimCLR
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - LARS
+      Training Resources: 8x V100 GPUs (b256), 16x A100-80G GPUs (b4096)
+      Architecture:
+        - ResNet
+        - SimCLR
+    Paper:
+      Title: A simple framework for contrastive learning of visual representations
+      URL: https://arxiv.org/abs/2002.05709
+    README: configs/simclr/README.md
+
+Models:
+  - Name: simclr_resnet50_16xb256-coslr-200e_in1k
+    Metadata:
+      Epochs: 200
+      Batch Size: 4096
+      FLOPs: 4109364224
+      Parameters: 27968832
+      Training Data: ImageNet-1k
+    In Collection: SimCLR
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/simclr_resnet50_16xb256-coslr-200e_in1k_20220825-4d9cce50.pth
+    Config: configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py
+    Downstream:
+      - resnet50_simclr-200e-pre_8xb512-linear-coslr-90e_in1k
+  - Name: simclr_resnet50_16xb256-coslr-800e_in1k
+    Metadata:
+      Epochs: 200
+      Batch Size: 4096
+      FLOPs: 4109364224
+      Parameters: 27968832
+      Training Data: ImageNet-1k
+    In Collection: SimCLR
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/simclr_resnet50_16xb256-coslr-800e_in1k_20220825-85fcc4de.pth
+    Config: configs/simclr/simclr_resnet50_16xb256-coslr-800e_in1k.py
+    Downstream:
+      - resnet50_simclr-800e-pre_8xb512-linear-coslr-90e_in1k
+  - Name: resnet50_simclr-200e-pre_8xb512-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 4096
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: SimCLR
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 66.9
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f12c0457.pth
+    Config: configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
+  - Name: resnet50_simclr-800e-pre_8xb512-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 4096
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: SimCLR
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.2
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simclr/simclr_resnet50_16xb256-coslr-800e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-b80ae1e5.pth
+    Config: configs/simclr/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
diff --git a/configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py b/configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b48d5b31071dbb5622616b62835caa6cdd8d9589
--- /dev/null
+++ b/configs/simclr/simclr_resnet50_16xb256-coslr-200e_in1k.py
@@ -0,0 +1,46 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_simclr.py',
+    '../_base_/schedules/imagenet_lars_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=256)
+
+# model settings
+model = dict(
+    type='SimCLR',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',  # SimCLR non-linear neck
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=128,
+        num_layers=2,
+        with_avg_pool=True),
+    head=dict(
+        type='ContrastiveHead',
+        loss=dict(type='CrossEntropyLoss'),
+        temperature=0.1),
+)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=4.8, momentum=0.9, weight_decay=1e-6),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True),
+        }))
+
+# runtime settings
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/simclr/simclr_resnet50_16xb256-coslr-800e_in1k.py b/configs/simclr/simclr_resnet50_16xb256-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..478ef0c33418a9467d01c2a0c133be119318359c
--- /dev/null
+++ b/configs/simclr/simclr_resnet50_16xb256-coslr-800e_in1k.py
@@ -0,0 +1,57 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_simclr.py',
+    '../_base_/schedules/imagenet_lars_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='SimCLR',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',  # SimCLR non-linear neck
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=128,
+        num_layers=2,
+        with_avg_pool=True),
+    head=dict(
+        type='ContrastiveHead',
+        loss=dict(type='CrossEntropyLoss'),
+        temperature=0.1),
+)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=4.8, momentum=0.9, weight_decay=1e-6),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR', T_max=790, by_epoch=True, begin=10, end=800)
+]
+
+# runtime settings
+train_cfg = dict(max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/simclr/simclr_resnet50_8xb32-coslr-200e_in1k.py b/configs/simclr/simclr_resnet50_8xb32-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..36a144536e832c5e022675f3f6878d1cfa71c563
--- /dev/null
+++ b/configs/simclr/simclr_resnet50_8xb32-coslr-200e_in1k.py
@@ -0,0 +1,47 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_simclr.py',
+    '../_base_/schedules/imagenet_lars_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='SimCLR',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',  # SimCLR non-linear neck
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=128,
+        num_layers=2,
+        with_avg_pool=True),
+    head=dict(
+        type='ContrastiveHead',
+        loss=dict(type='CrossEntropyLoss'),
+        temperature=0.1),
+)
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='LARS', lr=0.3, momentum=0.9, weight_decay=1e-6),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True),
+        }))
+
+# runtime settings
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/configs/simmim/README.md b/configs/simmim/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3e44b0790086ac62c5719eba3198fd531f2dab98
--- /dev/null
+++ b/configs/simmim/README.md
@@ -0,0 +1,90 @@
+# SimMIM
+
+> [SimMIM: A Simple Framework for Masked Image Modeling](https://arxiv.org/abs/2111.09886)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+This paper presents SimMIM, a simple framework for masked image modeling. We simplify recently proposed related approaches without special designs such as blockwise masking and tokenization via discrete VAE or clustering. To study what let the masked image modeling task learn good representations, we systematically study the major components in our framework, and find that simple designs of each component have revealed very strong representation learning performance: 1) random masking of the input image with a moderately large masked patch size (e.g., 32) makes a strong pre-text task; 2) predicting raw pixels of RGB values by direct regression performs no worse than the patch classification approaches with complex designs; 3) the prediction head can be as light as a linear layer, with no worse performance than heavier ones. Using ViT-B, our approach achieves 83.8% top-1 fine-tuning accuracy on ImageNet-1K by pre-training also on this dataset, surpassing previous best approach by +0.6%. When applied on a larger model of about 650 million parameters, SwinV2H, it achieves 87.1% top-1 accuracy on ImageNet-1K using only ImageNet-1K data. We also leverage this approach to facilitate the training of a 3B model (SwinV2-G), that by 40× less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks. The code and models will be publicly available at https: //github.com/microsoft/SimMIM .
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/30762564/159404597-ac6d3a44-ee59-4cdc-8f6f-506a7d1b18b6.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('swin-base-w6_simmim-100e-pre_8xb256-coslr-100e_in1k-192px', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k/swin-base_ft-8xb256-coslr-100e_in1k_20220829-9cf23aa1.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                                     | Params (M) | Flops (G) |                            Config                             |                            Download                             |
+| :-------------------------------------------------------- | :--------: | :-------: | :-----------------------------------------------------------: | :-------------------------------------------------------------: |
+| `simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px`    |   89.87    |   18.83   | [config](simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.json) |
+| `simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px`   |   89.87    |   18.83   | [config](simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192_20220916-a0e931ac.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192_20220916-a0e931ac.json) |
+| `simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px` |   199.92   |   55.85   | [config](simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192_20220916-4ad216d3.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192_20220916-4ad216d3.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `swin-base-w6_simmim-100e-pre_8xb256-coslr-100e_in1k-192px` | [SIMMIM 100-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.pth) |   87.75    |   11.30   |   82.70   | [config](benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k/swin-base_ft-8xb256-coslr-100e_in1k_20220829-9cf23aa1.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k/swin-base_ft-8xb256-coslr-100e_in1k_20220829-9cf23aa1.json) |
+| `swin-base-w7_simmim-100e-pre_8xb256-coslr-100e_in1k` | [SIMMIM 100-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.pth) |   87.77    |   15.47   |   83.50   | [config](benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py) |                      N/A                      |
+| `swin-base-w6_simmim-800e-pre_8xb256-coslr-100e_in1k-192px` | [SIMMIM 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192_20220916-a0e931ac.pth) |   87.77    |   15.47   |   83.80   | [config](benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k-224/swin-base_ft-8xb256-coslr-100e_in1k-224_20221208-155cc6e6.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k-224/swin-base_ft-8xb256-coslr-100e_in1k-224_20221208-155cc6e6.json) |
+| `swin-large-w14_simmim-800e-pre_8xb256-coslr-100e_in1k` | [SIMMIM 800-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192_20220916-4ad216d3.pth) |   196.85   |   38.85   |   84.80   | [config](benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224_20220916-d4865790.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224_20220916-d4865790.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{xie2021simmim,
+  title={SimMIM: A Simple Framework for Masked Image Modeling},
+  author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Bao, Jianmin and Yao, Zhuliang and Dai, Qi and Hu, Han},
+  booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
+  year={2022}
+}
+```
diff --git a/configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py b/configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..47c4fa1ccfa42b0d6a3c7eb58f43f8250441b7f3
--- /dev/null
+++ b/configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py
@@ -0,0 +1,59 @@
+_base_ = [
+    '../../_base_/models/swin_transformer/base_224.py',
+    '../../_base_/datasets/imagenet_bs256_swin_192.py',
+    '../../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    backbone=dict(
+        img_size=192,
+        drop_path_rate=0.1,
+        stage_cfgs=dict(block_cfgs=dict(window_size=6)),
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer settings
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(type='AdamW', lr=5e-3, weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.9,
+        custom_keys={
+            '.norm': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.absolute_pos_embed': dict(decay_mult=0.0),
+            '.relative_position_bias_table': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=2.5e-7 / 1.25e-3,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        eta_min=2.5e-7 * 2048 / 512,
+        by_epoch=True,
+        begin=20,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=100))
+
+randomness = dict(seed=0)
diff --git a/configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py b/configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7325f03d6b495b9b775f4e2cc3c33a06f6af7dd
--- /dev/null
+++ b/configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py
@@ -0,0 +1,102 @@
+_base_ = [
+    '../../_base_/models/swin_transformer/base_224.py',
+    '../../_base_/datasets/imagenet_bs256_swin_192.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    backbone=dict(
+        img_size=224,
+        drop_path_rate=0.1,
+        stage_cfgs=dict(block_cfgs=dict(window_size=7)),
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# optimizer settings
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(type='AdamW', lr=5e-3, weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.9,
+        custom_keys={
+            '.norm': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.absolute_pos_embed': dict(decay_mult=0.0),
+            '.relative_position_bias_table': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=2.5e-7 / 1.25e-3,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        eta_min=2.5e-7 * 2048 / 512,
+        by_epoch=True,
+        begin=20,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=100))
+
+randomness = dict(seed=0)
diff --git a/configs/simmim/benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py b/configs/simmim/benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a6eafd84d3c3f3224567747bcf645114286394f0
--- /dev/null
+++ b/configs/simmim/benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py
@@ -0,0 +1,105 @@
+_base_ = [
+    '../../_base_/models/swin_transformer/base_224.py',
+    '../../_base_/datasets/imagenet_bs256_swin_192.py',
+    '../../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# model settings
+model = dict(
+    backbone=dict(
+        arch='large',
+        img_size=224,
+        drop_path_rate=0.2,
+        stage_cfgs=dict(block_cfgs=dict(window_size=14)),
+        pad_small_map=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    head=dict(in_channels=1536))
+
+# optimizer settings
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(type='AdamW', lr=5e-3, weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.7,
+        custom_keys={
+            '.norm': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            '.absolute_pos_embed': dict(decay_mult=0.0),
+            '.relative_position_bias_table': dict(decay_mult=0.0)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=2.5e-7 / 1.25e-3,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=100,
+        eta_min=1e-6,
+        by_epoch=True,
+        begin=20,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3),
+    logger=dict(type='LoggerHook', interval=100))
+
+randomness = dict(seed=0)
diff --git a/configs/simmim/metafile.yml b/configs/simmim/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..19d9446c45c5f86315cc61be206430ea7bd97643
--- /dev/null
+++ b/configs/simmim/metafile.yml
@@ -0,0 +1,115 @@
+Collections:
+  - Name: SimMIM
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+      Training Resources: 16x A100 GPUs
+      Architecture:
+        - Swin
+    Paper:
+      Title: 'SimMIM: A Simple Framework for Masked Image Modeling'
+      URL: https://arxiv.org/abs/2111.09886
+    README: configs/simmim/README.md
+
+Models:
+  - Name: simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 18832161792
+      Parameters: 89874104
+      Training Data: ImageNet-1k
+    In Collection: SimMIM
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192_20220829-0e15782d.pth
+    Config: configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py
+    Downstream:
+      - swin-base-w6_simmim-100e-pre_8xb256-coslr-100e_in1k-192px
+      - swin-base-w7_simmim-100e-pre_8xb256-coslr-100e_in1k
+  - Name: simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 18832161792
+      Parameters: 89874104
+      Training Data: ImageNet-1k
+    In Collection: SimMIM
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192_20220916-a0e931ac.pth
+    Config: configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py
+    Downstream:
+      - swin-base-w6_simmim-800e-pre_8xb256-coslr-100e_in1k-192px
+  - Name: simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 55849130496
+      Parameters: 199920372
+      Training Data: ImageNet-1k
+    In Collection: SimMIM
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192_20220916-4ad216d3.pth
+    Config: configs/simmim/simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py
+    Downstream:
+      - swin-large-w14_simmim-800e-pre_8xb256-coslr-100e_in1k
+  - Name: swin-base-w6_simmim-100e-pre_8xb256-coslr-100e_in1k-192px
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 11303976960
+      Parameters: 87750176
+      Training Data: ImageNet-1k
+    In Collection: SimMIM
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.7
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_8xb256-amp-coslr-100e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k/swin-base_ft-8xb256-coslr-100e_in1k_20220829-9cf23aa1.pth
+    Config: configs/simmim/benchmarks/swin-base-w6_8xb256-coslr-100e_in1k-192px.py
+  - Name: swin-base-w7_simmim-100e-pre_8xb256-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 15466852352
+      Parameters: 87768224
+      Training Data: ImageNet-1k
+    In Collection: SimMIM
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.5
+    Weights: null
+    Config: configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py
+  - Name: swin-base-w6_simmim-800e-pre_8xb256-coslr-100e_in1k-192px
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 15466852352
+      Parameters: 87768224
+      Training Data: ImageNet-1k
+    In Collection: SimMIM
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.8
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-base_16xb128-amp-coslr-800e_in1k-192/swin-base_ft-8xb256-coslr-100e_in1k-224/swin-base_ft-8xb256-coslr-100e_in1k-224_20221208-155cc6e6.pth
+    Config: configs/simmim/benchmarks/swin-base-w7_8xb256-coslr-100e_in1k.py
+  - Name: swin-large-w14_simmim-800e-pre_8xb256-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 2048
+      FLOPs: 38853083136
+      Parameters: 196848316
+      Training Data: ImageNet-1k
+    In Collection: SimMIM
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.8
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simmim/simmim_swin-large_16xb128-amp-coslr-800e_in1k-192/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224/swin-large_ft-8xb256-coslr-ws14-100e_in1k-224_20220916-d4865790.pth
+    Config: configs/simmim/benchmarks/swin-large-w14_8xb256-coslr-100e_in1k.py
diff --git a/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-100e_in1k-192px.py b/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-100e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ed9dfdb85d6ebb0e87f18257a9320bc9166f4c5e
--- /dev/null
+++ b/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-100e_in1k-192px.py
@@ -0,0 +1,4 @@
+_base_ = 'simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py'
+
+# dataset 16 GPUs x 128
+train_dataloader = dict(batch_size=128)
diff --git a/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py b/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..560714b7d6a74a22f6d8bb4358a0977fc73909e8
--- /dev/null
+++ b/configs/simmim/simmim_swin-base-w6_16xb128-amp-coslr-800e_in1k-192px.py
@@ -0,0 +1,64 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_simmim_192.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='SimMIM',
+    backbone=dict(
+        type='SimMIMSwinTransformer',
+        arch='base',
+        img_size=192,
+        stage_cfgs=dict(block_cfgs=dict(window_size=6))),
+    neck=dict(
+        type='SimMIMLinearDecoder', in_channels=128 * 2**3, encoder_stride=32),
+    head=dict(
+        type='SimMIMHead',
+        patch_size=4,
+        loss=dict(type='PixelReconstructionLoss', criterion='L1', channel=3)))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=1e-4 * 2048 / 512,
+        betas=(0.9, 0.999),
+        weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'absolute_pos_embed': dict(decay_mult=0.),
+            'relative_position_bias_table': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=5e-7 / 1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='MultiStepLR',
+        milestones=[700],
+        by_epoch=True,
+        begin=10,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py b/configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0be14486a3e29b14b78e507108f57d803404b8f
--- /dev/null
+++ b/configs/simmim/simmim_swin-base-w6_8xb256-amp-coslr-100e_in1k-192px.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_simmim_192.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='SimMIM',
+    backbone=dict(
+        type='SimMIMSwinTransformer',
+        arch='base',
+        img_size=192,
+        stage_cfgs=dict(block_cfgs=dict(window_size=6))),
+    neck=dict(
+        type='SimMIMLinearDecoder', in_channels=128 * 2**3, encoder_stride=32),
+    head=dict(
+        type='SimMIMHead',
+        patch_size=4,
+        loss=dict(type='PixelReconstructionLoss', criterion='L1', channel=3)))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=2e-4 * 2048 / 512,
+        betas=(0.9, 0.999),
+        weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'absolute_pos_embed': dict(decay_mult=0.),
+            'relative_position_bias_table': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-6 / 2e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=90,
+        eta_min=1e-5 * 2048 / 512,
+        by_epoch=True,
+        begin=10,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+# runtime
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/simmim/simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py b/configs/simmim/simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..0563023bd796e640c5c4caff2b9dc9bc555227c4
--- /dev/null
+++ b/configs/simmim/simmim_swin-large-w12_16xb128-amp-coslr-800e_in1k-192px.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs256_simmim_192.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='SimMIM',
+    backbone=dict(
+        type='SimMIMSwinTransformer',
+        arch='large',
+        img_size=192,
+        stage_cfgs=dict(block_cfgs=dict(window_size=12)),
+        pad_small_map=True),
+    neck=dict(
+        type='SimMIMLinearDecoder', in_channels=192 * 2**3, encoder_stride=32),
+    head=dict(
+        type='SimMIMHead',
+        patch_size=4,
+        loss=dict(type='PixelReconstructionLoss', criterion='L1', channel=3)))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=1e-4 * 2048 / 512,
+        betas=(0.9, 0.999),
+        weight_decay=0.05),
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'absolute_pos_embed': dict(decay_mult=0.),
+            'relative_position_bias_table': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=5e-7 / 1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='MultiStepLR',
+        milestones=[700],
+        by_epoch=True,
+        begin=10,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/simsiam/README.md b/configs/simsiam/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..117e45bf7bec09a86558d3372663440d5859155f
--- /dev/null
+++ b/configs/simsiam/README.md
@@ -0,0 +1,87 @@
+# SimSiam
+
+> [Exploring simple siamese representation learning](https://arxiv.org/abs/2011.10566)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our “SimSiam” method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning.
+
+<div align=center>
+<img  src="https://user-images.githubusercontent.com/36138628/149724180-bc7bac6a-fcb8-421e-b8f1-9550c624d154.png" width="500" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_simsiam-100e-pre_8xb512-linear-coslr-90e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('simsiam_resnet50_8xb32-coslr-100e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f53ba400.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                    | Params (M) | Flops (G) |                       Config                        |                                          Download                                          |
+| :--------------------------------------- | :--------: | :-------: | :-------------------------------------------------: | :----------------------------------------------------------------------------------------: |
+| `simsiam_resnet50_8xb32-coslr-100e_in1k` |   38.20    |   4.11    | [config](simsiam_resnet50_8xb32-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/simsiam_resnet50_8xb32-coslr-100e_in1k_20220825-d07cb2e6.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/simsiam_resnet50_8xb32-coslr-100e_in1k_20220825-d07cb2e6.json) |
+| `simsiam_resnet50_8xb32-coslr-200e_in1k` |   38.20    |   4.11    | [config](simsiam_resnet50_8xb32-coslr-200e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/simsiam_resnet50_8xb32-coslr-200e_in1k_20220825-efe91299.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/simsiam_resnet50_8xb32-coslr-200e_in1k_20220825-efe91299.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_simsiam-100e-pre_8xb512-linear-coslr-90e_in1k` | [SIMSIAM 100-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/simsiam_resnet50_8xb32-coslr-100e_in1k_20220825-d07cb2e6.pth) |   25.56    |   4.11    |   68.30   | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f53ba400.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f53ba400.json) |
+| `resnet50_simsiam-200e-pre_8xb512-linear-coslr-90e_in1k` | [SIMSIAM 200-Epochs](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/simsiam_resnet50_8xb32-coslr-200e_in1k_20220825-efe91299.pth) |   25.56    |   4.11    |   69.80   | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-519b5135.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-519b5135.json) |
+
+## Citation
+
+```bibtex
+@inproceedings{chen2021exploring,
+  title={Exploring simple siamese representation learning},
+  author={Chen, Xinlei and He, Kaiming},
+  booktitle={CVPR},
+  year={2021}
+}
+```
diff --git a/configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py b/configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b5074c082b8b6fb36bd3c6711b60bab6394b4ce
--- /dev/null
+++ b/configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
@@ -0,0 +1,18 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_lars_coslr_90e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# dataset summary
+train_dataloader = dict(batch_size=512)
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/simsiam/metafile.yml b/configs/simsiam/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..40f6706511cf6cf49f8b65153ffd575348abeeca
--- /dev/null
+++ b/configs/simsiam/metafile.yml
@@ -0,0 +1,72 @@
+Collections:
+  - Name: SimSiam
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Architecture:
+        - ResNet
+    Paper:
+      Title: Exploring simple siamese representation learning
+      URL: https://arxiv.org/abs/2011.10566
+    README: configs/simsiam/README.md
+
+Models:
+  - Name: simsiam_resnet50_8xb32-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 256
+      FLOPs: 4109364224
+      Parameters: 38199360
+      Training Data: ImageNet-1k
+    In Collection: SimSiam
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/simsiam_resnet50_8xb32-coslr-100e_in1k_20220825-d07cb2e6.pth
+    Config: configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py
+    Downstream:
+      - resnet50_simsiam-100e-pre_8xb512-linear-coslr-90e_in1k
+  - Name: simsiam_resnet50_8xb32-coslr-200e_in1k
+    Metadata:
+      Epochs: 200
+      Batch Size: 256
+      FLOPs: 4109364224
+      Parameters: 38199360
+      Training Data: ImageNet-1k
+    In Collection: SimSiam
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/simsiam_resnet50_8xb32-coslr-200e_in1k_20220825-efe91299.pth
+    Config: configs/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k.py
+    Downstream:
+      - resnet50_simsiam-200e-pre_8xb512-linear-coslr-90e_in1k
+  - Name: resnet50_simsiam-100e-pre_8xb512-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 4096
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: SimSiam
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 68.3
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-f53ba400.pth
+    Config: configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
+  - Name: resnet50_simsiam-200e-pre_8xb512-linear-coslr-90e_in1k
+    Metadata:
+      Epochs: 90
+      Batch Size: 4096
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: SimSiam
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 69.8
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k/resnet50_linear-8xb512-coslr-90e_in1k/resnet50_linear-8xb512-coslr-90e_in1k_20220825-519b5135.pth
+    Config: configs/simsiam/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
diff --git a/configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py b/configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad19af6acaa530f0a0c3120034fa836cec965642
--- /dev/null
+++ b/configs/simsiam/simsiam_resnet50_8xb32-coslr-100e_in1k.py
@@ -0,0 +1,58 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_mocov2.py',
+    '../_base_/schedules/imagenet_sgd_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='SimSiam',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=2048,
+        num_layers=3,
+        with_last_bn_affine=False,
+        with_avg_pool=True),
+    head=dict(
+        type='LatentPredictHead',
+        loss=dict(type='CosineSimilarityLoss'),
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=2048,
+            hid_channels=512,
+            out_channels=2048,
+            with_avg_pool=False,
+            with_last_bn=False,
+            with_last_bias=True)),
+)
+
+# optimizer
+# set base learning rate
+lr = 0.05
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=lr, weight_decay=1e-4, momentum=0.9),
+    paramwise_cfg=dict(custom_keys={'predictor': dict(fix_lr=True)}))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(type='CosineAnnealingLR', T_max=100, by_epoch=True, begin=0, end=100)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# additional hooks
+custom_hooks = [
+    dict(type='SimSiamHook', priority='HIGH', fix_pred_lr=True, lr=lr)
+]
diff --git a/configs/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k.py b/configs/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa3b2bbf5eb0b2f6c9b6907e78d189c13ea00cae
--- /dev/null
+++ b/configs/simsiam/simsiam_resnet50_8xb32-coslr-200e_in1k.py
@@ -0,0 +1,52 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_mocov2.py',
+    '../_base_/schedules/imagenet_sgd_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='SimSiam',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='NonLinearNeck',
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=2048,
+        num_layers=3,
+        with_last_bn_affine=False,
+        with_avg_pool=True),
+    head=dict(
+        type='LatentPredictHead',
+        loss=dict(type='CosineSimilarityLoss'),
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=2048,
+            hid_channels=512,
+            out_channels=2048,
+            with_avg_pool=False,
+            with_last_bn=False,
+            with_last_bias=True)),
+)
+
+# optimizer
+# set base learning rate
+lr = 0.05
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=lr, weight_decay=1e-4, momentum=0.9),
+    paramwise_cfg=dict(custom_keys={'predictor': dict(fix_lr=True)}))
+
+# runtime settings
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# additional hooks
+custom_hooks = [
+    dict(type='SimSiamHook', priority='HIGH', fix_pred_lr=True, lr=lr)
+]
diff --git a/configs/spark/README.md b/configs/spark/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..60f510e959dacac9fa48a5e0495be63e4fc1a03a
--- /dev/null
+++ b/configs/spark/README.md
@@ -0,0 +1,87 @@
+# SparK
+
+> [Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling](https://arxiv.org/abs/2301.03580)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We identify and overcome two key obstacles in extending the success of BERT-style pre-training, or the masked image modeling, to convolutional networks (convnets): (i) convolution operation cannot handle irregular, random-masked input images; (ii) the single-scale nature of BERT pre-training is inconsistent with convnet's hierarchical structure. For (i), we treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode. This is the first use of sparse convolution for 2D masked modeling. For (ii), we develop a hierarchical decoder to reconstruct images from multi-scale encoded features. Our method called Sparse masKed modeling (SparK) is general: it can be used directly on any convolutional model without backbone modifications. We validate it on both classical (ResNet) and modern (ConvNeXt) models: on three downstream tasks, it surpasses both state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins (around +1.0%). Improvements on object detection and instance segmentation are more substantial (up to +3.5%), verifying the strong transferability of features learned. We also find its favorable scaling behavior by observing more gains on larger models. All this evidence reveals a promising future of generative pre-training on convnets. Codes and models are released at https://github.com/keyu-tian/SparK.
+
+<div align=center>
+<img src="https://github.com/open-mmlab/mmpretrain/assets/36138628/b93e8d6f-ec1e-4f27-b986-da470fabe7df" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_spark-pre_300e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('spark_sparse-resnet50_800e_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/resnet50_8xb256-coslr-300e_in1k/resnet50_8xb256-coslr-300e_in1k_20230612-f86aab51.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                    | Params (M) | Flops (G) |                                Config                                 |                                 Download                                 |
+| :--------------------------------------- | :--------: | :-------: | :-------------------------------------------------------------------: | :----------------------------------------------------------------------: |
+| `spark_sparse-resnet50_800e_in1k`        |   37.97    |   4.10    |     [config](spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py)     | [model](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k_20230612-e403c28f.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k_20230612-e403c28f.json) |
+| `spark_sparse-convnextv2-tiny_800e_in1k` |   39.73    |   4.47    | [config](spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k_20230612-b0ea712e.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k_20230612-b0ea712e.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                 |                  Pretrain                  | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                   |                  Download                   |
+| :------------------------------------ | :----------------------------------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :-----------------------------------------: |
+| `resnet50_spark-pre_300e_in1k`        | [SPARK](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k_20230612-e403c28f.pth) |   23.52    |   1.31    |   80.10   |   94.90   | [config](benchmarks/resnet50_8xb256-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/resnet50_8xb256-coslr-300e_in1k/resnet50_8xb256-coslr-300e_in1k_20230612-f86aab51.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/resnet50_8xb256-coslr-300e_in1k/resnet50_8xb256-coslr-300e_in1k_20230612-f86aab51.json) |
+| `convnextv2-tiny_spark-pre_300e_in1k` | [SPARK](https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k_20230612-b0ea712e.pth) |   28.64    |   4.47    |   82.80   |   96.30   | [config](benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py) | [model](https://download.openmmlab.com/mmpretrain/v1.0/spark//spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k_20230612-ffc78743.pth) \| [log](https://download.openmmlab.com/mmpretrain/v1.0/spark//spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k_20230612-ffc78743.json) |
+
+## Citation
+
+```bibtex
+@Article{tian2023designing,
+  author  = {Keyu Tian and Yi Jiang and Qishuai Diao and Chen Lin and Liwei Wang and Zehuan Yuan},
+  title   = {Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling},
+  journal = {arXiv:2301.03580},
+  year    = {2023},
+}
+```
diff --git a/configs/spark/benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py b/configs/spark/benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..95ef81f16a8d1173702ccfe3313f1e85bdd561ef
--- /dev/null
+++ b/configs/spark/benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py
@@ -0,0 +1,122 @@
+_base_ = [
+    '../../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../../_base_/default_runtime.py',
+]
+
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='NumpyToPIL', to_rgb=True),
+    dict(
+        type='torchvision/TrivialAugmentWide',
+        num_magnitude_bins=31,
+        interpolation='bicubic',
+        fill=None),
+    dict(type='PILToNumpy', to_bgr=True),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    dataset=dict(pipeline=train_pipeline),
+    sampler=dict(type='RepeatAugSampler', shuffle=True),
+)
+
+# Model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ConvNeXt',
+        arch='tiny',
+        drop_path_rate=0.1,
+        layer_scale_init_value=0.,
+        use_grn=True,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=.02, bias=0.),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+custom_hooks = [
+    dict(
+        type='EMAHook',
+        momentum=1e-4,
+        evaluate_on_origin=True,
+        priority='ABOVE_NORMAL')
+]
+
+# schedule settings
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=3.2e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.7,
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=280,
+        eta_min=1.0e-5,
+        by_epoch=True,
+        begin=20,
+        end=300)
+]
+train_cfg = dict(by_epoch=True, max_epochs=300)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    # only keeps the latest 2 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py b/configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d7527ce2a545949a6395d847631b5c4484af398
--- /dev/null
+++ b/configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py
@@ -0,0 +1,107 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs256_rsb_a12.py',
+    '../../_base_/default_runtime.py'
+]
+# modification is based on ResNets RSB settings
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='NumpyToPIL', to_rgb=True),
+    dict(
+        type='torchvision/TrivialAugmentWide',
+        num_magnitude_bins=31,
+        interpolation='bicubic',
+        fill=None),
+    dict(type='PILToNumpy', to_bgr=True),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    backbone=dict(
+        norm_cfg=dict(type='SyncBN', requires_grad=True),
+        drop_path_rate=0.05,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')),
+    head=dict(
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, use_sigmoid=True)),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.1),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# schedule settings
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(
+        type='Lamb',
+        lr=0.016,
+        weight_decay=0.02,
+    ),
+    constructor='LearningRateDecayOptimWrapperConstructor',
+    paramwise_cfg=dict(
+        layer_decay_rate=0.7,
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=295,
+        eta_min=1.0e-6,
+        by_epoch=True,
+        begin=5,
+        end=300)
+]
+train_cfg = dict(by_epoch=True, max_epochs=300)
+val_cfg = dict()
+test_cfg = dict()
+
+default_hooks = dict(
+    # only keeps the latest 2 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+# randomness
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/configs/spark/metafile.yml b/configs/spark/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..81ca3a7033e7eeac1ef88a852613f4866854f625
--- /dev/null
+++ b/configs/spark/metafile.yml
@@ -0,0 +1,73 @@
+Collections:
+  - Name: SparK
+    Metadata:
+      Architecture:
+        - Dense Connections
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+    Paper:
+      Title: 'Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling'
+      URL: https://arxiv.org/abs/2301.03580
+    README: configs/spark/README.md
+    Code:
+      URL: null
+      Version: null
+
+Models:
+  - Name: spark_sparse-resnet50_800e_in1k
+    Metadata:
+      FLOPs: 4100000000
+      Parameters: 37971000
+      Training Data:
+        - ImageNet-1k
+    In Collection: SparK
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k_20230612-e403c28f.pth
+    Config: configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py
+    Downstream:
+      - resnet50_spark-pre_300e_in1k
+  - Name: resnet50_spark-pre_300e_in1k
+    Metadata:
+      FLOPs: 1310000000
+      Parameters: 23520000
+      Training Data:
+        - ImageNet-1k
+    In Collection: SparK
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.1
+          Top 5 Accuracy: 94.9
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k/resnet50_8xb256-coslr-300e_in1k/resnet50_8xb256-coslr-300e_in1k_20230612-f86aab51.pth
+    Config: configs/spark/benchmarks/resnet50_8xb256-coslr-300e_in1k.py
+
+  - Name: spark_sparse-convnextv2-tiny_800e_in1k
+    Metadata:
+      FLOPs: 4470000000
+      Parameters: 39732000
+      Training Data:
+        - ImageNet-1k
+    In Collection: SparK
+    Results: null
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k_20230612-b0ea712e.pth
+    Config: configs/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py
+    Downstream:
+      - convnextv2-tiny_spark-pre_300e_in1k
+  - Name: convnextv2-tiny_spark-pre_300e_in1k
+    Metadata:
+      FLOPs: 4469631744
+      Parameters: 28635496
+      Training Data:
+        - ImageNet-1k
+    In Collection: SparK
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.8
+          Top 5 Accuracy: 96.3
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmpretrain/v1.0/spark//spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k/convnextv2-tiny_8xb256-coslr-300e_in1k_20230612-ffc78743.pth
+    Config: configs/spark/benchmarks/convnextv2-tiny_8xb256-coslr-300e_in1k.py
diff --git a/configs/spark/spark_sparse-convnext-small_16xb256-amp-coslr-800e_in1k.py b/configs/spark/spark_sparse-convnext-small_16xb256-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5cefb5b93ae8bd79e501b2c6ab6b874c11751b44
--- /dev/null
+++ b/configs/spark/spark_sparse-convnext-small_16xb256-amp-coslr-800e_in1k.py
@@ -0,0 +1,81 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset 8 x 512
+train_dataloader = dict(batch_size=256, num_workers=8)
+
+# model settings
+model = dict(
+    type='SparK',
+    input_size=224,
+    downsample_raito=32,
+    mask_ratio=0.6,
+    enc_dec_norm_cfg=dict(type='SparseLN2d', eps=1e-6),
+    enc_dec_norm_dim=768,
+    backbone=dict(
+        type='SparseConvNeXt',
+        arch='small',
+        drop_path_rate=0.2,
+        out_indices=(0, 1, 2, 3),
+        gap_before_output=False),
+    neck=dict(
+        type='SparKLightDecoder',
+        feature_dim=512,
+        upsample_ratio=32,  # equal to downsample_raito
+        mid_channels=0,
+        last_act=False),
+    head=dict(
+        type='SparKPretrainHead',
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')))
+
+# optimizer wrapper
+optimizer = dict(
+    type='Lamb', lr=2e-4 * 4096 / 512, betas=(0.9, 0.95), weight_decay=0.04)
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=optimizer,
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0,
+        custom_keys={
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingWeightDecay',
+        eta_min=0.2,
+        T_max=800,
+        by_epoch=True,
+        begin=0,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    logger=dict(type='LoggerHook', interval=100),
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+# randomness
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py b/configs/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a1afc80821abb06fcafe956d1e3c3b919ab0f20
--- /dev/null
+++ b/configs/spark/spark_sparse-convnextv2-tiny_16xb256-amp-coslr-800e_in1k.py
@@ -0,0 +1,84 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset 16 x 256
+train_dataloader = dict(batch_size=256, num_workers=8)
+
+# model settings, use ConvNeXt V2
+model = dict(
+    type='SparK',
+    input_size=224,
+    downsample_raito=32,
+    mask_ratio=0.6,
+    enc_dec_norm_cfg=dict(type='SparseLN2d', eps=1e-6),
+    enc_dec_norm_dim=768,
+    backbone=dict(
+        type='SparseConvNeXt',
+        arch='tiny',
+        drop_path_rate=0.2,
+        out_indices=(0, 1, 2, 3),
+        gap_before_output=False,
+        layer_scale_init_value=0.,
+        use_grn=True,
+    ),
+    neck=dict(
+        type='SparKLightDecoder',
+        feature_dim=512,
+        upsample_ratio=32,  # equal to downsample_raito
+        mid_channels=0,
+        last_act=False),
+    head=dict(
+        type='SparKPretrainHead',
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')))
+
+# optimizer wrapper
+optimizer = dict(
+    type='Lamb', lr=2e-4 * 4096 / 512, betas=(0.9, 0.95), weight_decay=0.04)
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=optimizer,
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0,
+        custom_keys={
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=780,
+        by_epoch=True,
+        begin=20,
+        end=800,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingWeightDecay',
+        eta_min=0.2,
+        T_max=800,
+        by_epoch=True,
+        begin=0,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    logger=dict(type='LoggerHook', interval=100),
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+# randomness
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-1600e_in1k.py b/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..10fc67574b705d2181f74db3d9d839a1812731e1
--- /dev/null
+++ b/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,30 @@
+_base_ = 'spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py'
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingWeightDecay',
+        eta_min=0.2,
+        T_max=1600,
+        by_epoch=True,
+        begin=0,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(max_epochs=1600)
diff --git a/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py b/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..864f616209361ba63158f64d66ffb06c2693e9e8
--- /dev/null
+++ b/configs/spark/spark_sparse-resnet50_8xb512-amp-coslr-800e_in1k.py
@@ -0,0 +1,80 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset 8 x 512
+train_dataloader = dict(batch_size=512, num_workers=8)
+
+# model settings
+model = dict(
+    type='SparK',
+    input_size=224,
+    downsample_raito=32,
+    mask_ratio=0.6,
+    enc_dec_norm_cfg=dict(type='SparseSyncBatchNorm2d'),
+    enc_dec_norm_dim=2048,
+    backbone=dict(
+        type='SparseResNet',
+        depth=50,
+        out_indices=(0, 1, 2, 3),
+        drop_path_rate=0.05),
+    neck=dict(
+        type='SparKLightDecoder',
+        feature_dim=512,
+        upsample_ratio=32,  # equal to downsample_raito
+        mid_channels=0,
+        last_act=False),
+    head=dict(
+        type='SparKPretrainHead',
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')))
+
+# optimizer wrapper
+optimizer = dict(
+    type='Lamb', lr=2e-4 * 4096 / 512, betas=(0.9, 0.95), weight_decay=0.04)
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=optimizer,
+    clip_grad=dict(max_norm=5.0),
+    paramwise_cfg=dict(
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0,
+        custom_keys={
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingWeightDecay',
+        eta_min=0.2,
+        T_max=800,
+        by_epoch=True,
+        begin=0,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=800)
+default_hooks = dict(
+    logger=dict(type='LoggerHook', interval=100),
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=2))
+
+# randomness
+randomness = dict(seed=0, diff_rank_seed=True)
diff --git a/configs/swav/README.md b/configs/swav/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..fdcdfeb25e3c454d084bbf2d8a7b3d685c35c9fc
--- /dev/null
+++ b/configs/swav/README.md
@@ -0,0 +1,85 @@
+# SwAV
+
+> [Unsupervised Learning of Visual Features by Contrasting Cluster Assignments](https://arxiv.org/abs/2006.09882)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or “views”) of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a “swapped” prediction mechanism where we predict the code of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements.
+
+<div align=center>
+<img  src="https://user-images.githubusercontent.com/36138628/149724517-9f1e7bdf-04c7-43e3-92f4-2b8fc1399123.png" width="500" />
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('resnet50_swav-pre_8xb32-linear-coslr-100e_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-80341e08.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                                  | Params (M) | Flops (G) |                             Config                             |                             Download                              |
+| :----------------------------------------------------- | :--------: | :-------: | :------------------------------------------------------------: | :---------------------------------------------------------------: |
+| `swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px` |   28.35    |   4.11    | [config](swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96_20220825-5b3fc7fc.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96_20220825-5b3fc7fc.json) |
+
+### Image Classification on ImageNet-1k
+
+| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
+| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
+| `resnet50_swav-pre_8xb32-linear-coslr-100e_in1k` | [SWAV](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96_20220825-5b3fc7fc.pth) |   25.56    |   4.11    |   70.50   | [config](benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-80341e08.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-80341e08.json) |
+
+## Citation
+
+```bibtex
+@article{caron2020unsupervised,
+  title={Unsupervised Learning of Visual Features by Contrasting Cluster Assignments},
+  author={Caron, Mathilde and Misra, Ishan and Mairal, Julien and Goyal, Priya and Bojanowski, Piotr and Joulin, Armand},
+  booktitle={NeurIPS},
+  year={2020}
+}
+```
diff --git a/configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py b/configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b5074c082b8b6fb36bd3c6711b60bab6394b4ce
--- /dev/null
+++ b/configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
@@ -0,0 +1,18 @@
+_base_ = [
+    '../../_base_/models/resnet50.py',
+    '../../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../../_base_/schedules/imagenet_lars_coslr_90e.py',
+    '../../_base_/default_runtime.py',
+]
+
+model = dict(
+    backbone=dict(
+        frozen_stages=4,
+        init_cfg=dict(type='Pretrained', checkpoint='', prefix='backbone.')))
+
+# dataset summary
+train_dataloader = dict(batch_size=512)
+
+# runtime settings
+default_hooks = dict(
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
diff --git a/configs/swav/metafile.yml b/configs/swav/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..5bc1252ad1ed6528d28847b728b85f3e91e7d0b9
--- /dev/null
+++ b/configs/swav/metafile.yml
@@ -0,0 +1,44 @@
+Collections:
+  - Name: SwAV
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - LARS
+      Training Resources: 8x V100 GPUs
+      Architecture:
+        - ResNet
+        - SwAV
+    Paper:
+      Title: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
+      URL: https://arxiv.org/abs/2006.09882
+    README: configs/swav/README.md
+
+Models:
+  - Name: swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px
+    Metadata:
+      Epochs: 200
+      Batch Size: 256
+      FLOPs: 4109364224
+      Parameters: 28354752
+      Training Data: ImageNet-1k
+    In Collection: SwAV
+    Results: null
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96_20220825-5b3fc7fc.pth
+    Config: configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py
+    Downstream:
+      - resnet50_swav-pre_8xb32-linear-coslr-100e_in1k
+  - Name: resnet50_swav-pre_8xb32-linear-coslr-100e_in1k
+    Metadata:
+      Epochs: 100
+      Batch Size: 256
+      FLOPs: 4109464576
+      Parameters: 25557032
+      Training Data: ImageNet-1k
+    In Collection: SwAV
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 70.5
+    Weights: https://download.openmmlab.com/mmselfsup/1.x/swav/swav_resnet50_8xb32-mcrop-2-6-coslr-200e_in1k-224-96/resnet50_linear-8xb32-coslr-100e_in1k/resnet50_linear-8xb32-coslr-100e_in1k_20220825-80341e08.pth
+    Config: configs/swav/benchmarks/resnet50_8xb512-linear-coslr-90e_in1k.py
diff --git a/configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py b/configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ebb9ead92ef84387aa8715c013be36eebb661dd8
--- /dev/null
+++ b/configs/swav/swav_resnet50_8xb32-mcrop-coslr-200e_in1k-224px-96px.py
@@ -0,0 +1,159 @@
+_base_ = [
+    '../_base_/schedules/imagenet_lars_coslr_200e.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type='SelfSupDataPreprocessor',
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+num_crops = [2, 6]
+color_distort_strength = 1.0
+view_pipeline1 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.14, 1.),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.8 * color_distort_strength,
+                contrast=0.8 * color_distort_strength,
+                saturation=0.8 * color_distort_strength,
+                hue=0.2 * color_distort_strength)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.5),
+    dict(type='RandomFlip', prob=0.5),
+]
+view_pipeline2 = [
+    dict(
+        type='RandomResizedCrop',
+        scale=96,
+        crop_ratio_range=(0.05, 0.14),
+        backend='pillow'),
+    dict(
+        type='RandomApply',
+        transforms=[
+            dict(
+                type='ColorJitter',
+                brightness=0.8 * color_distort_strength,
+                contrast=0.8 * color_distort_strength,
+                saturation=0.8 * color_distort_strength,
+                hue=0.2 * color_distort_strength)
+        ],
+        prob=0.8),
+    dict(
+        type='RandomGrayscale',
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type='GaussianBlur',
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.5),
+    dict(type='RandomFlip', prob=0.5),
+]
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiView',
+        num_views=num_crops,
+        transforms=[view_pipeline1, view_pipeline2]),
+    dict(type='PackInputs')
+]
+
+batch_size = 32
+train_dataloader = dict(
+    batch_size=batch_size,
+    num_workers=8,
+    drop_last=True,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    type='SwAV',
+    data_preprocessor=dict(
+        mean=(123.675, 116.28, 103.53),
+        std=(58.395, 57.12, 57.375),
+        to_rgb=True),
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type='SwAVNeck',
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=128,
+        with_avg_pool=True),
+    head=dict(
+        type='SwAVHead',
+        loss=dict(
+            type='SwAVLoss',
+            feat_dim=128,  # equal to neck['out_channels']
+            epsilon=0.05,
+            temperature=0.1,
+            num_crops=num_crops,
+        )))
+
+# optimizer
+optim_wrapper = dict(type='OptimWrapper', optimizer=dict(type='LARS', lr=0.6))
+find_unused_parameters = True
+
+# learning policy
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        T_max=200,
+        eta_min=6e-4,
+        by_epoch=True,
+        begin=0,
+        end=200,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3))
+
+# additional hooks
+custom_hooks = [
+    dict(
+        type='SwAVHook',
+        priority='VERY_HIGH',
+        batch_size=batch_size,
+        epoch_queue_starts=15,
+        crops_for_assign=[0, 1],
+        feat_dim=128,
+        queue_length=3840,
+        frozen_layers_cfg=dict(prototypes=5005))
+]
diff --git a/configs/swin_transformer/README.md b/configs/swin_transformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1d41f13a52554d7dd5896d284cd22b47b6b1fc8a
--- /dev/null
+++ b/configs/swin_transformer/README.md
@@ -0,0 +1,111 @@
+# Swin-Transformer
+
+> [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**Swin Transformer** (the name **Swin** stands for Shifted window) is initially described in [the paper](https://arxiv.org/pdf/2103.14030.pdf), which capably serves as a general-purpose backbone for computer vision. It is basically a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
+
+Swin Transformer achieves strong performance on COCO object detection (58.7 box AP and 51.1 mask AP on test-dev) and ADE20K semantic segmentation (53.5 mIoU on val), surpassing previous models by a large margin.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142576715-14668c6b-5cb8-4de8-ac51-419fae773c90.png" width="90%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with **Shifted windows**. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('swin-tiny_16xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('swin-tiny_16xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/swin_transformer/swin-tiny_16xb64_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/swin_transformer/swin-tiny_16xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                      |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                  Config                   |                               Download                               |
+| :----------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------: | :------------------------------------------------------------------: |
+| `swin-tiny_16xb64_in1k`                    | From scratch |   28.29    |   4.36    |   81.18   |   95.61   |    [config](swin-tiny_16xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925.json) |
+| `swin-small_16xb64_in1k`                   | From scratch |   49.61    |   8.52    |   83.02   |   96.29   |    [config](swin-small_16xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_small_224_b16x64_300e_imagenet_20210615_110219-7f9d988b.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_small_224_b16x64_300e_imagenet_20210615_110219.json) |
+| `swin-base_16xb64_in1k`                    | From scratch |   87.77    |   15.14   |   83.36   |   96.44   |    [config](swin-base_16xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_base_224_b16x64_300e_imagenet_20210616_190742-93230b0d.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_base_224_b16x64_300e_imagenet_20210616_190742.json) |
+| `swin-tiny_3rdparty_in1k`\*                | From scratch |   28.29    |   4.36    |   81.18   |   95.52   |    [config](swin-tiny_16xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_tiny_patch4_window7_224-160bb0a5.pth) |
+| `swin-small_3rdparty_in1k`\*               | From scratch |   49.61    |   8.52    |   83.21   |   96.25   |    [config](swin-small_16xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_small_patch4_window7_224-cc7a01c9.pth) |
+| `swin-base_3rdparty_in1k`\*                | From scratch |   87.77    |   15.14   |   83.42   |   96.44   |    [config](swin-base_16xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window7_224-4670dd19.pth) |
+| `swin-base_3rdparty_in1k-384`\*            | From scratch |   87.90    |   44.49   |   84.49   |   96.95   | [config](swin-base_16xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window12_384-02c598a4.pth) |
+| `swin-base_in21k-pre-3rdparty_in1k`\*      | From scratch |   87.77    |   15.14   |   85.16   |   97.50   |    [config](swin-base_16xb64_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window7_224_22kto1k-f967f799.pth) |
+| `swin-base_in21k-pre-3rdparty_in1k-384`\*  | From scratch |   87.90    |   44.49   |   86.44   |   98.05   | [config](swin-base_16xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window12_384_22kto1k-d59b0d1d.pth) |
+| `swin-large_in21k-pre-3rdparty_in1k`\*     | From scratch |   196.53   |   34.04   |   86.24   |   97.88   |    [config](swin-large_16xb64_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_large_patch4_window7_224_22kto1k-5f0996db.pth) |
+| `swin-large_in21k-pre-3rdparty_in1k-384`\* | From scratch |   196.74   |  100.04   |   87.25   |   98.25   | [config](swin-large_16xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_large_patch4_window12_384_22kto1k-0a40944b.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on CUB-200-2011
+
+| Model                       |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) |                 Config                 |                                            Download                                             |
+| :-------------------------- | :----------: | :--------: | :-------: | :-------: | :------------------------------------: | :---------------------------------------------------------------------------------------------: |
+| `swin-large_8xb8_cub-384px` | From scratch |   195.51   |  100.04   |   91.87   | [config](swin-large_8xb8_cub-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin-large_8xb8_cub_384px_20220307-1bbaee6a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin-large_8xb8_cub_384px_20220307-1bbaee6a.json) |
+
+## Citation
+
+```bibtex
+@article{liu2021Swin,
+  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
+  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
+  journal={arXiv preprint arXiv:2103.14030},
+  year={2021}
+}
+```
diff --git a/configs/swin_transformer/metafile.yml b/configs/swin_transformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8bff599267afe52a0904c106be4fcd8c76f6e4bf
--- /dev/null
+++ b/configs/swin_transformer/metafile.yml
@@ -0,0 +1,201 @@
+Collections:
+  - Name: Swin-Transformer
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+        - Weight Decay
+      Training Resources: 16x V100 GPUs
+      Epochs: 300
+      Batch Size: 1024
+      Architecture:
+        - Shift Window Multihead Self Attention
+    Paper:
+      URL: https://arxiv.org/abs/2103.14030
+      Title: "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows"
+    README: configs/swin_transformer/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/swin_transformer.py#L176
+      Version: v0.15.0
+
+Models:
+  - Name: swin-tiny_16xb64_in1k
+    Metadata:
+      FLOPs: 4360000000
+      Parameters: 28290000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.18
+          Top 5 Accuracy: 95.61
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth
+    Config: configs/swin_transformer/swin-tiny_16xb64_in1k.py
+  - Name: swin-small_16xb64_in1k
+    Metadata:
+      FLOPs: 8520000000
+      Parameters: 49610000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.02
+          Top 5 Accuracy: 96.29
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_small_224_b16x64_300e_imagenet_20210615_110219-7f9d988b.pth
+    Config: configs/swin_transformer/swin-small_16xb64_in1k.py
+  - Name: swin-base_16xb64_in1k
+    Metadata:
+      FLOPs: 15140000000
+      Parameters: 87770000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.36
+          Top 5 Accuracy: 96.44
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_base_224_b16x64_300e_imagenet_20210616_190742-93230b0d.pth
+    Config: configs/swin_transformer/swin-base_16xb64_in1k.py
+  - Name: swin-tiny_3rdparty_in1k
+    Metadata:
+      FLOPs: 4360000000
+      Parameters: 28290000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.18
+          Top 5 Accuracy: 95.52
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_tiny_patch4_window7_224-160bb0a5.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-tiny_16xb64_in1k.py
+  - Name: swin-small_3rdparty_in1k
+    Metadata:
+      FLOPs: 8520000000
+      Parameters: 49610000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.21
+          Top 5 Accuracy: 96.25
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_small_patch4_window7_224-cc7a01c9.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_small_patch4_window7_224.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-small_16xb64_in1k.py
+  - Name: swin-base_3rdparty_in1k
+    Metadata:
+      FLOPs: 15140000000
+      Parameters: 87770000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.42
+          Top 5 Accuracy: 96.44
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window7_224-4670dd19.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window7_224.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-base_16xb64_in1k.py
+  - Name: swin-base_3rdparty_in1k-384
+    Metadata:
+      FLOPs: 44490000000
+      Parameters: 87900000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.49
+          Top 5 Accuracy: 96.95
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window12_384-02c598a4.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window12_384.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-base_16xb64_in1k-384px.py
+  - Name: swin-base_in21k-pre-3rdparty_in1k
+    Metadata:
+      FLOPs: 15140000000
+      Parameters: 87770000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.16
+          Top 5 Accuracy: 97.50
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window7_224_22kto1k-f967f799.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window7_224_22kto1k.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-base_16xb64_in1k.py
+  - Name: swin-base_in21k-pre-3rdparty_in1k-384
+    Metadata:
+      FLOPs: 44490000000
+      Parameters: 87900000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.44
+          Top 5 Accuracy: 98.05
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_base_patch4_window12_384_22kto1k-d59b0d1d.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window12_384_22kto1k.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-base_16xb64_in1k-384px.py
+  - Name: swin-large_in21k-pre-3rdparty_in1k
+    Metadata:
+      FLOPs: 34040000000
+      Parameters: 196530000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.24
+          Top 5 Accuracy: 97.88
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_large_patch4_window7_224_22kto1k-5f0996db.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_large_patch4_window7_224_22kto1k.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-large_16xb64_in1k.py
+  - Name: swin-large_in21k-pre-3rdparty_in1k-384
+    Metadata:
+      FLOPs: 100040000000
+      Parameters: 196740000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.25
+          Top 5 Accuracy: 98.25
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin_large_patch4_window12_384_22kto1k-0a40944b.pth
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_large_patch4_window12_384_22kto1k.pth
+      Code: https://github.com/microsoft/Swin-Transformer/blob/777f6c66604bb5579086c4447efe3620344d95a9/models/swin_transformer.py#L458
+    Config: configs/swin_transformer/swin-large_16xb64_in1k-384px.py
+  - Name: swin-large_8xb8_cub-384px
+    Metadata:
+      FLOPs: 100040000000
+      Parameters: 195510000
+    In Collection: Swin-Transformer
+    Results:
+      - Dataset: CUB-200-2011
+        Metrics:
+          Top 1 Accuracy: 91.87
+        Task: Image Classification
+    Pretrain: https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin-large_3rdparty_in21k-384px.pth
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin-large_8xb8_cub_384px_20220307-1bbaee6a.pth
+    Config: configs/swin_transformer/swin-large_8xb8_cub-384px.py
diff --git a/configs/swin_transformer/swin-base_16xb64_in1k-384px.py b/configs/swin_transformer/swin-base_16xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..10f89921ff1ec6659509ccdee8e15cfe52395880
--- /dev/null
+++ b/configs/swin_transformer/swin-base_16xb64_in1k-384px.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/swin_transformer/base_384.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-base_16xb64_in1k.py b/configs/swin_transformer/swin-base_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..05a95b4483dd3764abbcf9e32b1291334e084099
--- /dev/null
+++ b/configs/swin_transformer/swin-base_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/swin_transformer/base_224.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-large_16xb64_in1k-384px.py b/configs/swin_transformer/swin-large_16xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..5ba52b3564704acfeb2c40eb39e1d4e5cf5bf573
--- /dev/null
+++ b/configs/swin_transformer/swin-large_16xb64_in1k-384px.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/swin_transformer/large_384.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-large_16xb64_in1k.py b/configs/swin_transformer/swin-large_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..36121efca15f951a03d153b614d3e844cc8cad26
--- /dev/null
+++ b/configs/swin_transformer/swin-large_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/swin_transformer/large_224.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-large_8xb8_cub-384px.py b/configs/swin_transformer/swin-large_8xb8_cub-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2f10a6a292bc2485085a38c895b635a5944d04c
--- /dev/null
+++ b/configs/swin_transformer/swin-large_8xb8_cub-384px.py
@@ -0,0 +1,40 @@
+_base_ = [
+    '../_base_/models/swin_transformer/large_384.py',
+    '../_base_/datasets/cub_bs8_384.py',
+    '../_base_/schedules/cub_bs64.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+checkpoint = 'https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin-large_3rdparty_in21k-384px.pth'  # noqa
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        init_cfg=dict(
+            type='Pretrained', checkpoint=checkpoint, prefix='backbone')),
+    head=dict(num_classes=200, ))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        _delete_=True,
+        type='AdamW',
+        lr=5e-6,
+        weight_decay=0.0005,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.absolute_pos_embed': dict(decay_mult=0.0),
+            '.relative_position_bias_table': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
+
+default_hooks = dict(
+    # log every 20 intervals
+    logger=dict(type='LoggerHook', interval=20),
+    # save last three checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
diff --git a/configs/swin_transformer/swin-small_16xb64_in1k.py b/configs/swin_transformer/swin-small_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c1a8e21a7f2cbc881cbde43c19af9cd10b7c2ba
--- /dev/null
+++ b/configs/swin_transformer/swin-small_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/swin_transformer/small_224.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer/swin-tiny_16xb64_in1k.py b/configs/swin_transformer/swin-tiny_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a1ce2508ab603b008640583de78c64d2f178620
--- /dev/null
+++ b/configs/swin_transformer/swin-tiny_16xb64_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/swin_transformer/tiny_224.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/swin_transformer_v2/README.md b/configs/swin_transformer_v2/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..dd20548ae780ebca6cf0cc982ea71c782e369b52
--- /dev/null
+++ b/configs/swin_transformer_v2/README.md
@@ -0,0 +1,121 @@
+# Swin-Transformer V2
+
+> [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**Swin Transformer V2** is a work on the scale up visual model based on [Swin Transformer](https://github.com/open-mmlab/mmpretrain/tree/main/configs/swin_transformer). In the visual field, We can not increase the performance by just simply scaling up the visual model like NLP models. The possible reasons mentioned in the article are:
+
+- Training instability when increasing the vision model
+- Migrating the model trained at low resolution to a larger scale resolution task
+- Too mush GPU memory
+
+To solve it, The following method improvements are proposed in the paper:
+
+- post normalization: layer normalization after self-attention layer and MLP block
+- scaled cosine attention approach: use cosine similarity to calculate the relationship between token pairs
+- log-spaced continuous position bias: redefine relative position encoding
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/42952108/180748696-ee7ed23d-7fee-4ccf-9eb5-f117db228a42.png" width="100%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the detailed Abstract</summary>
+
+<br>
+
+Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.
+
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('swinv2-tiny-w8_3rdparty_in1k-256px', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('swinv2-tiny-w8_3rdparty_in1k-256px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w8_3rdparty_in1k-256px_20220803-e318968f.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                     | Params (M) | Flops (G) |                      Config                      |                                           Download                                           |
+| :---------------------------------------- | :--------: | :-------: | :----------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| `swinv2-base-w12_3rdparty_in21k-192px`\*  |   87.92    |   8.51    | [config](swinv2-base-w12_8xb128_in21k-192px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-base-w12_3rdparty_in21k-192px_20220803-f7dc9763.pth) |
+| `swinv2-large-w12_3rdparty_in21k-192px`\* |   196.74   |   19.04   | [config](swinv2-large-w12_8xb128_in21k-192px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-large-w12_3rdparty_in21k-192px_20220803-d9073fee.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/Swin-Transformer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+### Image Classification on ImageNet-1k
+
+| Model                                             |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                       Config                       |                       Download                       |
+| :------------------------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------------: | :--------------------------------------------------: |
+| `swinv2-tiny-w8_3rdparty_in1k-256px`\*            | From scratch |   28.35    |   4.35    |   81.76   |   95.87   |   [config](swinv2-tiny-w8_16xb64_in1k-256px.py)    | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w8_3rdparty_in1k-256px_20220803-e318968f.pth) |
+| `swinv2-tiny-w16_3rdparty_in1k-256px`\*           | From scratch |   28.35    |   4.40    |   82.81   |   96.23   |   [config](swinv2-tiny-w16_16xb64_in1k-256px.py)   | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w16_3rdparty_in1k-256px_20220803-9651cdd7.pth) |
+| `swinv2-small-w8_3rdparty_in1k-256px`\*           | From scratch |   49.73    |   8.45    |   83.74   |   96.60   |   [config](swinv2-small-w8_16xb64_in1k-256px.py)   | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w8_3rdparty_in1k-256px_20220803-b01a4332.pth) |
+| `swinv2-small-w16_3rdparty_in1k-256px`\*          | From scratch |   49.73    |   8.57    |   84.13   |   96.83   |  [config](swinv2-small-w16_16xb64_in1k-256px.py)   | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w16_3rdparty_in1k-256px_20220803-b707d206.pth) |
+| `swinv2-base-w8_3rdparty_in1k-256px`\*            | From scratch |   87.92    |   14.99   |   84.20   |   96.86   |   [config](swinv2-base-w8_16xb64_in1k-256px.py)    | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w8_3rdparty_in1k-256px_20220803-8ff28f2b.pth) |
+| `swinv2-base-w16_3rdparty_in1k-256px`\*           | From scratch |   87.92    |   15.14   |   84.60   |   97.05   |   [config](swinv2-base-w16_16xb64_in1k-256px.py)   | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_3rdparty_in1k-256px_20220803-5a1886b7.pth) |
+| `swinv2-base-w16_in21k-pre_3rdparty_in1k-256px`\* | ImageNet-21k |   87.92    |   15.14   |   86.17   |   97.88   | [config](swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_in21k-pre_3rdparty_in1k-256px_20220803-8d7aa8ad.pth) |
+| `swinv2-base-w24_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k |   87.92    |   34.07   |   87.14   |   98.23   | [config](swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w24_in21k-pre_3rdparty_in1k-384px_20220803-44eb70f8.pth) |
+| `swinv2-large-w16_in21k-pre_3rdparty_in1k-256px`\* | ImageNet-21k |   196.75   |   33.86   |   86.93   |   98.06   | [config](swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w16_in21k-pre_3rdparty_in1k-256px_20220803-c40cbed7.pth) |
+| `swinv2-large-w24_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k |   196.75   |   76.20   |   87.59   |   98.27   | [config](swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w24_in21k-pre_3rdparty_in1k-384px_20220803-3b36c165.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/Swin-Transformer). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{https://doi.org/10.48550/arxiv.2111.09883,
+  doi = {10.48550/ARXIV.2111.09883},
+  url = {https://arxiv.org/abs/2111.09883},
+  author = {Liu, Ze and Hu, Han and Lin, Yutong and Yao, Zhuliang and Xie, Zhenda and Wei, Yixuan and Ning, Jia and Cao, Yue and Zhang, Zheng and Dong, Li and Wei, Furu and Guo, Baining},
+  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  title = {Swin Transformer V2: Scaling Up Capacity and Resolution},
+  publisher = {arXiv},
+  year = {2021},
+  copyright = {Creative Commons Attribution 4.0 International}
+}
+```
diff --git a/configs/swin_transformer_v2/metafile.yml b/configs/swin_transformer_v2/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..55a14cbab587f037d96583d3b0210ac3008b1118
--- /dev/null
+++ b/configs/swin_transformer_v2/metafile.yml
@@ -0,0 +1,206 @@
+Collections:
+  - Name: Swin-Transformer V2
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+        - Weight Decay
+      Training Resources: 16x V100 GPUs
+      Epochs: 300
+      Batch Size: 1024
+      Architecture:
+        - Shift Window Multihead Self Attention
+    Paper:
+      URL: https://arxiv.org/abs/2111.09883
+      Title: "Swin Transformer V2: Scaling Up Capacity and Resolution"
+    README: configs/swin_transformer_v2/README.md
+
+Models:
+  - Name: swinv2-tiny-w8_3rdparty_in1k-256px
+    Metadata:
+      FLOPs: 4350000000
+      Parameters: 28350000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.76
+          Top 5 Accuracy: 95.87
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w8_3rdparty_in1k-256px_20220803-e318968f.pth
+    Config: configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_tiny_patch4_window8_256.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-tiny-w16_3rdparty_in1k-256px
+    Metadata:
+      FLOPs: 4400000000
+      Parameters: 28350000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.81
+          Top 5 Accuracy: 96.23
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-w16_3rdparty_in1k-256px_20220803-9651cdd7.pth
+    Config: configs/swin_transformer_v2/swinv2-tiny-w16_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_tiny_patch4_window16_256.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-small-w8_3rdparty_in1k-256px
+    Metadata:
+      FLOPs: 8450000000
+      Parameters: 49730000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.74
+          Top 5 Accuracy: 96.6
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w8_3rdparty_in1k-256px_20220803-b01a4332.pth
+    Config: configs/swin_transformer_v2/swinv2-small-w8_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_small_patch4_window8_256.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-small-w16_3rdparty_in1k-256px
+    Metadata:
+      FLOPs: 8570000000
+      Parameters: 49730000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.13
+          Top 5 Accuracy: 96.83
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-w16_3rdparty_in1k-256px_20220803-b707d206.pth
+    Config: configs/swin_transformer_v2/swinv2-small-w16_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_small_patch4_window16_256.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-base-w8_3rdparty_in1k-256px
+    Metadata:
+      FLOPs: 14990000000
+      Parameters: 87920000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.2
+          Top 5 Accuracy: 96.86
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w8_3rdparty_in1k-256px_20220803-8ff28f2b.pth
+    Config: configs/swin_transformer_v2/swinv2-base-w8_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window8_256.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-base-w16_3rdparty_in1k-256px
+    Metadata:
+      FLOPs: 15140000000
+      Parameters: 87920000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.6
+          Top 5 Accuracy: 97.05
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_3rdparty_in1k-256px_20220803-5a1886b7.pth
+    Config: configs/swin_transformer_v2/swinv2-base-w16_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window16_256.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-base-w16_in21k-pre_3rdparty_in1k-256px
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 15140000000
+      Parameters: 87920000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.17
+          Top 5 Accuracy: 97.88
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_in21k-pre_3rdparty_in1k-256px_20220803-8d7aa8ad.pth
+    Config: configs/swin_transformer_v2/swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window12to16_192to256_22kto1k_ft.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-base-w24_in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 34070000000
+      Parameters: 87920000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.14
+          Top 5 Accuracy: 98.23
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w24_in21k-pre_3rdparty_in1k-384px_20220803-44eb70f8.pth
+    Config: configs/swin_transformer_v2/swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window12to24_192to384_22kto1k_ft.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-large-w16_in21k-pre_3rdparty_in1k-256px
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 33860000000
+      Parameters: 196750000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.93
+          Top 5 Accuracy: 98.06
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w16_in21k-pre_3rdparty_in1k-256px_20220803-c40cbed7.pth
+    Config: configs/swin_transformer_v2/swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_large_patch4_window12to16_192to256_22kto1k_ft.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-large-w24_in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 76200000000
+      Parameters: 196750000
+    In Collection: Swin-Transformer V2
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 87.59
+          Top 5 Accuracy: 98.27
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w24_in21k-pre_3rdparty_in1k-384px_20220803-3b36c165.pth
+    Config: configs/swin_transformer_v2/swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_large_patch4_window12to24_192to384_22kto1k_ft.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-base-w12_3rdparty_in21k-192px
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 8510000000
+      Parameters: 87920000
+    In Collection: Swin-Transformer V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-base-w12_3rdparty_in21k-192px_20220803-f7dc9763.pth
+    Config: configs/swin_transformer_v2/swinv2-base-w12_8xb128_in21k-192px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_base_patch4_window12_192_22k.pth
+      Code: https://github.com/microsoft/Swin-Transformer
+  - Name: swinv2-large-w12_3rdparty_in21k-192px
+    Metadata:
+      Training Data: ImageNet-21k
+      FLOPs: 19040000000
+      Parameters: 196740000
+    In Collection: Swin-Transformer V2
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-large-w12_3rdparty_in21k-192px_20220803-d9073fee.pth
+    Config: configs/swin_transformer_v2/swinv2-large-w12_8xb128_in21k-192px.py
+    Converted From:
+      Weights: https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_large_patch4_window12_192_22k.pth
+      Code: https://github.com/microsoft/Swin-Transformer
diff --git a/configs/swin_transformer_v2/swinv2-base-w12_8xb128_in21k-192px.py b/configs/swin_transformer_v2/swinv2-base-w12_8xb128_in21k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b01b75d296dae9db97d2d85f73463f6c87c0b1c
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w12_8xb128_in21k-192px.py
@@ -0,0 +1,19 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/base_256.py',
+    '../_base_/datasets/imagenet21k_bs128.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(img_size=192, window_size=[12, 12, 12, 6]),
+    head=dict(num_classes=21841),
+)
+
+# dataset settings
+data_preprocessor = dict(num_classes=21841)
+
+_base_['train_pipeline'][1]['scale'] = 192  # RandomResizedCrop
+_base_['test_pipeline'][1]['scale'] = 219  # ResizeEdge
+_base_['test_pipeline'][2]['crop_size'] = 192  # CenterCrop
diff --git a/configs/swin_transformer_v2/swinv2-base-w16_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-base-w16_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f375ee1fc9b10885f8b9d9f4794b8530c1460b5
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w16_16xb64_in1k-256px.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/base_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(window_size=[16, 16, 16, 8]))
diff --git a/configs/swin_transformer_v2/swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..0725f9e739a099551a4d5b5f007bcb83708be309
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w16_in21k-pre_16xb64_in1k-256px.py
@@ -0,0 +1,13 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/base_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        window_size=[16, 16, 16, 8],
+        drop_path_rate=0.2,
+        pretrained_window_sizes=[12, 12, 12, 6]))
diff --git a/configs/swin_transformer_v2/swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py b/configs/swin_transformer_v2/swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..3dd4e5fd935a356d29e7790e91d4538c94711062
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w24_in21k-pre_16xb64_in1k-384px.py
@@ -0,0 +1,14 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/base_384.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        img_size=384,
+        window_size=[24, 24, 24, 12],
+        drop_path_rate=0.2,
+        pretrained_window_sizes=[12, 12, 12, 6]))
diff --git a/configs/swin_transformer_v2/swinv2-base-w8_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-base-w8_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..23fc40701470f8e41252c274072896d1cd811f28
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-base-w8_16xb64_in1k-256px.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/base_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/swin_transformer_v2/swinv2-large-w12_8xb128_in21k-192px.py b/configs/swin_transformer_v2/swinv2-large-w12_8xb128_in21k-192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b01b75d296dae9db97d2d85f73463f6c87c0b1c
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-large-w12_8xb128_in21k-192px.py
@@ -0,0 +1,19 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/base_256.py',
+    '../_base_/datasets/imagenet21k_bs128.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(img_size=192, window_size=[12, 12, 12, 6]),
+    head=dict(num_classes=21841),
+)
+
+# dataset settings
+data_preprocessor = dict(num_classes=21841)
+
+_base_['train_pipeline'][1]['scale'] = 192  # RandomResizedCrop
+_base_['test_pipeline'][1]['scale'] = 219  # ResizeEdge
+_base_['test_pipeline'][2]['crop_size'] = 192  # CenterCrop
diff --git a/configs/swin_transformer_v2/swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..62a2a29b843f197c15d8f53a7cbd1029be675fa8
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-large-w16_in21k-pre_16xb64_in1k-256px.py
@@ -0,0 +1,13 @@
+# Only for evaluation
+_base_ = [
+    '../_base_/models/swin_transformer_v2/large_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        window_size=[16, 16, 16, 8], pretrained_window_sizes=[12, 12, 12, 6]),
+)
diff --git a/configs/swin_transformer_v2/swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py b/configs/swin_transformer_v2/swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..d97d9b2b869c1e0c264910859b6f980387a7b6ab
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-large-w24_in21k-pre_16xb64_in1k-384px.py
@@ -0,0 +1,15 @@
+# Only for evaluation
+_base_ = [
+    '../_base_/models/swin_transformer_v2/large_384.py',
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        img_size=384,
+        window_size=[24, 24, 24, 12],
+        pretrained_window_sizes=[12, 12, 12, 6]),
+)
diff --git a/configs/swin_transformer_v2/swinv2-small-w16_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-small-w16_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..f87265dd199c712a6442407db852b5d4b6aabd7d
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-small-w16_16xb64_in1k-256px.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/small_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(window_size=[16, 16, 16, 8]))
diff --git a/configs/swin_transformer_v2/swinv2-small-w8_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-small-w8_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1001f1b6e1978c3706ca6183f863c316b13ade4
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-small-w8_16xb64_in1k-256px.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/small_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/swin_transformer_v2/swinv2-tiny-w16_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-tiny-w16_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e1f290f371e1b9084f4cd5291e1e638d0ad54e3
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-tiny-w16_16xb64_in1k-256px.py
@@ -0,0 +1,8 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/tiny_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(window_size=[16, 16, 16, 8]))
diff --git a/configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py b/configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..2cdc9a25ae8a64758f8642c079e1ff7fbf0548c3
--- /dev/null
+++ b/configs/swin_transformer_v2/swinv2-tiny-w8_16xb64_in1k-256px.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/swin_transformer_v2/tiny_256.py',
+    '../_base_/datasets/imagenet_bs64_swin_256.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
diff --git a/configs/t2t_vit/README.md b/configs/t2t_vit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bf0967cf27f606788174bc9fc2198cad3dbfced6
--- /dev/null
+++ b/configs/t2t_vit/README.md
@@ -0,0 +1,81 @@
+# Tokens-to-Token ViT
+
+> [Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet](https://arxiv.org/abs/2101.11986)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3% top1 accuracy in image resolution 384×384 on ImageNet.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142578381-e9040610-05d9-457c-8bf5-01c2fa94add2.png" width="60%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('t2t-vit-t-14_8xb64_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('t2t-vit-t-14_8xb64_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                     |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                          Download                                          |
+| :------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :----------------------------------------------------------------------------------------: |
+| `t2t-vit-t-14_8xb64_in1k` | From scratch |   21.47    |   4.34    |   81.83   |   95.84   | [config](t2t-vit-t-14_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.json) |
+| `t2t-vit-t-19_8xb64_in1k` | From scratch |   39.08    |   7.80    |   82.63   |   96.18   | [config](t2t-vit-t-19_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-19_8xb64_in1k_20211214-7f5e3aaf.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-19_8xb64_in1k_20211214-7f5e3aaf.json) |
+| `t2t-vit-t-24_8xb64_in1k` | From scratch |   64.00    |   12.69   |   82.71   |   96.09   | [config](t2t-vit-t-24_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-24_8xb64_in1k_20211214-b2a68ae3.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-24_8xb64_in1k_20211214-b2a68ae3.json) |
+
+## Citation
+
+```bibtex
+@article{yuan2021tokens,
+  title={Tokens-to-token vit: Training vision transformers from scratch on imagenet},
+  author={Yuan, Li and Chen, Yunpeng and Wang, Tao and Yu, Weihao and Shi, Yujun and Tay, Francis EH and Feng, Jiashi and Yan, Shuicheng},
+  journal={arXiv preprint arXiv:2101.11986},
+  year={2021}
+}
+```
diff --git a/configs/t2t_vit/metafile.yml b/configs/t2t_vit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..72cb2dfc92899779846af6263a125d028d17d1b2
--- /dev/null
+++ b/configs/t2t_vit/metafile.yml
@@ -0,0 +1,58 @@
+Collections:
+  - Name: Tokens-to-Token ViT
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Layer Normalization
+        - Scaled Dot-Product Attention
+        - Attention Dropout
+        - Dropout
+        - Tokens to Token
+    Paper:
+      URL: https://arxiv.org/abs/2101.11986
+      Title: "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet"
+    README: configs/t2t_vit/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.17.0/mmcls/models/backbones/t2t_vit.py
+      Version: v0.17.0
+
+Models:
+  - Name: t2t-vit-t-14_8xb64_in1k
+    Metadata:
+      FLOPs: 4340000000
+      Parameters: 21470000
+    In Collection: Tokens-to-Token ViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.83
+          Top 5 Accuracy: 95.84
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-f7378dd5.pth
+    Config: configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
+  - Name: t2t-vit-t-19_8xb64_in1k
+    Metadata:
+      FLOPs: 7800000000
+      Parameters: 39080000
+    In Collection: Tokens-to-Token ViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.63
+          Top 5 Accuracy: 96.18
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-19_8xb64_in1k_20211214-7f5e3aaf.pth
+    Config: configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py
+  - Name: t2t-vit-t-24_8xb64_in1k
+    Metadata:
+      FLOPs: 12690000000
+      Parameters: 64000000
+    In Collection: Tokens-to-Token ViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.71
+          Top 5 Accuracy: 96.09
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-24_8xb64_in1k_20211214-b2a68ae3.pth
+    Config: configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py
diff --git a/configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py b/configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ff6444548c4be59f52bc2aa259e7aaac32dea3d
--- /dev/null
+++ b/configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
@@ -0,0 +1,49 @@
+_base_ = [
+    '../_base_/models/t2t-vit-t-14.py',
+    '../_base_/datasets/imagenet_bs64_t2t_224.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='AdamW', lr=5e-4, weight_decay=0.05),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={'cls_token': dict(decay_mult=0.0)},
+    ),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=290,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300),
+    # cool down learning rate scheduler
+    dict(type='ConstantLR', factor=0.1, by_epoch=True, begin=300, end=310),
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=310, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py b/configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c7275372f904a4d53453b37bb50bfd31edb842f
--- /dev/null
+++ b/configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py
@@ -0,0 +1,49 @@
+_base_ = [
+    '../_base_/models/t2t-vit-t-19.py',
+    '../_base_/datasets/imagenet_bs64_t2t_224.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='AdamW', lr=5e-4, weight_decay=0.065),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={'cls_token': dict(decay_mult=0.0)},
+    ),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=290,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300),
+    # cool down learning rate scheduler
+    dict(type='ConstantLR', factor=0.1, by_epoch=True, begin=300, end=310),
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=310, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py b/configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e180ff344bd88808e635f3004704c6079a03465b
--- /dev/null
+++ b/configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py
@@ -0,0 +1,49 @@
+_base_ = [
+    '../_base_/models/t2t-vit-t-24.py',
+    '../_base_/datasets/imagenet_bs64_t2t_224.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='AdamW', lr=5e-4, weight_decay=0.065),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={'cls_token': dict(decay_mult=0.0)},
+    ),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-6,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=290,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300),
+    # cool down learning rate scheduler
+    dict(type='ConstantLR', factor=0.1, by_epoch=True, begin=300, end=310),
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=310, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=512)
diff --git a/configs/tinyvit/README.md b/configs/tinyvit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..58ceb5779b474a9818843cec0d34e8fc8f178f4b
--- /dev/null
+++ b/configs/tinyvit/README.md
@@ -0,0 +1,82 @@
+# TinyViT
+
+> [TinyViT: Fast Pretraining Distillation for Small Vision Transformers](https://arxiv.org/abs/2207.10666)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Vision transformer (ViT) recently has drawn great attention in computer vision due to its remarkable model capability. However, most prevailing ViT models suffer from huge number of parameters, restricting their applicability on devices with limited resources. To alleviate this issue, we propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets with our proposed fast distillation framework. The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data. More specifically, we apply distillation during pretraining for knowledge transfer. The logits of large teacher models are sparsified and stored in disk in advance to save the memory cost and computation overheads. The tiny student transformers are automatically scaled down from a large pretrained model with computation and parameter constraints. Comprehensive experiments demonstrate the efficacy of TinyViT. It achieves a top-1 accuracy of 84.8% on ImageNet-1k with only 21M parameters, being comparable to SwinB pretrained on ImageNet-21k while using 4.2 times fewer parameters. Moreover, increasing image resolutions, TinyViT can reach 86.5% accuracy, being slightly better than Swin-L while using only 11% parameters. Last but not the least, we demonstrate a good transfer ability of TinyViT on various downstream tasks.
+
+<div align=center>
+<img src="https://github.com/microsoft/Cream/raw/main/TinyViT/.figure/framework.png" width="100%">
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('tinyvit-5m_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('tinyvit-5m_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/tinyvit/tinyvit-5m_8xb256_in1k.py https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_3rdparty_in1k_20221021-62cb5abf.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                          |       Pretrain       | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                     Config                      |                      Download                      |
+| :--------------------------------------------- | :------------------: | :--------: | :-------: | :-------: | :-------: | :---------------------------------------------: | :------------------------------------------------: |
+| `tinyvit-5m_3rdparty_in1k`\*                   |     From scratch     |    5.39    |   1.29    |   79.02   |   94.74   |       [config](tinyvit-5m_8xb256_in1k.py)       | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_3rdparty_in1k_20221021-62cb5abf.pth) |
+| `tinyvit-5m_in21k-distill-pre_3rdparty_in1k`\* | ImageNet-21k DISTILL |    5.39    |   1.29    |   80.71   |   95.57   |   [config](tinyvit-5m-distill_8xb256_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_in21k-distill-pre_3rdparty_in1k_20221021-d4b010a8.pth) |
+| `tinyvit-11m_3rdparty_in1k`\*                  |     From scratch     |   11.00    |   2.05    |   81.44   |   95.79   |      [config](tinyvit-11m_8xb256_in1k.py)       | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-11m_3rdparty_in1k_20221021-11ccef16.pth) |
+| `tinyvit-11m_in21k-distill-pre_3rdparty_in1k`\* | ImageNet-21k DISTILL |   11.00    |   2.05    |   83.19   |   96.53   |  [config](tinyvit-11m-distill_8xb256_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-11m_in21k-distill-pre_3rdparty_in1k_20221021-5d3bc0dc.pth) |
+| `tinyvit-21m_3rdparty_in1k`\*                  |     From scratch     |   21.20    |   4.30    |   83.08   |   96.58   |      [config](tinyvit-21m_8xb256_in1k.py)       | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_3rdparty_in1k_20221021-5346ba34.pth) |
+| `tinyvit-21m_in21k-distill-pre_3rdparty_in1k`\* | ImageNet-21k DISTILL |   21.20    |   4.30    |   84.85   |   97.27   |  [config](tinyvit-21m-distill_8xb256_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k_20221021-3d9b30a2.pth) |
+| `tinyvit-21m_in21k-distill-pre_3rdparty_in1k-384px`\* | ImageNet-21k DISTILL |   21.23    |   13.85   |   86.21   |   97.77   | [config](tinyvit-21m-distill_8xb256_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k-384px_20221021-65be6b3f.pth) |
+| `tinyvit-21m_in21k-distill-pre_3rdparty_in1k-512px`\* | ImageNet-21k DISTILL |   21.27    |   27.15   |   86.44   |   97.89   | [config](tinyvit-21m-distill_8xb256_in1k-512px.py) | [model](https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k-512px_20221021-e42a9bea.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/Cream/tree/main/TinyViT). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@InProceedings{tiny_vit,
+  title={TinyViT: Fast Pretraining Distillation for Small Vision Transformers},
+  author={Wu, Kan and Zhang, Jinnian and Peng, Houwen and Liu, Mengchen and Xiao, Bin and Fu, Jianlong and Yuan, Lu},
+  booktitle={European conference on computer vision (ECCV)},
+  year={2022}
+}
+```
diff --git a/configs/tinyvit/metafile.yml b/configs/tinyvit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..a1c5438acb9eba87f7a5e8c02356459c1194d74a
--- /dev/null
+++ b/configs/tinyvit/metafile.yml
@@ -0,0 +1,162 @@
+Collections:
+  - Name: TinyViT
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - MBConv
+        - Window Multi-head Self-Attention
+    Paper:
+      Title: 'TinyViT: Fast Pretraining Distillation for Small Vision Transformers'
+      URL: https://arxiv.org/abs/2207.10666
+    README: configs/tinyvit/README.md
+    Code:
+      Version: v1.0.0rc1
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.23.2/mmcls/models/backbones/tinyvit.py
+
+Models:
+  - Name: tinyvit-5m_3rdparty_in1k
+    Metadata:
+      FLOPs: 1286655360
+      Parameters: 5392764
+      Training Data: ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.02
+          Top 5 Accuracy: 94.74
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_3rdparty_in1k_20221021-62cb5abf.pth
+    Config: configs/tinyvit/tinyvit-5m_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_5m_1k.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+  - Name: tinyvit-5m_in21k-distill-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 1286655360
+      Parameters: 5392764
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.71
+          Top 5 Accuracy: 95.57
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-5m_in21k-distill-pre_3rdparty_in1k_20221021-d4b010a8.pth
+    Config: configs/tinyvit/tinyvit-5m-distill_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_5m_22kto1k_distill.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+  - Name: tinyvit-11m_3rdparty_in1k
+    Metadata:
+      FLOPs: 2050033664
+      Parameters: 10996972
+      Training Data: ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.44
+          Top 5 Accuracy: 95.79
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-11m_3rdparty_in1k_20221021-11ccef16.pth
+    Config: configs/tinyvit/tinyvit-11m_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_11m_1k.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+  - Name: tinyvit-11m_in21k-distill-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 2050033664
+      Parameters: 10996972
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.19
+          Top 5 Accuracy: 96.53
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-11m_in21k-distill-pre_3rdparty_in1k_20221021-5d3bc0dc.pth
+    Config: configs/tinyvit/tinyvit-11m-distill_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_11m_22kto1k_distill.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+  - Name: tinyvit-21m_3rdparty_in1k
+    Metadata:
+      FLOPs: 4301124096
+      Parameters: 21198568
+      Training Data: ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.08
+          Top 5 Accuracy: 96.58
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_3rdparty_in1k_20221021-5346ba34.pth
+    Config: configs/tinyvit/tinyvit-21m_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_21m_1k.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+  - Name: tinyvit-21m_in21k-distill-pre_3rdparty_in1k
+    Metadata:
+      FLOPs: 4301124096
+      Parameters: 21198568
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.85
+          Top 5 Accuracy: 97.27
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k_20221021-3d9b30a2.pth
+    Config: configs/tinyvit/tinyvit-21m-distill_8xb256_in1k.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_21m_22kto1k_distill.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+  - Name: tinyvit-21m_in21k-distill-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 13848250176
+      Parameters: 21230488
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.21
+          Top 5 Accuracy: 97.77
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k-384px_20221021-65be6b3f.pth
+    Config: configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-384px.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_21m_22kto1k_384_distill.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
+  - Name: tinyvit-21m_in21k-distill-pre_3rdparty_in1k-512px
+    Metadata:
+      FLOPs: 27151420224
+      Parameters: 21268120
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: TinyViT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.44
+          Top 5 Accuracy: 97.89
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tinyvit/tinyvit-21m_in21k-distill-pre_3rdparty_in1k-512px_20221021-e42a9bea.pth
+    Config: configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-512px.py
+    Converted From:
+      Weights: https://github.com/wkcn/TinyViT-model-zoo/releases/download/checkpoints/tiny_vit_21m_22kto1k_512_distill.pth
+      Code: https://github.com/microsoft/Cream/tree/main/TinyViT
diff --git a/configs/tinyvit/tinyvit-11m-distill_8xb256_in1k.py b/configs/tinyvit/tinyvit-11m-distill_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..145feb9aa65baf4bba947cdebb6e8dad5b9781f5
--- /dev/null
+++ b/configs/tinyvit/tinyvit-11m-distill_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = [
+    './tinyvit-11m_8xb256_in1k.py',
+]
diff --git a/configs/tinyvit/tinyvit-11m_8xb256_in1k.py b/configs/tinyvit/tinyvit-11m_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f3acfa86a0d5fa24aae44c01064c49f5348d7da3
--- /dev/null
+++ b/configs/tinyvit/tinyvit-11m_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+    '../_base_/models/tinyvit/tinyvit-11m.py',
+]
diff --git a/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-384px.py b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..44e51b1930dd96c987dd4eab9dd77d0e068c801c
--- /dev/null
+++ b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-384px.py
@@ -0,0 +1,29 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+    '../_base_/models/tinyvit/tinyvit-21m.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(
+        img_size=(384, 384),
+        window_size=[12, 12, 24, 12],
+        drop_path_rate=0.1,
+    ))
+
+# data settings
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(384, 384),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='PackInputs'),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+test_dataloader = val_dataloader
diff --git a/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-512px.py b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-512px.py
new file mode 100644
index 0000000000000000000000000000000000000000..05b47c6de94868a6df6ec95cd406095dfc80153e
--- /dev/null
+++ b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k-512px.py
@@ -0,0 +1,28 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+    '../_base_/models/tinyvit/tinyvit-21m.py',
+]
+
+# model settings
+model = dict(
+    backbone=dict(
+        img_size=(512, 512),
+        window_size=[16, 16, 32, 16],
+        drop_path_rate=0.1,
+    ))
+# data settings
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(512, 512),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='PackInputs'),
+]
+
+val_dataloader = dict(batch_size=16, dataset=dict(pipeline=test_pipeline))
+
+test_dataloader = val_dataloader
diff --git a/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k.py b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..53885852757c6dce993addb6772b7d6e98219d81
--- /dev/null
+++ b/configs/tinyvit/tinyvit-21m-distill_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = [
+    './tinyvit-21m_8xb256_in1k.py',
+]
diff --git a/configs/tinyvit/tinyvit-21m_8xb256_in1k.py b/configs/tinyvit/tinyvit-21m_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..6c12019c9cf0babe49b24a21fa74fc66d33dda91
--- /dev/null
+++ b/configs/tinyvit/tinyvit-21m_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+    '../_base_/models/tinyvit/tinyvit-21m.py',
+]
diff --git a/configs/tinyvit/tinyvit-5m-distill_8xb256_in1k.py b/configs/tinyvit/tinyvit-5m-distill_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0003c30ac46d2dbe2069733a17b039133b95ae8a
--- /dev/null
+++ b/configs/tinyvit/tinyvit-5m-distill_8xb256_in1k.py
@@ -0,0 +1,3 @@
+_base_ = [
+    './tinyvit-5m_8xb256_in1k.py',
+]
diff --git a/configs/tinyvit/tinyvit-5m_8xb256_in1k.py b/configs/tinyvit/tinyvit-5m_8xb256_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..262b5a469c4daa7ed135e466e872bb57e0f1f148
--- /dev/null
+++ b/configs/tinyvit/tinyvit-5m_8xb256_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+    '../_base_/models/tinyvit/tinyvit-5m.py',
+]
diff --git a/configs/tnt/README.md b/configs/tnt/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e86da0b4a8d31a09b6f41e99cff4c233e67a114a
--- /dev/null
+++ b/configs/tnt/README.md
@@ -0,0 +1,77 @@
+# Transformer in Transformer
+
+> [Transformer in Transformer](https://arxiv.org/abs/2103.00112)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16×16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4×4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142578661-298d92a1-2e25-4910-a312-085587be6b65.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('tnt-small-p16_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('tnt-small-p16_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/tnt/tnt-s-p16_16xb64_in1k.py https://download.openmmlab.com/mmclassification/v0/tnt/tnt-small-p16_3rdparty_in1k_20210903-c56ee7df.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                           |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config               |                                        Download                                        |
+| :------------------------------ | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------: | :------------------------------------------------------------------------------------: |
+| `tnt-small-p16_3rdparty_in1k`\* | From scratch |   23.76    |   3.36    |   81.52   |   95.73   | [config](tnt-s-p16_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/tnt/tnt-small-p16_3rdparty_in1k_20210903-c56ee7df.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/contrastive/pytorch-image-models/blob/809271b0f3e5d9be4e11c0c5cec1dbba8b5e2c60/timm/models/tnt.py#L144). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@misc{han2021transformer,
+      title={Transformer in Transformer},
+      author={Kai Han and An Xiao and Enhua Wu and Jianyuan Guo and Chunjing Xu and Yunhe Wang},
+      year={2021},
+      eprint={2103.00112},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/configs/tnt/metafile.yml b/configs/tnt/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..dcc2eddb5f479b987767802447cd46fa2a6383bb
--- /dev/null
+++ b/configs/tnt/metafile.yml
@@ -0,0 +1,29 @@
+Collections:
+  - Name: Transformer in Transformer
+    Metadata:
+      Training Data: ImageNet-1k
+    Paper:
+      URL: https://arxiv.org/abs/2103.00112
+      Title: "Transformer in Transformer"
+    README: configs/tnt/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/tnt.py#L203
+      Version: v0.15.0
+
+Models:
+  - Name: tnt-small-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 3360000000
+      Parameters: 23760000
+    In Collection: Transformer in Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.52
+          Top 5 Accuracy: 95.73
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/tnt/tnt-small-p16_3rdparty_in1k_20210903-c56ee7df.pth
+    Config: configs/tnt/tnt-s-p16_16xb64_in1k.py
+    Converted From:
+      Weights: https://github.com/contrastive/pytorch-image-models/releases/download/TNT/tnt_s_patch16_224.pth.tar
+      Code: https://github.com/contrastive/pytorch-image-models/blob/809271b0f3e5d9be4e11c0c5cec1dbba8b5e2c60/timm/models/tnt.py#L144
diff --git a/configs/tnt/tnt-s-p16_16xb64_in1k.py b/configs/tnt/tnt-s-p16_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..af71232f831089a934d14beb4b187432661921ae
--- /dev/null
+++ b/configs/tnt/tnt-s-p16_16xb64_in1k.py
@@ -0,0 +1,56 @@
+# accuracy_top-1 : 81.52 accuracy_top-5 : 95.73
+_base_ = [
+    '../_base_/models/tnt_s_patch16_224.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(batch_size=64)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-3, weight_decay=0.05))
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', T_max=295, by_epoch=True, begin=5, end=300)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (16 GPUs) x (64 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/configs/twins/README.md b/configs/twins/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..9e97b7842d9ddb8ab12d13283fb3ed50ed172f70
--- /dev/null
+++ b/configs/twins/README.md
@@ -0,0 +1,80 @@
+# Twins
+
+> [Twins: Revisiting the Design of Spatial Attention in Vision Transformers](http://arxiv-export-lb.library.cornell.edu/abs/2104.13840)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks, including image level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our code is released at [this https URL](https://github.com/Meituan-AutoML/Twins).
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24582831/145021310-57826cf5-5e03-4c7c-9081-ffa744bdae27.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('twins-pcpvt-small_3rdparty_8xb128_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('twins-pcpvt-small_3rdparty_8xb128_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/twins/twins-pcpvt-small_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-small_3rdparty_8xb128_in1k_20220126-ef23c132.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                      |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                   |                              Download                               |
+| :----------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------: | :-----------------------------------------------------------------: |
+| `twins-pcpvt-small_3rdparty_8xb128_in1k`\* | From scratch |   24.11    |   3.67    |   81.14   |   95.69   | [config](twins-pcpvt-small_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-small_3rdparty_8xb128_in1k_20220126-ef23c132.pth) |
+| `twins-pcpvt-base_3rdparty_8xb128_in1k`\*  | From scratch |   43.83    |   6.45    |   82.66   |   96.26   | [config](twins-pcpvt-base_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-base_3rdparty_8xb128_in1k_20220126-f8c4b0d5.pth) |
+| `twins-pcpvt-large_3rdparty_16xb64_in1k`\* | From scratch |   60.99    |   9.51    |   83.09   |   96.59   | [config](twins-pcpvt-large_16xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-large_3rdparty_16xb64_in1k_20220126-c1ef8d80.pth) |
+| `twins-svt-small_3rdparty_8xb128_in1k`\*   | From scratch |   24.06    |   2.82    |   81.77   |   95.57   |  [config](twins-svt-small_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-small_3rdparty_8xb128_in1k_20220126-8fe5205b.pth) |
+| `twins-svt-base_8xb128_3rdparty_in1k`\*    | From scratch |   56.07    |   8.35    |   83.13   |   96.29   |  [config](twins-svt-base_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-base_3rdparty_8xb128_in1k_20220126-e31cc8e9.pth) |
+| `twins-svt-large_3rdparty_16xb64_in1k`\*   | From scratch |   99.27    |   14.82   |   83.60   |   96.50   |  [config](twins-svt-large_16xb64_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-large_3rdparty_16xb64_in1k_20220126-4817645f.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{chu2021twins,
+  title={Twins: Revisiting spatial attention design in vision transformers},
+  author={Chu, Xiangxiang and Tian, Zhi and Wang, Yuqing and Zhang, Bo and Ren, Haibing and Wei, Xiaolin and Xia, Huaxia and Shen, Chunhua},
+  journal={arXiv preprint arXiv:2104.13840},
+  year={2021}altgvt
+}
+```
diff --git a/configs/twins/metafile.yml b/configs/twins/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..d0d8ff4a324b86865b711b48d769a1f8fdb9130c
--- /dev/null
+++ b/configs/twins/metafile.yml
@@ -0,0 +1,114 @@
+Collections:
+  - Name: Twins
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Global Subsampled Attention
+        - Locally Grouped SelfAttention
+        - Conditional Position Encoding
+        - Pyramid Vision Transformer
+    Paper:
+      URL: http://arxiv-export-lb.library.cornell.edu/abs/2104.13840
+      Title: "Twins: Revisiting the Design of Spatial Attention in Vision Transformers"
+    README: configs/twins/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/twins.py
+      Version: v0.20.1
+
+Models:
+  - Name: twins-pcpvt-small_3rdparty_8xb128_in1k
+    Metadata:
+      FLOPs: 3670000000        # 3.67G
+      Parameters: 24110000     # 24.11M
+    In Collection: Twins
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.14
+          Top 5 Accuracy: 95.69
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-small_3rdparty_8xb128_in1k_20220126-ef23c132.pth
+    Config: configs/twins/twins-pcpvt-small_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+  - Name: twins-pcpvt-base_3rdparty_8xb128_in1k
+    Metadata:
+      FLOPs: 6450000000        # 6.45G
+      Parameters: 43830000     # 43.83M
+    In Collection: Twins
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.66
+          Top 5 Accuracy: 96.26
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-base_3rdparty_8xb128_in1k_20220126-f8c4b0d5.pth
+    Config: configs/twins/twins-pcpvt-base_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+  - Name: twins-pcpvt-large_3rdparty_16xb64_in1k
+    Metadata:
+      FLOPs: 9510000000           # 9.51G
+      Parameters: 60990000        # 60.99M
+    In Collection: Twins
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.09
+          Top 5 Accuracy: 96.59
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-large_3rdparty_16xb64_in1k_20220126-c1ef8d80.pth
+    Config: configs/twins/twins-pcpvt-large_16xb64_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+  - Name: twins-svt-small_3rdparty_8xb128_in1k
+    Metadata:
+      FLOPs: 2820000000            # 2.82G
+      Parameters: 24060000         # 24.06M
+    In Collection: Twins
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.77
+          Top 5 Accuracy: 95.57
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-small_3rdparty_8xb128_in1k_20220126-8fe5205b.pth
+    Config: configs/twins/twins-svt-small_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+  - Name: twins-svt-base_8xb128_3rdparty_in1k
+    Metadata:
+      FLOPs: 8350000000           # 8.35G
+      Parameters: 56070000        # 56.07M
+    In Collection: Twins
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.13
+          Top 5 Accuracy: 96.29
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-base_3rdparty_8xb128_in1k_20220126-e31cc8e9.pth
+    Config: configs/twins/twins-svt-base_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
+  - Name: twins-svt-large_3rdparty_16xb64_in1k
+    Metadata:
+      FLOPs: 14820000000          # 14.82G
+      Parameters: 99270000        # 99.27M
+    In Collection: Twins
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.60
+          Top 5 Accuracy: 96.50
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-large_3rdparty_16xb64_in1k_20220126-4817645f.pth
+    Config: configs/twins/twins-svt-large_16xb64_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vt3p-weights/twins_pcpvt_small-e70e7e7a.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/twins.py
diff --git a/configs/twins/twins-pcpvt-base_8xb128_in1k.py b/configs/twins/twins-pcpvt-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ac5d2adf15e4c71af8cff09a59acaa9d863f9a7
--- /dev/null
+++ b/configs/twins/twins-pcpvt-base_8xb128_in1k.py
@@ -0,0 +1,41 @@
+_base_ = [
+    '../_base_/models/twins_pcpvt_base.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        lr=5e-4 * 128 * 8 / 512,  # learning rate for 128 batch size, 8 gpu.
+        weight_decay=0.05,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(_delete=True, norm_decay_mult=0.0, bias_decay_mult=0.0),
+    clip_grad=dict(max_norm=5.0),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=295,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=5,
+        end=300)
+]
diff --git a/configs/twins/twins-pcpvt-large_16xb64_in1k.py b/configs/twins/twins-pcpvt-large_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0acfd7528b5c17ece73586df3ce7dc850ea5a64a
--- /dev/null
+++ b/configs/twins/twins-pcpvt-large_16xb64_in1k.py
@@ -0,0 +1,7 @@
+_base_ = ['twins-pcpvt-base_8xb128_in1k.py']
+
+# model settings
+model = dict(backbone=dict(arch='large'), head=dict(in_channels=512))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
diff --git a/configs/twins/twins-pcpvt-small_8xb128_in1k.py b/configs/twins/twins-pcpvt-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9fe763b77754bf249030d48459302e532900a1a3
--- /dev/null
+++ b/configs/twins/twins-pcpvt-small_8xb128_in1k.py
@@ -0,0 +1,4 @@
+_base_ = ['twins-pcpvt-base_8xb128_in1k.py']
+
+# model settings
+model = dict(backbone=dict(arch='small'), head=dict(in_channels=512))
diff --git a/configs/twins/twins-svt-base_8xb128_in1k.py b/configs/twins/twins-svt-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d24f63b074afe59574d04e40f8379ec6c386baa
--- /dev/null
+++ b/configs/twins/twins-svt-base_8xb128_in1k.py
@@ -0,0 +1,41 @@
+_base_ = [
+    '../_base_/models/twins_svt_base.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        lr=5e-4 * 128 * 8 / 512,  # learning rate for 128 batch size, 8 gpu.
+        weight_decay=0.05,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(_delete=True, norm_decay_mult=0.0, bias_decay_mult=0.0),
+    clip_grad=dict(max_norm=5.0),
+)
+
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        start_factor=1e-3,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=295,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=5,
+        end=300)
+]
diff --git a/configs/twins/twins-svt-large_16xb64_in1k.py b/configs/twins/twins-svt-large_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e8a1eba894e5f831376ad7c5871434db438db59b
--- /dev/null
+++ b/configs/twins/twins-svt-large_16xb64_in1k.py
@@ -0,0 +1,7 @@
+_base_ = ['twins-svt-base_8xb128_in1k.py']
+
+# model settings
+model = dict(backbone=dict(arch='large'), head=dict(in_channels=1024))
+
+# dataset settings
+train_dataloader = dict(batch_size=64)
diff --git a/configs/twins/twins-svt-small_8xb128_in1k.py b/configs/twins/twins-svt-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ffe267b56e921abcdcc40c833bba42e9952a4d4
--- /dev/null
+++ b/configs/twins/twins-svt-small_8xb128_in1k.py
@@ -0,0 +1,4 @@
+_base_ = ['twins-svt-base_8xb128_in1k.py']
+
+# model settings
+model = dict(backbone=dict(arch='small'), head=dict(in_channels=512))
diff --git a/configs/van/README.md b/configs/van/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7e548b6b8003169602ea6a205c2c305b8808ed39
--- /dev/null
+++ b/configs/van/README.md
@@ -0,0 +1,78 @@
+# Visual-Attention-Network
+
+> [Visual Attention Network](https://arxiv.org/abs/2202.09741)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+While originally designed for natural language processing (NLP) tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention while avoiding the above issues. We further introduce a novel neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple and efficient, VAN outperforms the state-of-the-art vision transformers and convolutional neural networks with a large margin in extensive experiments, including image classification, object detection, semantic segmentation, instance segmentation, etc.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24734142/157409411-2f622ba7-553c-4702-91be-eba03f9ea04f.png" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('van-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('van-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/van/van-tiny_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/van/van-tiny_8xb128_in1k_20220501-385941af.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                       |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |               Config               |                                          Download                                          |
+| :-------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :--------------------------------: | :----------------------------------------------------------------------------------------: |
+| `van-tiny_3rdparty_in1k`\*  | From scratch |    4.11    |   0.88    |   75.41   |   93.02   | [config](van-tiny_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/van/van-tiny_8xb128_in1k_20220501-385941af.pth) |
+| `van-small_3rdparty_in1k`\* | From scratch |   13.86    |   2.52    |   81.01   |   95.63   | [config](van-small_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/van/van-small_8xb128_in1k_20220501-17bc91aa.pth) |
+| `van-base_3rdparty_in1k`\*  | From scratch |   26.58    |   5.03    |   82.80   |   96.21   | [config](van-base_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/van/van-base_8xb128_in1k_20220501-6a4cc31b.pth) |
+| `van-large_3rdparty_in1k`\* | From scratch |   44.77    |   8.99    |   83.86   |   96.73   | [config](van-large_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/van/van-large_8xb128_in1k_20220501-f212ba21.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/Visual-Attention-Network/VAN-Classification). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{guo2022visual,
+  title={Visual Attention Network},
+  author={Guo, Meng-Hao and Lu, Cheng-Ze and Liu, Zheng-Ning and Cheng, Ming-Ming and Hu, Shi-Min},
+  journal={arXiv preprint arXiv:2202.09741},
+  year={2022}
+}
+```
diff --git a/configs/van/metafile.yml b/configs/van/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..db5a6e6443c13a1eb9dc669923d8c0902e89ee7a
--- /dev/null
+++ b/configs/van/metafile.yml
@@ -0,0 +1,82 @@
+Collections:
+  - Name: Visual-Attention-Network
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - AdamW
+        - Weight Decay
+      Architecture:
+        - Visual Attention Network
+    Paper:
+      URL: https://arxiv.org/abs/2202.09741
+      Title: "Visual Attention Network"
+    README: configs/van/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.23.0/mmcls/models/backbones/van.py
+      Version: v0.23.0
+
+Models:
+  - Name: van-tiny_3rdparty_in1k
+    Metadata:
+      Parameters: 4110000      # 4.11M
+      FLOPs: 880000000   # 0.88G
+    In Collection: Visual-Attention-Network
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 75.41
+          Top 5 Accuracy: 93.02
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/van/van-tiny_8xb128_in1k_20220501-385941af.pth
+    Config: configs/van/van-tiny_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/Visual-Attention-Network/VAN-Classification
+      Weights: https://cloud.tsinghua.edu.cn/f/aada2242a16245d6a561/?dl=1
+  - Name: van-small_3rdparty_in1k
+    Metadata:
+      Parameters:  13860000          # 13.86M
+      FLOPs: 2520000000    # 2.52G
+    In Collection: Visual-Attention-Network
+    Results:
+        - Dataset: ImageNet-1k
+          Metrics:
+            Top 1 Accuracy: 81.01
+            Top 5 Accuracy: 95.63
+          Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/van/van-small_8xb128_in1k_20220501-17bc91aa.pth
+    Config: configs/van/van-small_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/Visual-Attention-Network/VAN-Classification
+      Weights: https://cloud.tsinghua.edu.cn/f/dd3eb73692f74a2499c9/?dl=1
+  - Name: van-base_3rdparty_in1k
+    Metadata:
+      Parameters: 26580000            # 26.58M
+      FLOPs: 5030000000                # 5.03G
+    In Collection: Visual-Attention-Network
+    Results:
+        - Dataset: ImageNet-1k
+          Metrics:
+            Top 1 Accuracy: 82.80
+            Top 5 Accuracy: 96.21
+          Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/van/van-base_8xb128_in1k_20220501-6a4cc31b.pth
+    Config: configs/van/van-base_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/Visual-Attention-Network/VAN-Classification
+      Weights: https://cloud.tsinghua.edu.cn/f/58e7acceaf334ecdba89/?dl=1
+  - Name: van-large_3rdparty_in1k
+    Metadata:
+      Parameters: 44770000              # 44.77 M
+      FLOPs: 8990000000              # 8.99G
+    In Collection: Visual-Attention-Network
+    Results:
+        - Dataset: ImageNet-1k
+          Metrics:
+            Top 1 Accuracy: 83.86
+            Top 5 Accuracy: 96.73
+          Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/van/van-large_8xb128_in1k_20220501-f212ba21.pth
+    Config: configs/van/van-large_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/Visual-Attention-Network/VAN-Classification
+      Weights: https://cloud.tsinghua.edu.cn/f/0201745f6920482490a0/?dl=1
diff --git a/configs/van/van-base_8xb128_in1k.py b/configs/van/van-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..47082b748554eea9dfc467f63a5644294131fd14
--- /dev/null
+++ b/configs/van/van-base_8xb128_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/models/van/van_base.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/van/van-large_8xb128_in1k.py b/configs/van/van-large_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b16567726222306eff4a28ef76361922ecf28970
--- /dev/null
+++ b/configs/van/van-large_8xb128_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/models/van/van_large.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/van/van-small_8xb128_in1k.py b/configs/van/van-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbbbbdf4c8b7441a19c00c44f012478b1021335a
--- /dev/null
+++ b/configs/van/van-small_8xb128_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/models/van/van_small.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/van/van-tiny_8xb128_in1k.py b/configs/van/van-tiny_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ac62dab083c5c42dfd532f9191f01c74fcc9408
--- /dev/null
+++ b/configs/van/van-tiny_8xb128_in1k.py
@@ -0,0 +1,65 @@
+_base_ = [
+    '../_base_/models/van/van_tiny.py',
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=248,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline), batch_size=128)
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/configs/vgg/README.md b/configs/vgg/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7af69ce6b87d1ce989881fa17bf5c6cacc3748be
--- /dev/null
+++ b/configs/vgg/README.md
@@ -0,0 +1,86 @@
+# VGG
+
+> [Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/abs/1409.1556)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142578905-9be586ec-f6fd-4bfb-bbba-432f599d3b9b.png" width="60%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vgg11_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vgg11_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/vgg/vgg11_8xb32_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/vgg/vgg11_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_batch256_imagenet_20210208-4271cd6c.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |             Config              |                                               Download                                               |
+| :------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :-----------------------------: | :--------------------------------------------------------------------------------------------------: |
+| `vgg11_8xb32_in1k`   | From scratch |   132.86   |   7.63    |   68.75   |   88.87   |  [config](vgg11_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_batch256_imagenet_20210208-4271cd6c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_batch256_imagenet_20210208-4271cd6c.json) |
+| `vgg13_8xb32_in1k`   | From scratch |   133.05   |   11.34   |   70.02   |   89.46   |  [config](vgg13_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_batch256_imagenet_20210208-4d1d6080.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_batch256_imagenet_20210208-4d1d6080.json) |
+| `vgg16_8xb32_in1k`   | From scratch |   138.36   |   15.50   |   71.62   |   90.49   |  [config](vgg16_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_batch256_imagenet_20210208-db26f1a5.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_batch256_imagenet_20210208-db26f1a5.json) |
+| `vgg19_8xb32_in1k`   | From scratch |   143.67   |   19.67   |   72.41   |   90.80   |  [config](vgg19_8xb32_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_batch256_imagenet_20210208-e6920e4a.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_batch256_imagenet_20210208-e6920e4a.json) |
+| `vgg11bn_8xb32_in1k` | From scratch |   132.87   |   7.64    |   70.67   |   90.16   | [config](vgg11bn_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_bn_batch256_imagenet_20210207-f244902c.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_bn_batch256_imagenet_20210207-f244902c.json) |
+| `vgg13bn_8xb32_in1k` | From scratch |   133.05   |   11.36   |   72.12   |   90.66   | [config](vgg13bn_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_bn_batch256_imagenet_20210207-1a8b7864.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_bn_batch256_imagenet_20210207-1a8b7864.json) |
+| `vgg16bn_8xb32_in1k` | From scratch |   138.37   |   15.53   |   73.74   |   91.66   | [config](vgg16bn_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_bn_batch256_imagenet_20210208-7e55cd29.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_bn_batch256_imagenet_20210208-7e55cd29.json) |
+| `vgg19bn_8xb32_in1k` | From scratch |   143.68   |   19.70   |   74.68   |   92.27   | [config](vgg19bn_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_bn_batch256_imagenet_20210208-da620c4f.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_bn_batch256_imagenet_20210208-da620c4f.json) |
+
+## Citation
+
+```bibtex
+@article{simonyan2014very,
+  title={Very deep convolutional networks for large-scale image recognition},
+  author={Simonyan, Karen and Zisserman, Andrew},
+  journal={arXiv preprint arXiv:1409.1556},
+  year={2014}
+}
+```
diff --git a/configs/vgg/metafile.yml b/configs/vgg/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..ce3af191a746878f7d9b6febf67cc6c96a5fa8c1
--- /dev/null
+++ b/configs/vgg/metafile.yml
@@ -0,0 +1,125 @@
+Collections:
+  - Name: VGG
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x Xp GPUs
+      Epochs: 100
+      Batch Size: 256
+      Architecture:
+        - VGG
+    Paper:
+      URL: https://arxiv.org/abs/1409.1556
+      Title: "Very Deep Convolutional Networks for Large-Scale Image Recognition"
+    README: configs/vgg/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.15.0/mmcls/models/backbones/vgg.py#L39
+      Version: v0.15.0
+
+Models:
+  - Name: vgg11_8xb32_in1k
+    Metadata:
+      FLOPs: 7630000000
+      Parameters: 132860000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 68.75
+          Top 5 Accuracy: 88.87
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_batch256_imagenet_20210208-4271cd6c.pth
+    Config: configs/vgg/vgg11_8xb32_in1k.py
+  - Name: vgg13_8xb32_in1k
+    Metadata:
+      FLOPs: 11340000000
+      Parameters: 133050000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 70.02
+          Top 5 Accuracy: 89.46
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_batch256_imagenet_20210208-4d1d6080.pth
+    Config: configs/vgg/vgg13_8xb32_in1k.py
+  - Name: vgg16_8xb32_in1k
+    Metadata:
+      FLOPs: 15500000000
+      Parameters: 138360000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 71.62
+          Top 5 Accuracy: 90.49
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_batch256_imagenet_20210208-db26f1a5.pth
+    Config: configs/vgg/vgg16_8xb32_in1k.py
+  - Name: vgg19_8xb32_in1k
+    Metadata:
+      FLOPs: 19670000000
+      Parameters: 143670000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 72.41
+          Top 5 Accuracy: 90.8
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_batch256_imagenet_20210208-e6920e4a.pth
+    Config: configs/vgg/vgg19_8xb32_in1k.py
+  - Name: vgg11bn_8xb32_in1k
+    Metadata:
+      FLOPs: 7640000000
+      Parameters: 132870000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 70.67
+          Top 5 Accuracy: 90.16
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_bn_batch256_imagenet_20210207-f244902c.pth
+    Config: configs/vgg/vgg11bn_8xb32_in1k.py
+  - Name: vgg13bn_8xb32_in1k
+    Metadata:
+      FLOPs: 11360000000
+      Parameters: 133050000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 72.12
+          Top 5 Accuracy: 90.66
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_bn_batch256_imagenet_20210207-1a8b7864.pth
+    Config: configs/vgg/vgg13bn_8xb32_in1k.py
+  - Name: vgg16bn_8xb32_in1k
+    Metadata:
+      FLOPs: 15530000000
+      Parameters: 138370000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 73.74
+          Top 5 Accuracy: 91.66
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_bn_batch256_imagenet_20210208-7e55cd29.pth
+    Config: configs/vgg/vgg16bn_8xb32_in1k.py
+  - Name: vgg19bn_8xb32_in1k
+    Metadata:
+      FLOPs: 19700000000
+      Parameters: 143680000
+    In Collection: VGG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.68
+          Top 5 Accuracy: 92.27
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_bn_batch256_imagenet_20210208-da620c4f.pth
+    Config: configs/vgg/vgg19bn_8xb32_in1k.py
diff --git a/configs/vgg/vgg11_8xb32_in1k.py b/configs/vgg/vgg11_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..616233c418fdeaa5d08db75b290f3438ec96b13c
--- /dev/null
+++ b/configs/vgg/vgg11_8xb32_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/vgg11.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=0.01))
diff --git a/configs/vgg/vgg11bn_8xb32_in1k.py b/configs/vgg/vgg11bn_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..22f55ef0851ee4728caad271cfdaf02fb5c4afed
--- /dev/null
+++ b/configs/vgg/vgg11bn_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vgg11bn.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vgg/vgg13_8xb32_in1k.py b/configs/vgg/vgg13_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec1c98fb997568754868670a0f9d37233e6ca57d
--- /dev/null
+++ b/configs/vgg/vgg13_8xb32_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/vgg13.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=0.01))
diff --git a/configs/vgg/vgg13bn_8xb32_in1k.py b/configs/vgg/vgg13bn_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3cb3592b09e06e1b902c6d1fcca2cb03bcb7f82c
--- /dev/null
+++ b/configs/vgg/vgg13bn_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vgg13bn.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vgg/vgg16_8xb16_voc.py b/configs/vgg/vgg16_8xb16_voc.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d9e347bf533f36eb165dd06d0faf20ccbaba917
--- /dev/null
+++ b/configs/vgg/vgg16_8xb16_voc.py
@@ -0,0 +1,43 @@
+_base_ = [
+    '../_base_/datasets/voc_bs16.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+
+# load model pretrained on imagenet
+pretrained = 'https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_batch256_imagenet_20210208-db26f1a5.pth'  # noqa
+
+# use different head for multilabel task
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VGG',
+        depth=16,
+        num_classes=20,
+        init_cfg=dict(
+            type='Pretrained', checkpoint=pretrained, prefix='backbone')),
+    neck=None,
+    head=dict(
+        type='MultiLabelClsHead',
+        loss=dict(type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0)))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.001, momentum=0.9, weight_decay=0),
+    # update the final linear by 10 times learning rate.
+    paramwise_cfg=dict(custom_keys={'.backbone.classifier': dict(lr_mult=10)}),
+)
+
+# learning policy
+param_scheduler = dict(type='StepLR', by_epoch=True, step_size=20, gamma=0.1)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=40, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (16 samples per GPU)
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/configs/vgg/vgg16_8xb32_in1k.py b/configs/vgg/vgg16_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a291da2813f011323f7ba19724dc92d87b935f80
--- /dev/null
+++ b/configs/vgg/vgg16_8xb32_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/vgg16.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=0.01))
diff --git a/configs/vgg/vgg16bn_8xb32_in1k.py b/configs/vgg/vgg16bn_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f6bbb81b86b279bbf84d7b877ef3bc370dedbf4e
--- /dev/null
+++ b/configs/vgg/vgg16bn_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vgg16bn.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vgg/vgg19_8xb32_in1k.py b/configs/vgg/vgg19_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..88cd24c1dd9cb28dc3c91e4403b241c441dfbe03
--- /dev/null
+++ b/configs/vgg/vgg19_8xb32_in1k.py
@@ -0,0 +1,9 @@
+_base_ = [
+    '../_base_/models/vgg19.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(lr=0.01))
diff --git a/configs/vgg/vgg19bn_8xb32_in1k.py b/configs/vgg/vgg19bn_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..4b4f34aba0ad5f665b86a8173af9e4436546af23
--- /dev/null
+++ b/configs/vgg/vgg19bn_8xb32_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vgg19bn.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/README.md b/configs/vig/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..624e387ac3799f599cbd886e9053cfa1d2a2de95
--- /dev/null
+++ b/configs/vig/README.md
@@ -0,0 +1,81 @@
+# VIG
+
+> [Vision GNN: An Image is Worth Graph of Nodes](https://arxiv.org/abs/2206.00272)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Network architecture plays a key role in the deep learning-based computer vision system. The widely-used convolutional neural network and transformer treat the image as a grid or sequence structure, which is not flexible to capture irregular and complex objects. In this paper, we propose to represent the image as a graph structure and introduce a new Vision GNN (ViG) architecture to extract graph-level feature for visual tasks. We first split the image to a number of patches which are viewed as nodes, and construct a graph by connecting the nearest neighbors. Based on the graph representation of images, we build our ViG model to transform and exchange information among all the nodes. ViG consists of two basic modules: Grapher module with graph convolution for aggregating and updating graph information, and FFN module with two linear layers for node feature transformation. Both isotropic and pyramid architectures of ViG are built with different model sizes. Extensive experiments on image recognition and object detection tasks demonstrate the superiority of our ViG architecture. We hope this pioneering study of GNN on general visual tasks will provide useful inspiration and experience for future research.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/212789461-f085e4da-9ce9-435f-93c0-e1b84d10b79f.png" width="50%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vig-tiny_3rdparty_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vig-tiny_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/vig/vig-tiny_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/vig/vig-tiny_3rdparty_in1k_20230117-6414c684.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                         |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                Config                |                                        Download                                        |
+| :---------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------: | :------------------------------------------------------------------------------------: |
+| `vig-tiny_3rdparty_in1k`\*    | From scratch |    7.18    |   1.31    |   74.40   |   92.34   |  [config](vig-tiny_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/vig/vig-tiny_3rdparty_in1k_20230117-6414c684.pth) |
+| `vig-small_3rdparty_in1k`\*   | From scratch |   22.75    |   4.54    |   80.61   |   95.28   |  [config](vig-small_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vig/vig-small_3rdparty_in1k_20230117-5338bf3b.pth) |
+| `vig-base_3rdparty_in1k`\*    | From scratch |   20.68    |   17.68   |   82.62   |   96.04   |  [config](vig-base_8xb128_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/vig/vig-base_3rdparty_in1k_20230117-92f6f12f.pth) |
+| `pvig-tiny_3rdparty_in1k`\*   | From scratch |    9.46    |   1.71    |   78.38   |   94.38   |  [config](pvig-tiny_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vig/pvig-tiny_3rdparty_in1k_20230117-eb77347d.pth) |
+| `pvig-small_3rdparty_in1k`\*  | From scratch |   29.02    |   4.57    |   82.00   |   95.97   | [config](pvig-small_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vig/pvig-small_3rdparty_in1k_20230117-9433dc96.pth) |
+| `pvig-medium_3rdparty_in1k`\* | From scratch |   51.68    |   8.89    |   83.12   |   96.35   | [config](pvig-medium_8xb128_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/vig/pvig-medium_3rdparty_in1k_20230117-21057a6d.pth) |
+| `pvig-base_3rdparty_in1k`\*   | From scratch |   95.21    |   16.86   |   83.59   |   96.52   |  [config](pvig-base_8xb128_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vig/pvig-base_3rdparty_in1k_20230117-dbab3c85.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{han2022vig,
+  title={Vision GNN: An Image is Worth Graph of Nodes},
+  author={Kai Han and Yunhe Wang and Jianyuan Guo and Yehui Tang and Enhua Wu},
+  booktitle={NeurIPS},
+  year={2022}
+}
+```
diff --git a/configs/vig/metafile.yml b/configs/vig/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..52bd18baf1623bf1f12a95d93c331749847a1339
--- /dev/null
+++ b/configs/vig/metafile.yml
@@ -0,0 +1,134 @@
+Collections:
+  - Name: VIG
+    Metadata:
+      Training Data: ImageNet-1k
+      Architecture:
+        - Vision GNN
+    Paper:
+      Title: 'Vision GNN: An Image is Worth Graph of Nodes'
+      URL: https://arxiv.org/abs/2206.00272
+    README: configs/vig/README.md
+    Code:
+      URL: null
+      Version: null
+
+Models:
+  - Name: vig-tiny_3rdparty_in1k
+    Metadata:
+      FLOPs: 1309000000
+      Parameters: 7185000
+      Training Data: ImageNet-1k
+    In Collection: VIG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.40
+          Top 5 Accuracy: 92.34
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vig/vig-tiny_3rdparty_in1k_20230117-6414c684.pth
+    Config: configs/vig/vig-tiny_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/vig/vig_ti_74.5.pth
+      Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+  - Name: vig-small_3rdparty_in1k
+    Metadata:
+      FLOPs: 4535000000
+      Parameters: 22748000
+      Training Data: ImageNet-1k
+    In Collection: VIG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.61
+          Top 5 Accuracy: 95.28
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vig/vig-small_3rdparty_in1k_20230117-5338bf3b.pth
+    Config: configs/vig/vig-small_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/vig/vig_s_80.6.pth
+      Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+  - Name: vig-base_3rdparty_in1k
+    Metadata:
+      FLOPs: 17681000000
+      Parameters: 20685000
+      Training Data: ImageNet-1k
+    In Collection: VIG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.62
+          Top 5 Accuracy: 96.04
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vig/vig-base_3rdparty_in1k_20230117-92f6f12f.pth
+    Config: configs/vig/vig-base_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/vig/vig_b_82.6.pth
+      Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+  - Name: pvig-tiny_3rdparty_in1k
+    Metadata:
+      FLOPs: 1714000000
+      Parameters: 9458000
+      Training Data: ImageNet-1k
+    In Collection: VIG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.38
+          Top 5 Accuracy: 94.38
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vig/pvig-tiny_3rdparty_in1k_20230117-eb77347d.pth
+    Config: configs/vig/pvig-tiny_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/pyramid-vig/pvig_ti_78.5.pth.tar
+      Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+  - Name: pvig-small_3rdparty_in1k
+    Metadata:
+      FLOPs: 4572000000
+      Parameters: 29024000
+      Training Data: ImageNet-1k
+    In Collection: VIG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.00
+          Top 5 Accuracy: 95.97
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vig/pvig-small_3rdparty_in1k_20230117-9433dc96.pth
+    Config: configs/vig/pvig-small_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/pyramid-vig/pvig_s_82.1.pth.tar
+      Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+  - Name: pvig-medium_3rdparty_in1k
+    Metadata:
+      FLOPs: 8886000000
+      Parameters: 51682000
+      Training Data: ImageNet-1k
+    In Collection: VIG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.12
+          Top 5 Accuracy: 96.35
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vig/pvig-medium_3rdparty_in1k_20230117-21057a6d.pth
+    Config: configs/vig/pvig-medium_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/pyramid-vig/pvig_m_83.1.pth.tar
+      Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+  - Name: pvig-base_3rdparty_in1k
+    Metadata:
+      FLOPs: 16861000000
+      Parameters: 95213000
+      Training Data: ImageNet-1k
+    In Collection: VIG
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.59
+          Top 5 Accuracy: 96.52
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/vig/pvig-base_3rdparty_in1k_20230117-dbab3c85.pth
+    Config: configs/vig/pvig-base_8xb128_in1k.py
+    Converted From:
+      Weights: https://github.com/huawei-noah/Efficient-AI-Backbones/releases/download/pyramid-vig/pvig_b_83.66.pth.tar
+      Code: https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
diff --git a/configs/vig/pvig-base_8xb128_in1k.py b/configs/vig/pvig-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d66359c6c78068e48e0466fede86f11e14e9a91
--- /dev/null
+++ b/configs/vig/pvig-base_8xb128_in1k.py
@@ -0,0 +1,22 @@
+_base_ = [
+    '../_base_/models/vig/pyramid_vig_base.py',
+    '../_base_/datasets/imagenet_bs128_vig_224.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
+
+# dataset settings
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=235,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
diff --git a/configs/vig/pvig-medium_8xb128_in1k.py b/configs/vig/pvig-medium_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..75c25a2d89b0b8fce8d816d0129afeaf63d6a5e2
--- /dev/null
+++ b/configs/vig/pvig-medium_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vig/pyramid_vig_medium.py',
+    '../_base_/datasets/imagenet_bs128_vig_224.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/pvig-small_8xb128_in1k.py b/configs/vig/pvig-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..755b3319d313f02ce9f1c2f2a943ddd934f7e49b
--- /dev/null
+++ b/configs/vig/pvig-small_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vig/pyramid_vig_small.py',
+    '../_base_/datasets/imagenet_bs128_vig_224.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/pvig-tiny_8xb128_in1k.py b/configs/vig/pvig-tiny_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a885559c597962201bed20249f8b688589a7788
--- /dev/null
+++ b/configs/vig/pvig-tiny_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vig/pyramid_vig_tiny.py',
+    '../_base_/datasets/imagenet_bs128_vig_224.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/vig-base_8xb128_in1k.py b/configs/vig/vig-base_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb8b55e3e841659f65e975947a9859361e34aa28
--- /dev/null
+++ b/configs/vig/vig-base_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vig/vig_base.py',
+    '../_base_/datasets/imagenet_bs128_vig_224.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/vig-small_8xb128_in1k.py b/configs/vig/vig-small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..41508b2894d0849cfc92dd2340c71bebdf06f591
--- /dev/null
+++ b/configs/vig/vig-small_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vig/vig_small.py',
+    '../_base_/datasets/imagenet_bs128_vig_224.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vig/vig-tiny_8xb128_in1k.py b/configs/vig/vig-tiny_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..80b1693ad5baecd57d450ae33806e80ddce0f55e
--- /dev/null
+++ b/configs/vig/vig-tiny_8xb128_in1k.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/models/vig/vig_tiny.py',
+    '../_base_/datasets/imagenet_bs128_vig_224.py',
+    '../_base_/schedules/imagenet_bs256.py',
+    '../_base_/default_runtime.py',
+]
diff --git a/configs/vision_transformer/README.md b/configs/vision_transformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..66bd3f529dd85062323c585b38660ab414362250
--- /dev/null
+++ b/configs/vision_transformer/README.md
@@ -0,0 +1,101 @@
+# Vision Transformer
+
+> [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929)
+
+<!-- [ALGORITHM] -->
+
+## Introduction
+
+**Vision Transformer**, known as **ViT**, succeeded in using a full transformer to outperform previous works that based on convolutional networks in vision field. ViT splits image into patches to feed the multi-head attentions, concatenates a learnable class token for final prediction and adds a learnable position embeddings for relative positional message between patches. Based on these three techniques with attentions, ViT provides a brand-new pattern to build a basic structure in vision field.
+
+The strategy works even better when coupled with large datasets pre-trainings. Because of its simplicity and effectiveness, some after works in classification field are originated from ViT. And even in recent multi-modality field, ViT-based method still plays a role in it.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/142579081-b5718032-6581-472b-8037-ea66aaa9e278.png" width="70%"/>
+</div>
+
+## Abstract
+
+<details>
+
+<summary>Show the paper's abstract</summary>
+
+<br>
+
+While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
+</br>
+
+</details>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('vit-base-p32_in21k-pre_3rdparty_in1k-384px', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('vit-base-p32_in21k-pre_3rdparty_in1k-384px', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Train/Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Train:
+
+```shell
+python tools/train.py configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py
+```
+
+Test:
+
+```shell
+python tools/test.py configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-9cea8599.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                           |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                    Config                    |                           Download                           |
+| :---------------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :------------------------------------------: | :----------------------------------------------------------: |
+| `vit-base-p32_in21k-pre_3rdparty_in1k-384px`\*  | ImageNet-21k |   88.30    |   13.06   |   84.01   |   97.08   | [config](vit-base-p32_64xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-9cea8599.pth) |
+| `vit-base-p16_32xb128-mae_in1k`                 | From scratch |   86.57    |   17.58   |   82.37   |   96.15   |  [config](vit-base-p16_32xb128-mae_in1k.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vit/vit-base-p16_pt-32xb128-mae_in1k_20220623-4c544545.pth) \| [log](https://download.openmmlab.com/mmclassification/v0/vit/vit-base-p16_pt-32xb128-mae_in1k_20220623-4c544545.log) |
+| `vit-base-p16_in21k-pre_3rdparty_in1k-384px`\*  | ImageNet-21k |   86.86    |   55.54   |   85.43   |   97.77   | [config](vit-base-p16_64xb64_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth) |
+| `vit-large-p16_in21k-pre_3rdparty_in1k-384px`\* | ImageNet-21k |   304.72   |  191.21   |   85.63   |   97.63   | [config](vit-large-p16_64xb64_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-large-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-b20ba619.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@inproceedings{
+  dosovitskiy2021an,
+  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
+  author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
+  booktitle={International Conference on Learning Representations},
+  year={2021},
+  url={https://openreview.net/forum?id=YicbFdNTTy}
+}
+```
diff --git a/configs/vision_transformer/metafile.yml b/configs/vision_transformer/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..891c413ab6c5b579eb5d404b7b7e7d01fe94b8d8
--- /dev/null
+++ b/configs/vision_transformer/metafile.yml
@@ -0,0 +1,95 @@
+Collections:
+  - Name: Vision Transformer
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      Title: 'An Image is Worth 16x16 Words: Transformers for Image Recognition at
+        Scale'
+      URL: https://arxiv.org/abs/2010.11929
+    README: configs/vision_transformer/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.17.0/mmcls/models/backbones/vision_transformer.py
+      Version: v0.17.0
+
+Models:
+  - Name: vit-base-p32_in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 13056716544
+      Parameters: 88297192
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: Vision Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 84.01
+          Top 5 Accuracy: 97.08
+    Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-9cea8599.pth
+    Config: configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py
+    Converted From:
+      Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/B_32-i21k-300ep-lr_0.001-aug_light1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_384.npz
+      Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
+  - Name: vit-base-p16_32xb128-mae_in1k
+    Metadata:
+      FLOPs: 17581972224
+      Parameters: 86567656
+      Training Data:
+        - ImageNet-1k
+    In Collection: Vision Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 82.37
+          Top 5 Accuracy: 96.15
+    Weights: https://download.openmmlab.com/mmclassification/v0/vit/vit-base-p16_pt-32xb128-mae_in1k_20220623-4c544545.pth
+    Config: configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py
+  - Name: vit-base-p16_in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 55538974464
+      Parameters: 86859496
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: Vision Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 85.43
+          Top 5 Accuracy: 97.77
+    Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth
+    Config: configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py
+    Converted From:
+      Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/B_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_384.npz
+      Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
+  - Name: vit-large-p16_in21k-pre_3rdparty_in1k-384px
+    Metadata:
+      FLOPs: 191210034176
+      Parameters: 304715752
+      Training Data:
+        - ImageNet-21k
+        - ImageNet-1k
+    In Collection: Vision Transformer
+    Results:
+      - Dataset: ImageNet-1k
+        Task: Image Classification
+        Metrics:
+          Top 1 Accuracy: 85.63
+          Top 5 Accuracy: 97.63
+    Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-large-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-b20ba619.pth
+    Config: configs/vision_transformer/vit-large-p16_64xb64_in1k-384px.py
+    Converted From:
+      Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/L_16-i21k-300ep-lr_0.001-aug_strong1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_384.npz
+      Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
diff --git a/configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py b/configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a46bbb21a99b34f792f277759b4dccb75c88b2ed
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_32xb128-mae_in1k.py
@@ -0,0 +1,58 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py'
+]
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW',
+        lr=1e-4 * 4096 / 256,
+        weight_decay=0.3,
+        eps=1e-8,
+        betas=(0.9, 0.95)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=1e-4)]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/configs/vision_transformer/vit-base-p16_4xb544-ipu_in1k.py b/configs/vision_transformer/vit-base-p16_4xb544-ipu_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d378b3b265b30b7f3e492dcf22527fed5cd9beb4
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_4xb544-ipu_in1k.py
@@ -0,0 +1,114 @@
+_base_ = [
+    '../_base_/models/vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+    '../_base_/default_runtime.py'
+]
+
+# specific to vit pretrain
+paramwise_cfg = dict(custom_keys={
+    '.cls_token': dict(decay_mult=0.0),
+    '.pos_embed': dict(decay_mult=0.0)
+})
+
+pretrained = 'https://download.openmmlab.com/mmclassification/v0/vit/pretrain/vit-base-p16_3rdparty_pt-64xb64_in1k-224_20210928-02284250.pth'  # noqa
+
+model = dict(
+    head=dict(
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0, _delete_=True), ),
+    backbone=dict(
+        img_size=224,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint=pretrained,
+            _delete_=True,
+            prefix='backbone')))
+
+img_norm_cfg = dict(
+    mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5], to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='ToTensor', keys=['gt_label']),
+    dict(type='ToHalf', keys=['img']),
+    dict(type='Collect', keys=['img', 'gt_label'])
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', scale=(224, -1), keep_ratio=True, backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='ToHalf', keys=['img']),
+    dict(type='Collect', keys=['img'])
+]
+
+# change batch size
+data = dict(
+    samples_per_gpu=17,
+    workers_per_gpu=16,
+    drop_last=True,
+    train=dict(pipeline=train_pipeline),
+    train_dataloader=dict(mode='async'),
+    val=dict(pipeline=test_pipeline, ),
+    val_dataloader=dict(samples_per_gpu=4, workers_per_gpu=1),
+    test=dict(pipeline=test_pipeline),
+    test_dataloader=dict(samples_per_gpu=4, workers_per_gpu=1))
+
+# optimizer
+optimizer = dict(
+    type='SGD',
+    lr=0.08,
+    weight_decay=1e-5,
+    momentum=0.9,
+    paramwise_cfg=paramwise_cfg,
+)
+
+# learning policy
+param_scheduler = [
+    dict(type='LinearLR', start_factor=0.02, by_epoch=False, begin=0, end=800),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=4200,
+        by_epoch=False,
+        begin=800,
+        end=5000)
+]
+
+# ipu cfg
+# model partition config
+ipu_model_cfg = dict(
+    train_split_edges=[
+        dict(layer_to_call='backbone.patch_embed', ipu_id=0),
+        dict(layer_to_call='backbone.layers.3', ipu_id=1),
+        dict(layer_to_call='backbone.layers.6', ipu_id=2),
+        dict(layer_to_call='backbone.layers.9', ipu_id=3)
+    ],
+    train_ckpt_nodes=['backbone.layers.{}'.format(i) for i in range(12)])
+
+# device config
+options_cfg = dict(
+    randomSeed=42,
+    partialsType='half',
+    train_cfg=dict(
+        executionStrategy='SameAsIpu',
+        Training=dict(gradientAccumulation=32),
+        availableMemoryProportion=[0.3, 0.3, 0.3, 0.3],
+    ),
+    eval_cfg=dict(deviceIterations=1, ),
+)
+
+# add model partition config and device config to runner
+runner = dict(
+    type='IterBasedRunner',
+    ipu_model_cfg=ipu_model_cfg,
+    options_cfg=options_cfg,
+    max_iters=5000)
+
+default_hooks = dict(checkpoint=dict(type='CheckpointHook', interval=1000))
+
+fp16 = dict(loss_scale=256.0, velocity_accum_type='half', accum_type='half')
diff --git a/configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py b/configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..e0f745874bcef7e3896cfc694c16bf4e5a235fae
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py
@@ -0,0 +1,38 @@
+_base_ = [
+    '../_base_/models/vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(img_size=384))
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-base-p16_64xb64_in1k.py b/configs/vision_transformer/vit-base-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..07be0e9a373a324f07989476314d391f2fee4f8e
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_64xb64_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+    '../_base_/models/vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+    head=dict(hidden_dim=3072),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-base-p16_8xb64-lora_in1k-384px.py b/configs/vision_transformer/vit-base-p16_8xb64-lora_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..ffe1018e5d9c0f724911b782a555cb34d50d6ceb
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p16_8xb64-lora_in1k-384px.py
@@ -0,0 +1,84 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='LoRAModel',
+        module=dict(
+            type='VisionTransformer',
+            arch='b',
+            img_size=384,
+            patch_size=16,
+            drop_rate=0.1,
+            init_cfg=dict(type='Pretrained', checkpoint='',
+                          prefix='backbone')),
+        alpha=16,
+        rank=16,
+        drop_rate=0.1,
+        targets=[dict(type='qkv')]),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1,
+            mode='classy_vision'),
+        init_cfg=[dict(type='TruncNormal', layer='Linear', std=2e-5)],
+    ))
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=45,
+        by_epoch=True,
+        begin=5,
+        end=50,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+train_cfg = dict(by_epoch=True, max_epochs=50)
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py b/configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5a4d14f4dad0759f70b9b9e29c085ad7eff292c
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p32_64xb64_in1k-384px.py
@@ -0,0 +1,38 @@
+_base_ = [
+    '../_base_/models/vit-base-p32.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(img_size=384))
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-base-p32_64xb64_in1k.py b/configs/vision_transformer/vit-base-p32_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..9cfc7c47df0887e4ace1bbaeb59bb5d42e004a83
--- /dev/null
+++ b/configs/vision_transformer/vit-base-p32_64xb64_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+    '../_base_/models/vit-base-p32.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+    head=dict(hidden_dim=3072),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-large-p16_64xb64_in1k-384px.py b/configs/vision_transformer/vit-large-p16_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..98e96ec68ffdaca2648e1ac2ae5a79db30ec8382
--- /dev/null
+++ b/configs/vision_transformer/vit-large-p16_64xb64_in1k-384px.py
@@ -0,0 +1,38 @@
+_base_ = [
+    '../_base_/models/vit-large-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(img_size=384))
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-large-p16_64xb64_in1k.py b/configs/vision_transformer/vit-large-p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d9bd283b779af36df99574bbdde7701c6b41393
--- /dev/null
+++ b/configs/vision_transformer/vit-large-p16_64xb64_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+    '../_base_/models/vit-large-p16.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+    head=dict(hidden_dim=3072),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-large-p32_64xb64_in1k-384px.py b/configs/vision_transformer/vit-large-p32_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..22320d119890bb80aca47e45322dabeee4d0feb7
--- /dev/null
+++ b/configs/vision_transformer/vit-large-p32_64xb64_in1k-384px.py
@@ -0,0 +1,38 @@
+_base_ = [
+    '../_base_/models/vit-large-p32.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(backbone=dict(img_size=384))
+
+# dataset setting
+data_preprocessor = dict(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=384, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=384, edge='short', backend='pillow'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/vision_transformer/vit-large-p32_64xb64_in1k.py b/configs/vision_transformer/vit-large-p32_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..61e179165b84d8aa521426aa992cc2460d7ae0a5
--- /dev/null
+++ b/configs/vision_transformer/vit-large-p32_64xb64_in1k.py
@@ -0,0 +1,15 @@
+_base_ = [
+    '../_base_/models/vit-large-p32.py',
+    '../_base_/datasets/imagenet_bs64_pil_resize_autoaug.py',
+    '../_base_/schedules/imagenet_bs4096_AdamW.py',
+    '../_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+    head=dict(hidden_dim=3072),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
+
+# schedule setting
+optim_wrapper = dict(clip_grad=dict(max_norm=1.0))
diff --git a/configs/wrn/README.md b/configs/wrn/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2753307b06699b4235aaf1465f0ce5cf89a30952
--- /dev/null
+++ b/configs/wrn/README.md
@@ -0,0 +1,76 @@
+# Wide-ResNet
+
+> [Wide Residual Networks](https://arxiv.org/abs/1605.07146)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes these networks very slow to train. To tackle these problems, in this paper we conduct a detailed experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture where we decrease depth and increase width of residual networks. We call the resulting network structures wide residual networks (WRNs) and show that these are far superior over their commonly used thin and very deep counterparts. For example, we demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layer-deep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on ImageNet.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/156701329-2c7ec7bc-23da-401b-86bf-dea8567ccee8.png" width="90%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Predict image**
+
+```python
+from mmpretrain import inference_model
+
+predict = inference_model('wide-resnet50_3rdparty_8xb32_in1k', 'demo/bird.JPEG')
+print(predict['pred_class'])
+print(predict['pred_score'])
+```
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('wide-resnet50_3rdparty_8xb32_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/wrn/wide-resnet50_8xb32_in1k.py https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty_8xb32_in1k_20220304-66678344.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Image Classification on ImageNet-1k
+
+| Model                                      |   Pretrain   | Params (M) | Flops (G) | Top-1 (%) | Top-5 (%) |                   Config                   |                              Download                               |
+| :----------------------------------------- | :----------: | :--------: | :-------: | :-------: | :-------: | :----------------------------------------: | :-----------------------------------------------------------------: |
+| `wide-resnet50_3rdparty_8xb32_in1k`\*      | From scratch |   68.88    |   11.44   |   78.48   |   94.08   |   [config](wide-resnet50_8xb32_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty_8xb32_in1k_20220304-66678344.pth) |
+| `wide-resnet101_3rdparty_8xb32_in1k`\*     | From scratch |   126.89   |   22.81   |   78.84   |   94.28   |   [config](wide-resnet101_8xb32_in1k.py)   | [model](https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet101_3rdparty_8xb32_in1k_20220304-8d5f9d61.pth) |
+| `wide-resnet50_3rdparty-timm_8xb32_in1k`\* | From scratch |   68.88    |   11.44   |   81.45   |   95.53   | [config](wide-resnet50_timm_8xb32_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty-timm_8xb32_in1k_20220304-83ae4399.pth) |
+
+*Models with * are converted from the [timm](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/resnet.py). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@INPROCEEDINGS{Zagoruyko2016WRN,
+    author = {Sergey Zagoruyko and Nikos Komodakis},
+    title = {Wide Residual Networks},
+    booktitle = {BMVC},
+    year = {2016}}
+```
diff --git a/configs/wrn/metafile.yml b/configs/wrn/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..75e346720cf626c923514e01a5bd3ed33849da9a
--- /dev/null
+++ b/configs/wrn/metafile.yml
@@ -0,0 +1,77 @@
+Collections:
+  - Name: Wide-ResNet
+    Metadata:
+      Training Data: ImageNet-1k
+      Training Techniques:
+        - SGD with Momentum
+        - Weight Decay
+      Training Resources: 8x V100 GPUs
+      Epochs: 100
+      Batch Size: 256
+      Architecture:
+        - 1x1 Convolution
+        - Batch Normalization
+        - Convolution
+        - Global Average Pooling
+        - Max Pooling
+        - ReLU
+        - Residual Connection
+        - Softmax
+        - Wide Residual Block
+    Paper:
+      URL: https://arxiv.org/abs/1605.07146
+      Title: "Wide Residual Networks"
+    README: configs/wrn/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/v0.20.1/mmcls/models/backbones/resnet.py#L383
+      Version: v0.20.1
+
+Models:
+  - Name: wide-resnet50_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 11440000000  # 11.44G
+      Parameters: 68880000  # 68.88M
+    In Collection: Wide-ResNet
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.48
+          Top 5 Accuracy: 94.08
+    Weights: https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty_8xb32_in1k_20220304-66678344.pth
+    Config: configs/wrn/wide-resnet50_8xb32_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/wide_resnet50_2-95faca4d.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py
+  - Name: wide-resnet101_3rdparty_8xb32_in1k
+    Metadata:
+      FLOPs: 22810000000  # 22.81G
+      Parameters: 126890000 # 126.89M
+    In Collection: Wide-ResNet
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.84
+          Top 5 Accuracy: 94.28
+    Weights: https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet101_3rdparty_8xb32_in1k_20220304-8d5f9d61.pth
+    Config: configs/wrn/wide-resnet101_8xb32_in1k.py
+    Converted From:
+      Weights: https://download.pytorch.org/models/wide_resnet101_2-32ee1156.pth
+      Code: https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py
+  - Name: wide-resnet50_3rdparty-timm_8xb32_in1k
+    Metadata:
+      FLOPs: 11440000000  # 11.44G
+      Parameters: 68880000  # 68.88M
+    In Collection: Wide-ResNet
+    Results:
+      - Task: Image Classification
+        Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.45
+          Top 5 Accuracy: 95.53
+    Weights: https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty-timm_8xb32_in1k_20220304-83ae4399.pth
+    Config: configs/wrn/wide-resnet50_timm_8xb32_in1k.py
+    Converted From:
+      Weights: https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/wide_resnet50_racm-8234f177.pth
+      Code: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/resnet.py
diff --git a/configs/wrn/wide-resnet101_8xb32_in1k.py b/configs/wrn/wide-resnet101_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1bf5e5e5fac3655bd27f64f4c5c5a1316403a3b
--- /dev/null
+++ b/configs/wrn/wide-resnet101_8xb32_in1k.py
@@ -0,0 +1,7 @@
+_base_ = [
+    '../_base_/models/wide-resnet50.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(depth=101))
diff --git a/configs/wrn/wide-resnet50_8xb32_in1k.py b/configs/wrn/wide-resnet50_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..edf6a0518ac73f4eaa54f261ecbfce8acf0f2035
--- /dev/null
+++ b/configs/wrn/wide-resnet50_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/wide-resnet50.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/wrn/wide-resnet50_timm_8xb32_in1k.py b/configs/wrn/wide-resnet50_timm_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8dca8f37319f8d60df0e42123b2ebe16a3f7d9d8
--- /dev/null
+++ b/configs/wrn/wide-resnet50_timm_8xb32_in1k.py
@@ -0,0 +1,5 @@
+_base_ = [
+    '../_base_/models/wide-resnet50.py',
+    '../_base_/datasets/imagenet_bs32_pil_bicubic.py',
+    '../_base_/schedules/imagenet_bs256.py', '../_base_/default_runtime.py'
+]
diff --git a/configs/xcit/README.md b/configs/xcit/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ab2cd7a3634e4d877bca3d5125d3506d3861b428
--- /dev/null
+++ b/configs/xcit/README.md
@@ -0,0 +1,106 @@
+# XCiT
+
+> [XCiT: Cross-Covariance Image Transformers](https://arxiv.org/abs/2106.09681)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/26739999/218900814-64a44606-150b-4757-aec8-7015c77a9fd1.png" width="60%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+
+model = get_model('xcit-nano-12-p16_3rdparty_in1k', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+**Test Command**
+
+Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
+
+Test:
+
+```shell
+python tools/test.py configs/xcit/xcit-nano-12-p16_8xb128_in1k.py https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty_in1k_20230213-ed776c38.pth
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+### Pretrained models
+
+| Model                                           | Params (M) | Flops (G) |                      Config                       |                                       Download                                        |
+| :---------------------------------------------- | :--------: | :-------: | :-----------------------------------------------: | :-----------------------------------------------------------------------------------: |
+| `xcit-nano-12-p16_3rdparty_in1k`\*              |    3.05    |   0.56    |     [config](xcit-nano-12-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty_in1k_20230213-ed776c38.pth) |
+| `xcit-nano-12-p16_3rdparty-dist_in1k`\*         |    3.05    |   0.56    |     [config](xcit-nano-12-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty-dist_in1k_20230213-fb247f7b.pth) |
+| `xcit-tiny-12-p16_3rdparty_in1k`\*              |    6.72    |   1.24    |     [config](xcit-tiny-12-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty_in1k_20230213-82c547ca.pth) |
+| `xcit-tiny-12-p16_3rdparty-dist_in1k`\*         |    6.72    |   1.24    |     [config](xcit-tiny-12-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty-dist_in1k_20230213-d5fde0a3.pth) |
+| `xcit-nano-12-p16_3rdparty-dist_in1k-384px`\*   |    3.05    |   1.64    |  [config](xcit-nano-12-p16_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty-dist_in1k-384px_20230213-712db4d4.pth) |
+| `xcit-nano-12-p8_3rdparty_in1k`\*               |    3.05    |   2.16    |     [config](xcit-nano-12-p8_8xb128_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty_in1k_20230213-3370c293.pth) |
+| `xcit-nano-12-p8_3rdparty-dist_in1k`\*          |    3.05    |   2.16    |     [config](xcit-nano-12-p8_8xb128_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty-dist_in1k_20230213-2f87d2b3.pth) |
+| `xcit-tiny-24-p16_3rdparty_in1k`\*              |   12.12    |   2.34    |     [config](xcit-tiny-24-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty_in1k_20230213-366c1cd0.pth) |
+| `xcit-tiny-24-p16_3rdparty-dist_in1k`\*         |   12.12    |   2.34    |     [config](xcit-tiny-24-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty-dist_in1k_20230213-b472e80a.pth) |
+| `xcit-tiny-12-p16_3rdparty-dist_in1k-384px`\*   |    6.72    |   3.64    |  [config](xcit-tiny-12-p16_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty-dist_in1k-384px_20230213-00a20023.pth) |
+| `xcit-tiny-12-p8_3rdparty_in1k`\*               |    6.71    |   4.81    |     [config](xcit-tiny-12-p8_8xb128_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty_in1k_20230213-8b02f8f5.pth) |
+| `xcit-tiny-12-p8_3rdparty-dist_in1k`\*          |    6.71    |   4.81    |     [config](xcit-tiny-12-p8_8xb128_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty-dist_in1k_20230213-f3f9b44f.pth) |
+| `xcit-small-12-p16_3rdparty_in1k`\*             |   26.25    |   4.81    |    [config](xcit-small-12-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty_in1k_20230213-d36779d2.pth) |
+| `xcit-small-12-p16_3rdparty-dist_in1k`\*        |   26.25    |   4.81    |    [config](xcit-small-12-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty-dist_in1k_20230213-c95bbae1.pth) |
+| `xcit-nano-12-p8_3rdparty-dist_in1k-384px`\*    |    3.05    |   6.34    |  [config](xcit-nano-12-p8_8xb128_in1k-384px.py)   | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty-dist_in1k-384px_20230213-09d925ef.pth) |
+| `xcit-tiny-24-p16_3rdparty-dist_in1k-384px`\*   |   12.12    |   6.87    |  [config](xcit-tiny-24-p16_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty-dist_in1k-384px_20230213-20e13917.pth) |
+| `xcit-small-24-p16_3rdparty_in1k`\*             |   47.67    |   9.10    |    [config](xcit-small-24-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty_in1k_20230213-40febe38.pth) |
+| `xcit-small-24-p16_3rdparty-dist_in1k`\*        |   47.67    |   9.10    |    [config](xcit-small-24-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty-dist_in1k_20230213-130d7262.pth) |
+| `xcit-tiny-24-p8_3rdparty_in1k`\*               |   12.11    |   9.21    |     [config](xcit-tiny-24-p8_8xb128_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty_in1k_20230213-4b9ba392.pth) |
+| `xcit-tiny-24-p8_3rdparty-dist_in1k`\*          |   12.11    |   9.21    |     [config](xcit-tiny-24-p8_8xb128_in1k.py)      | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty-dist_in1k_20230213-ad9c44b0.pth) |
+| `xcit-tiny-12-p8_3rdparty-dist_in1k-384px`\*    |    6.71    |   14.13   |  [config](xcit-tiny-12-p8_8xb128_in1k-384px.py)   | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty-dist_in1k-384px_20230213-a072174a.pth) |
+| `xcit-small-12-p16_3rdparty-dist_in1k-384px`\*  |   26.25    |   14.14   | [config](xcit-small-12-p16_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty-dist_in1k-384px_20230213-ba36c982.pth) |
+| `xcit-medium-24-p16_3rdparty_in1k`\*            |   84.40    |   16.13   |    [config](xcit-medium-24-p16_8xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty_in1k_20230213-ad0aa92e.pth) |
+| `xcit-medium-24-p16_3rdparty-dist_in1k`\*       |   84.40    |   16.13   |    [config](xcit-medium-24-p16_8xb128_in1k.py)    | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty-dist_in1k_20230213-aca5cd0c.pth) |
+| `xcit-small-12-p8_3rdparty_in1k`\*              |   26.21    |   18.69   |     [config](xcit-small-12-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty_in1k_20230213-9e364ce3.pth) |
+| `xcit-small-12-p8_3rdparty-dist_in1k`\*         |   26.21    |   18.69   |     [config](xcit-small-12-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty-dist_in1k_20230213-71886580.pth) |
+| `xcit-small-24-p16_3rdparty-dist_in1k-384px`\*  |   47.67    |   26.72   | [config](xcit-small-24-p16_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty-dist_in1k-384px_20230213-28fa2d0e.pth) |
+| `xcit-tiny-24-p8_3rdparty-dist_in1k-384px`\*    |   12.11    |   27.05   |  [config](xcit-tiny-24-p8_8xb128_in1k-384px.py)   | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty-dist_in1k-384px_20230213-30d5e5ec.pth) |
+| `xcit-small-24-p8_3rdparty_in1k`\*              |   47.63    |   35.81   |     [config](xcit-small-24-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty_in1k_20230213-280ebcc7.pth) |
+| `xcit-small-24-p8_3rdparty-dist_in1k`\*         |   47.63    |   35.81   |     [config](xcit-small-24-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty-dist_in1k_20230213-f2773c78.pth) |
+| `xcit-large-24-p16_3rdparty_in1k`\*             |   189.10   |   35.86   |    [config](xcit-large-24-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty_in1k_20230214-d29d2529.pth) |
+| `xcit-large-24-p16_3rdparty-dist_in1k`\*        |   189.10   |   35.86   |    [config](xcit-large-24-p16_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty-dist_in1k_20230214-4fea599c.pth) |
+| `xcit-medium-24-p16_3rdparty-dist_in1k-384px`\* |   84.40    |   47.39   | [config](xcit-medium-24-p16_8xb128_in1k-384px.py) | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty-dist_in1k-384px_20230214-6c23a201.pth) |
+| `xcit-small-12-p8_3rdparty-dist_in1k-384px`\*   |   26.21    |   54.92   |  [config](xcit-small-12-p8_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty-dist_in1k-384px_20230214-9f2178bc.pth) |
+| `xcit-medium-24-p8_3rdparty_in1k`\*             |   84.32    |   63.52   |    [config](xcit-medium-24-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty_in1k_20230214-c362850b.pth) |
+| `xcit-medium-24-p8_3rdparty-dist_in1k`\*        |   84.32    |   63.52   |    [config](xcit-medium-24-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty-dist_in1k_20230214-625c953b.pth) |
+| `xcit-small-24-p8_3rdparty-dist_in1k-384px`\*   |   47.63    |  105.24   |  [config](xcit-small-24-p8_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty-dist_in1k-384px_20230214-57298eca.pth) |
+| `xcit-large-24-p16_3rdparty-dist_in1k-384px`\*  |   189.10   |  105.35   | [config](xcit-large-24-p16_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty-dist_in1k-384px_20230214-bd515a34.pth) |
+| `xcit-large-24-p8_3rdparty_in1k`\*              |   188.93   |  141.23   |     [config](xcit-large-24-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty_in1k_20230214-08f2f664.pth) |
+| `xcit-large-24-p8_3rdparty-dist_in1k`\*         |   188.93   |  141.23   |     [config](xcit-large-24-p8_8xb128_in1k.py)     | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty-dist_in1k_20230214-8c092b34.pth) |
+| `xcit-medium-24-p8_3rdparty-dist_in1k-384px`\*  |   84.32    |  186.67   | [config](xcit-medium-24-p8_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty-dist_in1k-384px_20230214-5db925e0.pth) |
+| `xcit-large-24-p8_3rdparty-dist_in1k-384px`\*   |   188.93   |  415.00   |  [config](xcit-large-24-p8_8xb128_in1k-384px.py)  | [model](https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty-dist_in1k-384px_20230214-9f718b1a.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/facebookresearch/xcit). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{el2021xcit,
+  title={XCiT: Cross-Covariance Image Transformers},
+  author={El-Nouby, Alaaeldin and Touvron, Hugo and Caron, Mathilde and Bojanowski, Piotr and Douze, Matthijs and Joulin, Armand and Laptev, Ivan and Neverova, Natalia and Synnaeve, Gabriel and Verbeek, Jakob and others},
+  journal={arXiv preprint arXiv:2106.09681},
+  year={2021}
+}
+```
diff --git a/configs/xcit/metafile.yml b/configs/xcit/metafile.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8379da1927cae6a45433351ca0b930b54f0e9ba7
--- /dev/null
+++ b/configs/xcit/metafile.yml
@@ -0,0 +1,727 @@
+Collections:
+  - Name: XCiT
+    Metadata:
+      Architecture:
+        - Class Attention
+        - Local Patch Interaction
+        - Cross-Covariance Attention
+    Paper:
+      Title: 'XCiT: Cross-Covariance Image Transformers'
+      URL: https://arxiv.org/abs/2106.09681
+    README: configs/xcit/README.md
+
+Models:
+  - Name: xcit-nano-12-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 557074560
+      Parameters: 3053224
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 70.35
+          Top 5 Accuracy: 89.98
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty_in1k_20230213-ed776c38.pth
+    Config: configs/xcit/xcit-nano-12-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p16_224.pth
+  - Name: xcit-nano-12-p16_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 557074560
+      Parameters: 3053224
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 72.36
+          Top 5 Accuracy: 91.02
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty-dist_in1k_20230213-fb247f7b.pth
+    Config: configs/xcit/xcit-nano-12-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p16_224_dist.pth
+  - Name: xcit-tiny-12-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 1239698112
+      Parameters: 6716272
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.21
+          Top 5 Accuracy: 93.62
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty_in1k_20230213-82c547ca.pth
+    Config: configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p16_224.pth
+  - Name: xcit-tiny-12-p16_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 1239698112
+      Parameters: 6716272
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 78.7
+          Top 5 Accuracy: 94.12
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty-dist_in1k_20230213-d5fde0a3.pth
+    Config: configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p16_224_dist.pth
+  - Name: xcit-nano-12-p16_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 1636347520
+      Parameters: 3053224
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 74.93
+          Top 5 Accuracy: 92.42
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p16_3rdparty-dist_in1k-384px_20230213-712db4d4.pth
+    Config: configs/xcit/xcit-nano-12-p16_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p16_384_dist.pth
+  - Name: xcit-nano-12-p8_3rdparty_in1k
+    Metadata:
+      FLOPs: 2156861056
+      Parameters: 3049016
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 73.8
+          Top 5 Accuracy: 92.08
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty_in1k_20230213-3370c293.pth
+    Config: configs/xcit/xcit-nano-12-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p8_224.pth
+  - Name: xcit-nano-12-p8_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 2156861056
+      Parameters: 3049016
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 76.17
+          Top 5 Accuracy: 93.08
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty-dist_in1k_20230213-2f87d2b3.pth
+    Config: configs/xcit/xcit-nano-12-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p8_224_dist.pth
+  - Name: xcit-tiny-24-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 2339305152
+      Parameters: 12116896
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.47
+          Top 5 Accuracy: 94.85
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty_in1k_20230213-366c1cd0.pth
+    Config: configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p16_224.pth
+  - Name: xcit-tiny-24-p16_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 2339305152
+      Parameters: 12116896
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.51
+          Top 5 Accuracy: 95.17
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty-dist_in1k_20230213-b472e80a.pth
+    Config: configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p16_224_dist.pth
+  - Name: xcit-tiny-12-p16_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 3641468352
+      Parameters: 6716272
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 80.58
+          Top 5 Accuracy: 95.38
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p16_3rdparty-dist_in1k-384px_20230213-00a20023.pth
+    Config: configs/xcit/xcit-tiny-12-p16_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p16_384_dist.pth
+  - Name: xcit-tiny-12-p8_3rdparty_in1k
+    Metadata:
+      FLOPs: 4807399872
+      Parameters: 6706504
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 79.75
+          Top 5 Accuracy: 94.88
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty_in1k_20230213-8b02f8f5.pth
+    Config: configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p8_224.pth
+  - Name: xcit-tiny-12-p8_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 4807399872
+      Parameters: 6706504
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.26
+          Top 5 Accuracy: 95.46
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty-dist_in1k_20230213-f3f9b44f.pth
+    Config: configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p8_224_dist.pth
+  - Name: xcit-small-12-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 4814951808
+      Parameters: 26253304
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.87
+          Top 5 Accuracy: 95.77
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty_in1k_20230213-d36779d2.pth
+    Config: configs/xcit/xcit-small-12-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p16_224.pth
+  - Name: xcit-small-12-p16_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 4814951808
+      Parameters: 26253304
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.12
+          Top 5 Accuracy: 96.41
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty-dist_in1k_20230213-c95bbae1.pth
+    Config: configs/xcit/xcit-small-12-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p16_224_dist.pth
+  - Name: xcit-nano-12-p8_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 6337760896
+      Parameters: 3049016
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 77.69
+          Top 5 Accuracy: 94.09
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-nano-12-p8_3rdparty-dist_in1k-384px_20230213-09d925ef.pth
+    Config: configs/xcit/xcit-nano-12-p8_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_nano_12_p8_384_dist.pth
+  - Name: xcit-tiny-24-p16_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 6872966592
+      Parameters: 12116896
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.43
+          Top 5 Accuracy: 96.2
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p16_3rdparty-dist_in1k-384px_20230213-20e13917.pth
+    Config: configs/xcit/xcit-tiny-24-p16_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p16_384_dist.pth
+  - Name: xcit-small-24-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 9095064960
+      Parameters: 47671384
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.38
+          Top 5 Accuracy: 95.93
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty_in1k_20230213-40febe38.pth
+    Config: configs/xcit/xcit-small-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p16_224.pth
+  - Name: xcit-small-24-p16_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 9095064960
+      Parameters: 47671384
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.7
+          Top 5 Accuracy: 96.61
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty-dist_in1k_20230213-130d7262.pth
+    Config: configs/xcit/xcit-small-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p16_224_dist.pth
+  - Name: xcit-tiny-24-p8_3rdparty_in1k
+    Metadata:
+      FLOPs: 9205828032
+      Parameters: 12107128
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 81.7
+          Top 5 Accuracy: 95.9
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty_in1k_20230213-4b9ba392.pth
+    Config: configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p8_224.pth
+  - Name: xcit-tiny-24-p8_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 9205828032
+      Parameters: 12107128
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.62
+          Top 5 Accuracy: 96.16
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty-dist_in1k_20230213-ad9c44b0.pth
+    Config: configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p8_224_dist.pth
+  - Name: xcit-tiny-12-p8_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 14126142912
+      Parameters: 6706504
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.46
+          Top 5 Accuracy: 96.22
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-12-p8_3rdparty-dist_in1k-384px_20230213-a072174a.pth
+    Config: configs/xcit/xcit-tiny-12-p8_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_12_p8_384_dist.pth
+  - Name: xcit-small-12-p16_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 14143179648
+      Parameters: 26253304
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.74
+          Top 5 Accuracy: 97.19
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p16_3rdparty-dist_in1k-384px_20230213-ba36c982.pth
+    Config: configs/xcit/xcit-small-12-p16_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p16_384_dist.pth
+  - Name: xcit-medium-24-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 16129561088
+      Parameters: 84395752
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.56
+          Top 5 Accuracy: 95.82
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty_in1k_20230213-ad0aa92e.pth
+    Config: configs/xcit/xcit-medium-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p16_224.pth
+  - Name: xcit-medium-24-p16_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 16129561088
+      Parameters: 84395752
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.15
+          Top 5 Accuracy: 96.82
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty-dist_in1k_20230213-aca5cd0c.pth
+    Config: configs/xcit/xcit-medium-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p16_224_dist.pth
+  - Name: xcit-small-12-p8_3rdparty_in1k
+    Metadata:
+      FLOPs: 18691601280
+      Parameters: 26213032
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.21
+          Top 5 Accuracy: 96.41
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty_in1k_20230213-9e364ce3.pth
+    Config: configs/xcit/xcit-small-12-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p8_224.pth
+  - Name: xcit-small-12-p8_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 18691601280
+      Parameters: 26213032
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.97
+          Top 5 Accuracy: 96.81
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty-dist_in1k_20230213-71886580.pth
+    Config: configs/xcit/xcit-small-12-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p8_224_dist.pth
+  - Name: xcit-small-24-p16_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 26721471360
+      Parameters: 47671384
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.1
+          Top 5 Accuracy: 97.32
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p16_3rdparty-dist_in1k-384px_20230213-28fa2d0e.pth
+    Config: configs/xcit/xcit-small-24-p16_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p16_384_dist.pth
+  - Name: xcit-tiny-24-p8_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 27052135872
+      Parameters: 12107128
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.77
+          Top 5 Accuracy: 96.72
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-tiny-24-p8_3rdparty-dist_in1k-384px_20230213-30d5e5ec.pth
+    Config: configs/xcit/xcit-tiny-24-p8_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_tiny_24_p8_384_dist.pth
+  - Name: xcit-small-24-p8_3rdparty_in1k
+    Metadata:
+      FLOPs: 35812053888
+      Parameters: 47631112
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.62
+          Top 5 Accuracy: 96.51
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty_in1k_20230213-280ebcc7.pth
+    Config: configs/xcit/xcit-small-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p8_224.pth
+  - Name: xcit-small-24-p8_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 35812053888
+      Parameters: 47631112
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.68
+          Top 5 Accuracy: 97.07
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty-dist_in1k_20230213-f2773c78.pth
+    Config: configs/xcit/xcit-small-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p8_224_dist.pth
+  - Name: xcit-large-24-p16_3rdparty_in1k
+    Metadata:
+      FLOPs: 35855948544
+      Parameters: 189096136
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 82.97
+          Top 5 Accuracy: 95.86
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty_in1k_20230214-d29d2529.pth
+    Config: configs/xcit/xcit-large-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p16_224.pth
+  - Name: xcit-large-24-p16_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 35855948544
+      Parameters: 189096136
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.61
+          Top 5 Accuracy: 97.07
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty-dist_in1k_20230214-4fea599c.pth
+    Config: configs/xcit/xcit-large-24-p16_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p16_224_dist.pth
+  - Name: xcit-medium-24-p16_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 47388932608
+      Parameters: 84395752
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.47
+          Top 5 Accuracy: 97.49
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p16_3rdparty-dist_in1k-384px_20230214-6c23a201.pth
+    Config: configs/xcit/xcit-medium-24-p16_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p16_384_dist.pth
+  - Name: xcit-small-12-p8_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 54923537280
+      Parameters: 26213032
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.12
+          Top 5 Accuracy: 97.31
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-12-p8_3rdparty-dist_in1k-384px_20230214-9f2178bc.pth
+    Config: configs/xcit/xcit-small-12-p8_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_12_p8_384_dist.pth
+  - Name: xcit-medium-24-p8_3rdparty_in1k
+    Metadata:
+      FLOPs: 63524706816
+      Parameters: 84323624
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 83.61
+          Top 5 Accuracy: 96.23
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty_in1k_20230214-c362850b.pth
+    Config: configs/xcit/xcit-medium-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p8_224.pth
+  - Name: xcit-medium-24-p8_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 63524706816
+      Parameters: 84323624
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.0
+          Top 5 Accuracy: 97.16
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty-dist_in1k_20230214-625c953b.pth
+    Config: configs/xcit/xcit-medium-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p8_224_dist.pth
+  - Name: xcit-small-24-p8_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 105236704128
+      Parameters: 47631112
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.57
+          Top 5 Accuracy: 97.6
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-small-24-p8_3rdparty-dist_in1k-384px_20230214-57298eca.pth
+    Config: configs/xcit/xcit-small-24-p8_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_small_24_p8_384_dist.pth
+  - Name: xcit-large-24-p16_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 105345095424
+      Parameters: 189096136
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.78
+          Top 5 Accuracy: 97.6
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p16_3rdparty-dist_in1k-384px_20230214-bd515a34.pth
+    Config: configs/xcit/xcit-large-24-p16_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p16_384_dist.pth
+  - Name: xcit-large-24-p8_3rdparty_in1k
+    Metadata:
+      FLOPs: 141225699072
+      Parameters: 188932648
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 84.23
+          Top 5 Accuracy: 96.58
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty_in1k_20230214-08f2f664.pth
+    Config: configs/xcit/xcit-large-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p8_224.pth
+  - Name: xcit-large-24-p8_3rdparty-dist_in1k
+    Metadata:
+      FLOPs: 141225699072
+      Parameters: 188932648
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.14
+          Top 5 Accuracy: 97.32
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty-dist_in1k_20230214-8c092b34.pth
+    Config: configs/xcit/xcit-large-24-p8_8xb128_in1k.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p8_224_dist.pth
+  - Name: xcit-medium-24-p8_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 186672626176
+      Parameters: 84323624
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 85.87
+          Top 5 Accuracy: 97.61
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-medium-24-p8_3rdparty-dist_in1k-384px_20230214-5db925e0.pth
+    Config: configs/xcit/xcit-medium-24-p8_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_medium_24_p8_384_dist.pth
+  - Name: xcit-large-24-p8_3rdparty-dist_in1k-384px
+    Metadata:
+      FLOPs: 415003137792
+      Parameters: 188932648
+      Training Data: ImageNet-1k
+    In Collection: XCiT
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 86.13
+          Top 5 Accuracy: 97.75
+        Task: Image Classification
+    Weights: https://download.openmmlab.com/mmclassification/v0/xcit/xcit-large-24-p8_3rdparty-dist_in1k-384px_20230214-9f718b1a.pth
+    Config: configs/xcit/xcit-large-24-p8_8xb128_in1k-384px.py
+    Converted From:
+      Code: https://github.com/facebookresearch/xcit
+      Weights: https://dl.fbaipublicfiles.com/xcit/xcit_large_24_p8_384_dist.pth
diff --git a/configs/xcit/xcit-large-24-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-large-24-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b393c4aea03ab1927e11773609562cd323963931
--- /dev/null
+++ b/configs/xcit/xcit-large-24-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=768,
+        depth=24,
+        num_heads=16,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-large-24-p16_8xb128_in1k.py b/configs/xcit/xcit-large-24-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5c01cb5f72e93ad8b5e81d363b3c3f914504f64
--- /dev/null
+++ b/configs/xcit/xcit-large-24-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=768,
+        depth=24,
+        num_heads=16,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-large-24-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-large-24-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..46b8422b481e69100266798a2183cae56d6e345e
--- /dev/null
+++ b/configs/xcit/xcit-large-24-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=768,
+        depth=24,
+        num_heads=16,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-large-24-p8_8xb128_in1k.py b/configs/xcit/xcit-large-24-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..6dc67baa59b9e270b2c06bb0a928879ef8f78f60
--- /dev/null
+++ b/configs/xcit/xcit-large-24-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=768,
+        depth=24,
+        num_heads=16,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-medium-24-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-medium-24-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c91b9cd6e9511a8dbbae437a5454d35eb4c03e0
--- /dev/null
+++ b/configs/xcit/xcit-medium-24-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=512,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-medium-24-p16_8xb128_in1k.py b/configs/xcit/xcit-medium-24-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..148ed0640da548877cbf04c67bfc0bbb3351dfce
--- /dev/null
+++ b/configs/xcit/xcit-medium-24-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=512,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-medium-24-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-medium-24-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..3138ec4f0b41456d99e2d59d60575327e794f10e
--- /dev/null
+++ b/configs/xcit/xcit-medium-24-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=512,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-medium-24-p8_8xb128_in1k.py b/configs/xcit/xcit-medium-24-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8277a10b772aa3c7a39ace2051829c8818df987
--- /dev/null
+++ b/configs/xcit/xcit-medium-24-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=512,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-nano-12-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-nano-12-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..bf8c27b3b1acee69892fa83a8be40da82b62fd44
--- /dev/null
+++ b/configs/xcit/xcit-nano-12-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=128,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=False,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=128,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-nano-12-p16_8xb128_in1k.py b/configs/xcit/xcit-nano-12-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e9bf81c5f4639ee5c7ba57c9ef996c79076df65
--- /dev/null
+++ b/configs/xcit/xcit-nano-12-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=128,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=False,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=128,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-nano-12-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-nano-12-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..7dae69f0b3b9a2ea8792f0beed8e0ee68f0cc4e9
--- /dev/null
+++ b/configs/xcit/xcit-nano-12-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=128,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=False,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=128,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-nano-12-p8_8xb128_in1k.py b/configs/xcit/xcit-nano-12-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e6a003a30ef7348f29732ca1c36210704e886c1c
--- /dev/null
+++ b/configs/xcit/xcit-nano-12-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=128,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=False,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=128,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-12-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-small-12-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..54c80d498e0c1370f1122ee34ef1970a521796a7
--- /dev/null
+++ b/configs/xcit/xcit-small-12-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=384,
+        depth=12,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-12-p16_8xb128_in1k.py b/configs/xcit/xcit-small-12-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c546179f42f7a0a668d3d7f8d27ae137006577ae
--- /dev/null
+++ b/configs/xcit/xcit-small-12-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=384,
+        depth=12,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-12-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-small-12-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1b6a52c370578f9fe9420521d1bc494563071e6
--- /dev/null
+++ b/configs/xcit/xcit-small-12-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=384,
+        depth=12,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-12-p8_8xb128_in1k.py b/configs/xcit/xcit-small-12-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cbfbe151781fb012fae2099bb0a9b9bd5d7e563e
--- /dev/null
+++ b/configs/xcit/xcit-small-12-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=384,
+        depth=12,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-24-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-small-24-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6eb41275b83939e2ac71f5e6e15fa2a8bf5f4df2
--- /dev/null
+++ b/configs/xcit/xcit-small-24-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=384,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-24-p16_8xb128_in1k.py b/configs/xcit/xcit-small-24-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..5b3dc18f438ffb49bde71a795e24abf36c427e14
--- /dev/null
+++ b/configs/xcit/xcit-small-24-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=384,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-24-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-small-24-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..34445a09d637c222a25aa608de2f99bf1dacedb1
--- /dev/null
+++ b/configs/xcit/xcit-small-24-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=384,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-small-24-p8_8xb128_in1k.py b/configs/xcit/xcit-small-24-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..108e64d41ae0c34c17bc5e6a5baa6d46eb6a9d08
--- /dev/null
+++ b/configs/xcit/xcit-small-24-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=384,
+        depth=24,
+        num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=384,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-12-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-tiny-12-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b64ebe497082ef6f9c4b93ad16e7343f66008e07
--- /dev/null
+++ b/configs/xcit/xcit-tiny-12-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=192,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py b/configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..1b54592f88bad986e885129bbce9d585fb864206
--- /dev/null
+++ b/configs/xcit/xcit-tiny-12-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=192,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-12-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-tiny-12-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1acff7ead898fb45c8ab6eac5aa3ed3dd13d939
--- /dev/null
+++ b/configs/xcit/xcit-tiny-12-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=192,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py b/configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..39d97da21689382d0e6b168fd78f9a74b269e8c1
--- /dev/null
+++ b/configs/xcit/xcit-tiny-12-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=192,
+        depth=12,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1.0,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-24-p16_8xb128_in1k-384px.py b/configs/xcit/xcit-tiny-24-p16_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..556043565e2e844f77a2a2b62e7ebe71d638590d
--- /dev/null
+++ b/configs/xcit/xcit-tiny-24-p16_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=192,
+        depth=24,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py b/configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fdceb14323ac89a12d529f7112806fef7e6f9d66
--- /dev/null
+++ b/configs/xcit/xcit-tiny-24-p16_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=16,
+        embed_dims=192,
+        depth=24,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-24-p8_8xb128_in1k-384px.py b/configs/xcit/xcit-tiny-24-p8_8xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..2cee442e5b77481550d479c4f83cb2e9a80e46ae
--- /dev/null
+++ b/configs/xcit/xcit-tiny-24-p8_8xb128_in1k-384px.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_384.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=192,
+        depth=24,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py b/configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..283f17e61708e9d19e5af09c57d8a937cec2e854
--- /dev/null
+++ b/configs/xcit/xcit-tiny-24-p8_8xb128_in1k.py
@@ -0,0 +1,34 @@
+_base_ = [
+    '../_base_/datasets/imagenet_bs64_swin_224.py',
+    '../_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    '../_base_/default_runtime.py',
+]
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='XCiT',
+        patch_size=8,
+        embed_dims=192,
+        depth=24,
+        num_heads=4,
+        mlp_ratio=4,
+        qkv_bias=True,
+        layer_scale_init_value=1e-5,
+        tokens_norm=True,
+        out_type='cls_token',
+    ),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=192,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ),
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]),
+)
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
diff --git a/dataset-index.yml b/dataset-index.yml
new file mode 100644
index 0000000000000000000000000000000000000000..40ca62069295695d896134b60e66b2260066c072
--- /dev/null
+++ b/dataset-index.yml
@@ -0,0 +1,11 @@
+imagenet1k:
+  dataset: OpenDataLab/ImageNet-1K
+  download_root: data
+  data_root: data/imagenet
+  script: tools/dataset_converters/odl_imagenet1k_preprocess.sh
+
+cub:
+  dataset: OpenDataLab/CUB-200-2011
+  download_root: data
+  data_root: data/CUB_200_2011
+  script: tools/dataset_converters/odl_cub_preprocess.sh
diff --git a/demo/bird.JPEG b/demo/bird.JPEG
new file mode 100755
index 0000000000000000000000000000000000000000..9c132a099e87d1c3c1a76dfd9201b03801301eab
Binary files /dev/null and b/demo/bird.JPEG differ
diff --git a/demo/cat-dog.png b/demo/cat-dog.png
new file mode 100644
index 0000000000000000000000000000000000000000..2ddd0fdb2e6c9269a9739d525a8feae05af2ee5f
Binary files /dev/null and b/demo/cat-dog.png differ
diff --git a/demo/demo.JPEG b/demo/demo.JPEG
new file mode 100755
index 0000000000000000000000000000000000000000..fd3a93f59385d6ff632483646e6caee300b56d09
Binary files /dev/null and b/demo/demo.JPEG differ
diff --git a/demo/dog.jpg b/demo/dog.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..c68fb054ad2dd2e5968a866c3140849c84b5484b
Binary files /dev/null and b/demo/dog.jpg differ
diff --git a/demo/image_demo.py b/demo/image_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..015873506ce86a4af2d68e9df9b50e6afe5ec6bc
--- /dev/null
+++ b/demo/image_demo.py
@@ -0,0 +1,44 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from argparse import ArgumentParser
+
+from mmengine.fileio import dump
+from rich import print_json
+
+from mmpretrain.apis import ImageClassificationInferencer
+
+
+def main():
+    parser = ArgumentParser()
+    parser.add_argument('img', help='Image file')
+    parser.add_argument('model', help='Model name or config file path')
+    parser.add_argument('--checkpoint', help='Checkpoint file path.')
+    parser.add_argument(
+        '--show',
+        action='store_true',
+        help='Whether to show the prediction result in a window.')
+    parser.add_argument(
+        '--show-dir',
+        type=str,
+        help='The directory to save the visualization image.')
+    parser.add_argument('--device', help='Device used for inference')
+    args = parser.parse_args()
+
+    # build the model from a config file and a checkpoint file
+    try:
+        pretrained = args.checkpoint or True
+        inferencer = ImageClassificationInferencer(
+            args.model, pretrained=pretrained)
+    except ValueError:
+        raise ValueError(
+            f'Unavailable model "{args.model}", you can specify find a model '
+            'name or a config file or find a model name from '
+            'https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html#all-checkpoints'  # noqa: E501
+        )
+    result = inferencer(args.img, show=args.show, show_dir=args.show_dir)[0]
+    # show the results
+    result.pop('pred_scores')  # pred_scores is too verbose for a demo.
+    print_json(dump(result, file_format='json', indent=4))
+
+
+if __name__ == '__main__':
+    main()
diff --git a/demo/ipu_train_example.sh b/demo/ipu_train_example.sh
new file mode 100644
index 0000000000000000000000000000000000000000..94c8456d97897a717166d83fb4a494a8a61bfceb
--- /dev/null
+++ b/demo/ipu_train_example.sh
@@ -0,0 +1,9 @@
+
+
+# get SOTA accuracy 81.2 for 224 input ViT fine-tuning, reference is below:
+# https://github.com/google-research/vision_transformer#available-vit-models
+# cfg: vit-base-p16_ft-4xb544_in1k-224_ipu train model in fp16 precision
+# 8 epoch, 2176 batch size, 16 IPUs, 4 replicas, model Tput = 5600 images, training time 0.6 hour roughly
+cfg_name=vit-base-p16_ft-4xb544_in1k-224_ipu
+python3 tools/train.py configs/vision_transformer/${cfg_name}.py --ipu-replicas 4 --no-validate &&
+python3 tools/test.py configs/vision_transformer/${cfg_name}.py work_dirs/${cfg_name}/latest.pth --metrics accuracy --device ipu
diff --git a/docker/Dockerfile b/docker/Dockerfile
index f81a9f52839c74c109b7541d5d50e264f1ec8838..5f7df525ceb364d1d0dff72520bf9f75bf05f791 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -1,5 +1,26 @@
-FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.10.0-centos7.6-dtk-22.10.1-py37-latest
-ENV DEBIAN_FRONTEND=noninteractive
-# 安装pip相关依赖
-COPY requirements.txt requirements.txt
-RUN pip3 install -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com -r requirements.txt
+ARG PYTORCH="1.12.1"
+ARG CUDA="11.3"
+ARG CUDNN="8"
+
+FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel
+
+# fetch the key refer to https://forums.developer.nvidia.com/t/18-04-cuda-docker-image-is-broken/212892/9
+RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub 32
+RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub
+
+ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX"
+ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
+ENV CMAKE_PREFIX_PATH="(dirname(which conda))/../"
+
+RUN apt-get update && apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 \
+    && apt-get clean \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install MIM
+RUN pip install openmim
+
+# Install MMPretrain
+RUN conda clean --all
+RUN git clone https://github.com/open-mmlab/mmpretrain.git
+WORKDIR ./mmpretrain
+RUN mim install --no-cache-dir -e .
diff --git a/docker/requirements.txt b/docker/requirements.txt
deleted file mode 100644
index 88a2e62c0a93da425e87407f3cb98b93a0ba7216..0000000000000000000000000000000000000000
--- a/docker/requirements.txt
+++ /dev/null
@@ -1,15 +0,0 @@
-albumentations>=0.3.2 --no-binary qudida,albumentations
-colorama
-requests
-rich
-scipy
-matplotlib>=3.1.0
-numpy
-packaging
-codecov
-flake8
-interrogate
-isort==4.3.21
-pytest
-xdoctest >= 0.10.0
-yapf
diff --git a/docker/serve/Dockerfile b/docker/serve/Dockerfile
new file mode 100644
index 0000000000000000000000000000000000000000..c50c4e8ee829eace217e0991d10002ad4e4589da
--- /dev/null
+++ b/docker/serve/Dockerfile
@@ -0,0 +1,37 @@
+ARG PYTORCH="2.0.1"
+ARG CUDA="11.7"
+ARG CUDNN="8"
+FROM pytorch/torchserve:latest-gpu
+
+ARG MMPRE="1.2.0"
+
+ENV PYTHONUNBUFFERED TRUE
+
+ENV HOME="/home/model-server"
+ENV PATH="/opt/conda/bin:$HOME/.local/bin:$PATH"
+RUN export FORCE_CUDA=1
+
+# TORCHSEVER
+RUN pip install torchserve torch-model-archiver
+RUN pip install nvgpu
+
+# OPEN-MMLAB
+ARG PYTORCH
+ARG CUDA
+RUN pip install openmim
+RUN mim install mmpretrain==${MMPRE}
+RUN mkdir -p $HOME/tmp
+
+COPY --chown=model-server entrypoint.sh $HOME/.local/bin/entrypoint.sh
+
+RUN chmod +x $HOME/.local/bin/entrypoint.sh
+
+COPY --chown=model-server config.properties $HOME/config.properties
+
+EXPOSE 8080 8081 8082
+
+USER model-server
+WORKDIR $HOME
+ENV TEMP=$HOME/tmp
+ENTRYPOINT ["/home/model-server/.local/bin/entrypoint.sh"]
+CMD ["serve"]
diff --git a/docker/serve/config.properties b/docker/serve/config.properties
new file mode 100644
index 0000000000000000000000000000000000000000..efb9c47e40ab550bac765611e6c6c6f2a7152f11
--- /dev/null
+++ b/docker/serve/config.properties
@@ -0,0 +1,5 @@
+inference_address=http://0.0.0.0:8080
+management_address=http://0.0.0.0:8081
+metrics_address=http://0.0.0.0:8082
+model_store=/home/model-server/model-store
+load_models=all
diff --git a/docker/serve/entrypoint.sh b/docker/serve/entrypoint.sh
new file mode 100644
index 0000000000000000000000000000000000000000..41ba00b048aed84b45c5a8015a016ff148e97d86
--- /dev/null
+++ b/docker/serve/entrypoint.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+set -e
+
+if [[ "$1" = "serve" ]]; then
+    shift 1
+    torchserve --start --ts-config /home/model-server/config.properties
+else
+    eval "$@"
+fi
+
+# prevent docker exit
+tail -f /dev/null
diff --git a/docs/en/Makefile b/docs/en/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..d4bb2cbb9eddb1bb1b4f366623044af8e4830919
--- /dev/null
+++ b/docs/en/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/en/_static/css/readthedocs.css b/docs/en/_static/css/readthedocs.css
new file mode 100644
index 0000000000000000000000000000000000000000..4c7fa98fa8d80fbff62c508002aa5f65520195e9
--- /dev/null
+++ b/docs/en/_static/css/readthedocs.css
@@ -0,0 +1,62 @@
+.header-logo {
+    background-image: url("../image/mmpt-logo.png");
+    background-size: 183px 50px;
+    height: 50px;
+    width: 183px;
+}
+
+@media screen and (min-width: 1100px) {
+  .header-logo {
+    top: -12px;
+  }
+}
+
+pre {
+    white-space: pre;
+}
+
+@media screen and (min-width: 2000px) {
+  .pytorch-content-left {
+    width: 1200px;
+    margin-left: 30px;
+  }
+  article.pytorch-article {
+    max-width: 1200px;
+  }
+  .pytorch-breadcrumbs-wrapper {
+    width: 1200px;
+  }
+  .pytorch-right-menu.scrolling-fixed {
+    position: fixed;
+    top: 45px;
+    left: 1580px;
+  }
+}
+
+
+article.pytorch-article section code {
+  padding: .2em .4em;
+  background-color: #f3f4f7;
+  border-radius: 5px;
+}
+
+/* Disable the change in tables */
+article.pytorch-article section table code {
+  padding: unset;
+  background-color: unset;
+  border-radius: unset;
+}
+
+table.autosummary td {
+  width: 50%
+}
+
+img.align-center {
+  display: block;
+  margin-left: auto;
+  margin-right: auto;
+}
+
+article.pytorch-article p.rubric {
+  font-weight: bold;
+}
diff --git a/docs/en/_static/image/confusion-matrix.png b/docs/en/_static/image/confusion-matrix.png
new file mode 100755
index 0000000000000000000000000000000000000000..a1dc7ba6a73700ff55f81e40d00bc16f4da26b31
Binary files /dev/null and b/docs/en/_static/image/confusion-matrix.png differ
diff --git a/docs/en/_static/image/mmpt-logo.png b/docs/en/_static/image/mmpt-logo.png
new file mode 100644
index 0000000000000000000000000000000000000000..f4e060716520ece5db7e85df3c3ad8fd9e0eda57
Binary files /dev/null and b/docs/en/_static/image/mmpt-logo.png differ
diff --git a/docs/en/_static/image/tools/analysis/analyze_log.jpg b/docs/en/_static/image/tools/analysis/analyze_log.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..8eb1a27d6464d255b84b23a7460a5f622f51712f
Binary files /dev/null and b/docs/en/_static/image/tools/analysis/analyze_log.jpg differ
diff --git a/docs/en/_static/js/custom.js b/docs/en/_static/js/custom.js
new file mode 100644
index 0000000000000000000000000000000000000000..3eec9f46f8d3d360c0dcc256ddddd65b456e9553
--- /dev/null
+++ b/docs/en/_static/js/custom.js
@@ -0,0 +1,10 @@
+var collapsedSections = ['Advanced Guides', 'Model Zoo', 'Visualization', 'Analysis Tools', 'Deployment', 'Notes'];
+
+$(document).ready(function () {
+  $('.model-summary').DataTable({
+    "stateSave": false,
+    "lengthChange": false,
+    "pageLength": 20,
+    "order": []
+  });
+});
diff --git a/docs/en/_templates/404.html b/docs/en/_templates/404.html
new file mode 100644
index 0000000000000000000000000000000000000000..639d255989a87263c1d8a07df2312e1882104e90
--- /dev/null
+++ b/docs/en/_templates/404.html
@@ -0,0 +1,18 @@
+{% extends "layout.html" %}
+
+{% block body %}
+
+<h1>Page Not Found</h1>
+<p>
+  The page you are looking for cannot be found.
+</p>
+<p>
+  If you just switched documentation versions, it is likely that the page you were on is moved. You can look for it in
+  the content table left, or go to <a href="{{ pathto(root_doc) }}">the homepage</a>.
+</p>
+<p>
+  If you cannot find documentation you want, please <a
+    href="https://github.com/open-mmlab/mmpretrain/issues/new/choose">open an issue</a> to tell us!
+</p>
+
+{% endblock %}
diff --git a/docs/en/_templates/autosummary/class.rst b/docs/en/_templates/autosummary/class.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4c3a7a9abf5c5b14ac3ef3b00a2f070480295358
--- /dev/null
+++ b/docs/en/_templates/autosummary/class.rst
@@ -0,0 +1,13 @@
+.. role:: hidden
+    :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+    :members:
+
+..
+  autogenerated from _templates/autosummary/class.rst
+  note it does not have :inherited-members:
diff --git a/docs/en/_templates/callable.rst b/docs/en/_templates/callable.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3a7b9d2b96c76dfa3eb1d8bef56f58f219fe7760
--- /dev/null
+++ b/docs/en/_templates/callable.rst
@@ -0,0 +1,14 @@
+.. role:: hidden
+    :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+    :members:
+    :special-members: __call__
+
+..
+  autogenerated from _templates/callable.rst
+  note it does not have :inherited-members:
diff --git a/docs/en/_templates/data_transform.rst b/docs/en/_templates/data_transform.rst
new file mode 100644
index 0000000000000000000000000000000000000000..376bfe9db6c305e681f265dd0e20b7b7ea6e500f
--- /dev/null
+++ b/docs/en/_templates/data_transform.rst
@@ -0,0 +1,13 @@
+.. role:: hidden
+    :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+    :members: transform
+
+..
+  autogenerated from _templates/callable.rst
+  note it does not have :inherited-members:
diff --git a/docs/en/advanced_guides/convention.md b/docs/en/advanced_guides/convention.md
new file mode 100644
index 0000000000000000000000000000000000000000..9edd04c1d5685aaa353e10d04e7a609d9fc9adf4
--- /dev/null
+++ b/docs/en/advanced_guides/convention.md
@@ -0,0 +1,120 @@
+# Convention in MMPretrain
+
+## Model Naming Convention
+
+We follow the below convention to name models. Contributors are advised to follow the same style. The model names are divided into five parts: algorithm info, module information, pretrain information, training information and data information. Logically, different parts are concatenated by underscores `'_'`, and words in the same part are concatenated by dashes `'-'`.
+
+```text
+{algorithm info}_{module info}_{pretrain info}_{training info}_{data info}
+```
+
+- `algorithm info` (optional): The main algorithm information, it's includes the main training algorithms like MAE, BEiT, etc.
+- `module info`:  The module information, it usually includes the backbone name, such as resnet, vit, etc.
+- `pretrain info`: (optional): The pretrain model information, such as the pretrain model is trained on ImageNet-21k.
+- `training info`: The training information, some training schedule, including batch size, lr schedule, data augment and the like.
+- `data info`: The data information, it usually includes the dataset name, input size and so on, such as imagenet, cifar, etc.
+
+### Algorithm information
+
+The main algorithm name to train the model. For example:
+
+- `simclr`
+- `mocov2`
+- `eva-mae-style`
+
+The model trained by supervised image classification can omit this field.
+
+### Module information
+
+The modules of the model, usually, the backbone must be included in this field, and the neck and head
+information can be omitted. For example:
+
+- `resnet50`
+- `vit-base-p16`
+- `swin-base`
+
+### Pretrain information
+
+If the model is a fine-tuned model from a pre-trained model, we need to record some information of the
+pre-trained model. For example:
+
+- The source of the pre-trained model: `fb`, `openai`, etc.
+- The method to train the pre-trained model: `clip`, `mae`, `distill`, etc.
+- The dataset used for pre-training: `in21k`, `laion2b`, etc. (`in1k` can be omitted.)
+- The training duration: `300e`, `1600e`, etc.
+
+Not all information is necessary, only select the necessary information to distinguish different pre-trained
+models.
+
+At the end of this field, use a `-pre` as an identifier, like `mae-in21k-pre`.
+
+### Training information
+
+Training schedule, including training type, `batch size`, `lr schedule`, data augment, special loss functions and so on:
+
+- format `{gpu x batch_per_gpu}`, such as `8xb32`
+
+Training type (mainly seen in the transformer network, such as the `ViT` algorithm, which is usually divided into two training type: pre-training and fine-tuning):
+
+- `ft` : configuration file for fine-tuning
+- `pt` : configuration file for pretraining
+
+Training recipe. Usually, only the part that is different from the original paper will be marked. These methods will be arranged in the order `{pipeline aug}-{train aug}-{loss trick}-{scheduler}-{epochs}`.
+
+- `coslr-200e` : use cosine scheduler to train 200 epochs
+- `autoaug-mixup-lbs-coslr-50e` : use `autoaug`, `mixup`, `label smooth`, `cosine scheduler` to train 50 epochs
+
+If the model is converted from a third-party repository like the official repository, the training information
+can be omitted and use a `3rdparty` as an identifier.
+
+### Data information
+
+- `in1k` : `ImageNet1k` dataset, default to use the input image size of 224x224;
+- `in21k` : `ImageNet21k` dataset, also called `ImageNet22k` dataset, default to use the input image size of 224x224;
+- `in1k-384px` : Indicates that the input image size is 384x384;
+- `cifar100`
+
+### Model Name Example
+
+```text
+vit-base-p32_clip-openai-pre_3rdparty_in1k
+```
+
+- `vit-base-p32`: The module information
+- `clip-openai-pre`: The pre-train information.
+  - `clip`: The pre-train method is clip.
+  - `openai`: The pre-trained model is come from OpenAI.
+  - `pre`: The pre-train identifier.
+- `3rdparty`: The model is converted from a third-party repository.
+- `in1k`: Dataset information. The model is trained from ImageNet-1k dataset and the input size is `224x224`.
+
+```text
+beit_beit-base-p16_8xb256-amp-coslr-300e_in1k
+```
+
+- `beit`: The algorithm information
+- `beit-base`: The module information, since the backbone is a modified ViT from BEiT, the backbone name is
+  also `beit`.
+- `8xb256-amp-coslr-300e`: The training information.
+  - `8xb256`: Use 8 GPUs and the batch size on each GPU is 256.
+  - `amp`: Use automatic-mixed-precision training.
+  - `coslr`: Use cosine annealing learning rate scheduler.
+  - `300e`: To train 300 epochs.
+- `in1k`: Dataset information. The model is trained from ImageNet-1k dataset and the input size is `224x224`.
+
+## Config File Naming Convention
+
+The naming of the config file is almost the same with the model name, with several difference:
+
+- The training information is necessary, and cannot be `3rdparty`.
+- If the config file only includes backbone settings, without neither head settings nor dataset settings. We
+  will name it as `{module info}_headless.py`. This kind of config files are usually used for third-party
+  pre-trained models on large datasets.
+
+## Checkpoint Naming Convention
+
+The naming of the weight mainly includes the model name, date and hash value.
+
+```text
+{model_name}_{date}-{hash}.pth
+```
diff --git a/docs/en/advanced_guides/datasets.md b/docs/en/advanced_guides/datasets.md
new file mode 100644
index 0000000000000000000000000000000000000000..1a018e441a1a1e820b02602dec0f85f553ec8eb0
--- /dev/null
+++ b/docs/en/advanced_guides/datasets.md
@@ -0,0 +1,72 @@
+# Adding New Dataset
+
+You can write a new dataset class inherited from `BaseDataset`, and overwrite `load_data_list(self)`,
+like [CIFAR10](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/cifar.py) and [ImageNet](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/imagenet.py).
+Typically, this function returns a list, where each sample is a dict, containing necessary data information, e.g., `img` and `gt_label`.
+
+Assume we are going to implement a `Filelist` dataset, which takes filelists for both training and testing. The format of annotation list is as follows:
+
+```text
+000001.jpg 0
+000002.jpg 1
+```
+
+## 1. Create Dataset Class
+
+We can create a new dataset in `mmpretrain/datasets/filelist.py` to load the data.
+
+```python
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+
+
+@DATASETS.register_module()
+class Filelist(BaseDataset):
+
+    def load_data_list(self):
+        assert isinstance(self.ann_file, str),
+
+        data_list = []
+        with open(self.ann_file) as f:
+            samples = [x.strip().split(' ') for x in f.readlines()]
+            for filename, gt_label in samples:
+                img_path = add_prefix(filename, self.img_prefix)
+                info = {'img_path': img_path, 'gt_label': int(gt_label)}
+                data_list.append(info)
+        return data_list
+```
+
+## 2. Add to the package
+
+And add this dataset class in `mmpretrain/datasets/__init__.py`
+
+```python
+from .base_dataset import BaseDataset
+...
+from .filelist import Filelist
+
+__all__ = [
+    'BaseDataset', ... ,'Filelist'
+]
+```
+
+## 3. Modify Related Config
+
+Then in the config, to use `Filelist` you can modify the config as the following
+
+```python
+train_dataloader = dict(
+    ...
+    dataset=dict(
+        type='Filelist',
+        ann_file='image_list.txt',
+        pipeline=train_pipeline,
+    )
+)
+```
+
+All the dataset classes inherit from [`BaseDataset`](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/base_dataset.py) have **lazy loading** and **memory saving** features, you can refer to related documents of {external+mmengine:doc}`BaseDataset <advanced_tutorials/basedataset>`.
+
+```{note}
+If the dictionary of the data sample contains 'img_path' but not 'img', then 'LoadImgFromFile' transform must be added in the pipeline.
+```
diff --git a/docs/en/advanced_guides/evaluation.md b/docs/en/advanced_guides/evaluation.md
new file mode 100644
index 0000000000000000000000000000000000000000..d7978eafe02bfd09d1003bd5e1a6516a3b7020d6
--- /dev/null
+++ b/docs/en/advanced_guides/evaluation.md
@@ -0,0 +1,103 @@
+# Customize Evaluation Metrics
+
+## Use metrics in MMPretrain
+
+In MMPretrain, we have provided multiple metrics for both single-label classification and multi-label
+classification:
+
+**Single-label Classification**:
+
+- [`Accuracy`](mmpretrain.evaluation.Accuracy)
+- [`SingleLabelMetric`](mmpretrain.evaluation.SingleLabelMetric), including precision, recall, f1-score and
+  support.
+
+**Multi-label Classification**:
+
+- [`AveragePrecision`](mmpretrain.evaluation.AveragePrecision), or AP (mAP).
+- [`MultiLabelMetric`](mmpretrain.evaluation.MultiLabelMetric), including precision, recall, f1-score and
+  support.
+
+To use these metrics during validation and testing, we need to modify the `val_evaluator` and `test_evaluator`
+fields in the config file.
+
+Here is several examples:
+
+1. Calculate top-1 and top-5 accuracy during both validation and test.
+
+   ```python
+   val_evaluator = dict(type='Accuracy', topk=(1, 5))
+   test_evaluator = val_evaluator
+   ```
+
+2. Calculate top-1 accuracy, top-5 accuracy, precision and recall during both validation and test.
+
+   ```python
+   val_evaluator = [
+     dict(type='Accuracy', topk=(1, 5)),
+     dict(type='SingleLabelMetric', items=['precision', 'recall']),
+   ]
+   test_evaluator = val_evaluator
+   ```
+
+3. Calculate mAP (mean AveragePrecision), CP (Class-wise mean Precision), CR (Class-wise mean Recall), CF
+   (Class-wise mean F1-score), OP (Overall mean Precision), OR (Overall mean Recall) and OF1 (Overall mean
+   F1-score).
+
+   ```python
+   val_evaluator = [
+     dict(type='AveragePrecision'),
+     dict(type='MultiLabelMetric', average='macro'),  # class-wise mean
+     dict(type='MultiLabelMetric', average='micro'),  # overall mean
+   ]
+   test_evaluator = val_evaluator
+   ```
+
+## Add new metrics
+
+MMPretrain supports the implementation of customized evaluation metrics for users who pursue higher customization.
+
+You need to create a new file under `mmpretrain/evaluation/metrics`, and implement the new metric in the file, for example, in `mmpretrain/evaluation/metrics/my_metric.py`. And create a customized evaluation metric class `MyMetric` which inherits [`BaseMetric in MMEngine`](mmengine.evaluator.BaseMetric).
+
+The data format processing method `process` and the metric calculation method `compute_metrics` need to be overwritten respectively. Add it to the `METRICS` registry to implement any customized evaluation metric.
+
+```python
+from mmengine.evaluator import BaseMetric
+from mmpretrain.registry import METRICS
+
+@METRICS.register_module()
+class MyMetric(BaseMetric):
+
+    def process(self, data_batch: Sequence[Dict], data_samples: Sequence[Dict]):
+    """ The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+        `data_batch` stores the batch data from dataloader,
+        and `data_samples` stores the batch outputs from model.
+    """
+        ...
+
+    def compute_metrics(self, results: List):
+    """ Compute the metrics from processed results and returns the evaluation results.
+    """
+        ...
+```
+
+Then, import it in the `mmpretrain/evaluation/metrics/__init__.py` to add it into the `mmpretrain.evaluation` package.
+
+```python
+# In mmpretrain/evaluation/metrics/__init__.py
+...
+from .my_metric import MyMetric
+
+__all__ = [..., 'MyMetric']
+```
+
+Finally, use `MyMetric` in the `val_evaluator` and `test_evaluator` field of config files.
+
+```python
+val_evaluator = dict(type='MyMetric', ...)
+test_evaluator = val_evaluator
+```
+
+```{note}
+More details can be found in {external+mmengine:doc}`MMEngine Documentation: Evaluation <design/evaluation>`.
+```
diff --git a/docs/en/advanced_guides/modules.md b/docs/en/advanced_guides/modules.md
new file mode 100644
index 0000000000000000000000000000000000000000..fb34aedec2c7f2940504f307351f80305f1ee441
--- /dev/null
+++ b/docs/en/advanced_guides/modules.md
@@ -0,0 +1,511 @@
+# Customize Models
+
+In our design, a complete model is defined as a top-level module which contains several model components based on their functionalities.
+
+- model: a top-level module defines the type of the task, such as `ImageClassifier` for image classification, `MAE` for self-supervised leanrning, `ImageToImageRetriever` for image retrieval.
+- backbone: usually a feature extraction network that records the major differences between models, e.g., `ResNet`, `MobileNet`.
+- neck: the component between backbone and head, e.g., `GlobalAveragePooling`.
+- head: the component for specific tasks, e.g., `ClsHead`, `ContrastiveHead`.
+- loss: the component in the head for calculating losses, e.g., `CrossEntropyLoss`, `LabelSmoothLoss`.
+- target_generator: the component for self-supervised leanrning task specifically, e.g., `VQKD`, `HOGGenerator`.
+
+## Add a new model
+
+Generally, for image classification and retrieval tasks, the pipelines are consistent. However, the pipelines are different from each self-supervised leanrning algorithms, like `MAE` and `BEiT`. Thus, in this section, we will explain how to add your self-supervised learning algorithm.
+
+### Add a new self-supervised learning algorithm
+
+1. Create a new file `mmpretrain/models/selfsup/new_algorithm.py` and implement `NewAlgorithm` in it.
+
+   ```python
+   from mmpretrain.registry import MODELS
+   from .base import BaseSelfSupvisor
+
+
+   @MODELS.register_module()
+   class NewAlgorithm(BaseSelfSupvisor):
+
+       def __init__(self, backbone, neck=None, head=None, init_cfg=None):
+           super().__init__(init_cfg)
+           pass
+
+       # ``extract_feat`` function is defined in BaseSelfSupvisor, you could
+       # overwrite it if needed
+       def extract_feat(self, inputs, **kwargs):
+           pass
+
+       # the core function to compute the loss
+       def loss(self, inputs, data_samples, **kwargs):
+           pass
+
+   ```
+
+2. Import the new algorithm module in `mmpretrain/models/selfsup/__init__.py`
+
+   ```python
+   ...
+   from .new_algorithm import NewAlgorithm
+
+   __all__ = [
+       ...,
+       'NewAlgorithm',
+       ...
+   ]
+   ```
+
+3. Use it in your config file.
+
+   ```python
+   model = dict(
+       type='NewAlgorithm',
+       backbone=...,
+       neck=...,
+       head=...,
+       ...
+   )
+   ```
+
+## Add a new backbone
+
+Here we present how to develop a new backbone component by an example of `ResNet_CIFAR`.
+As the input size of CIFAR is 32x32, which is much smaller than the default size of 224x224 in ImageNet, this backbone replaces the `kernel_size=7, stride=2` to `kernel_size=3, stride=1` and removes the MaxPooling after the stem layer to avoid forwarding small feature maps to residual blocks.
+
+The easiest way is to inherit from `ResNet` and only modify the stem layer.
+
+1. Create a new file `mmpretrain/models/backbones/resnet_cifar.py`.
+
+   ```python
+   import torch.nn as nn
+
+   from mmpretrain.registry import MODELS
+   from .resnet import ResNet
+
+
+   @MODELS.register_module()
+   class ResNet_CIFAR(ResNet):
+
+       """ResNet backbone for CIFAR.
+
+       short description of the backbone
+
+       Args:
+           depth(int): Network depth, from {18, 34, 50, 101, 152}.
+           ...
+       """
+
+       def __init__(self, depth, deep_stem, **kwargs):
+           # call ResNet init
+           super(ResNet_CIFAR, self).__init__(depth, deep_stem=deep_stem, **kwargs)
+           # other specific initializations
+           assert not self.deep_stem, 'ResNet_CIFAR do not support deep_stem'
+
+       def _make_stem_layer(self, in_channels, base_channels):
+           # override the ResNet method to modify the network structure
+           self.conv1 = build_conv_layer(
+               self.conv_cfg,
+               in_channels,
+               base_channels,
+               kernel_size=3,
+               stride=1,
+               padding=1,
+               bias=False)
+           self.norm1_name, norm1 = build_norm_layer(
+               self.norm_cfg, base_channels, postfix=1)
+           self.add_module(self.norm1_name, norm1)
+           self.relu = nn.ReLU(inplace=True)
+
+       def forward(self, x):
+           # Customize the forward method if needed.
+           x = self.conv1(x)
+           x = self.norm1(x)
+           x = self.relu(x)
+           outs = []
+           for i, layer_name in enumerate(self.res_layers):
+               res_layer = getattr(self, layer_name)
+               x = res_layer(x)
+               if i in self.out_indices:
+                   outs.append(x)
+           # The return value needs to be a tuple with multi-scale outputs from different depths.
+           # If you don't need multi-scale features, just wrap the output as a one-item tuple.
+           return tuple(outs)
+
+       def init_weights(self):
+           # Customize the weight initialization method if needed.
+           super().init_weights()
+
+           # Disable the weight initialization if loading a pretrained model.
+           if self.init_cfg is not None and self.init_cfg['type'] == 'Pretrained':
+               return
+
+           # Usually, we recommend using `init_cfg` to specify weight initialization methods
+           # of convolution, linear, or normalization layers. If you have some special needs,
+           # do these extra weight initialization here.
+           ...
+   ```
+
+```{note}
+Replace original registry names from `BACKBONES`, `NECKS`, `HEADS` and `LOSSES` to `MODELS` in OpenMMLab 2.0 design.
+```
+
+2. Import the new backbone module in `mmpretrain/models/backbones/__init__.py`.
+
+   ```python
+   ...
+   from .resnet_cifar import ResNet_CIFAR
+
+   __all__ = [
+       ..., 'ResNet_CIFAR'
+   ]
+   ```
+
+3. Modify the correlated settings in your config file.
+
+   ```python
+   model = dict(
+       ...
+       backbone=dict(
+           type='ResNet_CIFAR',
+           depth=18,
+           ...),
+       ...
+   ```
+
+### Add a new backbone for self-supervised learning
+
+For some self-supervised learning algorithms, the backbones are kind of different, such as `MAE`, `BEiT`, etc. Their backbones need to deal with `mask` in order to extract features from visible tokens.
+
+Take [MAEViT](mmpretrain.models.selfsup.MAEViT) as an example, we need to overwrite `forward` function to compute with `mask`. We also defines `init_weights` to initialize parameters and `random_masking` to generate mask for `MAE` pre-training.
+
+```python
+class MAEViT(VisionTransformer):
+    """Vision Transformer for MAE pre-training"""
+
+    def __init__(mask_ratio, **kwargs) -> None:
+        super().__init__(**kwargs)
+        # position embedding is not learnable during pretraining
+        self.pos_embed.requires_grad = False
+        self.mask_ratio = mask_ratio
+        self.num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+
+    def init_weights(self) -> None:
+        """Initialize position embedding, patch embedding and cls token."""
+        super().init_weights()
+        # define what if needed
+        pass
+
+    def random_masking(
+        self,
+        x: torch.Tensor,
+        mask_ratio: float = 0.75
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Generate the mask for MAE Pre-training."""
+        pass
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        mask: Optional[bool] = True
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Generate features for masked images.
+
+        The function supports two kind of forward behaviors. If the ``mask`` is
+        ``True``, the function will generate mask to masking some patches
+        randomly and get the hidden features for visible patches, which means
+        the function will be executed as masked imagemodeling pre-training;
+        if the ``mask`` is ``None`` or ``False``, the forward function will
+        call ``super().forward()``, which extract features from images without
+        mask.
+        """
+        if mask is None or False:
+            return super().forward(x)
+
+        else:
+            B = x.shape[0]
+            x = self.patch_embed(x)[0]
+            # add pos embed w/o cls token
+            x = x + self.pos_embed[:, 1:, :]
+
+            # masking: length -> length * mask_ratio
+            x, mask, ids_restore = self.random_masking(x, self.mask_ratio)
+
+            # append cls token
+            cls_token = self.cls_token + self.pos_embed[:, :1, :]
+            cls_tokens = cls_token.expand(B, -1, -1)
+            x = torch.cat((cls_tokens, x), dim=1)
+
+            for _, layer in enumerate(self.layers):
+                x = layer(x)
+            # Use final norm
+            x = self.norm1(x)
+
+            return (x, mask, ids_restore)
+
+```
+
+## Add a new neck
+
+Here we take `GlobalAveragePooling` as an example. It is a very simple neck without any arguments.
+To add a new neck, we mainly implement the `forward` function, which applies some operations on the output from the backbone and forwards the results to the head.
+
+1. Create a new file in `mmpretrain/models/necks/gap.py`.
+
+   ```python
+   import torch.nn as nn
+
+   from mmpretrain.registry import MODELS
+
+   @MODELS.register_module()
+   class GlobalAveragePooling(nn.Module):
+
+       def __init__(self):
+           self.gap = nn.AdaptiveAvgPool2d((1, 1))
+
+       def forward(self, inputs):
+           # we regard inputs as tensor for simplicity
+           outs = self.gap(inputs)
+           outs = outs.view(inputs.size(0), -1)
+           return outs
+   ```
+
+2. Import the new neck module in `mmpretrain/models/necks/__init__.py`.
+
+   ```python
+   ...
+   from .gap import GlobalAveragePooling
+
+   __all__ = [
+       ..., 'GlobalAveragePooling'
+   ]
+   ```
+
+3. Modify the correlated settings in your config file.
+
+   ```python
+   model = dict(
+       neck=dict(type='GlobalAveragePooling'),
+   )
+   ```
+
+## Add a new head
+
+### Based on ClsHead
+
+Here we present how to develop a new head by the example of simplified `VisionTransformerClsHead` as the following.
+To implement a new head, we need to implement a `pre_logits` method for processes before the final classification head and a `forward` method.
+
+:::{admonition} Why do we need the `pre_logits` method?
+:class: note
+
+In classification tasks, we usually use a linear layer to do the final classification. And sometimes, we need
+to obtain the feature before the final classification, which is the output of the `pre_logits` method.
+:::
+
+1. Create a new file in `mmpretrain/models/heads/vit_head.py`.
+
+   ```python
+   import torch.nn as nn
+
+   from mmpretrain.registry import MODELS
+   from .cls_head import ClsHead
+
+
+   @MODELS.register_module()
+   class VisionTransformerClsHead(ClsHead):
+
+       def __init__(self, num_classes, in_channels, hidden_dim, **kwargs):
+           super().__init__(**kwargs)
+           self.in_channels = in_channels
+           self.num_classes = num_classes
+           self.hidden_dim = hidden_dim
+
+           self.fc1 = nn.Linear(in_channels, hidden_dim)
+           self.act = nn.Tanh()
+           self.fc2 = nn.Linear(hidden_dim, num_classes)
+
+       def pre_logits(self, feats):
+           # The output of the backbone is usually a tuple from multiple depths,
+           # and for classification, we only need the final output.
+           feat = feats[-1]
+
+           # The final output of VisionTransformer is a tuple of patch tokens and
+           # classification tokens. We need classification tokens here.
+           _, cls_token = feat
+
+           # Do all works except the final classification linear layer.
+           return self.act(self.fc1(cls_token))
+
+       def forward(self, feats):
+           pre_logits = self.pre_logits(feats)
+
+           # The final classification linear layer.
+           cls_score = self.fc2(pre_logits)
+           return cls_score
+   ```
+
+2. Import the module in `mmpretrain/models/heads/__init__.py`.
+
+   ```python
+   ...
+   from .vit_head import VisionTransformerClsHead
+
+   __all__ = [
+       ..., 'VisionTransformerClsHead'
+   ]
+   ```
+
+3. Modify the correlated settings in your config file.
+
+   ```python
+   model = dict(
+       head=dict(
+           type='VisionTransformerClsHead',
+           ...,
+       ))
+   ```
+
+### Based on BaseModule
+
+Here is an example of `MAEPretrainHead`, which is based on `BaseModule` and implemented for mask image modeling task. It is required to implement `loss` function to generate loss, but the other helper functions are optional.
+
+```python
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class MAEPretrainHead(BaseModule):
+    """Head for MAE Pre-training."""
+
+    def __init__(self,
+                 loss: dict,
+                 norm_pix: bool = False,
+                 patch_size: int = 16) -> None:
+        super().__init__()
+        self.norm_pix = norm_pix
+        self.patch_size = patch_size
+        self.loss_module = MODELS.build(loss)
+
+    def patchify(self, imgs: torch.Tensor) -> torch.Tensor:
+        """Split images into non-overlapped patches."""
+        p = self.patch_size
+        assert imgs.shape[2] == imgs.shape[3] and imgs.shape[2] % p == 0
+
+        h = w = imgs.shape[2] // p
+        x = imgs.reshape(shape=(imgs.shape[0], 3, h, p, w, p))
+        x = torch.einsum('nchpwq->nhwpqc', x)
+        x = x.reshape(shape=(imgs.shape[0], h * w, p**2 * 3))
+        return x
+
+    def construct_target(self, target: torch.Tensor) -> torch.Tensor:
+        """Construct the reconstruction target."""
+        target = self.patchify(target)
+        if self.norm_pix:
+            # normalize the target image
+            mean = target.mean(dim=-1, keepdim=True)
+            var = target.var(dim=-1, keepdim=True)
+            target = (target - mean) / (var + 1.e-6)**.5
+
+        return target
+
+    def loss(self, pred: torch.Tensor, target: torch.Tensor,
+             mask: torch.Tensor) -> torch.Tensor:
+        """Generate loss."""
+        target = self.construct_target(target)
+        loss = self.loss_module(pred, target, mask)
+
+        return loss
+```
+
+After implementation, the following step is the same as the step-2 and step-3 in [Based on ClsHead](#based-on-clshead)
+
+## Add a new loss
+
+To add a new loss function, we mainly implement the `forward` function in the loss module. We should register the loss module as `MODELS` as well.
+In addition, it is helpful to leverage the decorator `weighted_loss` to weight the loss for each element.
+Assuming that we want to mimic a probabilistic distribution generated from another classification model, we implement an L1Loss to fulfill the purpose as below.
+
+1. Create a new file in `mmpretrain/models/losses/l1_loss.py`.
+
+   ```python
+   import torch
+   import torch.nn as nn
+
+   from mmpretrain.registry import MODELS
+   from .utils import weighted_loss
+
+   @weighted_loss
+   def l1_loss(pred, target):
+       assert pred.size() == target.size() and target.numel() > 0
+       loss = torch.abs(pred - target)
+       return loss
+
+   @MODELS.register_module()
+   class L1Loss(nn.Module):
+
+       def __init__(self, reduction='mean', loss_weight=1.0):
+           super(L1Loss, self).__init__()
+           self.reduction = reduction
+           self.loss_weight = loss_weight
+
+       def forward(self,
+                   pred,
+                   target,
+                   weight=None,
+                   avg_factor=None,
+                   reduction_override=None):
+           assert reduction_override in (None, 'none', 'mean', 'sum')
+           reduction = (
+               reduction_override if reduction_override else self.reduction)
+           loss = self.loss_weight * l1_loss(
+               pred, target, weight, reduction=reduction, avg_factor=avg_factor)
+           return loss
+   ```
+
+2. Import the module in `mmpretrain/models/losses/__init__.py`.
+
+   ```python
+   ...
+   from .l1_loss import L1Loss
+
+   __all__ = [
+       ..., 'L1Loss'
+   ]
+   ```
+
+3. Modify loss field in the head configs.
+
+   ```python
+   model = dict(
+       head=dict(
+           loss=dict(type='L1Loss', loss_weight=1.0),
+       ))
+   ```
+
+Finally, we can combine all the new model components in a config file to create a new model for best practices. Because `ResNet_CIFAR` is not a ViT-based backbone, we do not implement `VisionTransformerClsHead` here.
+
+```python
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=18,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=10,
+        in_channels=512,
+        loss=dict(type='L1Loss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
+
+```
+
+```{tip}
+For convenience, the same model components could inherit from existing config files, refers to [Learn about configs](../user_guides/config.md) for more details.
+```
diff --git a/docs/en/advanced_guides/pipeline.md b/docs/en/advanced_guides/pipeline.md
new file mode 100644
index 0000000000000000000000000000000000000000..058e8139c91b331762cee7090d0626004e645930
--- /dev/null
+++ b/docs/en/advanced_guides/pipeline.md
@@ -0,0 +1,170 @@
+# Customize Data Pipeline
+
+## Design of Data pipelines
+
+In the [new dataset tutorial](./datasets.md), we know that the dataset class use the `load_data_list` method
+to initialize the entire dataset, and we save the information of every sample to a dict.
+
+Usually, to save memory usage, we only load image paths and labels in the `load_data_list`, and load full
+image content when we use them. Moreover, we may want to do some random data augmentation during picking
+samples when training. Almost all data loading, pre-processing, and formatting operations can be configured in
+MMPretrain by the **data pipeline**.
+
+The data pipeline means how to process the sample dict when indexing a sample from the dataset. And it
+consists of a sequence of data transforms. Each data transform takes a dict as input, processes it, and outputs a
+dict for the next data transform.
+
+Here is a data pipeline example for ResNet-50 training on ImageNet.
+
+```python
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+```
+
+All available data transforms in MMPretrain can be found in the [data transforms docs](mmpretrain.datasets.transforms).
+
+## Modify the training/test pipeline
+
+The data pipeline in MMPretrain is pretty flexible. You can control almost every step of the data
+preprocessing from the config file, but on the other hand, you may be confused facing so many options.
+
+Here is a common practice and guidance for image classification tasks.
+
+### Loading
+
+At the beginning of a data pipeline, we usually need to load image data from the file path.
+[`LoadImageFromFile`](mmcv.transforms.LoadImageFromFile) is commonly used to do this task.
+
+```python
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    ...
+]
+```
+
+If you want to load data from files with special formats or special locations, you can [implement a new loading
+transform](#add-new-data-transforms) and add it at the beginning of the data pipeline.
+
+### Augmentation and other processing
+
+During training, we usually need to do data augmentation to avoid overfitting. During the test, we also need to do
+some data processing like resizing and cropping. These data transforms will be placed after the loading process.
+
+Here is a simple data augmentation recipe example. It will randomly resize and crop the input image to the
+specified scale, and randomly flip the image horizontally with probability.
+
+```python
+train_pipeline = [
+    ...
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    ...
+]
+```
+
+Here is a heavy data augmentation recipe example used in [Swin-Transformer](../papers/swin_transformer.md)
+training. To align with the official implementation, it specified `pillow` as the resize backend and `bicubic`
+as the resize algorithm. Moreover, it added [`RandAugment`](mmpretrain.datasets.transforms.RandAugment) and
+[`RandomErasing`](mmpretrain.datasets.transforms.RandomErasing) as extra data augmentation method.
+
+This configuration specified every detail of the data augmentation, and you can simply copy it to your own
+config file to apply the data augmentations of the Swin-Transformer.
+
+```python
+bgr_mean = [103.53, 116.28, 123.675]
+bgr_std = [57.375, 57.12, 58.395]
+
+train_pipeline = [
+    ...
+    dict(type='RandomResizedCrop', scale=224, backend='pillow', interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    ...
+]
+```
+
+```{note}
+Usually, the data augmentation part in the data pipeline handles only image-wise transforms, but not transforms
+like image normalization or mixup/cutmix. It's because we can do image normalization and mixup/cutmix on batch data
+to accelerate. To configure image normalization and mixup/cutmix, please use the [data preprocessor](mmpretrain.models.utils.data_preprocessor).
+```
+
+### Formatting
+
+The formatting is to collect training data from the data information dict and convert these data to
+model-friendly format.
+
+In most cases, you can simply use [`PackInputs`](mmpretrain.datasets.transforms.PackInputs), and it will
+convert the image in NumPy array format to PyTorch tensor, and pack the ground truth categories information and
+other meta information as a [`DataSample`](mmpretrain.structures.DataSample).
+
+```python
+train_pipeline = [
+    ...
+    dict(type='PackInputs'),
+]
+```
+
+## Add new data transforms
+
+1. Write a new data transform in any file, e.g., `my_transform.py`, and place it in
+   the folder `mmpretrain/datasets/transforms/`. The data transform class needs to inherit
+   the [`mmcv.transforms.BaseTransform`](mmcv.transforms.BaseTransform) class and override
+   the `transform` method which takes a dict as input and returns a dict.
+
+   ```python
+   from mmcv.transforms import BaseTransform
+   from mmpretrain.registry import TRANSFORMS
+
+   @TRANSFORMS.register_module()
+   class MyTransform(BaseTransform):
+
+       def transform(self, results):
+           # Modify the data information dict `results`.
+           return results
+   ```
+
+2. Import the new class in the `mmpretrain/datasets/transforms/__init__.py`.
+
+   ```python
+   ...
+   from .my_transform import MyTransform
+
+   __all__ = [
+       ..., 'MyTransform'
+   ]
+   ```
+
+3. Use it in config files.
+
+   ```python
+   train_pipeline = [
+       ...
+       dict(type='MyTransform'),
+       ...
+   ]
+   ```
+
+## Pipeline visualization
+
+After designing data pipelines, you can use the [visualization tools](../useful_tools/dataset_visualization.md) to view the performance.
diff --git a/docs/en/advanced_guides/runtime.md b/docs/en/advanced_guides/runtime.md
new file mode 100644
index 0000000000000000000000000000000000000000..8150fb1432eaeb54553da93b943978eb953925fe
--- /dev/null
+++ b/docs/en/advanced_guides/runtime.md
@@ -0,0 +1,221 @@
+# Customize Runtime Settings
+
+The runtime configurations include many helpful functionalities, like checkpoint saving, logger configuration,
+etc. In this tutorial, we will introduce how to configure these functionalities.
+
+## Save Checkpoint
+
+The checkpoint saving functionality is a default hook during training. And you can configure it in the
+`default_hooks.checkpoint` field.
+
+```{note}
+The hook mechanism is widely used in all OpenMMLab libraries. Through hooks, you can plug in many
+functionalities without modifying the main execution logic of the runner.
+
+A detailed introduction of hooks can be found in {external+mmengine:doc}`Hooks <tutorials/hook>`.
+```
+
+**The default settings**
+
+```python
+default_hooks = dict(
+    ...
+    checkpoint = dict(type='CheckpointHook', interval=1)
+    ...
+)
+```
+
+Here are some usual arguments, and all available arguments can be found in the [CheckpointHook](mmengine.hooks.CheckpointHook).
+
+- **`interval`** (int): The saving period. If use -1, it will never save checkpoints.
+- **`by_epoch`** (bool): Whether the **`interval`** is by epoch or by iteration. Defaults to `True`.
+- **`out_dir`** (str): The root directory to save checkpoints. If not specified, the checkpoints will be saved in the work directory. If specified, the checkpoints will be saved in the sub-folder of the **`out_dir`**.
+- **`max_keep_ckpts`** (int): The maximum checkpoints to keep. In some cases, we want only the latest few checkpoints and would like to delete old ones to save disk space. Defaults to -1, which means unlimited.
+- **`save_best`** (str, List[str]): If specified, it will save the checkpoint with the best evaluation result.
+  Usually, you can simply use `save_best="auto"` to automatically select the evaluation metric.
+
+And if you want more advanced configuration, please refer to the [CheckpointHook docs](tutorials/hook.md#checkpointhook).
+
+## Load Checkpoint / Resume Training
+
+In config files, you can specify the loading and resuming functionality as below:
+
+```python
+# load from which checkpoint
+load_from = "Your checkpoint path"
+
+# whether to resume training from the loaded checkpoint
+resume = False
+```
+
+The `load_from` field can be either a local path or an HTTP path. And you can resume training from the checkpoint by
+specify `resume=True`.
+
+```{tip}
+You can also enable auto resuming from the latest checkpoint by specifying `load_from=None` and `resume=True`.
+Runner will find the latest checkpoint from the work directory automatically.
+```
+
+If you are training models by our `tools/train.py` script, you can also use `--resume` argument to resume
+training without modifying the config file manually.
+
+```bash
+# Automatically resume from the latest checkpoint.
+python tools/train.py configs/resnet/resnet50_8xb32_in1k.py --resume
+
+# Resume from the specified checkpoint.
+python tools/train.py configs/resnet/resnet50_8xb32_in1k.py --resume checkpoints/resnet.pth
+```
+
+## Randomness Configuration
+
+In the `randomness` field, we provide some options to make the experiment as reproducible as possible.
+
+By default, we won't specify seed in the config file, and in every experiment, the program will generate a random seed.
+
+**Default settings:**
+
+```python
+randomness = dict(seed=None, deterministic=False)
+```
+
+To make the experiment more reproducible, you can specify a seed and set `deterministic=True`. The influence
+of the `deterministic` option can be found [here](https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking).
+
+## Log Configuration
+
+The log configuration relates to multiple fields.
+
+In the `log_level` field, you can specify the global logging level. See {external+python:ref}`Logging Levels<levels>` for a list of levels.
+
+```python
+log_level = 'INFO'
+```
+
+In the `default_hooks.logger` field, you can specify the logging interval during training and testing. And all
+available arguments can be found in the [LoggerHook docs](tutorials/hook.md#loggerhook).
+
+```python
+default_hooks = dict(
+    ...
+    # print log every 100 iterations.
+    logger=dict(type='LoggerHook', interval=100),
+    ...
+)
+```
+
+In the `log_processor` field, you can specify the log smooth method. Usually, we use a window with length of 10
+to smooth the log and output the mean value of all information. If you want to specify the smooth method of
+some information finely, see the {external+mmengine:doc}`LogProcessor docs <advanced_tutorials/logging>`.
+
+```python
+# The default setting, which will smooth the values in training log by a 10-length window.
+log_processor = dict(window_size=10)
+```
+
+In the `visualizer` field, you can specify multiple backends to save the log information, such as TensorBoard
+and WandB. More details can be found in the [Visualizer section](#visualizer).
+
+## Custom Hooks
+
+Many above functionalities are implemented by hooks, and you can also plug-in other custom hooks by modifying
+`custom_hooks` field. Here are some hooks in MMEngine and MMPretrain that you can use directly, such as:
+
+- [EMAHook](mmpretrain.engine.hooks.EMAHook)
+- [SyncBuffersHook](mmengine.hooks.SyncBuffersHook)
+- [EmptyCacheHook](mmengine.hooks.EmptyCacheHook)
+- [ClassNumCheckHook](mmpretrain.engine.hooks.ClassNumCheckHook)
+- ......
+
+For example, EMA (Exponential Moving Average) is widely used in the model training, and you can enable it as
+below:
+
+```python
+custom_hooks = [
+    dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL'),
+]
+```
+
+## Visualize Validation
+
+The validation visualization functionality is a default hook during validation. And you can configure it in the
+`default_hooks.visualization` field.
+
+By default, we disabled it, and you can enable it by specifying `enable=True`. And more arguments can be found in
+the [VisualizationHook docs](mmpretrain.engine.hooks.VisualizationHook).
+
+```python
+default_hooks = dict(
+    ...
+    visualization=dict(type='VisualizationHook', enable=False),
+    ...
+)
+```
+
+This hook will select some images in the validation dataset, and tag the prediction results on these images
+during every validation process. You can use it to watch the varying of model performance on actual images
+during training.
+
+In addition, if the images in your validation dataset are small (\<100), you can rescale them before
+visualization by specifying `rescale_factor=2.` or higher.
+
+## Visualizer
+
+The visualizer is used to record all kinds of information during training and test, including logs, images and
+scalars. By default, the recorded information will be saved at the `vis_data` folder under the work directory.
+
+**Default settings:**
+
+```python
+visualizer = dict(
+    type='UniversalVisualizer',
+    vis_backends=[
+        dict(type='LocalVisBackend'),
+    ]
+)
+```
+
+Usually, the most useful function is to save the log and scalars like `loss` to different backends.
+For example, to save them to TensorBoard, simply set them as below:
+
+```python
+visualizer = dict(
+    type='UniversalVisualizer',
+    vis_backends=[
+        dict(type='LocalVisBackend'),
+        dict(type='TensorboardVisBackend'),
+    ]
+)
+```
+
+Or save them to WandB as below:
+
+```python
+visualizer = dict(
+    type='UniversalVisualizer',
+    vis_backends=[
+        dict(type='LocalVisBackend'),
+        dict(type='WandbVisBackend'),
+    ]
+)
+```
+
+## Environment Configuration
+
+In the `env_cfg` field, you can configure some low-level parameters, like cuDNN, multi-process, and distributed
+communication.
+
+**Please make sure you understand the meaning of these parameters before modifying them.**
+
+```python
+env_cfg = dict(
+    # whether to enable cudnn benchmark
+    cudnn_benchmark=False,
+
+    # set multi-process parameters
+    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+
+    # set distributed parameters
+    dist_cfg=dict(backend='nccl'),
+)
+```
diff --git a/docs/en/advanced_guides/schedule.md b/docs/en/advanced_guides/schedule.md
new file mode 100644
index 0000000000000000000000000000000000000000..f02075924d2a38de7c65c23e3377c793cec7ff4f
--- /dev/null
+++ b/docs/en/advanced_guides/schedule.md
@@ -0,0 +1,361 @@
+# Customize Training Schedule
+
+In our codebase, [default training schedules](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/schedules) have been provided for common datasets such as CIFAR, ImageNet, etc. If we attempt to experiment on these datasets for higher accuracy or on different new methods and datasets, we might possibly need to modify the strategies.
+
+In this tutorial, we will introduce how to modify configs to construct optimizers, use parameter-wise finely configuration, gradient clipping, gradient accumulation as well as customize learning rate and momentum schedules. Furthermore, introduce a template to customize self-implemented optimizationmethods for the project.
+
+## Customize optimization
+
+We use the `optim_wrapper` field to configure the strategies of optimization, which includes choices of optimizer, choices of automatic mixed precision training, parameter-wise configurations, gradient clipping and accumulation. Details are seen below.
+
+### Use optimizers supported by PyTorch
+
+We support all the optimizers implemented by PyTorch, and to use them, please change the `optimizer` field of config files.
+
+For example, if you want to use [`SGD`](torch.optim.SGD), the modification in config file could be as the following. Notice that optimization related settings should all wrapped inside the `optim_wrapper`.
+
+```python
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.0003, weight_decay=0.0001)
+)
+```
+
+```{note}
+`type` in optimizer is not a constructor but a optimizer name in PyTorch.
+Refers to {external+torch:ref}`List of optimizers supported by PyTorch <optim:algorithms>` for more choices.
+```
+
+To modify the learning rate of the model, just modify the `lr` in the config of optimizer.
+You can also directly set other arguments according to the [API doc](torch.optim) of PyTorch.
+
+For example, if you want to use [`Adam`](torch.optim.Adam) with settings like `torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)` in PyTorch. You could use the config below:
+
+```python
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer = dict(
+        type='Adam',
+        lr=0.001,
+        betas=(0.9, 0.999),
+        eps=1e-08,
+        weight_decay=0,
+        amsgrad=False),
+)
+```
+
+````{note}
+The default type of `optim_wrapper` field is [`OptimWrapper`](mmengine.optim.OptimWrapper), therefore, you can
+omit the type field usually, like:
+
+```python
+optim_wrapper = dict(
+    optimizer=dict(
+        type='Adam',
+        lr=0.001,
+        betas=(0.9, 0.999),
+        eps=1e-08,
+        weight_decay=0,
+        amsgrad=False))
+```
+````
+
+### Use AMP training
+
+If we want to use the automatic mixed precision training, we can simply change the type of `optim_wrapper` to `AmpOptimWrapper` in config files.
+
+```python
+optim_wrapper = dict(type='AmpOptimWrapper', optimizer=...)
+```
+
+Alternatively, for conveniency, we can set `--amp` parameter to turn on the AMP option directly in the `tools/train.py` script. Refers to [Training tutorial](../user_guides/train.md) for details of starting a training.
+
+### Parameter-wise finely configuration
+
+Some models may have parameter-specific settings for optimization, for example, no weight decay to the BatchNorm layers or using different learning rates for different network layers.
+To finely configure them, we can use the `paramwise_cfg` argument in `optim_wrapper`.
+
+- **Set different hyper-parameter multipliers for different types of parameters.**
+
+  For instance, we can set `norm_decay_mult=0.` in `paramwise_cfg` to change the weight decay of weight and bias of normalization layers to zero.
+
+  ```python
+  optim_wrapper = dict(
+      optimizer=dict(type='SGD', lr=0.8, weight_decay=1e-4),
+      paramwise_cfg=dict(norm_decay_mult=0.))
+  ```
+
+  More types of parameters are supported to configured, list as follow:
+
+  - `bias_lr_mult`: Multiplier for learning rate of bias (Not include normalization layers' biases and deformable convolution layers' offsets). Defaults to 1.
+  - `bias_decay_mult`: Multiplier for weight decay of bias (Not include normalization layers' biases and deformable convolution layers' offsets). Defaults to 1.
+  - `norm_decay_mult`: Multiplier for weight decay of weight and bias of normalization layers. Defaults to 1.
+  - `flat_decay_mult`: Multiplier for weight decay of all one-dimensional parameters. Defaults to 1.
+  - `dwconv_decay_mult`: Multiplier for weight decay of depth-wise convolution layers. Defaults to 1.
+  - `bypass_duplicate`: Whether to bypass duplicated parameters. Defaults to `False`.
+  - `dcn_offset_lr_mult`: Multiplier for learning rate of deformable convolution layers. Defaults to 1.
+
+- **Set different hyper-parameter multipliers for specific parameters.**
+
+  MMPretrain can use `custom_keys` in `paramwise_cfg` to specify different parameters to use different learning rates or weight decay.
+
+  For example, to set all learning rates and weight decays of `backbone.layer0` to 0, the rest of `backbone` remains the same as optimizer and the learning rate of `head` to 0.001, use the configs below.
+
+  ```python
+  optim_wrapper = dict(
+      optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+      paramwise_cfg=dict(
+          custom_keys={
+              'backbone.layer0': dict(lr_mult=0, decay_mult=0),
+              'backbone': dict(lr_mult=1),
+              'head': dict(lr_mult=0.1)
+          }))
+  ```
+
+### Gradient clipping
+
+During the training process, the loss function may get close to a cliffy region and cause gradient explosion. And gradient clipping is helpful to stabilize the training process. More introduction can be found in [this page](https://paperswithcode.com/method/gradient-clipping).
+
+Currently we support `clip_grad` option in `optim_wrapper` for gradient clipping, refers to [PyTorch Documentation](torch.nn.utils.clip_grad_norm_).
+
+Here is an example:
+
+```python
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+    # norm_type: type of the used p-norm, here norm_type is 2.
+    clip_grad=dict(max_norm=35, norm_type=2))
+```
+
+### Gradient accumulation
+
+When computing resources are lacking, the batch size can only be set to a small value, which may affect the performance of models. Gradient accumulation can be used to solve this problem. We support `accumulative_counts` option in `optim_wrapper` for gradient accumulation.
+
+Here is an example:
+
+```python
+train_dataloader = dict(batch_size=64)
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+    accumulative_counts=4)
+```
+
+Indicates that during training, back-propagation is performed every 4 iters. And the above is equivalent to:
+
+```python
+train_dataloader = dict(batch_size=256)
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001))
+```
+
+## Customize parameter schedules
+
+In training, the optimzation parameters such as learing rate, momentum, are usually not fixed but changing through iterations or epochs. PyTorch supports several learning rate schedulers, which are not sufficient for complex strategies. In MMPretrain, we provide `param_scheduler` for better controls of different parameter schedules.
+
+### Customize learning rate schedules
+
+Learning rate schedulers are widely used to improve performance. We support most of the PyTorch schedulers, including `ExponentialLR`, `LinearLR`, `StepLR`, `MultiStepLR`, etc.
+
+All available learning rate scheduler can be found {external+mmengine:doc}`here <api/optim>`, and the
+names of learning rate schedulers end with `LR`.
+
+- **Single learning rate schedule**
+
+  In most cases, we use only one learning rate schedule for simplicity. For instance, [`MultiStepLR`](mmengine.optim.MultiStepLR) is used as the default learning rate schedule for ResNet. Here, `param_scheduler` is a dictionary.
+
+  ```python
+  param_scheduler = dict(
+      type='MultiStepLR',
+      by_epoch=True,
+      milestones=[100, 150],
+      gamma=0.1)
+  ```
+
+  Or, we want to use the [`CosineAnnealingLR`](mmengine.optim.CosineAnnealingLR) scheduler to decay the learning rate:
+
+  ```python
+  param_scheduler = dict(
+      type='CosineAnnealingLR',
+      by_epoch=True,
+      T_max=num_epochs)
+  ```
+
+- **Multiple learning rate schedules**
+
+  In some of the training cases, multiple learning rate schedules are applied for higher accuracy. For example ,in the early stage, training is easy to be volatile, and warmup is a technique to reduce volatility.
+  The learning rate will increase gradually from a minor value to the expected value by warmup and decay afterwards by other schedules.
+
+  In MMPretrain, simply combines desired schedules in `param_scheduler` as a list can achieve the warmup strategy.
+
+  Here are some examples:
+
+  1. linear warmup during the first 50 iters.
+
+  ```python
+    param_scheduler = [
+        # linear warm-up by iters
+        dict(type='LinearLR',
+            start_factor=0.001,
+            by_epoch=False,  # by iters
+            end=50),  # only warm up for first 50 iters
+        # main learing rate schedule
+        dict(type='MultiStepLR',
+            by_epoch=True,
+            milestones=[8, 11],
+            gamma=0.1)
+    ]
+  ```
+
+  2. linear warmup and update lr by iter during the first 10 epochs.
+
+  ```python
+    param_scheduler = [
+        # linear warm-up by epochs in [0, 10) epochs
+        dict(type='LinearLR',
+            start_factor=0.001,
+            by_epoch=True,
+            end=10,
+            convert_to_iter_based=True,  # Update learning rate by iter.
+        ),
+        # use CosineAnnealing schedule after 10 epochs
+        dict(type='CosineAnnealingLR', by_epoch=True, begin=10)
+    ]
+  ```
+
+  Notice that, we use `begin` and `end` arguments here to assign the valid range, which is [`begin`, `end`) for this schedule. And the range unit is defined by `by_epoch` argument. If not specified, the `begin` is 0 and the `end` is the max epochs or iterations.
+
+  If the ranges for all schedules are not continuous, the learning rate will stay constant in ignored range, otherwise all valid schedulers will be executed in order in a specific stage, which behaves the same as PyTorch [`ChainedScheduler`](torch.optim.lr_scheduler.ChainedScheduler).
+
+  ```{tip}
+  To check that the learning rate curve is as expected, after completing your configuration file，you could use [optimizer parameter visualization tool](../useful_tools/scheduler_visualization.md) to draw the corresponding learning rate adjustment curve.
+  ```
+
+### Customize momentum schedules
+
+We support using momentum schedulers to modify the optimizer's momentum according to learning rate, which could make the loss converge in a faster way. The usage is the same as learning rate schedulers.
+
+All available learning rate scheduler can be found {external+mmengine:doc}`here <api/optim>`, and the
+names of momentum rate schedulers end with `Momentum`.
+
+Here is an example:
+
+```python
+param_scheduler = [
+    # the lr scheduler
+    dict(type='LinearLR', ...),
+    # the momentum scheduler
+    dict(type='LinearMomentum',
+         start_factor=0.001,
+         by_epoch=False,
+         begin=0,
+         end=1000)
+]
+```
+
+## Add new optimizers or constructors
+
+```{note}
+This part will modify the MMPretrain source code or add code to the MMPretrain framework, beginners can skip it.
+```
+
+### Add new optimizers
+
+In academic research and industrial practice, it may be necessary to use optimization methods not implemented by MMPretrain, and you can add them through the following methods.
+
+1. Implement a New Optimizer
+
+   Assume you want to add an optimizer named `MyOptimizer`, which has arguments `a`, `b`, and `c`.
+   You need to create a new file under `mmpretrain/engine/optimizers`, and implement the new optimizer in the file, for example, in `mmpretrain/engine/optimizers/my_optimizer.py`:
+
+   ```python
+   from torch.optim import Optimizer
+   from mmpretrain.registry import OPTIMIZERS
+
+
+   @OPTIMIZERS.register_module()
+   class MyOptimizer(Optimizer):
+
+       def __init__(self, a, b, c):
+           ...
+
+       def step(self, closure=None):
+           ...
+   ```
+
+2. Import the Optimizer
+
+   To find the above module defined above, this module should be imported during the running.
+
+   Import it in the `mmpretrain/engine/optimizers/__init__.py` to add it into the `mmpretrain.engine` package.
+
+   ```python
+   # In mmpretrain/engine/optimizers/__init__.py
+   ...
+   from .my_optimizer import MyOptimizer # MyOptimizer maybe other class name
+
+   __all__ = [..., 'MyOptimizer']
+   ```
+
+   During running, we will automatically import the `mmpretrain.engine` package and register the `MyOptimizer` at the same time.
+
+3. Specify the Optimizer in Config
+
+   Then you can use `MyOptimizer` in the `optim_wrapper.optimizer` field of config files.
+
+   ```python
+   optim_wrapper = dict(
+       optimizer=dict(type='MyOptimizer', a=a_value, b=b_value, c=c_value))
+   ```
+
+### Add new optimizer constructors
+
+Some models may have some parameter-specific settings for optimization, like different weight decay rate for all `BatchNorm` layers.
+
+Although we already can use [the `optim_wrapper.paramwise_cfg` field](#parameter-wise-finely-configuration) to
+configure various parameter-specific optimizer settings. It may still not cover your need.
+
+Of course, you can modify it. By default, we use the [`DefaultOptimWrapperConstructor`](mmengine.optim.DefaultOptimWrapperConstructor)
+class to deal with the construction of optimizer. And during the construction, it fine-grainedly configures the optimizer settings of
+different parameters according to the `paramwise_cfg`，which could also serve as a template for new optimizer constructor.
+
+You can overwrite these behaviors by add new optimizer constructors.
+
+```python
+# In mmpretrain/engine/optimizers/my_optim_constructor.py
+from mmengine.optim import DefaultOptimWrapperConstructor
+from mmpretrain.registry import OPTIM_WRAPPER_CONSTRUCTORS
+
+
+@OPTIM_WRAPPER_CONSTRUCTORS.register_module()
+class MyOptimWrapperConstructor:
+
+    def __init__(self, optim_wrapper_cfg, paramwise_cfg=None):
+        ...
+
+    def __call__(self, model):
+        ...
+```
+
+Here is a specific example of [OptimWrapperConstructor](mmpretrain.engine.optimizers.LearningRateDecayOptimWrapperConstructor).
+
+And then, import it and use it almost like [the optimizer tutorial](#add-new-optimizers).
+
+1. Import it in the `mmpretrain/engine/optimizers/__init__.py` to add it into the `mmpretrain.engine` package.
+
+   ```python
+   # In mmpretrain/engine/optimizers/__init__.py
+   ...
+   from .my_optim_constructor import MyOptimWrapperConstructor
+
+   __all__ = [..., 'MyOptimWrapperConstructor']
+   ```
+
+2. Use `MyOptimWrapperConstructor` in the `optim_wrapper.constructor` field of config files.
+
+   ```python
+   optim_wrapper = dict(
+       constructor=dict(type='MyOptimWrapperConstructor'),
+       optimizer=...,
+       paramwise_cfg=...,
+   )
+   ```
diff --git a/docs/en/api/apis.rst b/docs/en/api/apis.rst
new file mode 100644
index 0000000000000000000000000000000000000000..074960b6c313b63ff6bb2e98ef85a526a057ad15
--- /dev/null
+++ b/docs/en/api/apis.rst
@@ -0,0 +1,48 @@
+.. role:: hidden
+    :class: hidden-section
+
+.. module:: mmpretrain.apis
+
+mmpretrain.apis
+===================================
+
+These are some high-level APIs for classification tasks.
+
+.. contents:: mmpretrain.apis
+   :depth: 2
+   :local:
+   :backlinks: top
+
+Model
+------------------
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   list_models
+   get_model
+
+Inference
+------------------
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+   :template: callable.rst
+
+   ImageClassificationInferencer
+   ImageRetrievalInferencer
+   ImageCaptionInferencer
+   VisualQuestionAnsweringInferencer
+   VisualGroundingInferencer
+   TextToImageRetrievalInferencer
+   ImageToTextRetrievalInferencer
+   NLVRInferencer
+   FeatureExtractor
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   inference_model
diff --git a/docs/en/api/data_process.rst b/docs/en/api/data_process.rst
new file mode 100644
index 0000000000000000000000000000000000000000..af0f6e54ec2b76d61fd504abb79806d610329444
--- /dev/null
+++ b/docs/en/api/data_process.rst
@@ -0,0 +1,329 @@
+.. role:: hidden
+    :class: hidden-section
+
+Data Process
+=================
+
+In MMPreTrain, the data process and the dataset is decomposed. The
+datasets only define how to get samples' basic information from the file
+system. These basic information includes the ground-truth label and raw
+images data / the paths of images.The data process includes data transforms,
+data preprocessors and batch augmentations.
+
+- :mod:`Data Transforms <mmpretrain.datasets.transforms>`: Transforms includes loading, preprocessing, formatting and etc.
+- :mod:`Data Preprocessors <mmpretrain.models.utils.data_preprocessor>`: Processes includes collate, normalization, stacking, channel fliping and etc.
+
+  - :mod:`Batch Augmentations <mmpretrain.models.utils.batch_augments>`: Batch augmentation involves multiple samples, such as Mixup and CutMix.
+
+.. module:: mmpretrain.datasets.transforms
+
+Data Transforms
+--------------------
+
+To prepare the inputs data, we need to do some transforms on these basic
+information. These transforms includes loading, preprocessing and
+formatting. And a series of data transforms makes up a data pipeline.
+Therefore, you can find the a ``pipeline`` argument in the configs of dataset,
+for example:
+
+.. code:: python
+
+    train_pipeline = [
+        dict(type='LoadImageFromFile'),
+        dict(type='RandomResizedCrop', scale=224),
+        dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+        dict(type='PackInputs'),
+    ]
+
+    train_dataloader = dict(
+        ....
+        dataset=dict(
+            pipeline=train_pipeline,
+            ....),
+        ....
+    )
+
+Every item of a pipeline list is one of the following data transforms class. And if you want to add a custom data transformation class, the tutorial :doc:`Custom Data Pipelines </advanced_guides/pipeline>` will help you.
+
+.. contents::
+   :depth: 1
+   :local:
+   :backlinks: top
+
+Loading and Formatting
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+   :template: data_transform.rst
+
+   LoadImageFromFile
+   PackInputs
+   PackMultiTaskInputs
+   PILToNumpy
+   NumpyToPIL
+   Transpose
+   Collect
+
+Processing and Augmentation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+   :template: data_transform.rst
+
+   Albumentations
+   CenterCrop
+   ColorJitter
+   EfficientNetCenterCrop
+   EfficientNetRandomCrop
+   Lighting
+   Normalize
+   RandomCrop
+   RandomErasing
+   RandomFlip
+   RandomGrayscale
+   RandomResize
+   RandomResizedCrop
+   Resize
+   ResizeEdge
+   BEiTMaskGenerator
+   SimMIMMaskGenerator
+
+Composed Augmentation
+"""""""""""""""""""""
+Composed augmentation is a kind of methods which compose a series of data
+augmentation transforms, such as ``AutoAugment`` and ``RandAugment``.
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+   :template: data_transform.rst
+
+   AutoAugment
+   RandAugment
+
+The above transforms is composed from a group of policies from the below random
+transforms:
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+   :template: data_transform.rst
+
+   AutoContrast
+   Brightness
+   ColorTransform
+   Contrast
+   Cutout
+   Equalize
+   GaussianBlur
+   Invert
+   Posterize
+   Rotate
+   Sharpness
+   Shear
+   Solarize
+   SolarizeAdd
+   Translate
+   BaseAugTransform
+
+MMCV transforms
+^^^^^^^^^^^^^^^
+
+We also provides many transforms in MMCV. You can use them directly in the config files. Here are some frequently used transforms, and the whole transforms list can be found in :external+mmcv:doc:`api/transforms`.
+
+Transform Wrapper
+^^^^^^^^^^^^^^^^^
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+   :template: data_transform.rst
+
+   MultiView
+
+.. module:: mmpretrain.models.utils.data_preprocessor
+
+
+TorchVision Transforms
+^^^^^^^^^^^^^^^^^^^^^^
+
+We also provide all the transforms in TorchVision. You can use them the like following examples:
+
+**1. Use some TorchVision Augs Surrounded by NumpyToPIL and PILToNumpy (Recommendation)**
+
+Add TorchVision Augs surrounded by ``dict(type='NumpyToPIL', to_rgb=True),`` and ``dict(type='PILToNumpy', to_bgr=True),``
+
+.. code:: python
+
+    train_pipeline = [
+        dict(type='LoadImageFromFile'),
+        dict(type='NumpyToPIL', to_rgb=True),     # from BGR in cv2 to RGB  in PIL
+        dict(type='torchvision/RandomResizedCrop',size=176),
+        dict(type='PILToNumpy', to_bgr=True),     # from RGB  in PIL to BGR in cv2
+        dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+        dict(type='PackInputs'),
+    ]
+
+    data_preprocessor = dict(
+        num_classes=1000,
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        to_rgb=True,                          # from BGR in cv2 to RGB  in PIL
+    )
+
+
+**2. Use TorchVision Augs and ToTensor&Normalize**
+
+Make sure the 'img' has been converted to PIL format from BGR-Numpy format before being processed by TorchVision Augs.
+
+.. code:: python
+
+    train_pipeline = [
+        dict(type='LoadImageFromFile'),
+        dict(type='NumpyToPIL', to_rgb=True),       # from BGR in cv2 to RGB  in PIL
+        dict(
+            type='torchvision/RandomResizedCrop',
+            size=176,
+            interpolation='bilinear'),            # accept str format interpolation mode
+        dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+        dict(
+            type='torchvision/TrivialAugmentWide',
+            interpolation='bilinear'),
+        dict(type='torchvision/PILToTensor'),
+        dict(type='torchvision/ConvertImageDtype', dtype=torch.float),
+        dict(
+            type='torchvision/Normalize',
+            mean=(0.485, 0.456, 0.406),
+            std=(0.229, 0.224, 0.225),
+        ),
+        dict(type='torchvision/RandomErasing', p=0.1),
+        dict(type='PackInputs'),
+    ]
+
+    data_preprocessor = dict(num_classes=1000, mean=None, std=None, to_rgb=False)  # Normalize in dataset pipeline
+
+
+**3. Use TorchVision Augs Except ToTensor&Normalize**
+
+.. code:: python
+
+    train_pipeline = [
+        dict(type='LoadImageFromFile'),
+        dict(type='NumpyToPIL', to_rgb=True),   # from BGR in cv2 to RGB  in PIL
+        dict(type='torchvision/RandomResizedCrop', size=176, interpolation='bilinear'),
+        dict(type='torchvision/RandomHorizontalFlip', p=0.5),
+        dict(type='torchvision/TrivialAugmentWide', interpolation='bilinear'),
+        dict(type='PackInputs'),
+    ]
+
+    # here the Normalize params is for the RGB format
+    data_preprocessor = dict(
+        num_classes=1000,
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        to_rgb=False,
+    )
+
+
+Data Preprocessors
+------------------
+
+The data preprocessor is also a component to process the data before feeding data to the neural network.
+Comparing with the data transforms, the data preprocessor is a module of the classifier,
+and it takes a batch of data to process, which means it can use GPU and batch to accelebrate the processing.
+
+The default data preprocessor in MMPreTrain could do the pre-processing like following:
+
+1. Move data to the target device.
+2. Pad inputs to the maximum size of current batch.
+3. Stack inputs to a batch.
+4. Convert inputs from bgr to rgb if the shape of input is (3, H, W).
+5. Normalize image with defined std and mean.
+6. Do batch augmentations like Mixup and CutMix during training.
+
+You can configure the data preprocessor by the ``data_preprocessor`` field or ``model.data_preprocessor`` field in the config file. Typical usages are as below:
+
+.. code-block:: python
+
+    data_preprocessor = dict(
+        # RGB format normalization parameters
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        to_rgb=True,    # convert image from BGR to RGB
+    )
+
+Or define in ``model.data_preprocessor`` as following:
+
+.. code-block:: python
+
+   model = dict(
+       backbone = ...,
+       neck = ...,
+       head = ...,
+       data_preprocessor = dict(
+                            mean=[123.675, 116.28, 103.53],
+                            std=[58.395, 57.12, 57.375],
+                            to_rgb=True)
+       train_cfg=...,
+   )
+
+Note that the ``model.data_preprocessor`` has higher priority than ``data_preprocessor``.
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   ClsDataPreprocessor
+   SelfSupDataPreprocessor
+   TwoNormDataPreprocessor
+   VideoDataPreprocessor
+
+.. module:: mmpretrain.models.utils.batch_augments
+
+Batch Augmentations
+^^^^^^^^^^^^^^^^^^^^
+
+The batch augmentation is a component of data preprocessors. It involves multiple samples and mix them in some way, such as Mixup and CutMix.
+
+These augmentations are usually only used during training, therefore, we use the ``model.train_cfg`` field to configure them in config files.
+
+.. code-block:: python
+
+   model = dict(
+       backbone=...,
+       neck=...,
+       head=...,
+       train_cfg=dict(augments=[
+           dict(type='Mixup', alpha=0.8),
+           dict(type='CutMix', alpha=1.0),
+       ]),
+   )
+
+You can also specify the probabilities of every batch augmentation by the ``probs`` field.
+
+.. code-block:: python
+
+   model = dict(
+       backbone=...,
+       neck=...,
+       head=...,
+       train_cfg=dict(augments=[
+           dict(type='Mixup', alpha=0.8),
+           dict(type='CutMix', alpha=1.0),
+       ], probs=[0.3, 0.7])
+   )
+
+Here is a list of batch augmentations can be used in MMPreTrain.
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+   :template: callable.rst
+
+   Mixup
+   CutMix
+   ResizeMix
diff --git a/docs/en/api/datasets.rst b/docs/en/api/datasets.rst
new file mode 100644
index 0000000000000000000000000000000000000000..069880dd722457225c864639600aa5e0ff54f6ff
--- /dev/null
+++ b/docs/en/api/datasets.rst
@@ -0,0 +1,129 @@
+.. role:: hidden
+    :class: hidden-section
+
+.. module:: mmpretrain.datasets
+
+mmpretrain.datasets
+===================================
+
+The ``datasets`` package contains several usual datasets for image classification tasks and some dataset wrappers.
+
+.. contents:: mmpretrain.datasets
+   :depth: 2
+   :local:
+   :backlinks: top
+
+Custom Dataset
+--------------
+
+.. autoclass:: CustomDataset
+
+ImageNet
+--------
+
+.. autoclass:: ImageNet
+
+.. autoclass:: ImageNet21k
+
+CIFAR
+-----
+
+.. autoclass:: CIFAR10
+
+.. autoclass:: CIFAR100
+
+MNIST
+-----
+
+.. autoclass:: MNIST
+
+.. autoclass:: FashionMNIST
+
+VOC
+---
+
+.. autoclass:: VOC
+
+CUB
+---
+
+.. autoclass:: CUB
+
+Places205
+---------
+
+.. autoclass:: Places205
+
+Retrieval
+---------
+
+.. autoclass:: InShop
+
+Base classes
+------------
+
+.. autoclass:: BaseDataset
+
+.. autoclass:: MultiLabelDataset
+
+Caltech101
+----------------
+
+.. autoclass:: Caltech101
+
+Food101
+----------------
+
+.. autoclass:: Food101
+
+DTD
+----------------
+
+.. autoclass:: DTD
+
+FGVCAircraft
+----------------
+
+.. autoclass:: FGVCAircraft
+
+
+Flowers102
+----------------
+
+.. autoclass:: Flowers102
+
+StanfordCars
+----------------
+
+.. autoclass:: StanfordCars
+
+OxfordIIITPet
+----------------
+
+.. autoclass:: OxfordIIITPet
+
+SUN397
+----------------
+
+.. autoclass:: SUN397
+
+RefCOCO
+--------
+
+.. autoclass:: RefCOCO
+
+Dataset Wrappers
+----------------
+
+.. autoclass:: KFoldDataset
+
+The dataset wrappers in the MMEngine can be directly used in MMPreTrain.
+
+.. list-table::
+
+   * - :class:`~mmengine.dataset.ConcatDataset`
+     - A wrapper of concatenated dataset.
+   * - :class:`~mmengine.dataset.RepeatDataset`
+     - A wrapper of repeated dataset.
+   * - :class:`~mmengine.dataset.ClassBalancedDataset`
+     - A wrapper of class balanced dataset.
diff --git a/docs/en/api/engine.rst b/docs/en/api/engine.rst
new file mode 100644
index 0000000000000000000000000000000000000000..2e67fd064058dae19a188efd4e2f513b13ba63c6
--- /dev/null
+++ b/docs/en/api/engine.rst
@@ -0,0 +1,51 @@
+.. role:: hidden
+    :class: hidden-section
+
+.. module:: mmpretrain.engine
+
+mmpretrain.engine
+===================================
+
+This package includes some runtime components, including hooks, runners, optimizers and loops. These components are useful in
+classification tasks but not supported by MMEngine yet.
+
+.. note::
+
+   Some components may be moved to MMEngine in the future.
+
+.. contents:: mmpretrain.engine
+   :depth: 2
+   :local:
+   :backlinks: top
+
+.. module:: mmpretrain.engine.hooks
+
+Hooks
+------------------
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   ClassNumCheckHook
+   PreciseBNHook
+   VisualizationHook
+   PrepareProtoBeforeValLoopHook
+   SetAdaptiveMarginsHook
+   EMAHook
+   SimSiamHook
+   DenseCLHook
+   SwAVHook
+
+.. module:: mmpretrain.engine.optimizers
+
+Optimizers
+------------------
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   Lamb
+   LARS
+   LearningRateDecayOptimWrapperConstructor
diff --git a/docs/en/api/evaluation.rst b/docs/en/api/evaluation.rst
new file mode 100644
index 0000000000000000000000000000000000000000..bddea207879dec23ce72efe68b682561836dcd92
--- /dev/null
+++ b/docs/en/api/evaluation.rst
@@ -0,0 +1,47 @@
+.. role:: hidden
+    :class: hidden-section
+
+.. module:: mmpretrain.evaluation
+
+mmpretrain.evaluation
+===================================
+
+This package includes metrics and evaluators for classification tasks.
+
+.. contents:: mmpretrain.evaluation
+   :depth: 1
+   :local:
+   :backlinks: top
+
+Single Label Metric
+----------------------
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   Accuracy
+   SingleLabelMetric
+   ConfusionMatrix
+
+Multi Label Metric
+----------------------
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   AveragePrecision
+   MultiLabelMetric
+   VOCAveragePrecision
+   VOCMultiLabelMetric
+
+Retrieval Metric
+----------------------
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+   :template: classtemplate.rst
+
+   RetrievalRecall
+   RetrievalAveragePrecision
diff --git a/docs/en/api/models.rst b/docs/en/api/models.rst
new file mode 100644
index 0000000000000000000000000000000000000000..30980324a4fa0302806cfbb5c5dee903782b9757
--- /dev/null
+++ b/docs/en/api/models.rst
@@ -0,0 +1,364 @@
+.. role:: hidden
+    :class: hidden-section
+
+.. module:: mmpretrain.models
+
+mmpretrain.models
+===================================
+
+The ``models`` package contains several sub-packages for addressing the different components of a model.
+
+- :mod:`~mmpretrain.models.classifiers`: The top-level module which defines the whole process of a classification model.
+- :mod:`~mmpretrain.models.selfsup`: The top-level module which defines the whole process of a self-supervised learning model.
+- :mod:`~mmpretrain.models.retrievers`: The top-level module which defines the whole process of a retrieval model.
+- :mod:`~mmpretrain.models.backbones`: Usually a feature extraction network, e.g., ResNet, MobileNet.
+- :mod:`~mmpretrain.models.necks`: The component between backbones and heads, e.g., GlobalAveragePooling.
+- :mod:`~mmpretrain.models.heads`: The component for specific tasks.
+- :mod:`~mmpretrain.models.losses`: Loss functions.
+- :mod:`~mmpretrain.models.peft`: The PEFT (Parameter-Efficient Fine-Tuning) module, e.g. LoRAModel.
+- :mod:`~mmpretrain.models.utils`: Some helper functions and common components used in various networks.
+
+  - :mod:`~mmpretrain.models.utils.data_preprocessor`: The component before model to preprocess the inputs, e.g., ClsDataPreprocessor.
+  - :ref:`components`: Common components used in various networks.
+  - :ref:`helpers`: Helper functions.
+
+Build Functions
+---------------
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+
+    build_classifier
+    build_backbone
+    build_neck
+    build_head
+    build_loss
+
+.. module:: mmpretrain.models.classifiers
+
+Classifiers
+------------------
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+    BaseClassifier
+    ImageClassifier
+    TimmClassifier
+    HuggingFaceClassifier
+
+.. module:: mmpretrain.models.selfsup
+
+Self-supervised Algorithms
+--------------------------
+
+.. _selfsup_algorithms:
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   BaseSelfSupervisor
+   BEiT
+   BYOL
+   BarlowTwins
+   CAE
+   DenseCL
+   EVA
+   iTPN
+   MAE
+   MILAN
+   MaskFeat
+   MixMIM
+   MoCo
+   MoCoV3
+   SimCLR
+   SimMIM
+   SimSiam
+   SparK
+   SwAV
+
+.. _selfsup_backbones:
+
+Some of above algorithms modified the backbone module to adapt the extra inputs
+like ``mask``, and here is the a list of these **modified backbone** modules.
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   BEiTPretrainViT
+   CAEPretrainViT
+   iTPNHiViT
+   MAEHiViT
+   MAEViT
+   MILANViT
+   MaskFeatViT
+   MixMIMPretrainTransformer
+   MoCoV3ViT
+   SimMIMSwinTransformer
+
+.. _target_generators:
+
+Some self-supervise algorithms need an external **target generator** to
+generate the optimization target. Here is a list of target generators.
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   VQKD
+   DALLEEncoder
+   HOGGenerator
+   CLIPGenerator
+
+.. module:: mmpretrain.models.retrievers
+
+Retrievers
+------------------
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   BaseRetriever
+   ImageToImageRetriever
+
+.. module:: mmpretrain.models.multimodal
+
+Multi-Modality Algorithms
+--------------------------
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   Blip2Caption
+   Blip2Retrieval
+   Blip2VQA
+   BlipCaption
+   BlipGrounding
+   BlipNLVR
+   BlipRetrieval
+   BlipVQA
+   Flamingo
+   OFA
+   MiniGPT4
+   Llava
+   Otter
+
+.. module:: mmpretrain.models.backbones
+
+Backbones
+------------------
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   AlexNet
+   BEiTViT
+   CSPDarkNet
+   CSPNet
+   CSPResNeXt
+   CSPResNet
+   Conformer
+   ConvMixer
+   ConvNeXt
+   DaViT
+   DeiT3
+   DenseNet
+   DistilledVisionTransformer
+   EdgeNeXt
+   EfficientFormer
+   EfficientNet
+   EfficientNetV2
+   HiViT
+   HRNet
+   HorNet
+   InceptionV3
+   LeNet5
+   LeViT
+   MViT
+   MlpMixer
+   MobileNetV2
+   MobileNetV3
+   MobileOne
+   MobileViT
+   PCPVT
+   PoolFormer
+   PyramidVig
+   RegNet
+   RepLKNet
+   RepMLPNet
+   RepVGG
+   Res2Net
+   ResNeSt
+   ResNeXt
+   ResNet
+   ResNetV1c
+   ResNetV1d
+   ResNet_CIFAR
+   RevVisionTransformer
+   SEResNeXt
+   SEResNet
+   SVT
+   ShuffleNetV1
+   ShuffleNetV2
+   SparseResNet
+   SparseConvNeXt
+   SwinTransformer
+   SwinTransformerV2
+   T2T_ViT
+   TIMMBackbone
+   TNT
+   VAN
+   VGG
+   Vig
+   VisionTransformer
+   ViTSAM
+   XCiT
+   ViTEVA02
+
+.. module:: mmpretrain.models.necks
+
+Necks
+------------------
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   BEiTV2Neck
+   CAENeck
+   ClsBatchNormNeck
+   DenseCLNeck
+   GeneralizedMeanPooling
+   GlobalAveragePooling
+   HRFuseScales
+   LinearNeck
+   MAEPretrainDecoder
+   MILANPretrainDecoder
+   MixMIMPretrainDecoder
+   MoCoV2Neck
+   NonLinearNeck
+   SimMIMLinearDecoder
+   SwAVNeck
+   iTPNPretrainDecoder
+   SparKLightDecoder
+
+.. module:: mmpretrain.models.heads
+
+Heads
+------------------
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   ArcFaceClsHead
+   BEiTV1Head
+   BEiTV2Head
+   CAEHead
+   CSRAClsHead
+   ClsHead
+   ConformerHead
+   ContrastiveHead
+   DeiTClsHead
+   EfficientFormerClsHead
+   LatentCrossCorrelationHead
+   LatentPredictHead
+   LeViTClsHead
+   LinearClsHead
+   MAEPretrainHead
+   MIMHead
+   MixMIMPretrainHead
+   MoCoV3Head
+   MultiLabelClsHead
+   MultiLabelLinearClsHead
+   MultiTaskHead
+   SimMIMHead
+   StackedLinearClsHead
+   SwAVHead
+   VigClsHead
+   VisionTransformerClsHead
+   iTPNClipHead
+   SparKPretrainHead
+
+.. module:: mmpretrain.models.losses
+
+Losses
+------------------
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   AsymmetricLoss
+   CAELoss
+   CosineSimilarityLoss
+   CrossCorrelationLoss
+   CrossEntropyLoss
+   FocalLoss
+   LabelSmoothLoss
+   PixelReconstructionLoss
+   SeesawLoss
+   SwAVLoss
+
+.. module:: mmpretrain.models.peft
+
+PEFT
+------------------
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   LoRAModel
+
+.. module:: mmpretrain.models.utils
+
+models.utils
+------------
+
+This package includes some helper functions and common components used in various networks.
+
+.. _components:
+
+Common Components
+^^^^^^^^^^^^^^^^^
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   ConditionalPositionEncoding
+   CosineEMA
+   HybridEmbed
+   InvertedResidual
+   LayerScale
+   MultiheadAttention
+   PatchEmbed
+   PatchMerging
+   SELayer
+   ShiftWindowMSA
+   WindowMSA
+   WindowMSAV2
+
+.. _helpers:
+
+Helper Functions
+^^^^^^^^^^^^^^^^
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   channel_shuffle
+   is_tracing
+   make_divisible
+   resize_pos_embed
+   resize_relative_position_bias_table
+   to_ntuple
diff --git a/docs/en/api/structures.rst b/docs/en/api/structures.rst
new file mode 100644
index 0000000000000000000000000000000000000000..10caa37c8e96dde2f2fa57714d68f16ec2893967
--- /dev/null
+++ b/docs/en/api/structures.rst
@@ -0,0 +1,13 @@
+.. role:: hidden
+    :class: hidden-section
+
+.. module:: mmpretrain.structures
+
+mmpretrain.structures
+===================================
+
+This package includes basic data structures.
+
+DataSample
+-------------
+.. autoclass:: DataSample
diff --git a/docs/en/api/utils.rst b/docs/en/api/utils.rst
new file mode 100644
index 0000000000000000000000000000000000000000..b2b9ea91c5589b33206c2ce614e92c16a02a2179
--- /dev/null
+++ b/docs/en/api/utils.rst
@@ -0,0 +1,19 @@
+.. role:: hidden
+    :class: hidden-section
+
+.. module:: mmpretrain.utils
+
+mmpretrain.utils
+===================================
+
+This package includes some useful helper functions for developing.
+
+.. autosummary::
+   :toctree: generated
+   :nosignatures:
+
+   collect_env
+   register_all_modules
+   load_json_log
+   track_on_main_process
+   get_ori_model
diff --git a/docs/en/api/visualization.rst b/docs/en/api/visualization.rst
new file mode 100644
index 0000000000000000000000000000000000000000..85742a1c487f9ceff424f35fd8e1b0e2898997a1
--- /dev/null
+++ b/docs/en/api/visualization.rst
@@ -0,0 +1,14 @@
+.. role:: hidden
+    :class: hidden-section
+
+.. module:: mmpretrain.visualization
+
+mmpretrain.visualization
+===================================
+
+This package includes visualizer and some helper functions for visualization.
+
+Visualizer
+-------------
+.. autoclass:: UniversalVisualizer
+    :members:
diff --git a/docs/en/conf.py b/docs/en/conf.py
new file mode 100644
index 0000000000000000000000000000000000000000..a5a7fefbb9fd95f46075d926a6dc525ae50a28e5
--- /dev/null
+++ b/docs/en/conf.py
@@ -0,0 +1,248 @@
+# flake8: noqa
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import subprocess
+import sys
+
+import pytorch_sphinx_theme
+from sphinx.builders.html import StandaloneHTMLBuilder
+
+sys.path.insert(0, os.path.abspath('../../'))
+
+# -- Project information -----------------------------------------------------
+
+project = 'MMPretrain'
+copyright = '2020, OpenMMLab'
+author = 'MMPretrain Authors'
+
+# The full version, including alpha/beta/rc tags
+version_file = '../../mmpretrain/version.py'
+
+
+def get_version():
+    with open(version_file, 'r') as f:
+        exec(compile(f.read(), version_file, 'exec'))
+    return locals()['__version__']
+
+
+release = get_version()
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx.ext.autodoc',
+    'sphinx.ext.autosummary',
+    'sphinx.ext.intersphinx',
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+    'myst_parser',
+    'sphinx_copybutton',
+    'sphinx_tabs.tabs',
+    'notfound.extension',
+    'sphinxcontrib.jquery',
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+source_suffix = {
+    '.rst': 'restructuredtext',
+    '.md': 'markdown',
+}
+
+language = 'en'
+
+# The master toctree document.
+root_doc = 'index'
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'pytorch_sphinx_theme'
+html_theme_path = [pytorch_sphinx_theme.get_html_theme_path()]
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+# yapf: disable
+html_theme_options = {
+    'menu': [
+        {
+            'name': 'GitHub',
+            'url': 'https://github.com/open-mmlab/mmpretrain'
+        },
+        {
+            'name': 'Colab Tutorials',
+            'children': [
+                {'name': 'Train and inference with shell commands',
+                 'url': 'https://colab.research.google.com/github/mzr1996/mmpretrain-tutorial/blob/master/1.x/MMPretrain_tools.ipynb'},
+                {'name': 'Train and inference with Python APIs',
+                 'url': 'https://colab.research.google.com/github/mzr1996/mmpretrain-tutorial/blob/master/1.x/MMPretrain_python.ipynb'},
+            ]
+        },
+        {
+            'name': 'Version',
+            'children': [
+                {'name': 'MMPreTrain 0.x',
+                 'url': 'https://mmpretrain.readthedocs.io/en/0.x/',
+                 'description': '0.x branch'},
+                {'name': 'MMPreTrain 1.x',
+                 'url': 'https://mmpretrain.readthedocs.io/en/latest/',
+                 'description': 'Main branch'},
+            ],
+        }
+    ],
+    # Specify the language of shared menu
+    'menu_lang': 'en',
+    # Disable the default edit on GitHub
+    'default_edit_on_github': False,
+}
+# yapf: enable
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+html_css_files = [
+    'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.css',
+    'css/readthedocs.css'
+]
+html_js_files = [
+    'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.js',
+    'js/custom.js'
+]
+
+# -- Options for HTMLHelp output ---------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'mmpretraindoc'
+
+# -- Options for LaTeX output ------------------------------------------------
+
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (root_doc, 'mmpretrain.tex', 'MMPretrain Documentation', author, 'manual'),
+]
+
+# -- Options for manual page output ------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [(root_doc, 'mmpretrain', 'MMPretrain Documentation', [author], 1)]
+
+# -- Options for Texinfo output ----------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (root_doc, 'mmpretrain', 'MMPretrain Documentation', author, 'mmpretrain',
+     'OpenMMLab pre-training toolbox and benchmark.', 'Miscellaneous'),
+]
+
+# -- Options for Epub output -------------------------------------------------
+
+# Bibliographic Dublin Core info.
+epub_title = project
+
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+
+# A unique identification for the text.
+#
+# epub_uid = ''
+
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ['search.html']
+
+# set priority when building html
+StandaloneHTMLBuilder.supported_image_types = [
+    'image/svg+xml', 'image/gif', 'image/png', 'image/jpeg'
+]
+
+# -- Extension configuration -------------------------------------------------
+# Ignore >>> when copying code
+copybutton_prompt_text = r'>>> |\.\.\. '
+copybutton_prompt_is_regexp = True
+
+# Auto-generated header anchors
+myst_heading_anchors = 3
+# Enable "colon_fence" extension of myst.
+myst_enable_extensions = ['colon_fence', 'dollarmath']
+
+# Configuration for intersphinx
+intersphinx_mapping = {
+    'python': ('https://docs.python.org/3', None),
+    'numpy': ('https://numpy.org/doc/stable', None),
+    'torch': ('https://pytorch.org/docs/stable/', None),
+    'mmcv': ('https://mmcv.readthedocs.io/en/2.x/', None),
+    'mmengine': ('https://mmengine.readthedocs.io/en/latest/', None),
+    'transformers':
+    ('https://huggingface.co/docs/transformers/main/en/', None),
+}
+napoleon_custom_sections = [
+    # Custom sections for data elements.
+    ('Meta fields', 'params_style'),
+    ('Data fields', 'params_style'),
+]
+
+# Disable docstring inheritance
+autodoc_inherit_docstrings = False
+# Mock some imports during generate API docs.
+autodoc_mock_imports = ['rich', 'attr', 'einops', 'mat4py']
+# Disable displaying type annotations, these can be very verbose
+autodoc_typehints = 'none'
+
+# The not found page
+notfound_template = '404.html'
+
+
+def builder_inited_handler(app):
+    if subprocess.run(['./stat.py']).returncode != 0:
+        raise RuntimeError('Failed to run the script `stat.py`.')
+
+
+def setup(app):
+    app.connect('builder-inited', builder_inited_handler)
diff --git a/docs/en/device/npu.md b/docs/en/device/npu.md
new file mode 100644
index 0000000000000000000000000000000000000000..d450029f7211bf10e00568bf00d26567f15b59a0
--- /dev/null
+++ b/docs/en/device/npu.md
@@ -0,0 +1,47 @@
+# NPU (HUAWEI Ascend)
+
+## Usage
+
+### General Usage
+
+Please refer to the [building documentation of MMCV](https://mmcv.readthedocs.io/en/latest/get_started/build.html#build-mmcv-full-on-ascend-npu-machine) to install MMCV and [MMEngine](https://mmengine.readthedocs.io/en/latest/get_started/installation.html#build-from-source) on NPU devices.
+
+Here we use 8 NPUs on your computer to train the model with the following command:
+
+```shell
+bash ./tools/dist_train.sh configs/resnet/resnet50_8xb32_in1k.py 8
+```
+
+Also, you can use only one NPU to train the model with the following command:
+
+```shell
+python ./tools/train.py configs/resnet/resnet50_8xb32_in1k.py
+```
+
+## Models Results
+
+|                            Model                            | Top-1 (%) | Top-5 (%) |                            Config                            |                            Download                             |
+| :---------------------------------------------------------: | :-------: | :-------: | :----------------------------------------------------------: | :-------------------------------------------------------------: |
+| [ResNet-50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/README.md) |   76.40   |   93.21   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet50_8xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/resnet50_8xb32_in1k.log) |
+| [ResNetXt-32x4d-50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnext/README.md) |   77.48   |   93.75   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnext/resnext50-32x4d_8xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/resnext50-32x4d_8xb32_in1k.log) |
+| [HRNet-W18](https://github.com/open-mmlab/mmclassification/blob/master/configs/hrnet/README.md) |   77.06   |   93.57   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/hrnet/hrnet-w18_4xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/hrnet-w18_4xb32_in1k.log) |
+| [ResNetV1D-152](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/README.md) |   79.41   |   94.48   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnetv1d152_8xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/resnetv1d152_8xb32_in1k.log) |
+| [SE-ResNet-50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/seresnet/README.md) |   77.65   |   93.74   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/seresnet/seresnet50_8xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/seresnet50_8xb32_in1k.log) |
+| [ShuffleNetV2 1.0x](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/shufflenet_v2/README.md) |   69.52   |   88.79   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/shufflenet-v2-1x_16xb64_in1k.log) |
+| [MobileNetV2](https://github.com/open-mmlab/mmclassification/tree/1.x/configs/mobilenet_v2) |   71.74   |   90.28   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/mobilenet-v2_8xb32_in1k.log) |
+| [MobileNetV3-Small](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v3/README.md) |   67.09   |   87.17   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/mobilenet-v3-small.log) |
+| [\*CSPResNeXt50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/cspnet/README.md) |   77.25   |   93.46   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/cspnet/cspresnext50_8xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/cspresnext50_8xb32_in1k.log) |
+| [\*EfficientNet-B4](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/README.md) |   75.73   |   92.91   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-b4_8xb32_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/efficientnet-b4_8xb32_in1k.log) |
+| [\*\*DenseNet121](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/densenet/README.md) |   72.53   |   90.85   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/densenet/densenet121_4xb256_in1k.py) | [log](https://download.openmmlab.com/mmclassification/v1/device/npu/densenet121_4xb256_in1k.log) |
+
+**Notes:**
+
+- If not specially marked, the results are almost same between results on the NPU and results on the GPU with FP32.
+- (\*) The training results of these models are lower than the results on the readme in the corresponding model, mainly
+  because the results on the readme are directly the weight of the timm of the eval, and the results on this side are
+  retrained according to the config with mmcls. The results of the config training on the GPU are consistent with the
+  results of the NPU.
+- (\*\*) The accuracy of this model is slightly lower because config is a 4-card config, we use 8 cards to run, and users
+  can adjust hyperparameters to get the best accuracy results.
+
+**All above models are provided by Huawei Ascend group.**
diff --git a/docs/en/docutils.conf b/docs/en/docutils.conf
new file mode 100644
index 0000000000000000000000000000000000000000..0c00c84688701117f231fd0c8ec295fb747b7d8f
--- /dev/null
+++ b/docs/en/docutils.conf
@@ -0,0 +1,2 @@
+[html writers]
+table_style: colwidths-auto
diff --git a/docs/en/get_started.md b/docs/en/get_started.md
new file mode 100644
index 0000000000000000000000000000000000000000..5d33ac00969a0701fbd067b9ad2321303c04a49d
--- /dev/null
+++ b/docs/en/get_started.md
@@ -0,0 +1,164 @@
+# Prerequisites
+
+In this section we demonstrate how to prepare an environment with PyTorch.
+
+MMPretrain works on Linux, Windows and macOS. It requires Python 3.7+, CUDA 10.2+ and PyTorch 1.8+.
+
+```{note}
+If you are experienced with PyTorch and have already installed it, just skip this part and jump to the [next section](#installation). Otherwise, you can follow these steps for the preparation.
+```
+
+**Step 1.** Download and install Miniconda from the [official website](https://docs.conda.io/en/latest/miniconda.html).
+
+**Step 2.** Create a conda environment and activate it.
+
+```shell
+conda create --name openmmlab python=3.8 -y
+conda activate openmmlab
+```
+
+**Step 3.** Install PyTorch following [official instructions](https://pytorch.org/get-started/locally/), e.g.
+
+On GPU platforms:
+
+```shell
+conda install pytorch torchvision -c pytorch
+```
+
+```{warning}
+This command will automatically install the latest version PyTorch and cudatoolkit, please check whether they match your environment.
+```
+
+On CPU platforms:
+
+```shell
+conda install pytorch torchvision cpuonly -c pytorch
+```
+
+# Installation
+
+## Best Practices
+
+According to your needs, we support two install modes:
+
+- [Install from source (Recommended)](#install-from-source): You want to develop your own network or new features based on MMPretrain framework. For example, adding new datasets or new backbones. And you can use all tools we provided.
+- [Install as a Python package](#install-as-a-python-package): You just want to call MMPretrain's APIs or import MMPretrain's modules in your project.
+
+### Install from source
+
+In this case, install mmpretrain from source:
+
+```shell
+git clone https://github.com/open-mmlab/mmpretrain.git
+cd mmpretrain
+pip install -U openmim && mim install -e .
+```
+
+```{note}
+`"-e"` means installing a project in editable mode, thus any local modifications made to the code will take effect without reinstallation.
+```
+
+### Install as a Python package
+
+Just install with mim.
+
+```shell
+pip install -U openmim && mim install "mmpretrain>=1.0.0rc8"
+```
+
+```{note}
+`mim` is a light-weight command-line tool to setup appropriate environment for OpenMMLab repositories according to PyTorch and CUDA version. It also has some useful functions for deep-learning experiments.
+```
+
+## Install multi-modality support (Optional)
+
+The multi-modality models in MMPretrain requires extra dependencies. To install these dependencies, you
+can add `[multimodal]` during the installation. For example:
+
+```shell
+# Install from source
+mim install -e ".[multimodal]"
+
+# Install as a Python package
+mim install "mmpretrain[multimodal]>=1.0.0rc8"
+```
+
+## Verify the installation
+
+To verify whether MMPretrain is installed correctly, we provide some sample codes to run an inference demo.
+
+Option (a). If you install mmpretrain from the source, just run the following command:
+
+```shell
+python demo/image_demo.py demo/demo.JPEG resnet18_8xb32_in1k --device cpu
+```
+
+You will see the output result dict including `pred_label`, `pred_score` and `pred_class` in your terminal.
+
+Option (b). If you install mmpretrain as a python package, open your python interpreter and copy&paste the following codes.
+
+```python
+from mmpretrain import get_model, inference_model
+
+model = get_model('resnet18_8xb32_in1k', device='cpu')  # or device='cuda:0'
+inference_model(model, 'demo/demo.JPEG')
+```
+
+You will see a dict printed, including the predicted label, score and category name.
+
+```{note}
+The `resnet18_8xb32_in1k` is the model name, and you can use [`mmpretrain.list_models`](mmpretrain.apis.list_models) to
+explore all models, or search them on the [Model Zoo Summary](./modelzoo_statistics.md)
+```
+
+## Customize Installation
+
+### CUDA versions
+
+When installing PyTorch, you need to specify the version of CUDA. If you are
+not clear on which to choose, follow our recommendations:
+
+- For Ampere-based NVIDIA GPUs, such as GeForce 30 series and NVIDIA A100, CUDA 11 is a must.
+- For older NVIDIA GPUs, CUDA 11 is backward compatible, but CUDA 10.2 offers better compatibility and is more lightweight.
+
+Please make sure the GPU driver satisfies the minimum version requirements. See [this table](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-major-component-versions__table-cuda-toolkit-driver-versions) for more information.
+
+```{note}
+Installing CUDA runtime libraries is enough if you follow our best practices,
+because no CUDA code will be compiled locally. However if you hope to compile
+MMCV from source or develop other CUDA operators, you need to install the
+complete CUDA toolkit from NVIDIA's [website](https://developer.nvidia.com/cuda-downloads),
+and its version should match the CUDA version of PyTorch. i.e., the specified
+version of cudatoolkit in `conda install` command.
+```
+
+### Install on CPU-only platforms
+
+MMPretrain can be built for CPU only environment. In CPU mode you can train, test or inference a model.
+
+### Install on Google Colab
+
+See [the Colab tutorial](https://colab.research.google.com/github/mzr1996/mmclassification-tutorial/blob/master/1.x/MMClassification_tools.ipynb).
+
+### Using MMPretrain with Docker
+
+We provide a [Dockerfile](https://github.com/open-mmlab/mmpretrain/blob/main/docker/Dockerfile)
+to build an image. Ensure that your [docker version](https://docs.docker.com/engine/install/) >=19.03.
+
+```shell
+# build an image with PyTorch 1.12.1, CUDA 11.3
+# If you prefer other versions, just modified the Dockerfile
+docker build -t mmpretrain docker/
+```
+
+Run it with
+
+```shell
+docker run --gpus all --shm-size=8g -it -v {DATA_DIR}:/mmpretrain/data mmpretrain
+```
+
+## Trouble shooting
+
+If you have some issues during the installation, please first view the [FAQ](./notes/faq.md) page.
+You may [open an issue](https://github.com/open-mmlab/mmpretrain/issues/new/choose)
+on GitHub if no solution is found.
diff --git a/docs/en/index.rst b/docs/en/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..d16a32d603d018eb209e9dca546b5e200c0fba25
--- /dev/null
+++ b/docs/en/index.rst
@@ -0,0 +1,157 @@
+Welcome to MMPretrain's documentation!
+============================================
+
+MMPretrain is a newly upgraded open-source framework for pre-training.
+It has set out to provide multiple powerful pre-trained backbones and
+support different pre-training strategies. MMPretrain originated from the
+famous open-source projects
+`MMClassification <https://github.com/open-mmlab/mmclassification/tree/1.x>`_
+and `MMSelfSup <https://github.com/open-mmlab/mmselfsup>`_, and is developed
+with many exiciting new features. The pre-training stage is essential for
+vision recognition currently. With the rich and strong pre-trained models,
+we are currently capable of improving various downstream vision tasks.
+
+Our primary objective for the codebase is to become an easily accessible and
+user-friendly library and to streamline research and engineering. We
+detail the properties and design of MMPretrain across different sections.
+
+Hands-on Roadmap of MMPretrain
+-------------------------------
+
+To help users quickly utilize MMPretrain, we recommend following the hands-on
+roadmap we have created for the library:
+
+   - For users who want to try MMPretrain, we suggest reading the GetStarted_
+     section for the environment setup.
+
+   - For basic usage, we refer users to UserGuides_ for utilizing various
+     algorithms to obtain the pre-trained models and evaluate their performance
+     in downstream tasks.
+
+   - For those who wish to customize their own algorithms, we provide
+     AdvancedGuides_ that include hints and rules for modifying code.
+
+   - To find your desired pre-trained models, users could check the ModelZoo_,
+     which features a summary of various backbones and pre-training methods and
+     introfuction of different algorithms.
+
+   - Additionally, we provide Analysis_ and Visualization_ tools to help
+     diagnose algorithms.
+
+   - Besides, if you have any other questions or concerns, please refer to the
+     Notes_ section for potential answers.
+
+We always welcome *PRs* and *Issues* for the betterment of MMPretrain.
+
+.. _GetStarted:
+.. toctree::
+   :maxdepth: 1
+   :caption: Get Started
+
+   get_started.md
+
+.. _UserGuides:
+.. toctree::
+   :maxdepth: 1
+   :caption: User Guides
+
+   user_guides/config.md
+   user_guides/dataset_prepare.md
+   user_guides/inference.md
+   user_guides/train.md
+   user_guides/test.md
+   user_guides/downstream.md
+
+.. _AdvancedGuides:
+.. toctree::
+   :maxdepth: 1
+   :caption: Advanced Guides
+
+   advanced_guides/datasets.md
+   advanced_guides/pipeline.md
+   advanced_guides/modules.md
+   advanced_guides/schedule.md
+   advanced_guides/runtime.md
+   advanced_guides/evaluation.md
+   advanced_guides/convention.md
+
+.. _ModelZoo:
+.. toctree::
+   :maxdepth: 1
+   :caption: Model Zoo
+   :glob:
+
+   modelzoo_statistics.md
+   papers/*
+
+.. _Visualization:
+.. toctree::
+   :maxdepth: 1
+   :caption: Visualization
+
+   useful_tools/dataset_visualization.md
+   useful_tools/scheduler_visualization.md
+   useful_tools/cam_visualization.md
+   useful_tools/t-sne_visualization.md
+
+.. _Analysis:
+.. toctree::
+   :maxdepth: 1
+   :caption: Analysis Tools
+
+   useful_tools/print_config.md
+   useful_tools/verify_dataset.md
+   useful_tools/log_result_analysis.md
+   useful_tools/complexity_analysis.md
+   useful_tools/confusion_matrix.md
+   useful_tools/shape_bias.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Deployment
+
+   useful_tools/model_serving.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Migration
+
+   migration.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: API Reference
+
+   mmpretrain.apis <api/apis>
+   mmpretrain.engine <api/engine>
+   mmpretrain.datasets <api/datasets>
+   Data Process <api/data_process>
+   mmpretrain.models <api/models>
+   mmpretrain.structures <api/structures>
+   mmpretrain.visualization <api/visualization>
+   mmpretrain.evaluation <api/evaluation>
+   mmpretrain.utils <api/utils>
+
+.. _Notes:
+.. toctree::
+   :maxdepth: 1
+   :caption: Notes
+
+   notes/contribution_guide.md
+   notes/projects.md
+   notes/changelog.md
+   notes/faq.md
+   notes/pretrain_custom_dataset.md
+   notes/finetune_custom_dataset.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Device Support
+
+   device/npu.md
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`search`
diff --git a/docs/en/migration.md b/docs/en/migration.md
new file mode 100644
index 0000000000000000000000000000000000000000..bdebdf6f5a9b454f94b5c66688f33d429544669e
--- /dev/null
+++ b/docs/en/migration.md
@@ -0,0 +1,772 @@
+# Migration
+
+We introduce some modifications in MMPretrain 1.x, and some of them are BC-breacking. To migrate your projects from **MMClassification 0.x** or **MMSelfSup 0.x** smoothly, please read this tutorial.
+
+- [Migration](#migration)
+  - [New dependencies](#new-dependencies)
+- [General change of config](#general-change-of-config)
+  - [Schedule settings](#schedule-settings)
+  - [Runtime settings](#runtime-settings)
+  - [Other changes](#other-changes)
+- [Migration from MMClassification 0.x](#migration-from-mmclassification-0x)
+  - [Config files](#config-files)
+    - [Model settings](#model-settings)
+    - [Data settings](#data-settings)
+  - [Packages](#packages)
+    - [`mmpretrain.apis`](#mmpretrainapis)
+    - [`mmpretrain.core`](#mmpretraincore)
+    - [`mmpretrain.datasets`](#mmpretraindatasets)
+    - [`mmpretrain.models`](#mmpretrainmodels)
+    - [`mmpretrain.utils`](#mmpretrainutils)
+- [Migration from MMSelfSup 0.x](#migration-from-mmselfsup-0x)
+  - [Config](#config)
+    - [Dataset settings](#dataset-settings)
+    - [Model settings](#model-settings-1)
+  - [Package](#package)
+
+## New dependencies
+
+```{warning}
+MMPretrain 1.x has new package dependencies, and a new environment should be created for MMPretrain 1.x even if you already have a well-rounded MMClassification 0.x or MMSelfSup 0.x environment. Please refer to the [installation tutorial](./get_started.md) for the required package installation or install the packages manually.
+```
+
+1. [MMEngine](https://github.com/open-mmlab/mmengine): MMEngine is the core the OpenMMLab 2.0 architecture,
+   and we have split many compentents unrelated to computer vision from MMCV to MMEngine.
+2. [MMCV](https://github.com/open-mmlab/mmcv): The computer vision package of OpenMMLab. This is not a new
+   dependency, but it should be upgraded to version `2.0.0rc1` or above.
+3. [rich](https://github.com/Textualize/rich): A terminal formatting package, and we use it to enhance some
+   outputs in the terminal.
+4. [einops](https://github.com/arogozhnikov/einops): Operators for Einstein notations.
+
+# General change of config
+
+In this section, we introduce the general difference between old version(**MMClassification 0.x** or **MMSelfSup 0.x**) and **MMPretrain 1.x**.
+
+## Schedule settings
+
+| MMCls or MMSelfSup 0.x | MMPretrain 1.x  | Remark                                                                                                                          |
+| ---------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------- |
+| optimizer_config       | /               | It has been **removed**.                                                                                                        |
+| /                      | optim_wrapper   | The `optim_wrapper` provides a common interface for updating parameters.                                                        |
+| lr_config              | param_scheduler | The `param_scheduler` is a list to set learning rate or other parameters, which is more flexible.                               |
+| runner                 | train_cfg       | The loop setting (`EpochBasedTrainLoop`, `IterBasedTrainLoop`) in `train_cfg` controls the work flow of the algorithm training. |
+
+Changes in **`optimizer`** and **`optimizer_config`**:
+
+- Now we use `optim_wrapper` field to specify all configurations related to optimization process. The
+  `optimizer` has become a subfield of `optim_wrapper`.
+- The `paramwise_cfg` field is also a subfield of `optim_wrapper`, instead of `optimizer`.
+- The `optimizer_config` field has been removed, and all configurations has been moved to `optim_wrapper`.
+- The `grad_clip` field has been renamed to `clip_grad`.
+
+<table class="docutils">
+<tr>
+<td>Original</td>
+<td>
+
+```python
+optimizer = dict(
+    type='AdamW',
+    lr=0.0015,
+    weight_decay=0.3,
+    paramwise_cfg = dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+    ))
+
+optimizer_config = dict(grad_clip=dict(max_norm=1.0))
+```
+
+</td>
+<tr>
+<td>New</td>
+<td>
+
+```python
+optim_wrapper = dict(
+    optimizer=dict(type='AdamW', lr=0.0015, weight_decay=0.3),
+    paramwise_cfg = dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+    ),
+    clip_grad=dict(max_norm=1.0),
+)
+```
+
+</td>
+</tr>
+</table>
+
+Changes in **`lr_config`**:
+
+- The `lr_config` field has been removed and replaced by the new `param_scheduler`.
+- The `warmup` related arguments have also been removed since we use a combination of schedulers to implement this
+  functionality.
+
+The new scheduler combination mechanism is highly flexible and enables the design of various learning rate/momentum curves.
+For more details, see the {external+mmengine:doc}`parameter schedulers tutorial <tutorials/param_scheduler>`.
+
+<table class="docutils">
+<tr>
+<td>Original</td>
+<td>
+
+```python
+lr_config = dict(
+    policy='CosineAnnealing',
+    min_lr=0,
+    warmup='linear',
+    warmup_iters=5,
+    warmup_ratio=0.01,
+    warmup_by_epoch=True)
+```
+
+</td>
+<tr>
+<td>New</td>
+<td>
+
+```python
+param_scheduler = [
+    # warmup
+    dict(
+        type='LinearLR',
+        start_factor=0.01,
+        by_epoch=True,
+        end=5,
+        # Update the learning rate after every iters.
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type='CosineAnnealingLR', by_epoch=True, begin=5),
+]
+```
+
+</td>
+</tr>
+</table>
+
+Changes in **`runner`**:
+
+Most of the configurations that were originally in the `runner` field have been moved to `train_cfg`, `val_cfg`, and `test_cfg`.
+These fields are used to configure the loop for training, validation, and testing.
+
+<table class="docutils">
+<tr>
+<td>Original</td>
+<td>
+
+```python
+runner = dict(type='EpochBasedRunner', max_epochs=100)
+```
+
+</td>
+<tr>
+<td>New</td>
+<td>
+
+```python
+# The `val_interval` is the original `evaluation.interval`.
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()   # Use the default validation loop.
+test_cfg = dict()  # Use the default test loop.
+```
+
+</td>
+</tr>
+</table>
+
+In OpenMMLab 2.0, we introduced `Loop` to control the behaviors in training, validation and testing. As a result, the functionalities of `Runner` have also been changed.
+More details can be found in the {external+mmengine:doc}`MMEngine tutorials <design/runner>`.
+
+## Runtime settings
+
+Changes in **`checkpoint_config`** and **`log_config`**:
+
+The `checkpoint_config` has been moved to `default_hooks.checkpoint`, and `log_config` has been moved to
+`default_hooks.logger`. Additionally, many hook settings that were previously included in the script code have
+been moved to the `default_hooks` field in the runtime configuration.
+
+```python
+default_hooks = dict(
+    # record the time of every iterations.
+    timer=dict(type='IterTimerHook'),
+
+    # print log every 100 iterations.
+    logger=dict(type='LoggerHook', interval=100),
+
+    # enable the parameter scheduler.
+    param_scheduler=dict(type='ParamSchedulerHook'),
+
+    # save checkpoint per epoch, and automatically save the best checkpoint.
+    checkpoint=dict(type='CheckpointHook', interval=1, save_best='auto'),
+
+    # set sampler seed in distributed evrionment.
+    sampler_seed=dict(type='DistSamplerSeedHook'),
+
+    # validation results visualization, set True to enable it.
+    visualization=dict(type='VisualizationHook', enable=False),
+)
+```
+
+In OpenMMLab 2.0, we have split the original logger into logger and visualizer. The logger is used to record
+information, while the visualizer is used to display the logger in different backends such as terminal,
+TensorBoard, and Wandb.
+
+<table class="docutils">
+<tr>
+<td>Original</td>
+<td>
+
+```python
+log_config = dict(
+    interval=100,
+    hooks=[
+        dict(type='TextLoggerHook'),
+        dict(type='TensorboardLoggerHook'),
+    ])
+```
+
+</td>
+<tr>
+<td>New</td>
+<td>
+
+```python
+default_hooks = dict(
+    ...
+    logger=dict(type='LoggerHook', interval=100),
+)
+
+visualizer = dict(
+    type='UniversalVisualizer',
+    vis_backends=[dict(type='LocalVisBackend'), dict(type='TensorboardVisBackend')],
+)
+```
+
+</td>
+</tr>
+</table>
+
+Changes in **`load_from`** and **`resume_from`**:
+
+- The `resume_from` is removed. And we use `resume` and `load_from` to replace it.
+  - If `resume=True` and `load_from` is not None, resume training from the checkpoint in `load_from`.
+  - If `resume=True` and `load_from` is None, try to resume from the latest checkpoint in the work directory.
+  - If `resume=False` and `load_from` is not None, only load the checkpoint, not resume training.
+  - If `resume=False` and `load_from` is None, do not load nor resume.
+
+the `resume_from` field has been removed, and we use `resume` and `load_from` instead.
+
+- If `resume=True` and `load_from` is not None, training is resumed from the checkpoint in `load_from`.
+- If `resume=True` and `load_from` is None, the latest checkpoint in the work directory is used for resuming.
+- If `resume=False` and `load_from` is not None, only the checkpoint is loaded without resuming training.
+- If `resume=False` and `load_from` is None, neither checkpoint is loaded nor is training resumed.
+
+Changes in **`dist_params`**: The `dist_params` field has become a subfield of `env_cfg` now.
+Additionally, some new configurations have been added to `env_cfg`.
+
+```python
+env_cfg = dict(
+    # whether to enable cudnn benchmark
+    cudnn_benchmark=False,
+
+    # set multi process parameters
+    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+
+    # set distributed parameters
+    dist_cfg=dict(backend='nccl'),
+)
+```
+
+Changes in **`workflow`**: `workflow` related functionalities are removed.
+
+New field **`visualizer`**: The visualizer is a new design in OpenMMLab 2.0 architecture. The runner uses an
+instance of the visualizer to handle result and log visualization, as well as to save to different backends.
+For more information, please refer to the {external+mmengine:doc}`MMEngine tutorial <advanced_tutorials/visualization>`.
+
+```python
+visualizer = dict(
+    type='UniversalVisualizer',
+    vis_backends=[
+        dict(type='LocalVisBackend'),
+        # Uncomment the below line to save the log and visualization results to TensorBoard.
+        # dict(type='TensorboardVisBackend')
+    ]
+)
+```
+
+New field **`default_scope`**: The start point to search module for all registries. The `default_scope` in MMPretrain is `mmpretrain`. See {external+mmengine:doc}`the registry tutorial <advanced_tutorials/registry>` for more details.
+
+## Other changes
+
+We moved the definition of all registries in different packages to the `mmpretrain.registry` package.
+
+# Migration from MMClassification 0.x
+
+## Config files
+
+In MMPretrain 1.x, we refactored the structure of configuration files, and the original files are not usable.
+
+In this section, we will introduce all changes of the configuration files. And we assume you already have
+ideas of the [config files](./user_guides/config.md).
+
+### Model settings
+
+No changes in `model.backbone`, `model.neck` and `model.head` fields.
+
+Changes in **`model.train_cfg`**:
+
+- `BatchMixup` is renamed to [`Mixup`](mmpretrain.models.utils.batch_augments.Mixup).
+- `BatchCutMix` is renamed to [`CutMix`](mmpretrain.models.utils.batch_augments.CutMix).
+- `BatchResizeMix` is renamed to [`ResizeMix`](mmpretrain.models.utils.batch_augments.ResizeMix).
+- The `prob` argument is removed from all augments settings, and you can use the `probs` field in `train_cfg` to
+  specify probabilities of every augemnts. If no `probs` field, randomly choose one by the same probability.
+
+<table class="docutils">
+<tr>
+<td>Original</td>
+<td>
+
+```python
+model = dict(
+    ...
+    train_cfg=dict(augments=[
+        dict(type='BatchMixup', alpha=0.8, num_classes=1000, prob=0.5),
+        dict(type='BatchCutMix', alpha=1.0, num_classes=1000, prob=0.5)
+    ]
+)
+```
+
+</td>
+<tr>
+<td>New</td>
+<td>
+
+```python
+model = dict(
+    ...
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8), dict(type='CutMix', alpha=1.0)]
+)
+```
+
+</td>
+</tr>
+</table>
+
+### Data settings
+
+Changes in **`data`**:
+
+- The original `data` field is splited to `train_dataloader`, `val_dataloader` and
+  `test_dataloader`. This allows us to configure them in fine-grained. For example,
+  you can specify different sampler and batch size during training and test.
+- The `samples_per_gpu` is renamed to `batch_size`.
+- The `workers_per_gpu` is renamed to `num_workers`.
+
+<table class="docutils">
+<tr>
+<td>Original</td>
+<td>
+
+```python
+data = dict(
+    samples_per_gpu=32,
+    workers_per_gpu=2,
+    train=dict(...),
+    val=dict(...),
+    test=dict(...),
+)
+```
+
+</td>
+<tr>
+<td>New</td>
+<td>
+
+```python
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=2,
+    dataset=dict(...),
+    sampler=dict(type='DefaultSampler', shuffle=True)  # necessary
+)
+
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=2,
+    dataset=dict(...),
+    sampler=dict(type='DefaultSampler', shuffle=False)  # necessary
+)
+
+test_dataloader = val_dataloader
+```
+
+</td>
+</tr>
+</table>
+
+Changes in **`pipeline`**:
+
+- The original formatting transforms **`ToTensor`**, **`ImageToTensor`** and **`Collect`** are combined as [`PackInputs`](mmpretrain.datasets.transforms.PackInputs).
+- We don't recommend to do **`Normalize`** in the dataset pipeline. Please remove it from pipelines and set it in the `data_preprocessor` field.
+- The argument `flip_prob` in [**`RandomFlip`**](mmcv.transforms.RandomFlip) is renamed to `prob`.
+- The argument `size` in [**`RandomCrop`**](mmpretrain.datasets.transforms.RandomCrop) is renamed to `crop_size`.
+- The argument `size` in [**`RandomResizedCrop`**](mmpretrain.datasets.transforms.RandomResizedCrop) is renamed to `scale`.
+- The argument `size` in [**`Resize`**](mmcv.transforms.Resize) is renamed to `scale`. And `Resize` won't support size like `(256, -1)`, please use [`ResizeEdge`](mmpretrain.datasets.transforms.ResizeEdge) to replace it.
+- The argument `policies` in [**`AutoAugment`**](mmpretrain.datasets.transforms.AutoAugment) and [**`RandAugment`**](mmpretrain.datasets.transforms.RandAugment) supports using string to specify preset policies. `AutoAugment` supports "imagenet" and `RandAugment` supports "timm_increasing".
+- **`RandomResizedCrop`** and **`CenterCrop`** won't supports `efficientnet_style`, and please use [`EfficientNetRandomCrop`](mmpretrain.datasets.transforms.EfficientNetRandomCrop) and [`EfficientNetCenterCrop`](mmpretrain.datasets.transforms.EfficientNetCenterCrop) to replace them.
+
+```{note}
+We move some work of data transforms to the data preprocessor, like normalization, see [the documentation](mmpretrain.models.utils.data_preprocessor) for
+more details.
+```
+
+<table class="docutils">
+<tr>
+<td>Original</td>
+<td>
+
+```python
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', size=224),
+    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='ToTensor', keys=['gt_label']),
+    dict(type='Collect', keys=['img', 'gt_label'])
+]
+```
+
+</td>
+<tr>
+<td>New</td>
+<td>
+
+```python
+data_preprocessor = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+```
+
+</td>
+</tr>
+</table>
+
+Changes in **`evaluation`**:
+
+- The **`evaluation`** field is splited to `val_evaluator` and `test_evaluator`. And it won't supports `interval` and `save_best` arguments.
+  The `interval` is moved to `train_cfg.val_interval`, see [the schedule settings](./user_guides/config.md#schedule-settings) and the `save_best`
+  is moved to `default_hooks.checkpoint.save_best`, see [the runtime settings](./user_guides/config.md#runtime-settings).
+- The 'accuracy' metric is renamed to [`Accuracy`](mmpretrain.evaluation.Accuracy).
+- The 'precision', 'recall', 'f1-score' and 'support' are combined as [`SingleLabelMetric`](mmpretrain.evaluation.SingleLabelMetric), and use `items` argument to specify to calculate which metric.
+- The 'mAP' is renamed to [`AveragePrecision`](mmpretrain.evaluation.AveragePrecision).
+- The 'CP', 'CR', 'CF1', 'OP', 'OR', 'OF1' are combined as [`MultiLabelMetric`](mmpretrain.evaluation.MultiLabelMetric), and use `items` and `average` arguments to specify to calculate which metric.
+
+<table class="docutils">
+<tr>
+<td>Original</td>
+<td>
+
+```python
+evaluation = dict(
+    interval=1,
+    metric='accuracy',
+    metric_options=dict(topk=(1, 5))
+)
+```
+
+</td>
+<tr>
+<td>New</td>
+<td>
+
+```python
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+test_evaluator = val_evaluator
+```
+
+</td>
+</tr>
+<tr>
+<td>Original</td>
+<td>
+
+```python
+evaluation = dict(
+    interval=1,
+    metric=['mAP', 'CP', 'OP', 'CR', 'OR', 'CF1', 'OF1'],
+    metric_options=dict(thr=0.5),
+)
+```
+
+</td>
+<tr>
+<td>New</td>
+<td>
+
+```python
+val_evaluator = [
+    dict(type='AveragePrecision'),
+    dict(type='MultiLabelMetric',
+        items=['precision', 'recall', 'f1-score'],
+        average='both',
+        thr=0.5),
+]
+test_evaluator = val_evaluator
+```
+
+</td>
+</tr>
+</table>
+
+## Packages
+
+### `mmpretrain.apis`
+
+The documentation can be found [here](mmpretrain.apis).
+
+|       Function       | Changes                                                                                                                                          |
+| :------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------- |
+|     `init_model`     | No changes                                                                                                                                       |
+|  `inference_model`   | No changes. But we recommend to use [`mmpretrain.ImageClassificationInferencer`](mmpretrain.apis.ImageClassificationInferencer) instead.         |
+|    `train_model`     | Removed, use `runner.train` to train.                                                                                                            |
+|   `multi_gpu_test`   | Removed, use `runner.test` to test.                                                                                                              |
+|  `single_gpu_test`   | Removed, use `runner.test` to test.                                                                                                              |
+| `show_result_pyplot` | Removed, use [`mmpretrain.ImageClassificationInferencer`](mmpretrain.apis.ImageClassificationInferencer) to inference model and show the result. |
+|  `set_random_seed`   | Removed, use `mmengine.runner.set_random_seed`.                                                                                                  |
+|  `init_random_seed`  | Removed, use `mmengine.dist.sync_random_seed`.                                                                                                   |
+
+### `mmpretrain.core`
+
+The `mmpretrain.core` package is renamed to [`mmpretrain.engine`](mmpretrain.engine).
+
+|   Sub package   | Changes                                                                                                                           |
+| :-------------: | :-------------------------------------------------------------------------------------------------------------------------------- |
+|  `evaluation`   | Removed, use the metrics in [`mmpretrain.evaluation`](mmpretrain.evaluation).                                                     |
+|     `hook`      | Moved to [`mmpretrain.engine.hooks`](mmpretrain.engine.hooks)                                                                     |
+|  `optimizers`   | Moved to [`mmpretrain.engine.optimizers`](mmpretrain.engine.optimizers)                                                           |
+|     `utils`     | Removed, the distributed environment related functions can be found in the [`mmengine.dist`](api/dist) package.                   |
+| `visualization` | Removed, the related functionalities are implemented in [`mmengine.visualization.Visualizer`](mmengine.visualization.Visualizer). |
+
+The `MMClsWandbHook` in `hooks` package is waiting for implementation.
+
+The `CosineAnnealingCooldownLrUpdaterHook` in `hooks` package is removed, and we support this functionality by
+the combination of parameter schedulers, see [the tutorial](./advanced_guides/schedule.md).
+
+### `mmpretrain.datasets`
+
+The documentation can be found [here](mmpretrain.datasets).
+
+|                                       Dataset class                                       | Changes                                                                                                         |
+| :---------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------- |
+|                   [`CustomDataset`](mmpretrain.datasets.CustomDataset)                    | Add `data_root` argument as the common prefix of `data_prefix` and `ann_file` and support to load unlabeled data. |
+|                        [`ImageNet`](mmpretrain.datasets.ImageNet)                         | Same as `CustomDataset`.                                                                                        |
+|                     [`ImageNet21k`](mmpretrain.datasets.ImageNet21k)                      | Same as `CustomDataset`.                                                                                        |
+|   [`CIFAR10`](mmpretrain.datasets.CIFAR10) & [`CIFAR100`](mmpretrain.datasets.CIFAR100)   | The `test_mode` argument is a required argument now.                                                            |
+| [`MNIST`](mmpretrain.datasets.MNIST) & [`FashionMNIST`](mmpretrain.datasets.FashionMNIST) | The `test_mode` argument is a required argument now.                                                            |
+|                             [`VOC`](mmpretrain.datasets.VOC)                              | Requires `data_root`, `image_set_path` and `test_mode` now.                                                     |
+|                             [`CUB`](mmpretrain.datasets.CUB)                              | Requires `data_root` and `test_mode` now.                                                                       |
+
+The `mmpretrain.datasets.pipelines` is renamed to `mmpretrain.datasets.transforms`.
+
+|         Transform class         | Changes                                                                                                                                                                   |
+| :-----------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+|       `LoadImageFromFile`       | Removed, use [`mmcv.transforms.LoadImageFromFile`](mmcv.transforms.LoadImageFromFile).                                                                                    |
+|          `RandomFlip`           | Removed, use [`mmcv.transforms.RandomFlip`](mmcv.transforms.RandomFlip). The argument `flip_prob` is renamed to `prob`.                                                   |
+|          `RandomCrop`           | The argument `size` is renamed to `crop_size`.                                                                                                                            |
+|       `RandomResizedCrop`       | The argument `size` is renamed to `scale`. The argument `scale` is renamed to `crop_ratio_range`. Won't support `efficientnet_style`, use [`EfficientNetRandomCrop`](mmpretrain.datasets.transforms.EfficientNetRandomCrop). |
+|          `CenterCrop`           | Removed, use [`mmcv.transforms.CenterCrop`](mmcv.transforms.CenterCrop). Won't support `efficientnet_style`, use [`EfficientNetCenterCrop`](mmpretrain.datasets.transforms.EfficientNetCenterCrop). |
+|            `Resize`             | Removed, use [`mmcv.transforms.Resize`](mmcv.transforms.Resize). The argument `size` is renamed to `scale`. Won't support size like `(256, -1)`, use [`ResizeEdge`](mmpretrain.datasets.transforms.ResizeEdge). |
+| `AutoAugment` & `RandomAugment` | The argument `policies` supports using string to specify preset policies.                                                                                                 |
+|            `Compose`            | Removed, use [`mmcv.transforms.Compose`](mmcv.transforms.Compose).                                                                                                        |
+
+### `mmpretrain.models`
+
+The documentation can be found [here](mmpretrain.models). The interface of all **backbones**, **necks** and **losses** didn't change.
+
+Changes in [`ImageClassifier`](mmpretrain.models.classifiers.ImageClassifier):
+
+| Method of classifiers | Changes                                                                                                                                                                 |
+| :-------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+|    `extract_feat`     | No changes                                                                                                                                                              |
+|       `forward`       | Now only accepts three arguments: `inputs`, `data_samples` and `mode`. See [the documentation](mmpretrain.models.classifiers.ImageClassifier.forward) for more details. |
+|    `forward_train`    | Replaced by `loss`.                                                                                                                                                     |
+|     `simple_test`     | Replaced by `predict`.                                                                                                                                                  |
+|     `train_step`      | The `optimizer` argument is replaced by `optim_wrapper` and it accepts [`OptimWrapper`](mmengine.optim.OptimWrapper).                                                   |
+|      `val_step`       | The original `val_step` is the same as `train_step`, now it calls `predict`.                                                                                            |
+|      `test_step`      | New method, and it's the same as `val_step`.                                                                                                                            |
+
+Changes in [heads](mmpretrain.models.heads):
+
+| Method of heads | Changes                                                                                                                                                |
+| :-------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------- |
+|  `pre_logits`   | No changes                                                                                                                                             |
+| `forward_train` | Replaced by `loss`.                                                                                                                                    |
+|  `simple_test`  | Replaced by `predict`.                                                                                                                                 |
+|     `loss`      | It accepts `data_samples` instead of `gt_labels` to calculate loss. The `data_samples` is a list of [ClsDataSample](mmpretrain.structures.DataSample). |
+|    `forward`    | New method, and it returns the output of the classification head without any post-processs like softmax or sigmoid.                                    |
+
+### `mmpretrain.utils`
+
+|           Function           | Changes                                                                                                         |
+| :--------------------------: | :-------------------------------------------------------------------------------------------------------------- |
+|        `collect_env`         | No changes                                                                                                      |
+|      `get_root_logger`       | Removed, use [`mmengine.logging.MMLogger.get_current_instance`](mmengine.logging.MMLogger.get_current_instance) |
+|       `load_json_log`        | The output format changed.                                                                                      |
+|   `setup_multi_processes`    | Removed, use [`mmengine.utils.dl_utils.set_multi_processing`](mmengine.utils.dl_utils.set_multi_processing).    |
+| `wrap_non_distributed_model` | Removed, we auto wrap the model in the runner.                                                                  |
+|   `wrap_distributed_model`   | Removed, we auto wrap the model in the runner.                                                                  |
+|     `auto_select_device`     | Removed, we auto select the device in the runner.                                                               |
+
+# Migration from MMSelfSup 0.x
+
+## Config
+
+This section illustrates the changes of our config files in the `_base_` folder, which includes three parts
+
+- Datasets: `configs/_base_/datasets`
+- Models: `configs/_base_/models`
+- Schedules: `configs/_base_/schedules`
+
+### Dataset settings
+
+In **MMSelfSup 0.x**, we use key `data` to summarize all information, such as `samples_per_gpu`, `train`, `val`, etc.
+
+In **MMPretrain 1.x**, we separate `train_dataloader`, `val_dataloader` to summarize information correspodingly and the key `data` has been **removed**.
+
+<table class="docutils">
+<tr>
+<td>Original</td>
+<td>
+
+```python
+data = dict(
+    samples_per_gpu=32,  # total 32*8(gpu)=256
+    workers_per_gpu=4,
+    train=dict(
+        type=dataset_type,
+        data_source=dict(
+            type=data_source,
+            data_prefix='data/imagenet/train',
+            ann_file='data/imagenet/meta/train.txt',
+        ),
+        num_views=[1, 1],
+        pipelines=[train_pipeline1, train_pipeline2],
+        prefetch=prefetch,
+    ),
+    val=...)
+```
+
+</td>
+
+<tr>
+<td>New</td>
+<td>
+
+```python
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=4,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+val_dataloader = ...
+```
+
+</td>
+</tr>
+</table>
+
+Besides, we **remove** the key of `data_source` to keep the pipeline format consistent with that in other OpenMMLab projects. Please refer to [Config](user_guides/config.md) for more details.
+
+Changes in **`pipeline`**:
+
+Take MAE as an example of `pipeline`:
+
+```python
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.2, 1.0),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(type='PackInputs')
+]
+```
+
+### Model settings
+
+In the config of models, there are two main different parts from MMSeflSup 0.x.
+
+1. There is a new key called `data_preprocessor`, which is responsible for preprocessing the data, like normalization, channel conversion, etc. For example:
+
+```python
+data_preprocessor=dict(
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    bgr_to_rgb=True)
+model = dict(
+    type='MAE',
+    data_preprocessor=dict(
+        mean=[127.5, 127.5, 127.5],
+        std=[127.5, 127.5, 127.5],
+        bgr_to_rgb=True),
+    backbone=...,
+    neck=...,
+    head=...,
+    init_cfg=...)
+```
+
+2. There is a new key `loss` in `head` in MMPretrain 1.x, to determine the loss function of the algorithm. For example:
+
+```python
+model = dict(
+    type='MAE',
+    backbone=...,
+    neck=...,
+    head=dict(
+        type='MAEPretrainHead',
+        norm_pix=True,
+        patch_size=16,
+        loss=dict(type='MAEReconstructionLoss')),
+    init_cfg=...)
+```
+
+## Package
+
+The table below records the general modification of the folders and files.
+
+| MMSelfSup 0.x            | MMPretrain 1.x      | Remark                                                                                                                                                        |
+| ------------------------ | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| apis                     | apis                | The high level APIs are updated.                                                                                                                              |
+| core                     | engine              | The `core` folder has been renamed to `engine`, which includes `hooks`, `opimizers`. ([API link](mmpretrain.engine))                                          |
+| datasets                 | datasets            | The datasets is implemented according to different datasets, such as ImageNet, Places205. ([API link](mmpretrain.datasets))                                   |
+| datasets/data_sources    | /                   | The `data_sources` has been **removed** and the directory of `datasets` now is consistent with other OpenMMLab projects.                                      |
+| datasets/pipelines       | datasets/transforms | The `pipelines` folder has been renamed to `transforms`. ([API link](mmpretrain.datasets.transforms))                                                         |
+| /                        | evaluation          | The `evaluation` is created for some evaluation functions or classes. ([API link](mmpretrain.evaluation))                                                     |
+| models/algorithms        | selfsup             | The algorithms are moved to `selfsup` folder. ([API link](mmpretrain.models.selfsup))                                                                         |
+| models/backbones         | selfsup             | The re-implemented backbones are moved to corresponding self-supervised learning algorithm `.py` files. ([API link](mmpretrain.models.selfsup))               |
+| models/target_generators | selfsup             | The target generators are moved to corresponding self-supervised learning algorithm `.py` files. ([API link](mmpretrain.models.selfsup))                      |
+| /                        | models/losses       | The `losses` folder is created to provide different loss implementations, which is from `heads`. ([API link](mmpretrain.models.losses))                       |
+| /                        | structures          | The `structures` folder is for the implementation of data structures. In MMPretrain, we implement a new data structure, `DataSample`,  to pass and receive data throughout the training/val process. ([API link](mmpretrain.structures)) |
+| /                        | visualization       | The `visualization` folder contains the visualizer, which is responsible for some visualization tasks like visualizing data augmentation. ([API link](mmpretrain.visualization)) |
diff --git a/docs/en/notes/changelog.md b/docs/en/notes/changelog.md
new file mode 100644
index 0000000000000000000000000000000000000000..499ed24f64941e6731aeaa16fe492307ef4e0e4f
--- /dev/null
+++ b/docs/en/notes/changelog.md
@@ -0,0 +1,1055 @@
+# Changelog (MMPreTrain)
+
+## v1.2.0(04/01/2024)
+
+### New Features
+
+- [Feature] Support LLaVA 1.5 ([#1853](https://github.com/open-mmlab/mmpretrain/pull/1853))
+- [Feature] Implement of RAM with a gradio interface. ([#1802](https://github.com/open-mmlab/mmpretrain/pull/1802))
+
+### Bug Fix
+
+- [Fix] Fix resize mix argument bug.
+
+## v1.1.0(12/10/2023)
+
+### New Features
+
+- [Feature] Implement of Zero-Shot CLIP Classifier ([#1737](https://github.com/open-mmlab/mmpretrain/pull/1737))
+- [Feature] Add minigpt4 gradio demo and training script. ([#1758](https://github.com/open-mmlab/mmpretrain/pull/1758))
+
+### Improvements
+
+- [Config] New Version of config Adapting MobileNet Algorithm ([#1774](https://github.com/open-mmlab/mmpretrain/pull/1774))
+- [Config] Support DINO self-supervised learning in project ([#1756](https://github.com/open-mmlab/mmpretrain/pull/1756))
+- [Config] New Version of config Adapting Swin Transformer Algorithm ([#1780](https://github.com/open-mmlab/mmpretrain/pull/1780))
+- [Enhance] Add iTPN Supports for Non-three channel image ([#1735](https://github.com/open-mmlab/mmpretrain/pull/1735))
+- [Docs] Update dataset download script from opendatalab to openXlab ([#1765](https://github.com/open-mmlab/mmpretrain/pull/1765))
+- [Docs] Update COCO-Retrieval dataset docs. ([#1806](https://github.com/open-mmlab/mmpretrain/pull/1806))
+
+### Bug Fix
+
+- Update `train.py` to compat with new config.
+- Update OFA module to compat with the latest huggingface.
+- Fix pipeline bug in ImageRetrievalInferencer.
+
+## v1.0.2(15/08/2023)
+
+### New Features
+
+- Add MFF ([#1725](https://github.com/open-mmlab/mmpretrain/pull/1725))
+- Support training of BLIP2 ([#1700](https://github.com/open-mmlab/mmpretrain/pull/1700))
+
+### Improvements
+
+- New Version of config Adapting MAE Algorithm ([#1750](https://github.com/open-mmlab/mmpretrain/pull/1750))
+- New Version of config Adapting ConvNeXt Algorithm ([#1760](https://github.com/open-mmlab/mmpretrain/pull/1760))
+- New version of config adapting BeitV2 Algorithm ([#1755](https://github.com/open-mmlab/mmpretrain/pull/1755))
+- Update `dataset_prepare.md` ([#1732](https://github.com/open-mmlab/mmpretrain/pull/1732))
+- New Version of `config` Adapting Vision Transformer Algorithm ([#1727](https://github.com/open-mmlab/mmpretrain/pull/1727))
+- Support Infographic VQA dataset and ANLS metric. ([#1667](https://github.com/open-mmlab/mmpretrain/pull/1667))
+- Support IconQA dataset. ([#1670](https://github.com/open-mmlab/mmpretrain/pull/1670))
+- Fix typo MIMHIVIT to MAEHiViT ([#1749](https://github.com/open-mmlab/mmpretrain/pull/1749))
+
+## v1.0.1(28/07/2023)
+
+### Improvements
+
+- Add init_cfg with type='pretrained' to downstream tasks ([#1717](https://github.com/open-mmlab/mmpretrain/pull/1717)
+- Set 'is_init' in some multimodal methods ([#1718](https://github.com/open-mmlab/mmpretrain/pull/1718)
+- Adapt test cases on Ascend NPU ([#1728](https://github.com/open-mmlab/mmpretrain/pull/1728)
+- Add GPU Acceleration Apple silicon mac ([#1699](https://github.com/open-mmlab/mmpretrain/pull/1699)
+- BEiT refactor ([#1705](https://github.com/open-mmlab/mmpretrain/pull/1705)
+
+### Bug Fixes
+
+- Fix dict update in minigpt4. ([#1709](https://github.com/open-mmlab/mmpretrain/pull/1709)
+- Fix nested predict for multi-task prediction ([#1716](https://github.com/open-mmlab/mmpretrain/pull/1716)
+- Fix the issue #1711 "GaussianBlur doesn't work" ([#1722](https://github.com/open-mmlab/mmpretrain/pull/1722)
+- Just to correct a typo of 'target' ([#1655](https://github.com/open-mmlab/mmpretrain/pull/1655)
+- Fix freeze without cls_token in vit ([#1693](https://github.com/open-mmlab/mmpretrain/pull/1693)
+- Fix RandomCrop bug ([#1706](https://github.com/open-mmlab/mmpretrain/pull/1706)
+
+### Docs Update
+
+- Fix spelling ([#1689](https://github.com/open-mmlab/mmpretrain/pull/1689)
+
+## v1.0.0(04/07/2023)
+
+### Highlights
+
+- Support inference of more **multi-modal** algorithms, such as **LLaVA**, **MiniGPT-4**, **Otter**, etc.
+- Support around **10 multi-modal datasets**!
+- Add **iTPN**, **SparK** self-supervised learning algorithms.
+- Provide examples of [New Config](https://github.com/open-mmlab/mmpretrain/tree/main/mmpretrain/configs/) and [DeepSpeed/FSDP](https://github.com/open-mmlab/mmpretrain/tree/main/configs/mae/benchmarks/).
+
+### New Features
+
+- Transfer shape-bias tool from mmselfsup ([#1658](https://github.com/open-mmlab/mmpretrain/pull/1685))
+- Download dataset by using MIM&OpenDataLab ([#1630](https://github.com/open-mmlab/mmpretrain/pull/1630))
+- Support New Configs ([#1639](https://github.com/open-mmlab/mmpretrain/pull/1639), [#1647](https://github.com/open-mmlab/mmpretrain/pull/1647), [#1665](https://github.com/open-mmlab/mmpretrain/pull/1665))
+- Support Flickr30k Retrieval dataset ([#1625](https://github.com/open-mmlab/mmpretrain/pull/1625))
+- Support SparK ([#1531](https://github.com/open-mmlab/mmpretrain/pull/1531))
+- Support LLaVA ([#1652](https://github.com/open-mmlab/mmpretrain/pull/1652))
+- Support Otter ([#1651](https://github.com/open-mmlab/mmpretrain/pull/1651))
+- Support MiniGPT-4 ([#1642](https://github.com/open-mmlab/mmpretrain/pull/1642))
+- Add support for VizWiz dataset ([#1636](https://github.com/open-mmlab/mmpretrain/pull/1636))
+- Add support for vsr dataset ([#1634](https://github.com/open-mmlab/mmpretrain/pull/1634))
+- Add InternImage Classification project ([#1569](https://github.com/open-mmlab/mmpretrain/pull/1569))
+- Support OCR-VQA dataset ([#1621](https://github.com/open-mmlab/mmpretrain/pull/1621))
+- Support OK-VQA dataset ([#1615](https://github.com/open-mmlab/mmpretrain/pull/1615))
+- Support TextVQA dataset ([#1569](https://github.com/open-mmlab/mmpretrain/pull/1569))
+- Support iTPN and HiViT ([#1584](https://github.com/open-mmlab/mmpretrain/pull/1584))
+- Add retrieval mAP metric ([#1552](https://github.com/open-mmlab/mmpretrain/pull/1552))
+- Support NoCap dataset based on BLIP. ([#1582](https://github.com/open-mmlab/mmpretrain/pull/1582))
+- Add GQA dataset ([#1585](https://github.com/open-mmlab/mmpretrain/pull/1585))
+
+### Improvements
+
+- Update fsdp vit-huge and vit-large config ([#1675](https://github.com/open-mmlab/mmpretrain/pull/1675))
+- Support deepspeed with flexible runner ([#1673](https://github.com/open-mmlab/mmpretrain/pull/1673))
+- Update Otter and LLaVA docs and config. ([#1653](https://github.com/open-mmlab/mmpretrain/pull/1653))
+- Add image_only param of ScienceQA ([#1613](https://github.com/open-mmlab/mmpretrain/pull/1613))
+- Support to use "split" to specify training set/validation ([#1535](https://github.com/open-mmlab/mmpretrain/pull/1535))
+
+### Bug Fixes
+
+- Refactor \_prepare_pos_embed in ViT ([#1656](https://github.com/open-mmlab/mmpretrain/pull/1656)， [#1679](https://github.com/open-mmlab/mmpretrain/pull/1679))
+- Freeze pre norm in vision transformer ([#1672](https://github.com/open-mmlab/mmpretrain/pull/1672))
+- Fix bug loading IN1k dataset ([#1641](https://github.com/open-mmlab/mmpretrain/pull/1641))
+- Fix sam bug ([#1633](https://github.com/open-mmlab/mmpretrain/pull/1633))
+- Fixed circular import error for new transform ([#1609](https://github.com/open-mmlab/mmpretrain/pull/1609))
+- Update torchvision transform wrapper ([#1595](https://github.com/open-mmlab/mmpretrain/pull/1595))
+- Set default out_type in CAM visualization ([#1586](https://github.com/open-mmlab/mmpretrain/pull/1586))
+
+### Docs Update
+
+- Fix spelling ([#1681](https://github.com/open-mmlab/mmpretrain/pull/1681))
+- Fix doc typos ([#1671](https://github.com/open-mmlab/mmpretrain/pull/1671), [#1644](https://github.com/open-mmlab/mmpretrain/pull/1644), [#1629](https://github.com/open-mmlab/mmpretrain/pull/1629))
+- Add t-SNE visualization doc ([#1555](https://github.com/open-mmlab/mmpretrain/pull/1555))
+
+## v1.0.0rc8(22/05/2023)
+
+### Highlights
+
+- Support multiple multi-modal algorithms and inferencers. You can explore these features by the [gradio demo](https://github.com/open-mmlab/mmpretrain/tree/main/projects/gradio_demo)!
+- Add EVA-02, Dino-V2, ViT-SAM and GLIP backbones.
+- Register torchvision transforms into MMPretrain, you can now easily integrate torchvision's data augmentations in MMPretrain.
+
+### New Features
+
+- Support Chinese CLIP. ([#1576](https://github.com/open-mmlab/mmpretrain/pull/1576))
+- Add ScienceQA Metrics ([#1577](https://github.com/open-mmlab/mmpretrain/pull/1577))
+- Support multiple multi-modal algorithms and inferencers. ([#1561](https://github.com/open-mmlab/mmpretrain/pull/1561))
+- add eva02 backbone ([#1450](https://github.com/open-mmlab/mmpretrain/pull/1450))
+- Support dinov2 backbone ([#1522](https://github.com/open-mmlab/mmpretrain/pull/1522))
+- Support some downstream classification datasets. ([#1467](https://github.com/open-mmlab/mmpretrain/pull/1467))
+- Support GLIP ([#1308](https://github.com/open-mmlab/mmpretrain/pull/1308))
+- Register torchvision transforms into mmpretrain ([#1265](https://github.com/open-mmlab/mmpretrain/pull/1265))
+- Add ViT of SAM ([#1476](https://github.com/open-mmlab/mmpretrain/pull/1476))
+
+### Improvements
+
+- [Refactor] Support to freeze channel reduction and add layer decay function ([#1490](https://github.com/open-mmlab/mmpretrain/pull/1490))
+- [Refactor] Support resizing pos_embed while loading ckpt and format output ([#1488](https://github.com/open-mmlab/mmpretrain/pull/1488))
+
+### Bug Fixes
+
+- Fix scienceqa ([#1581](https://github.com/open-mmlab/mmpretrain/pull/1581))
+- Fix config of beit ([#1528](https://github.com/open-mmlab/mmpretrain/pull/1528))
+- Incorrect stage freeze on RIFormer Model ([#1573](https://github.com/open-mmlab/mmpretrain/pull/1573))
+- Fix ddp bugs caused by `out_type`. ([#1570](https://github.com/open-mmlab/mmpretrain/pull/1570))
+- Fix multi-task-head loss potential bug ([#1530](https://github.com/open-mmlab/mmpretrain/pull/1530))
+- Support bce loss without batch augmentations ([#1525](https://github.com/open-mmlab/mmpretrain/pull/1525))
+- Fix clip generator init bug ([#1518](https://github.com/open-mmlab/mmpretrain/pull/1518))
+- Fix the bug in binary cross entropy loss ([#1499](https://github.com/open-mmlab/mmpretrain/pull/1499))
+
+### Docs Update
+
+- Update PoolFormer citation to CVPR version ([#1505](https://github.com/open-mmlab/mmpretrain/pull/1505))
+- Refine Inference Doc ([#1489](https://github.com/open-mmlab/mmpretrain/pull/1489))
+- Add doc for usage of confusion matrix ([#1513](https://github.com/open-mmlab/mmpretrain/pull/1513))
+- Update MMagic link ([#1517](https://github.com/open-mmlab/mmpretrain/pull/1517))
+- Fix example_project README ([#1575](https://github.com/open-mmlab/mmpretrain/pull/1575))
+- Add NPU support page ([#1481](https://github.com/open-mmlab/mmpretrain/pull/1481))
+- train cfg: Removed old description ([#1473](https://github.com/open-mmlab/mmpretrain/pull/1473))
+- Fix typo in MultiLabelDataset docstring ([#1483](https://github.com/open-mmlab/mmpretrain/pull/1483))
+
+## v1.0.0rc7(07/04/2023)
+
+### Highlights
+
+- Integrated Self-supervised learning algorithms from **MMSelfSup**, such as **MAE**, **BEiT**, etc.
+- Support **RIFormer**, a simple but effective vision backbone by removing token mixer.
+- Support **LeViT**, **XCiT**, **ViG** and **ConvNeXt-V2** backbone.
+- Add t-SNE visualization.
+- Refactor dataset pipeline visualization.
+- Support confusion matrix calculation and plot.
+
+### New Features
+
+- Support RIFormer. ([#1453](https://github.com/open-mmlab/mmpretrain/pull/1453))
+- Support XCiT Backbone. ([#1305](https://github.com/open-mmlab/mmclassification/pull/1305))
+- Support calculate confusion matrix and plot it. ([#1287](https://github.com/open-mmlab/mmclassification/pull/1287))
+- Support RetrieverRecall metric & Add ArcFace config ([#1316](https://github.com/open-mmlab/mmclassification/pull/1316))
+- Add `ImageClassificationInferencer`. ([#1261](https://github.com/open-mmlab/mmclassification/pull/1261))
+- Support InShop Dataset (Image Retrieval). ([#1019](https://github.com/open-mmlab/mmclassification/pull/1019))
+- Support LeViT backbone. ([#1238](https://github.com/open-mmlab/mmclassification/pull/1238))
+- Support VIG Backbone. ([#1304](https://github.com/open-mmlab/mmclassification/pull/1304))
+- Support ConvNeXt-V2 backbone. ([#1294](https://github.com/open-mmlab/mmclassification/pull/1294))
+
+### Improvements
+
+- Use PyTorch official `scaled_dot_product_attention` to accelerate `MultiheadAttention`. ([#1434](https://github.com/open-mmlab/mmpretrain/pull/1434))
+- Add ln to vit avg_featmap output ([#1447](https://github.com/open-mmlab/mmpretrain/pull/1447))
+- Update analysis tools and documentations. ([#1359](https://github.com/open-mmlab/mmclassification/pull/1359))
+- Unify the `--out` and `--dump` in `tools/test.py`. ([#1307](https://github.com/open-mmlab/mmclassification/pull/1307))
+- Enable to toggle whether Gem Pooling is trainable or not. ([#1246](https://github.com/open-mmlab/mmclassification/pull/1246))
+- Update registries of mmcls. ([#1306](https://github.com/open-mmlab/mmclassification/pull/1306))
+- Add metafile fill and validation tools. ([#1297](https://github.com/open-mmlab/mmclassification/pull/1297))
+- Remove useless EfficientnetV2 config files. ([#1300](https://github.com/open-mmlab/mmclassification/pull/1300))
+
+### Bug Fixes
+
+- Fix precise bn hook ([#1466](https://github.com/open-mmlab/mmpretrain/pull/1466))
+- Fix retrieval multi gpu bug ([#1319](https://github.com/open-mmlab/mmclassification/pull/1319))
+- Fix error repvgg-deploy base config path. ([#1357](https://github.com/open-mmlab/mmclassification/pull/1357))
+- Fix bug in test tools. ([#1309](https://github.com/open-mmlab/mmclassification/pull/1309))
+
+### Docs Update
+
+- Translate some tools tutorials to Chinese. ([#1321](https://github.com/open-mmlab/mmclassification/pull/1321))
+- Add Chinese translation for runtime.md.  ([#1313](https://github.com/open-mmlab/mmclassification/pull/1313))
+
+# Changelog (MMClassification)
+
+## v1.0.0rc5(30/12/2022)
+
+### Highlights
+
+- Support EVA, RevViT, EfficientnetV2, CLIP, TinyViT and MixMIM backbones.
+- Reproduce the training accuracy of ConvNeXt and RepVGG.
+- Support multi-task training and testing.
+- Support Test-time Augmentation.
+
+### New Features
+
+- [Feature] Add EfficientnetV2 Backbone. ([#1253](https://github.com/open-mmlab/mmclassification/pull/1253))
+- [Feature] Support TTA and add `--tta` in `tools/test.py`. ([#1161](https://github.com/open-mmlab/mmclassification/pull/1161))
+- [Feature] Support Multi-task. ([#1229](https://github.com/open-mmlab/mmclassification/pull/1229))
+- [Feature] Add clip backbone. ([#1258](https://github.com/open-mmlab/mmclassification/pull/1258))
+- [Feature] Add mixmim backbone with checkpoints. ([#1224](https://github.com/open-mmlab/mmclassification/pull/1224))
+- [Feature] Add TinyViT for dev-1.x. ([#1042](https://github.com/open-mmlab/mmclassification/pull/1042))
+- [Feature] Add some scripts for development. ([#1257](https://github.com/open-mmlab/mmclassification/pull/1257))
+- [Feature] Support EVA. ([#1239](https://github.com/open-mmlab/mmclassification/pull/1239))
+- [Feature] Implementation of RevViT. ([#1127](https://github.com/open-mmlab/mmclassification/pull/1127))
+
+### Improvements
+
+- [Reproduce] Reproduce RepVGG  Training Accuracy. ([#1264](https://github.com/open-mmlab/mmclassification/pull/1264))
+- [Enhance] Support ConvNeXt More Weights. ([#1240](https://github.com/open-mmlab/mmclassification/pull/1240))
+- [Reproduce] Update ConvNeXt config files. ([#1256](https://github.com/open-mmlab/mmclassification/pull/1256))
+- [CI] Update CI to test PyTorch 1.13.0. ([#1260](https://github.com/open-mmlab/mmclassification/pull/1260))
+- [Project] Add ACCV workshop 1st Solution. ([#1245](https://github.com/open-mmlab/mmclassification/pull/1245))
+- [Project] Add Example project. ([#1254](https://github.com/open-mmlab/mmclassification/pull/1254))
+
+### Bug Fixes
+
+- [Fix] Fix imports in transforms. ([#1255](https://github.com/open-mmlab/mmclassification/pull/1255))
+- [Fix] Fix CAM visualization. ([#1248](https://github.com/open-mmlab/mmclassification/pull/1248))
+- [Fix] Fix the requirements and lazy register mmpretrain models. ([#1275](https://github.com/open-mmlab/mmclassification/pull/1275))
+
+## v1.0.0rc4(06/12/2022)
+
+### Highlights
+
+- Upgrade API to get pre-defined models of MMClassification. See [#1236](https://github.com/open-mmlab/mmclassification/pull/1236) for more details.
+- Refactor BEiT backbone and support v1/v2 inference. See [#1144](https://github.com/open-mmlab/mmclassification/pull/1144).
+
+### New Features
+
+- Support getting model from the name defined in the model-index file. ([#1236](https://github.com/open-mmlab/mmclassification/pull/1236))
+
+### Improvements
+
+- Support evaluate on both EMA and non-EMA models. ([#1204](https://github.com/open-mmlab/mmclassification/pull/1204))
+- Refactor BEiT backbone and support v1/v2 inference. ([#1144](https://github.com/open-mmlab/mmclassification/pull/1144))
+
+### Bug Fixes
+
+- Fix `reparameterize_model.py` doesn't save meta info. ([#1221](https://github.com/open-mmlab/mmclassification/pull/1221))
+- Fix dict update in BEiT. ([#1234](https://github.com/open-mmlab/mmclassification/pull/1234))
+
+### Docs Update
+
+- Update install tutorial. ([#1223](https://github.com/open-mmlab/mmclassification/pull/1223))
+- Update MobileNetv2 & MobileNetv3 readme. ([#1222](https://github.com/open-mmlab/mmclassification/pull/1222))
+- Add version selection in the banner. ([#1217](https://github.com/open-mmlab/mmclassification/pull/1217))
+
+## v1.0.0rc3(21/11/2022)
+
+### Highlights
+
+- Add **Switch Recipe** Hook, Now we can modify training pipeline, mixup and loss settings during training, see [#1101](https://github.com/open-mmlab/mmclassification/pull/1101).
+- Add **TIMM and HuggingFace** wrappers. Now you can train/use models in TIMM/HuggingFace directly, see [#1102](https://github.com/open-mmlab/mmclassification/pull/1102).
+- Support **retrieval tasks**, see [#1055](https://github.com/open-mmlab/mmclassification/pull/1055).
+- Reproduce **mobileone** training accuracy. See [#1191](https://github.com/open-mmlab/mmclassification/pull/1191)
+
+### New Features
+
+- Add checkpoints from EfficientNets NoisyStudent & L2. ([#1122](https://github.com/open-mmlab/mmclassification/pull/1122))
+- Migrate CSRA head to 1.x. ([#1177](https://github.com/open-mmlab/mmclassification/pull/1177))
+- Support RepLKnet backbone. ([#1129](https://github.com/open-mmlab/mmclassification/pull/1129))
+- Add Switch Recipe Hook. ([#1101](https://github.com/open-mmlab/mmclassification/pull/1101))
+- Add adan optimizer. ([#1180](https://github.com/open-mmlab/mmclassification/pull/1180))
+- Support DaViT. ([#1105](https://github.com/open-mmlab/mmclassification/pull/1105))
+- Support Activation Checkpointing for ConvNeXt. ([#1153](https://github.com/open-mmlab/mmclassification/pull/1153))
+- Add TIMM and HuggingFace wrappers to build classifiers from them directly. ([#1102](https://github.com/open-mmlab/mmclassification/pull/1102))
+- Add reduction for neck ([#978](https://github.com/open-mmlab/mmclassification/pull/978))
+- Support HorNet Backbone for dev1.x. ([#1094](https://github.com/open-mmlab/mmclassification/pull/1094))
+- Add arcface head. ([#926](https://github.com/open-mmlab/mmclassification/pull/926))
+- Add Base Retriever and Image2Image Retriever for retrieval tasks. ([#1055](https://github.com/open-mmlab/mmclassification/pull/1055))
+- Support MobileViT backbone. ([#1068](https://github.com/open-mmlab/mmclassification/pull/1068))
+
+### Improvements
+
+- [Enhance] Enhance ArcFaceClsHead. ([#1181](https://github.com/open-mmlab/mmclassification/pull/1181))
+- [Refactor] Refactor to use new fileio API in MMEngine. ([#1176](https://github.com/open-mmlab/mmclassification/pull/1176))
+- [Enhance] Reproduce mobileone training accuracy. ([#1191](https://github.com/open-mmlab/mmclassification/pull/1191))
+- [Enhance] add deleting params info in swinv2. ([#1142](https://github.com/open-mmlab/mmclassification/pull/1142))
+- [Enhance] Add more mobilenetv3 pretrains. ([#1154](https://github.com/open-mmlab/mmclassification/pull/1154))
+- [Enhancement] RepVGG for YOLOX-PAI for dev-1.x. ([#1126](https://github.com/open-mmlab/mmclassification/pull/1126))
+- [Improve] Speed up data preprocessor. ([#1064](https://github.com/open-mmlab/mmclassification/pull/1064))
+
+### Bug Fixes
+
+- Fix the torchserve. ([#1143](https://github.com/open-mmlab/mmclassification/pull/1143))
+- Fix configs due to api refactor of `num_classes`. ([#1184](https://github.com/open-mmlab/mmclassification/pull/1184))
+- Update mmpretrain2torchserve. ([#1189](https://github.com/open-mmlab/mmclassification/pull/1189))
+- Fix for `inference_model` cannot get classes information in checkpoint. ([#1093](https://github.com/open-mmlab/mmclassification/pull/1093))
+
+### Docs Update
+
+- Add not-found page extension. ([#1207](https://github.com/open-mmlab/mmclassification/pull/1207))
+- update visualization doc. ([#1160](https://github.com/open-mmlab/mmclassification/pull/1160))
+- Support sort and search the Model Summary table. ([#1100](https://github.com/open-mmlab/mmclassification/pull/1100))
+- Improve the ResNet model page. ([#1118](https://github.com/open-mmlab/mmclassification/pull/1118))
+- update the readme of convnext. ([#1156](https://github.com/open-mmlab/mmclassification/pull/1156))
+- Fix the installation docs link in README. ([#1164](https://github.com/open-mmlab/mmclassification/pull/1164))
+- Improve ViT and MobileViT model pages. ([#1155](https://github.com/open-mmlab/mmclassification/pull/1155))
+- Improve Swin Doc and Add Tabs enxtation. ([#1145](https://github.com/open-mmlab/mmclassification/pull/1145))
+- Add MMEval projects link in README. ([#1162](https://github.com/open-mmlab/mmclassification/pull/1162))
+- Add runtime configuration docs. ([#1128](https://github.com/open-mmlab/mmclassification/pull/1128))
+- Add custom evaluation docs ([#1130](https://github.com/open-mmlab/mmclassification/pull/1130))
+- Add custom pipeline docs. ([#1124](https://github.com/open-mmlab/mmclassification/pull/1124))
+- Add MMYOLO projects link in MMCLS1.x. ([#1117](https://github.com/open-mmlab/mmclassification/pull/1117))
+
+## v1.0.0rc2(12/10/2022)
+
+### New Features
+
+- [Feature] Support DeiT3. ([#1065](https://github.com/open-mmlab/mmclassification/pull/1065))
+
+### Improvements
+
+- [Enhance] Update `analyze_results.py` for dev-1.x. ([#1071](https://github.com/open-mmlab/mmclassification/pull/1071))
+- [Enhance] Get scores from inference api. ([#1070](https://github.com/open-mmlab/mmclassification/pull/1070))
+
+### Bug Fixes
+
+- [Fix] Update requirements. ([#1083](https://github.com/open-mmlab/mmclassification/pull/1083))
+
+### Docs Update
+
+- [Docs] Add 1x docs schedule. ([#1015](https://github.com/open-mmlab/mmclassification/pull/1015))
+
+## v1.0.0rc1(30/9/2022)
+
+### New Features
+
+- Support MViT for MMCLS 1.x ([#1023](https://github.com/open-mmlab/mmclassification/pull/1023))
+- Add ViT huge architecture. ([#1049](https://github.com/open-mmlab/mmclassification/pull/1049))
+- Support EdgeNeXt for dev-1.x. ([#1037](https://github.com/open-mmlab/mmclassification/pull/1037))
+- Support Swin Transformer V2 for MMCLS 1.x. ([#1029](https://github.com/open-mmlab/mmclassification/pull/1029))
+- Add efficientformer Backbone for MMCls 1.x. ([#1031](https://github.com/open-mmlab/mmclassification/pull/1031))
+- Add MobileOne Backbone For MMCls 1.x.  ([#1030](https://github.com/open-mmlab/mmclassification/pull/1030))
+- Support BEiT Transformer layer. ([#919](https://github.com/open-mmlab/mmclassification/pull/919))
+
+### Improvements
+
+- [Refactor] Fix visualization tools. ([#1045](https://github.com/open-mmlab/mmclassification/pull/1045))
+- [Improve] Update benchmark scripts ([#1028](https://github.com/open-mmlab/mmclassification/pull/1028))
+- [Improve] Update tools to enable `pin_memory` and `persistent_workers` by default. ([#1024](https://github.com/open-mmlab/mmclassification/pull/1024))
+- [CI] Update circle-ci and github workflow. ([#1018](https://github.com/open-mmlab/mmclassification/pull/1018))
+
+### Bug Fixes
+
+- Fix verify dataset tool in 1.x. ([#1062](https://github.com/open-mmlab/mmclassification/pull/1062))
+- Fix `loss_weight` in `LabelSmoothLoss`. ([#1058](https://github.com/open-mmlab/mmclassification/pull/1058))
+- Fix the output position of Swin-Transformer. ([#947](https://github.com/open-mmlab/mmclassification/pull/947))
+
+### Docs Update
+
+- Auto generate model summary table.  ([#1010](https://github.com/open-mmlab/mmclassification/pull/1010))
+- Refactor new modules tutorial. ([#998](https://github.com/open-mmlab/mmclassification/pull/998))
+
+## v1.0.0rc0(31/8/2022)
+
+MMClassification 1.0.0rc0 is the first version of MMClassification 1.x, a part of the OpenMMLab 2.0 projects.
+
+Built upon the new [training engine](https://github.com/open-mmlab/mmengine), MMClassification 1.x unifies the interfaces of dataset, models, evaluation, and visualization.
+
+And there are some BC-breaking changes. Please check [the migration tutorial](https://mmclassification.readthedocs.io/en/1.x/migration.html) for more details.
+
+## v0.23.1(2/6/2022)
+
+### New Features
+
+- Dedicated MMClsWandbHook for MMClassification (Weights and Biases Integration) ([#764](https://github.com/open-mmlab/mmclassification/pull/764))
+
+### Improvements
+
+- Use mdformat instead of markdownlint to format markdown. ([#844](https://github.com/open-mmlab/mmclassification/pull/844))
+
+### Bug Fixes
+
+- Fix wrong `--local_rank`.
+
+### Docs Update
+
+- Update install tutorials. ([#854](https://github.com/open-mmlab/mmclassification/pull/854))
+- Fix wrong link in README. ([#835](https://github.com/open-mmlab/mmclassification/pull/835))
+
+## v0.23.0(1/5/2022)
+
+### New Features
+
+- Support DenseNet. ([#750](https://github.com/open-mmlab/mmclassification/pull/750))
+- Support VAN. ([#739](https://github.com/open-mmlab/mmclassification/pull/739))
+
+### Improvements
+
+- Support training on IPU and add fine-tuning configs of ViT. ([#723](https://github.com/open-mmlab/mmclassification/pull/723))
+
+### Docs Update
+
+- New style API reference, and easier to use! Welcome [view it](https://mmclassification.readthedocs.io/en/master/api/models.html). ([#774](https://github.com/open-mmlab/mmclassification/pull/774))
+
+## v0.22.1(15/4/2022)
+
+### New Features
+
+- [Feature] Support resize relative position embedding in `SwinTransformer`. ([#749](https://github.com/open-mmlab/mmclassification/pull/749))
+- [Feature] Add PoolFormer backbone and checkpoints. ([#746](https://github.com/open-mmlab/mmclassification/pull/746))
+
+### Improvements
+
+- [Enhance] Improve CPE performance by reduce memory copy. ([#762](https://github.com/open-mmlab/mmclassification/pull/762))
+- [Enhance] Add extra dataloader settings in configs. ([#752](https://github.com/open-mmlab/mmclassification/pull/752))
+
+## v0.22.0(30/3/2022)
+
+### Highlights
+
+- Support a series of CSP Network, such as CSP-ResNet, CSP-ResNeXt and CSP-DarkNet.
+- A new `CustomDataset` class to help you build dataset of yourself!
+- Support ConvMixer, RepMLP and new dataset - CUB dataset.
+
+### New Features
+
+- [Feature] Add CSPNet and backbone and checkpoints ([#735](https://github.com/open-mmlab/mmclassification/pull/735))
+- [Feature] Add `CustomDataset`. ([#738](https://github.com/open-mmlab/mmclassification/pull/738))
+- [Feature] Add diff seeds to diff ranks.  ([#744](https://github.com/open-mmlab/mmclassification/pull/744))
+- [Feature] Support ConvMixer. ([#716](https://github.com/open-mmlab/mmclassification/pull/716))
+- [Feature] Our `dist_train` & `dist_test` tools support distributed training on multiple machines. ([#734](https://github.com/open-mmlab/mmclassification/pull/734))
+- [Feature] Add RepMLP backbone and checkpoints. ([#709](https://github.com/open-mmlab/mmclassification/pull/709))
+- [Feature] Support CUB dataset. ([#703](https://github.com/open-mmlab/mmclassification/pull/703))
+- [Feature] Support ResizeMix. ([#676](https://github.com/open-mmlab/mmclassification/pull/676))
+
+### Improvements
+
+- [Enhance] Use `--a-b` instead of `--a_b` in arguments. ([#754](https://github.com/open-mmlab/mmclassification/pull/754))
+- [Enhance] Add `get_cat_ids` and `get_gt_labels` to KFoldDataset. ([#721](https://github.com/open-mmlab/mmclassification/pull/721))
+- [Enhance] Set torch seed in `worker_init_fn`. ([#733](https://github.com/open-mmlab/mmclassification/pull/733))
+
+### Bug Fixes
+
+- [Fix] Fix the discontiguous output feature map of ConvNeXt. ([#743](https://github.com/open-mmlab/mmclassification/pull/743))
+
+### Docs Update
+
+- [Docs] Add brief installation steps in README for copy&paste. ([#755](https://github.com/open-mmlab/mmclassification/pull/755))
+- [Docs] fix logo url link from mmocr to mmpretrain. ([#732](https://github.com/open-mmlab/mmclassification/pull/732))
+
+## v0.21.0(04/03/2022)
+
+### Highlights
+
+- Support ResNetV1c and Wide-ResNet, and provide pre-trained models.
+- Support dynamic input shape for ViT-based algorithms. Now our ViT, DeiT, Swin-Transformer and T2T-ViT support forwarding with any input shape.
+- Reproduce training results of DeiT. And our DeiT-T and DeiT-S have higher accuracy comparing with the official weights.
+
+### New Features
+
+- Add ResNetV1c. ([#692](https://github.com/open-mmlab/mmclassification/pull/692))
+- Support Wide-ResNet. ([#715](https://github.com/open-mmlab/mmclassification/pull/715))
+- Support gem pooling ([#677](https://github.com/open-mmlab/mmclassification/pull/677))
+
+### Improvements
+
+- Reproduce training results of DeiT. ([#711](https://github.com/open-mmlab/mmclassification/pull/711))
+- Add ConvNeXt pretrain models on ImageNet-1k. ([#707](https://github.com/open-mmlab/mmclassification/pull/707))
+- Support dynamic input shape for ViT-based algorithms. ([#706](https://github.com/open-mmlab/mmclassification/pull/706))
+- Add `evaluate` function for ConcatDataset. ([#650](https://github.com/open-mmlab/mmclassification/pull/650))
+- Enhance vis-pipeline tool. ([#604](https://github.com/open-mmlab/mmclassification/pull/604))
+- Return code 1 if scripts runs failed. ([#694](https://github.com/open-mmlab/mmclassification/pull/694))
+- Use PyTorch official `one_hot` to implement `convert_to_one_hot`. ([#696](https://github.com/open-mmlab/mmclassification/pull/696))
+- Add a new pre-commit-hook to automatically add a copyright. ([#710](https://github.com/open-mmlab/mmclassification/pull/710))
+- Add deprecation message for deploy tools. ([#697](https://github.com/open-mmlab/mmclassification/pull/697))
+- Upgrade isort pre-commit hooks. ([#687](https://github.com/open-mmlab/mmclassification/pull/687))
+- Use `--gpu-id` instead of `--gpu-ids` in non-distributed multi-gpu training/testing. ([#688](https://github.com/open-mmlab/mmclassification/pull/688))
+- Remove deprecation. ([#633](https://github.com/open-mmlab/mmclassification/pull/633))
+
+### Bug Fixes
+
+- Fix Conformer forward with irregular input size. ([#686](https://github.com/open-mmlab/mmclassification/pull/686))
+- Add `dist.barrier` to fix a bug in directory checking. ([#666](https://github.com/open-mmlab/mmclassification/pull/666))
+
+## v0.20.1(07/02/2022)
+
+### Bug Fixes
+
+- Fix the MMCV dependency version.
+
+## v0.20.0(30/01/2022)
+
+### Highlights
+
+- Support K-fold cross-validation. The tutorial will be released later.
+- Support HRNet, ConvNeXt, Twins and EfficientNet.
+- Support model conversion from PyTorch to Core-ML by a tool.
+
+### New Features
+
+- Support K-fold cross-validation. ([#563](https://github.com/open-mmlab/mmclassification/pull/563))
+- Support HRNet and add pre-trained models. ([#660](https://github.com/open-mmlab/mmclassification/pull/660))
+- Support ConvNeXt and add pre-trained models. ([#670](https://github.com/open-mmlab/mmclassification/pull/670))
+- Support Twins and add pre-trained models. ([#642](https://github.com/open-mmlab/mmclassification/pull/642))
+- Support EfficientNet and add pre-trained models.([#649](https://github.com/open-mmlab/mmclassification/pull/649))
+- Support `features_only` option in `TIMMBackbone`. ([#668](https://github.com/open-mmlab/mmclassification/pull/668))
+- Add conversion script from pytorch to Core-ML model. ([#597](https://github.com/open-mmlab/mmclassification/pull/597))
+
+### Improvements
+
+- New-style CPU training and inference. ([#674](https://github.com/open-mmlab/mmclassification/pull/674))
+- Add setup multi-processing both in train and test. ([#671](https://github.com/open-mmlab/mmclassification/pull/671))
+- Rewrite channel split operation in ShufflenetV2. ([#632](https://github.com/open-mmlab/mmclassification/pull/632))
+- Deprecate the support for "python setup.py test". ([#646](https://github.com/open-mmlab/mmclassification/pull/646))
+- Support single-label, softmax, custom eps by asymmetric loss. ([#609](https://github.com/open-mmlab/mmclassification/pull/609))
+- Save class names in best checkpoint created by evaluation hook. ([#641](https://github.com/open-mmlab/mmclassification/pull/641))
+
+### Bug Fixes
+
+- Fix potential unexcepted behaviors if `metric_options` is not specified in multi-label evaluation. ([#647](https://github.com/open-mmlab/mmclassification/pull/647))
+- Fix API changes in  `pytorch-grad-cam&gt;=1.3.7`. ([#656](https://github.com/open-mmlab/mmclassification/pull/656))
+- Fix bug which breaks `cal_train_time` in `analyze_logs.py`. ([#662](https://github.com/open-mmlab/mmclassification/pull/662))
+
+### Docs Update
+
+- Update README in configs according to OpenMMLab standard. ([#672](https://github.com/open-mmlab/mmclassification/pull/672))
+- Update installation guide and README. ([#624](https://github.com/open-mmlab/mmclassification/pull/624))
+
+## v0.19.0(31/12/2021)
+
+### Highlights
+
+- The feature extraction function has been enhanced. See [#593](https://github.com/open-mmlab/mmclassification/pull/593) for more details.
+- Provide the high-acc ResNet-50 training settings from [*ResNet strikes back*](https://arxiv.org/abs/2110.00476).
+- Reproduce the training accuracy of T2T-ViT & RegNetX, and provide self-training checkpoints.
+- Support DeiT & Conformer backbone and checkpoints.
+- Provide a CAM visualization tool based on [pytorch-grad-cam](https://github.com/jacobgil/pytorch-grad-cam), and detailed [user guide](https://mmclassification.readthedocs.io/en/latest/tools/visualization.html#class-activation-map-visualization)!
+
+### New Features
+
+- Support Precise BN. ([#401](https://github.com/open-mmlab/mmclassification/pull/401))
+- Add CAM visualization tool. ([#577](https://github.com/open-mmlab/mmclassification/pull/577))
+- Repeated Aug and Sampler Registry. ([#588](https://github.com/open-mmlab/mmclassification/pull/588))
+- Add DeiT backbone and checkpoints. ([#576](https://github.com/open-mmlab/mmclassification/pull/576))
+- Support LAMB optimizer. ([#591](https://github.com/open-mmlab/mmclassification/pull/591))
+- Implement the conformer backbone. ([#494](https://github.com/open-mmlab/mmclassification/pull/494))
+- Add the frozen function for Swin Transformer model. ([#574](https://github.com/open-mmlab/mmclassification/pull/574))
+- Support using checkpoint in Swin Transformer to save memory. ([#557](https://github.com/open-mmlab/mmclassification/pull/557))
+
+### Improvements
+
+- [Reproduction] Reproduce RegNetX training accuracy. ([#587](https://github.com/open-mmlab/mmclassification/pull/587))
+- [Reproduction] Reproduce training results of T2T-ViT. ([#610](https://github.com/open-mmlab/mmclassification/pull/610))
+- [Enhance] Provide high-acc training settings of ResNet. ([#572](https://github.com/open-mmlab/mmclassification/pull/572))
+- [Enhance] Set a random seed when the user does not set a seed. ([#554](https://github.com/open-mmlab/mmclassification/pull/554))
+- [Enhance] Added `NumClassCheckHook` and unit tests. ([#559](https://github.com/open-mmlab/mmclassification/pull/559))
+- [Enhance] Enhance feature extraction function. ([#593](https://github.com/open-mmlab/mmclassification/pull/593))
+- [Enhance] Improve efficiency of precision, recall, f1_score and support. ([#595](https://github.com/open-mmlab/mmclassification/pull/595))
+- [Enhance] Improve accuracy calculation performance. ([#592](https://github.com/open-mmlab/mmclassification/pull/592))
+- [Refactor] Refactor `analysis_log.py`. ([#529](https://github.com/open-mmlab/mmclassification/pull/529))
+- [Refactor] Use new API of matplotlib to handle blocking input in visualization. ([#568](https://github.com/open-mmlab/mmclassification/pull/568))
+- [CI] Cancel previous runs that are not completed. ([#583](https://github.com/open-mmlab/mmclassification/pull/583))
+- [CI] Skip build CI if only configs or docs modification. ([#575](https://github.com/open-mmlab/mmclassification/pull/575))
+
+### Bug Fixes
+
+- Fix test sampler bug. ([#611](https://github.com/open-mmlab/mmclassification/pull/611))
+- Try to create a symbolic link, otherwise copy. ([#580](https://github.com/open-mmlab/mmclassification/pull/580))
+- Fix a bug for multiple output in swin transformer. ([#571](https://github.com/open-mmlab/mmclassification/pull/571))
+
+### Docs Update
+
+- Update mmcv, torch, cuda version in Dockerfile and docs. ([#594](https://github.com/open-mmlab/mmclassification/pull/594))
+- Add analysis&misc docs. ([#525](https://github.com/open-mmlab/mmclassification/pull/525))
+- Fix docs build dependency. ([#584](https://github.com/open-mmlab/mmclassification/pull/584))
+
+## v0.18.0(30/11/2021)
+
+### Highlights
+
+- Support MLP-Mixer backbone and provide pre-trained checkpoints.
+- Add a tool to visualize the learning rate curve of the training phase. Welcome to use with the [tutorial](https://mmclassification.readthedocs.io/en/latest/tools/visualization.html#learning-rate-schedule-visualization)!
+
+### New Features
+
+- Add MLP Mixer Backbone. ([#528](https://github.com/open-mmlab/mmclassification/pull/528), [#539](https://github.com/open-mmlab/mmclassification/pull/539))
+- Support positive weights in BCE. ([#516](https://github.com/open-mmlab/mmclassification/pull/516))
+- Add a tool to visualize learning rate in each iterations. ([#498](https://github.com/open-mmlab/mmclassification/pull/498))
+
+### Improvements
+
+- Use CircleCI to do unit tests. ([#567](https://github.com/open-mmlab/mmclassification/pull/567))
+- Focal loss for single label tasks. ([#548](https://github.com/open-mmlab/mmclassification/pull/548))
+- Remove useless `import_modules_from_string`. ([#544](https://github.com/open-mmlab/mmclassification/pull/544))
+- Rename config files according to the config name standard. ([#508](https://github.com/open-mmlab/mmclassification/pull/508))
+- Use `reset_classifier` to remove head of timm backbones. ([#534](https://github.com/open-mmlab/mmclassification/pull/534))
+- Support passing arguments to loss from head. ([#523](https://github.com/open-mmlab/mmclassification/pull/523))
+- Refactor `Resize` transform and add `Pad` transform. ([#506](https://github.com/open-mmlab/mmclassification/pull/506))
+- Update mmcv dependency version. ([#509](https://github.com/open-mmlab/mmclassification/pull/509))
+
+### Bug Fixes
+
+- Fix bug when using `ClassBalancedDataset`. ([#555](https://github.com/open-mmlab/mmclassification/pull/555))
+- Fix a bug when using iter-based runner with 'val' workflow. ([#542](https://github.com/open-mmlab/mmclassification/pull/542))
+- Fix interpolation method checking in `Resize`. ([#547](https://github.com/open-mmlab/mmclassification/pull/547))
+- Fix a bug when load checkpoints in mulit-GPUs environment. ([#527](https://github.com/open-mmlab/mmclassification/pull/527))
+- Fix an error on indexing scalar metrics in `analyze_result.py`. ([#518](https://github.com/open-mmlab/mmclassification/pull/518))
+- Fix wrong condition judgment in `analyze_logs.py` and prevent empty curve. ([#510](https://github.com/open-mmlab/mmclassification/pull/510))
+
+### Docs Update
+
+- Fix vit config and model broken links. ([#564](https://github.com/open-mmlab/mmclassification/pull/564))
+- Add abstract and image for every paper. ([#546](https://github.com/open-mmlab/mmclassification/pull/546))
+- Add mmflow and mim in banner and readme. ([#543](https://github.com/open-mmlab/mmclassification/pull/543))
+- Add schedule and runtime tutorial docs. ([#499](https://github.com/open-mmlab/mmclassification/pull/499))
+- Add the top-5 acc in ResNet-CIFAR README. ([#531](https://github.com/open-mmlab/mmclassification/pull/531))
+- Fix TOC of `visualization.md` and add example images. ([#513](https://github.com/open-mmlab/mmclassification/pull/513))
+- Use docs link of other projects and add MMCV docs. ([#511](https://github.com/open-mmlab/mmclassification/pull/511))
+
+## v0.17.0(29/10/2021)
+
+### Highlights
+
+- Support Tokens-to-Token ViT backbone and Res2Net backbone. Welcome to use!
+- Support ImageNet21k dataset.
+- Add a pipeline visualization tool. Try it with the [tutorials](https://mmclassification.readthedocs.io/en/latest/tools/visualization.html#pipeline-visualization)!
+
+### New Features
+
+- Add Tokens-to-Token ViT backbone and converted checkpoints. ([#467](https://github.com/open-mmlab/mmclassification/pull/467))
+- Add Res2Net backbone and converted weights. ([#465](https://github.com/open-mmlab/mmclassification/pull/465))
+- Support ImageNet21k dataset. ([#461](https://github.com/open-mmlab/mmclassification/pull/461))
+- Support seesaw loss. ([#500](https://github.com/open-mmlab/mmclassification/pull/500))
+- Add a pipeline visualization tool. ([#406](https://github.com/open-mmlab/mmclassification/pull/406))
+- Add a tool to find broken files. ([#482](https://github.com/open-mmlab/mmclassification/pull/482))
+- Add a tool to test TorchServe. ([#468](https://github.com/open-mmlab/mmclassification/pull/468))
+
+### Improvements
+
+- Refator Vision Transformer. ([#395](https://github.com/open-mmlab/mmclassification/pull/395))
+- Use context manager to reuse matplotlib figures. ([#432](https://github.com/open-mmlab/mmclassification/pull/432))
+
+### Bug Fixes
+
+- Remove `DistSamplerSeedHook` if use `IterBasedRunner`. ([#501](https://github.com/open-mmlab/mmclassification/pull/501))
+- Set the priority of `EvalHook` to "LOW" to avoid a bug when using `IterBasedRunner`. ([#488](https://github.com/open-mmlab/mmclassification/pull/488))
+- Fix a wrong parameter of `get_root_logger` in `apis/train.py`. ([#486](https://github.com/open-mmlab/mmclassification/pull/486))
+- Fix version check in dataset builder. ([#474](https://github.com/open-mmlab/mmclassification/pull/474))
+
+### Docs Update
+
+- Add English Colab tutorials and update Chinese Colab tutorials. ([#483](https://github.com/open-mmlab/mmclassification/pull/483), [#497](https://github.com/open-mmlab/mmclassification/pull/497))
+- Add tutuorial for config files. ([#487](https://github.com/open-mmlab/mmclassification/pull/487))
+- Add model-pages in Model Zoo. ([#480](https://github.com/open-mmlab/mmclassification/pull/480))
+- Add code-spell pre-commit hook and fix a large mount of typos. ([#470](https://github.com/open-mmlab/mmclassification/pull/470))
+
+## v0.16.0(30/9/2021)
+
+### Highlights
+
+- We have improved compatibility with downstream repositories like MMDetection and MMSegmentation. We will add some examples about how to use our backbones in MMDetection.
+- Add RepVGG backbone and checkpoints. Welcome to use it!
+- Add timm backbones wrapper, now you can simply use backbones of pytorch-image-models in MMClassification!
+
+### New Features
+
+- Add RepVGG backbone and checkpoints. ([#414](https://github.com/open-mmlab/mmclassification/pull/414))
+- Add timm backbones wrapper. ([#427](https://github.com/open-mmlab/mmclassification/pull/427))
+
+### Improvements
+
+- Fix TnT compatibility and verbose warning. ([#436](https://github.com/open-mmlab/mmclassification/pull/436))
+- Support setting `--out-items` in `tools/test.py`.  ([#437](https://github.com/open-mmlab/mmclassification/pull/437))
+- Add datetime info and saving model using torch\<1.6 format. ([#439](https://github.com/open-mmlab/mmclassification/pull/439))
+- Improve downstream repositories compatibility. ([#421](https://github.com/open-mmlab/mmclassification/pull/421))
+- Rename the option `--options` to `--cfg-options` in some tools. ([#425](https://github.com/open-mmlab/mmclassification/pull/425))
+- Add PyTorch 1.9 and Python 3.9 build workflow, and remove some CI. ([#422](https://github.com/open-mmlab/mmclassification/pull/422))
+
+### Bug Fixes
+
+- Fix format error in `test.py` when metric returns `np.ndarray`. ([#441](https://github.com/open-mmlab/mmclassification/pull/441))
+- Fix `publish_model` bug if no parent of `out_file`. ([#463](https://github.com/open-mmlab/mmclassification/pull/463))
+- Fix num_classes bug in pytorch2onnx.py. ([#458](https://github.com/open-mmlab/mmclassification/pull/458))
+- Fix missing runtime requirement `packaging`. ([#459](https://github.com/open-mmlab/mmclassification/pull/459))
+- Fix saving simplified model bug in ONNX export tool. ([#438](https://github.com/open-mmlab/mmclassification/pull/438))
+
+### Docs Update
+
+- Update `getting_started.md` and `install.md`. And rewrite `finetune.md`. ([#466](https://github.com/open-mmlab/mmclassification/pull/466))
+- Use PyTorch style docs theme. ([#457](https://github.com/open-mmlab/mmclassification/pull/457))
+- Update metafile and Readme. ([#435](https://github.com/open-mmlab/mmclassification/pull/435))
+- Add `CITATION.cff`. ([#428](https://github.com/open-mmlab/mmclassification/pull/428))
+
+## v0.15.0(31/8/2021)
+
+### Highlights
+
+- Support `hparams` argument in `AutoAugment` and `RandAugment` to provide hyperparameters for sub-policies.
+- Support custom squeeze channels in `SELayer`.
+- Support classwise weight in losses.
+
+### New Features
+
+- Add `hparams` argument in `AutoAugment` and `RandAugment` and some other improvement. ([#398](https://github.com/open-mmlab/mmclassification/pull/398))
+- Support classwise weight in losses. ([#388](https://github.com/open-mmlab/mmclassification/pull/388))
+- Enhance `SELayer` to support custom squeeze channels. ([#417](https://github.com/open-mmlab/mmclassification/pull/417))
+
+### Code Refactor
+
+- Better result visualization. ([#419](https://github.com/open-mmlab/mmclassification/pull/419))
+- Use `post_process` function to handle pred result processing. ([#390](https://github.com/open-mmlab/mmclassification/pull/390))
+- Update `digit_version` function. ([#402](https://github.com/open-mmlab/mmclassification/pull/402))
+- Avoid albumentations to install both opencv and opencv-headless. ([#397](https://github.com/open-mmlab/mmclassification/pull/397))
+- Avoid unnecessary listdir when building ImageNet. ([#396](https://github.com/open-mmlab/mmclassification/pull/396))
+- Use dynamic mmcv download link in TorchServe dockerfile. ([#387](https://github.com/open-mmlab/mmclassification/pull/387))
+
+### Docs Improvement
+
+- Add readme of some algorithms and update meta yml. ([#418](https://github.com/open-mmlab/mmclassification/pull/418))
+- Add Copyright information. ([#413](https://github.com/open-mmlab/mmclassification/pull/413))
+- Fix typo 'metirc'. ([#411](https://github.com/open-mmlab/mmclassification/pull/411))
+- Update QQ group QR code. ([#393](https://github.com/open-mmlab/mmclassification/pull/393))
+- Add PR template and modify issue template. ([#380](https://github.com/open-mmlab/mmclassification/pull/380))
+
+## v0.14.0(4/8/2021)
+
+### Highlights
+
+- Add transformer-in-transformer backbone and pretrain checkpoints, refers to [the paper](https://arxiv.org/abs/2103.00112).
+- Add Chinese colab tutorial.
+- Provide dockerfile to build mmpretrain dev docker image.
+
+### New Features
+
+- Add transformer in transformer backbone and pretrain checkpoints. ([#339](https://github.com/open-mmlab/mmclassification/pull/339))
+- Support mim, welcome to use mim to manage your mmpretrain project. ([#376](https://github.com/open-mmlab/mmclassification/pull/376))
+- Add Dockerfile. ([#365](https://github.com/open-mmlab/mmclassification/pull/365))
+- Add ResNeSt configs. ([#332](https://github.com/open-mmlab/mmclassification/pull/332))
+
+### Improvements
+
+- Use the `presistent_works` option if available, to accelerate training. ([#349](https://github.com/open-mmlab/mmclassification/pull/349))
+- Add Chinese ipynb tutorial. ([#306](https://github.com/open-mmlab/mmclassification/pull/306))
+- Refactor unit tests. ([#321](https://github.com/open-mmlab/mmclassification/pull/321))
+- Support to test mmdet inference with mmpretrain backbone. ([#343](https://github.com/open-mmlab/mmclassification/pull/343))
+- Use zero as default value of `thrs` in metrics. ([#341](https://github.com/open-mmlab/mmclassification/pull/341))
+
+### Bug Fixes
+
+- Fix ImageNet dataset annotation file parse bug. ([#370](https://github.com/open-mmlab/mmclassification/pull/370))
+- Fix docstring typo and init bug in ShuffleNetV1. ([#374](https://github.com/open-mmlab/mmclassification/pull/374))
+- Use local ATTENTION registry to avoid conflict with other repositories. ([#376](https://github.com/open-mmlab/mmclassification/pull/375))
+- Fix swin transformer config bug. ([#355](https://github.com/open-mmlab/mmclassification/pull/355))
+- Fix `patch_cfg` argument bug in SwinTransformer. ([#368](https://github.com/open-mmlab/mmclassification/pull/368))
+- Fix duplicate `init_weights` call in ViT init function. ([#373](https://github.com/open-mmlab/mmclassification/pull/373))
+- Fix broken `_base_` link in a resnet config. ([#361](https://github.com/open-mmlab/mmclassification/pull/361))
+- Fix vgg-19 model link missing. ([#363](https://github.com/open-mmlab/mmclassification/pull/363))
+
+## v0.13.0(3/7/2021)
+
+- Support Swin-Transformer backbone and add training configs for Swin-Transformer on ImageNet.
+
+### New Features
+
+- Support Swin-Transformer backbone and add training configs for Swin-Transformer on ImageNet. (#271)
+- Add pretained model of RegNetX. (#269)
+- Support adding custom hooks in config file. (#305)
+- Improve and add Chinese translation of `CONTRIBUTING.md` and all tools tutorials. (#320)
+- Dump config before training. (#282)
+- Add torchscript and torchserve deployment tools. (#279, #284)
+
+### Improvements
+
+- Improve test tools and add some new tools. (#322)
+- Correct MobilenetV3 backbone structure and add pretained models. (#291)
+- Refactor `PatchEmbed` and `HybridEmbed` as independent components. (#330)
+- Refactor mixup and cutmix as `Augments` to support more functions. (#278)
+- Refactor weights initialization method. (#270, #318, #319)
+- Refactor `LabelSmoothLoss` to support multiple calculation formulas. (#285)
+
+### Bug Fixes
+
+- Fix bug for CPU training. (#286)
+- Fix missing test data when `num_imgs` can not be evenly divided by `num_gpus`. (#299)
+- Fix build compatible with pytorch v1.3-1.5. (#301)
+- Fix `magnitude_std` bug in `RandAugment`. (#309)
+- Fix bug when `samples_per_gpu` is 1. (#311)
+
+## v0.12.0(3/6/2021)
+
+- Finish adding Chinese tutorials and build Chinese documentation on readthedocs.
+- Update ResNeXt checkpoints and ResNet checkpoints on CIFAR.
+
+### New Features
+
+- Improve and add Chinese translation of `data_pipeline.md` and `new_modules.md`. (#265)
+- Build Chinese translation on readthedocs. (#267)
+- Add an argument efficientnet_style to `RandomResizedCrop` and `CenterCrop`. (#268)
+
+### Improvements
+
+- Only allow directory operation when rank==0 when testing. (#258)
+- Fix typo in `base_head`. (#274)
+- Update ResNeXt checkpoints. (#283)
+
+### Bug Fixes
+
+- Add attribute `data.test` in MNIST configs. (#264)
+- Download CIFAR/MNIST dataset only on rank 0. (#273)
+- Fix MMCV version compatibility. (#276)
+- Fix CIFAR color channels bug and update checkpoints in model zoo. (#280)
+
+## v0.11.1(21/5/2021)
+
+- Refine `new_dataset.md` and add Chinese translation of `finture.md`, `new_dataset.md`.
+
+### New Features
+
+- Add `dim` argument for `GlobalAveragePooling`. (#236)
+- Add random noise to `RandAugment` magnitude. (#240)
+- Refine `new_dataset.md` and add Chinese translation of `finture.md`, `new_dataset.md`. (#243)
+
+### Improvements
+
+- Refactor arguments passing for Heads. (#239)
+- Allow more flexible `magnitude_range` in `RandAugment`. (#249)
+- Inherits MMCV registry so that in the future OpenMMLab repos like MMDet and MMSeg could directly use the backbones supported in MMCls. (#252)
+
+### Bug Fixes
+
+- Fix typo in `analyze_results.py`. (#237)
+- Fix typo in unittests. (#238)
+- Check if specified tmpdir exists when testing to avoid deleting existing data. (#242 & #258)
+- Add missing config files in `MANIFEST.in`. (#250 & #255)
+- Use temporary directory under shared directory to collect results to avoid unavailability of temporary directory for multi-node testing. (#251)
+
+## v0.11.0(1/5/2021)
+
+- Support cutmix trick.
+- Support random augmentation.
+- Add `tools/deployment/test.py` as a ONNX runtime test tool.
+- Support ViT backbone and add training configs for ViT on ImageNet.
+- Add Chinese `README.md` and some Chinese tutorials.
+
+### New Features
+
+- Support cutmix trick. (#198)
+- Add `simplify` option in `pytorch2onnx.py`. (#200)
+- Support random augmentation. (#201)
+- Add config and checkpoint for training ResNet on CIFAR-100. (#208)
+- Add `tools/deployment/test.py` as a ONNX runtime test tool. (#212)
+- Support ViT backbone and add training configs for ViT on ImageNet. (#214)
+- Add finetuning configs for ViT on ImageNet. (#217)
+- Add `device` option to support training on CPU. (#219)
+- Add Chinese `README.md` and some Chinese tutorials. (#221)
+- Add `metafile.yml` in configs to support interaction with paper with code(PWC) and MMCLI. (#225)
+- Upload configs and converted checkpoints for ViT fintuning on ImageNet. (#230)
+
+### Improvements
+
+- Fix `LabelSmoothLoss` so that label smoothing and mixup could be enabled at the same time. (#203)
+- Add `cal_acc` option in `ClsHead`. (#206)
+- Check `CLASSES` in checkpoint to avoid unexpected key error. (#207)
+- Check mmcv version when importing mmpretrain to ensure compatibility. (#209)
+- Update `CONTRIBUTING.md` to align with that in MMCV. (#210)
+- Change tags to html comments in configs README.md. (#226)
+- Clean codes in ViT backbone. (#227)
+- Reformat `pytorch2onnx.md` tutorial. (#229)
+- Update `setup.py` to support MMCLI. (#232)
+
+### Bug Fixes
+
+- Fix missing `cutmix_prob` in ViT configs. (#220)
+- Fix backend for resize in ResNeXt configs. (#222)
+
+## v0.10.0(1/4/2021)
+
+- Support AutoAugmentation
+- Add tutorials for installation and usage.
+
+### New Features
+
+- Add `Rotate` pipeline for data augmentation. (#167)
+- Add `Invert` pipeline for data augmentation. (#168)
+- Add `Color` pipeline for data augmentation. (#171)
+- Add `Solarize` and `Posterize` pipeline for data augmentation. (#172)
+- Support fp16 training. (#178)
+- Add tutorials for installation and basic usage of MMClassification.(#176)
+- Support `AutoAugmentation`, `AutoContrast`, `Equalize`, `Contrast`, `Brightness` and `Sharpness` pipelines for data augmentation. (#179)
+
+### Improvements
+
+- Support dynamic shape export to onnx. (#175)
+- Release training configs and update model zoo for fp16 (#184)
+- Use MMCV's EvalHook in MMClassification (#182)
+
+### Bug Fixes
+
+- Fix wrong naming in vgg config (#181)
+
+## v0.9.0(1/3/2021)
+
+- Implement mixup trick.
+- Add a new tool to create TensorRT engine from ONNX, run inference and verify outputs in Python.
+
+### New Features
+
+- Implement mixup and provide configs of training ResNet50 using mixup. (#160)
+- Add `Shear` pipeline for data augmentation. (#163)
+- Add `Translate` pipeline for data augmentation. (#165)
+- Add `tools/onnx2tensorrt.py` as a tool to create TensorRT engine from ONNX, run inference and verify outputs in Python. (#153)
+
+### Improvements
+
+- Add `--eval-options` in `tools/test.py` to support eval options override, matching the behavior of other open-mmlab projects. (#158)
+- Support showing and saving painted results in `mmpretrain.apis.test` and `tools/test.py`, matching the behavior of other open-mmlab projects. (#162)
+
+### Bug Fixes
+
+- Fix configs for VGG, replace checkpoints converted from other repos with the ones trained by ourselves and upload the missing logs in the model zoo. (#161)
+
+## v0.8.0(31/1/2021)
+
+- Support multi-label task.
+- Support more flexible metrics settings.
+- Fix bugs.
+
+### New Features
+
+- Add evaluation metrics: mAP, CP, CR, CF1, OP, OR, OF1 for multi-label task. (#123)
+- Add BCE loss for multi-label task. (#130)
+- Add focal loss for multi-label task. (#131)
+- Support PASCAL VOC 2007 dataset for multi-label task. (#134)
+- Add asymmetric loss for multi-label task. (#132)
+- Add analyze_results.py to select images for success/fail demonstration. (#142)
+- Support new metric that calculates the total number of occurrences of each label. (#143)
+- Support class-wise evaluation results. (#143)
+- Add thresholds in eval_metrics. (#146)
+- Add heads and a baseline config for multilabel task. (#145)
+
+### Improvements
+
+- Remove the models with 0 checkpoint and ignore the repeated papers when counting papers to gain more accurate model statistics. (#135)
+- Add tags in README.md. (#137)
+- Fix optional issues in docstring. (#138)
+- Update stat.py to classify papers. (#139)
+- Fix mismatched columns in README.md. (#150)
+- Fix test.py to support more evaluation metrics. (#155)
+
+### Bug Fixes
+
+- Fix bug in VGG weight_init. (#140)
+- Fix bug in 2 ResNet configs in which outdated heads were used. (#147)
+- Fix bug of misordered height and width in `RandomCrop` and `RandomResizedCrop`. (#151)
+- Fix missing `meta_keys` in `Collect`. (#149 & #152)
+
+## v0.7.0(31/12/2020)
+
+- Add more evaluation metrics.
+- Fix bugs.
+
+### New Features
+
+- Remove installation of MMCV from requirements. (#90)
+- Add 3 evaluation metrics: precision, recall and F-1 score. (#93)
+- Allow config override during testing and inference with `--options`. (#91 & #96)
+
+### Improvements
+
+- Use `build_runner` to make runners more flexible. (#54)
+- Support to get category ids in `BaseDataset`. (#72)
+- Allow `CLASSES` override during `BaseDateset` initialization. (#85)
+- Allow input image as ndarray during inference. (#87)
+- Optimize MNIST config. (#98)
+- Add config links in model zoo documentation. (#99)
+- Use functions from MMCV to collect environment. (#103)
+- Refactor config files so that they are now categorized by methods. (#116)
+- Add README in config directory. (#117)
+- Add model statistics. (#119)
+- Refactor documentation in consistency with other MM repositories. (#126)
+
+### Bug Fixes
+
+- Add missing `CLASSES` argument to dataset wrappers. (#66)
+- Fix slurm evaluation error during training. (#69)
+- Resolve error caused by shape in `Accuracy`. (#104)
+- Fix bug caused by extremely insufficient data in distributed sampler.(#108)
+- Fix bug in `gpu_ids` in distributed training. (#107)
+- Fix bug caused by extremely insufficient data in collect results during testing (#114)
+
+## v0.6.0(11/10/2020)
+
+- Support new method: ResNeSt and VGG.
+- Support new dataset: CIFAR10.
+- Provide new tools to do model inference, model conversion from pytorch to onnx.
+
+### New Features
+
+- Add model inference. (#16)
+- Add pytorch2onnx. (#20)
+- Add PIL backend for transform `Resize`. (#21)
+- Add ResNeSt. (#25)
+- Add VGG and its pretained models. (#27)
+- Add CIFAR10 configs and models. (#38)
+- Add albumentations transforms. (#45)
+- Visualize results on image demo. (#58)
+
+### Improvements
+
+- Replace urlretrieve with urlopen in dataset.utils. (#13)
+- Resize image according to its short edge. (#22)
+- Update ShuffleNet config. (#31)
+- Update pre-trained models for shufflenet_v2, shufflenet_v1, se-resnet50, se-resnet101. (#33)
+
+### Bug Fixes
+
+- Fix init_weights in `shufflenet_v2.py`. (#29)
+- Fix the parameter `size` in test_pipeline. (#30)
+- Fix the parameter in cosine lr schedule. (#32)
+- Fix the convert tools for mobilenet_v2. (#34)
+- Fix crash in CenterCrop transform when image is greyscale (#40)
+- Fix outdated configs. (#53)
diff --git a/docs/en/notes/contribution_guide.md b/docs/en/notes/contribution_guide.md
new file mode 120000
index 0000000000000000000000000000000000000000..c97564d93a7f0a753a23cd97d2467d595bd154ff
--- /dev/null
+++ b/docs/en/notes/contribution_guide.md
@@ -0,0 +1 @@
+../../../CONTRIBUTING.md
\ No newline at end of file
diff --git a/docs/en/notes/faq.md b/docs/en/notes/faq.md
new file mode 100644
index 0000000000000000000000000000000000000000..da45841bb10c347bb3724d5e49e90ab5199c5caf
--- /dev/null
+++ b/docs/en/notes/faq.md
@@ -0,0 +1,116 @@
+# Frequently Asked Questions
+
+We list some common troubles faced by many users and their corresponding
+solutions here. Feel free to enrich the list if you find any frequent issues
+and have ways to help others to solve them. If the contents here do not cover
+your issue, please create an issue using the
+[provided templates](https://github.com/open-mmlab/mmpretrain/issues/new/choose)
+and make sure you fill in all required information in the template.
+
+## Installation
+
+- Compatibility issue between MMEngine, MMCV and MMPretrain
+
+  Compatible MMPretrain and MMEngine, MMCV versions are shown as below. Please
+  choose the correct version of MMEngine and MMCV to avoid installation issues.
+
+  | MMPretrain version | MMEngine version  |   MMCV version   |
+  | :----------------: | :---------------: | :--------------: |
+  |    1.2.0 (main)    | mmengine >= 0.8.3 |  mmcv >= 2.0.0   |
+  |       1.1.1        | mmengine >= 0.8.3 |  mmcv >= 2.0.0   |
+  |       1.0.0        | mmengine >= 0.8.0 |  mmcv >= 2.0.0   |
+  |      1.0.0rc8      | mmengine >= 0.7.1 | mmcv >= 2.0.0rc4 |
+  |      1.0.0rc7      | mmengine >= 0.5.0 | mmcv >= 2.0.0rc4 |
+
+  ```{note}
+  Since the `dev` branch is under frequent development, the MMEngine and MMCV
+  version dependency may be inaccurate. If you encounter problems when using
+  the `dev` branch, please try to update MMEngine and MMCV to the latest version.
+  ```
+
+- Using Albumentations
+
+  If you would like to use `albumentations`, we suggest using `pip install -r requirements/albu.txt` or
+  `pip install -U albumentations --no-binary qudida,albumentations`.
+
+  If you simply use `pip install albumentations>=0.3.2`, it will install `opencv-python-headless` simultaneously
+  (even though you have already installed `opencv-python`). Please refer to the
+  [official documentation](https://albumentations.ai/docs/getting_started/installation/#note-on-opencv-dependencies)
+  for details.
+
+## General Questions
+
+### Do I need to reinstall mmpretrain after some code modifications?
+
+If you follow [the best practice](../get_started.md#best-practices) and install mmpretrain from source,
+any local modifications made to the code will take effect without
+reinstallation.
+
+### How to develop with multiple MMPretrain versions?
+
+Generally speaking, we recommend to use different virtual environments to
+manage MMPretrain in different working directories. However, you
+can also use the same environment to develop MMPretrain in different
+folders, like mmpretrain-0.21, mmpretrain-0.23. When you run the train or test shell script,
+it will adopt the mmpretrain package in the current folder. And when you run other Python
+script, you can also add `` PYTHONPATH=`pwd`  `` at the beginning of your command
+to use the package in the current folder.
+
+Conversely, to use the default MMPretrain installed in the environment
+rather than the one you are working with, you can remove the following line
+in those shell scripts:
+
+```shell
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH
+```
+
+### What's the relationship between the `load_from` and the `init_cfg`?
+
+- `load_from`: If `resume=False`, only imports model weights, which is mainly used to load trained models;
+  If `resume=True`, load all of the model weights, optimizer state, and other training information, which is
+  mainly used to resume interrupted training.
+
+- `init_cfg`: You can also specify `init=dict(type="Pretrained", checkpoint=xxx)` to load checkpoint, it
+  means load the weights during model weights initialization. That is, it will be only done at the
+  beginning of the training. It's mainly used to fine-tune a pre-trained model, and you can set it in
+  the backbone config and use `prefix` field to only load backbone weights, for example:
+
+```python
+model = dict(
+  backbone=dict(
+      type='ResNet',
+      depth=50,
+      init_cfg=dict(type='Pretrained', checkpoints=xxx, prefix='backbone'),
+  )
+  ...
+)
+```
+
+See the [Fine-tune Models](./finetune_custom_dataset.md) for more details about fine-tuning.
+
+### What's the difference between `default_hooks` and `custom_hooks`?
+
+Almost no difference. Usually, the `default_hooks` field is used to specify the hooks that will be used in almost
+all experiments, and the `custom_hooks` field is used in only some experiments.
+
+Another difference is the `default_hooks` is a dict while the `custom_hooks` is a list, please don't be
+confused.
+
+### During training, I got no training log, what's the reason?
+
+If your training dataset is small while the batch size is large, our default log interval may be too large to
+record your training log.
+
+You can shrink the log interval and try again, like:
+
+```python
+default_hooks = dict(
+    ...
+    logger=dict(type='LoggerHook', interval=10),
+    ...
+)
+```
+
+### How to train with other datasets, like my own dataset or COCO?
+
+We provide [specific examples](./pretrain_custom_dataset.md) to show how to train with other datasets.
diff --git a/docs/en/notes/finetune_custom_dataset.md b/docs/en/notes/finetune_custom_dataset.md
new file mode 100644
index 0000000000000000000000000000000000000000..4000268ca47651233799dbcba3add351979e65c0
--- /dev/null
+++ b/docs/en/notes/finetune_custom_dataset.md
@@ -0,0 +1,340 @@
+# How to Fine-tune with Custom Dataset
+
+In most scenarios, we want to apply a pre-trained model without training from scratch, which might possibly introduce extra uncertainties about the model convergency and therefore, is time-consuming.
+The common sense is to learn from previous models trained on large dataset, which can hopefully provide better knowledge than a random beginner. Roughly speaking, this process is as known as fine-tuning.
+
+Models pre-trained on the ImageNet dataset have been demonstrated to be effective for other datasets and other downstream tasks.
+Hence, this tutorial provides instructions for users to use the models provided in the [Model Zoo](../modelzoo_statistics.md) for other datasets to obtain better performance.
+
+In this tutorial, we provide a practice example and some tips on how to fine-tune a model on your own dataset.
+
+## Step-1: Prepare your dataset
+
+Prepare your dataset following [Prepare Dataset](../user_guides/dataset_prepare.md).
+And the root folder of the dataset can be like `data/custom_dataset/`.
+
+Here, we assume you want to do supervised image-classification training, and use the sub-folder format
+`CustomDataset` to organize your dataset as:
+
+```text
+data/custom_dataset/
+├── train
+│   ├── class_x
+│   │   ├── x_1.png
+│   │   ├── x_2.png
+│   │   ├── x_3.png
+│   │   └── ...
+│   ├── class_y
+│   └── ...
+└── test
+    ├── class_x
+    │   ├── test_x_1.png
+    │   ├── test_x_2.png
+    │   ├── test_x_3.png
+    │   └── ...
+    ├── class_y
+    └── ...
+```
+
+## Step-2: Choose one config as template
+
+Here, we would like to use `configs/resnet/resnet50_8xb32_in1k.py` as the example. We first copy this config
+file to the same folder and rename it as `resnet50_8xb32-ft_custom.py`.
+
+```{tip}
+As a convention, the last field of the config name is the dataset, e.g.,`in1k` for ImageNet dataset, `coco` for COCO dataset
+```
+
+The content of this config is:
+
+```python
+_base_ = [
+    '../_base_/models/resnet50.py',           # model settings
+    '../_base_/datasets/imagenet_bs32.py',    # data settings
+    '../_base_/schedules/imagenet_bs256.py',  # schedule settings
+    '../_base_/default_runtime.py',           # runtime settings
+]
+```
+
+## Step-3: Edit the model settings
+
+When fine-tuning a model, usually we want to load the pre-trained backbone
+weights and train a new classification head from scratch.
+
+To load the pre-trained backbone, we need to change the initialization config
+of the backbone and use `Pretrained` initialization function. Besides, in the
+`init_cfg`, we use `prefix='backbone'` to tell the initialization function
+the prefix of the submodule that needs to be loaded in the checkpoint.
+
+For example, `backbone` here means to load the backbone submodule. And here we
+use an online checkpoint, it will be downloaded automatically during training,
+you can also download the model manually and use a local path.
+And then we need to modify the head according to the class numbers of the new
+datasets by just changing `num_classes` in the head.
+
+When new dataset is small and shares the domain with the pre-trained dataset,
+we might want to freeze the first several stages' parameters of the
+backbone, that will help the network to keep ability to extract low-level
+information learnt from pre-trained model. In MMPretrain, you can simply
+specify how many stages to freeze by `frozen_stages` argument. For example, to
+freeze the first two stages' parameters, just use the following configs:
+
+```{note}
+Not all backbones support the `frozen_stages` argument by now. Please check
+[the docs](https://mmpretrain.readthedocs.io/en/latest/api.html#module-mmpretrain.models.backbones)
+to confirm if your backbone supports it.
+```
+
+```python
+_base_ = [
+    '../_base_/models/resnet50.py',           # model settings
+    '../_base_/datasets/imagenet_bs32.py',    # data settings
+    '../_base_/schedules/imagenet_bs256.py',  # schedule settings
+    '../_base_/default_runtime.py',           # runtime settings
+]
+
+# >>>>>>>>>>>>>>> Override model settings here >>>>>>>>>>>>>>>>>>>
+model = dict(
+    backbone=dict(
+        frozen_stages=2,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
+            prefix='backbone',
+        )),
+    head=dict(num_classes=10),
+)
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+```
+
+```{tip}
+Here we only need to set the part of configs we want to modify, because the
+inherited configs will be merged and get the entire configs.
+```
+
+## Step-4: Edit the dataset settings
+
+To fine-tuning on a new dataset, we need to override some dataset settings, like the type of dataset, data
+pipeline, etc.
+
+```python
+_base_ = [
+    '../_base_/models/resnet50.py',           # model settings
+    '../_base_/datasets/imagenet_bs32.py',    # data settings
+    '../_base_/schedules/imagenet_bs256.py',  # schedule settings
+    '../_base_/default_runtime.py',           # runtime settings
+]
+
+# model settings
+model = dict(
+    backbone=dict(
+        frozen_stages=2,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
+            prefix='backbone',
+        )),
+    head=dict(num_classes=10),
+)
+
+# >>>>>>>>>>>>>>> Override data settings here >>>>>>>>>>>>>>>>>>>
+data_root = 'data/custom_dataset'
+train_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',       # We assume you are using the sub-folder format without ann_file
+        data_prefix='train',
+    ))
+val_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',       # We assume you are using the sub-folder format without ann_file
+        data_prefix='test',
+    ))
+test_dataloader = val_dataloader
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+```
+
+## Step-5: Edit the schedule settings (optional)
+
+The fine-tuning hyper parameters vary from the default schedule. It usually
+requires smaller learning rate and quicker decaying scheduler epochs.
+
+```python
+_base_ = [
+    '../_base_/models/resnet50.py',           # model settings
+    '../_base_/datasets/imagenet_bs32.py',    # data settings
+    '../_base_/schedules/imagenet_bs256.py',  # schedule settings
+    '../_base_/default_runtime.py',           # runtime settings
+]
+
+# model settings
+model = dict(
+    backbone=dict(
+        frozen_stages=2,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
+            prefix='backbone',
+        )),
+    head=dict(num_classes=10),
+)
+
+# data settings
+data_root = 'data/custom_dataset'
+train_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',       # We assume you are using the sub-folder format without ann_file
+        data_prefix='train',
+    ))
+val_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',       # We assume you are using the sub-folder format without ann_file
+        data_prefix='test',
+    ))
+test_dataloader = val_dataloader
+
+# >>>>>>>>>>>>>>> Override schedule settings here >>>>>>>>>>>>>>>>>>>
+# optimizer hyper-parameters
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001))
+# learning policy
+param_scheduler = dict(
+    type='MultiStepLR', by_epoch=True, milestones=[15], gamma=0.1)
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+```
+
+```{tip}
+Refers to [Learn about Configs](../user_guides/config.md) for more detailed configurations.
+```
+
+## Start Training
+
+Now, we have finished the fine-tuning config file as following:
+
+```python
+_base_ = [
+    '../_base_/models/resnet50.py',           # model settings
+    '../_base_/datasets/imagenet_bs32.py',    # data settings
+    '../_base_/schedules/imagenet_bs256.py',  # schedule settings
+    '../_base_/default_runtime.py',           # runtime settings
+]
+
+# model settings
+model = dict(
+    backbone=dict(
+        frozen_stages=2,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
+            prefix='backbone',
+        )),
+    head=dict(num_classes=10),
+)
+
+# data settings
+data_root = 'data/custom_dataset'
+train_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',       # We assume you are using the sub-folder format without ann_file
+        data_prefix='train',
+    ))
+val_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',       # We assume you are using the sub-folder format without ann_file
+        data_prefix='test',
+    ))
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001))
+param_scheduler = dict(
+    type='MultiStepLR', by_epoch=True, milestones=[15], gamma=0.1)
+```
+
+Here we use 8 GPUs on your computer to train the model with the following command:
+
+```shell
+bash tools/dist_train.sh configs/resnet/resnet50_8xb32-ft_custom.py 8
+```
+
+Also, you can use only one GPU to train the model with the following command:
+
+```shell
+python tools/train.py configs/resnet/resnet50_8xb32-ft_custom.py
+```
+
+But wait, an important config need to be changed if using one GPU. We need to
+change the dataset config as following:
+
+```python
+data_root = 'data/custom_dataset'
+train_dataloader = dict(
+    batch_size=256,
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',       # We assume you are using the sub-folder format without ann_file
+        data_prefix='train',
+    ))
+val_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',       # We assume you are using the sub-folder format without ann_file
+        data_prefix='test',
+    ))
+test_dataloader = val_dataloader
+```
+
+It's because our training schedule is for a batch size of 256. If using 8 GPUs,
+just use `batch_size=32` config in the base config file for every GPU, and the total batch
+size will be 256. But if using one GPU, you need to change it to 256 manually to
+match the training schedule.
+
+However, a larger batch size requires a larger GPU memory, and here are several simple tricks to save the GPU
+memory:
+
+1. Enable Automatic-Mixed-Precision training.
+
+   ```shell
+   python tools/train.py configs/resnet/resnet50_8xb32-ft_custom.py --amp
+   ```
+
+2. Use a smaller batch size, like `batch_size=32` instead of 256, and enable the auto learning rate scaling.
+
+   ```shell
+   python tools/train.py configs/resnet/resnet50_8xb32-ft_custom.py --auto-scale-lr
+   ```
+
+   The auto learning rate scaling will adjust the learning rate according to the actual batch size and the
+   `auto_scale_lr.base_batch_size` (You can find it in the base config
+   `configs/_base_/schedules/imagenet_bs256.py`)
+
+```{note}
+Most of these tricks may influence the training performance slightly.
+```
+
+### Apply pre-trained model with command line
+
+If you don't want to modify the configs, you could use `--cfg-options` to add your pre-trained model path to `init_cfg`.
+
+For example, the command below will also load pre-trained model.
+
+```shell
+bash tools/dist_train.sh configs/resnet/resnet50_8xb32-ft_custom.py 8 \
+    --cfg-options model.backbone.init_cfg.type='Pretrained' \
+    model.backbone.init_cfg.checkpoint='https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.pth' \
+    model.backbone.init_cfg.prefix='backbone' \
+```
diff --git a/docs/en/notes/pretrain_custom_dataset.md b/docs/en/notes/pretrain_custom_dataset.md
new file mode 100644
index 0000000000000000000000000000000000000000..c9e583799c2922579b3611892e4dae56ca2a285d
--- /dev/null
+++ b/docs/en/notes/pretrain_custom_dataset.md
@@ -0,0 +1,255 @@
+# How to Pretrain with Custom Dataset
+
+In this tutorial, we provide a practice example and some tips on how to train on your own dataset.
+
+In MMPretrain, We support the `CustomDataset` (similar to the `ImageFolder` in `torchvision`),  which is able to read the images within the specified folder directly. You only need to prepare the path information of the custom dataset and edit the config.
+
+## Step-1: Prepare your dataset
+
+Prepare your dataset following [Prepare Dataset](../user_guides/dataset_prepare.md).
+And the root folder of the dataset can be like `data/custom_dataset/`.
+
+Here, we assume you want to do unsupervised training, and use the sub-folder format `CustomDataset` to
+organize your dataset as:
+
+```text
+data/custom_dataset/
+├── sample1.png
+├── sample2.png
+├── sample3.png
+├── sample4.png
+└── ...
+```
+
+## Step-2: Choose one config as template
+
+Here, we would like to use `configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py` as the example. We
+first copy this config file to the same folder and rename it as
+`mae_vit-base-p16_8xb512-amp-coslr-300e_custom.py`.
+
+```{tip}
+As a convention, the last field of the config name is the dataset, e.g.,`in1k` for ImageNet dataset, `coco` for COCO dataset
+```
+
+The content of this config is:
+
+```python
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
+```
+
+## Step-3: Edit the dataset related config
+
+- Override the `type` of dataset settings as `'CustomDataset'`
+- Override the `data_root` of dataset settings as `data/custom_dataset`.
+- Override the `ann_file` of dataset settings as an empty string since we assume you are using the sub-folder
+  format `CustomDataset`.
+- Override the `data_prefix` of dataset settings as an empty string since we are using the whole dataset under
+  the `data_root`, and you don't need to split samples into different subset and set the `data_prefix`.
+
+The modified config will be like:
+
+```python
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# >>>>>>>>>>>>>>> Override dataset settings here >>>>>>>>>>>>>>>>>>>
+train_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root='data/custom_dataset/',
+        ann_file='',       # We assume you are using the sub-folder format without ann_file
+        data_prefix='',    # The `data_root` is the data_prefix directly.
+        with_label=False,
+    )
+)
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
+```
+
+By using the edited config file, you are able to train a self-supervised model with MAE algorithm on the custom dataset.
+
+## Another example: Train MAE on COCO Dataset
+
+```{note}
+You need to install MMDetection to use the `mmdet.CocoDataset` follow this [documentation](https://github.com/open-mmlab/mmdetection/blob/3.x/docs/en/get_started.md)
+```
+
+Follow the aforementioned idea, we also present an example of how to train MAE on COCO dataset.  The edited file will be like this:
+
+```python
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# >>>>>>>>>>>>>>> Override dataset settings here >>>>>>>>>>>>>>>>>>>
+train_dataloader = dict(
+    dataset=dict(
+        type='mmdet.CocoDataset',
+        data_root='data/coco/',
+        ann_file='annotations/instances_train2017.json',  # Only for loading images, and the labels won't be used.
+        data_prefix=dict(img='train2017/'),
+    )
+)
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
+```
diff --git a/docs/en/notes/projects.md b/docs/en/notes/projects.md
new file mode 100644
index 0000000000000000000000000000000000000000..d6b625432948307e76e5ddeb5b994e575874e425
--- /dev/null
+++ b/docs/en/notes/projects.md
@@ -0,0 +1,21 @@
+# Projects based on MMPretrain
+
+There are many projects built upon MMPretrain(MMClassification previsously).
+We list some of them as examples of how to extend MMPretrain(MMClassification previsously) for your own projects.
+As the page might not be completed, please feel free to create a PR to update this page.
+
+## Projects as an extension
+
+- [OpenMixup](https://github.com/Westlake-AI/openmixup): an open-source toolbox for supervised, self-, and semi-supervised visual representation learning with mixup based on PyTorch, especially for mixup-related methods.
+- [AI Power](https://github.com/ykk648/AI_power): AI toolbox and pretrain models.
+- [OpenBioSeq](https://github.com/Westlake-AI/OpenBioSeq): an open-source supervised and self-supervised bio-sequence representation learning toolbox based on PyTorch.
+
+## Projects of papers
+
+There are also projects released with papers.
+Some of the papers are published in top-tier conferences (CVPR, ICCV, and ECCV), the others are also highly influential.
+To make this list also a reference for the community to develop and compare new image classification algorithms, we list them following the time order of top-tier conferences.
+Methods already supported and maintained by MMPretrain(MMClassification previsously) are not listed.
+
+- Involution: Inverting the Inherence of Convolution for Visual Recognition, CVPR21. [[paper]](https://arxiv.org/abs/2103.06255)[[github]](https://github.com/d-li14/involution)
+- Convolution of Convolution: Let Kernels Spatially Collaborate, CVPR22. [[paper]](https://openaccess.thecvf.com/content/CVPR2022/papers/Zhao_Convolution_of_Convolution_Let_Kernels_Spatially_Collaborate_CVPR_2022_paper.pdf)[[github]](https://github.com/Genera1Z/ConvolutionOfConvolution)
diff --git a/docs/en/stat.py b/docs/en/stat.py
new file mode 100755
index 0000000000000000000000000000000000000000..2d74823b10020af523bba787bbca7521ff797f17
--- /dev/null
+++ b/docs/en/stat.py
@@ -0,0 +1,249 @@
+#!/usr/bin/env python
+import re
+import warnings
+from collections import defaultdict
+from pathlib import Path
+
+from modelindex.load_model_index import load
+from modelindex.models.Result import Result
+from tabulate import tabulate
+
+MMPT_ROOT = Path(__file__).absolute().parents[2]
+PAPERS_ROOT = Path('papers')  # Path to save generated paper pages.
+GITHUB_PREFIX = 'https://github.com/open-mmlab/mmpretrain/blob/main/'
+MODELZOO_TEMPLATE = """\
+# Model Zoo Summary
+
+In this page, we list [all algorithms](#all-supported-algorithms) we support. You can click the link to jump to the corresponding model pages.
+
+And we also list all checkpoints for different tasks we provide. You can sort or search checkpoints in the table and click the corresponding link to model pages for more details.
+
+## All supported algorithms
+
+* Number of papers: {num_papers}
+{type_msg}
+
+* Number of checkpoints: {num_ckpts}
+{paper_msg}
+
+"""  # noqa: E501
+
+METRIC_ALIAS = {
+    'Top 1 Accuracy': 'Top-1 (%)',
+    'Top 5 Accuracy': 'Top-5 (%)',
+}
+
+model_index = load(str(MMPT_ROOT / 'model-index.yml'))
+
+
+def build_collections(model_index):
+    col_by_name = {}
+    for col in model_index.collections:
+        setattr(col, 'models', [])
+        col_by_name[col.name] = col
+
+    for model in model_index.models:
+        col = col_by_name[model.in_collection]
+        col.models.append(model)
+        setattr(model, 'collection', col)
+        if model.results is None:
+            setattr(model, 'tasks', [])
+        else:
+            setattr(model, 'tasks', [result.task for result in model.results])
+
+
+build_collections(model_index)
+
+
+def count_papers(collections):
+    total_num_ckpts = 0
+    type_count = defaultdict(int)
+    paper_msgs = []
+
+    for collection in collections:
+        with open(MMPT_ROOT / collection.readme) as f:
+            readme = f.read()
+        ckpts = set(x.lower().strip()
+                    for x in re.findall(r'\[model\]\((https?.*)\)', readme))
+        total_num_ckpts += len(ckpts)
+        title = collection.paper['Title']
+        papertype = collection.data.get('type', 'Algorithm')
+        type_count[papertype] += 1
+
+        readme = PAPERS_ROOT / Path(
+            collection.filepath).parent.with_suffix('.md').name
+        paper_msgs.append(
+            f'\t- [{papertype}] [{title}]({readme}) ({len(ckpts)} ckpts)')
+
+    type_msg = '\n'.join(
+        [f'\t- {type_}: {count}' for type_, count in type_count.items()])
+    paper_msg = '\n'.join(paper_msgs)
+
+    modelzoo = MODELZOO_TEMPLATE.format(
+        num_papers=len(collections),
+        num_ckpts=total_num_ckpts,
+        type_msg=type_msg,
+        paper_msg=paper_msg,
+    )
+
+    with open('modelzoo_statistics.md', 'w') as f:
+        f.write(modelzoo)
+
+
+count_papers(model_index.collections)
+
+
+def generate_paper_page(collection):
+    PAPERS_ROOT.mkdir(exist_ok=True)
+
+    # Write a copy of README
+    with open(MMPT_ROOT / collection.readme) as f:
+        readme = f.read()
+    folder = Path(collection.filepath).parent
+    copy = PAPERS_ROOT / folder.with_suffix('.md').name
+
+    def replace_link(matchobj):
+        # Replace relative link to GitHub link.
+        name = matchobj.group(1)
+        link = matchobj.group(2)
+        if not link.startswith('http'):
+            assert (folder / link).exists(), \
+                f'Link not found:\n{collection.readme}: {link}'
+            rel_link = (folder / link).absolute().relative_to(MMPT_ROOT)
+            link = GITHUB_PREFIX + str(rel_link)
+        return f'[{name}]({link})'
+
+    content = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', replace_link, readme)
+    content = f'---\ngithub_page: /{collection.readme}\n---\n' + content
+
+    def make_tabs(matchobj):
+        """modify the format from emphasis black symbol to tabs."""
+        content = matchobj.group()
+        content = content.replace('<!-- [TABS-BEGIN] -->', '')
+        content = content.replace('<!-- [TABS-END] -->', '')
+
+        # split the content by "**{Tab-Name}**""
+        splits = re.split(r'^\*\*(.*)\*\*$', content, flags=re.M)[1:]
+        tabs_list = []
+        for title, tab_content in zip(splits[::2], splits[1::2]):
+            title = ':::{tab} ' + title + '\n'
+            tab_content = tab_content.strip() + '\n:::\n'
+            tabs_list.append(title + tab_content)
+
+        return '::::{tabs}\n' + ''.join(tabs_list) + '::::'
+
+    if '<!-- [TABS-BEGIN] -->' in content and '<!-- [TABS-END] -->' in content:
+        # Make TABS block a selctive tabs
+        try:
+            pattern = r'<!-- \[TABS-BEGIN\] -->([\d\D]*?)<!-- \[TABS-END\] -->'
+            content = re.sub(pattern, make_tabs, content)
+        except Exception as e:
+            warnings.warn(f'Can not parse the TABS, get an error : {e}')
+
+    with open(copy, 'w') as copy_file:
+        copy_file.write(content)
+
+
+for collection in model_index.collections:
+    generate_paper_page(collection)
+
+
+def scatter_results(models):
+    model_result_pairs = []
+    for model in models:
+        if model.results is None:
+            result = Result(task=None, dataset=None, metrics={})
+            model_result_pairs.append((model, result))
+        else:
+            for result in model.results:
+                model_result_pairs.append((model, result))
+    return model_result_pairs
+
+
+def generate_summary_table(task, model_result_pairs, title=None):
+    metrics = set()
+    for model, result in model_result_pairs:
+        if result.task == task:
+            metrics = metrics.union(result.metrics.keys())
+    metrics = sorted(list(metrics))
+
+    rows = []
+    for model, result in model_result_pairs:
+        if result.task != task:
+            continue
+        name = model.name
+        params = f'{model.metadata.parameters / 1e6:.2f}'  # Params
+        if model.metadata.flops is not None:
+            flops = f'{model.metadata.flops / 1e9:.2f}'  # Flops
+        else:
+            flops = None
+        readme = Path(model.collection.filepath).parent.with_suffix('.md').name
+        page = f'[link]({PAPERS_ROOT / readme})'
+        model_metrics = []
+        for metric in metrics:
+            model_metrics.append(str(result.metrics.get(metric, '')))
+
+        rows.append([name, params, flops, *model_metrics, page])
+
+    with open('modelzoo_statistics.md', 'a') as f:
+        if title is not None:
+            f.write(f'\n{title}')
+        f.write("""\n```{table}\n:class: model-summary\n""")
+        header = [
+            'Model',
+            'Params (M)',
+            'Flops (G)',
+            *[METRIC_ALIAS.get(metric, metric) for metric in metrics],
+            'Readme',
+        ]
+        table_cfg = dict(
+            tablefmt='pipe',
+            floatfmt='.2f',
+            numalign='right',
+            stralign='center')
+        f.write(tabulate(rows, header, **table_cfg))
+        f.write('\n```\n')
+
+
+def generate_dataset_wise_table(task, model_result_pairs, title=None):
+    dataset_rows = defaultdict(list)
+    for model, result in model_result_pairs:
+        if result.task == task:
+            dataset_rows[result.dataset].append((model, result))
+
+    if title is not None:
+        with open('modelzoo_statistics.md', 'a') as f:
+            f.write(f'\n{title}')
+    for dataset, pairs in dataset_rows.items():
+        generate_summary_table(task, pairs, title=f'### {dataset}')
+
+
+model_result_pairs = scatter_results(model_index.models)
+
+# Generate Pretrain Summary
+generate_summary_table(
+    task=None,
+    model_result_pairs=model_result_pairs,
+    title='## Pretrained Models',
+)
+
+# Generate Image Classification Summary
+generate_dataset_wise_table(
+    task='Image Classification',
+    model_result_pairs=model_result_pairs,
+    title='## Image Classification',
+)
+
+# Generate Multi-Label Classification Summary
+generate_dataset_wise_table(
+    task='Multi-Label Classification',
+    model_result_pairs=model_result_pairs,
+    title='## Multi-Label Classification',
+)
+
+# Generate Image Retrieval Summary
+generate_dataset_wise_table(
+    task='Image Retrieval',
+    model_result_pairs=model_result_pairs,
+    title='## Image Retrieval',
+)
diff --git a/docs/en/useful_tools/cam_visualization.md b/docs/en/useful_tools/cam_visualization.md
new file mode 100644
index 0000000000000000000000000000000000000000..023e37ac2397fd315df8eace8ed1fde1c9f1abb1
--- /dev/null
+++ b/docs/en/useful_tools/cam_visualization.md
@@ -0,0 +1,164 @@
+# Class Activation Map (CAM) Visualization
+
+## Introduction of the CAM visualization tool
+
+MMPretrain provides `tools/visualization/vis_cam.py` tool to visualize class activation map. Please use `pip install "grad-cam>=1.3.6"` command to install [pytorch-grad-cam](https://github.com/jacobgil/pytorch-grad-cam).
+
+The supported methods are as follows:
+
+| Method       | What it does                                                                                                                 |
+| ------------ | ---------------------------------------------------------------------------------------------------------------------------- |
+| GradCAM      | Weight the 2D activations by the average gradient                                                                            |
+| GradCAM++    | Like GradCAM but uses second order gradients                                                                                 |
+| XGradCAM     | Like GradCAM but scale the gradients by the normalized activations                                                           |
+| EigenCAM     | Takes the first principle component of the 2D Activations (no class discrimination, but seems to give great results)         |
+| EigenGradCAM | Like EigenCAM but with class discrimination: First principle component of Activations\*Grad. Looks like GradCAM, but cleaner |
+| LayerCAM     | Spatially weight the activations by positive gradients. Works better especially in lower layers                              |
+
+More CAM methods supported by the new version `pytorch-grad-cam` can also be used but we haven't verified the availability.
+
+**Command**：
+
+```bash
+python tools/visualization/vis_cam.py \
+    ${IMG} \
+    ${CONFIG_FILE} \
+    ${CHECKPOINT} \
+    [--target-layers ${TARGET-LAYERS}] \
+    [--preview-model] \
+    [--method ${METHOD}] \
+    [--target-category ${TARGET-CATEGORY}] \
+    [--save-path ${SAVE_PATH}] \
+    [--vit-like] \
+    [--num-extra-tokens ${NUM-EXTRA-TOKENS}]
+    [--aug_smooth] \
+    [--eigen_smooth] \
+    [--device ${DEVICE}] \
+    [--cfg-options ${CFG-OPTIONS}]
+```
+
+**Description of all arguments**：
+
+- `img`: The target picture path.
+- `config`: The path of the model config file.
+- `checkpoint`: The path of the checkpoint.
+- `--target-layers`: The target layers to get activation maps, one or more network layers can be specified. If not set, use the norm layer of the last block.
+- `--preview-model`: Whether to print all network layer names in the model.
+- `--method`: Visualization method, supports `GradCAM`, `GradCAM++`, `XGradCAM`, `EigenCAM`, `EigenGradCAM`, `LayerCAM`, which is case insensitive. Defaults to `GradCAM`.
+- `--target-category`: Target category, if not set, use the category detected by the given model.
+- `--eigen-smooth`: Whether to use the principal component to reduce noise.
+- `--aug-smooth`: Whether to use TTA(Test Time Augment) to get CAM.
+- `--save-path`: The path to save the CAM visualization image. If not set, the CAM image will not be saved.
+- `--vit-like`: Whether the network is ViT-like network.
+- `--num-extra-tokens`: The number of extra tokens in ViT-like backbones. If not set, use num_extra_tokens the backbone.
+- `--device`: The computing device used. Default to 'cpu'.
+- `--cfg-options`: Modifications to the configuration file, refer to [Learn about Configs](../user_guides/config.md).
+
+```{note}
+The argument `--preview-model` can view all network layers names in the given model. It will be helpful if you know nothing about the model layers when setting `--target-layers`.
+```
+
+## How to visualize the CAM of CNN (ResNet-50)
+
+Here are some examples of `target-layers` in ResNet-50, which can be any module or layer:
+
+- `'backbone.layer4'` means the output of the forth ResLayer.
+- `'backbone.layer4.2'` means the output of the third BottleNeck block in the forth ResLayer.
+- `'backbone.layer4.2.conv1'` means the output of the `conv1` layer in above BottleNeck block.
+
+1. Use different methods to visualize CAM for `ResNet50`, the `target-category` is the predicted result by the given checkpoint, using the default `target-layers`.
+
+   ```shell
+   python tools/visualization/vis_cam.py \
+       demo/bird.JPEG \
+       configs/resnet/resnet50_8xb32_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth \
+       --method GradCAM
+       # GradCAM++, XGradCAM, EigenCAM, EigenGradCAM, LayerCAM
+   ```
+
+   | Image                                | GradCAM                                 | GradCAM++                                 | EigenGradCAM                                 | LayerCAM                                 |
+   | ------------------------------------ | --------------------------------------- | ----------------------------------------- | -------------------------------------------- | ---------------------------------------- |
+   | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429496-628d3fb3-1f6e-41ff-aa5c-1b08c60c32a9.JPEG' height="auto" width="160" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065002-f1c86516-38b2-47ba-90c1-e00b49556c70.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065119-82581fa1-3414-4d6c-a849-804e1503c74b.jpg' height="auto" width="150"></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065096-75a6a2c1-6c57-4789-ad64-ebe5e38765f4.jpg' height="auto" width="150"></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065129-814d20fb-98be-4106-8c5e-420adcc85295.jpg' height="auto" width="150"></div> |
+
+2. Use different `target-category` to get CAM from the same picture. In `ImageNet` dataset, the category 238 is 'Greater Swiss Mountain dog', the category 281 is 'tabby, tabby cat'.
+
+   ```shell
+   python tools/visualization/vis_cam.py \
+       demo/cat-dog.png configs/resnet/resnet50_8xb32_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth \
+       --target-layers 'backbone.layer4.2' \
+       --method GradCAM \
+       --target-category 238
+       # --target-category 281
+   ```
+
+   | Category | Image                                          | GradCAM                                          | XGradCAM                                          | LayerCAM                                          |
+   | -------- | ---------------------------------------------- | ------------------------------------------------ | ------------------------------------------------- | ------------------------------------------------- |
+   | Dog      | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429526-f27f4cce-89b9-4117-bfe6-55c2ca7eaba6.png' height="auto" width="165" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144433562-968a57bc-17d9-413e-810e-f91e334d648a.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144433853-319f3a8f-95f2-446d-b84f-3028daca5378.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144433937-daef5a69-fd70-428f-98a3-5e7747f4bb88.jpg' height="auto" width="150" ></div> |
+   | Cat      | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429526-f27f4cce-89b9-4117-bfe6-55c2ca7eaba6.png' height="auto" width="165" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144434518-867ae32a-1cb5-4dbd-b1b9-5e375e94ea48.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144434603-0a2fd9ec-c02e-4e6c-a17b-64c234808c56.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144434623-b4432cc2-c663-4b97-aed3-583d9d3743e6.jpg' height="auto" width="150" ></div> |
+
+3. Use `--eigen-smooth` and `--aug-smooth` to improve visual effects.
+
+   ```shell
+   python tools/visualization/vis_cam.py \
+       demo/dog.jpg  \
+       configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth \
+       --target-layers 'backbone.layer16' \
+       --method LayerCAM \
+       --eigen-smooth --aug-smooth
+   ```
+
+   | Image                                | LayerCAM                                | eigen-smooth                                | aug-smooth                                | eigen&aug                                 |
+   | ------------------------------------ | --------------------------------------- | ------------------------------------------- | ----------------------------------------- | ----------------------------------------- |
+   | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557492-98ac5ce0-61f9-4da9-8ea7-396d0b6a20fa.jpg' height="auto" width="160"></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557541-a4cf7d86-7267-46f9-937c-6f657ea661b4.jpg'  height="auto" width="145" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557547-2731b53e-e997-4dd2-a092-64739cc91959.jpg'  height="auto" width="145" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557545-8189524a-eb92-4cce-bf6a-760cab4a8065.jpg'  height="auto" width="145" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557548-c1e3f3ec-3c96-43d4-874a-3b33cd3351c5.jpg'  height="auto" width="145" ></div> |
+
+## How to visualize the CAM of vision transformer
+
+Here are some examples:
+
+- `'backbone.norm3'` for Swin-Transformer;
+- `'backbone.layers.11.ln1'` for ViT;
+
+For ViT-like networks, such as ViT, T2T-ViT and Swin-Transformer, the features are flattened. And for drawing the CAM, we need to specify the `--vit-like` argument to reshape the features into square feature maps.
+
+Besides the flattened features, some ViT-like networks also add extra tokens like the class token in ViT and T2T-ViT, and the distillation token in DeiT. In these networks, the final classification is done on the tokens computed in the last attention block, and therefore, the classification score will not be affected by other features and the gradient of the classification score with respect to them, will be zero. Therefore, you shouldn't use the output of the last attention block as the target layer in these networks.
+
+To exclude these extra tokens, we need know the number of extra tokens. Almost all transformer-based backbones in MMPretrain have the `num_extra_tokens` attribute. If you want to use this tool in a new or third-party network that don't have the `num_extra_tokens` attribute, please specify it the `--num-extra-tokens` argument.
+
+1. Visualize CAM for `Swin Transformer`, using default `target-layers`:
+
+   ```shell
+   python tools/visualization/vis_cam.py \
+       demo/bird.JPEG  \
+       configs/swin_transformer/swin-tiny_16xb64_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth \
+       --vit-like
+   ```
+
+2. Visualize CAM for `Vision Transformer(ViT)`:
+
+   ```shell
+   python tools/visualization/vis_cam.py \
+       demo/bird.JPEG  \
+       configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py \
+       https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth \
+       --vit-like \
+       --target-layers 'backbone.layers.11.ln1'
+   ```
+
+3. Visualize CAM for `T2T-ViT`:
+
+   ```shell
+   python tools/visualization/vis_cam.py \
+       demo/bird.JPEG  \
+       configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_3rdparty_8xb64_in1k_20210928-b7c09b62.pth \
+       --vit-like \
+       --target-layers 'backbone.encoder.12.ln1'
+   ```
+
+| Image                                   | ResNet50                                   | ViT                                    | Swin                                    | T2T-ViT                                    |
+| --------------------------------------- | ------------------------------------------ | -------------------------------------- | --------------------------------------- | ------------------------------------------ |
+| <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429496-628d3fb3-1f6e-41ff-aa5c-1b08c60c32a9.JPEG' height="auto" width="165" ></div> | <div align=center><img src=https://user-images.githubusercontent.com/18586273/144431491-a2e19fe3-5c12-4404-b2af-a9552f5a95d9.jpg  height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144436218-245a11de-6234-4852-9c08-ff5069f6a739.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144436168-01b0e565-442c-4e1e-910c-17c62cff7cd3.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144436198-51dbfbda-c48d-48cc-ae06-1a923d19b6f6.jpg' height="auto" width="150" ></div> |
diff --git a/docs/en/useful_tools/complexity_analysis.md b/docs/en/useful_tools/complexity_analysis.md
new file mode 100644
index 0000000000000000000000000000000000000000..ac6d1334c6d18c448d5f89144b421717259d7b19
--- /dev/null
+++ b/docs/en/useful_tools/complexity_analysis.md
@@ -0,0 +1,77 @@
+# Model Complexity Analysis
+
+## Get the FLOPs and params (experimental)
+
+We provide a script adapted from [MMEngine](https://github.com/open-mmlab/mmengine/blob/main/mmengine/analysis/complexity_analysis.py) to compute the FLOPs and params of a given model.
+
+```shell
+python tools/analysis_tools/get_flops.py ${CONFIG_FILE} [--shape ${INPUT_SHAPE}]
+```
+
+Description of all arguments:
+
+- `config`: The path of the model config file.
+- `--shape`: Input size, support single value or double value parameter, such as `--shape 256` or `--shape 224 256`. If not set, default to be `224 224`.
+
+Example:
+
+```shell
+python tools/analysis_tools/get_flops.py configs/resnet/resnet50_8xb32_in1k.py
+```
+
+You will get the final result like this.
+
+```text
+==============================
+Input shape: (3, 224, 224)
+Flops: 4.109G
+Params: 25.557M
+Activation: 11.114M
+==============================
+```
+
+Also, you will get the detailed complexity information of each layer like this:
+
+```text
++--------------------------+----------------------+-----------+--------------+
+| module                   | #parameters or shape | #flops    | #activations |
++--------------------------+----------------------+-----------+--------------+
+| model                    | 25.557M              | 4.109G    | 11.114M      |
+|  backbone                |  23.508M             |  4.109G   |  11.114M     |
+|   backbone.conv1         |   9.408K             |   0.118G  |   0.803M     |
+|    backbone.conv1.weight |    (64, 3, 7, 7)     |           |              |
+|   backbone.bn1           |   0.128K             |   1.606M  |   0          |
+|    backbone.bn1.weight   |    (64,)             |           |              |
+|    backbone.bn1.bias     |    (64,)             |           |              |
+|   backbone.layer1        |   0.216M             |   0.677G  |   4.415M     |
+|    backbone.layer1.0     |    75.008K           |    0.235G |    2.007M    |
+|    backbone.layer1.1     |    70.4K             |    0.221G |    1.204M    |
+|    backbone.layer1.2     |    70.4K             |    0.221G |    1.204M    |
+|   backbone.layer2        |   1.22M              |   1.034G  |   3.111M     |
+|    backbone.layer2.0     |    0.379M            |    0.375G |    1.305M    |
+|    backbone.layer2.1     |    0.28M             |    0.22G  |    0.602M    |
+|    backbone.layer2.2     |    0.28M             |    0.22G  |    0.602M    |
+|    backbone.layer2.3     |    0.28M             |    0.22G  |    0.602M    |
+|   backbone.layer3        |   7.098M             |   1.469G  |   2.158M     |
+|    backbone.layer3.0     |    1.512M            |    0.374G |    0.652M    |
+|    backbone.layer3.1     |    1.117M            |    0.219G |    0.301M    |
+|    backbone.layer3.2     |    1.117M            |    0.219G |    0.301M    |
+|    backbone.layer3.3     |    1.117M            |    0.219G |    0.301M    |
+|    backbone.layer3.4     |    1.117M            |    0.219G |    0.301M    |
+|    backbone.layer3.5     |    1.117M            |    0.219G |    0.301M    |
+|   backbone.layer4        |   14.965M            |   0.81G   |   0.627M     |
+|    backbone.layer4.0     |    6.04M             |    0.373G |    0.326M    |
+|    backbone.layer4.1     |    4.463M            |    0.219G |    0.151M    |
+|    backbone.layer4.2     |    4.463M            |    0.219G |    0.151M    |
+|  head.fc                 |  2.049M              |           |              |
+|   head.fc.weight         |   (1000, 2048)       |           |              |
+|   head.fc.bias           |   (1000,)            |           |              |
+|  neck.gap                |                      |  0.1M     |  0           |
++--------------------------+----------------------+-----------+--------------+
+```
+
+```{warning}
+This tool is still experimental and we do not guarantee that the number is correct. You may well use the result for simple comparisons, but double-check it before you adopt it in technical reports or papers.
+- FLOPs are related to the input shape while parameters are not. The default input shape is (1, 3, 224, 224).
+- Some operators are not counted into FLOPs like custom operators. Refer to [`mmengine.analysis.complexity_analysis._DEFAULT_SUPPORTED_FLOP_OPS`](https://github.com/open-mmlab/mmengine/blob/main/mmengine/analysis/complexity_analysis.py) for details.
+```
diff --git a/docs/en/useful_tools/confusion_matrix.md b/docs/en/useful_tools/confusion_matrix.md
new file mode 100644
index 0000000000000000000000000000000000000000..306b585c0d39007adf6db5899105574e7c597f17
--- /dev/null
+++ b/docs/en/useful_tools/confusion_matrix.md
@@ -0,0 +1,84 @@
+# Confusion Matrix
+
+MMPretrain provides `tools/analysis_tools/confusion_matrix.py` tool to calculate and visualize the confusion matrix. For an introduction to the confusion matrix, see [link](https://en.wikipedia.org/wiki/Confusion_matrix).
+
+## Command-line Usage
+
+**Command**：
+
+```shell
+python tools/analysis_tools/confusion_matrix.py \
+    ${CONFIG_FILE} \
+    ${CHECKPOINT} \
+    [--show] \
+    [--show-path] \
+    [--include-values] \
+    [--cmap ${CMAP}] \
+    [--cfg-options ${CFG-OPTIONS}]
+```
+
+**Description of all arguments**：
+
+- `config`: The path of the model config file.
+- `checkpoint`: The path of the checkpoint.
+- `--show`: If or not to show the matplotlib visualization result of the confusion matrix, the default is `False`.
+- `--show-path`: If `show` is True, the path where the results are saved is visualized.
+- `--include-values`: Whether to add values to the visualization results.
+- `--cmap`: The color map used for visualization results, `cmap`, which defaults to `viridis`.
+
+* `--cfg-options`: Modifications to the configuration file, refer to [Learn about Configs](../user_guides/config.md).
+
+**Examples of use**:
+
+```shell
+python tools/analysis_tools/confusion_matrix.py \
+    configs/resnet/resnet50_8xb16_cifar10.py \
+    https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth \
+    --show
+```
+
+**output image**:
+
+<div align=center><img src="https://user-images.githubusercontent.com/26739999/210298124-49ae00f7-c8fd-488a-a4da-58c285e9c1f1.png" style=" width: auto; height: 40%; "></div>
+
+## **Basic Usage**
+
+```python
+>>> import torch
+>>> from mmpretrain.evaluation import ConfusionMatrix
+>>> y_pred = [0, 1, 1, 3]
+>>> y_true = [0, 2, 1, 3]
+>>> ConfusionMatrix.calculate(y_pred, y_true, num_classes=4)
+tensor([[1, 0, 0, 0],
+        [0, 1, 0, 0],
+        [0, 1, 0, 0],
+        [0, 0, 0, 1]])
+>>> # plot the confusion matrix
+>>> import matplotlib.pyplot as plt
+>>> y_score = torch.rand((1000, 10))
+>>> y_true = torch.randint(10, (1000, ))
+>>> matrix = ConfusionMatrix.calculate(y_score, y_true)
+>>> ConfusionMatrix().plot(matrix)
+>>> plt.show()
+```
+
+## **Use with Evalutor**
+
+```python
+>>> import torch
+>>> from mmpretrain.evaluation import ConfusionMatrix
+>>> from mmpretrain.structures import DataSample
+>>> from mmengine.evaluator import Evaluator
+>>> data_samples = [
+...     DataSample().set_gt_label(i%5).set_pred_score(torch.rand(5))
+...     for i in range(1000)
+... ]
+>>> evaluator = Evaluator(metrics=ConfusionMatrix())
+>>> evaluator.process(data_samples)
+>>> evaluator.evaluate(1000)
+{'confusion_matrix/result': tensor([[37, 37, 48, 43, 35],
+         [35, 51, 32, 46, 36],
+         [45, 28, 39, 42, 46],
+         [42, 40, 40, 35, 43],
+         [40, 39, 41, 37, 43]])}
+```
diff --git a/docs/en/useful_tools/dataset_visualization.md b/docs/en/useful_tools/dataset_visualization.md
new file mode 100644
index 0000000000000000000000000000000000000000..b1f216ce68a38d9f3d6b59e5c48a61fa0f0375fe
--- /dev/null
+++ b/docs/en/useful_tools/dataset_visualization.md
@@ -0,0 +1,90 @@
+# Dataset Visualization
+
+## Introduce the dataset visualization tool
+
+```bash
+python tools/visualization/browse_dataset.py \
+    ${CONFIG_FILE} \
+    [-o, --output-dir ${OUTPUT_DIR}] \
+    [-p, --phase ${DATASET_PHASE}] \
+    [-n, --show-number ${NUMBER_IMAGES_DISPLAY}] \
+    [-i, --show-interval ${SHOW_INTERRVAL}] \
+    [-m, --mode ${DISPLAY_MODE}] \
+    [-r, --rescale-factor ${RESCALE_FACTOR}] \
+    [-c, --channel-order ${CHANNEL_ORDER}] \
+    [--cfg-options ${CFG_OPTIONS}]
+```
+
+**Description of all arguments**：
+
+- `config` : The path of a model config file.
+- `-o, --output-dir`: The output path for visualized images. If not specified, it will be set to `''`, which means not to save.
+- **`-p, --phase`**: Phase of visualizing dataset，must be one of `['train', 'val', 'test']`. If not specified, it will be set to `'train'`.
+- **`-n, --show-number`**: The number of samples to visualized. If not specified, display all images in the dataset.
+- `--show-interval`: The interval of show (s).
+- **`-m, --mode`**: The display mode, can be one of `['original', 'transformed', 'concat', 'pipeline']`. If not specified, it will be set to `'transformed'`.
+- `-r, --rescale-factor`: The image rescale factor, which is useful if the output is too large or too small
+  in the `original` mode.
+- `-c, --channel-order`: The channel of the showing images, could be "BGR" or "RGB", If not specified, it will be set to 'BGR'.
+- `--cfg-options` : Modifications to the configuration file, refer to [Learn about Configs](../user_guides/config.md).
+
+```{note}
+1. The `-m, --mode` is about display mode, display original pictures or transformed pictures or comparison pictures:
+- "original" means show images load from disk;
+- "transformed" means to show images after transformed;
+- "concat" means show images stitched by "original" and "transformed" images;
+- "pipeline" means show all the intermediate images throghout the pipeline.
+
+2.  The `-r, --rescale-factor` option is set when the label information is too large or too small relative to the picture. For example, when visualizing the CIFAR dataset, since the resolution of the image is very small, `--rescale-factor` can be set to 10.
+```
+
+## How to visualize the original image
+
+In **'original'** mode:
+
+```shell
+python ./tools/visualization/browse_dataset.py ./configs/resnet/resnet101_8xb16_cifar10.py --phase val --output-dir tmp --mode original --show-number 100 --rescale-factor 10 --channel-order RGB
+```
+
+- `--phase val`: Visual validation set, can be simplified to `-p val`;
+- `--output-dir tmp`: The visualization results are saved in the "tmp" folder, can be simplified to `-o tmp`;
+- `--mode original`: Visualize the original image, can be simplified to `-m original`;
+- `--show-number 100`: visualize 100 images, can be simplified to `-n 100`;
+- `--rescale-factor`: the image is enlarged by 10 times, can be simplified to `-r 10`;
+- `--channel-order RGB`: The channel order of the visualized image is "RGB", can be simplified to `-c RGB`.
+
+<div align=center><img src="https://user-images.githubusercontent.com/18586273/190993839-216a7a1e-590e-47b9-92ae-08f87a7d58df.jpg" style=" width: auto; height: 40%; "></div>
+
+## How to visualize the transformed images
+
+In **'transformed'** mode:
+
+```shell
+python ./tools/visualization/browse_dataset.py ./configs/resnet/resnet50_8xb32_in1k.py -n 100
+```
+
+<div align=center><img src="https://user-images.githubusercontent.com/18586273/190994696-737b09d9-d0fb-4593-94a2-4487121e0286.JPEG" style=" width: auto; height: 40%; "></div>
+
+## How to visualize the transformed images and original images together
+
+In **'concat'** mode:
+
+```shell
+python ./tools/visualization/browse_dataset.py configs/swin_transformer/swin-small_16xb64_in1k.py -n 10 -m concat
+```
+
+<div align=center><img src="https://user-images.githubusercontent.com/18586273/190995078-3872feb2-d4e2-4727-a21b-7062d52f7d3e.JPEG" style=" width: auto; height: 40%; "></div>
+
+4. In **'pipeline'** mode：
+
+```shell
+python ./tools/visualization/browse_dataset.py configs/swin_transformer/swin-small_16xb64_in1k.py -m pipeline
+```
+
+<div align=center><img src="https://user-images.githubusercontent.com/18586273/190995525-fac0220f-6630-4013-b94a-bc6de4fdff7a.JPEG" style=" width: auto; height: 40%; "></div>
+
+```shell
+python ./tools/visualization/browse_dataset.py configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py -m pipeline
+```
+
+<div align=center><img src="https://user-images.githubusercontent.com/26739999/226542300-74216187-e3d0-4a6e-8731-342abe719721.png" style=" width: auto; height: 40%; "></div>
diff --git a/docs/en/useful_tools/log_result_analysis.md b/docs/en/useful_tools/log_result_analysis.md
new file mode 100644
index 0000000000000000000000000000000000000000..99968d7a05937929f021c712808e8fe0ef2db3ff
--- /dev/null
+++ b/docs/en/useful_tools/log_result_analysis.md
@@ -0,0 +1,226 @@
+# Log and Results Analysis
+
+## Log Analysis
+
+### Introduction of log analysis tool
+
+`tools/analysis_tools/analyze_logs.py` plots curves of given keys according to the log files.
+
+<div align=center><img src="../_static/image/tools/analysis/analyze_log.jpg" style=" width: 75%; height: 30%; "></div>
+
+```shell
+python tools/analysis_tools/analyze_logs.py plot_curve  \
+    ${JSON_LOGS}  \
+    [--keys ${KEYS}]  \
+    [--title ${TITLE}]  \
+    [--legend ${LEGEND}]  \
+    [--backend ${BACKEND}]  \
+    [--style ${STYLE}]  \
+    [--out ${OUT_FILE}]  \
+    [--window-size ${WINDOW_SIZE}]
+```
+
+**Description of all arguments**：
+
+- `json_logs` : The paths of the log files, separate multiple files by spaces.
+- `--keys` : The fields of the logs to analyze, separate multiple keys by spaces. Defaults to 'loss'.
+- `--title` : The title of the figure. Defaults to use the filename.
+- `--legend` : The names of legend, the number of which must be equal to `len(${JSON_LOGS}) * len(${KEYS})`. Defaults to use `"${JSON_LOG}-${KEYS}"`.
+- `--backend` : The backend of matplotlib. Defaults to auto selected by matplotlib.
+- `--style` : The style of the figure. Default to `whitegrid`.
+- `--out` : The path of the output picture. If not set, the figure won't be saved.
+- `--window-size`: The shape of the display window. The format should be `'W*H'`. Defaults to `'12*7'`.
+
+```{note}
+The `--style` option depends on `seaborn` package, please install it before setting it.
+```
+
+### How to plot the loss/accuracy curve
+
+We present some examples here to show how to plot the loss curve of accuracy curve by using the `tools/analysis_tools/analyze_logs.py`
+
+#### Plot the loss curve in training.
+
+```shell
+python tools/analysis_tools/analyze_logs.py plot_curve your_log_json --keys loss --legend loss
+```
+
+#### Plot the top-1 accuracy and top-5 accuracy curves, and save the figure to results.jpg.
+
+```shell
+python tools/analysis_tools/analyze_logs.py plot_curve your_log_json --keys accuracy/top1 accuracy/top5  --legend top1 top5 --out results.jpg
+```
+
+#### Compare the top-1 accuracy of two log files in the same figure.
+
+```shell
+python tools/analysis_tools/analyze_logs.py plot_curve log1.json log2.json --keys accuracy/top1 --legend exp1 exp2
+```
+
+### How to calculate training time
+
+`tools/analysis_tools/analyze_logs.py` can also calculate the training time according to the log files.
+
+```shell
+python tools/analysis_tools/analyze_logs.py cal_train_time \
+    ${JSON_LOGS}
+    [--include-outliers]
+```
+
+**Description of all arguments**:
+
+- `json_logs` : The paths of the log files, separate multiple files by spaces.
+- `--include-outliers` : If set, include the first time record in each epoch (Sometimes the time of the first iteration is longer).
+
+Example:
+
+```shell
+python tools/analysis_tools/analyze_logs.py cal_train_time work_dirs/your_exp/20230206_181002/vis_data/scalars.json
+```
+
+The output is expected to be like the below.
+
+```text
+-----Analyze train time of work_dirs/your_exp/20230206_181002/vis_data/scalars.json-----
+slowest epoch 68, average time is 0.3818
+fastest epoch 1, average time is 0.3694
+time std over epochs is 0.0020
+average iter time: 0.3777 s/iter
+```
+
+## Result Analysis
+
+With the `--out` argument in `tools/test.py`, we can save the inference results of all samples as a file.
+And with this result file, we can do further analysis.
+
+### How to conduct offline metric evaluation
+
+We provide `tools/analysis_tools/eval_metric.py` to enable the user evaluate the model from the prediction files.
+
+```shell
+python tools/analysis_tools/eval_metric.py \
+      ${RESULT} \
+      [--metric ${METRIC_OPTIONS} ...]
+```
+
+Description of all arguments:
+
+- `result`:  The output result file in pickle format from `tools/test.py`.
+- `--metric`: The metric and options to evaluate the results. You need to specify at least one metric and you
+  can also specify multiple `--metric` to use multiple metrics.
+
+Please refer the [Metric Documentation](mmpretrain.evaluation) to find the available metrics and options.
+
+```{note}
+In `tools/test.py`, we support using `--out-item` option to select which kind of results will be saved.
+Please ensure the `--out-item` is not specified or `--out-item=pred` to use this tool.
+```
+
+**Examples**:
+
+```shell
+# Get the prediction results
+python tools/test.py configs/resnet/resnet18_8xb16_cifar10.py \
+    https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth \
+    --out results.pkl
+
+# Eval the top-1 and top-5 accuracy
+python tools/analysis_tools/eval_metric.py results.pkl --metric type=Accuracy topk=1,5
+
+# Eval the overall accuracy and the class-wise precision, recall, f1-score
+python tools/analysis_tools/eval_metric.py results.pkl --metric type=Accuracy \
+    --metric type=SingleLabelMetric items=precision,recall,f1-score average=None
+```
+
+### How to plot the confusion matrix for the test result
+
+We provide `tools/analysis_tools/confusion_matrix.py` to enable the user plot the confusion matrix from the prediction files.
+
+```shell
+python tools/analysis_tools/confusion_matrix.py \
+      ${CONFIG} \
+      ${RESULT} \
+      [--out ${OUT}] \
+      [--show] \
+      [--show-path ${SHOW_PATH}] \
+      [--include-values] \
+      [--cmap] \
+      [--cfg-options ${CFG_OPTIONS} ...] \
+```
+
+Description of all arguments:
+
+- `config`: The config file path.
+- `result`:  The output result file in pickle format from `tools/test.py`, or a checkpoint file.
+- `--out`: The path to save the confusion matrix in pickle format.
+- `--show`: Whether to show the confusion matrix plot.
+- `--show-path`: The path to save the confusion matrix plot.
+- `--include-values`: Whether to show the values in the confusion matrix plot.
+- `--cmap`: The color map to plot the confusion matrix.
+- `--cfg-options`: If specified, the key-value pair config will be merged into the config file, for more details please refer to [Learn about Configs](../user_guides/config.md)
+
+```{note}
+In `tools/test.py`, we support using `--out-item` option to select which kind of results will be saved.
+Please ensure the `--out-item` is not specified or `--out-item=pred` to use this tool.
+```
+
+**Examples**:
+
+```shell
+# Get the prediction results
+python tools/test.py configs/resnet/resnet18_8xb16_cifar10.py \
+    https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth \
+    --out results.pkl
+
+# Save the confusion matrix in a pickle file
+python tools/analysis_tools/confusion_matrix.py configs/resnet/resnet18_8xb16_cifar10.py results.pkl --out cm.pkl
+
+# Show the confusion matrix plot in a graphical window.
+python tools/analysis_tools/confusion_matrix.py configs/resnet/resnet18_8xb16_cifar10.py results.pkl --show
+```
+
+### How to visualize the prediction results
+
+We can use `tools/analysis_tools/analyze_results.py` to save the images with the highest scores in successful or failed prediction.
+
+```shell
+python tools/analysis_tools/analyze_results.py \
+      ${CONFIG} \
+      ${RESULT} \
+      [--out-dir ${OUT_DIR}] \
+      [--topk ${TOPK}] \
+      [--rescale-factor ${RESCALE_FACTOR}] \
+      [--cfg-options ${CFG_OPTIONS}]
+```
+
+**Description of all arguments**:
+
+- `config` : The path of the model config file.
+- `result`:  Output result file in json/pickle format from `tools/test.py`.
+- `--out_dir`: Directory to store output files.
+- `--topk`: The number of images in successful or failed prediction with the highest `topk` scores to save. If not specified, it will be set to 20.
+- `--rescale-factor`: Image rescale factor, which is useful if the output is too large or too small (Too small
+  images may cause the prediction tag is too vague).
+- `--cfg-options`: If specified, the key-value pair config will be merged into the config file, for more details please refer to [Learn about Configs](../user_guides/config.md)
+
+```{note}
+In `tools/test.py`, we support using `--out-item` option to select which kind of results will be saved.
+Please ensure the `--out-item` is not specified or `--out-item=pred` to use this tool.
+```
+
+**Examples**:
+
+```shell
+# Get the prediction results
+python tools/test.py configs/resnet/resnet18_8xb16_cifar10.py \
+    https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth \
+    --out results.pkl
+
+# Save the top-10 successful and failed predictions. And enlarge the sample images by 10 times.
+python tools/analysis_tools/analyze_results.py \
+       configs/resnet/resnet18_8xb16_cifar10.py \
+       results.pkl \
+       --out-dir output \
+       --topk 10 \
+       --rescale-factor 10
+```
diff --git a/docs/en/useful_tools/model_serving.md b/docs/en/useful_tools/model_serving.md
new file mode 100644
index 0000000000000000000000000000000000000000..9f135fbf5c95ba35fc2b794afdaf9b0f0f0c2ec6
--- /dev/null
+++ b/docs/en/useful_tools/model_serving.md
@@ -0,0 +1,88 @@
+# Torchserve Deployment
+
+In order to serve an `MMPretrain` model with [`TorchServe`](https://pytorch.org/serve/), you can follow the steps:
+
+## 1. Convert model from MMPretrain to TorchServe
+
+```shell
+python tools/torchserve/mmpretrain2torchserve.py ${CONFIG_FILE} ${CHECKPOINT_FILE} \
+--output-folder ${MODEL_STORE} \
+--model-name ${MODEL_NAME}
+```
+
+```{note}
+${MODEL_STORE} needs to be an absolute path to a folder.
+```
+
+Example:
+
+```shell
+python tools/torchserve/mmpretrain2torchserve.py \
+  configs/resnet/resnet18_8xb32_in1k.py \
+  checkpoints/resnet18_8xb32_in1k_20210831-fbbb1da6.pth \
+  --output-folder ./checkpoints \
+  --model-name resnet18_in1k
+```
+
+## 2. Build `mmpretrain-serve` docker image
+
+```shell
+docker build -t mmpretrain-serve:latest docker/serve/
+```
+
+## 3. Run `mmpretrain-serve`
+
+Check the official docs for [running TorchServe with docker](https://github.com/pytorch/serve/blob/master/docker/README.md#running-torchserve-in-a-production-docker-environment).
+
+In order to run in GPU, you need to install [nvidia-docker](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). You can omit the `--gpus` argument in order to run in GPU.
+
+Example:
+
+```shell
+docker run --rm \
+--name mar \
+--cpus 8 \
+--gpus device=0 \
+-p8080:8080 -p8081:8081 -p8082:8082 \
+--mount type=bind,source=`realpath ./checkpoints`,target=/home/model-server/model-store \
+mmpretrain-serve:latest
+```
+
+```{note}
+`realpath ./checkpoints` points to the absolute path of "./checkpoints", and you can replace it with the absolute path where you store torchserve models.
+```
+
+[Read the docs](https://github.com/pytorch/serve/blob/master/docs/rest_api.md) about the Inference (8080), Management (8081) and Metrics (8082) APis
+
+## 4. Test deployment
+
+```shell
+curl http://127.0.0.1:8080/predictions/${MODEL_NAME} -T demo/demo.JPEG
+```
+
+You should obtain a response similar to:
+
+```json
+{
+  "pred_label": 58,
+  "pred_score": 0.38102269172668457,
+  "pred_class": "water snake"
+}
+```
+
+And you can use `test_torchserver.py` to compare result of TorchServe and PyTorch, and visualize them.
+
+```shell
+python tools/torchserve/test_torchserver.py ${IMAGE_FILE} ${CONFIG_FILE} ${CHECKPOINT_FILE} ${MODEL_NAME}
+[--inference-addr ${INFERENCE_ADDR}] [--device ${DEVICE}]
+```
+
+Example:
+
+```shell
+python tools/torchserve/test_torchserver.py \
+  demo/demo.JPEG \
+  configs/resnet/resnet18_8xb32_in1k.py \
+  checkpoints/resnet18_8xb32_in1k_20210831-fbbb1da6.pth \
+  resnet18_in1k
+```
diff --git a/docs/en/useful_tools/print_config.md b/docs/en/useful_tools/print_config.md
new file mode 100644
index 0000000000000000000000000000000000000000..ea4076475b4fdf1ee6f158e49b115abeabf2336c
--- /dev/null
+++ b/docs/en/useful_tools/print_config.md
@@ -0,0 +1,27 @@
+# How to Get the Complete Config
+
+We also provide the `print_config.py` tools to print the complete configuration of the given experiment.
+You can check each item of the config before the training by using the following command.
+
+## Description
+
+`tools/misc/print_config.py` prints the whole config verbatim, expanding all its imports.
+
+```shell
+python tools/misc/print_config.py ${CONFIG} [--cfg-options ${CFG_OPTIONS}]
+```
+
+Description of all arguments:
+
+- `config` : The path of the model config file.
+- `--cfg-options`: If specified, the key-value pair config will be merged into the config file, for more details please refer to [Learn about Configs](../user_guides/config.md)
+
+## Examples
+
+```shell
+# Print a complete config
+python tools/misc/print_config.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
+
+# Save the complete config to a independent config file.
+python tools/misc/print_config.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py > final_config.py
+```
diff --git a/docs/en/useful_tools/scheduler_visualization.md b/docs/en/useful_tools/scheduler_visualization.md
new file mode 100644
index 0000000000000000000000000000000000000000..0ba1bdc4ff96d678985522b17dd539dc0964f1a9
--- /dev/null
+++ b/docs/en/useful_tools/scheduler_visualization.md
@@ -0,0 +1,44 @@
+# Hyper-parameter Scheduler Visualization
+
+This tool aims to help the user to check the hyper-parameter scheduler of the optimizer (without training), which support the "learning rate" or "momentum"
+
+## Introduce the scheduler visualization tool
+
+```bash
+python tools/visualization/vis_scheduler.py \
+    ${CONFIG_FILE} \
+    [-p, --parameter ${PARAMETER_NAME}] \
+    [-d, --dataset-size ${DATASET_SIZE}] \
+    [-n, --ngpus ${NUM_GPUs}] \
+    [-s, --save-path ${SAVE_PATH}] \
+    [--title ${TITLE}] \
+    [--style ${STYLE}] \
+    [--window-size ${WINDOW_SIZE}] \
+    [--cfg-options]
+```
+
+**Description of all arguments**：
+
+- `config`: The path of a model config file.
+- **`-p, --parameter`**: The param to visualize its change curve, choose from "lr" and "momentum". Default to use "lr".
+- **`-d, --dataset-size`**: The size of the datasets. If set，`build_dataset` will be skipped and `${DATASET_SIZE}` will be used as the size. Default to use the function `build_dataset`.
+- **`-n, --ngpus`**: The number of GPUs used in training, default to be 1.
+- **`-s, --save-path`**: The learning rate curve plot save path, default not to save.
+- `--title`: Title of figure. If not set, default to be config file name.
+- `--style`: Style of plt. If not set, default to be `whitegrid`.
+- `--window-size`: The shape of the display window. If not specified, it will be set to `12*7`. If used, it must be in the format `'W*H'`.
+- `--cfg-options`: Modifications to the configuration file, refer to [Learn about Configs](../user_guides/config.md).
+
+```{note}
+Loading annotations maybe consume much time, you can directly specify the size of the dataset with `-d, dataset-size` to save time.
+```
+
+## How to plot the learning rate curve without training
+
+You can use the following command to plot the step learning rate schedule used in the config `configs/swin_transformer/swin-base_16xb64_in1k.py`:
+
+```bash
+python tools/visualization/vis_scheduler.py configs/swin_transformer/swin-base_16xb64_in1k.py --dataset-size 1281167 --ngpus 16
+```
+
+<div align=center><img src="https://user-images.githubusercontent.com/26739999/226544329-cf3a3d45-6ab3-48aa-8972-2c2a58c35e62.png" style=" width: auto; height: 40%; "></div>
diff --git a/docs/en/useful_tools/shape_bias.md b/docs/en/useful_tools/shape_bias.md
new file mode 100644
index 0000000000000000000000000000000000000000..907bde61ee7f1d86e839b2b32c694c3270a2298a
--- /dev/null
+++ b/docs/en/useful_tools/shape_bias.md
@@ -0,0 +1,100 @@
+# Shape Bias Tool Usage
+
+Shape bias measures how a model relies the shapes, compared to texture, to sense the semantics in images. For more details,
+we recommend interested readers to this [paper](https://arxiv.org/abs/2106.07411). MMPretrain provide an off-the-shelf toolbox to
+obtain the shape bias of a classification model. You can following these steps below:
+
+## Prepare the dataset
+
+First you should download the [cue-conflict](https://github.com/bethgelab/model-vs-human/releases/download/v0.1/cue-conflict.tar.gz) to `data` folder,
+and then unzip this dataset. After that, you `data` folder should have the following structure:
+
+```text
+data
+├──cue-conflict
+|      |──airplane
+|      |──bear
+|      ...
+|      |── truck
+```
+
+## Modify the config for classification
+
+We run the shape-bias tool on a ViT-base model with masked autoencoder pretraining. Its config file is `configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py`, and its checkpoint is downloaded from [this link](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.pth). Replace the original test_pipeline, test_dataloader and test_evaluation with the following configurations:
+
+```python
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+test_dataloader = dict(
+    pin_memory=True,
+    collate_fn=dict(type='default_collate'),
+    batch_size=32,
+    num_workers=4,
+    dataset=dict(
+        type='CustomDataset',
+        data_root='data/cue-conflict',
+        pipeline=test_pipeline,
+        _delete_=True),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+    drop_last=False)
+test_evaluator = dict(
+    type='mmpretrain.ShapeBiasMetric',
+    _delete_=True,
+    csv_dir='work_dirs/shape_bias',
+    model_name='mae')
+```
+
+Please note you should make custom modifications to the `csv_dir` and `model_name` above. I renamed my modified sample config file as `vit-base-p16_8xb128-coslr-100e_in1k_shape-bias.py` in the folder `configs/mae/benchmarks/`.
+
+## Inference your model with above modified config file
+
+Then you should inferece your model on the `cue-conflict` dataset with the your modified config file.
+
+```shell
+# For PyTorch
+bash tools/dist_test.sh $CONFIG $CHECKPOINT
+```
+
+**Description of all arguments**:
+
+- `$CONFIG`: The path of your modified config file.
+- `$CHECKPOINT`: The path or link of the checkpoint file.
+
+```shell
+# Example
+bash tools/dist_test.sh configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k_shape-bias.py https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.pth 1
+```
+
+After that, you should obtain a csv file in `csv_dir` folder, named `cue-conflict_model-name_session-1.csv`. Besides this file, you should also download these [csv files](https://github.com/bethgelab/model-vs-human/tree/master/raw-data/cue-conflict) to the
+`csv_dir`.
+
+## Plot shape bias
+
+Then we can start to plot the shape bias:
+
+```shell
+python tools/analysis_tools/shape_bias.py --csv-dir $CSV_DIR --result-dir $RESULT_DIR --colors $RGB --markers o --plotting-names $YOUR_MODEL_NAME --model-names $YOUR_MODEL_NAME
+```
+
+**Description of all arguments**:
+
+- `--csv-dir $CSV_DIR`, the same directory to save these csv files.
+- `--result-dir $RESULT_DIR`, the directory to output the result named `cue-conflict_shape-bias_matrixplot.pdf`.
+- `--colors $RGB`, should be the RGB values, formatted in R G B, e.g. 100 100 100, and can be multiple RGB values, if you want to plot the shape bias of several models.
+- `--plotting-names $YOUR_MODEL_NAME`, the name of the legend in the shape bias figure, and you can set it as your model name. If you want to plot several models, plotting_names can be multiple values.
+- `model-names $YOUR_MODEL_NAME`, should be the same name specified in your config, and can be multiple names if you want to plot the shape bias of several models.
+
+Please note, every three values for `--colors` corresponds to one value for `--model-names`. After all of above steps, you are expected to obtain the following figure.
+
+<div align="center">
+<img src="https://github.com/open-mmlab/mmpretrain/assets/42371271/dc608d06-43eb-4860-bb70-486ed2a3f927" width="500" />
+</div>
diff --git a/docs/en/useful_tools/t-sne_visualization.md b/docs/en/useful_tools/t-sne_visualization.md
new file mode 100644
index 0000000000000000000000000000000000000000..9f24a114dbe6e70a1d3b7beae6f2c98967008113
--- /dev/null
+++ b/docs/en/useful_tools/t-sne_visualization.md
@@ -0,0 +1,85 @@
+# t-Distributed Stochastic Neighbor Embedding (t-SNE) Visualization
+
+## Introduction of the t-SNE visualization tool
+
+MMPretrain provides `tools/visualization/vis_tsne.py` tool to visualize the feature embeddings of images by t-SNE. Please install `sklearn` to calculate t-SNE by `pip install scikit-learn`.
+
+**Command**：
+
+```bash
+python tools/visualization/vis_tsne.py \
+    CONFIG \
+    [--checkpoint CHECKPOINT] \
+    [--work-dir WORK_DIR] \
+    [--test-cfg TEST_CFG] \
+    [--vis-stage {backbone,neck,pre_logits}]
+    [--class-idx ${CLASS_IDX} [CLASS_IDX ...]]
+    [--max-num-class MAX_NUM_CLASS]
+    [--max-num-samples MAX_NUM_SAMPLES]
+    [--cfg-options CFG_OPTIONS [CFG_OPTIONS ...]]
+    [--device DEVICE]
+    [--legend]
+    [--show]
+    [--n-components N_COMPONENTS]
+    [--perplexity PERPLEXITY]
+    [--early-exaggeration EARLY_EXAGGERATION]
+    [--learning-rate LEARNING_RATE]
+    [--n-iter N_ITER]
+    [--n-iter-without-progress N_ITER_WITHOUT_PROGRESS]
+    [--init INIT]
+```
+
+**Description of all arguments**：
+
+- `CONFIG`: The path of t-SNE config file.
+- `--checkpoint CHECKPOINT`: The path of the checkpoint file.
+- `--work-dir WORK_DIR`: The directory to save logs and visualization images.
+- `--test-cfg TEST_CFG`: The path of t-SNE config file to load config of test dataloader.
+- `--vis-stage {backbone,neck,pre_logits}`: The visualization stage of the model.
+- `--class-idx CLASS_IDX [CLASS_IDX ...]`: The categories used to calculate t-SNE.
+- `--max-num-class MAX_NUM_CLASS`: The first N categories to apply t-SNE algorithms. Defaults to 20.
+- `--max-num-samples MAX_NUM_SAMPLES`: The maximum number of samples per category. Higher number need longer time to calculate. Defaults to 100.
+- `--cfg-options CFG_OPTIONS [CFG_OPTIONS ...]`: override some settings in the used config, the key-value pair in xxx=yyy format will be merged into config file. If the value to be overwritten is a list, it should be like key="[a,b]" or key=a,b It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" Note that the quotation marks are necessary and that no white space is allowed.
+- `--device DEVICE`: Device used for inference.
+- `--legend`: Show the legend of all categories.
+- `--show`: Display the result in a graphical window.
+- `--n-components N_COMPONENTS`: The dimension of results.
+- `--perplexity PERPLEXITY`: The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms.
+- `--early-exaggeration EARLY_EXAGGERATION`: Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them.
+- `--learning-rate LEARNING_RATE`: The learning rate for t-SNE is usually in the range[10.0, 1000.0]. If the learning rate is too high, the data may looklike a ball with any point approximately equidistant from its nearestneighbours. If the learning rate is too low, most points may lookcompressed in a dense cloud with few outliers.
+- `--n-iter N_ITER`: Maximum number of iterations for the optimization. Should be at least 250.
+- `--n-iter-without-progress N_ITER_WITHOUT_PROGRESS`: Maximum number of iterations without progress before we abort the optimization.
+- `--init INIT`: The init method.
+
+## How to visualize the t-SNE of a image classifier (such as ResNet)
+
+Here are two examples of running t-SNE visualization on ResNet-18 and ResNet-50 models, trained on CIFAR-10 dataset:
+
+```shell
+python tools/visualization/vis_tsne.py \
+    configs/resnet/resnet18_8xb16_cifar10.py \
+    --checkpoint https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth
+
+python tools/visualization/vis_tsne.py \
+    configs/resnet/resnet50_8xb16_cifar10.py \
+    --checkpoint https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth
+```
+
+| ResNet-18                                                                                            | ResNet-50                                                                                            |
+| ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
+| <div align=center><img src='https://user-images.githubusercontent.com/42371271/236410521-c4d087da-d16f-48ad-b951-c74d10c68f33.png' height="auto" width="auto" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/42371271/236411844-c97dc514-dad0-401e-ba8f-307d0a385b4e.png' height="auto" width="auto" ></div> |
+
+## How to visualize the t-SNE of a self-supervised model (such as MAE)
+
+Here is an example of running t-SNE visualization on MAE-ViT-base model, trained on ImageNet dataset. The input data is from ImageNet validation set. MAE and some self-supervised pre-training algorithms do not have test_dataloader information. When analyzing such self-supervised algorithms, you need to add test_dataloader information in the config, or you can use '--test-cfg' argument to specify a config file.
+
+```shell
+python tools/visualization/vis_tsne.py \
+    configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py \
+    --checkpoint https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth \
+    --test-cfg configs/_base_/datasets/imagenet_bs32.py
+```
+
+| MAE-ViT-base                                                                                                                                                  |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| <div align=center><img src='https://github.com/open-mmlab/mmpretrain/assets/42371271/ee576c0c-abef-43d1-8866-24a5f5fd0cf6' height="auto" width="auto" ></div> |
diff --git a/docs/en/useful_tools/verify_dataset.md b/docs/en/useful_tools/verify_dataset.md
new file mode 100644
index 0000000000000000000000000000000000000000..d27948f44b9980bf76bd6a582a13219667c4e683
--- /dev/null
+++ b/docs/en/useful_tools/verify_dataset.md
@@ -0,0 +1,28 @@
+# Verify Dataset
+
+In MMPretrain, we also provide a tool `tools/misc/verify_dataset.py` to check whether there exists **broken pictures** in the given dataset.
+
+## Introduce the tool
+
+```shell
+python tools/print_config.py \
+    ${CONFIG} \
+    [--out-path ${OUT-PATH}] \
+    [--phase ${PHASE}] \
+    [--num-process ${NUM-PROCESS}]
+    [--cfg-options ${CFG_OPTIONS}]
+```
+
+**Description of all arguments**:
+
+- `config` : The path of the model config file.
+- `--out-path` : The path to save the verification result, if not set, defaults to 'brokenfiles.log'.
+- `--phase` :  Phase of dataset to verify, accept "train" "test" and "val", if not set, defaults to "train".
+- `--num-process` : number of process to use, if not set, defaults to 1.
+- `--cfg-options`: If specified, the key-value pair config will be merged into the config file, for more details please refer to [Learn about Configs](../user_guides/config.md)
+
+## Example
+
+```shell
+python tools/misc/verify_dataset.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py --out-path broken_imgs.log --phase val --num-process 8
+```
diff --git a/docs/en/user_guides/config.md b/docs/en/user_guides/config.md
new file mode 100644
index 0000000000000000000000000000000000000000..6077c707df0d87a25f746b7265ce6e0a4eec92d8
--- /dev/null
+++ b/docs/en/user_guides/config.md
@@ -0,0 +1,421 @@
+# Learn about Configs
+
+To manage various configurations in a deep-learning experiment, we use a kind of config file to record all of
+these configurations. This config system has a modular and inheritance design, and more details can be found in
+{external+mmengine:doc}`the tutorial in MMEngine <advanced_tutorials/config>`.
+
+Usually, we use python files as config file. All configuration files are placed under the [`configs`](https://github.com/open-mmlab/mmpretrain/tree/main/configs) folder, and the directory structure is as follows:
+
+```text
+MMPretrain/
+    ├── configs/
+    │   ├── _base_/                       # primitive configuration folder
+    │   │   ├── datasets/                      # primitive datasets
+    │   │   ├── models/                        # primitive models
+    │   │   ├── schedules/                     # primitive schedules
+    │   │   └── default_runtime.py             # primitive runtime setting
+    │   ├── beit/                         # BEiT Algorithms Folder
+    │   ├── mae/                          # MAE Algorithms Folder
+    │   ├── mocov2/                       # MoCoV2 Algorithms Folder
+    │   ├── resnet/                       # ResNet Algorithms Folder
+    │   ├── swin_transformer/             # Swin Algorithms Folder
+    │   ├── vision_transformer/           # ViT Algorithms Folder
+    │   ├── ...
+    └── ...
+```
+
+If you wish to inspect the config file, you may run `python tools/misc/print_config.py /PATH/TO/CONFIG` to see the complete config.
+
+This article mainly explains the structure of configuration files, and how to modify it based on the existing configuration files. We will take [ResNet50 config file](https://github.com/open-mmlab/mmpretrain/blob/main/configs/resnet/resnet50_8xb32_in1k.py) as an example and explain it line by line.
+
+## Config Structure
+
+There are four kinds of basic component files in the `configs/_base_` folders, namely：
+
+- [models](https://github.com/open-mmlab/mmpretrain/tree/main/configs/_base_/models)
+- [datasets](https://github.com/open-mmlab/mmpretrain/tree/main/configs/_base_/datasets)
+- [schedules](https://github.com/open-mmlab/mmpretrain/tree/main/configs/_base_/schedules)
+- [runtime](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/default_runtime.py)
+
+We call all the config files in the `_base_` folder as _primitive_ config files. You can easily build your training config file by inheriting some primitive config files.
+
+For easy understanding, we use [ResNet50 config file](https://github.com/open-mmlab/mmpretrain/blob/main/configs/resnet/resnet50_8xb32_in1k.py) as an example and comment on each line.
+
+```python
+_base_ = [                                    # This config file will inherit all config files in `_base_`.
+    '../_base_/models/resnet50.py',           # model settings
+    '../_base_/datasets/imagenet_bs32.py',    # data settings
+    '../_base_/schedules/imagenet_bs256.py',  # schedule settings
+    '../_base_/default_runtime.py'            # runtime settings
+]
+```
+
+We will explain the four primitive config files separately below.
+
+### Model settings
+
+This primitive config file includes a dict variable `model`, which mainly includes information such as network structure and loss function:
+
+- `type`: The type of model to build, we support several tasks.
+  - For image classification tasks, it's usually `ImageClassifier` You can find more details in the [API documentation](mmpretrain.models.classifiers).
+  - For self-supervised leanrning, there are several `SelfSupervisors`, such as `MoCoV2`, `BEiT`, `MAE`, etc. You can find more details in the [API documentation](mmpretrain.models.selfsup).
+  - For image retrieval tasks, it's usually `ImageToImageRetriever` You can find more details in the [API documentation](mmpretrain.models.retrievers).
+
+Usually, we use the **`type` field** to specify the class of the component and use other fields to pass
+the initialization arguments of the class. The {external+mmengine:doc}`registry tutorial <advanced_tutorials/registry>` describes it in detail.
+
+Here, we use the config fields of [`ImageClassifier`](mmpretrain.models.classifiers.ImageClassifier) as an example to
+describe the initialization arguments as below:
+
+- `backbone`: The settings of the backbone. The backbone is the main network to extract features of the inputs, like `ResNet`, `Swin Transformer`, `Vision Transformer` etc. All available backbones can be found in the [API documentation](mmpretrain.models.backbones).
+  - For self-supervised leanrning, some of the backbones are re-implemented, you can find more details in the [API documentation](mmpretrain.models.selfsup).
+- `neck`: The settings of the neck. The neck is the intermediate module to connect the backbone and the head, like `GlobalAveragePooling`. All available necks can be found in the [API documentation](mmpretrain.models.necks).
+- `head`: The settings of the task head. The head is the task-related component to do a specified task, like image classification or self-supervised training. All available heads can be found in the [API documentation](mmpretrain.models.heads).
+  - `loss`: The loss function to optimize, like `CrossEntropyLoss`, `LabelSmoothLoss`, `PixelReconstructionLoss` and etc. All available losses can be found in the [API documentation](mmpretrain.models.losses).
+- `data_preprocessor`: The component before the model forwarding to preprocess the inputs. See the [documentation](mmpretrain.models.utils.data_preprocessor) for more details.
+- `train_cfg`: The extra settings of `ImageClassifier` during training. In `ImageClassifier`, we mainly use it to specify batch augmentation settings, like `Mixup` and `CutMix`. See the [documentation](mmpretrain.models.utils.batch_augments) for more details.
+
+Following is the model primitive config of the ResNet50 config file in [`configs/_base_/models/resnet50.py`](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/models/resnet50.py):
+
+```python
+model = dict(
+    type='ImageClassifier',     # The type of the main model (here is for image classification task).
+    backbone=dict(
+        type='ResNet',          # The type of the backbone module.
+        # All fields except `type` come from the __init__ method of class `ResNet`
+        # and you can find them from https://mmpretrain.readthedocs.io/en/latest/api/generated/mmpretrain.models.backbones.ResNet.html
+        depth=50,
+        num_stages=4,
+        out_indices=(3, ),
+        frozen_stages=-1,
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),    # The type of the neck module.
+    head=dict(
+        type='LinearClsHead',     # The type of the classification head module.
+        # All fields except `type` come from the __init__ method of class `LinearClsHead`
+        # and you can find them from https://mmpretrain.readthedocs.io/en/latest/api/generated/mmpretrain.models.heads.LinearClsHead.html
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    ))
+```
+
+### Data settings
+
+This primitive config file includes information to construct the dataloader and evaluator:
+
+- `data_preprocessor`: Model input preprocessing configuration, same as `model.data_preprocessor` but with lower priority.
+- `train_evaluator | val_evaluator | test_evaluator`: To build the evaluator or metrics, refer to the [tutorial](mmpretrain.evaluation).
+- `train_dataloader | val_dataloader | test_dataloader`: The settings of dataloaders
+  - `batch_size`: The batch size of each GPU.
+  - `num_workers`: The number of workers to fetch data of each GPU.
+  - `sampler`: The settings of the sampler.
+  - `persistent_workers`: Whether to persistent workers after finishing one epoch.
+  - `dataset`: The settings of the dataset.
+    - `type`: The type of the dataset, we support `CustomDataset`, `ImageNet` and many other datasets, refer to [documentation](mmpretrain.datasets).
+    - `pipeline`: The data transform pipeline. You can find how to design a pipeline in [this tutorial](https://mmpretrain.readthedocs.io/en/latest/tutorials/data_pipeline.html).
+
+Following is the data primitive config of the ResNet50 config in [`configs/_base_/datasets/imagenet_bs32.py`](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/datasets/imagenet_bs32.py)：
+
+```python
+dataset_type = 'ImageNet'
+# preprocessing configuration
+data_preprocessor = dict(
+    # Input image data channels in 'RGB' order
+    mean=[123.675, 116.28, 103.53],    # Input image normalized channel mean in RGB order
+    std=[58.395, 57.12, 57.375],       # Input image normalized channel std in RGB order
+    to_rgb=True,                       # Whether to flip the channel from BGR to RGB or RGB to BGR
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),     # read image
+    dict(type='RandomResizedCrop', scale=224),     # Random scaling and cropping
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),   # random horizontal flip
+    dict(type='PackInputs'),         # prepare images and labels
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),     # read image
+    dict(type='ResizeEdge', scale=256, edge='short'),  # Scale the short side to 256
+    dict(type='CenterCrop', crop_size=224),     # center crop
+    dict(type='PackInputs'),                 # prepare images and labels
+]
+
+# Construct training set dataloader
+train_dataloader = dict(
+    batch_size=32,                     # batchsize per GPU
+    num_workers=5,                     # Number of workers to fetch data per GPU
+    dataset=dict(                      # training dataset
+        type=dataset_type,
+        data_root='data/imagenet',
+        ann_file='meta/train.txt',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),   # default sampler
+    persistent_workers=True,                             # Whether to keep the process, can shorten the preparation time of each epoch
+)
+
+# Construct the validation set dataloader
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        ann_file='meta/val.txt',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+# The settings of the evaluation metrics for validation. We use the top1 and top5 accuracy here.
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+test_dataloader = val_dataloader  # The settings of the dataloader for the test dataset, which is the same as val_dataloader
+test_evaluator = val_evaluator    # The settings of the evaluation metrics for test, which is the same as val_evaluator
+```
+
+```{note}
+The data preprocessor can be defined either in the subfield of `model`, or a using the `data_preprocessor` definition here, if both of them exist, use the `model.data_preprocessor` configuration.
+```
+
+### Schedule settings
+
+This primitive config file mainly contains training strategy settings and the settings of training, val and
+test loops:
+
+- `optim_wrapper`: The settings of the optimizer wrapper. We use the optimizer wrapper to customize the
+  optimization process.
+  - `optimizer`: Supports all `pytorch` optimizers, refers to the relevant {external+mmengine:doc}`MMEngine documentation <tutorials/optim_wrapper>`.
+  - `paramwise_cfg`: To set different optimization arguments according to the parameters' type or name, refer to the relevant [learning policy documentation](../advanced_guides/schedule.md).
+  - `accumulative_counts`: Optimize parameters after several backward steps instead of one backward step. You
+    can use it to simulate large batch size by small batch size.
+- `param_scheduler`: Optimizer parameters policy. You can use it to specify learning rate and momentum curves during training. See the {external+mmengine:doc}`documentation <tutorials/param_scheduler>` in MMEngine for more details.
+- `train_cfg | val_cfg | test_cfg`: The settings of the training, validation and test loops, refer to the relevant {external+mmengine:doc}`MMEngine documentation <design/runner>`.
+
+Following is the schedule primitive config of the ResNet50 config in [`configs/_base_/datasets/imagenet_bs32.py`](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/datasets/imagenet_bs32.py)：
+
+```python
+optim_wrapper = dict(
+    # Use SGD optimizer to optimize parameters.
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# The tuning strategy of the learning rate.
+# The 'MultiStepLR' means to use multiple steps policy to schedule the learning rate (LR).
+param_scheduler = dict(
+    type='MultiStepLR', by_epoch=True, milestones=[30, 60, 90], gamma=0.1)
+
+# Training configuration, iterate 100 epochs, and perform validation after every training epoch.
+# 'by_epoch=True' means to use `EpochBaseTrainLoop`, 'by_epoch=False' means to use IterBaseTrainLoop.
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+# Use the default val loop settings.
+val_cfg = dict()
+# Use the default test loop settings.
+test_cfg = dict()
+
+# This schedule is for the total batch size 256.
+# If you use a different total batch size, like 512 and enable auto learning rate scaling.
+# We will scale up the learning rate to 2 times.
+auto_scale_lr = dict(base_batch_size=256)
+```
+
+### Runtime settings
+
+This part mainly includes saving the checkpoint strategy, log configuration, training parameters, breakpoint weight path, working directory, etc.
+
+Here is the runtime primitive config file ['configs/_base_/default_runtime.py'](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/default_runtime.py) file used by almost all configs:
+
+```python
+# defaults to use registries in mmpretrain
+default_scope = 'mmpretrain'
+
+# configure default hooks
+default_hooks = dict(
+    # record the time of every iteration.
+    timer=dict(type='IterTimerHook'),
+
+    # print log every 100 iterations.
+    logger=dict(type='LoggerHook', interval=100),
+
+    # enable the parameter scheduler.
+    param_scheduler=dict(type='ParamSchedulerHook'),
+
+    # save checkpoint per epoch.
+    checkpoint=dict(type='CheckpointHook', interval=1),
+
+    # set sampler seed in a distributed environment.
+    sampler_seed=dict(type='DistSamplerSeedHook'),
+
+    # validation results visualization, set True to enable it.
+    visualization=dict(type='VisualizationHook', enable=False),
+)
+
+# configure environment
+env_cfg = dict(
+    # whether to enable cudnn benchmark
+    cudnn_benchmark=False,
+
+    # set multi-process parameters
+    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+
+    # set distributed parameters
+    dist_cfg=dict(backend='nccl'),
+)
+
+# set visualizer
+vis_backends = [dict(type='LocalVisBackend')]  # use local HDD backend
+visualizer = dict(
+    type='UniversalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+# set log level
+log_level = 'INFO'
+
+# load from which checkpoint
+load_from = None
+
+# whether to resume training from the loaded checkpoint
+resume = False
+```
+
+## Inherit and Modify Config File
+
+For easy understanding, we recommend contributors inherit from existing config files. But do not abuse the
+inheritance. Usually, for all config files, we recommend the maximum inheritance level is 3.
+
+For example, if your config file is based on ResNet with some other modification, you can first inherit the
+basic ResNet structure, dataset and other training settings by specifying `_base_ ='./resnet50_8xb32_in1k.py'`
+(The path relative to your config file), and then modify the necessary parameters in the config file. A more
+specific example, now we want to use almost all configs in `configs/resnet/resnet50_8xb32_in1k.py`, but using
+`CutMix` train batch augment and changing the number of training epochs from 100 to 300, modify when to decay
+the learning rate, and modify the dataset path, you can create a new config file
+`configs/resnet/resnet50_8xb32-300e_in1k.py` with content as below:
+
+```python
+# create this file under 'configs/resnet/' folder
+_base_ = './resnet50_8xb32_in1k.py'
+
+# using CutMix batch augment
+model = dict(
+    train_cfg=dict(
+        augments=dict(type='CutMix', alpha=1.0)
+    )
+)
+
+# trains more epochs
+train_cfg = dict(max_epochs=300, val_interval=10)  # Train for 300 epochs, evaluate every 10 epochs
+param_scheduler = dict(step=[150, 200, 250])   # The learning rate adjustment has also changed
+
+# Use your own dataset directory
+train_dataloader = dict(
+    dataset=dict(data_root='mydata/imagenet/train'),
+)
+val_dataloader = dict(
+    batch_size=64,                  # No back-propagation during validation, larger batch size can be used
+    dataset=dict(data_root='mydata/imagenet/val'),
+)
+test_dataloader = dict(
+    batch_size=64,                  # No back-propagation during test, larger batch size can be used
+    dataset=dict(data_root='mydata/imagenet/val'),
+)
+```
+
+### Use intermediate variables in configs
+
+Some intermediate variables are used in the configuration file. The intermediate variables make the configuration file clearer and easier to modify.
+
+For example, `train_pipeline` / `test_pipeline` is the intermediate variable of the data pipeline. We first need to define `train_pipeline` / `test_pipeline`, and then pass them to `train_dataloader` / `test_dataloader`. If you want to modify the size of the input image during training and testing, you need to modify the intermediate variables of `train_pipeline` / `test_pipeline`.
+
+```python
+bgr_mean = [103.53, 116.28, 123.675]  # mean in BGR order
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow', interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=6,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=236, edge='short', backend='pillow', interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=val_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=val_pipeline))
+```
+
+### Ignore some fields in the base configs
+
+Sometimes, you need to set `_delete_=True` to ignore some domain content in the basic configuration file. You can refer to the {external+mmengine:doc}`documentation in MMEngine <advanced_tutorials/config>` for more instructions.
+
+The following is an example. If you want to use cosine schedule in the above ResNet50 case, just using inheritance and directly modifying it will report `get unexpected keyword 'step'` error, because the `'step'` field of the basic config in `param_scheduler` domain information is reserved, and you need to add `_delete_ =True` to ignore the content of `param_scheduler` related fields in the basic configuration file:
+
+```python
+_base_ = '../../configs/resnet/resnet50_8xb32_in1k.py'
+
+# the learning rate scheduler
+param_scheduler = dict(type='CosineAnnealingLR', by_epoch=True, _delete_=True)
+```
+
+### Use some fields in the base configs
+
+Sometimes, you may refer to some fields in the `_base_` config, to avoid duplication of definitions. You can refer to {external+mmengine:doc}`MMEngine <advanced_tutorials/config>` for some more instructions.
+
+The following is an example of using auto augment in the training data preprocessing pipeline, refer to [`configs/resnest/resnest50_32xb64_in1k.py`](https://github.com/open-mmlab/mmpretrain/blob/main/configs/resnest/resnest50_32xb64_in1k.py). When defining `train_pipeline`, just add the definition file name of auto augment to `_base_`, and then use `_base_.auto_increasing_policies` to reference the variables in the primitive config:
+
+```python
+_base_ = [
+    '../_base_/models/resnest50.py', '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/default_runtime.py', './_randaug_policies.py',
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandAugment',
+        policies=_base_.policies, # This uses the `policies` parameter in the primitive config.
+        num_policies=2,
+        magnitude_level=12),
+    dict(type='EfficientNetRandomCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='Lighting',
+        eigval=EIGVAL,
+        eigvec=EIGVEC,
+        alphastd=0.1,
+        to_rgb=False),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+```
+
+## Modify config in command
+
+When you use the script "tools/train.py" or "tools/test.py" to submit tasks or use some other tools, they can directly modify the content of the configuration file used by specifying the `--cfg-options` argument.
+
+- Update config keys of dict chains.
+
+  The config options can be specified following the order of the dict keys in the original config.
+  For example, `--cfg-options model.backbone.norm_eval=False` changes the all BN modules in model backbones to `train` mode.
+
+- Update keys inside a list of configs.
+
+  Some config dicts are composed as a list in your config. For example, the training pipeline `data.train.pipeline` is normally a list
+  e.g. `[dict(type='LoadImageFromFile'), dict(type='TopDownRandomFlip', flip_prob=0.5), ...]`. If you want to change `'flip_prob=0.5'` to `'flip_prob=0.0'` in the pipeline,
+  you may specify `--cfg-options data.train.pipeline.1.flip_prob=0.0`.
+
+- Update values of list/tuples.
+
+  If the value to be updated is a list or a tuple. For example, the config file normally sets `val_evaluator = dict(type='Accuracy', topk=(1, 5))`. If you want to change the field `topk`, you may specify `--cfg-options val_evaluator.topk="(1,3)"`. Note that the quotation mark " is necessary to support list/tuple data types and that **NO** white space is allowed inside the quotation marks in the specified value.
diff --git a/docs/en/user_guides/dataset_prepare.md b/docs/en/user_guides/dataset_prepare.md
new file mode 100644
index 0000000000000000000000000000000000000000..17ec229b86693ce50f3c989f45a556da5b696260
--- /dev/null
+++ b/docs/en/user_guides/dataset_prepare.md
@@ -0,0 +1,364 @@
+# Prepare Dataset
+
+## CustomDataset
+
+[`CustomDataset`](mmpretrain.datasets.CustomDataset) is a general dataset class for you to use your own datasets. To use `CustomDataset`, you need to organize your dataset files according to the following two formats:
+
+### Subfolder Format
+
+In this format, you only need to re-organize your dataset folder and place all samples in one folder without
+creating any annotation files.
+
+For supervised tasks (with `with_label=True`), we use the name of sub-folders as the categories names, as
+shown in the below example, `class_x` and `class_y` will be recognized as the categories names.
+
+```text
+data_prefix/
+├── class_x
+│   ├── xxx.png
+│   ├── xxy.png
+│   └── ...
+│       └── xxz.png
+└── class_y
+    ├── 123.png
+    ├── nsdf3.png
+    ├── ...
+    └── asd932_.png
+```
+
+For unsupervised tasks (with `with_label=False`), we directly load all sample files under the specified folder:
+
+```text
+data_prefix/
+├── folder_1
+│   ├── xxx.png
+│   ├── xxy.png
+│   └── ...
+├── 123.png
+├── nsdf3.png
+└── ...
+```
+
+Assume you want to use it as the training dataset, and the below is the configurations in your config file.
+
+```python
+train_dataloader = dict(
+    ...
+    # Training dataset configurations
+    dataset=dict(
+        type='CustomDataset',
+        data_prefix='path/to/data_prefix',
+        with_label=True,   # or False for unsupervised tasks
+        pipeline=...
+    )
+)
+```
+
+```{note}
+If you want to use this format, do not specify `ann_file`, or specify `ann_file=''`.
+
+And please note that the subfolder format requires a folder scanning which may cause a slower initialization,
+especially for large datasets or slow file IO.
+```
+
+### Text Annotation File Format
+
+In this format, we use a text annotation file to store image file paths and the corespondding category
+indices.
+
+For supervised tasks (with `with_label=True`), the annotation file should include the file path and the
+category index of one sample in one line and split them by a space, as below:
+
+All these file paths can be absolute paths, or paths relative to the `data_prefix`.
+
+```text
+folder_1/xxx.png 0
+folder_1/xxy.png 1
+123.png 4
+nsdf3.png 3
+...
+```
+
+```{note}
+The index numbers of categories start from 0. And the value of ground-truth labels should fall in range `[0, num_classes - 1]`.
+
+In addition, please use the `classes` field in the dataset settings to specify the name of every category.
+```
+
+For unsupervised tasks (with `with_label=False`), the annotation file only need to include the file path of
+one sample in one line, as below:
+
+```text
+folder_1/xxx.png
+folder_1/xxy.png
+123.png
+nsdf3.png
+...
+```
+
+Assume the entire dataset folder is as below:
+
+```text
+data_root
+├── meta
+│   ├── test.txt     # The annotation file for the test dataset
+│   ├── train.txt    # The annotation file for the training dataset
+│   └── val.txt      # The annotation file for the validation dataset.
+├── train
+│   ├── 123.png
+│   ├── folder_1
+│   │   ├── xxx.png
+│   │   └── xxy.png
+│   └── nsdf3.png
+├── test
+└── val
+```
+
+Here is an example dataset settings in config files:
+
+```python
+# Training dataloader configurations
+train_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root='path/to/data_root',  # The common prefix of both `ann_flie` and `data_prefix`.
+        ann_file='meta/train.txt',      # The path of annotation file relative to the data_root.
+        data_prefix='train',            # The prefix of file paths in the `ann_file`, relative to the data_root.
+        with_label=True,                # or False for unsupervised tasks
+        classes=['A', 'B', 'C', 'D', ...],  # The name of every category.
+        pipeline=...,    # The transformations to process the dataset samples.
+    )
+    ...
+)
+```
+
+```{note}
+For a complete example about how to use the `CustomDataset`, please see [How to Pretrain with Custom Dataset](../notes/pretrain_custom_dataset.md)
+```
+
+## ImageNet
+
+ImageNet has multiple versions, but the most commonly used one is [ILSVRC 2012](http://www.image-net.org/challenges/LSVRC/2012/). It can be accessed with the following steps.
+
+`````{tabs}
+
+````{group-tab} Download by MIM
+
+MIM supports downloading from [OpenXlab](https://openxlab.org.cn/datasets) and preprocessing ImageNet dataset with one command line.
+
+_You need to register an account at [OpenXlab official website](https://openxlab.org.cn/datasets) and login by CLI._
+
+```Bash
+# install OpenXlab CLI tools
+pip install -U openxlab
+# log in OpenXLab
+openxlab login
+# download and preprocess by MIM, better to execute in $MMPreTrain directory.
+mim download mmpretrain --dataset imagenet1k
+```
+
+````
+
+````{group-tab} Download form Official Source
+
+1. Register an account and login to the [download page](http://www.image-net.org/download-images).
+2. Find download links for ILSVRC2012 and download the following two files
+   - ILSVRC2012_img_train.tar (~138GB)
+   - ILSVRC2012_img_val.tar (~6.3GB)
+3. Untar the downloaded files
+
+````
+
+`````
+
+### The Directory Structrue of the ImageNet dataset
+
+We support two ways of organizing the ImageNet dataset: Subfolder Format and Text Annotation File Format.
+
+#### Subfolder Format
+
+We have provided a sample, which you can download and extract from this [link](https://download.openmmlab.com/mmpretrain/datasets/imagenet_1k.zip). The directory structure of the dataset should be as below:
+
+```text
+data/imagenet/
+├── train/
+│   ├── n01440764
+│   │   ├── n01440764_10026.JPEG
+│   │   ├── n01440764_10027.JPEG
+│   │   ├── n01440764_10029.JPEG
+│   │   ├── n01440764_10040.JPEG
+│   │   ├── n01440764_10042.JPEG
+│   │   ├── n01440764_10043.JPEG
+│   │   └── n01440764_10048.JPEG
+│   ├── ...
+├── val/
+│   ├── n01440764
+│   │   ├── ILSVRC2012_val_00000293.JPEG
+│   │   ├── ILSVRC2012_val_00002138.JPEG
+│   │   ├── ILSVRC2012_val_00003014.JPEG
+│   │   └── ...
+│   ├── ...
+```
+
+#### Text Annotation File Format
+
+You can download and untar the meta data from this [link](https://download.openmmlab.com/mmclassification/datasets/imagenet/meta/caffe_ilsvrc12.tar.gz). And re-organize the dataset as below:
+
+```text
+data/imagenet/
+├── meta/
+│   ├── train.txt
+│   ├── test.txt
+│   └── val.txt
+├── train/
+│   ├── n01440764
+│   │   ├── n01440764_10026.JPEG
+│   │   ├── n01440764_10027.JPEG
+│   │   ├── n01440764_10029.JPEG
+│   │   ├── n01440764_10040.JPEG
+│   │   ├── n01440764_10042.JPEG
+│   │   ├── n01440764_10043.JPEG
+│   │   └── n01440764_10048.JPEG
+│   ├── ...
+├── val/
+│   ├── ILSVRC2012_val_00000001.JPEG
+│   ├── ILSVRC2012_val_00000002.JPEG
+│   ├── ILSVRC2012_val_00000003.JPEG
+│   ├── ILSVRC2012_val_00000004.JPEG
+│   ├── ...
+```
+
+### Configuration
+
+Once your dataset is organized in the way described above, you can use the [`ImageNet`](mmpretrain.datasets.ImageNet) dataset with the below configurations:
+
+```python
+train_dataloader = dict(
+    ...
+    # Training dataset configurations
+    dataset=dict(
+        type='ImageNet',
+        data_root='data/imagenet',
+        split='train',
+        pipeline=...,
+    )
+)
+
+val_dataloader = dict(
+    ...
+    # Validation dataset configurations
+    dataset=dict(
+        type='ImageNet',
+        data_root='data/imagenet',
+        split='val',
+        pipeline=...,
+    )
+)
+
+test_dataloader = val_dataloader
+```
+
+## Supported Image Classification Datasets
+
+| Datasets                                                                           | split                               | HomePage                                                                            |
+| ---------------------------------------------------------------------------------- | :---------------------------------- | ----------------------------------------------------------------------------------- |
+| [`Calthch101`](mmpretrain.datasets.Caltech101)(data_root[, split, pipeline, ...])  | ["train", "test"]                   | [Caltech 101](https://data.caltech.edu/records/mzrjq-6wc02) Dataset.                |
+| [`CIFAR10`](mmpretrain.datasets.CIFAR10)(data_root[, split, pipeline, ...])        | ["train", "test"]                   | [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) Dataset.                     |
+| [`CIFAR100`](mmpretrain.datasets.CIFAR100)(data_root[, split, pipeline, ...])      | ["train", "test"]                   | [CIFAR100](https://www.cs.toronto.edu/~kriz/cifar.html) Dataset.                    |
+| [`CUB`](mmpretrain.datasets.CUB)(data_root[, split, pipeline, ...])                | ["train", "test"]                   | [CUB-200-2011](http://www.vision.caltech.edu/datasets/cub_200_2011/) Dataset.       |
+| [`DTD`](mmpretrain.datasets.DTD)(data_root[, split, pipeline, ...])                | ["train", "val", "tranval", "test"] | [Describable Texture Dataset (DTD)](https://www.robots.ox.ac.uk/~vgg/data/dtd/) Dataset. |
+| [`FashionMNIST`](mmpretrain.datasets.FashionMNIST) (data_root[, split, pipeline, ...]) | ["train", "test"]                   | [Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist) Dataset.          |
+| [`FGVCAircraft`](mmpretrain.datasets.FGVCAircraft)(data_root[, split, pipeline, ...]) | ["train", "val", "tranval", "test"] | [FGVC Aircraft](https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/) Dataset.      |
+| [`Flowers102`](mmpretrain.datasets.Flowers102)(data_root[, split, pipeline, ...])  | ["train", "val", "tranval", "test"] | [Oxford 102 Flower](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/) Dataset.    |
+| [`Food101`](mmpretrain.datasets.Food101)(data_root[, split, pipeline, ...])        | ["train", "test"]                   | [Food101](https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/) Dataset.     |
+| [`MNIST`](mmpretrain.datasets.MNIST) (data_root[, split, pipeline, ...])           | ["train", "test"]                   | [MNIST](http://yann.lecun.com/exdb/mnist/) Dataset.                                 |
+| [`OxfordIIITPet`](mmpretrain.datasets.OxfordIIITPet)(data_root[, split, pipeline, ...]) | ["tranval", test"]                  | [Oxford-IIIT Pets](https://www.robots.ox.ac.uk/~vgg/data/pets/) Dataset.            |
+| [`Places205`](mmpretrain.datasets.Places205)(data_root[, pipeline, ...])           | -                                   | [Places205](http://places.csail.mit.edu/downloadData.html) Dataset.                 |
+| [`StanfordCars`](mmpretrain.datasets.StanfordCars)(data_root[, split, pipeline, ...]) | ["train", "test"]                   | [Stanford Cars](https://ai.stanford.edu/~jkrause/cars/car_dataset.html) Dataset.    |
+| [`SUN397`](mmpretrain.datasets.SUN397)(data_root[, split, pipeline, ...])          | ["train", "test"]                   | [SUN397](https://vision.princeton.edu/projects/2010/SUN/) Dataset.                  |
+| [`VOC`](mmpretrain.datasets.VOC)(data_root[, image_set_path, pipeline, ...])       | ["train", "val", "tranval", "test"] | [Pascal VOC](http://host.robots.ox.ac.uk/pascal/VOC/) Dataset.                      |
+
+Some dataset homepage links may be unavailable, and you can download datasets through [OpenXLab](https://openxlab.org.cn/datasets), such as [Stanford Cars](https://openxlab.org.cn/datasets/OpenDataLab/Stanford_Cars).
+
+## Supported Multi-modality Datasets
+
+| Datasets                                                                                      | split                    | HomePage                                                                            |
+| --------------------------------------------------------------------------------------------- | :----------------------- | ----------------------------------------------------------------------------------- |
+| [`RefCOCO`](mmpretrain.datasets.RefCOCO)(data_root, ann_file, data_prefix, split_file[, split, ...]) | ["train", "val", "test"] | [RefCOCO](https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip) Dataset. |
+
+Some dataset homepage links may be unavailable, and you can download datasets through [OpenDataLab](https://opendatalab.com/), such as [RefCOCO](https://opendatalab.com/RefCOCO/download).
+
+## OpenMMLab 2.0 Standard Dataset
+
+In order to facilitate the training of multi-task algorithm models, we unify the dataset interfaces of different tasks. OpenMMLab has formulated the **OpenMMLab 2.0 Dataset Format Specification**. When starting a trainning task, the users can choose to convert their dataset annotation into the specified format, and use the algorithm library of OpenMMLab to perform algorithm training and testing based on the data annotation file.
+
+The OpenMMLab 2.0 Dataset Format Specification stipulates that the annotation file must be in `json` or `yaml`, `yml`, `pickle` or `pkl` format; the dictionary stored in the annotation file must contain `metainfo` and `data_list` fields, The value of `metainfo` is a dictionary, which contains the meta information of the dataset; and the value of `data_list` is a list, each element in the list is a dictionary, the dictionary defines a raw data, each raw data contains a or several training/testing samples.
+
+The following is an example of a JSON annotation file (in this example each raw data contains only one train/test sample):
+
+```
+{
+    'metainfo':
+        {
+            'classes': ('cat', 'dog'), # the category index of 'cat' is 0 and 'dog' is 1.
+            ...
+        },
+    'data_list':
+        [
+            {
+                'img_path': "xxx/xxx_0.jpg",
+                'gt_label': 0,
+                ...
+            },
+            {
+                'img_path': "xxx/xxx_1.jpg",
+                'gt_label': 1,
+                ...
+            },
+            ...
+        ]
+}
+```
+
+Assume you want to use the training dataset and the dataset is stored as the below structure:
+
+```text
+data
+├── annotations
+│   ├── train.json
+├── train
+│   ├── xxx/xxx_0.jpg
+│   ├── xxx/xxx_1.jpg
+│   ├── ...
+```
+
+Build from the following dictionaries:
+
+```python
+train_dataloader = dict(
+    ...
+    dataset=dict(
+        type='BaseDataset',
+        data_root='data',
+        ann_file='annotations/train.json',
+        data_prefix='train/',
+        pipeline=...,
+    )
+)
+```
+
+## Other Datasets
+
+To find more datasets supported by MMPretrain, and get more configurations of the above datasets, please see the [dataset documentation](mmpretrain.datasets).
+
+To implement your own dataset class for some special formats, please see the [Adding New Dataset](../advanced_guides/datasets.md).
+
+## Dataset Wrappers
+
+The following datawrappers are supported in MMEngine, you can refer to {external+mmengine:doc}`MMEngine tutorial <advanced_tutorials/basedataset>` to learn how to use it.
+
+- {external:py:class}`~mmengine.dataset.ConcatDataset`
+- {external:py:class}`~mmengine.dataset.RepeatDataset`
+- {external:py:class}`~mmengine.dataset.ClassBalancedDataset`
+
+The MMPretrain also support [KFoldDataset](mmpretrain.datasets.KFoldDataset), please use it with `tools/kfold-cross-valid.py`.
diff --git a/docs/en/user_guides/downstream.md b/docs/en/user_guides/downstream.md
new file mode 100644
index 0000000000000000000000000000000000000000..9abb077ae9b98b25054441a618d14b34406c2d2c
--- /dev/null
+++ b/docs/en/user_guides/downstream.md
@@ -0,0 +1,128 @@
+# Downstream tasks
+
+## Detection
+
+For detection tasks, please use MMDetection. First, make sure you have installed [MIM](https://github.com/open-mmlab/mim), which is also a project of OpenMMLab.
+
+```shell
+pip install openmim
+mim install 'mmdet>=3.0.0rc0'
+```
+
+Besides, please refer to MMDet for [installation](https://mmdetection.readthedocs.io/en/dev-3.x/get_started.html) and [data preparation](https://mmdetection.readthedocs.io/en/dev-3.x/user_guides/dataset_prepare.html)
+
+### Train
+
+After installation, you can run MMDetection with simple command.
+
+```shell
+# distributed version
+bash tools/benchmarks/mmdetection/mim_dist_train_c4.sh ${CONFIG} ${PRETRAIN} ${GPUS}
+bash tools/benchmarks/mmdetection/mim_dist_train_fpn.sh ${CONFIG} ${PRETRAIN} ${GPUS}
+
+# slurm version
+bash tools/benchmarks/mmdetection/mim_slurm_train_c4.sh ${PARTITION} ${CONFIG} ${PRETRAIN}
+bash tools/benchmarks/mmdetection/mim_slurm_train_fpn.sh ${PARTITION} ${CONFIG} ${PRETRAIN}
+```
+
+- `${CONFIG}`: Use config file path in MMDetection directly. And for some algorithms, we also have some
+  modified config files which can be found in the `benchmarks` folder under the correspondding algorithm
+  folder. You can also writing your config file from scratch.
+- `${PRETRAIN}`: the pre-trained model file.
+- `${GPUS}`: The number of GPUs that you want to use to train. We adopt 8 GPUs for detection tasks by default.
+
+Example:
+
+```shell
+bash ./tools/benchmarks/mmdetection/mim_dist_train_c4.sh \
+  configs/byol/benchmarks/mask-rcnn_r50-c4_ms-1x_coco.py \
+  https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth 8
+```
+
+### Test
+
+After training, you can also run the command below to test your model.
+
+```shell
+# distributed version
+bash tools/benchmarks/mmdetection/mim_dist_test.sh ${CONFIG} ${CHECKPOINT} ${GPUS}
+
+# slurm version
+bash tools/benchmarks/mmdetection/mim_slurm_test.sh ${PARTITION} ${CONFIG} ${CHECKPOINT}
+```
+
+- `${CONFIG}`: Use config file name in MMDetection directly. And for some algorithms, we also have some
+  modified config files which can be found in the `benchmarks` folder under the correspondding algorithm
+  folder. You can also writing your config file from scratch.
+- `${CHECKPOINT}`: The fine-tuned detection model that you want to test.
+- `${GPUS}`: The number of GPUs that you want to use to test. We adopt 8 GPUs for detection tasks by default.
+
+Example:
+
+```shell
+bash ./tools/benchmarks/mmdetection/mim_dist_test.sh \
+configs/byol/benchmarks/mask-rcnn_r50_fpn_ms-1x_coco.py \
+https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth 8
+```
+
+## Segmentation
+
+For semantic segmentation task, we use MMSegmentation. First, make sure you have installed [MIM](https://github.com/open-mmlab/mim), which is also a project of OpenMMLab.
+
+```shell
+pip install openmim
+mim install 'mmsegmentation>=1.0.0rc0'
+```
+
+Besides, please refer to MMSegmentation for [installation](https://mmsegmentation.readthedocs.io/en/dev-1.x/get_started.html) and [data preparation](https://mmsegmentation.readthedocs.io/en/dev-1.x/user_guides/2_dataset_prepare.html).
+
+### Train
+
+After installation, you can run MMSegmentation with simple command.
+
+```shell
+# distributed version
+bash tools/benchmarks/mmsegmentation/mim_dist_train.sh ${CONFIG} ${PRETRAIN} ${GPUS}
+
+# slurm version
+bash tools/benchmarks/mmsegmentation/mim_slurm_train.sh ${PARTITION} ${CONFIG} ${PRETRAIN}
+```
+
+- `${CONFIG}`: Use config file path in MMSegmentation directly. And for some algorithms, we also have some
+  modified config files which can be found in the `benchmarks` folder under the correspondding algorithm
+  folder. You can also writing your config file from scratch.
+- `${PRETRAIN}`: the pre-trained model file.
+- `${GPUS}`: The number of GPUs that you want to use to train. We adopt 4 GPUs for segmentation tasks by default.
+
+Example:
+
+```shell
+bash ./tools/benchmarks/mmsegmentation/mim_dist_train.sh \
+configs/benchmarks/mmsegmentation/voc12aug/fcn_r50-d8_4xb4-20k_voc12aug-512x512.py \
+https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth 4
+```
+
+### Test
+
+After training, you can also run the command below to test your model.
+
+```shell
+# distributed version
+bash tools/benchmarks/mmsegmentation/mim_dist_test.sh ${CONFIG} ${CHECKPOINT} ${GPUS}
+
+# slurm version
+bash tools/benchmarks/mmsegmentation/mim_slurm_test.sh ${PARTITION} ${CONFIG} ${CHECKPOINT}
+```
+
+- `${CONFIG}`: Use config file name in MMSegmentation directly. And for some algorithms, we also have some
+  modified config files which can be found in the `benchmarks` folder under the correspondding algorithm
+  folder. You can also writing your config file from scratch.
+- `${CHECKPOINT}`: The fine-tuned segmentation model that you want to test.
+- `${GPUS}`: The number of GPUs that you want to use to test. We adopt 4 GPUs for segmentation tasks by default.
+
+Example:
+
+```shell
+bash ./tools/benchmarks/mmsegmentation/mim_dist_test.sh  fcn_r50-d8_4xb4-20k_voc12aug-512x512.py \
+https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth 4
+```
diff --git a/docs/en/user_guides/inference.md b/docs/en/user_guides/inference.md
new file mode 100644
index 0000000000000000000000000000000000000000..8d6cbefb67d9e7790627d566fca1a89cfd9bcfe2
--- /dev/null
+++ b/docs/en/user_guides/inference.md
@@ -0,0 +1,179 @@
+# Inference with existing models
+
+This tutorial will show how to use the following APIs：
+
+- [**`list_models`**](mmpretrain.apis.list_models): List available model names in MMPreTrain.
+- [**`get_model`**](mmpretrain.apis.get_model): Get a model from model name or model config.
+- [**`inference_model`**](mmpretrain.apis.inference_model): Inference a model with the correspondding
+  inferencer. It's a shortcut for a quick start, and for advanced usage, please use the below inferencer
+  directly.
+- Inferencers:
+  1. [**`ImageClassificationInferencer`**](mmpretrain.apis.ImageClassificationInferencer):
+     Perform image classification on the given image.
+  2. [**`ImageRetrievalInferencer`**](mmpretrain.apis.ImageRetrievalInferencer):
+     Perform image-to-image retrieval from the given image on a given image set.
+  3. [**`ImageCaptionInferencer`**](mmpretrain.apis.ImageCaptionInferencer):
+     Generate a caption on the given image.
+  4. [**`VisualQuestionAnsweringInferencer`**](mmpretrain.apis.VisualQuestionAnsweringInferencer):
+     Answer a question according to the given image.
+  5. [**`VisualGroundingInferencer`**](mmpretrain.apis.VisualGroundingInferencer):
+     Locate an object from the description on the given image.
+  6. [**`TextToImageRetrievalInferencer`**](mmpretrain.apis.TextToImageRetrievalInferencer):
+     Perform text-to-image retrieval from the given description on a given image set.
+  7. [**`ImageToTextRetrievalInferencer`**](mmpretrain.apis.ImageToTextRetrievalInferencer):
+     Perform image-to-text retrieval from the given image on a series of text.
+  8. [**`NLVRInferencer`**](mmpretrain.apis.NLVRInferencer):
+     Perform Natural Language for Visual Reasoning on a given image-pair and text.
+  9. [**`FeatureExtractor`**](mmpretrain.apis.FeatureExtractor):
+     Extract features from the image files by a vision backbone.
+
+## List available models
+
+list all the models in MMPreTrain.
+
+```python
+>>> from mmpretrain import list_models
+>>> list_models()
+['barlowtwins_resnet50_8xb256-coslr-300e_in1k',
+ 'beit-base-p16_beit-in21k-pre_3rdparty_in1k',
+ ...]
+```
+
+`list_models` supports Unix filename pattern matching, you can use \*\* * \*\* to match any character.
+
+```python
+>>> from mmpretrain import list_models
+>>> list_models("*convnext-b*21k")
+['convnext-base_3rdparty_in21k',
+ 'convnext-base_in21k-pre-3rdparty_in1k-384px',
+ 'convnext-base_in21k-pre_3rdparty_in1k']
+```
+
+You can use the `list_models` method of inferencers to get the available models of the correspondding tasks.
+
+```python
+>>> from mmpretrain import ImageCaptionInferencer
+>>> ImageCaptionInferencer.list_models()
+['blip-base_3rdparty_caption',
+ 'blip2-opt2.7b_3rdparty-zeroshot_caption',
+ 'flamingo_3rdparty-zeroshot_caption',
+ 'ofa-base_3rdparty-finetuned_caption']
+```
+
+## Get a model
+
+you can use `get_model` get the model.
+
+```python
+>>> from mmpretrain import get_model
+
+# Get model without loading pre-trained weight.
+>>> model = get_model("convnext-base_in21k-pre_3rdparty_in1k")
+
+# Get model and load the default checkpoint.
+>>> model = get_model("convnext-base_in21k-pre_3rdparty_in1k", pretrained=True)
+
+# Get model and load the specified checkpoint.
+>>> model = get_model("convnext-base_in21k-pre_3rdparty_in1k", pretrained="your_local_checkpoint_path")
+
+# Get model with extra initialization arguments, for example, modify the num_classes in head.
+>>> model = get_model("convnext-base_in21k-pre_3rdparty_in1k", head=dict(num_classes=10))
+
+# Another example, remove the neck and head, and output from stage 1, 2, 3 in backbone
+>>> model_headless = get_model("resnet18_8xb32_in1k", head=None, neck=None, backbone=dict(out_indices=(1, 2, 3)))
+```
+
+The obtained model is a usual PyTorch module.
+
+```python
+>>> import torch
+>>> from mmpretrain import get_model
+>>> model = get_model('convnext-base_in21k-pre_3rdparty_in1k', pretrained=True)
+>>> x = torch.rand((1, 3, 224, 224))
+>>> y = model(x)
+>>> print(type(y), y.shape)
+<class 'torch.Tensor'> torch.Size([1, 1000])
+```
+
+## Inference on given images
+
+Here is an example to inference an [image](https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG) by the ResNet-50 pre-trained classification model.
+
+```python
+>>> from mmpretrain import inference_model
+>>> image = 'https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG'
+>>> # If you have no graphical interface, please set `show=False`
+>>> result = inference_model('resnet50_8xb32_in1k', image, show=True)
+>>> print(result['pred_class'])
+sea snake
+```
+
+The `inference_model` API is only for demo and cannot keep the model instance or inference on multiple
+samples. You can use the inferencers for multiple calling.
+
+```python
+>>> from mmpretrain import ImageClassificationInferencer
+>>> image = 'https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG'
+>>> inferencer = ImageClassificationInferencer('resnet50_8xb32_in1k')
+>>> # Note that the inferencer output is a list of result even if the input is a single sample.
+>>> result = inferencer('https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG')[0]
+>>> print(result['pred_class'])
+sea snake
+>>>
+>>> # You can also use is for multiple images.
+>>> image_list = ['demo/demo.JPEG', 'demo/bird.JPEG'] * 16
+>>> results = inferencer(image_list, batch_size=8)
+>>> print(len(results))
+32
+>>> print(results[1]['pred_class'])
+house finch, linnet, Carpodacus mexicanus
+```
+
+Usually, the result for every sample is a dictionary. For example, the image classification result is a dictionary containing `pred_label`, `pred_score`, `pred_scores` and `pred_class` as follows:
+
+```python
+{
+    "pred_label": 65,
+    "pred_score": 0.6649366617202759,
+    "pred_class":"sea snake",
+    "pred_scores": array([..., 0.6649366617202759, ...], dtype=float32)
+}
+```
+
+You can configure the inferencer by arguments, for example, use your own config file and checkpoint to
+inference images by CUDA.
+
+```python
+>>> from mmpretrain import ImageClassificationInferencer
+>>> image = 'https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG'
+>>> config = 'configs/resnet/resnet50_8xb32_in1k.py'
+>>> checkpoint = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth'
+>>> inferencer = ImageClassificationInferencer(model=config, pretrained=checkpoint, device='cuda')
+>>> result = inferencer(image)[0]
+>>> print(result['pred_class'])
+sea snake
+```
+
+## Inference by a Gradio demo
+
+We also provide a gradio demo for all supported tasks and you can find it in [projects/gradio_demo/launch.py](https://github.com/open-mmlab/mmpretrain/blob/main/projects/gradio_demo/launch.py).
+
+Please install `gradio` by `pip install -U gradio` at first.
+
+Here is the interface preview:
+
+<img src="https://user-images.githubusercontent.com/26739999/236147750-90ccb517-92c0-44e9-905e-1473677023b1.jpg" width="100%"/>
+
+## Extract Features From Image
+
+Compared with `model.extract_feat`, `FeatureExtractor` is used to extract features from the image files directly, instead of a batch of tensors.
+In a word, the input of `model.extract_feat` is `torch.Tensor`, the input of `FeatureExtractor` is images.
+
+```python
+>>> from mmpretrain import FeatureExtractor, get_model
+>>> model = get_model('resnet50_8xb32_in1k', backbone=dict(out_indices=(0, 1, 2, 3)))
+>>> extractor = FeatureExtractor(model)
+>>> features = extractor('https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG')[0]
+>>> features[0].shape, features[1].shape, features[2].shape, features[3].shape
+(torch.Size([256]), torch.Size([512]), torch.Size([1024]), torch.Size([2048]))
+```
diff --git a/docs/en/user_guides/test.md b/docs/en/user_guides/test.md
new file mode 100644
index 0000000000000000000000000000000000000000..65ec073ea96762a0e5c6c850b7bdbd3fd3e67dac
--- /dev/null
+++ b/docs/en/user_guides/test.md
@@ -0,0 +1,123 @@
+# Test
+
+For image classification task and image retrieval task, you could test your model after training.
+
+## Test with your PC
+
+You can use `tools/test.py` to test a model on a single machine with a CPU and optionally a GPU.
+
+Here is the full usage of the script:
+
+```shell
+python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]
+```
+
+````{note}
+By default, MMPretrain prefers GPU to CPU. If you want to test a model on CPU, please empty `CUDA_VISIBLE_DEVICES` or set it to -1 to make GPU invisible to the program.
+
+```bash
+CUDA_VISIBLE_DEVICES=-1 python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]
+```
+````
+
+| ARGS                                  | Description                                                                                                                                                         |
+| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `CONFIG_FILE`                         | The path to the config file.                                                                                                                                        |
+| `CHECKPOINT_FILE`                     | The path to the checkpoint file (It can be a http link, and you can find checkpoints [here](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html)). |
+| `--work-dir WORK_DIR`                 | The directory to save the file containing evaluation metrics.                                                                                                       |
+| `--out OUT`                           | The path to save the file containing test results.                                                                                                                  |
+| `--out-item OUT_ITEM`                 | To specify the content of the test results file, and it can be "pred" or "metrics". If "pred", save the outputs of the model for offline evaluation. If "metrics", save the evaluation metrics. Defaults to "pred". |
+| `--cfg-options CFG_OPTIONS`           | Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either `key="[a,b]"` or `key=a,b`. The argument also allows nested list/tuple values, e.g. `key="[(a,b),(c,d)]"`. Note that the quotation marks are necessary and that no white space is allowed. |
+| `--show-dir SHOW_DIR`                 | The directory to save the result visualization images.                                                                                                              |
+| `--show`                              | Visualize the prediction result in a window.                                                                                                                        |
+| `--interval INTERVAL`                 | The interval of samples to visualize.                                                                                                                               |
+| `--wait-time WAIT_TIME`               | The display time of every window (in seconds). Defaults to 1.                                                                                                       |
+| `--no-pin-memory`                     | Whether to disable the `pin_memory` option in dataloaders.                                                                                                          |
+| `--tta`                               | Whether to enable the Test-Time-Aug (TTA). If the config file has `tta_pipeline` and `tta_model` fields, use them to determine the TTA transforms and how to merge the TTA results. Otherwise, use flip TTA by averaging classification score. |
+| `--launcher {none,pytorch,slurm,mpi}` | Options for job launcher.                                                                                                                                           |
+
+## Test with multiple GPUs
+
+We provide a shell script to start a multi-GPUs task with `torch.distributed.launch`.
+
+```shell
+bash ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]
+```
+
+| ARGS              | Description                                                                                                                                                         |
+| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `CONFIG_FILE`     | The path to the config file.                                                                                                                                        |
+| `CHECKPOINT_FILE` | The path to the checkpoint file (It can be a http link, and you can find checkpoints [here](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html)). |
+| `GPU_NUM`         | The number of GPUs to be used.                                                                                                                                      |
+| `[PY_ARGS]`       | The other optional arguments of `tools/test.py`, see [here](#test-with-your-pc).                                                                                    |
+
+You can also specify extra arguments of the launcher by environment variables. For example, change the
+communication port of the launcher to 29666 by the below command:
+
+```shell
+PORT=29666 bash ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]
+```
+
+If you want to startup multiple test jobs and use different GPUs, you can launch them by specifying
+different port and visible devices.
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash ./tools/dist_test.sh ${CONFIG_FILE1} ${CHECKPOINT_FILE} 4 [PY_ARGS]
+CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash ./tools/dist_test.sh ${CONFIG_FILE2} ${CHECKPOINT_FILE} 4 [PY_ARGS]
+```
+
+## Test with multiple machines
+
+### Multiple machines in the same network
+
+If you launch a test job with multiple machines connected with ethernet, you can run the following commands:
+
+On the first machine:
+
+```shell
+NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT_FILE $GPUS
+```
+
+On the second machine:
+
+```shell
+NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT_FILE $GPUS
+```
+
+Comparing with multi-GPUs in a single machine, you need to specify some extra environment variables:
+
+| ENV_VARS      | Description                                                                  |
+| ------------- | ---------------------------------------------------------------------------- |
+| `NNODES`      | The total number of machines.                                                |
+| `NODE_RANK`   | The index of the local machine.                                              |
+| `PORT`        | The communication port, it should be the same in all machines.               |
+| `MASTER_ADDR` | The IP address of the master machine, it should be the same in all machines. |
+
+Usually it is slow if you do not have high speed networking like InfiniBand.
+
+### Multiple machines managed with slurm
+
+If you run MMPretrain on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `tools/slurm_test.sh`.
+
+```shell
+[ENV_VARS] ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} [PY_ARGS]
+```
+
+Here are the arguments description of the script.
+
+| ARGS              | Description                                                                                                                                                         |
+| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `PARTITION`       | The partition to use in your cluster.                                                                                                                               |
+| `JOB_NAME`        | The name of your job, you can name it as you like.                                                                                                                  |
+| `CONFIG_FILE`     | The path to the config file.                                                                                                                                        |
+| `CHECKPOINT_FILE` | The path to the checkpoint file (It can be a http link, and you can find checkpoints [here](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html)). |
+| `[PY_ARGS]`       | The other optional arguments of `tools/test.py`, see [here](#test-with-your-pc).                                                                                    |
+
+Here are the environment variables can be used to configure the slurm job.
+
+| ENV_VARS        | Description                                                                                                |
+| --------------- | ---------------------------------------------------------------------------------------------------------- |
+| `GPUS`          | The number of GPUs to be used. Defaults to 8.                                                              |
+| `GPUS_PER_NODE` | The number of GPUs to be allocated per node.                                                               |
+| `CPUS_PER_TASK` | The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5.      |
+| `SRUN_ARGS`     | The other arguments of `srun`. Available options can be found [here](https://slurm.schedmd.com/srun.html). |
diff --git a/docs/en/user_guides/train.md b/docs/en/user_guides/train.md
new file mode 100644
index 0000000000000000000000000000000000000000..9cc618b038b4c44e46904ccca5c80731653ab1fc
--- /dev/null
+++ b/docs/en/user_guides/train.md
@@ -0,0 +1,121 @@
+# Train
+
+In this tutorial, we will introduce how to use the scripts provided in MMPretrain to start a training task. If
+you need, we also have some practice examples about [how to pretrain with custom dataset](../notes/pretrain_custom_dataset.md)
+and [how to finetune with custom dataset](../notes/finetune_custom_dataset.md).
+
+## Train with your PC
+
+You can use `tools/train.py` to train a model on a single machine with a CPU and optionally a GPU.
+
+Here is the full usage of the script:
+
+```shell
+python tools/train.py ${CONFIG_FILE} [ARGS]
+```
+
+````{note}
+By default, MMPretrain prefers GPU to CPU. If you want to train a model on CPU, please empty `CUDA_VISIBLE_DEVICES` or set it to -1 to make GPU invisible to the program.
+
+```bash
+CUDA_VISIBLE_DEVICES=-1 python tools/train.py ${CONFIG_FILE} [ARGS]
+```
+````
+
+| ARGS                                  | Description                                                                                                                                                         |
+| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `CONFIG_FILE`                         | The path to the config file.                                                                                                                                        |
+| `--work-dir WORK_DIR`                 | The target folder to save logs and checkpoints. Defaults to a folder with the same name of the config file under `./work_dirs`.                                     |
+| `--resume [RESUME]`                   | Resume training. If specify a path, resume from it, while if not specify, try to auto resume from the latest checkpoint.                                            |
+| `--amp`                               | Enable automatic-mixed-precision training.                                                                                                                          |
+| `--no-validate`                       | **Not suggested**. Disable checkpoint evaluation during training.                                                                                                   |
+| `--auto-scale-lr`                     | Auto scale the learning rate according to the actual batch size and the original batch size.                                                                        |
+| `--no-pin-memory`                     | Whether to disable the `pin_memory` option in dataloaders.                                                                                                          |
+| `--no-persistent-workers`             | Whether to disable the `persistent_workers` option in dataloaders.                                                                                                  |
+| `--cfg-options CFG_OPTIONS`           | Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either `key="[a,b]"` or `key=a,b`. The argument also allows nested list/tuple values, e.g. `key="[(a,b),(c,d)]"`. Note that the quotation marks are necessary and that no white space is allowed. |
+| `--launcher {none,pytorch,slurm,mpi}` | Options for job launcher.                                                                                                                                           |
+
+## Train with multiple GPUs
+
+We provide a shell script to start a multi-GPUs task with `torch.distributed.launch`.
+
+```shell
+bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]
+```
+
+| ARGS          | Description                                                                        |
+| ------------- | ---------------------------------------------------------------------------------- |
+| `CONFIG_FILE` | The path to the config file.                                                       |
+| `GPU_NUM`     | The number of GPUs to be used.                                                     |
+| `[PY_ARGS]`   | The other optional arguments of `tools/train.py`, see [here](#train-with-your-pc). |
+
+You can also specify extra arguments of the launcher by environment variables. For example, change the
+communication port of the launcher to 29666 by the below command:
+
+```shell
+PORT=29666 bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]
+```
+
+If you want to startup multiple training jobs and use different GPUs, you can launch them by specifying
+different ports and visible devices.
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash ./tools/dist_train.sh ${CONFIG_FILE1} 4 [PY_ARGS]
+CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash ./tools/dist_train.sh ${CONFIG_FILE2} 4 [PY_ARGS]
+```
+
+## Train with multiple machines
+
+### Multiple machines in the same network
+
+If you launch a training job with multiple machines connected with ethernet, you can run the following commands:
+
+On the first machine:
+
+```shell
+NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS
+```
+
+On the second machine:
+
+```shell
+NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS
+```
+
+Comparing with multi-GPUs in a single machine, you need to specify some extra environment variables:
+
+| ENV_VARS      | Description                                                                  |
+| ------------- | ---------------------------------------------------------------------------- |
+| `NNODES`      | The total number of machines.                                                |
+| `NODE_RANK`   | The index of the local machine.                                              |
+| `PORT`        | The communication port, it should be the same in all machines.               |
+| `MASTER_ADDR` | The IP address of the master machine, it should be the same in all machines. |
+
+Usually it is slow if you do not have high speed networking like InfiniBand.
+
+### Multiple machines managed with slurm
+
+If you run MMPretrain on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `tools/slurm_train.sh`.
+
+```shell
+[ENV_VARS] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} [PY_ARGS]
+```
+
+Here are the arguments description of the script.
+
+| ARGS          | Description                                                                        |
+| ------------- | ---------------------------------------------------------------------------------- |
+| `PARTITION`   | The partition to use in your cluster.                                              |
+| `JOB_NAME`    | The name of your job, you can name it as you like.                                 |
+| `CONFIG_FILE` | The path to the config file.                                                       |
+| `WORK_DIR`    | The target folder to save logs and checkpoints.                                    |
+| `[PY_ARGS]`   | The other optional arguments of `tools/train.py`, see [here](#train-with-your-pc). |
+
+Here are the environment variables can be used to configure the slurm job.
+
+| ENV_VARS        | Description                                                                                                |
+| --------------- | ---------------------------------------------------------------------------------------------------------- |
+| `GPUS`          | The number of GPUs to be used. Defaults to 8.                                                              |
+| `GPUS_PER_NODE` | The number of GPUs to be allocated per node..                                                              |
+| `CPUS_PER_TASK` | The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5.      |
+| `SRUN_ARGS`     | The other arguments of `srun`. Available options can be found [here](https://slurm.schedmd.com/srun.html). |
diff --git a/docs/zh_CN/Makefile b/docs/zh_CN/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..d4bb2cbb9eddb1bb1b4f366623044af8e4830919
--- /dev/null
+++ b/docs/zh_CN/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/zh_CN/_static/css/readthedocs.css b/docs/zh_CN/_static/css/readthedocs.css
new file mode 100644
index 0000000000000000000000000000000000000000..39dc689e8a97b22a48e9d6badbb729faa4335d3c
--- /dev/null
+++ b/docs/zh_CN/_static/css/readthedocs.css
@@ -0,0 +1,61 @@
+.header-logo {
+    background-image: url("../image/mmpt-logo.png");
+    background-size: 183px 50px;
+    height: 50px;
+    width: 183px;
+}
+
+@media screen and (min-width: 1100px) {
+  .header-logo {
+    top: -12px;
+  }
+}
+
+pre {
+    white-space: pre;
+}
+
+@media screen and (min-width: 2000px) {
+  .pytorch-content-left {
+    width: 1200px;
+    margin-left: 30px;
+  }
+  article.pytorch-article {
+    max-width: 1200px;
+  }
+  .pytorch-breadcrumbs-wrapper {
+    width: 1200px;
+  }
+  .pytorch-right-menu.scrolling-fixed {
+    position: fixed;
+    top: 45px;
+    left: 1580px;
+  }
+}
+
+article.pytorch-article section code {
+  padding: .2em .4em;
+  background-color: #f3f4f7;
+  border-radius: 5px;
+}
+
+/* Disable the change in tables */
+article.pytorch-article section table code {
+  padding: unset;
+  background-color: unset;
+  border-radius: unset;
+}
+
+table.autosummary td {
+  width: 50%
+}
+
+img.align-center {
+  display: block;
+  margin-left: auto;
+  margin-right: auto;
+}
+
+article.pytorch-article p.rubric {
+  font-weight: bold;
+}
diff --git a/docs/zh_CN/_static/image/confusion-matrix.png b/docs/zh_CN/_static/image/confusion-matrix.png
new file mode 120000
index 0000000000000000000000000000000000000000..7b0b377272ca60968b14e3b30e5cb8545f13534b
--- /dev/null
+++ b/docs/zh_CN/_static/image/confusion-matrix.png
@@ -0,0 +1 @@
+../../../en/_static/image/confusion-matrix.png
\ No newline at end of file
diff --git a/docs/zh_CN/_static/image/mmpt-logo.png b/docs/zh_CN/_static/image/mmpt-logo.png
new file mode 100644
index 0000000000000000000000000000000000000000..f4e060716520ece5db7e85df3c3ad8fd9e0eda57
Binary files /dev/null and b/docs/zh_CN/_static/image/mmpt-logo.png differ
diff --git a/docs/zh_CN/_static/image/tools/analysis/analyze_log.jpg b/docs/zh_CN/_static/image/tools/analysis/analyze_log.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..8eb1a27d6464d255b84b23a7460a5f622f51712f
Binary files /dev/null and b/docs/zh_CN/_static/image/tools/analysis/analyze_log.jpg differ
diff --git a/docs/zh_CN/_static/js/custom.js b/docs/zh_CN/_static/js/custom.js
new file mode 100644
index 0000000000000000000000000000000000000000..96f0679385f616f29f7d7106f0507a5f120019be
--- /dev/null
+++ b/docs/zh_CN/_static/js/custom.js
@@ -0,0 +1,20 @@
+var collapsedSections = ['进阶教程', '模型库', '可视化', '分析工具', '部署', '其他说明'];
+
+$(document).ready(function () {
+  $('.model-summary').DataTable({
+    "stateSave": false,
+    "lengthChange": false,
+    "pageLength": 20,
+    "order": [],
+    "language": {
+      "info": "显示 _START_ 至 _END_ 条目（总计 _TOTAL_ ）",
+      "infoFiltered": "（筛选自 _MAX_ 条目）",
+      "search": "搜索：",
+      "zeroRecords": "没有找到任何条目",
+      "paginate": {
+        "next": "下一页",
+        "previous": "上一页"
+      },
+    }
+  });
+});
diff --git a/docs/zh_CN/_templates/404.html b/docs/zh_CN/_templates/404.html
new file mode 100644
index 0000000000000000000000000000000000000000..abf3356cf4413269b82439f28b6884fc8e51376f
--- /dev/null
+++ b/docs/zh_CN/_templates/404.html
@@ -0,0 +1,16 @@
+{% extends "layout.html" %}
+
+{% block body %}
+
+<h1>未找到页面</h1>
+<p>
+  未找到你要打开的页面。
+</p>
+<p>
+  如果你是从旧版本文档跳转至此，可能是对应的页面被移动了。请从左侧的目录中寻找新版本文档，或者跳转至<a href="{{ pathto(root_doc) }}">首页</a>。
+</p>
+<p>
+  如果你找不到希望打开的文档，欢迎在 <a href="https://github.com/open-mmlab/mmpretrain/issues/new/choose">Issue</a> 中告诉我们！
+</p>
+
+{% endblock %}
diff --git a/docs/zh_CN/_templates/autosummary/class.rst b/docs/zh_CN/_templates/autosummary/class.rst
new file mode 100644
index 0000000000000000000000000000000000000000..4c3a7a9abf5c5b14ac3ef3b00a2f070480295358
--- /dev/null
+++ b/docs/zh_CN/_templates/autosummary/class.rst
@@ -0,0 +1,13 @@
+.. role:: hidden
+    :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+    :members:
+
+..
+  autogenerated from _templates/autosummary/class.rst
+  note it does not have :inherited-members:
diff --git a/docs/zh_CN/_templates/callable.rst b/docs/zh_CN/_templates/callable.rst
new file mode 100644
index 0000000000000000000000000000000000000000..3a7b9d2b96c76dfa3eb1d8bef56f58f219fe7760
--- /dev/null
+++ b/docs/zh_CN/_templates/callable.rst
@@ -0,0 +1,14 @@
+.. role:: hidden
+    :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+    :members:
+    :special-members: __call__
+
+..
+  autogenerated from _templates/callable.rst
+  note it does not have :inherited-members:
diff --git a/docs/zh_CN/_templates/data_transform.rst b/docs/zh_CN/_templates/data_transform.rst
new file mode 100644
index 0000000000000000000000000000000000000000..376bfe9db6c305e681f265dd0e20b7b7ea6e500f
--- /dev/null
+++ b/docs/zh_CN/_templates/data_transform.rst
@@ -0,0 +1,13 @@
+.. role:: hidden
+    :class: hidden-section
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+    :members: transform
+
+..
+  autogenerated from _templates/callable.rst
+  note it does not have :inherited-members:
diff --git a/docs/zh_CN/advanced_guides/convention.md b/docs/zh_CN/advanced_guides/convention.md
new file mode 100644
index 0000000000000000000000000000000000000000..941236b698bf0861d1547227c7671c76a59e3075
--- /dev/null
+++ b/docs/zh_CN/advanced_guides/convention.md
@@ -0,0 +1,114 @@
+# MMPretrain 中的约定
+
+## 模型命名规则
+
+MMPretrain 按照以下风格进行模型命名，代码库的贡献者需要遵循相同的命名规则。模型名总体分为五个部分：算法信息，模块信息，预训练信息，训练信息和数据信息。逻辑上属于不同部分的单词之间用下划线 `'_'` 连接，同一部分有多个单词用短横线 `'-'` 连接。
+
+```text
+{algorithm info}_{module info}_{pretrain info}_{training info}_{data info}
+```
+
+- `algorithm info`（可选）：算法信息，表示用以训练该模型的主要算法，如 MAE、BEiT 等
+- `module info`：模块信息，主要包含模型的主干网络名称，如 resnet、vit 等
+- `pretrain info`（可选）：预训练信息，比如预训练模型是在 ImageNet-21k 数据集上训练的等
+- `training info`：训练信息，训练策略设置，包括 batch size，schedule 以及数据增强等；
+- `data info`：数据信息，数据集名称、模态、输入尺寸等，如 imagenet, cifar 等；
+
+### 算法信息
+
+指用以训练该模型的算法名称，例如：
+
+- `simclr`
+- `mocov2`
+- `eva-mae-style`
+
+使用监督图像分类任务训练的模型可以省略这个字段。
+
+### 模块信息
+
+指模型的结构信息，一般主要包含模型的主干网络结构，`neck` 和 `head` 信息一般被省略。例如：
+
+- `resnet50`
+- `vit-base-p16`
+- `swin-base`
+
+### 预训练信息
+
+如果该模型是在预训练模型基础上，通过微调获得的，我们需要记录预训练模型的一些信息。例如：
+
+- 预训练模型的来源：`fb`、`openai`等。
+- 训练预训练模型的方法：`clip`、`mae`、`distill` 等。
+- 用于预训练的数据集：`in21k`、`laion2b`等（`in1k`可以省略）
+- 训练时长：`300e`、`1600e` 等。
+
+并非所有信息都是必要的，只需要选择用以区分不同的预训练模型的信息即可。
+
+在此字段的末尾，使用 `-pre` 作为标识符，例如 `mae-in21k-pre`。
+
+### 训练信息
+
+训练策略的一些设置，包括训练类型、 `batch size`、 `lr schedule`、 数据增强以及特殊的损失函数等等,比如:
+Batch size 信息：
+
+- 格式为`{gpu x batch_per_gpu}`, 如 `8xb32`
+
+训练类型(主要见于 transformer 网络，如 `ViT` 算法，这类算法通常分为预训练和微调两种模式):
+
+- `ft` : Finetune config，用于微调的配置文件
+- `pt` : Pretrain config，用于预训练的配置文件
+
+训练策略信息，训练策略以复现配置文件为基础，此基础不必标注训练策略。但如果在此基础上进行改进，则需注明训练策略，按照应用点位顺序排列，如：`{pipeline aug}-{train aug}-{loss trick}-{scheduler}-{epochs}`
+
+- `coslr-200e` : 使用 cosine scheduler, 训练 200 个 epoch
+- `autoaug-mixup-lbs-coslr-50e` : 使用了 `autoaug`、`mixup`、`label smooth`、`cosine scheduler`, 训练了 50 个轮次
+
+如果模型是从官方仓库等第三方仓库转换过来的，训练信息可以省略，使用 `3rdparty` 作为标识符。
+
+### 数据信息
+
+- `in1k` : `ImageNet1k` 数据集，默认使用 `224x224` 大小的图片
+- `in21k` : `ImageNet21k` 数据集，有些地方也称为 `ImageNet22k` 数据集，默认使用 `224x224` 大小的图片
+- `in1k-384px` : 表示训练的输出图片大小为 `384x384`
+- `cifar100`
+
+### 模型命名案例
+
+```text
+vit-base-p32_clip-openai-pre_3rdparty_in1k
+```
+
+- `vit-base-p32`: 模块信息
+- `clip-openai-pre`：预训练信息
+  - `clip`：预训练方法是 clip
+  - `openai`：预训练模型来自 OpenAI
+  - `pre`：预训练标识符
+- `3rdparty`：模型是从第三方仓库转换而来的
+- `in1k`：数据集信息。该模型是从 ImageNet-1k 数据集训练而来的，输入大小为 `224x224`
+
+```text
+beit_beit-base-p16_8xb256-amp-coslr-300e_in1k
+```
+
+- `beit`: 算法信息
+- `beit-base`：模块信息，由于主干网络来自 BEiT 中提出的修改版 ViT，主干网络名称也是 `beit`
+- `8xb256-amp-coslr-300e`：训练信息
+  - `8xb256`：使用 8 个 GPU，每个 GPU 的批量大小为 256
+  - `amp`：使用自动混合精度训练
+  - `coslr`：使用余弦退火学习率调度器
+  - `300e`：训练 300 个 epoch
+- `in1k`：数据集信息。该模型是从 ImageNet-1k 数据集训练而来的，输入大小为 `224x224`
+
+## 配置文件命名规则
+
+配置文件的命名与模型名称几乎相同，有几点不同：
+
+- 训练信息是必要的，不能是 `3rdparty`
+- 如果配置文件只包含主干网络设置，既没有头部设置也没有数据集设置，我们将其命名为`{module info}_headless.py`。这种配置文件通常用于大型数据集上的第三方预训练模型。
+
+### 权重命名规则
+
+权重的命名主要包括模型名称，日期和哈希值。
+
+```text
+{model_name}_{date}-{hash}.pth
+```
diff --git a/docs/zh_CN/advanced_guides/datasets.md b/docs/zh_CN/advanced_guides/datasets.md
new file mode 100644
index 0000000000000000000000000000000000000000..83b7959b9f136e0938c89fe6f171f33c2eedde35
--- /dev/null
+++ b/docs/zh_CN/advanced_guides/datasets.md
@@ -0,0 +1,73 @@
+# 添加新数据集
+
+用户可以编写一个继承自 [BasesDataset](https://mmpretrain.readthedocs.io/zh_CN/latest/_modules/mmpretrain/datasets/base_dataset.html#BaseDataset) 的新数据集类，并重载 `load_data_list(self)` 方法，类似 [CIFAR10](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/cifar.py) 和 [ImageNet](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/imagenet.py)。
+
+通常，此方法返回一个包含所有样本的列表，其中的每个样本都是一个字典。字典中包含了必要的数据信息，例如 `img` 和 `gt_label`。
+
+假设我们将要实现一个 `Filelist` 数据集，该数据集将使用文件列表进行训练和测试。注释列表的格式如下：
+
+```text
+000001.jpg 0
+000002.jpg 1
+...
+```
+
+## 1. 创建数据集类
+
+我们可以在 `mmpretrain/datasets/filelist.py` 中创建一个新的数据集类以加载数据。
+
+```python
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+
+
+@DATASETS.register_module()
+class Filelist(BaseDataset):
+
+    def load_data_list(self):
+        assert isinstance(self.ann_file, str)
+
+        data_list = []
+        with open(self.ann_file) as f:
+            samples = [x.strip().split(' ') for x in f.readlines()]
+            for filename, gt_label in samples:
+                img_path = add_prefix(filename, self.img_prefix)
+                info = {'img_path': img_path, 'gt_label': int(gt_label)}
+                data_list.append(info)
+        return data_list
+```
+
+## 2. 添加到库
+
+将新的数据集类加入到 `mmpretrain/datasets/__init__.py` 中：
+
+```python
+from .base_dataset import BaseDataset
+...
+from .filelist import Filelist
+
+__all__ = [
+    'BaseDataset', ... ,'Filelist'
+]
+```
+
+### 3. 修改相关配置文件
+
+然后在配置文件中，为了使用 `Filelist`，用户可以按以下方式修改配置
+
+```python
+train_dataloader = dict(
+    ...
+    dataset=dict(
+        type='Filelist',
+        ann_file='image_list.txt',
+        pipeline=train_pipeline,
+    )
+)
+```
+
+所有继承 [`BaseDataset`](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/base_dataset.py) 的数据集类都具有**懒加载**以及**节省内存**的特性，可以参考相关文档 {external+mmengine:doc}`BaseDataset <advanced_tutorials/basedataset>`。
+
+```{note}
+如果数据样本时获取的字典中，只包含了 'img_path' 不包含 'img'， 则在 pipeline 中必须包含 'LoadImgFromFile'。
+```
diff --git a/docs/zh_CN/advanced_guides/evaluation.md b/docs/zh_CN/advanced_guides/evaluation.md
new file mode 100644
index 0000000000000000000000000000000000000000..32db19750458a1b297a8d444df00df76699bd5ef
--- /dev/null
+++ b/docs/zh_CN/advanced_guides/evaluation.md
@@ -0,0 +1,97 @@
+# 自定义评估指标
+
+## 使用 MMPretrain 中的指标
+
+在 MMPretrain 中，我们为单标签分类和多标签分类提供了多种指标：
+
+**单标签分类**:
+
+- [`Accuracy`](mmpretrain.evaluation.Accuracy)
+- [`SingleLabelMetric`](mmpretrain.evaluation.SingleLabelMetric)，包括精度、召回率、f1-score 和支持度。
+
+**多标签分类**:
+
+- [`AveragePrecision`](mmpretrain.evaluation.AveragePrecision)， 或 AP (mAP)。
+- [`MultiLabelMetric`](mmpretrain.evaluation.MultiLabelMetric)，包括精度、召回率、f1-score 和支持度。
+
+要在验证和测试期间使用这些指标，我们需要修改配置文件中的 `val_evaluator` 和 `test_evaluator` 字段。
+
+以下为几个例子：
+
+1. 在验证和测试期间计算 top-1 和 top-5 准确率。
+
+   ```python
+   val_evaluator = dict(type='Accuracy', topk=(1, 5))
+   test_evaluator = val_evaluator
+   ```
+
+2. 在验证和测试期间计算 top-1 准确率、top-5 准确度、精确度和召回率。
+
+   ```python
+   val_evaluator = [
+     dict(type='Accuracy', topk=(1, 5)),
+     dict(type='SingleLabelMetric', items=['precision', 'recall']),
+   ]
+   test_evaluator = val_evaluator
+   ```
+
+3. 计算 mAP（平均平均精度）、CP（类别平均精度）、CR（类别平均召回率）、CF（类别平均 F1 分数）、OP（总体平均精度）、OR（总体平均召回率）和 OF1（总体平均 F1 分数）。
+
+   ```python
+   val_evaluator = [
+     dict(type='AveragePrecision'),
+     dict(type='MultiLabelMetric', average='macro'),  # class-wise mean
+     dict(type='MultiLabelMetric', average='micro'),  # overall mean
+   ]
+   test_evaluator = val_evaluator
+   ```
+
+## 添加新的指标
+
+MMPretrain 支持为追求更高定制化的用户实现定制化的评估指标。
+
+您需要在 `mmpretrain/evaluation/metrics` 下创建一个新文件，并在该文件中实现新的指标，例如，在 `mmpretrain/evaluation/metrics/my_metric.py` 中。并创建一个自定义的评估指标类 `MyMetric` 继承 [MMEngine 中的 BaseMetric](mmengine.evaluator.BaseMetric)。
+
+需要分别覆盖数据格式处理方法`process`和度量计算方法`compute_metrics`。 将其添加到“METRICS”注册表以实施任何自定义评估指标。
+
+```python
+from mmengine.evaluator import BaseMetric
+from mmpretrain.registry import METRICS
+
+@METRICS.register_module()
+class MyMetric(BaseMetric):
+
+    def process(self, data_batch: Sequence[Dict], data_samples: Sequence[Dict]):
+    """ The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+        `data_batch` stores the batch data from dataloader,
+        and `data_samples` stores the batch outputs from model.
+    """
+        ...
+
+    def compute_metrics(self, results: List):
+    """ Compute the metrics from processed results and returns the evaluation results.
+    """
+        ...
+```
+
+然后，将其导入 `mmpretrain/evaluation/metrics/__init__.py` 以将其添加到 `mmpretrain.evaluation` 包中。
+
+```python
+# In mmpretrain/evaluation/metrics/__init__.py
+...
+from .my_metric import MyMetric
+
+__all__ = [..., 'MyMetric']
+```
+
+最后，在配置文件的 `val_evaluator` 和 `test_evaluator` 字段中使用 `MyMetric`。
+
+```python
+val_evaluator = dict(type='MyMetric', ...)
+test_evaluator = val_evaluator
+```
+
+```{note}
+更多的细节可以参考 {external+mmengine:doc}`MMEngine 文档: Evaluation <design/evaluation>`.
+```
diff --git a/docs/zh_CN/advanced_guides/modules.md b/docs/zh_CN/advanced_guides/modules.md
new file mode 100644
index 0000000000000000000000000000000000000000..cb0fac6a11a79a736a7e7290e8e107745bb98d57
--- /dev/null
+++ b/docs/zh_CN/advanced_guides/modules.md
@@ -0,0 +1,512 @@
+# 自定义模型
+
+在我们的设计中，我们定义一个完整的模型为顶层模块，根据功能的不同，基本几种不同类型的模型组件组成。
+
+- 模型：顶层模块定义了具体的任务类型，例如 `ImageClassifier` 用在图像分类任务中， `MAE` 用在自监督学习中， `ImageToImageRetriever` 用在图像检索中。
+- 主干网络：通常是一个特征提取网络，涵盖了模型之间绝大多数的差异，例如 `ResNet`、`MobileNet`。
+- 颈部：用于连接主干网络和头部的组件，例如 `GlobalAveragePooling`。
+- 头部：用于执行特定任务的组件，例如 `ClsHead`、 `ContrastiveHead`。
+- 损失函数：在头部用于计算损失函数的组件，例如 `CrossEntropyLoss`、`LabelSmoothLoss`。
+- 目标生成器: 用于自监督学习任务的组件，例如 `VQKD`、 `HOGGenerator`。
+
+## 添加新的顶层模型
+
+通常来说，对于图像分类和图像检索任务来说，模型顶层模型流程基本一致。但是不同的自监督学习算法却用不同的计算流程，像 `MAE` 和 `BEiT` 就大不相同。 所以在这个部分，我们将简单介绍如何添加一个新的自监督学习算法。
+
+### 添加新的自监督学习算法
+
+1. 创建新文件 `mmpretrain/models/selfsup/new_algorithm.py` 以及实现 `NewAlgorithm`
+
+   ```python
+   from mmpretrain.registry import MODELS
+   from .base import BaseSelfSupvisor
+
+
+   @MODELS.register_module()
+   class NewAlgorithm(BaseSelfSupvisor):
+
+       def __init__(self, backbone, neck=None, head=None, init_cfg=None):
+           super().__init__(init_cfg)
+           pass
+
+       # ``extract_feat`` function is defined in BaseSelfSupvisor, you could
+       # overwrite it if needed
+       def extract_feat(self, inputs, **kwargs):
+           pass
+
+       # the core function to compute the loss
+       def loss(self, inputs, data_samples, **kwargs):
+           pass
+
+   ```
+
+2. 在 `mmpretrain/models/selfsup/__init__.py` 中导入对应的新算法
+
+   ```python
+   ...
+   from .new_algorithm import NewAlgorithm
+
+   __all__ = [
+       ...,
+       'NewAlgorithm',
+       ...
+   ]
+   ```
+
+3. 在配置文件中使用新算法
+
+   ```python
+   model = dict(
+       type='NewAlgorithm',
+       backbone=...,
+       neck=...,
+       head=...,
+       ...
+   )
+   ```
+
+## 添加新的主干网络
+
+这里，我们以 `ResNet_CIFAR` 为例，展示了如何开发一个新的主干网络组件。
+
+`ResNet_CIFAR` 针对 CIFAR 32x32 的图像输入，远小于大多数模型使用的ImageNet默认的224x224输入配置，所以我们将骨干网络中 `kernel_size=7,stride=2`
+的设置替换为 `kernel_size=3, stride=1`，并移除了 stem 层之后的
+`MaxPooling`，以避免传递过小的特征图到残差块中。
+
+最简单的方式就是继承自 `ResNet` 并只修改 stem 层。
+
+1. 创建一个新文件 `mmpretrain/models/backbones/resnet_cifar.py`。
+
+   ```python
+   import torch.nn as nn
+
+   from mmpretrain.registry import MODELS
+   from .resnet import ResNet
+
+
+   @MODELS.register_module()
+   class ResNet_CIFAR(ResNet):
+
+       """ResNet backbone for CIFAR.
+
+       （对这个主干网络的简短描述）
+
+       Args:
+           depth(int): Network depth, from {18, 34, 50, 101, 152}.
+           ...
+           （参数文档）
+       """
+
+       def __init__(self, depth, deep_stem=False, **kwargs):
+           # 调用基类 ResNet 的初始化函数
+           super(ResNet_CIFAR, self).__init__(depth, deep_stem=deep_stem **kwargs)
+           # 其他特殊的初始化流程
+           assert not self.deep_stem, 'ResNet_CIFAR do not support deep_stem'
+
+       def _make_stem_layer(self, in_channels, base_channels):
+           # 重载基类的方法，以实现对网络结构的修改
+           self.conv1 = build_conv_layer(
+               self.conv_cfg,
+               in_channels,
+               base_channels,
+               kernel_size=3,
+               stride=1,
+               padding=1,
+               bias=False)
+           self.norm1_name, norm1 = build_norm_layer(
+               self.norm_cfg, base_channels, postfix=1)
+           self.add_module(self.norm1_name, norm1)
+           self.relu = nn.ReLU(inplace=True)
+
+       def forward(self, x):
+           # 如果需要的话，可以自定义forward方法
+           x = self.conv1(x)
+           x = self.norm1(x)
+           x = self.relu(x)
+           outs = []
+           for i, layer_name in enumerate(self.res_layers):
+               res_layer = getattr(self, layer_name)
+               x = res_layer(x)
+               if i in self.out_indices:
+                   outs.append(x)
+           # 输出值需要是一个包含不同层多尺度输出的元组
+           # 如果不需要多尺度特征，可以直接在最终输出上包一层元组
+           return tuple(outs)
+
+       def init_weights(self):
+           # 如果需要的话，可以自定义权重初始化的方法
+           super().init_weights()
+
+           # 如果有预训练模型，则不需要进行权重初始化
+           if self.init_cfg is not None and self.init_cfg['type'] == 'Pretrained':
+               return
+
+           # 通常来说，我们建议用`init_cfg`去列举不同层权重初始化方法
+           # 包括卷积层，线性层，归一化层等等
+           # 如果有特殊需要，可以在这里进行额外的初始化操作
+           ...
+   ```
+
+```{note}
+在 OpenMMLab 2.0 的设计中，将原有的`BACKBONES`、`NECKS`、`HEADS`、`LOSSES`等注册名统一为`MODELS`.
+```
+
+2. 在 `mmpretrain/models/backbones/__init__.py` 中导入新模块
+
+   ```python
+   ...
+   from .resnet_cifar import ResNet_CIFAR
+
+   __all__ = [
+       ..., 'ResNet_CIFAR'
+   ]
+   ```
+
+3. 在配置文件中使用新的主干网络
+
+   ```python
+   model = dict(
+       ...
+       backbone=dict(
+           type='ResNet_CIFAR',
+           depth=18,
+           other_arg=xxx),
+       ...
+   ```
+
+### 为自监督学习添加新的主干网络
+
+对于一部分自监督学习算法，主干网络做了一定修改，例如 `MAE`、`BEiT` 等。 这些主干网络需要处理 `mask` 相关的逻辑，以此从可见的图像块中提取对应的特征信息。
+
+以 [MAEViT](mmpretrain.models.selfsup.MAEViT) 作为例子，我们需要重写 `forward` 函数，进行基于 `mask` 的计算。我们实现了 `init_weights` 进行特定权重的初始化和 `random_masking` 函数来生成 `MAE` 预训练所需要的 `mask`。
+
+```python
+class MAEViT(VisionTransformer):
+    """Vision Transformer for MAE pre-training"""
+
+    def __init__(mask_ratio, **kwargs) -> None:
+        super().__init__(**kwargs)
+        # position embedding is not learnable during pretraining
+        self.pos_embed.requires_grad = False
+        self.mask_ratio = mask_ratio
+        self.num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+
+    def init_weights(self) -> None:
+        """Initialize position embedding, patch embedding and cls token."""
+        super().init_weights()
+        # define what if needed
+        pass
+
+    def random_masking(
+        self,
+        x: torch.Tensor,
+        mask_ratio: float = 0.75
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Generate the mask for MAE Pre-training."""
+        pass
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        mask: Optional[bool] = True
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Generate features for masked images.
+
+        The function supports two kind of forward behaviors. If the ``mask`` is
+        ``True``, the function will generate mask to masking some patches
+        randomly and get the hidden features for visible patches, which means
+        the function will be executed as masked imagemodeling pre-training;
+        if the ``mask`` is ``None`` or ``False``, the forward function will
+        call ``super().forward()``, which extract features from images without
+        mask.
+        """
+        if mask is None or False:
+            return super().forward(x)
+
+        else:
+            B = x.shape[0]
+            x = self.patch_embed(x)[0]
+            # add pos embed w/o cls token
+            x = x + self.pos_embed[:, 1:, :]
+
+            # masking: length -> length * mask_ratio
+            x, mask, ids_restore = self.random_masking(x, self.mask_ratio)
+
+            # append cls token
+            cls_token = self.cls_token + self.pos_embed[:, :1, :]
+            cls_tokens = cls_token.expand(B, -1, -1)
+            x = torch.cat((cls_tokens, x), dim=1)
+
+            for _, layer in enumerate(self.layers):
+                x = layer(x)
+            # Use final norm
+            x = self.norm1(x)
+
+            return (x, mask, ids_restore)
+
+```
+
+## 添加新的颈部组件
+
+这里我们以 `GlobalAveragePooling` 为例。这是一个非常简单的颈部组件，没有任何参数。
+
+要添加新的颈部组件，我们主要需要实现 `forward` 函数，该函数对主干网络的输出进行
+一些操作并将结果传递到头部。
+
+1. 创建一个新文件 `mmpretrain/models/necks/gap.py`
+
+   ```python
+   import torch.nn as nn
+
+   from mmpretrain.registry import MODELS
+
+   @MODELS.register_module()
+   class GlobalAveragePooling(nn.Module):
+
+       def __init__(self):
+           self.gap = nn.AdaptiveAvgPool2d((1, 1))
+
+       def forward(self, inputs):
+           # 简单起见，我们默认输入是一个张量
+           outs = self.gap(inputs)
+           outs = outs.view(inputs.size(0), -1)
+           return outs
+   ```
+
+2. 在 `mmpretrain/models/necks/__init__.py` 中导入新模块
+
+   ```python
+   ...
+   from .gap import GlobalAveragePooling
+
+   __all__ = [
+       ..., 'GlobalAveragePooling'
+   ]
+   ```
+
+3. 修改配置文件以使用新的颈部组件
+
+   ```python
+   model = dict(
+       neck=dict(type='GlobalAveragePooling'),
+   )
+   ```
+
+## 添加新的头部组件
+
+### 基于分类头
+
+在此，我们以一个简化的 `VisionTransformerClsHead` 为例，说明如何开发新的头部组件。
+
+要添加一个新的头部组件，基本上我们需要实现 `pre_logits` 函数用于进入最后的分类头之前需要的处理，
+以及 `forward` 函数。
+
+1. 创建一个文件 `mmpretrain/models/heads/vit_head.py`.
+
+   ```python
+   import torch.nn as nn
+
+   from mmpretrain.registry import MODELS
+   from .cls_head import ClsHead
+
+
+   @MODELS.register_module()
+   class LinearClsHead(ClsHead):
+
+       def __init__(self, num_classes, in_channels, hidden_dim, **kwargs):
+           super().__init__(**kwargs)
+           self.in_channels = in_channels
+           self.num_classes = num_classes
+           self.hidden_dim = hidden_dim
+
+           self.fc1 = nn.Linear(in_channels, hidden_dim)
+           self.act = nn.Tanh()
+           self.fc2 = nn.Linear(hidden_dim, num_classes)
+
+       def pre_logits(self, feats):
+           # 骨干网络的输出通常包含多尺度信息的元组
+           # 对于分类任务来说，我们只需要关注最后的输出
+           feat = feats[-1]
+
+           # VisionTransformer的最终输出是一个包含patch tokens和cls tokens的元组
+           # 这里我们只需要cls tokens
+           _, cls_token = feat
+
+           # 完成除了最后的线性分类头以外的操作
+           return self.act(self.fc1(cls_token))
+
+       def forward(self, feats):
+           pre_logits = self.pre_logits(feats)
+
+           # 完成最后的分类头
+           cls_score = self.fc(pre_logits)
+           return cls_score
+   ```
+
+2. 在 `mmpretrain/models/heads/__init__.py` 中导入这个模块
+
+   ```python
+   ...
+   from .vit_head import VisionTransformerClsHead
+
+   __all__ = [
+       ..., 'VisionTransformerClsHead'
+   ]
+   ```
+
+3. 修改配置文件以使用新的头部组件。
+
+   ```python
+   model = dict(
+       head=dict(
+           type='VisionTransformerClsHead',
+           ...,
+       ))
+   ```
+
+### 基于 BaseModule 类
+
+这是一个基于 MMEngine 中的 `BaseModule` 进行开发例子，`MAEPretrainHead`，主要是为了 `MAE` 掩码学习。我们需要实现 `loss` 函数来计算损失吗，不过其它的函数均为可选项。
+
+```python
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class MAEPretrainHead(BaseModule):
+    """Head for MAE Pre-training."""
+
+    def __init__(self,
+                 loss: dict,
+                 norm_pix: bool = False,
+                 patch_size: int = 16) -> None:
+        super().__init__()
+        self.norm_pix = norm_pix
+        self.patch_size = patch_size
+        self.loss_module = MODELS.build(loss)
+
+    def patchify(self, imgs: torch.Tensor) -> torch.Tensor:
+        """Split images into non-overlapped patches."""
+        p = self.patch_size
+        assert imgs.shape[2] == imgs.shape[3] and imgs.shape[2] % p == 0
+
+        h = w = imgs.shape[2] // p
+        x = imgs.reshape(shape=(imgs.shape[0], 3, h, p, w, p))
+        x = torch.einsum('nchpwq->nhwpqc', x)
+        x = x.reshape(shape=(imgs.shape[0], h * w, p**2 * 3))
+        return x
+
+    def construct_target(self, target: torch.Tensor) -> torch.Tensor:
+        """Construct the reconstruction target."""
+        target = self.patchify(target)
+        if self.norm_pix:
+            # normalize the target image
+            mean = target.mean(dim=-1, keepdim=True)
+            var = target.var(dim=-1, keepdim=True)
+            target = (target - mean) / (var + 1.e-6)**.5
+
+        return target
+
+    def loss(self, pred: torch.Tensor, target: torch.Tensor,
+             mask: torch.Tensor) -> torch.Tensor:
+        """Generate loss."""
+        target = self.construct_target(target)
+        loss = self.loss_module(pred, target, mask)
+
+        return loss
+```
+
+完成实现后，之后的步骤和 [基于分类头](#基于分类头) 中的步骤 2 和步骤 3 一致。
+
+## 添加新的损失函数
+
+要添加新的损失函数，我们主要需要在损失函数模块中 `forward` 函数。这里需要注意的是，损失模块也应该注册到`MODELS`中。另外，利用装饰器 `weighted_loss` 可以方便的实现对每个元素的损失进行加权平均。
+
+假设我们要模拟从另一个分类模型生成的概率分布，需要添加 `L1loss` 来实现该目的。
+
+1. 创建一个新文件 `mmpretrain/models/losses/l1_loss.py`
+
+   ```python
+   import torch
+   import torch.nn as nn
+
+   from mmpretrain.registry import MODELS
+   from .utils import weighted_loss
+
+   @weighted_loss
+   def l1_loss(pred, target):
+       assert pred.size() == target.size() and target.numel() > 0
+       loss = torch.abs(pred - target)
+       return loss
+
+   @MODELS.register_module()
+   class L1Loss(nn.Module):
+
+       def __init__(self, reduction='mean', loss_weight=1.0):
+           super(L1Loss, self).__init__()
+           self.reduction = reduction
+           self.loss_weight = loss_weight
+
+       def forward(self,
+                   pred,
+                   target,
+                   weight=None,
+                   avg_factor=None,
+                   reduction_override=None):
+           assert reduction_override in (None, 'none', 'mean', 'sum')
+           reduction = (
+               reduction_override if reduction_override else self.reduction)
+           loss = self.loss_weight * l1_loss(
+               pred, target, weight, reduction=reduction, avg_factor=avg_factor)
+           return loss
+   ```
+
+2. 在文件 `mmpretrain/models/losses/__init__.py` 中导入这个模块
+
+   ```python
+   ...
+   from .l1_loss import L1Loss
+
+   __all__ = [
+       ..., 'L1Loss'
+   ]
+   ```
+
+3. 修改配置文件中的 `loss` 字段以使用新的损失函数
+
+   ```python
+   model = dict(
+       head=dict(
+           loss=dict(type='L1Loss', loss_weight=1.0),
+       ))
+   ```
+
+最后我们可以在配置文件中结合所有新增的模型组件来使用新的模型。由于`ResNet_CIFAR` 不是一个基于ViT的骨干网络，这里我们不用`VisionTransformerClsHead`的配置。
+
+```python
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='ResNet_CIFAR',
+        depth=18,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=10,
+        in_channels=512,
+        loss=dict(type='L1Loss', loss_weight=1.0),
+        topk=(1, 5),
+    ))
+
+```
+
+```{tip}
+为了方便，相同的模型组件可以直接从已有的config文件里继承，更多细节可以参考[学习配置文件](../user_guides/config.md)。
+```
diff --git a/docs/zh_CN/advanced_guides/pipeline.md b/docs/zh_CN/advanced_guides/pipeline.md
new file mode 100644
index 0000000000000000000000000000000000000000..99506b0848008befab6781771071cbb54cf2bfb0
--- /dev/null
+++ b/docs/zh_CN/advanced_guides/pipeline.md
@@ -0,0 +1,148 @@
+# 自定义数据处理流程
+
+## 数据流的设计
+
+在[新数据集教程](./datasets.md)中，我们知道数据集类使用 `load_data_list` 方法来初始化整个数据集，我们将每个样本的信息保存到一个 dict 中。
+
+通常，为了节省内存，我们只加载 `load_data_list` 中的图片路径和标签，使用时加载完整的图片内容。此外，我们可能希望在训练时选择样本时进行一些随机数据扩充。几乎所有的数据加载、预处理和格式化操作都可以通过**数据管道**在 MMPretrain 中进行配置。
+
+数据管道意味着在从数据集中索引样本时如何处理样本字典，它由一系列数据变换组成。每个数据变换都将一个字典作为输入，对其进行处理，并为下一个数据变换输出一个字典。
+
+这是 ImageNet 上 ResNet-50 训练的数据管道示例。
+
+```python
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+```
+
+MMPretrain 中所有可用的数据变换都可以在 [数据变换文档](mmpretrain.datasets.transforms) 中找到。
+
+## 修改训练/测试管道
+
+MMPretrain 中的数据管道非常灵活。您几乎可以从配置文件中控制数据预处理的每一步，但另一方面，面对如此多的选项，您可能会感到困惑。
+
+这是图像分类任务的常见做法和指南。
+
+### 读取
+
+在数据管道的开始，我们通常需要从文件路径加载图像数据。
+[`LoadImageFromFile`](mmcv.transforms.LoadImageFromFile) 通常用于执行此任务。
+
+```python
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    ...
+]
+```
+
+如果您想从具有特殊格式或特殊位置的文件中加载数据，您可以 [实施新的加载变换](#添加新的数据变换) 并将其添加到数据管道的开头。
+
+### 增强和其它处理
+
+在训练过程中，我们通常需要做数据增强来避免过拟合。在测试过程中，我们还需要做一些数据处理，比如调整大小和裁剪。这些数据变换将放置在加载过程之后。
+
+这是一个简单的数据扩充方案示例。它会将输入图像随机调整大小并裁剪到指定比例，并随机水平翻转图像。
+
+```python
+train_pipeline = [
+    ...
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    ...
+]
+```
+
+这是 [Swin-Transformer](../papers/swin_transformer.md) 训练中使用的大量数据增强配方示例。 为了与官方实施保持一致，它指定 `pillow` 作为调整大小后端，`bicubic` 作为调整大小算法。 此外，它添加了 [`RandAugment`](mmpretrain.datasets.transforms.RandAugment) 和 [`RandomErasing`](mmpretrain.datasets.transforms.RandomErasing) 作为额外的数据增强方法。
+
+此配置指定了数据扩充的每个细节，您只需将其复制到您自己的配置文件中即可应用 Swin-Transformer 的数据扩充。
+
+```python
+bgr_mean = [103.53, 116.28, 123.675]
+bgr_std = [57.375, 57.12, 58.395]
+
+train_pipeline = [
+    ...
+    dict(type='RandomResizedCrop', scale=224, backend='pillow', interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type='RandomErasing',
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    ...
+]
+```
+
+```{note}
+通常，数据管道中的数据增强部分仅处理图像方面的变换，而不处理图像归一化或混合/剪切混合等变换。 因为我们可以对 batch data 做 image normalization 和 mixup/cutmix 来加速。要配置图像归一化和 mixup/cutmix，请使用 [数据预处理器](mmpretrain.models.utils.data_preprocessor)。
+```
+
+### 格式化
+
+格式化是从数据信息字典中收集训练数据，并将这些数据转换为模型友好的格式。
+
+在大多数情况下，您可以简单地使用 [`PackInputs`](mmpretrain.datasets.transforms.PackInputs)，它将 NumPy 数组格式的图像转换为 PyTorch 张量，并将 ground truth 类别信息和其他元信息打包为 [`DataSample`](mmpretrain.structures.DataSample)。
+
+```python
+train_pipeline = [
+    ...
+    dict(type='PackInputs'),
+]
+```
+
+## 添加新的数据变换
+
+1. 在任何文件中写入一个新的数据转换，例如 `my_transform.py`，并将其放在文件夹 `mmpretrain/datasets/transforms/` 中。 数据变换类需要继承 [`mmcv.transforms.BaseTransform`](mmcv.transforms.BaseTransform) 类并覆盖以字典作为输入并返回字典的 `transform` 方法。
+
+   ```python
+   from mmcv.transforms import BaseTransform
+   from mmpretrain.registry import TRANSFORMS
+
+   @TRANSFORMS.register_module()
+   class MyTransform(BaseTransform):
+
+       def transform(self, results):
+           # Modify the data information dict `results`.
+           return results
+   ```
+
+2. 在 `mmpretrain/datasets/transforms/__init__.py` 中导入新的变换
+
+   ```python
+   ...
+   from .my_transform import MyTransform
+
+   __all__ = [
+       ..., 'MyTransform'
+   ]
+   ```
+
+3. 在配置文件中使用
+
+   ```python
+   train_pipeline = [
+       ...
+       dict(type='MyTransform'),
+       ...
+   ]
+   ```
+
+## 数据管道可视化
+
+数据流水线设计完成后，可以使用 [可视化工具](../useful_tools/dataset_visualization.md) 查看效果。
diff --git a/docs/zh_CN/advanced_guides/runtime.md b/docs/zh_CN/advanced_guides/runtime.md
new file mode 100644
index 0000000000000000000000000000000000000000..e5fa3864a47a9ebb77ab992cc45b1162814f52fb
--- /dev/null
+++ b/docs/zh_CN/advanced_guides/runtime.md
@@ -0,0 +1,213 @@
+# 自定义运行参数
+
+运行参数配置包括许多有用的功能，如权重文件保存、日志配置等等，在本教程中，我们将介绍如何配置这些功能。
+
+## 保存权重文件
+
+权重文件保存功能是一个在训练阶段默认注册的钩子， 你可以通过配置文件中的 `default_hooks.checkpoint` 字段配置它。
+
+```{note}
+钩子机制在 OpenMMLab 开源算法库中应用非常广泛。通过钩子，你可以在不修改运行器的主要执行逻辑的情况下插入许多功能。
+
+可以通过{external+mmengine:doc}`相关文章 <tutorials/hook>`进一步理解钩子。
+```
+
+**默认配置:**
+
+```python
+default_hooks = dict(
+    ...
+    checkpoint = dict(type='CheckpointHook', interval=1)
+    ...
+)
+```
+
+下面是一些[权重文件钩子(CheckpointHook)](mmengine.hooks.CheckpointHook)的常用可配置参数。
+
+- **`interval`** (int): 文件保存周期。如果使用-1，它将永远不会保存权重。
+- **`by_epoch`** (bool): 选择 **`interval`** 是基于epoch还是基于iteration， 默认为 `True`.
+- **`out_dir`** (str): 保存权重文件的根目录。如果不指定，检查点将被保存在工作目录中。如果指定，检查点将被保存在 **`out_dir`** 的子文件夹中。
+- **`max_keep_ckpts`** (int): 要保留的权重文件数量。在某些情况下，为了节省磁盘空间，我们希望只保留最近的几个权重文件。默认为 -1，也就是无限制。
+- **`save_best`** (str, List[str]): 如果指定，它将保存具有最佳评估结果的权重。
+  通常情况下，你可以直接使用`save_best="auto"`来自动选择评估指标。
+
+而如果你想要更高级的配置，请参考[权重文件钩子(CheckpointHook)](tutorials/hook.md#checkpointhook)。
+
+## 权重加载 / 断点训练
+
+在配置文件中，你可以加载指定模型权重或者断点继续训练，如下所示:
+
+```python
+# 从指定权重文件加载
+load_from = "Your checkpoint path"
+
+# 是否从加载的断点继续训练
+resume = False
+```
+
+`load_from` 字段可以是本地路径，也可以是HTTP路径。你可以从检查点恢复训练，方法是指定 `resume=True`。
+
+```{tip}
+你也可以通过指定 `load_from=None` 和 `resume=True` 启用从最新的断点自动恢复。
+Runner执行器将自动从工作目录中找到最新的权重文件。
+```
+
+如果你用我们的 `tools/train.py` 脚本来训练模型，你只需使用 `--resume` 参数来恢复训练，就不用手动修改配置文件了。如下所示:
+
+```bash
+# 自动从最新的断点恢复
+python tools/train.py configs/resnet/resnet50_8xb32_in1k.py --resume
+
+# 从指定的断点恢复
+python tools/train.py configs/resnet/resnet50_8xb32_in1k.py --resume checkpoints/resnet.pth
+```
+
+## 随机性(Randomness)配置
+
+为了让实验尽可能是可复现的， 我们在 `randomness` 字段中提供了一些控制随机性的选项。
+
+默认情况下，我们不会在配置文件中指定随机数种子，在每次实验中，程序会生成一个不同的随机数种子。
+
+**默认配置:**
+
+```python
+randomness = dict(seed=None, deterministic=False)
+```
+
+为了使实验更具可复现性，你可以指定一个种子并设置 `deterministic=True`。
+`deterministic` 选项的使用效果可以在[这里](https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking)找到。
+
+## 日志配置
+
+日志的配置与多个字段有关。
+
+在`log_level`字段中，你可以指定全局日志级别。参见 {external+python:ref}`Logging Levels<levels>` 以获得日志级别列表。
+
+```python
+log_level = 'INFO'
+```
+
+在 `default_hooks.logger` 字段中，你可以指定训练和测试期间的日志间隔。
+而所有可用的参数可以在[日志钩子文档](tutorials/hook.md#loggerhook)中找到。
+
+```python
+default_hooks = dict(
+    ...
+    # 每100次迭代就打印一次日志
+    logger=dict(type='LoggerHook', interval=100),
+    ...
+)
+```
+
+在 `log_processor` 字段中，你可以指定日志信息的平滑方法。
+通常，我们使用一个长度为10的窗口来平滑日志中的值，并输出所有信息的平均值。
+如果你想特别指定某些信息的平滑方法，请参阅{external+mmengine:doc}`日志处理器文档 <advanced_tutorials/logging>`。
+
+```python
+# 默认设置，它将通过一个10长度的窗口平滑训练日志中的值
+log_processor = dict(window_size=10)
+```
+
+在 `visualizer` 字段中，你可以指定多个后端来保存日志信息，如TensorBoard和WandB。
+更多的细节可以在[可视化工具](#visualizer)找到。
+
+## 自定义钩子
+
+上述许多功能是由钩子实现的，你也可以通过修改 `custom_hooks` 字段来插入其他的自定义钩子。
+下面是 MMEngine 和 MMPretrain 中的一些钩子，你可以直接使用，例如：
+
+- [EMAHook](mmpretrain.engine.hooks.EMAHook)
+- [SyncBuffersHook](mmengine.hooks.SyncBuffersHook)
+- [EmptyCacheHook](mmengine.hooks.EmptyCacheHook)
+- [ClassNumCheckHook](mmpretrain.engine.hooks.ClassNumCheckHook)
+- ......
+
+例如，EMA（Exponential Moving Average）在模型训练中被广泛使用，你可以以下方式启用它：
+
+```python
+custom_hooks = [
+    dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL'),
+]
+```
+
+## 验证可视化
+
+验证可视化钩子是一个验证过程中默认注册的钩子。
+你可以在 `default_hooks.visualization` 字段中来配置它。
+
+默认情况下，我们禁用这个钩子，你可以通过指定 `enable=True` 来启用它。而更多的参数可以在
+[可视化钩子文档](mmpretrain.engine.hooks.VisualizationHook)中找到。
+
+```python
+default_hooks = dict(
+    ...
+    visualization=dict(type='VisualizationHook', enable=False),
+    ...
+)
+```
+
+这个钩子将在验证数据集中选择一部分图像，在每次验证过程中记录并可视化它们的预测结果。
+你可以用它来观察训练期间模型在实际图像上的性能变化。
+
+此外，如果你的验证数据集中的图像很小（\<100， 如Cifra数据集），
+你可以指定 `rescale_factor` 来缩放它们，如 `rescale_factor=2.`, 将可视化的图像放大两倍。
+
+## Visualizer
+
+`Visualizer` 用于记录训练和测试过程中的各种信息，包括日志、图像和标量。
+默认情况下，记录的信息将被保存在工作目录下的 `vis_data` 文件夹中。
+
+**默认配置:**
+
+```python
+visualizer = dict(
+    type='UniversalVisualizer',
+    vis_backends=[
+        dict(type='LocalVisBackend'),
+    ]
+)
+```
+
+通常，最有用的功能是将日志和标量如 `loss` 保存到不同的后端。
+例如，要把它们保存到 TensorBoard，只需像下面这样设置：
+
+```python
+visualizer = dict(
+    type='UniversalVisualizer',
+    vis_backends=[
+        dict(type='LocalVisBackend'),
+        dict(type='TensorboardVisBackend'),
+    ]
+)
+```
+
+或者像下面这样把它们保存到 WandB：
+
+```python
+visualizer = dict(
+    type='UniversalVisualizer',
+    vis_backends=[
+        dict(type='LocalVisBackend'),
+        dict(type='WandbVisBackend'),
+    ]
+)
+```
+
+## 环境配置
+
+在 `env_cfg` 字段中，你可以配置一些底层的参数，如 cuDNN、多进程和分布式通信。
+
+**在修改这些参数之前，请确保你理解这些参数的含义。**
+
+```python
+env_cfg = dict(
+    # 是否启用cudnn基准测试
+    cudnn_benchmark=False,
+
+    # 设置多进程参数
+    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+
+    # 设置分布式参数
+    dist_cfg=dict(backend='nccl'),
+)
+```
diff --git a/docs/zh_CN/advanced_guides/schedule.md b/docs/zh_CN/advanced_guides/schedule.md
new file mode 100644
index 0000000000000000000000000000000000000000..d1c347d11930acd7087701ae2db0e750a9012ef2
--- /dev/null
+++ b/docs/zh_CN/advanced_guides/schedule.md
@@ -0,0 +1,359 @@
+# 自定义训练优化策略
+
+在我们的算法库中，已经提供了通用数据集（如ImageNet，CIFAR）的[默认训练策略配置](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/schedules)。如果想要在这些数据集上继续提升模型性能，或者在不同数据集和方法上进行新的尝试，我们通常需要修改这些默认的策略。
+
+在本教程中，我们将介绍如何在运行自定义训练时，通过修改配置文件进行构造优化器、参数化精细配置、梯度裁剪、梯度累计以及定制动量调整策略等。同时也会通过模板简单介绍如何自定义开发优化器和构造器。
+
+## 配置训练优化策略
+
+我们通过 `optim_wrapper` 来配置主要的优化策略，包括优化器的选择，混合精度训练的选择，参数化精细配置，梯度裁剪以及梯度累计。接下来将分别介绍这些内容。
+
+### 构造 PyTorch 内置优化器
+
+MMPretrain 支持 PyTorch 实现的所有优化器，仅需在配置文件中，指定优化器封装需要的 `optimizer` 字段。
+
+如果要使用 [`SGD`](torch.optim.SGD)，则修改如下。这里要注意所有优化相关的配置都需要封装在 `optim_wrapper` 配置里。
+
+```python
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.0003, weight_decay=0.0001)
+)
+```
+
+```{note}
+配置文件中的 'type' 不是构造时的参数，而是 PyTorch 内置优化器的类名。
+更多优化器选择可以参考{external+torch:ref}`PyTorch 支持的优化器列表<optim:algorithms>`。
+```
+
+要修改模型的学习率，只需要在优化器的配置中修改 `lr` 即可。
+要配置其他参数，可直接根据 [PyTorch API 文档](torch.optim) 进行。
+
+例如，如果想使用 [`Adam`](torch.optim.Adam) 并设置参数为 `torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)`。
+则需要进行如下修改：
+
+```python
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer = dict(
+        type='Adam',
+        lr=0.001,
+        betas=(0.9, 0.999),
+        eps=1e-08,
+        weight_decay=0,
+        amsgrad=False),
+)
+```
+
+````{note}
+考虑到对于单精度训练来说，优化器封装的默认类型就是 `OptimWrapper`，我们在这里可以直接省略，因此配置文件可以进一步简化为：
+
+```python
+optim_wrapper = dict(
+    optimizer=dict(
+        type='Adam',
+        lr=0.001,
+        betas=(0.9, 0.999),
+        eps=1e-08,
+        weight_decay=0,
+        amsgrad=False))
+```
+````
+
+### 混合精度训练
+
+如果我们想要使用混合精度训练（Automactic Mixed Precision），我们只需简单地将 `optim_wrapper` 的类型改为 `AmpOptimWrapper`。
+
+```python
+optim_wrapper = dict(type='AmpOptimWrapper', optimizer=...)
+```
+
+另外，为了方便，我们同时在启动训练脚本 `tools/train.py` 中提供了 `--amp` 参数作为开启混合精度训练的开关，更多细节可以参考[训练教程](../user_guides/train.md)。
+
+### 参数化精细配置
+
+在一些模型中，不同的优化策略需要适应特定的参数，例如不在 BatchNorm 层使用权重衰减，或者在不同层使用不同的学习率等等。
+我们需要用到 `optim_wrapper` 中的 `paramwise_cfg` 参数来进行精细化配置。
+
+- **为不同类型的参数设置超参乘子**
+
+  例如，我们可以在 `paramwise_cfg` 配置中设置 `norm_decay_mult=0.` 来改变归一化层权重和偏移的衰减为0。
+
+  ```python
+  optim_wrapper = dict(
+      optimizer=dict(type='SGD', lr=0.8, weight_decay=1e-4),
+      paramwise_cfg=dict(norm_decay_mult=0.))
+  ```
+
+  支持更多类型的参数配置，参考以下列表：
+
+  - `bias_lr_mult`：偏置的学习率系数（不包括正则化层的偏置以及可变形卷积的 offset），默认值为 1
+  - `bias_decay_mult`：偏置的权值衰减系数（不包括正则化层的偏置以及可变形卷积的 offset），默认值为 1
+  - `norm_decay_mult`：正则化层权重和偏置的权值衰减系数，默认值为 1
+  - `flat_decay_mult`: 一维参数的权值衰减系数，默认值为 1
+  - `dwconv_decay_mult`：Depth-wise 卷积的权值衰减系数，默认值为 1
+  - `bypass_duplicate`：是否跳过重复的参数，默认为 `False`
+  - `dcn_offset_lr_mult`：可变形卷积（Deformable Convolution）的学习率系数，默认值为 1
+
+- **为特定参数设置超参乘子**
+
+  MMPretrain 通过 `paramwise_cfg` 的 `custom_keys` 参数来配置特定参数的超参乘子。
+
+  例如，我们可以通过以下配置来设置所有 `backbone.layer0` 层的学习率和权重衰减为0， `backbone` 的其余层和优化器保持一致，另外 `head` 层的学习率为0.001.
+
+  ```python
+  optim_wrapper = dict(
+      optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+      paramwise_cfg=dict(
+          custom_keys={
+              'backbone.layer0': dict(lr_mult=0, decay_mult=0),
+              'backbone': dict(lr_mult=1),
+              'head': dict(lr_mult=0.1)
+          }))
+  ```
+
+### 梯度裁剪
+
+在训练过程中，损失函数可能接近于一些异常陡峭的区域，从而导致梯度爆炸。而梯度裁剪可以帮助稳定训练过程，更多介绍可以参见[该页面](https://paperswithcode.com/method/gradient-clipping)。
+
+目前我们支持在 `optim_wrapper` 字段中添加 `clip_grad` 参数来进行梯度裁剪，更详细的参数可参考 [PyTorch 文档](torch.nn.utils.clip_grad_norm_)。
+
+用例如下：
+
+```python
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+    # norm_type: 使用的范数类型，此处使用范数2。
+    clip_grad=dict(max_norm=35, norm_type=2))
+```
+
+### 梯度累计
+
+计算资源缺乏缺乏时，每个训练批次的大小（batch size）只能设置为较小的值，这可能会影响模型的性能。
+
+可以使用梯度累计来规避这一问题。我们支持在 `optim_wrapper` 字段中添加 `accumulative_counts` 参数来进行梯度累计。
+
+用例如下：
+
+```python
+train_dataloader = dict(batch_size=64)
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+    accumulative_counts=4)
+```
+
+表示训练时，每 4 个 iter 执行一次反向传播。由于此时单张 GPU 上的批次大小为 64，也就等价于单张 GPU 上一次迭代的批次大小为 256，也即：
+
+```python
+train_dataloader = dict(batch_size=256)
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001))
+```
+
+## 配置参数优化策略
+
+在训练过程中，优化参数例如学习率、动量，通常不会是固定不变，而是随着训练进程的变化而调整。PyTorch 支持一些学习率调整的调度器，但是不足以完成复杂的策略。在 MMPretrain 中，我们提供 `param_scheduler` 来更好地控制不同优化参数的策略。
+
+### 配置学习率调整策略
+
+深度学习研究中，广泛应用学习率衰减来提高网络的性能。我们支持大多数 PyTorch 学习率调度器， 其中包括 `ExponentialLR`, `LinearLR`, `StepLR`, `MultiStepLR` 等等。
+
+- **单个学习率策略**
+
+  多数情况下，我们使用单一学习率策略，这里 `param_scheduler` 会是一个字典。比如在默认的 ResNet 网络训练中，我们使用阶梯式的学习率衰减策略 [`MultiStepLR`](mmengine.optim.MultiStepLR)，配置文件为：
+
+  ```python
+  param_scheduler = dict(
+      type='MultiStepLR',
+      by_epoch=True,
+      milestones=[100, 150],
+      gamma=0.1)
+  ```
+
+  或者我们想使用 [`CosineAnnealingLR`](mmengine.optim.CosineAnnealingLR) 来进行学习率衰减：
+
+  ```python
+  param_scheduler = dict(
+      type='CosineAnnealingLR',
+      by_epoch=True,
+      T_max=num_epochs)
+  ```
+
+- **多个学习率策略**
+
+  然而在一些其他情况下，为了提高模型的精度，通常会使用多种学习率策略。例如，在训练的早期阶段，网络容易不稳定，而学习率的预热就是为了减少这种不稳定性。
+
+  整个学习过程中，学习率将会通过预热从一个很小的值逐步提高到预定值，再会通过其他的策略进一步调整。
+
+  在 MMPretrain 中，我们同样使用 `param_scheduler` ，将多种学习策略写成列表就可以完成上述预热策略的组合。
+
+  例如：
+
+  1. 在前50次迭代中逐**迭代次数**地**线性**预热
+
+  ```python
+    param_scheduler = [
+        # 逐迭代次数，线性预热
+        dict(type='LinearLR',
+            start_factor=0.001,
+            by_epoch=False,  # 逐迭代次数
+            end=50),  # 只预热50次迭代次数
+        # 主要的学习率策略
+        dict(type='MultiStepLR',
+            by_epoch=True,
+            milestones=[8, 11],
+            gamma=0.1)
+    ]
+  ```
+
+  2. 在前10轮迭代中逐**迭代次数**地**线性**预热
+
+  ```python
+    param_scheduler = [
+        # 在前10轮迭代中，逐迭代次数，线性预热
+        dict(type='LinearLR',
+            start_factor=0.001,
+            by_epoch=True,
+            end=10,
+            convert_to_iter_based=True,  # 逐迭代次数更新学习率.
+        ),
+        # 在 10 轮次后，通过余弦退火衰减
+        dict(type='CosineAnnealingLR', by_epoch=True, begin=10)
+    ]
+  ```
+
+  注意这里增加了 `begin` 和 `end` 参数，这两个参数指定了调度器的**生效区间**。生效区间通常只在多个调度器组合时才需要去设置，使用单个调度器时可以忽略。当指定了 `begin` 和 `end` 参数时，表示该调度器只在 [begin, end) 区间内生效，其单位是由 `by_epoch` 参数决定。在组合不同调度器时，各调度器的 `by_epoch` 参数不必相同。如果没有指定的情况下，`begin` 为 0， `end` 为最大迭代轮次或者最大迭代次数。
+
+  如果相邻两个调度器的生效区间没有紧邻，而是有一段区间没有被覆盖，那么这段区间的学习率维持不变。而如果两个调度器的生效区间发生了重叠，则对多组调度器叠加使用，学习率的调整会按照调度器配置文件中的顺序触发（行为与 PyTorch 中 [`ChainedScheduler`](torch.optim.lr_scheduler.ChainedScheduler) 一致）。
+
+  ```{tip}
+  为了避免学习率曲线与预期不符， 配置完成后，可以使用 MMPretrain 提供的 [学习率可视化工具](../useful_tools/scheduler_visualization.md) 画出对应学习率调整曲线。
+  ```
+
+### 配置动量调整策略
+
+MMPretrain 支持动量调度器根据学习率修改优化器的动量，从而使损失函数收敛更快。用法和学习率调度器一致。
+
+我们支持的动量策略和详细的使用细节可以参考[这里](https://github.com/open-mmlab/mmengine/blob/main/mmengine/optim/scheduler/momentum_scheduler.py)。我们只将调度器中的 `LR` 替换为了 `Momentum`，动量策略可以直接追加 `param_scheduler` 列表中。
+
+这里是一个用例：
+
+```python
+param_scheduler = [
+    # 学习率策略
+    dict(type='LinearLR', ...),
+    # 动量策略
+    dict(type='LinearMomentum',
+         start_factor=0.001,
+         by_epoch=False,
+         begin=0,
+         end=1000)
+]
+```
+
+## 新增优化器或者优化器构造器
+
+```{note}
+本部分将修改 MMPretrain 源码或者向 MMPretrain 框架添加代码，初学者可跳过。
+```
+
+### 新增优化器
+
+在学术研究和工业实践中，可能需要使用 MMPretrain 未实现的优化方法，可以通过以下方法添加。
+
+1. 定义一个新的优化器
+
+   一个自定义的优化器可根据如下规则进行定制：
+
+   假设我们想添加一个名为 `MyOptimzer` 的优化器，其拥有参数 `a`, `b` 和 `c`。
+   可以创建一个名为 `mmpretrain/engine/optimizer` 的文件夹，并在目录下的一个文件，如 `mmpretrain/engine/optimizer/my_optimizer.py` 中实现该自定义优化器：
+
+   ```python
+   from mmpretrain.registry import OPTIMIZERS
+   from torch.optim import Optimizer
+
+
+   @OPTIMIZERS.register_module()
+   class MyOptimizer(Optimizer):
+
+       def __init__(self, a, b, c):
+           ...
+
+       def step(self, closure=None):
+           ...
+   ```
+
+2. 注册优化器
+
+   要注册上面定义的上述模块，首先需要将此模块导入到主命名空间中。有两种方法可以实现它。
+
+   修改 `mmpretrain/engine/optimizers/__init__.py`，将其导入至 `mmpretrain.engine` 包。
+
+   ```python
+   # 在 mmpretrain/engine/optimizers/__init__.py 中
+   ...
+   from .my_optimizer import MyOptimizer # MyOptimizer 是我们自定义的优化器的名字
+
+   __all__ = [..., 'MyOptimizer']
+   ```
+
+   在运行过程中，我们会自动导入 `mmpretrain.engine` 包并同时注册 `MyOptimizer`。
+
+3. 在配置文件中指定优化器
+
+   之后，用户便可在配置文件的 `optim_wrapper.optimizer` 域中使用 `MyOptimizer`：
+
+   ```python
+   optim_wrapper = dict(
+       optimizer=dict(type='MyOptimizer', a=a_value, b=b_value, c=c_value))
+   ```
+
+### 新增优化器构造器
+
+某些模型可能具有一些特定于参数的设置以进行优化，例如为所有 BatchNorm 层设置不同的权重衰减。
+
+尽管我们已经可以使用 [`optim_wrapper.paramwise_cfg` 字段](#参数化精细配置)来配置特定参数的优化设置，但可能仍然无法覆盖你的需求。
+
+当然你可以在此基础上进行修改。我们默认使用 [`DefaultOptimWrapperConstructor`](mmengine.optim.DefaultOptimWrapperConstructor) 来构造优化器。在构造过程中，通过 `paramwise_cfg` 来精细化配置不同设置。这个默认构造器可以作为新优化器构造器实现的模板。
+
+我们可以新增一个优化器构造器来覆盖这些行为。
+
+```python
+# 在 mmpretrain/engine/optimizers/my_optim_constructor.py 中
+from mmengine.optim import DefaultOptimWrapperConstructor
+from mmpretrain.registry import OPTIM_WRAPPER_CONSTRUCTORS
+
+
+@OPTIM_WRAPPER_CONSTRUCTORS.register_module()
+class MyOptimWrapperConstructor:
+
+    def __init__(self, optim_wrapper_cfg, paramwise_cfg=None):
+        ...
+
+    def __call__(self, model):
+        ...
+```
+
+这是一个已实现的 [OptimWrapperConstructor](mmpretrain.engine.optimizers.LearningRateDecayOptimWrapperConstructor) 具体例子。
+
+接下来类似 [新增优化器教程](#新增优化器) 来导入并使用新的优化器构造器。
+
+1. 修改 `mmpretrain/engine/optimizers/__init__.py`，将其导入至 `mmpretrain.engine` 包。
+
+   ```python
+   # 在 mmpretrain/engine/optimizers/__init__.py 中
+   ...
+   from .my_optim_constructor import MyOptimWrapperConstructor
+
+   __all__ = [..., 'MyOptimWrapperConstructor']
+   ```
+
+2. 在配置文件的 `optim_wrapper.constructor` 字段中使用 `MyOptimWrapperConstructor` 。
+
+   ```python
+   optim_wrapper = dict(
+       constructor=dict(type='MyOptimWrapperConstructor'),
+       optimizer=...,
+       paramwise_cfg=...,
+   )
+   ```
diff --git a/docs/zh_CN/api b/docs/zh_CN/api
new file mode 120000
index 0000000000000000000000000000000000000000..0ef434a4902196a4b89383d9cfb5f47b2e11a999
--- /dev/null
+++ b/docs/zh_CN/api
@@ -0,0 +1 @@
+../en/api
\ No newline at end of file
diff --git a/docs/zh_CN/conf.py b/docs/zh_CN/conf.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c372a8ae590fc889bf775bd041aed2991c15fbb
--- /dev/null
+++ b/docs/zh_CN/conf.py
@@ -0,0 +1,253 @@
+# flake8: noqa
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import subprocess
+import sys
+
+import pytorch_sphinx_theme
+from sphinx.builders.html import StandaloneHTMLBuilder
+
+sys.path.insert(0, os.path.abspath('../../'))
+
+# -- Project information -----------------------------------------------------
+
+project = 'MMPretrain'
+copyright = '2020, OpenMMLab'
+author = 'MMPretrain Authors'
+
+# The full version, including alpha/beta/rc tags
+version_file = '../../mmpretrain/version.py'
+
+
+def get_version():
+    with open(version_file, 'r') as f:
+        exec(compile(f.read(), version_file, 'exec'))
+    return locals()['__version__']
+
+
+release = get_version()
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx.ext.autodoc',
+    'sphinx.ext.autosummary',
+    'sphinx.ext.intersphinx',
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+    'myst_parser',
+    'sphinx_copybutton',
+    'sphinx_tabs.tabs',
+    'notfound.extension',
+    'sphinxcontrib.jquery',
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+source_suffix = {
+    '.rst': 'restructuredtext',
+    '.md': 'markdown',
+}
+
+language = 'zh_CN'
+
+# The master toctree document.
+root_doc = 'index'
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'pytorch_sphinx_theme'
+html_theme_path = [pytorch_sphinx_theme.get_html_theme_path()]
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+# yapf: disable
+html_theme_options = {
+    'menu': [
+        {
+            'name': 'GitHub',
+            'url': 'https://github.com/open-mmlab/mmpretrain'
+        },
+        {
+            'name': 'Colab 教程',
+            'children': [
+                {'name': '用命令行工具训练和推理',
+                 'url': 'https://colab.research.google.com/github/mzr1996/mmpretrain-tutorial/blob/master/1.x/MMPretrain_tools.ipynb'},
+                {'name': '用 Python API 训练和推理',
+                 'url': 'https://colab.research.google.com/github/mzr1996/mmpretrain-tutorial/blob/master/1.x/MMPretrain_python.ipynb'},
+            ]
+        },
+        {
+            'name': 'Version',
+            'children': [
+                {'name': 'MMPretrain 0.x',
+                 'url': 'https://mmpretrain.readthedocs.io/zh_CN/0.x/',
+                 'description': '0.x branch'},
+                {'name': 'MMPretrain 1.x',
+                 'url': 'https://mmpretrain.readthedocs.io/zh_CN/latest/',
+                 'description': 'Main branch'},
+            ],
+        }
+    ],
+    # Specify the language of shared menu
+    'menu_lang': 'cn',
+    # Disable the default edit on GitHub
+    'default_edit_on_github': False,
+}
+# yapf: enable
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+html_css_files = [
+    'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.css',
+    'css/readthedocs.css'
+]
+html_js_files = [
+    'https://cdn.datatables.net/v/bs4/dt-1.12.1/datatables.min.js',
+    'js/custom.js'
+]
+
+# -- Options for HTMLHelp output ---------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'mmpretraindoc'
+
+# -- Options for LaTeX output ------------------------------------------------
+
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+
+    # Latex figure (float) alignment
+    #
+    # 'figure_align': 'htbp',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (root_doc, 'mmpretrain.tex', 'MMPretrain Documentation', author, 'manual'),
+]
+
+# -- Options for manual page output ------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [(root_doc, 'mmpretrain', 'MMPretrain Documentation', [author], 1)]
+
+# -- Options for Texinfo output ----------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (root_doc, 'mmpretrain', 'MMPretrain Documentation', author, 'mmpretrain',
+     'OpenMMLab pre-training toolbox and benchmark.', 'Miscellaneous'),
+]
+
+# -- Options for Epub output -------------------------------------------------
+
+# Bibliographic Dublin Core info.
+epub_title = project
+
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+
+# A unique identification for the text.
+#
+# epub_uid = ''
+
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ['search.html']
+
+# set priority when building html
+StandaloneHTMLBuilder.supported_image_types = [
+    'image/svg+xml', 'image/gif', 'image/png', 'image/jpeg'
+]
+
+# -- Extension configuration -------------------------------------------------
+# Ignore >>> when copying code
+copybutton_prompt_text = r'>>> |\.\.\. '
+copybutton_prompt_is_regexp = True
+
+# Auto-generated header anchors
+myst_heading_anchors = 3
+# Enable "colon_fence" extension of myst.
+myst_enable_extensions = ['colon_fence', 'dollarmath']
+
+# Configuration for intersphinx
+intersphinx_mapping = {
+    'python': ('https://docs.python.org/3', None),
+    'numpy': ('https://numpy.org/doc/stable', None),
+    'torch': ('https://pytorch.org/docs/stable/', None),
+    'mmcv': ('https://mmcv.readthedocs.io/zh_CN/2.x/', None),
+    'mmengine': ('https://mmengine.readthedocs.io/zh_CN/latest/', None),
+    'transformers':
+    ('https://huggingface.co/docs/transformers/main/zh/', None),
+}
+napoleon_custom_sections = [
+    # Custom sections for data elements.
+    ('Meta fields', 'params_style'),
+    ('Data fields', 'params_style'),
+]
+
+# Disable docstring inheritance
+autodoc_inherit_docstrings = False
+# Mock some imports during generate API docs.
+autodoc_mock_imports = ['rich', 'attr', 'einops', 'mat4py']
+# Disable displaying type annotations, these can be very verbose
+autodoc_typehints = 'none'
+
+# The not found page
+notfound_template = '404.html'
+
+
+def builder_inited_handler(app):
+    if subprocess.run(['./stat.py']).returncode != 0:
+        raise RuntimeError('Failed to run the script `stat.py`.')
+
+
+def setup(app):
+    app.add_config_value('no_underscore_emphasis', False, 'env')
+    app.connect('builder-inited', builder_inited_handler)
diff --git a/docs/zh_CN/device/npu.md b/docs/zh_CN/device/npu.md
new file mode 100644
index 0000000000000000000000000000000000000000..b81c175117be5ce0fc6925ab96d2a5f517b602b4
--- /dev/null
+++ b/docs/zh_CN/device/npu.md
@@ -0,0 +1,41 @@
+# NPU (华为昇腾)
+
+## 使用方法
+
+首先，请参考[链接](https://mmcv.readthedocs.io/zh_CN/latest/get_started/build.html#npu-mmcv-full)安装带有 NPU 支持的 MMCV 和[链接](https://mmengine.readthedocs.io/en/latest/get_started/installation.html#build-from-source)安装 MMEngine。
+
+使用如下命令，可以利用 8 个 NPU 在机器上训练模型（以 ResNet 为例）：
+
+```shell
+bash tools/dist_train.sh configs/cspnet/resnet50_8xb32_in1k.py 8
+```
+
+或者，使用如下命令，在一个 NPU 上训练模型（以 ResNet 为例）：
+
+```shell
+python tools/train.py configs/cspnet/resnet50_8xb32_in1k.py
+```
+
+## 经过验证的模型
+
+|                            Model                            | Top-1 (%) | Top-5 (%) |                            Config                            |                            Download                             |
+| :---------------------------------------------------------: | :-------: | :-------: | :----------------------------------------------------------: | :-------------------------------------------------------------: |
+| [ResNet-50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/README.md) |   76.40   |   93.21   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet50_8xb32_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/resnet50_8xb32_in1k.log) |
+| [ResNetXt-32x4d-50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnext/README.md) |   77.48   |   93.75   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnext/resnext50-32x4d_8xb32_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/resnext50-32x4d_8xb32_in1k.log) |
+| [HRNet-W18](https://github.com/open-mmlab/mmclassification/blob/master/configs/hrnet/README.md) |   77.06   |   93.57   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/hrnet/hrnet-w18_4xb32_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/hrnet-w18_4xb32_in1k.log) |
+| [ResNetV1D-152](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/README.md) |   79.41   |   94.48   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnetv1d152_8xb32_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/resnetv1d152_8xb32_in1k.log) |
+| [SE-ResNet-50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/seresnet/README.md) |   77.65   |   93.74   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/seresnet/seresnet50_8xb32_in1k.py) | [model](<>) \|[log](https://download.openmmlab.com/mmclassification/v1/device/npu/seresnet50_8xb32_in1k.log) |
+| [ShuffleNetV2 1.0x](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/shufflenet_v2/README.md) |   69.52   |   88.79   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/shufflenet_v2/shufflenet-v2-1x_16xb64_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/shufflenet-v2-1x_16xb64_in1k.log) |
+| [MobileNetV2](https://github.com/open-mmlab/mmclassification/tree/1.x/configs/mobilenet_v2) |   71.74   |   90.28   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v2/mobilenet-v2_8xb32_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/mobilenet-v2_8xb32_in1k.log) |
+| [MobileNetV3-Small](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v3/README.md) |   67.09   |   87.17   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v3/mobilenet-v3-small_8xb128_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/mobilenet-v3-small.log) |
+| [\*CSPResNeXt50](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/cspnet/README.md) |   77.25   |   93.46   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/cspnet/cspresnext50_8xb32_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/cspresnext50_8xb32_in1k.log) |
+| [\*EfficientNet-B4](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/README.md) |   75.73   |  92.9100  | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-b4_8xb32_in1k.py) | [model](<>) \|[log](https://download.openmmlab.com/mmclassification/v1/device/npu/efficientnet-b4_8xb32_in1k.log) |
+| [\*\*DenseNet121](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/densenet/README.md) |   72.53   |   90.85   | [config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/densenet/densenet121_4xb256_in1k.py) | [model](<>) \| [log](https://download.openmmlab.com/mmclassification/v1/device/npu/densenet121_4xb256_in1k.log) |
+
+**注意:**
+
+- 如果没有特别标记，NPU 上的结果与使用 FP32 的 GPU 上的结果结果相同。
+- (\*) 这些模型的训练结果低于相应模型中自述文件上的结果，主要是因为自述文件上的结果直接是 timm 训练得出的权重，而这边的结果是根据 mmcls 的配置重新训练得到的结果。GPU 上的配置训练结果与 NPU 的结果相同。
+- (\*\*)这个模型的精度略低，因为 config 是 4 张卡的配置，我们使用 8 张卡来运行，用户可以调整超参数以获得最佳精度结果。
+
+**以上所有模型权重及训练日志均由华为昇腾团队提供**
diff --git a/docs/zh_CN/docutils.conf b/docs/zh_CN/docutils.conf
new file mode 100644
index 0000000000000000000000000000000000000000..0c00c84688701117f231fd0c8ec295fb747b7d8f
--- /dev/null
+++ b/docs/zh_CN/docutils.conf
@@ -0,0 +1,2 @@
+[html writers]
+table_style: colwidths-auto
diff --git a/docs/zh_CN/get_started.md b/docs/zh_CN/get_started.md
new file mode 100644
index 0000000000000000000000000000000000000000..0cf252f1f4f2beef0d6f2879f1a166b0dde5ae0c
--- /dev/null
+++ b/docs/zh_CN/get_started.md
@@ -0,0 +1,163 @@
+# 依赖环境
+
+在本节中，我们将演示如何准备 PyTorch 相关的依赖环境。
+
+MMPretrain 适用于 Linux、Windows 和 macOS。它需要 Python 3.7+、CUDA 10.2+ 和 PyTorch 1.8+。
+
+```{note}
+如果你对配置 PyTorch 环境已经很熟悉，并且已经完成了配置，可以直接进入[下一节](#安装)。
+否则的话，请依照以下步骤完成配置。
+```
+
+**第 1 步** 从[官网](https://docs.conda.io/en/latest/miniconda.html)下载并安装 Miniconda。
+
+**第 2 步** 创建一个 conda 虚拟环境并激活它。
+
+```shell
+conda create --name openmmlab python=3.8 -y
+conda activate openmmlab
+```
+
+**第 3 步** 按照[官方指南](https://pytorch.org/get-started/locally/)安装 PyTorch。例如：
+
+在 GPU 平台：
+
+```shell
+conda install pytorch torchvision -c pytorch
+```
+
+```{warning}
+以上命令会自动安装最新版的 PyTorch 与对应的 cudatoolkit，请检查它们是否与你的环境匹配。
+```
+
+在 CPU 平台：
+
+```shell
+conda install pytorch torchvision cpuonly -c pytorch
+```
+
+# 安装
+
+我们推荐用户按照我们的最佳实践来安装 MMPretrain。但除此之外，如果你想根据
+你的习惯完成安装流程，也可以参见[自定义安装](#自定义安装)一节来获取更多信息。
+
+## 最佳实践
+
+根据具体需求，我们支持两种安装模式：
+
+- [从源码安装（推荐）](#从源码安装)：希望基于 MMPretrain 框架开发自己的预训练任务，需要添加新的功能，比如新的模型或是数据集，或者使用我们提供的各种工具。
+- [作为 Python 包安装](#作为-python-包安装)：只是希望调用 MMPretrain 的 API 接口，或者在自己的项目中导入 MMPretrain 中的模块。
+
+### 从源码安装
+
+这种情况下，从源码按如下方式安装 mmpretrain：
+
+```shell
+git clone https://github.com/open-mmlab/mmpretrain.git
+cd mmpretrain
+pip install -U openmim && mim install -e .
+```
+
+```{note}
+`"-e"` 表示以可编辑形式安装，这样可以在不重新安装的情况下，让本地修改直接生效
+```
+
+### 作为 Python 包安装
+
+直接使用 mim 安装即可。
+
+```shell
+pip install -U openmim && mim install "mmpretrain>=1.0.0rc8"
+```
+
+```{note}
+`mim` 是一个轻量级的命令行工具，可以根据 PyTorch 和 CUDA 版本为 OpenMMLab 算法库配置合适的环境。同时它也提供了一些对于深度学习实验很有帮助的功能。
+```
+
+## 安装多模态支持 (可选)
+
+MMPretrain 中的多模态模型需要额外的依赖项，要安装这些依赖项，请在安装过程中添加 `[multimodal]` 参数，如下所示：
+
+```shell
+# 从源码安装
+mim install -e ".[multimodal]"
+
+# 作为 Python 包安装
+mim install "mmpretrain[multimodal]>=1.0.0rc8"
+```
+
+## 验证安装
+
+为了验证 MMPretrain 的安装是否正确，我们提供了一些示例代码来执行模型推理。
+
+如果你是**从源码安装**的 mmpretrain，那么直接运行以下命令进行验证：
+
+```shell
+python demo/image_demo.py demo/demo.JPEG resnet18_8xb32_in1k --device cpu
+```
+
+你可以看到命令行中输出了结果字典，包括 `pred_label`，`pred_score` 和 `pred_class` 三个字段。
+
+如果你是**作为 Python 包安装**，那么可以打开你的 Python 解释器，并粘贴如下代码：
+
+```python
+from mmpretrain import get_model, inference_model
+
+model = get_model('resnet18_8xb32_in1k', device='cpu')  # 或者 device='cuda:0'
+inference_model(model, 'demo/demo.JPEG')
+```
+
+你会看到输出一个字典，包含预测的标签、得分及类别名。
+
+```{note}
+以上示例中，`resnet18_8xb32_in1k` 是模型名称。你可以使用 [`mmpretrain.list_models`](mmpretrain.apis.list_models) 接口来
+浏览所有的模型，或者在[模型汇总](./modelzoo_statistics.md)页面进行查找。
+```
+
+## 自定义安装
+
+### CUDA 版本
+
+安装 PyTorch 时，需要指定 CUDA 版本。如果您不清楚选择哪个，请遵循我们的建议：
+
+- 对于 Ampere 架构的 NVIDIA GPU，例如 GeForce 30 series 以及 NVIDIA A100，CUDA 11 是必需的。
+- 对于更早的 NVIDIA GPU，CUDA 11 是向前兼容的，但 CUDA 10.2 能够提供更好的兼容性，也更加轻量。
+
+请确保你的 GPU 驱动版本满足最低的版本需求，参阅[这张表](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-major-component-versions__table-cuda-toolkit-driver-versions)。
+
+```{note}
+如果按照我们的最佳实践进行安装，CUDA 运行时库就足够了，因为我们提供相关 CUDA 代码的预编译，你不需要进行本地编译。
+但如果你希望从源码进行 MMCV 的编译，或是进行其他 CUDA 算子的开发，那么就必须安装完整的 CUDA 工具链，参见
+[NVIDIA 官网](https://developer.nvidia.com/cuda-downloads)，另外还需要确保该 CUDA 工具链的版本与 PyTorch 安装时
+的配置相匹配（如用 `conda install` 安装 PyTorch 时指定的 cudatoolkit 版本）。
+```
+
+### 在 CPU 环境中安装
+
+MMPretrain 可以仅在 CPU 环境中安装，在 CPU 模式下，你可以完成训练、测试和模型推理等所有操作。
+
+### 在 Google Colab 中安装
+
+参考 [Colab 教程](https://colab.research.google.com/github/mzr1996/mmclassification-tutorial/blob/master/1.x/MMClassification_tools.ipynb) 安装即可。
+
+### 通过 Docker 使用 MMPretrain
+
+MMPretrain 提供 [Dockerfile](https://github.com/open-mmlab/mmpretrain/blob/main/docker/Dockerfile)
+用于构建镜像。请确保你的 [Docker 版本](https://docs.docker.com/engine/install/) >=19.03。
+
+```shell
+# 构建默认的 PyTorch 1.12.1，CUDA 11.3 版本镜像
+# 如果你希望使用其他版本，请修改 Dockerfile
+docker build -t mmpretrain docker/
+```
+
+用以下命令运行 Docker 镜像：
+
+```shell
+docker run --gpus all --shm-size=8g -it -v {DATA_DIR}:/mmpretrain/data mmpretrain
+```
+
+## 故障解决
+
+如果你在安装过程中遇到了什么问题，请先查阅[常见问题](./notes/faq.md)。如果没有找到解决方法，可以在 GitHub
+上[提出 issue](https://github.com/open-mmlab/mmpretrain/issues/new/choose)。
diff --git a/docs/zh_CN/index.rst b/docs/zh_CN/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..ca57faacfe3e512cd80f35559e692c2a92c1a36c
--- /dev/null
+++ b/docs/zh_CN/index.rst
@@ -0,0 +1,150 @@
+欢迎来到 MMPretrain 中文教程！
+==========================================
+
+MMPretrain 是一个全新升级的预训练开源算法框架，旨在提供各种强大的预训练主干网络，
+并支持了不同的预训练策略。MMPretrain 源自著名的开源项目
+`MMClassification <https://github.com/open-mmlab/mmclassification/tree/1.x>`_
+和 `MMSelfSup <https://github.com/open-mmlab/mmselfsup>`_，并开发了许多令人兴奋的新功能。
+目前，预训练阶段对于视觉识别至关重要，凭借丰富而强大的预训练模型，我们能够改进各种下游视觉任务。
+
+我们的代码库旨在成为一个易于使用和用户友好的代码库库，并简化学术研究活动和工程任务。
+我们在以下不同部分中详细介绍了 MMPretrain 的特性和设计。
+
+MMPretrain 上手路线
+-------------------------------
+
+为了用户能够快速上手，我们推荐以下流程：
+
+   - 对于想要使用 MMPretrain 的用户，我们推荐先阅读 开始你的第一步_ 部分来设置环境。
+
+   - 对于一些基础使用，我们建议用户阅读 教程_ 来学习如何使用算法库来获得预训练模型以及在下游任务进行评测。
+
+   - 若您想进行算法的自定义，我们提供了 进阶教程_ 来阐述了代码修改的方法和规则。
+
+   - 如果您想找到所期望的预训练模型，您可以浏览 模型库_，其中包含了模型库的总结，以及各类主干网络和预训练算法的介绍。
+
+   - 我们同样提供了 分析工具_ 和 可视化_ 来辅助模型分析。
+
+   - 另外，如果您还有其它问题，欢迎查阅 其他说明_，也许可以找到您想要的答案。
+
+我们始终非常欢迎用户的 PRs 和 Issues 来完善 MMPretrain！
+
+.. _开始你的第一步:
+.. toctree::
+   :maxdepth: 1
+   :caption: 开始你的第一步
+
+   get_started.md
+
+.. _教程:
+.. toctree::
+   :maxdepth: 1
+   :caption: 教程
+
+   user_guides/config.md
+   user_guides/dataset_prepare.md
+   user_guides/inference.md
+   user_guides/train.md
+   user_guides/test.md
+   user_guides/downstream.md
+
+.. _进阶教程:
+.. toctree::
+   :maxdepth: 1
+   :caption: 进阶教程
+
+   advanced_guides/datasets.md
+   advanced_guides/pipeline.md
+   advanced_guides/modules.md
+   advanced_guides/schedule.md
+   advanced_guides/runtime.md
+   advanced_guides/evaluation.md
+   advanced_guides/convention.md
+
+.. _模型库:
+.. toctree::
+   :maxdepth: 1
+   :caption: 模型库
+   :glob:
+
+   modelzoo_statistics.md
+   papers/*
+
+.. _可视化:
+.. toctree::
+   :maxdepth: 1
+   :caption: 可视化
+
+   useful_tools/dataset_visualization.md
+   useful_tools/scheduler_visualization.md
+   useful_tools/cam_visualization.md
+   useful_tools/t-sne_visualization.md
+
+.. _分析工具:
+.. toctree::
+   :maxdepth: 1
+   :caption: 分析工具
+
+   useful_tools/print_config.md
+   useful_tools/verify_dataset.md
+   useful_tools/log_result_analysis.md
+   useful_tools/complexity_analysis.md
+   useful_tools/confusion_matrix.md
+   useful_tools/shape_bias.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: 部署
+
+   useful_tools/model_serving.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: 迁移指南
+
+   migration.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: API 参考文档
+
+   mmpretrain.apis <api/apis>
+   mmpretrain.engine <api/engine>
+   mmpretrain.datasets <api/datasets>
+   数据处理 <api/data_process>
+   mmpretrain.models <api/models>
+   mmpretrain.structures <api/structures>
+   mmpretrain.visualization <api/visualization>
+   mmpretrain.evaluation <api/evaluation>
+   mmpretrain.utils <api/utils>
+
+.. _其他说明:
+.. toctree::
+   :maxdepth: 1
+   :caption: 其他说明
+
+   notes/contribution_guide.md
+   notes/projects.md
+   notes/changelog.md
+   notes/faq.md
+   notes/pretrain_custom_dataset.md
+   notes/finetune_custom_dataset.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: 设备支持
+
+   device/npu.md
+
+.. toctree::
+   :caption: 切换语言
+
+   English <https://mmpretrain.readthedocs.io/en/latest/>
+   简体中文 <https://mmpretrain.readthedocs.io/zh_CN/latest/>
+
+
+索引与表格
+==================
+
+* :ref:`genindex`
+* :ref:`search`
diff --git a/docs/zh_CN/locales/zh_CN/LC_MESSAGES/api.po b/docs/zh_CN/locales/zh_CN/LC_MESSAGES/api.po
new file mode 100644
index 0000000000000000000000000000000000000000..abfc40d0c3d1b8da4d49f9ecc28b5c53a9e10f83
--- /dev/null
+++ b/docs/zh_CN/locales/zh_CN/LC_MESSAGES/api.po
@@ -0,0 +1,9090 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2020, OpenMMLab
+# This file is distributed under the same license as the MMClassification
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2021.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: MMClassification\n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-11-22 08:42+0800\n"
+"PO-Revision-Date: 2022-11-22 15:18+0800\n"
+"Last-Translator: Ma Zerun <mzr1996@163.com>\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.1\n"
+"Language-Team: \n"
+"Language: zh_CN\n"
+"X-Generator: Poedit 2.3\n"
+
+#: ../../api/apis.rst:7 ../../api/apis.rst:14
+msgid "mmcls.apis"
+msgstr ""
+
+#: ../../api/apis.rst:9
+msgid "These are some high-level APIs for classification tasks."
+msgstr "该包提供了一些用于分类任务的高阶 API"
+
+#: ../../api/apis.rst:17
+msgid "Inference"
+msgstr "推理"
+
+#: ../../api/apis.rst:24:<autosummary>:1
+msgid ":py:obj:`init_model <mmcls.apis.init_model>`"
+msgstr ""
+
+#: ../../api/apis.rst:24:<autosummary>:1 mmcls.apis.inference.init_model:1 of
+msgid "Initialize a classifier from config file."
+msgstr "从配置文件初始化一个分类器"
+
+#: ../../api/apis.rst:24:<autosummary>:1
+msgid ":py:obj:`inference_model <mmcls.apis.inference_model>`"
+msgstr ""
+
+#: ../../api/apis.rst:24:<autosummary>:1 mmcls.apis.inference.inference_model:1 of
+msgid "Inference image(s) with the classifier."
+msgstr "使用分类器推理图像"
+
+#: ../../api/data_process.rst:5
+msgid "Data Process"
+msgstr "数据处理"
+
+#: ../../api/data_process.rst:7
+msgid ""
+"In MMClassification, the data process and the dataset is decomposed. The datasets only define how to get "
+"samples' basic information from the file system. These basic information includes the ground-truth label "
+"and raw images data / the paths of images.The data process includes data transforms, data preprocessors and "
+"batch augmentations."
+msgstr ""
+"在 MMClassification 中，数据处理和数据集是解耦的。数据集只定义了如何从文件系统中获取样本的基本信息。这些基本"
+"信息包括分类标签和原始图像数据/图像的路径。完整的数据处理流程包括了数据变换（data transform）、数据预处理器"
+"（data preprocessor）及批量数据增强（batch augmentation）。"
+
+#: ../../api/data_process.rst:13
+msgid ""
+":mod:`Data Transforms <mmcls.datasets.transforms>`: Transforms includes loading, preprocessing, formatting "
+"and etc."
+msgstr ""
+":mod:`数据变换 <mmcls.datasets.transforms>`：数据变换包括了数据的加载、部分预处理/增强、数据格式化等操作"
+
+#: ../../api/data_process.rst:14
+msgid ""
+":mod:`Data Preprocessors <mmcls.models.utils.data_preprocessor>`: Processes includes collate, "
+"normalization, stacking, channel fliping and etc."
+msgstr ""
+":mod:`数据预处理器 <mmcls.models.utils.data_preprocessor>`：主要负责批量数据的收集、归一化、堆叠、通道翻转等"
+"操作。"
+
+#: ../../api/data_process.rst:16
+msgid ""
+":mod:`Batch Augmentations <mmcls.models.utils.batch_augments>`: Batch augmentation involves multiple "
+"samples, such as Mixup and CutMix."
+msgstr ""
+":mod:`批量数据增强 <mmcls.models.utils.batch_augments>`：批量数据增强是数据预处理器的功能之一，负责处理涉及"
+"多个样本的数据增强操作，例如 Mixup 和 CutMix。"
+
+#: ../../api/data_process.rst:21
+msgid "Data Transforms"
+msgstr "数据变换"
+
+#: ../../api/data_process.rst:23
+msgid ""
+"To prepare the inputs data, we need to do some transforms on these basic information. These transforms "
+"includes loading, preprocessing and formatting. And a series of data transforms makes up a data pipeline. "
+"Therefore, you can find the a ``pipeline`` argument in the configs of dataset, for example:"
+msgstr ""
+"为了准备输入数据，我们需要对数据集中保存的基本信息做一些变换。这些变换包括数据加载、部分预处理和增强、格式"
+"化。一系列的数据变换组成了数据流水线（data pipeline）。因此，在数据集的配置参数中通常存在一个 ``pipeline`` "
+"参数，例如："
+
+#: ../../api/data_process.rst:46
+msgid ""
+"Every item of a pipeline list is one of the following data transforms class. And if you want to add a "
+"custom data transformation class, the tutorial :doc:`Custom Data Pipelines </advanced_guides/pipeline>` "
+"will help you."
+msgstr ""
+"``pipeline`` 列表中的每一项都是以下数据变换类之一。如果您想添加自定义数据变换类，可以参考 :doc:`自定义数据流"
+"水线教程 </advanced_guides/pipeline>`。"
+
+#: ../../api/data_process.rst:54
+msgid "Processing and Augmentation"
+msgstr "组合式增强"
+
+#: ../../api/data_process.rst:70:<autosummary>:1
+msgid ":py:obj:`Albumentations <mmcls.datasets.transforms.Albumentations>`"
+msgstr ""
+
+#: ../../api/data_process.rst:70:<autosummary>:1 mmcls.datasets.transforms.processing.Albumentations:1 of
+msgid "Wrapper to use augmentation from albumentations library."
+msgstr "使用 Albumentations 库进行数据变换的封装类"
+
+#: ../../api/data_process.rst:70:<autosummary>:1
+msgid ":py:obj:`ColorJitter <mmcls.datasets.transforms.ColorJitter>`"
+msgstr ""
+
+#: ../../api/data_process.rst:70:<autosummary>:1 mmcls.datasets.transforms.processing.ColorJitter:1 of
+msgid "Randomly change the brightness, contrast and saturation of an image."
+msgstr "随机改变图像的亮度、对比度和饱和度"
+
+#: ../../api/data_process.rst:70:<autosummary>:1
+msgid ":py:obj:`EfficientNetCenterCrop <mmcls.datasets.transforms.EfficientNetCenterCrop>`"
+msgstr ""
+
+#: ../../api/data_process.rst:70:<autosummary>:1 mmcls.datasets.transforms.processing.EfficientNetCenterCrop:1
+#: of
+msgid "EfficientNet style center crop."
+msgstr "EfficientNet 风格的中心裁剪"
+
+#: ../../api/data_process.rst:70:<autosummary>:1
+msgid ":py:obj:`EfficientNetRandomCrop <mmcls.datasets.transforms.EfficientNetRandomCrop>`"
+msgstr ""
+
+#: ../../api/data_process.rst:70:<autosummary>:1 mmcls.datasets.transforms.processing.EfficientNetRandomCrop:1
+#: of
+msgid "EfficientNet style RandomResizedCrop."
+msgstr "EfficientNet 风格的随机缩放裁剪"
+
+#: ../../api/data_process.rst:70:<autosummary>:1
+msgid ":py:obj:`Lighting <mmcls.datasets.transforms.Lighting>`"
+msgstr ""
+
+#: ../../api/data_process.rst:70:<autosummary>:1 mmcls.datasets.transforms.processing.Lighting:1 of
+msgid "Adjust images lighting using AlexNet-style PCA jitter."
+msgstr "使用 AlexNet 风格的 PCA 抖动随机调整图像照明"
+
+#: ../../api/data_process.rst:70:<autosummary>:1
+msgid ":py:obj:`RandomCrop <mmcls.datasets.transforms.RandomCrop>`"
+msgstr ""
+
+#: ../../api/data_process.rst:70:<autosummary>:1 mmcls.datasets.transforms.processing.RandomCrop:1 of
+msgid "Crop the given Image at a random location."
+msgstr "在随机位置裁剪给定图像"
+
+#: ../../api/data_process.rst:70:<autosummary>:1
+msgid ":py:obj:`RandomErasing <mmcls.datasets.transforms.RandomErasing>`"
+msgstr ""
+
+#: ../../api/data_process.rst:70:<autosummary>:1 mmcls.datasets.transforms.processing.RandomErasing:1 of
+msgid "Randomly selects a rectangle region in an image and erase pixels."
+msgstr "在图像中随机选择一个矩形区域并擦除像素"
+
+#: ../../api/data_process.rst:70:<autosummary>:1
+msgid ":py:obj:`RandomResizedCrop <mmcls.datasets.transforms.RandomResizedCrop>`"
+msgstr ""
+
+#: ../../api/data_process.rst:70:<autosummary>:1 mmcls.datasets.transforms.processing.RandomResizedCrop:1 of
+msgid "Crop the given image to random scale and aspect ratio."
+msgstr "将给定图像按照随机尺寸和纵横比进行裁剪"
+
+#: ../../api/data_process.rst:70:<autosummary>:1
+msgid ":py:obj:`ResizeEdge <mmcls.datasets.transforms.ResizeEdge>`"
+msgstr ""
+
+#: ../../api/data_process.rst:70:<autosummary>:1 mmcls.datasets.transforms.processing.ResizeEdge:1 of
+msgid "Resize images along the specified edge."
+msgstr "按照指定边长调整图像尺寸"
+
+#: ../../api/data_process.rst:72
+msgid "Composed Augmentation"
+msgstr "组合式增强"
+
+#: ../../api/data_process.rst:73
+msgid ""
+"Composed augmentation is a kind of methods which compose a series of data augmentation transforms, such as "
+"``AutoAugment`` and ``RandAugment``."
+msgstr ""
+"组合式增强将一系列数据增强方法组合在一起，实现对样本的整体增强，例如 ``AutoAugment`` 和 ``RandAugment``"
+
+#: ../../api/data_process.rst:83:<autosummary>:1
+msgid ":py:obj:`AutoAugment <mmcls.datasets.transforms.AutoAugment>`"
+msgstr ""
+
+#: ../../api/data_process.rst:83:<autosummary>:1 mmcls.datasets.transforms.auto_augment.AutoAugment:1 of
+msgid "Auto augmentation."
+msgstr ""
+
+#: ../../api/data_process.rst:83:<autosummary>:1
+msgid ":py:obj:`RandAugment <mmcls.datasets.transforms.RandAugment>`"
+msgstr ""
+
+#: ../../api/data_process.rst:83:<autosummary>:1 mmcls.datasets.transforms.auto_augment.RandAugment:1 of
+msgid "Random augmentation."
+msgstr ""
+
+#: ../../api/data_process.rst:84
+msgid ""
+"To specify the augmentation combination (The ``policies`` argument), you can use string to specify from "
+"some preset policies."
+msgstr "为了指定增强组合的策略（即上述变换中的 ``policies`` 参数），你可以使用字符串从一系列预设策略中指定。"
+
+#: ../../api/data_process.rst:91
+msgid "Preset policy"
+msgstr "预设策略"
+
+#: ../../api/data_process.rst:92
+msgid "Use for"
+msgstr "用于"
+
+#: ../../api/data_process.rst:93
+msgid "Description"
+msgstr "说明"
+
+#: ../../api/data_process.rst:94
+msgid "\"imagenet\""
+msgstr ""
+
+#: ../../api/data_process.rst:95
+msgid ":class:`AutoAugment`"
+msgstr ""
+
+#: ../../api/data_process.rst:96
+msgid "Policy for ImageNet, come from `DeepVoltaire/AutoAugment`_"
+msgstr "用于 ImageNet 数据集的增强组合，来自 `DeepVoltaire/AutoAugment`_ 仓库"
+
+#: ../../api/data_process.rst:97
+msgid "\"timm_increasing\""
+msgstr ""
+
+#: ../../api/data_process.rst:98
+msgid ":class:`RandAugment`"
+msgstr ""
+
+#: ../../api/data_process.rst:99
+msgid "The ``_RAND_INCREASING_TRANSFORMS`` policy from `timm`_"
+msgstr "`timm`_ 仓库中的 ``_RAND_INCREASING_TRANSFORMS`` 增强组合"
+
+#: ../../api/data_process.rst:104
+msgid "And you can also configure a group of policies manually by selecting from the below table."
+msgstr "你还可以通过根据下表手动配置一组策略。"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`AutoContrast <mmcls.datasets.transforms.AutoContrast>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.AutoContrast:1 of
+msgid "Auto adjust image contrast."
+msgstr "自动调整图像对比度"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`Brightness <mmcls.datasets.transforms.Brightness>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.Brightness:1 of
+msgid "Adjust images brightness."
+msgstr "自动调整图像亮度"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`ColorTransform <mmcls.datasets.transforms.ColorTransform>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.ColorTransform:1 of
+msgid "Adjust images color balance."
+msgstr "自动调整图像平衡"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`Contrast <mmcls.datasets.transforms.Contrast>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.Contrast:1 of
+msgid "Adjust images contrast."
+msgstr "改变图像对比度"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`Cutout <mmcls.datasets.transforms.Cutout>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.Cutout:1 of
+msgid "Cutout images."
+msgstr "擦除部分图像区域"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`Equalize <mmcls.datasets.transforms.Equalize>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.Equalize:1 of
+msgid "Equalize the image histogram."
+msgstr "均衡化图像直方图"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`Invert <mmcls.datasets.transforms.Invert>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.Invert:1 of
+msgid "Invert images."
+msgstr "反转图像色阶"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`Posterize <mmcls.datasets.transforms.Posterize>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.Posterize:1 of
+msgid "Posterize images (reduce the number of bits for each color channel)."
+msgstr "图像像素化（降低各色彩通道的比特数）"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`Rotate <mmcls.datasets.transforms.Rotate>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.Rotate:1 of
+msgid "Rotate images."
+msgstr "旋转图像"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`Sharpness <mmcls.datasets.transforms.Sharpness>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.Sharpness:1 of
+msgid "Adjust images sharpness."
+msgstr "改变图像锐度"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`Shear <mmcls.datasets.transforms.Shear>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.Shear:1 of
+msgid "Shear images."
+msgstr "图像切变"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`Solarize <mmcls.datasets.transforms.Solarize>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.Solarize:1 of
+msgid "Solarize images (invert all pixel values above a threshold)."
+msgstr "图像日光化（反转高于某一阈值的所有图像色阶）"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`SolarizeAdd <mmcls.datasets.transforms.SolarizeAdd>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.SolarizeAdd:1 of
+msgid "SolarizeAdd images (add a certain value to pixels below a threshold)."
+msgstr "图像过曝（为低于某一阈值的所有色阶增加一个固定值）"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`Translate <mmcls.datasets.transforms.Translate>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.Translate:1 of
+msgid "Translate images."
+msgstr "平移图像"
+
+#: ../../api/data_process.rst:126:<autosummary>:1
+msgid ":py:obj:`BaseAugTransform <mmcls.datasets.transforms.BaseAugTransform>`"
+msgstr ""
+
+#: ../../api/data_process.rst:126:<autosummary>:1 mmcls.datasets.transforms.auto_augment.BaseAugTransform:1 of
+msgid "The base class of augmentation transform for RandAugment."
+msgstr "用于组合式增强的数据变换基类"
+
+#: ../../api/data_process.rst:128
+msgid "Formatting"
+msgstr "格式化"
+
+#: ../../api/data_process.rst:141:<autosummary>:1
+msgid ":py:obj:`Collect <mmcls.datasets.transforms.Collect>`"
+msgstr ""
+
+#: ../../api/data_process.rst:141:<autosummary>:1 mmcls.datasets.transforms.formatting.Collect:1 of
+msgid "Collect and only reserve the specified fields."
+msgstr "收集并仅保留指定字段的数据"
+
+#: ../../api/data_process.rst:141:<autosummary>:1
+msgid ":py:obj:`PackClsInputs <mmcls.datasets.transforms.PackClsInputs>`"
+msgstr ""
+
+#: ../../api/data_process.rst:141:<autosummary>:1 mmcls.datasets.transforms.formatting.PackClsInputs:1 of
+msgid "Pack the inputs data for the classification."
+msgstr "将输入数据整理成为用于分类任务的数据格式。"
+
+#: ../../api/data_process.rst:141:<autosummary>:1
+msgid ":py:obj:`ToNumpy <mmcls.datasets.transforms.ToNumpy>`"
+msgstr ""
+
+#: ../../api/data_process.rst:141:<autosummary>:1 mmcls.datasets.transforms.formatting.ToNumpy:1 of
+msgid "Convert object to :obj:`numpy.ndarray`."
+msgstr "将对象转变为 :obj:`numpy.ndarray`"
+
+#: ../../api/data_process.rst:141:<autosummary>:1
+msgid ":py:obj:`ToPIL <mmcls.datasets.transforms.ToPIL>`"
+msgstr ""
+
+#: ../../api/data_process.rst:141:<autosummary>:1 mmcls.datasets.transforms.formatting.ToPIL:1 of
+msgid "Convert the image from OpenCV format to :obj:`PIL.Image.Image`."
+msgstr "将图片从 OpenCV 格式转为为 :obj:`PIL.Image.Image` 格式"
+
+#: ../../api/data_process.rst:141:<autosummary>:1
+msgid ":py:obj:`Transpose <mmcls.datasets.transforms.Transpose>`"
+msgstr ""
+
+#: ../../api/data_process.rst:141:<autosummary>:1 mmcls.datasets.transforms.formatting.Transpose:1 of
+msgid "Transpose numpy array."
+msgstr "转置 NumPy 数组"
+
+#: ../../api/data_process.rst:143
+msgid "MMCV transforms"
+msgstr "MMCV 中的数据变换"
+
+#: ../../api/data_process.rst:145
+msgid ""
+"We also provides many transforms in MMCV. You can use them directly in the config files. Here are some "
+"frequently used transforms, and the whole transforms list can be found in :external+mmcv:doc:`api/"
+"transforms`."
+msgstr ""
+"我们还在 MMCV 中提供了很多数据转换类。你可以在配置文件中直接使用它们。这里我们列举了一些常用的数据变换类，完"
+"整的数据变换类列表可以在 :external+mmcv:doc:`api/transforms` 中找到。"
+
+#: ../../api/data_process.rst:150
+msgid ":external:class:`~mmcv.transforms.LoadImageFromFile`"
+msgstr ""
+
+#: ../../api/data_process.rst:151
+msgid "Load an image from file."
+msgstr "从图片路径加载图片"
+
+#: ../../api/data_process.rst:152
+msgid ":external:class:`~mmcv.transforms.Resize`"
+msgstr ""
+
+#: ../../api/data_process.rst:153
+msgid "Resize images & bbox & seg & keypoints."
+msgstr "缩放图像、bbox、分割图、关键点等"
+
+#: ../../api/data_process.rst:154
+msgid ":external:class:`~mmcv.transforms.RandomResize`"
+msgstr ""
+
+#: ../../api/data_process.rst:155
+msgid "Random resize images & bbox & keypoints."
+msgstr "随机缩放图像、bbox、关键点等"
+
+#: ../../api/data_process.rst:156
+msgid ":external:class:`~mmcv.transforms.RandomFlip`"
+msgstr ""
+
+#: ../../api/data_process.rst:157
+msgid "Flip the image & bbox & keypoints & segmentation map."
+msgstr "随机翻转图像、bbox、关键点等"
+
+#: ../../api/data_process.rst:158
+msgid ":external:class:`~mmcv.transforms.RandomGrayscale`"
+msgstr ""
+
+#: ../../api/data_process.rst:159
+msgid "Randomly convert image to grayscale with a probability."
+msgstr "随机灰度化图像"
+
+#: ../../api/data_process.rst:160
+msgid ":external:class:`~mmcv.transforms.CenterCrop`"
+msgstr ""
+
+#: ../../api/data_process.rst:161
+msgid ""
+"Crop the center of the image, segmentation masks, bounding boxes and key points. If the crop area exceeds "
+"the original image and ``auto_pad`` is True, the original image will be padded before cropping."
+msgstr ""
+"裁剪一张图像的中心区域（同时处理分割图、bbox、关键点等）。如果裁剪尺寸超出原图区域，并且指定了 "
+"``auto_pad=True``，则会在裁剪之前扩充原图至合适大小"
+
+#: ../../api/data_process.rst:162
+msgid ":external:class:`~mmcv.transforms.Normalize`"
+msgstr ""
+
+#: ../../api/data_process.rst:163
+msgid "Normalize the image."
+msgstr "归一化图像"
+
+#: ../../api/data_process.rst:164
+msgid ":external:class:`~mmcv.transforms.Compose`"
+msgstr ""
+
+#: ../../api/data_process.rst:165
+msgid "Compose multiple transforms sequentially."
+msgstr "顺序组合一系列数据变换"
+
+#: ../../api/data_process.rst:170
+msgid "Data Preprocessors"
+msgstr "数据预处理器"
+
+#: ../../api/data_process.rst:172
+msgid ""
+"The data preprocessor is also a component to process the data before feeding data to the neural network. "
+"Comparing with the data transforms, the data preprocessor is a module of the classifier, and it takes a "
+"batch of data to process, which means it can use GPU and batch to accelebrate the processing."
+msgstr ""
+"数据预处理器也是在数据进入神经网络之前，对数据进行处理的组件。与数据变换相比，数据预处理器是模型的一个的模"
+"块，并且可以获得一个批次的数据进行处理，这意味着它可以使用模型所在的设备（如 GPU），并利用批量处理，实现加"
+"速。"
+
+#: ../../api/data_process.rst:176
+msgid "The default data preprocessor in MMClassification could do the pre-processing like following:"
+msgstr "MMClassification 中使用的默认的数据预处理器可以进行以下操作："
+
+#: ../../api/data_process.rst:178
+msgid "Move data to the target device."
+msgstr "将数据移动到模型所在的设备"
+
+#: ../../api/data_process.rst:179
+msgid "Pad inputs to the maximum size of current batch."
+msgstr "将不同尺寸的输入填充至统一的尺寸"
+
+#: ../../api/data_process.rst:180
+msgid "Stack inputs to a batch."
+msgstr "将一系列输入的 tensor 组成 batch"
+
+#: ../../api/data_process.rst:181 mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:16 of
+msgid "Convert inputs from bgr to rgb if the shape of input is (3, H, W)."
+msgstr "如果输入的 tensor 形状为 (3, H, W)，则可以执行 BGR 到 RGB 的通道转换"
+
+#: ../../api/data_process.rst:182 mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:17 of
+msgid "Normalize image with defined std and mean."
+msgstr "根据给定的均值和方差对图像进行归一化"
+
+#: ../../api/data_process.rst:183
+msgid "Do batch augmentations like Mixup and CutMix during training."
+msgstr "在训练时进行批量数据增强，如 Mixup 和 CutMix"
+
+#: ../../api/data_process.rst:185
+msgid ""
+"You can configure the data preprocessor by the ``data_preprocessor`` field or ``model.data_preprocessor`` "
+"field in the config file. Typical usages are as below:"
+msgstr ""
+"你可以在配置文件的 ``data_preprocessor`` 字段，或是 ``model.data_preprocessor`` 字段对数据预处理器进行配置。"
+"一个典型的用法如下："
+
+#: ../../api/data_process.rst:196
+msgid "Or define in ``model.data_preprocessor`` as following:"
+msgstr "或者在 ``model.data_preprocessor`` 字段配置如下："
+
+#: ../../api/data_process.rst:211
+msgid "Note that the ``model.data_preprocessor`` has higher priority than ``data_preprocessor``."
+msgstr "请注意如果在两处均进行了配置，``model.data_preprocessor`` 拥有更高的优先级。"
+
+#: ../../api/data_process.rst:219:<autosummary>:1
+msgid ":py:obj:`ClsDataPreprocessor <mmcls.models.utils.data_preprocessor.ClsDataPreprocessor>`"
+msgstr ""
+
+#: ../../api/data_process.rst:219:<autosummary>:1 mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:1
+#: of
+msgid "Image pre-processor for classification tasks."
+msgstr "用于分类任务的图像预处理器"
+
+#: ../../api/data_process.rst:223
+msgid "Batch Augmentations"
+msgstr "批量数据增强"
+
+#: ../../api/data_process.rst:225
+msgid ""
+"The batch augmentation is a component of data preprocessors. It involves multiple samples and mix them in "
+"some way, such as Mixup and CutMix."
+msgstr ""
+"批量数据增强是数据预处理器的一个功能。它可以利用一个批次的多个样本，以某种方式进行混合增强，如 Mixup 和 "
+"CutMix。"
+
+#: ../../api/data_process.rst:227
+msgid ""
+"These augmentations are usually only used during training, therefore, we use the ``model.train_cfg`` field "
+"to configure them in config files."
+msgstr "这些数据增强只会在训练过程中生效，因此，我们使用 ``model.train_cfg`` 字段来配置这些功能。"
+
+#: ../../api/data_process.rst:241
+msgid "You can also specify the probabilities of every batch augmentation by the ``probs`` field."
+msgstr "你也可以通过 ``probs`` 字段指定每一个批量数据增强的概率。"
+
+#: ../../api/data_process.rst:255
+msgid "Here is a list of batch augmentations can be used in MMClassification."
+msgstr "这里是 MMClassification 中支持的所有批量数据增强列表。"
+
+#: ../../api/data_process.rst:264:<autosummary>:1
+msgid ":py:obj:`Mixup <mmcls.models.utils.batch_augments.Mixup>`"
+msgstr ""
+
+#: ../../api/data_process.rst:264:<autosummary>:1 mmcls.models.utils.batch_augments.mixup.Mixup:1 of
+msgid "Mixup batch augmentation."
+msgstr ""
+
+#: ../../api/data_process.rst:264:<autosummary>:1
+msgid ":py:obj:`CutMix <mmcls.models.utils.batch_augments.CutMix>`"
+msgstr ""
+
+#: ../../api/data_process.rst:264:<autosummary>:1 mmcls.models.utils.batch_augments.cutmix.CutMix:1 of
+msgid "CutMix batch agumentation."
+msgstr ""
+
+#: ../../api/data_process.rst:264:<autosummary>:1
+msgid ":py:obj:`ResizeMix <mmcls.models.utils.batch_augments.ResizeMix>`"
+msgstr ""
+
+#: ../../api/data_process.rst:264:<autosummary>:1 mmcls.models.utils.batch_augments.resizemix.ResizeMix:1 of
+msgid "ResizeMix Random Paste layer for a batch of data."
+msgstr ""
+
+#: ../../api/datasets.rst:7 ../../api/datasets.rst:14
+msgid "mmcls.datasets"
+msgstr ""
+
+#: ../../api/datasets.rst:9
+msgid ""
+"The ``datasets`` package contains several usual datasets for image classification tasks and some dataset "
+"wrappers."
+msgstr "``dataset`` 包中包含了分类任务中常用的数据集，以及一些数据集封装。"
+
+#: ../../api/datasets.rst:17
+msgid "Custom Dataset"
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:1 of
+msgid "Custom dataset for classification."
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:3 of
+msgid "The dataset supports two kinds of annotation format."
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:5 of
+msgid "An annotation file is provided, and each line indicates a sample:"
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:7 of
+msgid "The sample files: ::"
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:19 of
+msgid ""
+"The annotation file (the first column is the image path and the second column is the index of category): ::"
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:28 of
+msgid "Please specify the name of categories by the argument ``classes`` or ``metainfo``."
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:31 of
+msgid "The samples are arranged in the specific way: ::"
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:45 of
+msgid ""
+"If the ``ann_file`` is specified, the dataset will be generated by the first way, otherwise, try the second "
+"way."
+msgstr ""
+
+#: mmcls.apis.inference.inference_model mmcls.apis.inference.init_model
+#: mmcls.datasets.base_dataset.BaseDataset mmcls.datasets.cifar.CIFAR10 mmcls.datasets.cifar.CIFAR100
+#: mmcls.datasets.cub.CUB mmcls.datasets.custom.CustomDataset mmcls.datasets.dataset_wrappers.KFoldDataset
+#: mmcls.datasets.imagenet.ImageNet mmcls.datasets.imagenet.ImageNet21k mmcls.datasets.mnist.FashionMNIST
+#: mmcls.datasets.mnist.MNIST mmcls.datasets.multi_label.MultiLabelDataset
+#: mmcls.datasets.transforms.auto_augment.AutoAugment mmcls.datasets.transforms.auto_augment.AutoContrast
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform mmcls.datasets.transforms.auto_augment.Brightness
+#: mmcls.datasets.transforms.auto_augment.ColorTransform mmcls.datasets.transforms.auto_augment.Contrast
+#: mmcls.datasets.transforms.auto_augment.Cutout mmcls.datasets.transforms.auto_augment.Equalize
+#: mmcls.datasets.transforms.auto_augment.Invert mmcls.datasets.transforms.auto_augment.Posterize
+#: mmcls.datasets.transforms.auto_augment.RandAugment mmcls.datasets.transforms.auto_augment.Rotate
+#: mmcls.datasets.transforms.auto_augment.Sharpness mmcls.datasets.transforms.auto_augment.Shear
+#: mmcls.datasets.transforms.auto_augment.Solarize mmcls.datasets.transforms.auto_augment.SolarizeAdd
+#: mmcls.datasets.transforms.auto_augment.Translate mmcls.datasets.transforms.formatting.Collect
+#: mmcls.datasets.transforms.formatting.PackClsInputs mmcls.datasets.transforms.formatting.ToNumpy
+#: mmcls.datasets.transforms.formatting.Transpose mmcls.datasets.transforms.processing.Albumentations
+#: mmcls.datasets.transforms.processing.Albumentations.transform
+#: mmcls.datasets.transforms.processing.ColorJitter mmcls.datasets.transforms.processing.ColorJitter.transform
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop mmcls.datasets.transforms.processing.Lighting
+#: mmcls.datasets.transforms.processing.Lighting.transform mmcls.datasets.transforms.processing.RandomCrop
+#: mmcls.datasets.transforms.processing.RandomCrop.transform
+#: mmcls.datasets.transforms.processing.RandomErasing
+#: mmcls.datasets.transforms.processing.RandomErasing.transform
+#: mmcls.datasets.transforms.processing.RandomResizedCrop
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform
+#: mmcls.datasets.transforms.processing.ResizeEdge mmcls.datasets.transforms.processing.ResizeEdge.transform
+#: mmcls.datasets.voc.VOC mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_test
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_train
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_val
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook.before_train
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_epoch
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_iter
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_test_iter
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_val_iter mmcls.engine.optimizers.lamb.Lamb
+#: mmcls.engine.optimizers.lamb.Lamb.step mmcls.evaluation.metrics.multi_label.AveragePrecision
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric mmcls.evaluation.metrics.single_label.Accuracy
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate
+#: mmcls.evaluation.metrics.single_label.Accuracy.compute_metrics
+#: mmcls.evaluation.metrics.single_label.Accuracy.process
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.compute_metrics
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.process
+#: mmcls.evaluation.metrics.voc_multi_label.VOCAveragePrecision
+#: mmcls.evaluation.metrics.voc_multi_label.VOCMultiLabelMetric mmcls.models.backbones.alexnet.AlexNet
+#: mmcls.models.backbones.conformer.Conformer mmcls.models.backbones.convmixer.ConvMixer
+#: mmcls.models.backbones.convnext.ConvNeXt mmcls.models.backbones.cspnet.CSPDarkNet
+#: mmcls.models.backbones.cspnet.CSPNet mmcls.models.backbones.cspnet.CSPResNeXt
+#: mmcls.models.backbones.cspnet.CSPResNet mmcls.models.backbones.davit.DaViT
+#: mmcls.models.backbones.deit.DistilledVisionTransformer mmcls.models.backbones.deit3.DeiT3
+#: mmcls.models.backbones.densenet.DenseNet mmcls.models.backbones.edgenext.EdgeNeXt
+#: mmcls.models.backbones.efficientformer.EfficientFormer mmcls.models.backbones.efficientnet.EfficientNet
+#: mmcls.models.backbones.hornet.HorNet mmcls.models.backbones.hrnet.HRNet
+#: mmcls.models.backbones.inception_v3.InceptionV3 mmcls.models.backbones.lenet.LeNet5
+#: mmcls.models.backbones.mlp_mixer.MlpMixer mmcls.models.backbones.mobilenet_v2.MobileNetV2
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2.make_layer mmcls.models.backbones.mobilenet_v3.MobileNetV3
+#: mmcls.models.backbones.mobileone.MobileOne mmcls.models.backbones.mobilevit.MobileViT
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilenetv2_layer
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilevit_layer mmcls.models.backbones.mvit.MViT
+#: mmcls.models.backbones.poolformer.PoolFormer mmcls.models.backbones.regnet.RegNet
+#: mmcls.models.backbones.regnet.RegNet.adjust_width_group
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet
+#: mmcls.models.backbones.regnet.RegNet.get_stages_from_blocks
+#: mmcls.models.backbones.regnet.RegNet.quantize_float mmcls.models.backbones.replknet.RepLKNet
+#: mmcls.models.backbones.repmlp.RepMLPNet mmcls.models.backbones.repvgg.RepVGG
+#: mmcls.models.backbones.res2net.Res2Net mmcls.models.backbones.resnest.ResNeSt
+#: mmcls.models.backbones.resnet.ResNet mmcls.models.backbones.resnet_cifar.ResNet_CIFAR
+#: mmcls.models.backbones.resnext.ResNeXt mmcls.models.backbones.seresnet.SEResNet
+#: mmcls.models.backbones.seresnext.SEResNeXt mmcls.models.backbones.shufflenet_v1.ShuffleNetV1
+#: mmcls.models.backbones.shufflenet_v1.ShuffleNetV1.make_layer
+#: mmcls.models.backbones.shufflenet_v2.ShuffleNetV2 mmcls.models.backbones.swin_transformer.SwinTransformer
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2 mmcls.models.backbones.t2t_vit.T2T_ViT
+#: mmcls.models.backbones.timm_backbone.TIMMBackbone mmcls.models.backbones.tnt.TNT
+#: mmcls.models.backbones.twins.PCPVT mmcls.models.backbones.twins.SVT mmcls.models.backbones.van.VAN
+#: mmcls.models.backbones.vgg.VGG mmcls.models.backbones.vision_transformer.VisionTransformer
+#: mmcls.models.classifiers.base.BaseClassifier mmcls.models.classifiers.base.BaseClassifier.extract_feat
+#: mmcls.models.classifiers.base.BaseClassifier.extract_feats
+#: mmcls.models.classifiers.base.BaseClassifier.forward
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.loss
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.predict
+#: mmcls.models.classifiers.image.ImageClassifier mmcls.models.classifiers.image.ImageClassifier.extract_feat
+#: mmcls.models.classifiers.image.ImageClassifier.forward mmcls.models.classifiers.image.ImageClassifier.loss
+#: mmcls.models.classifiers.image.ImageClassifier.predict mmcls.models.classifiers.timm.TimmClassifier
+#: mmcls.models.classifiers.timm.TimmClassifier.loss mmcls.models.classifiers.timm.TimmClassifier.predict
+#: mmcls.models.heads.cls_head.ClsHead mmcls.models.heads.cls_head.ClsHead.loss
+#: mmcls.models.heads.cls_head.ClsHead.predict mmcls.models.heads.conformer_head.ConformerHead
+#: mmcls.models.heads.conformer_head.ConformerHead.predict mmcls.models.heads.deit_head.DeiTClsHead
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead.loss
+#: mmcls.models.heads.linear_head.LinearClsHead mmcls.models.heads.margin_head.ArcFaceClsHead
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.loss
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.set_margins
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.loss
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.predict
+#: mmcls.models.heads.multi_label_csra_head.CSRAClsHead
+#: mmcls.models.heads.multi_label_linear_head.MultiLabelLinearClsHead
+#: mmcls.models.heads.stacked_head.StackedLinearClsHead
+#: mmcls.models.heads.vision_transformer_head.VisionTransformerClsHead
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss.forward
+#: mmcls.models.losses.cross_entropy_loss.CrossEntropyLoss mmcls.models.losses.focal_loss.FocalLoss
+#: mmcls.models.losses.focal_loss.FocalLoss.forward mmcls.models.losses.label_smooth_loss.LabelSmoothLoss
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss.forward mmcls.models.losses.seesaw_loss.SeesawLoss
+#: mmcls.models.losses.seesaw_loss.SeesawLoss.forward mmcls.models.necks.gap.GlobalAveragePooling
+#: mmcls.models.necks.gem.GeneralizedMeanPooling mmcls.models.necks.hr_fuse.HRFuseScales
+#: mmcls.models.utils.attention.MultiheadAttention mmcls.models.utils.attention.ShiftWindowMSA
+#: mmcls.models.utils.attention.WindowMSA mmcls.models.utils.attention.WindowMSA.forward
+#: mmcls.models.utils.attention.WindowMSAV2 mmcls.models.utils.attention.WindowMSAV2.forward
+#: mmcls.models.utils.batch_augments.cutmix.CutMix
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.cutmix_bbox_and_lam
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.mix
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.rand_bbox
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.rand_bbox_minmax
+#: mmcls.models.utils.batch_augments.mixup.Mixup mmcls.models.utils.batch_augments.mixup.Mixup.mix
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix.mix
+#: mmcls.models.utils.channel_shuffle.channel_shuffle mmcls.models.utils.data_preprocessor.ClsDataPreprocessor
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor.forward mmcls.models.utils.embed.HybridEmbed
+#: mmcls.models.utils.embed.PatchEmbed mmcls.models.utils.embed.PatchMerging
+#: mmcls.models.utils.embed.PatchMerging.forward mmcls.models.utils.embed.resize_pos_embed
+#: mmcls.models.utils.embed.resize_relative_position_bias_table mmcls.models.utils.helpers._ntuple
+#: mmcls.models.utils.inverted_residual.InvertedResidual
+#: mmcls.models.utils.inverted_residual.InvertedResidual.forward mmcls.models.utils.layer_scale.LayerScale
+#: mmcls.models.utils.make_divisible.make_divisible
+#: mmcls.models.utils.position_encoding.ConditionalPositionEncoding mmcls.models.utils.se_layer.SELayer
+#: mmcls.utils.setup_env.register_all_modules mmcls.visualization.cls_visualizer.ClsVisualizer of
+msgid "参数"
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:48 mmcls.datasets.imagenet.ImageNet:6
+#: mmcls.datasets.imagenet.ImageNet21k:7 of
+msgid "Annotation file path. Defaults to ''."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:14 mmcls.datasets.custom.CustomDataset:50
+#: mmcls.datasets.imagenet.ImageNet:8 mmcls.datasets.imagenet.ImageNet21k:9
+#: mmcls.datasets.multi_label.MultiLabelDataset:35 of
+msgid "Meta information for dataset, such as class information. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:17 mmcls.datasets.custom.CustomDataset:53
+#: mmcls.datasets.imagenet.ImageNet:11 mmcls.datasets.imagenet.ImageNet21k:12
+#: mmcls.datasets.multi_label.MultiLabelDataset:38 of
+msgid "The root directory for ``data_prefix`` and ``ann_file``. Defaults to ''."
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:56 of
+msgid "Prefix for the data. Defaults to ''."
+msgstr ""
+
+#: mmcls.datasets.custom.CustomDataset:58 of
+msgid ""
+"A sequence of allowed extensions. Defaults to ('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif')."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:37 mmcls.datasets.custom.CustomDataset:61
+#: mmcls.datasets.multi_label.MultiLabelDataset:59 of
+msgid ""
+"Whether to load annotation during instantiation. In some cases, such as visualization, only the meta "
+"information of the dataset is needed, which is not necessary to load annotation file. ``Basedataset`` can "
+"skip load annotations to save time by set ``lazy_init=False``. Defaults to False."
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:20 mmcls.datasets.cifar.CIFAR100:17 mmcls.datasets.custom.CustomDataset:67
+#: mmcls.datasets.mnist.FashionMNIST:18 mmcls.datasets.mnist.MNIST:20 mmcls.datasets.voc.VOC:40 of
+msgid "Other keyword arguments in :class:`BaseDataset`."
+msgstr ""
+
+#: ../../api/datasets.rst:22
+msgid "ImageNet"
+msgstr ""
+
+#: mmcls.datasets.imagenet.ImageNet:1 of
+msgid "`ImageNet <http://www.image-net.org>`_ Dataset."
+msgstr ""
+
+#: mmcls.datasets.imagenet.ImageNet:3 of
+msgid ""
+"The dataset supports two kinds of annotation format. More details can be found in :class:`CustomDataset`."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:20 mmcls.datasets.imagenet.ImageNet:14
+#: mmcls.datasets.imagenet.ImageNet21k:15 mmcls.datasets.multi_label.MultiLabelDataset:41 of
+msgid "Prefix for training data. Defaults to ''."
+msgstr ""
+
+#: mmcls.datasets.imagenet.ImageNet:16 mmcls.datasets.imagenet.ImageNet21k:20 of
+msgid "Other keyword arguments in :class:`CustomDataset` and :class:`BaseDataset`."
+msgstr ""
+
+#: mmcls.datasets.imagenet.ImageNet21k:1 of
+msgid "ImageNet21k Dataset."
+msgstr ""
+
+#: mmcls.datasets.imagenet.ImageNet21k:3 of
+msgid ""
+"Since the dataset ImageNet21k is extremely big, cantains 21k+ classes and 1.4B files. We won't provide the "
+"default categories list. Please specify it from the ``classes`` argument."
+msgstr ""
+
+#: mmcls.datasets.imagenet.ImageNet21k:17 of
+msgid "Not implement by now. Use multi label or not. Defaults to False."
+msgstr ""
+
+#: ../../api/datasets.rst:29
+msgid "CIFAR"
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:1 of
+msgid "`CIFAR10 <https://www.cs.toronto.edu/~kriz/cifar.html>`_ Dataset."
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:3 of
+msgid ""
+"This implementation is modified from https://github.com/pytorch/vision/blob/master/torchvision/datasets/"
+"cifar.py"
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:6 mmcls.datasets.cifar.CIFAR100:3 mmcls.datasets.mnist.FashionMNIST:4
+#: mmcls.datasets.mnist.MNIST:6 of
+msgid "Prefix for data."
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:8 mmcls.datasets.cifar.CIFAR100:5 mmcls.datasets.cub.CUB:28
+#: mmcls.datasets.mnist.FashionMNIST:6 mmcls.datasets.mnist.MNIST:8 mmcls.datasets.voc.VOC:34 of
+msgid "``test_mode=True`` means in test phase. It determines to use the training set or test set."
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:11 mmcls.datasets.cifar.CIFAR100:8 mmcls.datasets.mnist.FashionMNIST:9
+#: mmcls.datasets.mnist.MNIST:11 mmcls.datasets.voc.VOC:37 of
+msgid "Meta information for dataset, such as categories information. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:14 mmcls.datasets.cifar.CIFAR100:11 mmcls.datasets.mnist.FashionMNIST:12
+#: mmcls.datasets.mnist.MNIST:14 of
+msgid "The root directory for ``data_prefix``. Defaults to ''."
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR10:17 mmcls.datasets.cifar.CIFAR100:14 mmcls.datasets.mnist.FashionMNIST:15
+#: mmcls.datasets.mnist.MNIST:17 of
+msgid "Whether to download the dataset if not exists. Defaults to True."
+msgstr ""
+
+#: mmcls.datasets.cifar.CIFAR100:1 of
+msgid "`CIFAR100 <https://www.cs.toronto.edu/~kriz/cifar.html>`_ Dataset."
+msgstr ""
+
+#: ../../api/datasets.rst:36
+msgid "MNIST"
+msgstr ""
+
+#: mmcls.datasets.mnist.MNIST:1 of
+msgid "`MNIST <http://yann.lecun.com/exdb/mnist/>`_ Dataset."
+msgstr ""
+
+#: mmcls.datasets.mnist.MNIST:3 of
+msgid ""
+"This implementation is modified from https://github.com/pytorch/vision/blob/master/torchvision/datasets/"
+"mnist.py"
+msgstr ""
+
+#: mmcls.datasets.mnist.FashionMNIST:1 of
+msgid "`Fashion-MNIST <https://github.com/zalandoresearch/fashion-mnist>`_ Dataset."
+msgstr ""
+
+#: ../../api/datasets.rst:43
+msgid "VOC"
+msgstr ""
+
+#: mmcls.datasets.voc.VOC:1 of
+msgid "`Pascal VOC <http://host.robots.ox.ac.uk/pascal/VOC/>`_ Dataset."
+msgstr ""
+
+#: mmcls.datasets.voc.VOC:3 of
+msgid "After decompression, the dataset directory structure is as follows:"
+msgstr ""
+
+#: mmcls.datasets.voc.VOC:5 of
+msgid "VOC dataset directory: ::"
+msgstr ""
+
+#: mmcls.datasets.voc.VOC:18 of
+msgid ""
+"Extra difficult label is in VOC annotations, we will use `gt_label_difficult` to record the difficult "
+"labels in each sample and corresponding evaluation should take care of this field to calculate metrics. "
+"Usually, difficult labels are reckoned as negative in defaults."
+msgstr ""
+
+#: mmcls.datasets.voc.VOC:24 of
+msgid "The root directory for VOC dataset."
+msgstr ""
+
+#: mmcls.datasets.voc.VOC:26 of
+msgid ""
+"The path of image set, The file which lists image ids of the sub dataset, and this path is relative to "
+"``data_root``."
+msgstr ""
+
+#: mmcls.datasets.voc.VOC:30 of
+msgid ""
+"Prefix for data and annotation, keyword 'img_path' and 'ann_path' can be set. Defaults to be "
+"``dict(img_path='JPEGImages', ann_path='Annotations')``."
+msgstr ""
+
+#: ../../api/datasets.rst:48
+msgid "CUB"
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:1 of
+msgid "The CUB-200-2011 Dataset."
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:3 of
+msgid ""
+"Support the `CUB-200-2011 <http://www.vision.caltech.edu/visipedia/CUB-200-2011.html>`_ Dataset. Comparing "
+"with the `CUB-200 <http://www.vision.caltech.edu/visipedia/CUB-200.html>`_ Dataset, there are much more "
+"pictures in `CUB-200-2011`. After downloading and decompression, the dataset directory structure is as "
+"follows."
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:8 of
+msgid "CUB dataset directory: ::"
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:26 of
+msgid "The root directory for CUB-200-2011 dataset."
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:31 of
+msgid "Annotation file path, path relative to ``data_root``. Defaults to 'images.txt'."
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:34 of
+msgid "Prefix for iamges, path relative to ``data_root``. Defaults to 'images'."
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:37 of
+msgid "The label file, path relative to ``data_root``. Defaults to 'image_class_labels.txt'."
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:40 of
+msgid ""
+"The split file  to split train and test dataset, path relative to ``data_root``. Defaults to "
+"'train_test_split_file.txt'."
+msgstr ""
+
+#: mmcls.datasets.cub.CUB:46 mmcls.datasets.transforms.auto_augment.RandAugment:44
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:39
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:69 mmcls.evaluation.metrics.single_label.Accuracy:32
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:68 mmcls.models.backbones.mvit.MViT:80
+#: mmcls.models.backbones.swin_transformer.SwinTransformer:75
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:78 mmcls.models.backbones.twins.PCPVT:46
+#: mmcls.models.backbones.twins.SVT:47 mmcls.models.backbones.van.VAN:50
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:49
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat:25
+#: mmcls.models.classifiers.timm.TimmClassifier:40 mmcls.structures.cls_data_sample.ClsDataSample:21
+#: mmcls.visualization.cls_visualizer.ClsVisualizer:22 of
+msgid "实际案例"
+msgstr "使用示例"
+
+#: ../../api/datasets.rst:53
+msgid "Base classes"
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:1 of
+msgid "Base dataset for image classification task."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:3 mmcls.datasets.multi_label.MultiLabelDataset:3 of
+msgid "This dataset support annotation file in `OpenMMLab 2.0 style annotation format`."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:9 of
+msgid "Comparing with the :class:`mmengine.BaseDataset`, this class implemented several useful methods."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:12 mmcls.datasets.multi_label.MultiLabelDataset:33 of
+msgid "Annotation file path."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:22 mmcls.datasets.multi_label.MultiLabelDataset:43 of
+msgid "Config for filter data. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:24 of
+msgid ""
+"Support using first few data in annotation file to facilitate training/testing on a smaller dataset. "
+"Defaults to None, which means using all ``data_infos``."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:28 mmcls.datasets.multi_label.MultiLabelDataset:49 of
+msgid ""
+"Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from "
+"master process instead of making a copy. Defaults to True."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:32 of
+msgid "Processing pipeline. Defaults to an empty tuple."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:34 mmcls.datasets.multi_label.MultiLabelDataset:56 of
+msgid "``test_mode=True`` means in test phase. Defaults to False."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:43 mmcls.datasets.multi_label.MultiLabelDataset:65 of
+msgid ""
+"If ``Basedataset.prepare_data`` get a None img. The maximum extra number of cycles to get a valid image. "
+"Defaults to 1000."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:47 mmcls.datasets.multi_label.MultiLabelDataset:69 of
+msgid ""
+"Specify names of classes.  - If is string, it should be a file path, and the every line of   the file is a "
+"name of a class. - If is a sequence of string, every item is a name of class. - If is None, use categories "
+"information in ``metainfo`` argument,   annotation file or the class attribute ``METAINFO``.  Defaults to "
+"None."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:47 mmcls.datasets.multi_label.MultiLabelDataset:69 of
+msgid "Specify names of classes."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:49 mmcls.datasets.multi_label.MultiLabelDataset:71 of
+msgid "If is string, it should be a file path, and the every line of the file is a name of a class."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:51 mmcls.datasets.multi_label.MultiLabelDataset:73 of
+msgid "If is a sequence of string, every item is a name of class."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:52 mmcls.datasets.multi_label.MultiLabelDataset:74 of
+msgid ""
+"If is None, use categories information in ``metainfo`` argument, annotation file or the class attribute "
+"``METAINFO``."
+msgstr ""
+
+#: mmcls.datasets.base_dataset.BaseDataset:55 mmcls.datasets.multi_label.MultiLabelDataset:77
+#: mmcls.models.backbones.hrnet.HRNet:23 mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:32
+#: mmcls.models.classifiers.image.ImageClassifier:23 mmcls.models.classifiers.timm.TimmClassifier:23 of
+msgid "Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.multi_label.MultiLabelDataset:1 of
+msgid "Multi-label Dataset."
+msgstr ""
+
+#: mmcls.datasets.multi_label.MultiLabelDataset:9 of
+msgid "The annotation format is shown as follows."
+msgstr ""
+
+#: mmcls.datasets.multi_label.MultiLabelDataset:45 of
+msgid ""
+"Support using first few data in annotation file to facilitate training/testing on a smaller dataset. "
+"Defaults to None which means using all ``data_infos``."
+msgstr ""
+
+#: mmcls.datasets.multi_label.MultiLabelDataset:54 of
+msgid "Processing pipeline. Defaults to []."
+msgstr ""
+
+#: ../../api/datasets.rst:60
+msgid "Dataset Wrappers"
+msgstr ""
+
+#: mmcls.datasets.dataset_wrappers.KFoldDataset:1 of
+msgid "A wrapper of dataset for K-Fold cross-validation."
+msgstr ""
+
+#: mmcls.datasets.dataset_wrappers.KFoldDataset:3 of
+msgid ""
+"K-Fold cross-validation divides all the samples in groups of samples, called folds, of almost equal sizes. "
+"And we use k-1 of folds to do training and use the fold left to do validation."
+msgstr ""
+
+#: mmcls.datasets.dataset_wrappers.KFoldDataset:7 of
+msgid "The dataset to be divided"
+msgstr ""
+
+#: mmcls.datasets.dataset_wrappers.KFoldDataset:10 of
+msgid "The fold used to do validation. Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.dataset_wrappers.KFoldDataset:12 of
+msgid "The number of all folds. Defaults to 5."
+msgstr ""
+
+#: mmcls.datasets.dataset_wrappers.KFoldDataset:14 of
+msgid "Use the training dataset or validation dataset. Defaults to False."
+msgstr ""
+
+#: mmcls.datasets.dataset_wrappers.KFoldDataset:17 of
+msgid "The seed to shuffle the dataset before splitting. If None, not shuffle the dataset. Defaults to None."
+msgstr ""
+
+#: ../../api/datasets.rst:64
+msgid "The dataset wrappers in the MMEngine can be directly used in MMClassification."
+msgstr ""
+
+#: ../../api/datasets.rst:68
+msgid ":class:`~mmengine.dataset.ConcatDataset`"
+msgstr ""
+
+#: ../../api/datasets.rst:69
+msgid "A wrapper of concatenated dataset."
+msgstr ""
+
+#: ../../api/datasets.rst:70
+msgid ":class:`~mmengine.dataset.RepeatDataset`"
+msgstr ""
+
+#: ../../api/datasets.rst:71
+msgid "A wrapper of repeated dataset."
+msgstr ""
+
+#: ../../api/datasets.rst:72
+msgid ":class:`~mmengine.dataset.ClassBalancedDataset`"
+msgstr ""
+
+#: ../../api/datasets.rst:73
+msgid "A wrapper of class balanced dataset."
+msgstr ""
+
+#: ../../api/engine.rst:7 ../../api/engine.rst:19
+msgid "mmcls.engine"
+msgstr ""
+
+#: ../../api/engine.rst:9
+msgid ""
+"This package includes some runtime components, including hooks, runners, optimizers and loops. These "
+"components are useful in classification tasks but not supported by MMEngine yet."
+msgstr ""
+"该包中包含了一些运行时组件，如钩子（hook）、执行器（runner）、优化器（optimizer）和循环执行器（loop）。这些"
+"组件在分类任务中需要用到，而还未被 MMEngine 支持。"
+
+#: ../../api/engine.rst:14
+msgid "Some components may be moved to MMEngine in the future."
+msgstr "部分组件未来可能会被移动到 MMEngine 中。"
+
+#: ../../api/engine.rst:24
+msgid "Hooks"
+msgstr ""
+
+#: ../../api/engine.rst:36:<autosummary>:1
+msgid ":py:obj:`ClassNumCheckHook <mmcls.engine.hooks.ClassNumCheckHook>`"
+msgstr ""
+
+#: ../../api/engine.rst:36:<autosummary>:1 mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook:1 of
+msgid "Class Number Check HOOK."
+msgstr ""
+
+#: ../../api/engine.rst:36:<autosummary>:1
+msgid ":py:obj:`PreciseBNHook <mmcls.engine.hooks.PreciseBNHook>`"
+msgstr ""
+
+#: ../../api/engine.rst:36:<autosummary>:1 mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:1 of
+msgid "Precise BN hook."
+msgstr ""
+
+#: ../../api/engine.rst:36:<autosummary>:1
+msgid ":py:obj:`VisualizationHook <mmcls.engine.hooks.VisualizationHook>`"
+msgstr ""
+
+#: ../../api/engine.rst:36:<autosummary>:1
+msgid "Classification Visualization Hook."
+msgstr ""
+
+#: ../../api/engine.rst:36:<autosummary>:1
+msgid ":py:obj:`PrepareProtoBeforeValLoopHook <mmcls.engine.hooks.PrepareProtoBeforeValLoopHook>`"
+msgstr ""
+
+#: ../../api/engine.rst:36:<autosummary>:1 mmcls.engine.hooks.retriever_hooks.PrepareProtoBeforeValLoopHook:1
+#: of
+msgid "The hook to prepare the prototype in retrievers."
+msgstr ""
+
+#: ../../api/engine.rst:36:<autosummary>:1
+msgid ":py:obj:`SetAdaptiveMarginsHook <mmcls.engine.hooks.SetAdaptiveMarginsHook>`"
+msgstr ""
+
+#: ../../api/engine.rst:36:<autosummary>:1 mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook:1 of
+msgid "Set adaptive-margins in ArcFaceClsHead based on the power of category-wise count."
+msgstr ""
+
+#: ../../api/engine.rst:40
+msgid "Optimizers"
+msgstr ""
+
+#: ../../api/engine.rst:47:<autosummary>:1
+msgid ":py:obj:`Lamb <mmcls.engine.optimizers.Lamb>`"
+msgstr ""
+
+#: ../../api/engine.rst:47:<autosummary>:1 mmcls.engine.optimizers.lamb.Lamb:1 of
+msgid "A pure pytorch variant of FuseLAMB (NvLamb variant) optimizer."
+msgstr ""
+
+#: ../../api/evaluation.rst:7 ../../api/evaluation.rst:14
+msgid "mmcls.evaluation"
+msgstr ""
+
+#: ../../api/evaluation.rst:9
+msgid "This package includes metrics and evaluators for classification tasks."
+msgstr "该包中包含了用于分类任务的一系列评测指标及评测器。"
+
+#: ../../api/evaluation.rst:17
+msgid "Single Label Metric"
+msgstr ""
+
+#: ../../api/evaluation.rst:26:<autosummary>:1
+msgid ":py:obj:`Accuracy <mmcls.evaluation.Accuracy>`"
+msgstr ""
+
+#: ../../api/evaluation.rst:26:<autosummary>:1 mmcls.evaluation.metrics.single_label.Accuracy:1 of
+msgid "Accuracy evaluation metric."
+msgstr ""
+
+#: ../../api/evaluation.rst:26:<autosummary>:1
+msgid ":py:obj:`SingleLabelMetric <mmcls.evaluation.SingleLabelMetric>`"
+msgstr ""
+
+#: ../../api/evaluation.rst:26:<autosummary>:1 mmcls.evaluation.metrics.single_label.SingleLabelMetric:1 of
+msgid "A collection of precision, recall, f1-score and support for single-label tasks."
+msgstr ""
+
+#: ../../api/evaluation.rst:28
+msgid "Multi Label Metric"
+msgstr ""
+
+#: ../../api/evaluation.rst:36:<autosummary>:1
+msgid ":py:obj:`AveragePrecision <mmcls.evaluation.AveragePrecision>`"
+msgstr ""
+
+#: ../../api/evaluation.rst:36:<autosummary>:1 mmcls.evaluation.metrics.multi_label.AveragePrecision:1 of
+msgid "Calculate the average precision with respect of classes."
+msgstr ""
+
+#: ../../api/evaluation.rst:36:<autosummary>:1
+msgid ":py:obj:`MultiLabelMetric <mmcls.evaluation.MultiLabelMetric>`"
+msgstr ""
+
+#: ../../api/evaluation.rst:36:<autosummary>:1 mmcls.evaluation.metrics.multi_label.MultiLabelMetric:1 of
+msgid "A collection of precision, recall, f1-score and support for multi-label tasks."
+msgstr ""
+
+#: ../../api/evaluation.rst:36:<autosummary>:1
+msgid ":py:obj:`VOCAveragePrecision <mmcls.evaluation.VOCAveragePrecision>`"
+msgstr ""
+
+#: ../../api/evaluation.rst:36:<autosummary>:1 mmcls.evaluation.metrics.voc_multi_label.VOCAveragePrecision:1
+#: of
+msgid "Calculate the average precision with respect of classes for VOC dataset."
+msgstr ""
+
+#: ../../api/evaluation.rst:36:<autosummary>:1
+msgid ":py:obj:`VOCMultiLabelMetric <mmcls.evaluation.VOCMultiLabelMetric>`"
+msgstr ""
+
+#: ../../api/evaluation.rst:36:<autosummary>:1 mmcls.evaluation.metrics.voc_multi_label.VOCMultiLabelMetric:1
+#: of
+msgid ""
+"A collection of metrics for multi-label multi-class classification task based on confusion matrix for VOC "
+"dataset."
+msgstr ""
+
+#: ../../api/generated/mmcls.apis.inference_model.rst:2
+msgid "mmcls.apis.inference\\_model"
+msgstr ""
+
+#: mmcls.apis.inference.inference_model:3 of
+msgid "The loaded classifier."
+msgstr ""
+
+#: mmcls.apis.inference.inference_model:5 of
+msgid "The image filename or loaded image."
+msgstr ""
+
+#: mmcls.apis.inference.inference_model mmcls.apis.inference.init_model
+#: mmcls.datasets.transforms.processing.Albumentations.albu_builder
+#: mmcls.datasets.transforms.processing.Albumentations.mapper
+#: mmcls.datasets.transforms.processing.Albumentations.transform
+#: mmcls.datasets.transforms.processing.ColorJitter.transform
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform
+#: mmcls.datasets.transforms.processing.Lighting.transform
+#: mmcls.datasets.transforms.processing.RandomCrop.transform
+#: mmcls.datasets.transforms.processing.RandomErasing.transform
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform
+#: mmcls.datasets.transforms.processing.ResizeEdge.transform
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate
+#: mmcls.evaluation.metrics.single_label.Accuracy.compute_metrics
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.compute_metrics
+#: mmcls.models.backbones.regnet.RegNet.adjust_width_group
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet
+#: mmcls.models.backbones.regnet.RegNet.get_stages_from_blocks
+#: mmcls.models.backbones.regnet.RegNet.quantize_float
+#: mmcls.models.classifiers.base.BaseClassifier.extract_feats
+#: mmcls.models.classifiers.base.BaseClassifier.forward
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.loss
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.predict
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat
+#: mmcls.models.classifiers.image.ImageClassifier.forward mmcls.models.classifiers.image.ImageClassifier.loss
+#: mmcls.models.classifiers.timm.TimmClassifier.loss mmcls.models.classifiers.timm.TimmClassifier.predict
+#: mmcls.models.heads.cls_head.ClsHead.loss mmcls.models.heads.cls_head.ClsHead.predict
+#: mmcls.models.heads.conformer_head.ConformerHead.predict
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead.loss
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.loss
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.loss
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.predict
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss.forward mmcls.models.losses.focal_loss.FocalLoss.forward
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss.forward
+#: mmcls.models.losses.seesaw_loss.SeesawLoss.forward mmcls.models.utils.batch_augments.cutmix.CutMix.mix
+#: mmcls.models.utils.batch_augments.mixup.Mixup.mix mmcls.models.utils.batch_augments.resizemix.ResizeMix.mix
+#: mmcls.models.utils.channel_shuffle.channel_shuffle
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor.forward
+#: mmcls.models.utils.embed.PatchMerging.forward mmcls.models.utils.embed.resize_pos_embed
+#: mmcls.models.utils.embed.resize_relative_position_bias_table
+#: mmcls.models.utils.inverted_residual.InvertedResidual.forward
+#: mmcls.models.utils.make_divisible.make_divisible of
+msgid "返回"
+msgstr ""
+
+#: mmcls.apis.inference.inference_model:8 of
+msgid "The classification results that contains     `class_name`, `pred_label` and `pred_score`."
+msgstr ""
+
+#: mmcls.apis.inference.inference_model:10 of
+msgid "The classification results that contains"
+msgstr ""
+
+#: mmcls.apis.inference.inference_model:11 of
+msgid "`class_name`, `pred_label` and `pred_score`."
+msgstr ""
+
+#: mmcls.apis.inference.inference_model mmcls.apis.inference.init_model
+#: mmcls.datasets.transforms.processing.Albumentations.albu_builder
+#: mmcls.datasets.transforms.processing.Albumentations.mapper
+#: mmcls.datasets.transforms.processing.Albumentations.transform
+#: mmcls.datasets.transforms.processing.ColorJitter.transform
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform
+#: mmcls.datasets.transforms.processing.Lighting.transform
+#: mmcls.datasets.transforms.processing.RandomCrop.transform
+#: mmcls.datasets.transforms.processing.RandomErasing.transform
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform
+#: mmcls.datasets.transforms.processing.ResizeEdge.transform
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate
+#: mmcls.evaluation.metrics.single_label.Accuracy.compute_metrics
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.compute_metrics
+#: mmcls.models.backbones.regnet.RegNet.adjust_width_group
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet
+#: mmcls.models.backbones.regnet.RegNet.get_stages_from_blocks
+#: mmcls.models.backbones.regnet.RegNet.quantize_float
+#: mmcls.models.classifiers.base.BaseClassifier.extract_feats
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.loss
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.predict
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat
+#: mmcls.models.classifiers.image.ImageClassifier.loss mmcls.models.classifiers.timm.TimmClassifier.loss
+#: mmcls.models.classifiers.timm.TimmClassifier.predict mmcls.models.heads.cls_head.ClsHead.loss
+#: mmcls.models.heads.cls_head.ClsHead.predict mmcls.models.heads.conformer_head.ConformerHead.predict
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead.loss
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.loss
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.loss
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.predict
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss.forward mmcls.models.losses.focal_loss.FocalLoss.forward
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss.forward
+#: mmcls.models.losses.seesaw_loss.SeesawLoss.forward mmcls.models.utils.batch_augments.cutmix.CutMix.mix
+#: mmcls.models.utils.batch_augments.mixup.Mixup.mix mmcls.models.utils.batch_augments.resizemix.ResizeMix.mix
+#: mmcls.models.utils.channel_shuffle.channel_shuffle
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor.forward
+#: mmcls.models.utils.embed.PatchMerging.forward mmcls.models.utils.embed.resize_pos_embed
+#: mmcls.models.utils.embed.resize_relative_position_bias_table
+#: mmcls.models.utils.inverted_residual.InvertedResidual.forward
+#: mmcls.models.utils.make_divisible.make_divisible of
+msgid "返回类型"
+msgstr ""
+
+#: ../../api/generated/mmcls.apis.init_model.rst:2
+msgid "mmcls.apis.init\\_model"
+msgstr ""
+
+#: mmcls.apis.inference.init_model:3 of
+msgid "Config file path or the config object."
+msgstr ""
+
+#: mmcls.apis.inference.init_model:6 of
+msgid "Checkpoint path. If left as None, the model will not load any weights."
+msgstr ""
+
+#: mmcls.apis.inference.init_model:9 of
+msgid "Options to override some settings in the used config."
+msgstr ""
+
+#: mmcls.apis.inference.init_model:12 of
+msgid "The constructed classifier."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Albumentations.rst:7
+msgid "Albumentations"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Collect:3 mmcls.datasets.transforms.formatting.PackClsInputs:3
+#: mmcls.datasets.transforms.formatting.ToNumpy:3 mmcls.datasets.transforms.formatting.ToPIL:3
+#: mmcls.datasets.transforms.formatting.Transpose:3 mmcls.datasets.transforms.processing.Albumentations:3
+#: mmcls.datasets.transforms.processing.ColorJitter:7
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:3
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:3
+#: mmcls.datasets.transforms.processing.Lighting:3 mmcls.datasets.transforms.processing.RandomCrop:3
+#: mmcls.datasets.transforms.processing.RandomErasing:3
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:7 mmcls.datasets.transforms.processing.ResizeEdge:3
+#: of
+msgid "**Required Keys:**"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:5 mmcls.datasets.transforms.formatting.ToPIL:5
+#: mmcls.datasets.transforms.formatting.ToPIL:9 mmcls.datasets.transforms.processing.Albumentations:5
+#: mmcls.datasets.transforms.processing.Albumentations:9 mmcls.datasets.transforms.processing.ColorJitter:9
+#: mmcls.datasets.transforms.processing.ColorJitter:13
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:5
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:9
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:5
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:9
+#: mmcls.datasets.transforms.processing.Lighting:5 mmcls.datasets.transforms.processing.Lighting:9
+#: mmcls.datasets.transforms.processing.RandomCrop:5 mmcls.datasets.transforms.processing.RandomCrop:9
+#: mmcls.datasets.transforms.processing.RandomErasing:5 mmcls.datasets.transforms.processing.RandomErasing:9
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:9
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:13 mmcls.datasets.transforms.processing.ResizeEdge:5
+#: mmcls.datasets.transforms.processing.ResizeEdge:9 of
+msgid "img"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.ToNumpy:7 mmcls.datasets.transforms.formatting.ToPIL:7
+#: mmcls.datasets.transforms.formatting.Transpose:7 mmcls.datasets.transforms.processing.Albumentations:7
+#: mmcls.datasets.transforms.processing.ColorJitter:11
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:7
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:7
+#: mmcls.datasets.transforms.processing.Lighting:7 mmcls.datasets.transforms.processing.RandomCrop:7
+#: mmcls.datasets.transforms.processing.RandomErasing:7
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:11 mmcls.datasets.transforms.processing.ResizeEdge:7
+#: of
+msgid "**Modified Keys:**"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations:10
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:10
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:10
+#: mmcls.datasets.transforms.processing.RandomCrop:10
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:14
+#: mmcls.datasets.transforms.processing.ResizeEdge:10 of
+msgid "img_shape"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations:12 of
+msgid ""
+"Adds custom transformations from albumentations library. More details can be found in `Albumentations "
+"<https://albumentations.readthedocs.io>`_. An example of ``transforms`` is as followed:"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations:42 of
+msgid "List of albumentations transform configs."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations:44 of
+msgid ""
+"Mapping of mmcls to albumentations fields, in format {'input key':'albumentation-style key'}. Defaults to "
+"None."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations:50 mmcls.models.backbones.cspnet.CSPDarkNet:30
+#: mmcls.models.backbones.cspnet.CSPNet:63 mmcls.models.backbones.cspnet.CSPResNeXt:28
+#: mmcls.models.backbones.cspnet.CSPResNet:28 mmcls.models.backbones.efficientformer.EfficientFormer:53
+#: mmcls.models.backbones.hrnet.HRNet:52 mmcls.models.backbones.inception_v3.InceptionV3:23
+#: mmcls.models.backbones.mobileone.MobileOne:48 mmcls.models.backbones.regnet.RegNet:45
+#: mmcls.models.backbones.res2net.Res2Net:56 mmcls.models.backbones.resnet.ResNet:54
+#: mmcls.models.backbones.seresnet.SEResNet:56 mmcls.models.heads.margin_head.ArcFaceClsHead:9 of
+msgid "示例"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.albu_builder:1 of
+msgid "Import a module from albumentations."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.albu_builder:3 of
+msgid ""
+"It inherits some of :func:`build_from_cfg` logic. :param cfg: Config dict. It should at least contain the "
+"key \"type\". :type cfg: dict"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.albu_builder:7 of
+msgid "The constructed object."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.mapper:1 of
+msgid "Dictionary mapper."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.mapper:3 of
+msgid ""
+"Renames keys according to keymap provided. :param d: old dict :type d: dict :param keymap: "
+"{'old_key':'new_key'} :type keymap: dict"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.mapper:9 of
+msgid "new dict."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.transform:1 of
+msgid "Transform function to perform albumentations transforms."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.transform:3
+#: mmcls.datasets.transforms.processing.ColorJitter.transform:3
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform:3
+#: mmcls.datasets.transforms.processing.Lighting.transform:3
+#: mmcls.datasets.transforms.processing.RandomCrop.transform:3
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform:3
+#: mmcls.datasets.transforms.processing.ResizeEdge.transform:3 of
+msgid "Result dict from loading pipeline."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.transform:6 of
+msgid "Transformed results, 'img' and 'img_shape' keys are     updated in result dict."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.transform:8 of
+msgid "Transformed results, 'img' and 'img_shape' keys are"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Albumentations.transform:9 of
+msgid "updated in result dict."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.AutoAugment.rst:7
+msgid "AutoAugment"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.AutoAugment:3 of
+msgid ""
+"This data augmentation is proposed in `AutoAugment: Learning Augmentation Policies from Data <https://arxiv."
+"org/abs/1805.09501>`_."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.AutoAugment:6 of
+msgid ""
+"The policies of auto augmentation. If string, use preset policies collection like \"imagenet\". If list, "
+"Each item is a sub policies, composed by several augmentation policy dicts. When AutoAugment is called, a "
+"random sub policies in ``policies`` will be selected to augment images."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.AutoAugment:12 mmcls.datasets.transforms.auto_augment.RandAugment:38
+#: of
+msgid ""
+"Configs of hyperparameters. Hyperparameters will be used in policies that require these arguments if these "
+"arguments are not set in policy dicts. Defaults to ``dict(pad_val=128)``."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.AutoContrast.rst:7
+msgid "AutoContrast"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.AutoContrast:3 of
+msgid "The probability for performing auto contrast therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.AutoContrast:6 mmcls.datasets.transforms.auto_augment.Brightness:15
+#: mmcls.datasets.transforms.auto_augment.ColorTransform:15 mmcls.datasets.transforms.auto_augment.Cutout:15
+#: mmcls.datasets.transforms.auto_augment.Equalize:6 mmcls.datasets.transforms.auto_augment.Invert:6
+#: mmcls.datasets.transforms.auto_augment.Posterize:11 mmcls.datasets.transforms.auto_augment.Rotate:27
+#: mmcls.datasets.transforms.auto_augment.Sharpness:15 mmcls.datasets.transforms.auto_augment.Shear:23
+#: mmcls.datasets.transforms.auto_augment.Solarize:10 mmcls.datasets.transforms.auto_augment.SolarizeAdd:13
+#: mmcls.datasets.transforms.auto_augment.Translate:25 of
+msgid "Other keyword arguments of :class:`BaseAugTransform`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.AutoContrast.transform:1
+#: mmcls.datasets.transforms.auto_augment.Brightness.transform:1
+#: mmcls.datasets.transforms.auto_augment.ColorTransform.transform:1
+#: mmcls.datasets.transforms.auto_augment.Contrast.transform:1
+#: mmcls.datasets.transforms.auto_augment.Cutout.transform:1
+#: mmcls.datasets.transforms.auto_augment.Equalize.transform:1
+#: mmcls.datasets.transforms.auto_augment.Invert.transform:1
+#: mmcls.datasets.transforms.auto_augment.Posterize.transform:1
+#: mmcls.datasets.transforms.auto_augment.Rotate.transform:1
+#: mmcls.datasets.transforms.auto_augment.Sharpness.transform:1
+#: mmcls.datasets.transforms.auto_augment.Shear.transform:1
+#: mmcls.datasets.transforms.auto_augment.Solarize.transform:1
+#: mmcls.datasets.transforms.auto_augment.SolarizeAdd.transform:1
+#: mmcls.datasets.transforms.auto_augment.Translate.transform:1 of
+msgid "Apply transform to results."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.BaseAugTransform.rst:7
+msgid "BaseAugTransform"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:3 of
+msgid ""
+"This class provides several common attributions and methods to support the magnitude level mapping and "
+"magnitude level randomness in :class:`RandAugment`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:7 of
+msgid "Magnitude level."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:9 of
+msgid ""
+"For augmentation have magnitude argument, maybe \"magnitude\", \"angle\" or other, you can specify the "
+"magnitude level mapping range to generate the magnitude argument. For example, assume ``total_level`` is "
+"10, ``magnitude_level=3`` specify magnitude is 3 if ``magnitude_range=(0, 10)`` while specify magnitude is "
+"7 if ``magnitude_range=(10, 0)``. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:17 of
+msgid ""
+"Deviation of magnitude noise applied.  - If positive number, the magnitude obeys normal distribution   :"
+"math:`\\mathcal{N}(magnitude, magnitude_std)`. - If 0 or negative number, magnitude remains unchanged. - If "
+"str \"inf\", the magnitude obeys uniform distribution   :math:`Uniform(min, magnitude)`.  Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:17
+#: mmcls.datasets.transforms.auto_augment.RandAugment:27 of
+msgid "Deviation of magnitude noise applied."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:19 of
+msgid ""
+"If positive number, the magnitude obeys normal distribution :math:`\\mathcal{N}(magnitude, magnitude_std)`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:21
+#: mmcls.datasets.transforms.auto_augment.RandAugment:31 of
+msgid "If 0 or negative number, magnitude remains unchanged."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:22
+#: mmcls.datasets.transforms.auto_augment.RandAugment:32 of
+msgid "If str \"inf\", the magnitude obeys uniform distribution :math:`Uniform(min, magnitude)`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:25 of
+msgid "Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:27
+#: mmcls.datasets.transforms.auto_augment.RandAugment:35 of
+msgid "Total level for the magnitude. Defaults to 10."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:30 of
+msgid "The probability for performing transformation therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform:33 of
+msgid "The probability that turns the magnitude negative, which should be in range [0,1]. Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.BaseAugTransform.extra_repr:1 of
+msgid "Extra repr string when auto-generating magnitude is enabled."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Brightness.rst:7
+msgid "Brightness"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Brightness:3 of
+msgid ""
+"The magnitude used for adjusting brightness. A positive magnitude would enhance the brightness and a "
+"negative magnitude would make the image darker. A magnitude=0 gives the origin img. If None, generate from "
+"``magnitude_range``, see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Brightness:9 of
+msgid ""
+"The probability for performing brightness adjusting therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Brightness:12
+#: mmcls.datasets.transforms.auto_augment.ColorTransform:12 mmcls.datasets.transforms.auto_augment.Contrast:12
+#: mmcls.datasets.transforms.auto_augment.Sharpness:12 mmcls.datasets.transforms.auto_augment.Shear:17
+#: mmcls.datasets.transforms.auto_augment.Translate:19 of
+msgid "The probability that turns the magnitude negative, which should be in range [0,1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Collect.rst:7
+msgid "Collect"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Collect:5 mmcls.datasets.transforms.formatting.Transpose:5
+#: mmcls.datasets.transforms.formatting.Transpose:9 of
+msgid "``*keys``"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Collect:7 mmcls.datasets.transforms.formatting.PackClsInputs:9 of
+msgid "**Deleted Keys:**"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Collect:9 of
+msgid "All keys except those in the argument ``*keys``."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Collect:11 of
+msgid "The keys of the fields to be collected."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.ColorJitter.rst:7
+msgid "ColorJitter"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ColorJitter:3 of
+msgid ""
+"Modified from https://github.com/pytorch/vision/blob/main/torchvision/transforms/transforms.py Licensed "
+"under the BSD 3-Clause License."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ColorJitter:15 of
+msgid ""
+"How much to jitter brightness. brightness_factor is chosen uniformly from ``[max(0, 1 - brightness), 1 + "
+"brightness]`` or the given ``[min, max]``. Should be non negative numbers. Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ColorJitter:20 of
+msgid ""
+"How much to jitter contrast. contrast_factor is chosen uniformly from ``[max(0, 1 - contrast), 1 + "
+"contrast]`` or the given ``[min, max]``. Should be non negative numbers. Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ColorJitter:25 of
+msgid ""
+"How much to jitter saturation. saturation_factor is chosen uniformly from ``[max(0, 1 - saturation), 1 + "
+"saturation]`` or the given ``[min, max]``. Should be non negative numbers. Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ColorJitter:30 of
+msgid ""
+"How much to jitter hue. hue_factor is chosen uniformly from ``[-hue, hue]`` (0 <= hue <= 0.5) or the given "
+"``[min, max]`` (-0.5 <= min <= max <= 0.5). Defaults to 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ColorJitter.transform:1
+#: mmcls.datasets.transforms.processing.Lighting.transform:1
+#: mmcls.datasets.transforms.processing.ResizeEdge.transform:1 of
+msgid "Transform function to resize images."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ColorJitter.transform:6 of
+msgid "ColorJitter results, 'img' key is updated in result dict."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.ColorTransform.rst:7
+msgid "ColorTransform"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.ColorTransform:3 of
+msgid ""
+"The magnitude used for color transform. A positive magnitude would enhance the color and a negative "
+"magnitude would make the image grayer. A magnitude=0 gives the origin img. If None, generate from "
+"``magnitude_range``, see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.ColorTransform:9 of
+msgid "The probability for performing ColorTransform therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Contrast.rst:7
+msgid "Contrast"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Contrast:3 of
+msgid ""
+"The magnitude used for adjusting contrast. A positive magnitude would enhance the contrast and a negative "
+"magnitude would make the image grayer. A magnitude=0 gives the origin img. If None, generate from "
+"``magnitude_range``, see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Contrast:9 of
+msgid ""
+"The probability for performing contrast adjusting therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Cutout.rst:7
+msgid "Cutout"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Cutout:3 of
+msgid ""
+"Expected cutout shape (h, w). If given as a single value, the value will be used for both h and w. If None, "
+"generate from ``magnitude_range``, see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Cutout:8 of
+msgid ""
+"Pixel pad_val value for constant fill. If it is a sequence, it must have the same length with the image "
+"channels. Defaults to 128."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Cutout:12 of
+msgid "The probability for performing cutout therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.EfficientNetCenterCrop.rst:7
+msgid "EfficientNetCenterCrop"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:12 of
+msgid "Expected size after cropping with the format of (h, w)."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:15
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:18 of
+msgid "The crop padding parameter in efficientnet style center crop. Defaults to 32."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:18 of
+msgid ""
+"Interpolation method, accepted values are 'nearest', 'bilinear', 'bicubic', 'area', 'lanczos'. Only valid "
+"if ``efficientnet_style`` is True. Defaults to 'bicubic'."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:22 of
+msgid ""
+"The image resize backend type, accepted values are `cv2` and `pillow`. Only valid if efficientnet style is "
+"True. Defaults to `cv2`."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:28
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead:17
+#: mmcls.models.heads.multi_label_linear_head.MultiLabelLinearClsHead:17
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:30 of
+msgid "提示"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:29 of
+msgid "If the image is smaller than the crop size, return the original image."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:31 of
+msgid "The pipeline will be to first to perform the center crop with the ``crop_size_`` as:"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:34 of
+msgid ""
+"\\text{crop_size_} = \\frac{\\text{crop_size}}{\\text{crop_size} +\n"
+"\\text{crop_padding}} \\times \\text{short_edge}"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop:39 of
+msgid "And then the pipeline resizes the img to the input crop size."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform:1
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform:1 of
+msgid "Transform function to randomly resized crop images."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform:6 of
+msgid ""
+"EfficientNet style center cropped results, 'img_shape'     key in result dict is updated according to crop "
+"size."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform:8 of
+msgid "EfficientNet style center cropped results, 'img_shape'"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetCenterCrop.transform:9
+#: mmcls.datasets.transforms.processing.RandomCrop.transform:9
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform:9 of
+msgid "key in result dict is updated according to crop size."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.EfficientNetRandomCrop.rst:7
+msgid "EfficientNetRandomCrop"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:12 of
+msgid "Desired output scale of the crop. Only int size is accepted, a square crop (size, size) is made."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:15 of
+msgid "Minimum ratio of the cropped area to the original area. Defaults to 0.1."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:21
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:20 of
+msgid "Range of the random size of the cropped image compared to the original image. Defaults to (0.08, 1.0)."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:24
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:23 of
+msgid ""
+"Range of the random aspect ratio of the cropped image compared to the original image. Defaults to (3. / 4., "
+"4. / 3.)."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:28
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:27 of
+msgid "Maximum number of attempts before falling back to Central Crop. Defaults to 10."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:31 of
+msgid ""
+"Interpolation method, accepted values are 'nearest', 'bilinear', 'bicubic', 'area', 'lanczos'. Defaults to "
+"'bicubic'."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.EfficientNetRandomCrop:35
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:34 of
+msgid "The image resize backend type, accepted values are 'cv2' and 'pillow'. Defaults to 'cv2'."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Equalize.rst:7
+msgid "Equalize"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Equalize:3 of
+msgid "The probability for performing equalize therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Invert.rst:7
+msgid "Invert"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Invert:3 of
+msgid "The probability for performing invert therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Lighting.rst:7
+msgid "Lighting"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Lighting:11 of
+msgid "the eigenvalue of the convariance matrix of pixel values, respectively."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Lighting:14 of
+msgid "the eigenvector of the convariance matrix of pixel values, respectively."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Lighting:17 of
+msgid "The standard deviation for distribution of alpha. Defaults to 0.1."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Lighting:20 of
+msgid "Whether to convert img to rgb. Defaults to False."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.Lighting.transform:6 of
+msgid "Lightinged results, 'img' key is updated in result dict."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.PackClsInputs.rst:7
+msgid "PackClsInputs"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:6 of
+msgid "gt_label (optional)"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:7 of
+msgid "``*meta_keys`` (optional)"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:11 of
+msgid "All keys in the dict."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:13 mmcls.datasets.transforms.processing.ResizeEdge:12 of
+msgid "**Added Keys:**"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:15 of
+msgid "inputs (:obj:`torch.Tensor`): The forward data of models."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:16 of
+msgid "data_samples (:obj:`~mmcls.structures.ClsDataSample`): The annotation info of the sample."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:19 of
+msgid ""
+"The meta keys to be saved in the ``metainfo`` of the packed ``data_samples``. Defaults to a tuple includes "
+"keys:  - ``sample_idx``: The id of the image sample. - ``img_path``: The path to the image file. - "
+"``ori_shape``: The original shape of the image as a tuple (H, W). - ``img_shape``: The shape of the image "
+"after the pipeline as a   tuple (H, W). - ``scale_factor``: The scale factor between the resized image "
+"and   the original image. - ``flip``: A boolean indicating if image flip transform was used. - "
+"``flip_direction``: The flipping direction."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:19 of
+msgid ""
+"The meta keys to be saved in the ``metainfo`` of the packed ``data_samples``. Defaults to a tuple includes "
+"keys:"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:23 of
+msgid "``sample_idx``: The id of the image sample."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:24 of
+msgid "``img_path``: The path to the image file."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:25 of
+msgid "``ori_shape``: The original shape of the image as a tuple (H, W)."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:26 of
+msgid "``img_shape``: The shape of the image after the pipeline as a tuple (H, W)."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:28 of
+msgid "``scale_factor``: The scale factor between the resized image and the original image."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:30 of
+msgid "``flip``: A boolean indicating if image flip transform was used."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs:31 of
+msgid "``flip_direction``: The flipping direction."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.PackClsInputs.transform:1 of
+msgid "Method to pack the input data."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Posterize.rst:7
+msgid "Posterize"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Posterize:3 of
+msgid ""
+"Number of bits for each pixel in the output img, which should be less or equal to 8. If None, generate from "
+"``magnitude_range``, see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Posterize:8 of
+msgid "The probability for posterizing therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.RandAugment.rst:7
+msgid "RandAugment"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:3 of
+msgid ""
+"This data augmentation is proposed in `RandAugment: Practical automated data augmentation with a reduced "
+"search space <https://arxiv.org/abs/1909.13719>`_."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:7 of
+msgid ""
+"The policies of random augmentation. If string, use preset policies collection like \"timm_increasing\". If "
+"list, each item is one specific augmentation policy dict. The policy dict shall should have these keys:  - "
+"``type`` (str), The type of augmentation. - ``magnitude_range`` (Sequence[number], optional): For those   "
+"augmentation have magnitude, you need to specify the magnitude   level mapping range. For example, assume "
+"``total_level`` is 10,   ``magnitude_level=3`` specify magnitude is 3 if   ``magnitude_range=(0, 10)`` "
+"while specify magnitude is 7 if   ``magnitude_range=(10, 0)``. - other keyword arguments of the "
+"augmentation."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:7 of
+msgid ""
+"The policies of random augmentation. If string, use preset policies collection like \"timm_increasing\". If "
+"list, each item is one specific augmentation policy dict. The policy dict shall should have these keys:"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:12 of
+msgid "``type`` (str), The type of augmentation."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:13 of
+msgid ""
+"``magnitude_range`` (Sequence[number], optional): For those augmentation have magnitude, you need to "
+"specify the magnitude level mapping range. For example, assume ``total_level`` is 10, ``magnitude_level=3`` "
+"specify magnitude is 3 if ``magnitude_range=(0, 10)`` while specify magnitude is 7 if "
+"``magnitude_range=(10, 0)``."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:19 of
+msgid "other keyword arguments of the augmentation."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:21 of
+msgid "Number of policies to select from policies each time."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:24 of
+msgid "Magnitude level for all the augmentation selected."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:27 of
+msgid ""
+"Deviation of magnitude noise applied.  - If positive number, the magnitude obeys normal distribution   :"
+"math:`\\mathcal{N}(magnitude_level, magnitude_std)`. - If 0 or negative number, magnitude remains "
+"unchanged. - If str \"inf\", the magnitude obeys uniform distribution   :math:`Uniform(min, magnitude)`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:29 of
+msgid ""
+"If positive number, the magnitude obeys normal distribution :math:`\\mathcal{N}(magnitude_level, "
+"magnitude_std)`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:45 of
+msgid ""
+"To use \"timm-increasing\" policies collection, select two policies every time, and magnitude_level of "
+"every policy is 6 (total is 10 by default)"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:60 of
+msgid ""
+"If you want the ``magnitude_level`` randomly changes every time, you can use ``magnitude_std`` to specify "
+"the random distribution. For example, a normal distribution :math:`\\mathcal{N}(6, 0.5)`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:71 of
+msgid "You can also use your own policies:"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:86 of
+msgid ""
+"``magnitude_std`` will introduce some randomness to policy, modified by https://github.com/rwightman/"
+"pytorch-image-models."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:89 of
+msgid "When magnitude_std=0, we calculate the magnitude as follows:"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment:91 of
+msgid ""
+"\\text{magnitude} = \\frac{\\text{magnitude_level}}\n"
+"{\\text{totallevel}} \\times (\\text{val2} - \\text{val1})\n"
+"+ \\text{val1}\n"
+"\n"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.RandAugment.transform:1 of
+msgid "Randomly choose a sub-policy to apply."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.RandomCrop.rst:7
+msgid "RandomCrop"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:12 of
+msgid ""
+"Desired output size of the crop. If crop_size is an int instead of sequence like (h, w), a square crop "
+"(crop_size, crop_size) is made."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:16 of
+msgid ""
+"Optional padding on each border of the image. If a sequence of length 4 is provided, it is used to pad "
+"left, top, right, bottom borders respectively.  If a sequence of length 2 is provided, it is used to pad "
+"left/right, top/bottom borders, respectively. Default: None, which means no padding."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:22 of
+msgid ""
+"It will pad the image if smaller than the desired size to avoid raising an exception. Since cropping is "
+"done after padding, the padding seems to be done at a random offset. Default: False."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:27 of
+msgid ""
+"Pixel pad_val value for constant fill. If a tuple of length 3, it is used to pad_val R, G, B channels "
+"respectively. Default: 0."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:31 of
+msgid ""
+"Type of padding. Defaults to \"constant\". Should be one of the following:  - ``constant``: Pads with a "
+"constant value, this value is specified   with pad_val. - ``edge``: pads with the last value at the edge of "
+"the image. - ``reflect``: Pads with reflection of image without repeating the   last value on the edge. For "
+"example, padding [1, 2, 3, 4]   with 2 elements on both sides in reflect mode will result   in [3, 2, 1, 2, "
+"3, 4, 3, 2]. - ``symmetric``: Pads with reflection of image repeating the last   value on the edge. For "
+"example, padding [1, 2, 3, 4] with   2 elements on both sides in symmetric mode will result in   [2, 1, 1, "
+"2, 3, 4, 4, 3]."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:31 of
+msgid "Type of padding. Defaults to \"constant\". Should be one of the following:"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:34 of
+msgid "``constant``: Pads with a constant value, this value is specified with pad_val."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:36 of
+msgid "``edge``: pads with the last value at the edge of the image."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:37 of
+msgid ""
+"``reflect``: Pads with reflection of image without repeating the last value on the edge. For example, "
+"padding [1, 2, 3, 4] with 2 elements on both sides in reflect mode will result in [3, 2, 1, 2, 3, 4, 3, 2]."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop:41 of
+msgid ""
+"``symmetric``: Pads with reflection of image repeating the last value on the edge. For example, padding [1, "
+"2, 3, 4] with 2 elements on both sides in symmetric mode will result in [2, 1, 1, 2, 3, 4, 4, 3]."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop.transform:1 of
+msgid "Transform function to randomly crop images."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop.transform:6 of
+msgid "Randomly cropped results, 'img_shape'     key in result dict is updated according to crop size."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomCrop.transform:8 of
+msgid "Randomly cropped results, 'img_shape'"
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.RandomErasing.rst:7
+msgid "RandomErasing"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:11 of
+msgid "Probability that image will be randomly erased. Default: 0.5"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:14 of
+msgid "Minimum erased area / input image area Default: 0.02"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:17 of
+msgid "Maximum erased area / input image area Default: 0.4"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:20 of
+msgid ""
+"Aspect ratio range of erased area. if float, it will be converted to (aspect_ratio, 1/aspect_ratio) "
+"Default: (3/10, 10/3)"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:24 of
+msgid ""
+"Fill method in erased area, can be:  - const (default): All pixels are assign with the same value. - rand: "
+"each pixel is assigned with a random value in [0, 255]"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:24 of
+msgid "Fill method in erased area, can be:"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:26 of
+msgid "const (default): All pixels are assign with the same value."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:27 of
+msgid "rand: each pixel is assigned with a random value in [0, 255]"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:29 of
+msgid "Base color filled in erased area. Defaults to (128, 128, 128)."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:32 of
+msgid ""
+"If set and ``mode`` is 'rand', fill erased area with random color from normal distribution "
+"(mean=fill_color, std=fill_std); If not set, fill erased area with random color from uniform distribution "
+"(0~255). Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:40 of
+msgid "See `Random Erasing Data Augmentation <https://arxiv.org/pdf/1708.04896.pdf>`_"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:43 of
+msgid ""
+"This paper provided 4 modes: RE-R, RE-M, RE-0, RE-255, and use RE-M as default. The config of these 4 modes "
+"are:"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:46 of
+msgid "RE-R: RandomErasing(mode='rand')"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:47 of
+msgid "RE-M: RandomErasing(mode='const', fill_color=(123.67, 116.3, 103.5))"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:48 of
+msgid "RE-0: RandomErasing(mode='const', fill_color=0)"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing:49 of
+msgid "RE-255: RandomErasing(mode='const', fill_color=255)"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing.transform:1 of
+msgid "Results dict from pipeline"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomErasing.transform:4 of
+msgid "Results after the transformation."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.RandomResizedCrop.rst:7
+msgid "RandomResizedCrop"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:3 of
+msgid ""
+"A crop of random size (default: of 0.08 to 1.0) of the original size and a random aspect ratio (default: of "
+"3/4 to 4/3) of the original aspect ratio is made. This crop is finally resized to given size."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:16 of
+msgid ""
+"Desired output scale of the crop. If size is an int instead of sequence like (h, w), a square crop (size, "
+"size) is made."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomResizedCrop:30 of
+msgid ""
+"Interpolation method, accepted values are 'nearest', 'bilinear', 'bicubic', 'area', 'lanczos'. Defaults to "
+"'bilinear'."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform:6 of
+msgid ""
+"Randomly resized cropped results, 'img_shape'     key in result dict is updated according to crop size."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.RandomResizedCrop.transform:8 of
+msgid "Randomly resized cropped results, 'img_shape'"
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.ResizeEdge.rst:7
+msgid "ResizeEdge"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ResizeEdge:14 of
+msgid "scale"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ResizeEdge:15 of
+msgid "scale_factor"
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ResizeEdge:17 of
+msgid "The edge scale to resizing."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ResizeEdge:19 of
+msgid "The edge to resize. Defaults to 'short'."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ResizeEdge:21 of
+msgid ""
+"Image resize backend, choices are 'cv2' and 'pillow'. These two backends generates slightly different "
+"results. Defaults to 'cv2'."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ResizeEdge:25 of
+msgid ""
+"Interpolation method, accepted values are \"nearest\", \"bilinear\", \"bicubic\", \"area\", \"lanczos\" for "
+"'cv2' backend, \"nearest\", \"bilinear\" for 'pillow' backend. Defaults to 'bilinear'."
+msgstr ""
+
+#: mmcls.datasets.transforms.processing.ResizeEdge.transform:6 of
+msgid "Resized results, 'img', 'scale', 'scale_factor', 'img_shape' keys are updated in result dict."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Rotate.rst:7
+msgid "Rotate"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Rotate:3 of
+msgid ""
+"The angle used for rotate. Positive values stand for clockwise rotation. If None, generate from "
+"``magnitude_range``, see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Rotate:8 of
+msgid ""
+"Center point (w, h) of the rotation in the source image. If None, the center of the image will be used. "
+"Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Rotate:12 of
+msgid "Isotropic scale factor. Defaults to 1.0."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Rotate:14 mmcls.datasets.transforms.auto_augment.Shear:7
+#: mmcls.datasets.transforms.auto_augment.Translate:9 of
+msgid ""
+"Pixel pad_val value for constant fill. If a sequence of length 3, it is used to pad_val R, G, B channels "
+"respectively. Defaults to 128."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Rotate:18 of
+msgid "The probability for performing rotate therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Rotate:21 of
+msgid "The probability that turns the angle negative, which should be in range [0,1]. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Rotate:24 mmcls.datasets.transforms.auto_augment.Translate:22 of
+msgid ""
+"Interpolation method. Options are 'nearest', 'bilinear', 'bicubic', 'area', 'lanczos'. Defaults to "
+"'nearest'."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Sharpness.rst:7
+msgid "Sharpness"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Sharpness:3 of
+msgid ""
+"The magnitude used for adjusting sharpness. A positive magnitude would enhance the sharpness and a negative "
+"magnitude would make the image bulr. A magnitude=0 gives the origin img. If None, generate from "
+"``magnitude_range``, see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Sharpness:9 of
+msgid ""
+"The probability for performing sharpness adjusting therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Shear.rst:7
+msgid "Shear"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Shear:3 of
+msgid ""
+"The magnitude used for shear. If None, generate from ``magnitude_range``, see :class:`BaseAugTransform`. "
+"Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Shear:11 of
+msgid "The probability for performing shear therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Shear:14 of
+msgid "The shearing direction. Options are 'horizontal' and 'vertical'. Defaults to 'horizontal'."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Shear:20 of
+msgid ""
+"Interpolation method. Options are 'nearest', 'bilinear', 'bicubic', 'area', 'lanczos'. Defaults to "
+"'bicubic'."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Solarize.rst:7
+msgid "Solarize"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Solarize:3 of
+msgid ""
+"The threshold above which the pixels value will be inverted. If None, generate from ``magnitude_range``, "
+"see :class:`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Solarize:7 mmcls.datasets.transforms.auto_augment.SolarizeAdd:10 of
+msgid "The probability for solarizing therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.SolarizeAdd.rst:7
+msgid "SolarizeAdd"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.SolarizeAdd:3 of
+msgid ""
+"The value to be added to pixels below the thr. If None, generate from ``magnitude_range``, see :class:"
+"`BaseAugTransform`. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.SolarizeAdd:7 of
+msgid "The threshold below which the pixels value will be adjusted."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.ToNumpy.rst:7
+msgid "ToNumpy"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.ToNumpy:5 mmcls.datasets.transforms.formatting.ToNumpy:9 of
+msgid "``*keys**``"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.ToNumpy:11 of
+msgid "The dtype of the converted numpy array. Defaults to None."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.ToNumpy.transform:1 of
+msgid "Method to convert object to :obj:`numpy.ndarray`."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.ToPIL.rst:7
+msgid "ToPIL"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.ToPIL.transform:1 of
+msgid "Method to convert images to :obj:`PIL.Image.Image`."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Translate.rst:7
+msgid "Translate"
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Translate:3 of
+msgid ""
+"The magnitude used for translate. Note that the offset is calculated by magnitude * size in the "
+"corresponding direction. With a magnitude of 1, the whole image will be moved out of the range. If None, "
+"generate from ``magnitude_range``, see :class:`BaseAugTransform`."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Translate:13 of
+msgid "The probability for performing translate therefore should be in range [0, 1]. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.datasets.transforms.auto_augment.Translate:16 of
+msgid "The translating direction. Options are 'horizontal' and 'vertical'. Defaults to 'horizontal'."
+msgstr ""
+
+#: ../../api/generated/mmcls.datasets.transforms.Transpose.rst:7
+msgid "Transpose"
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Transpose:11 of
+msgid "The fields to convert to tensor."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Transpose:13 of
+msgid "The output dimensions order."
+msgstr ""
+
+#: mmcls.datasets.transforms.formatting.Transpose.transform:1 of
+msgid "Method to transpose array."
+msgstr ""
+
+#: ../../api/generated/mmcls.engine.hooks.ClassNumCheckHook.rst:7
+msgid "ClassNumCheckHook"
+msgstr ""
+
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_test:1 of
+msgid "Check whether the test dataset is compatible with head."
+msgstr ""
+
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_test:3
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_train:3
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_val:3 of
+msgid "`IterBasedRunner`): Iter based Runner."
+msgstr ""
+
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_train:1 of
+msgid "Check whether the training dataset is compatible with head."
+msgstr ""
+
+#: mmcls.engine.hooks.class_num_check_hook.ClassNumCheckHook.before_val:1 of
+msgid "Check whether the validation dataset is compatible with head."
+msgstr ""
+
+#: ../../api/generated/mmcls.engine.hooks.PreciseBNHook.rst:7
+msgid "PreciseBNHook"
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:3 of
+msgid ""
+"Recompute and update the batch norm stats to make them more precise. During training both BN stats and the "
+"weight are changing after every iteration, so the running average can not precisely reflect the actual "
+"stats of the current model."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:8 of
+msgid ""
+"With this hook, the BN stats are recomputed with fixed weights, to make the running average more precise. "
+"Specifically, it computes the true average of per-batch mean/variance instead of the running average. See "
+"Sec. 3 of the paper `Rethinking Batch in BatchNorm <https://arxiv.org/abs/2105.07576>` for details."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:14 of
+msgid ""
+"This hook will update BN stats, so it should be executed before ``CheckpointHook`` and ``EMAHook``, "
+"generally set its priority to \"ABOVE_NORMAL\"."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:18 of
+msgid "The number of samples to update the bn stats. Defaults to 8192."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:21 of
+msgid "Perform precise bn interval. If the train loop is"
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:23 mmcls.engine.hooks.precise_bn_hook.PreciseBNHook:25 of
+msgid "train loop is `IterBasedTrainLoop` or `by_epoch=False`, its unit is 'iter'. Defaults to 1."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_epoch:1
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_iter:1 of
+msgid "Calculate prcise BN and broadcast BN stats across GPUs."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_epoch:3
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_iter:3 of
+msgid "`Runner`): The runner of the training process."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_iter:4 of
+msgid "The index of the current batch in the train loop."
+msgstr ""
+
+#: mmcls.engine.hooks.precise_bn_hook.PreciseBNHook.after_train_iter:6 of
+msgid "Data from dataloader. Defaults to None."
+msgstr ""
+
+#: ../../api/generated/mmcls.engine.hooks.PrepareProtoBeforeValLoopHook.rst:7
+msgid "PrepareProtoBeforeValLoopHook"
+msgstr ""
+
+#: mmcls.engine.hooks.retriever_hooks.PrepareProtoBeforeValLoopHook:3 of
+msgid ""
+"Since the encoders of the retriever changes during training, the prototype changes accordingly. So the "
+"`prototype_vecs` needs to be regenerated before validation loop."
+msgstr ""
+
+#: ../../api/generated/mmcls.engine.hooks.SetAdaptiveMarginsHook.rst:7
+msgid "SetAdaptiveMarginsHook"
+msgstr ""
+
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook:4 of
+msgid ""
+"A PyTorch implementation of paper `Google Landmark Recognition 2020 Competition Third Place Solution "
+"<https://arxiv.org/abs/2010.05350>`_. The margins will be :math:`\\text{f}(n) = (marginMax - marginMin) · "
+"norm(n^p) + marginMin`. The `n` indicates the number of occurrences of a category."
+msgstr ""
+
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook:10 of
+msgid "Lower bound of margins. Defaults to 0.05."
+msgstr ""
+
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook:12 of
+msgid "Upper bound of margins. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook:14 of
+msgid "The power of category freqercy. Defaults to -0.25."
+msgstr ""
+
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook.before_train:1 of
+msgid "change the margins in ArcFaceClsHead."
+msgstr ""
+
+#: mmcls.engine.hooks.margin_head_hooks.SetAdaptiveMarginsHook.before_train:3 of
+msgid "`Runner`): Runner."
+msgstr ""
+
+#: ../../api/generated/mmcls.engine.hooks.VisualizationHook.rst:7
+msgid "VisualizationHook"
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:1 of
+msgid "Classification Visualization Hook. Used to visualize validation and testing prediction results."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:4 of
+msgid "If ``out_dir`` is specified, all storage backends are ignored and save the image to the ``out_dir``."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:6 of
+msgid ""
+"If ``show`` is True, plot the result image in a window, please confirm you are able to access the graphical "
+"interface."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:9 of
+msgid "Whether to enable this hook. Defaults to False."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:11 of
+msgid "The interval of samples to visualize. Defaults to 5000."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:13 of
+msgid "Whether to display the drawn image. Defaults to False."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:15 of
+msgid ""
+"directory where painted images will be saved in the testing process. If None, handle with the backends of "
+"the visualizer. Defaults to None."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook:19 of
+msgid "other keyword arguments of :meth:`mmcls.visualization.ClsVisualizer.add_datasample`."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_test_iter:1 of
+msgid "Visualize every ``self.interval`` samples during test."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_test_iter:3 of
+msgid "The runner of the testing process."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_test_iter:5 of
+msgid "The index of the current batch in the test loop."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_test_iter:7
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_val_iter:7 of
+msgid "Data from dataloader."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_test_iter:9
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_val_iter:9 of
+msgid "Outputs from model."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_val_iter:1 of
+msgid "Visualize every ``self.interval`` samples during validation."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_val_iter:3 of
+msgid "The runner of the validation process."
+msgstr ""
+
+#: mmcls.engine.hooks.visualization_hook.VisualizationHook.after_val_iter:5 of
+msgid "The index of the current batch in the val loop."
+msgstr ""
+
+#: ../../api/generated/mmcls.engine.optimizers.Lamb.rst:7
+msgid "Lamb"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:3 of
+msgid ""
+"This class is copied from `timm`_. The LAMB was proposed in `Large Batch Optimization for Deep Learning - "
+"Training BERT in 76 minutes`_."
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:11 of
+msgid "iterable of parameters to optimize or dicts defining"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:14 of
+msgid "learning rate. (default: 1e-3)"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:16 of
+msgid "coefficients used for computing running averages of gradient and its norm. (default: (0.9, 0.999))"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:19 of
+msgid "term added to the denominator to improve numerical stability. (default: 1e-8)"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:22 of
+msgid "weight decay (L2 penalty) (default: 0)"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:24 of
+msgid "whether apply (1-beta2) to grad when calculating running averages of gradient. (default: True)"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:27 of
+msgid "value used to clip global grad norm (default: 1.0)"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:30 of
+msgid "enable LAMBC trust ratio clipping (default: False)"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb:32 of
+msgid "Apply adaptive learning rate to 0.0 weight decay parameter (default: False)"
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb.step:1 of
+msgid "Performs a single optimization step."
+msgstr ""
+
+#: mmcls.engine.optimizers.lamb.Lamb.step:3 of
+msgid "A closure that reevaluates the model and returns the loss."
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.Accuracy.rst:7
+msgid "Accuracy"
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy:3 of
+msgid ""
+"For either binary classification or multi-class classification, the accuracy is the fraction of correct "
+"predictions in all predictions:"
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy:6 of
+msgid "\\text{Accuracy} = \\frac{N_{\\text{correct}}}{N_{\\text{all}}}"
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy:10 of
+msgid ""
+"If the ground truth label matches one of the best **k** predictions, the sample will be regard as a "
+"positive prediction. If the parameter is a tuple, all of top-k accuracy will be calculated and outputted "
+"together. Defaults to 1."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy:15 of
+msgid ""
+"If a float, predictions with score lower than the threshold will be regard as the negative prediction. If "
+"None, not apply threshold. If the parameter is a tuple, accuracy based on all thresholds will be calculated "
+"and outputted together. Defaults to 0."
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:22
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:58 mmcls.evaluation.metrics.single_label.Accuracy:21
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:57 of
+msgid ""
+"Device name used for collecting results from different ranks during distributed training. Must be 'cpu' or "
+"'gpu'. Defaults to 'cpu'."
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:26
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:62 mmcls.evaluation.metrics.single_label.Accuracy:25
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:61 of
+msgid ""
+"The prefix that will be added in the metric names to disambiguate homonymous metrics of different "
+"evaluators. If prefix is not provided in the argument, self.default_prefix will be used instead. Defaults "
+"to None."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:1 of
+msgid "Calculate the accuracy."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:3
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:3 of
+msgid "The prediction results. It can be labels (N, ), or scores of every class (N, C)."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:7
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:7 of
+msgid "The target of each prediction with shape (N, )."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:10
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:10 of
+msgid ""
+"Predictions with scores under the thresholds are considered negative. It's only used when ``pred`` is "
+"scores. None means no thresholds. Defaults to (0., )."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:15 of
+msgid ""
+"Predictions with scores under the thresholds are considered negative. It's only used when ``pred`` is "
+"scores. Defaults to (0., )."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:20 of
+msgid ""
+"Accuracy.  - torch.Tensor: If the ``pred`` is a sequence of label instead of   score (number of dimensions "
+"is 1). Only return a top-1 accuracy   tensor, and ignore the argument ``topk` and ``thrs``. - "
+"List[List[torch.Tensor]]: If the ``pred`` is a sequence of score   (number of dimensions is 2). Return the "
+"accuracy on each ``topk``   and ``thrs``. And the first dim is ``topk``, the second dim is   ``thrs``."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:20 of
+msgid "Accuracy."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:22 of
+msgid ""
+"torch.Tensor: If the ``pred`` is a sequence of label instead of score (number of dimensions is 1). Only "
+"return a top-1 accuracy tensor, and ignore the argument ``topk` and ``thrs``."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.calculate:25 of
+msgid ""
+"List[List[torch.Tensor]]: If the ``pred`` is a sequence of score (number of dimensions is 2). Return the "
+"accuracy on each ``topk`` and ``thrs``. And the first dim is ``topk``, the second dim is ``thrs``."
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:25:<autosummary>:1
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:25:<autosummary>:1
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:25:<autosummary>:1
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:25:<autosummary>:1
+#: mmcls.evaluation.metrics.single_label.Accuracy.compute_metrics:1
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.compute_metrics:1 of
+msgid "Compute the metrics from processed results."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.compute_metrics:3
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.compute_metrics:3 of
+msgid "The processed results of each batch."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.compute_metrics:6
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.compute_metrics:6 of
+msgid "The computed metrics. The keys are the names of the metrics, and the values are corresponding results."
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:25:<autosummary>:1
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:25:<autosummary>:1
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:25:<autosummary>:1
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:25:<autosummary>:1
+#: mmcls.evaluation.metrics.single_label.Accuracy.process:1
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.process:1 of
+msgid "Process one batch of data samples."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.process:3
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.process:3 of
+msgid ""
+"The processed results should be stored in ``self.results``, which will be used to computed the metrics when "
+"all batches have been processed."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.process:6
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.process:6 of
+msgid "A batch of data from the dataloader."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.Accuracy.process:7
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.process:7 of
+msgid "A batch of outputs from the model."
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:2
+msgid "mmcls.evaluation.AveragePrecision"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:3 of
+msgid ""
+"AveragePrecision (AP) summarizes a precision-recall curve as the weighted mean of maximum precisions "
+"obtained for any r'>r, where r is the recall:"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:6 of
+msgid ""
+"\\text{AP} = \\sum_n (R_n - R_{n-1}) P_n\n"
+"\n"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:9 of
+msgid "Note that no approximation is involved since the curve is piecewise constant."
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:12 of
+msgid ""
+"How to calculate the final metrics from every category. It supports two modes:  - `\"macro\"`: Calculate "
+"metrics for each category, and calculate   the mean value over all categories. The result of this mode   is "
+"also called **mAP**. - `None`: Calculate metrics of every category and output directly.  Defaults to \"macro"
+"\"."
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:12 of
+msgid "How to calculate the final metrics from every category. It supports two modes:"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:15 of
+msgid ""
+"`\"macro\"`: Calculate metrics for each category, and calculate the mean value over all categories. The "
+"result of this mode is also called **mAP**."
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:18
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:54
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:51
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:23 of
+msgid "`None`: Calculate metrics of every category and output directly."
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:20
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:56
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:53
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:26 of
+msgid "Defaults to \"macro\"."
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:33 of
+msgid "引用"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.AveragePrecision:34 of
+msgid ""
+"`Wikipedia entry for the Average precision <https://en.wikipedia.org/w/index.php?"
+"title=Information_retrieval& oldid=793358396#Average_precision>`_"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:13
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:13
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:13
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:13
+msgid "Methods"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:25:<autosummary>:1
+msgid ""
+":py:obj:`__init__ <mmcls.evaluation.AveragePrecision.__init__>`\\ \\(\\[average\\, collect\\_device\\, "
+"prefix\\]\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:25:<autosummary>:1
+msgid ""
+":py:obj:`calculate <mmcls.evaluation.AveragePrecision.calculate>`\\ \\(pred\\, target\\[\\, average\\]\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:25:<autosummary>:1
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:25:<autosummary>:1
+msgid "Calculate the average precision for a single class."
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:25:<autosummary>:1
+msgid ":py:obj:`compute_metrics <mmcls.evaluation.AveragePrecision.compute_metrics>`\\ \\(results\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:25:<autosummary>:1
+msgid ":py:obj:`evaluate <mmcls.evaluation.AveragePrecision.evaluate>`\\ \\(size\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:25:<autosummary>:1
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:25:<autosummary>:1
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:25:<autosummary>:1
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:25:<autosummary>:1
+msgid "Evaluate the model performance of the whole dataset after processing all batches."
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:25:<autosummary>:1
+msgid ":py:obj:`process <mmcls.evaluation.AveragePrecision.process>`\\ \\(data\\_batch\\, data\\_samples\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:27
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:27
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:27
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:27
+msgid "Attributes"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:31:<autosummary>:1
+msgid ":py:obj:`dataset_meta <mmcls.evaluation.AveragePrecision.dataset_meta>`\\"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:31:<autosummary>:1
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:31:<autosummary>:1
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:31:<autosummary>:1
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:31:<autosummary>:1
+msgid "Meta info of the dataset."
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.AveragePrecision.rst:31:<autosummary>:1
+msgid ":py:obj:`default_prefix <mmcls.evaluation.AveragePrecision.default_prefix>`\\"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:2
+msgid "mmcls.evaluation.MultiLabelMetric"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:4
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:4 of
+msgid ""
+"The collection of metrics is for single-label multi-class classification. And all these metrics are based "
+"on the confusion matrix of every category:"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:11
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:11 of
+msgid "All metrics can be formulated use variables above:"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:13
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:13 of
+msgid "**Precision** is the fraction of correct predictions in all predictions:"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:15
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:15 of
+msgid ""
+"\\text{Precision} = \\frac{TP}{TP+FP}\n"
+"\n"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:18
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:18 of
+msgid "**Recall** is the fraction of correct predictions in all targets:"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:20
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:20 of
+msgid ""
+"\\text{Recall} = \\frac{TP}{TP+FN}\n"
+"\n"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:23
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:23 of
+msgid "**F1-score** is the harmonic mean of the precision and recall:"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:25
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:25 of
+msgid ""
+"\\text{F1-score} = \\frac{2\\times\\text{Recall}\\times\\text{Precision}}{\\text{Recall}+"
+"\\text{Precision}}\n"
+"\n"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:28
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:28 of
+msgid "**Support** is the number of samples:"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:30
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:30 of
+msgid ""
+"\\text{Support} = TP + TN + FN + FP\n"
+"\n"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:33 of
+msgid ""
+"Predictions with scores under the threshold are considered as negative. If None, the ``topk`` predictions "
+"will be considered as positive. If the ``topk`` is also None, use ``thr=0.5`` as default. Defaults to None."
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:38 of
+msgid ""
+"Predictions with the k-th highest scores are considered as positive. If None, use ``thr`` to determine "
+"positive predictions. If both ``thr`` and ``topk`` are not None, use ``thr``. Defaults to None."
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:43
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:40 of
+msgid ""
+"The detailed metric items to evaluate, select from \"precision\", \"recall\", \"f1-score\" and \"support\". "
+"Defaults to ``('precision', 'recall', 'f1-score')``."
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:47
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:44 of
+msgid ""
+"How to calculate the final metrics from the confusion matrix of every category. It supports three modes:  - "
+"`\"macro\"`: Calculate metrics for each category, and calculate   the mean value over all categories. - `"
+"\"micro\"`: Average the confusion matrix over all categories and   calculate metrics on the mean confusion "
+"matrix. - `None`: Calculate metrics of every category and output directly.  Defaults to \"macro\"."
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:47
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:44
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:15 of
+msgid ""
+"How to calculate the final metrics from the confusion matrix of every category. It supports three modes:"
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:50
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:47
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:19 of
+msgid "`\"macro\"`: Calculate metrics for each category, and calculate the mean value over all categories."
+msgstr ""
+
+#: mmcls.evaluation.metrics.multi_label.MultiLabelMetric:52
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:49
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:21 of
+msgid ""
+"`\"micro\"`: Average the confusion matrix over all categories and calculate metrics on the mean confusion "
+"matrix."
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:25:<autosummary>:1
+msgid ""
+":py:obj:`__init__ <mmcls.evaluation.MultiLabelMetric.__init__>`\\ \\(\\[thr\\, topk\\, items\\, average"
+"\\, ...\\]\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:25:<autosummary>:1
+msgid ""
+":py:obj:`calculate <mmcls.evaluation.MultiLabelMetric.calculate>`\\ \\(pred\\, target\\[\\, pred\\_indices"
+"\\, ...\\]\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:25:<autosummary>:1
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:25:<autosummary>:1
+msgid "Calculate the precision, recall, f1-score."
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:25:<autosummary>:1
+msgid ":py:obj:`compute_metrics <mmcls.evaluation.MultiLabelMetric.compute_metrics>`\\ \\(results\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:25:<autosummary>:1
+msgid ":py:obj:`evaluate <mmcls.evaluation.MultiLabelMetric.evaluate>`\\ \\(size\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:25:<autosummary>:1
+msgid ":py:obj:`process <mmcls.evaluation.MultiLabelMetric.process>`\\ \\(data\\_batch\\, data\\_samples\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:31:<autosummary>:1
+msgid ":py:obj:`dataset_meta <mmcls.evaluation.MultiLabelMetric.dataset_meta>`\\"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.MultiLabelMetric.rst:31:<autosummary>:1
+msgid ":py:obj:`default_prefix <mmcls.evaluation.MultiLabelMetric.default_prefix>`\\"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.SingleLabelMetric.rst:7
+msgid "SingleLabelMetric"
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:33 of
+msgid ""
+"If a float, predictions with score lower than the threshold will be regard as the negative prediction. If "
+"None, only the top-1 prediction will be regard as the positive prediction. If the parameter is a tuple, "
+"accuracy based on all thresholds will be calculated and outputted together. Defaults to 0."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric:55
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:37 of
+msgid "The number of classes. Defaults to None."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:1 of
+msgid "Calculate the precision, recall, f1-score and support."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:15 of
+msgid ""
+"How to calculate the final metrics from the confusion matrix of every category. It supports three modes:  - "
+"`\"macro\"`: Calculate metrics for each category, and calculate   the mean value over all categories. - `"
+"\"micro\"`: Average the confusion matrix over all categories   and calculate metrics on the mean confusion "
+"matrix. - `None`: Calculate metrics of every category and output   directly.  Defaults to \"macro\"."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:28 of
+msgid ""
+"The number of classes. If the ``pred`` is label instead of scores, this argument is required. Defaults to "
+"None."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:33 of
+msgid ""
+"The tuple contains precision, recall and f1-score. And the type of each item is:  - torch.Tensor: If the "
+"``pred`` is a sequence of label instead of   score (number of dimensions is 1). Only returns a tensor for   "
+"each metric. The shape is (1, ) if ``classwise`` is False, and   (C, ) if ``classwise`` is True. - "
+"List[torch.Tensor]: If the ``pred`` is a sequence of score   (number of dimensions is 2). Return the "
+"metrics on each ``thrs``.   The shape of tensor is (1, ) if ``classwise`` is False, and (C, )   if "
+"``classwise`` is True."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:33 of
+msgid "The tuple contains precision, recall and f1-score. And the type of each item is:"
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:36 of
+msgid ""
+"torch.Tensor: If the ``pred`` is a sequence of label instead of score (number of dimensions is 1). Only "
+"returns a tensor for each metric. The shape is (1, ) if ``classwise`` is False, and (C, ) if ``classwise`` "
+"is True."
+msgstr ""
+
+#: mmcls.evaluation.metrics.single_label.SingleLabelMetric.calculate:40 of
+msgid ""
+"List[torch.Tensor]: If the ``pred`` is a sequence of score (number of dimensions is 2). Return the metrics "
+"on each ``thrs``. The shape of tensor is (1, ) if ``classwise`` is False, and (C, ) if ``classwise`` is "
+"True."
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:2
+msgid "mmcls.evaluation.VOCAveragePrecision"
+msgstr ""
+
+#: mmcls.evaluation.metrics.voc_multi_label.VOCAveragePrecision:3
+#: mmcls.evaluation.metrics.voc_multi_label.VOCMultiLabelMetric:6 of
+msgid ""
+"Whether to map the difficult labels as positive in one-hot ground truth for evaluation. If it set to True, "
+"map difficult gt labels to positive ones(1), If it set to False, map difficult gt labels to negative "
+"ones(0). Defaults to None, the difficult labels will be set to '-1'."
+msgstr ""
+
+#: mmcls.evaluation.metrics.voc_multi_label.VOCAveragePrecision:9 of
+msgid "Refers to `AveragePrecision` for detailed docstrings."
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:25:<autosummary>:1
+msgid ""
+":py:obj:`__init__ <mmcls.evaluation.VOCAveragePrecision.__init__>`\\ \\(\\*arg\\[\\, difficult\\_as"
+"\\_positive\\]\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:25:<autosummary>:1
+msgid ""
+":py:obj:`calculate <mmcls.evaluation.VOCAveragePrecision.calculate>`\\ \\(pred\\, target\\[\\, average\\]\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:25:<autosummary>:1
+msgid ":py:obj:`compute_metrics <mmcls.evaluation.VOCAveragePrecision.compute_metrics>`\\ \\(results\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:25:<autosummary>:1
+msgid ":py:obj:`evaluate <mmcls.evaluation.VOCAveragePrecision.evaluate>`\\ \\(size\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:25:<autosummary>:1
+msgid ""
+":py:obj:`process <mmcls.evaluation.VOCAveragePrecision.process>`\\ \\(data\\_batch\\, data\\_samples\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:31:<autosummary>:1
+msgid ":py:obj:`dataset_meta <mmcls.evaluation.VOCAveragePrecision.dataset_meta>`\\"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCAveragePrecision.rst:31:<autosummary>:1
+msgid ":py:obj:`default_prefix <mmcls.evaluation.VOCAveragePrecision.default_prefix>`\\"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:2
+msgid "mmcls.evaluation.VOCMultiLabelMetric"
+msgstr ""
+
+#: mmcls.evaluation.metrics.voc_multi_label.VOCMultiLabelMetric:4 of
+msgid "It includes precision, recall, f1-score and support."
+msgstr ""
+
+#: mmcls.evaluation.metrics.voc_multi_label.VOCMultiLabelMetric:12 of
+msgid "Refers to `MultiLabelMetric` for detailed docstrings."
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:25:<autosummary>:1
+msgid ""
+":py:obj:`__init__ <mmcls.evaluation.VOCMultiLabelMetric.__init__>`\\ \\(\\*arg\\[\\, difficult\\_as"
+"\\_positive\\]\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:25:<autosummary>:1
+msgid ""
+":py:obj:`calculate <mmcls.evaluation.VOCMultiLabelMetric.calculate>`\\ \\(pred\\, target\\[\\, pred"
+"\\_indices\\, ...\\]\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:25:<autosummary>:1
+msgid ":py:obj:`compute_metrics <mmcls.evaluation.VOCMultiLabelMetric.compute_metrics>`\\ \\(results\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:25:<autosummary>:1
+msgid ":py:obj:`evaluate <mmcls.evaluation.VOCMultiLabelMetric.evaluate>`\\ \\(size\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:25:<autosummary>:1
+msgid ""
+":py:obj:`process <mmcls.evaluation.VOCMultiLabelMetric.process>`\\ \\(data\\_batch\\, data\\_samples\\)"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:31:<autosummary>:1
+msgid ":py:obj:`dataset_meta <mmcls.evaluation.VOCMultiLabelMetric.dataset_meta>`\\"
+msgstr ""
+
+#: ../../api/generated/mmcls.evaluation.VOCMultiLabelMetric.rst:31:<autosummary>:1
+msgid ":py:obj:`default_prefix <mmcls.evaluation.VOCMultiLabelMetric.default_prefix>`\\"
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.AlexNet.rst:7
+msgid "AlexNet"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.alexnet.AlexNet:1 of
+msgid "`AlexNet <https://en.wikipedia.org/wiki/AlexNet>`_ backbone."
+msgstr ""
+
+#: mmcls.models.backbones.alexnet.AlexNet:3 of
+msgid "The input for AlexNet is a 224x224 RGB image."
+msgstr ""
+
+#: mmcls.models.backbones.alexnet.AlexNet:5 mmcls.models.backbones.lenet.LeNet5:5 of
+msgid ""
+"number of classes for classification. The default value is -1, which uses the backbone as a feature "
+"extractor without the top classifier."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.CSPDarkNet.rst:7
+msgid "CSPDarkNet"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.cspnet.CSPDarkNet:1 of
+msgid "CSP-Darknet backbone used in YOLOv4."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPDarkNet:3 of
+msgid "Depth of CSP-Darknet. Default: 53."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPDarkNet:5 mmcls.models.backbones.mobileone.MobileOne:20
+#: mmcls.models.backbones.regnet.RegNet:17 mmcls.models.backbones.repmlp.RepMLPNet:17
+#: mmcls.models.backbones.resnest.ResNeSt:21 mmcls.models.backbones.resnet.ResNet:8
+#: mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:10 mmcls.models.backbones.resnext.ResNeXt:13
+#: mmcls.models.backbones.seresnet.SEResNet:10 mmcls.models.backbones.seresnext.SEResNeXt:15 of
+msgid "Number of input image channels. Default: 3."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPDarkNet:7 of
+msgid "Output from which stages. Default: (3, )."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPDarkNet:10 mmcls.models.backbones.cspnet.CSPResNeXt:8
+#: mmcls.models.backbones.cspnet.CSPResNet:8 mmcls.models.backbones.resnest.ResNeSt:48
+#: mmcls.models.backbones.resnet.ResNet:35 mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:39
+#: mmcls.models.backbones.resnext.ResNeXt:40 mmcls.models.backbones.seresnet.SEResNet:37
+#: mmcls.models.backbones.seresnext.SEResNeXt:42 of
+msgid "Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Default: -1."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPDarkNet:13 mmcls.models.backbones.cspnet.CSPResNeXt:11
+#: mmcls.models.backbones.cspnet.CSPResNet:11 of
+msgid "Config dict for convolution layer. Default: None."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPDarkNet:15 mmcls.models.backbones.cspnet.CSPResNeXt:13
+#: mmcls.models.backbones.cspnet.CSPResNet:13 of
+msgid "Dictionary to construct and config norm layer. Default: dict(type='BN', requires_grad=True)."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPDarkNet:18 mmcls.models.backbones.cspnet.CSPResNeXt:16
+#: mmcls.models.backbones.cspnet.CSPResNet:16 of
+msgid "Config dict for activation layer. Default: dict(type='LeakyReLU', negative_slope=0.1)."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPDarkNet:21 mmcls.models.backbones.cspnet.CSPResNeXt:19
+#: mmcls.models.backbones.cspnet.CSPResNet:19 of
+msgid ""
+"Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch "
+"Norm and its variants only."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPDarkNet:25 mmcls.models.backbones.cspnet.CSPResNeXt:23
+#: mmcls.models.backbones.cspnet.CSPResNet:23 of
+msgid "Initialization config dict. Default: None."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.CSPNet.rst:7
+msgid "CSPNet"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.cspnet.CSPNet:1 of
+msgid "The abstract CSP Network class."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:3 of
+msgid ""
+"A Pytorch implementation of `CSPNet: A New Backbone that can Enhance Learning Capability of CNN <https://"
+"arxiv.org/abs/1911.11929>`_"
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:6 of
+msgid ""
+"This class is an abstract class because the Cross Stage Partial Network (CSPNet) is a kind of universal "
+"network structure, and you network block to implement networks like CSPResNet, CSPResNeXt and CSPDarkNet."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:11 of
+msgid ""
+"The architecture of the CSPNet. It should have the following keys:  - block_fn (Callable): A function or "
+"class to return a block   module, and it should accept at least ``in_channels``,   ``out_channels``, "
+"``expansion``, ``drop_path_rate``, ``norm_cfg``   and ``act_cfg``. - in_channels (Tuple[int]): The number "
+"of input channels of each   stage. - out_channels (Tuple[int]): The number of output channels of each   "
+"stage. - num_blocks (Tuple[int]): The number of blocks in each stage. - expansion_ratio (float | "
+"Tuple[float]): The expansion ratio in   the expand convolution of each stage. Defaults to 0.5. - "
+"bottle_ratio (float | Tuple[float]): The expansion ratio of   blocks in each stage. Defaults to 2. - "
+"has_downsampler (bool | Tuple[bool]): Whether to add a   downsample convolution in each stage. Defaults to "
+"True - down_growth (bool | Tuple[bool]): Whether to expand the channels   in the downsampler layer of each "
+"stage. Defaults to False. - block_args (dict | Tuple[dict], optional): The extra arguments to   the blocks "
+"in each stage. Defaults to None."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:11 of
+msgid "The architecture of the CSPNet. It should have the following keys:"
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:14 of
+msgid ""
+"block_fn (Callable): A function or class to return a block module, and it should accept at least "
+"``in_channels``, ``out_channels``, ``expansion``, ``drop_path_rate``, ``norm_cfg`` and ``act_cfg``."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:18 of
+msgid "in_channels (Tuple[int]): The number of input channels of each stage."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:20 of
+msgid "out_channels (Tuple[int]): The number of output channels of each stage."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:22 of
+msgid "num_blocks (Tuple[int]): The number of blocks in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:23 of
+msgid ""
+"expansion_ratio (float | Tuple[float]): The expansion ratio in the expand convolution of each stage. "
+"Defaults to 0.5."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:25 of
+msgid "bottle_ratio (float | Tuple[float]): The expansion ratio of blocks in each stage. Defaults to 2."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:27 of
+msgid ""
+"has_downsampler (bool | Tuple[bool]): Whether to add a downsample convolution in each stage. Defaults to "
+"True"
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:29 of
+msgid ""
+"down_growth (bool | Tuple[bool]): Whether to expand the channels in the downsampler layer of each stage. "
+"Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:31 of
+msgid ""
+"block_args (dict | Tuple[dict], optional): The extra arguments to the blocks in each stage. Defaults to "
+"None."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:34 of
+msgid "A function or class to return a stem module. And it should accept ``in_channels``."
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:22 mmcls.models.backbones.convnext.ConvNeXt:20
+#: mmcls.models.backbones.cspnet.CSPNet:37 mmcls.models.backbones.densenet.DenseNet:22
+#: mmcls.models.backbones.hornet.HorNet:21 mmcls.models.backbones.hrnet.HRNet:25
+#: mmcls.models.backbones.mobilevit.MobileViT:25 mmcls.models.backbones.repvgg.RepVGG:21
+#: mmcls.models.backbones.res2net.Res2Net:12 mmcls.models.backbones.timm_backbone.TIMMBackbone:22 of
+msgid "Number of input image channels. Defaults to 3."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:39 of
+msgid "Output from which stages. Defaults to -1, which means the last stage."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:42 mmcls.models.backbones.davit.DaViT:50
+#: mmcls.models.backbones.efficientformer.EfficientFormer:35 mmcls.models.backbones.hornet.HorNet:33
+#: mmcls.models.backbones.res2net.Res2Net:35 mmcls.models.backbones.swin_transformer.SwinTransformer:48
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:48 mmcls.models.backbones.van.VAN:32
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:51 of
+msgid ""
+"Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Defaults to -1."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:45 of
+msgid "The config dict for conv layers in blocks. Defaults to None, which means use Conv2d."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:48 of
+msgid "The config dict for norm layers. Defaults to ``dict(type='BN', eps=1e-5)``."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:51 of
+msgid "The config dict for activation functions. Defaults to ``dict(type='LeakyReLU', inplace=True)``."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:54 mmcls.models.backbones.davit.DaViT:53
+#: mmcls.models.backbones.efficientnet.EfficientNet:20 mmcls.models.backbones.hrnet.HRNet:33
+#: mmcls.models.backbones.mobileone.MobileOne:40 mmcls.models.backbones.repvgg.RepVGG:53
+#: mmcls.models.backbones.res2net.Res2Net:41 mmcls.models.backbones.swin_transformer.SwinTransformer:51
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:51 mmcls.models.backbones.van.VAN:35 of
+msgid ""
+"Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch "
+"Norm and its variants only. Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPNet:58 of
+msgid "The initialization settings. Defaults to ``dict(type='Kaiming', layer='Conv2d'))``."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.CSPResNeXt.rst:7
+msgid "CSPResNeXt"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.cspnet.CSPResNeXt:1 of
+msgid "CSP-ResNeXt backbone."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPResNeXt:3 of
+msgid "Depth of CSP-ResNeXt. Default: 50."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPResNeXt:5 mmcls.models.backbones.cspnet.CSPResNet:5 of
+msgid "Output from which stages. Default: (4, )."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.CSPResNet.rst:7
+msgid "CSPResNet"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.cspnet.CSPResNet:1 of
+msgid "CSP-ResNet backbone."
+msgstr ""
+
+#: mmcls.models.backbones.cspnet.CSPResNet:3 of
+msgid "Depth of CSP-ResNet. Default: 50."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.Conformer.rst:7
+msgid "Conformer"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.conformer.Conformer:1 of
+msgid "Conformer backbone."
+msgstr ""
+
+#: mmcls.models.backbones.conformer.Conformer:3 of
+msgid ""
+"A PyTorch implementation of : `Conformer: Local Features Coupling Global Representations for Visual "
+"Recognition <https://arxiv.org/abs/2105.03889>`_"
+msgstr ""
+
+#: mmcls.models.backbones.conformer.Conformer:6 of
+msgid "Conformer architecture. Defaults to 'tiny'."
+msgstr ""
+
+#: mmcls.models.backbones.conformer.Conformer:8 of
+msgid "The patch size. Defaults to 16."
+msgstr ""
+
+#: mmcls.models.backbones.conformer.Conformer:10 of
+msgid "The base number of channels in CNN network. Defaults to 64."
+msgstr ""
+
+#: mmcls.models.backbones.conformer.Conformer:13 of
+msgid "The expansion ratio of FFN network in transformer block. Defaults to 4."
+msgstr ""
+
+#: mmcls.models.backbones.conformer.Conformer:16 of
+msgid "Whether use class token or not. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.conformer.Conformer:19 mmcls.models.backbones.deit.DistilledVisionTransformer:33
+#: mmcls.models.backbones.deit3.DeiT3:38 mmcls.models.backbones.mlp_mixer.MlpMixer:29
+#: mmcls.models.backbones.t2t_vit.T2T_ViT:23 mmcls.models.backbones.vision_transformer.VisionTransformer:33
+#: mmcls.models.utils.inverted_residual.InvertedResidual:26 of
+msgid "stochastic depth rate. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.backbones.conformer.Conformer:21 mmcls.models.backbones.convmixer.ConvMixer:33
+#: mmcls.models.backbones.convnext.ConvNeXt:39 mmcls.models.backbones.davit.DaViT:57
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:27 mmcls.models.backbones.deit3.DeiT3:32
+#: mmcls.models.backbones.densenet.DenseNet:42 mmcls.models.backbones.edgenext.EdgeNeXt:56
+#: mmcls.models.backbones.t2t_vit.T2T_ViT:17 mmcls.models.backbones.vision_transformer.VisionTransformer:27 of
+msgid "Output from which stages. Defaults to -1, means the last stage."
+msgstr ""
+
+#: mmcls.models.backbones.conformer.Conformer:24 mmcls.models.backbones.deit.DistilledVisionTransformer:58
+#: mmcls.models.backbones.deit3.DeiT3:66 mmcls.models.backbones.efficientformer.EfficientFormer:48
+#: mmcls.models.backbones.hrnet.HRNet:47 mmcls.models.backbones.mlp_mixer.MlpMixer:41
+#: mmcls.models.backbones.repvgg.RepVGG:59 mmcls.models.backbones.res2net.Res2Net:51
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:70
+#: mmcls.models.classifiers.base.BaseClassifier:3 of
+msgid "Initialization config dict. Defaults to None."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.ConvMixer.rst:7
+msgid "ConvMixer"
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:1 of
+msgid "ConvMixer.                              ."
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:3 of
+msgid "A PyTorch implementation of : `Patches Are All You Need? <https://arxiv.org/pdf/2201.09792.pdf>`_"
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:6 of
+msgid ""
+"Modified from the `official repo <https://github.com/locuslab/convmixer/blob/main/convmixer.py>`_ and `timm "
+"<https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/convmixer.py>`_."
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:11 of
+msgid ""
+"The model's architecture. If string, it should be one of architecture in ``ConvMixer.arch_settings``. And "
+"if dict, it should include the following two keys:  - embed_dims (int): The dimensions of patch embedding. "
+"- depth (int): Number of repetitions of ConvMixer Layer. - patch_size (int): The patch size. - kernel_size "
+"(int): The kernel size of depthwise conv layers.  Defaults to '768/32'."
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:11 of
+msgid ""
+"The model's architecture. If string, it should be one of architecture in ``ConvMixer.arch_settings``. And "
+"if dict, it should include the following two keys:"
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:15 of
+msgid "embed_dims (int): The dimensions of patch embedding."
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:16 of
+msgid "depth (int): Number of repetitions of ConvMixer Layer."
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:17 of
+msgid "patch_size (int): The patch size."
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:18 of
+msgid "kernel_size (int): The kernel size of depthwise conv layers."
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:20 of
+msgid "Defaults to '768/32'."
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:24 of
+msgid "The size of one patch in the patch embed layer. Defaults to 7."
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:27 mmcls.models.backbones.densenet.DenseNet:36
+#: mmcls.models.backbones.mobileone.MobileOne:31 mmcls.models.backbones.repvgg.RepVGG:41 of
+msgid "The config dict for norm layers. Defaults to ``dict(type='BN')``."
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:30 of
+msgid "The config dict for activation after each convolution. Defaults to ``dict(type='GELU')``."
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:36 mmcls.models.backbones.convnext.ConvNeXt:42
+#: mmcls.models.backbones.densenet.DenseNet:45 mmcls.models.backbones.edgenext.EdgeNeXt:59
+#: mmcls.models.backbones.efficientnet.EfficientNet:8 mmcls.models.backbones.poolformer.PoolFormer:53 of
+msgid "Stages to be frozen (all param fixed). Defaults to 0, which means not freezing any parameters."
+msgstr ""
+
+#: mmcls.models.backbones.convmixer.ConvMixer:39 mmcls.models.backbones.densenet.DenseNet:48
+#: mmcls.models.backbones.mobileone.MobileOne:44 mmcls.models.backbones.mobilevit.MobileViT:47
+#: mmcls.models.backbones.replknet.RepLKNet:59 mmcls.models.backbones.repmlp.RepMLPNet:51
+#: mmcls.models.classifiers.base.BaseClassifier:14 mmcls.models.utils.inverted_residual.InvertedResidual:31 of
+msgid "Initialization config dict."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.ConvNeXt.rst:7
+msgid "ConvNeXt"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.convnext.ConvNeXt:1 of
+msgid "ConvNeXt."
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:3 of
+msgid "A PyTorch implementation of : `A ConvNet for the 2020s <https://arxiv.org/pdf/2201.03545.pdf>`_"
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:6 of
+msgid ""
+"Modified from the `official repo <https://github.com/facebookresearch/ConvNeXt/blob/main/models/convnext."
+"py>`_ and `timm <https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/convnext.py>`_."
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:11 of
+msgid ""
+"The model's architecture. If string, it should be one of architecture in ``ConvNeXt.arch_settings``. And if "
+"dict, it should include the following two keys:  - depths (list[int]): Number of blocks at each stage. - "
+"channels (list[int]): The number of channels at each stage.  Defaults to 'tiny'."
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:11 of
+msgid ""
+"The model's architecture. If string, it should be one of architecture in ``ConvNeXt.arch_settings``. And if "
+"dict, it should include the following two keys:"
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:15 of
+msgid "depths (list[int]): Number of blocks at each stage."
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:16 mmcls.models.backbones.edgenext.EdgeNeXt:14 of
+msgid "channels (list[int]): The number of channels at each stage."
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:18 mmcls.models.backbones.hornet.HorNet:19
+#: mmcls.models.backbones.swin_transformer.SwinTransformer:19
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:21 mmcls.models.backbones.van.VAN:18 of
+msgid "Defaults to 'tiny'."
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:22 of
+msgid "The size of one patch in the stem layer. Defaults to 4."
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:25 mmcls.models.backbones.poolformer.PoolFormer:20 of
+msgid "The config dict for norm layers. Defaults to ``dict(type='LN2d', eps=1e-6)``."
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:28 mmcls.models.backbones.efficientformer.EfficientFormer:38
+#: mmcls.models.backbones.poolformer.PoolFormer:23 of
+msgid "The config dict for activation between pointwise convolution. Defaults to ``dict(type='GELU')``."
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:31 of
+msgid "Whether to use linear layer to do pointwise convolution. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:34 mmcls.models.backbones.efficientformer.EfficientFormer:43
+#: mmcls.models.backbones.hornet.HorNet:23 mmcls.models.backbones.poolformer.PoolFormer:46 of
+msgid "Stochastic depth rate. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:36 of
+msgid "Init value for Layer Scale. Defaults to 1e-6."
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:45 mmcls.models.backbones.hornet.HorNet:39 of
+msgid ""
+"Whether to globally average the feature map before the final norm layer. In the official repo, it's only "
+"used in classification task. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:49 mmcls.models.backbones.davit.DaViT:60
+#: mmcls.models.backbones.efficientnet.EfficientNet:24 mmcls.models.backbones.hornet.HorNet:36
+#: mmcls.models.backbones.hrnet.HRNet:37 mmcls.models.backbones.repvgg.RepVGG:47
+#: mmcls.models.backbones.res2net.Res2Net:45 mmcls.models.backbones.swin_transformer.SwinTransformer:45
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:45
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:34
+#: mmcls.models.classifiers.timm.TimmClassifier:25 mmcls.models.utils.inverted_residual.InvertedResidual:28 of
+msgid ""
+"Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. "
+"Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.convnext.ConvNeXt:52 mmcls.models.backbones.poolformer.PoolFormer:56
+#: mmcls.models.backbones.tnt.TNT:41 of
+msgid "Initialization config dict"
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.DaViT.rst:7
+msgid "DaViT"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.davit.DaViT:1 of
+msgid "DaViT."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:3 of
+msgid ""
+"A PyTorch implement of : `DaViT: Dual Attention Vision Transformers <https://arxiv.org/abs/2204.03645v1>`_"
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:6 of
+msgid "Inspiration from https://github.com/dingmyu/davit"
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:9 of
+msgid ""
+"DaViT architecture. If use string, choose from 'tiny', 'small', 'base' and 'large', 'huge', 'giant'. If use "
+"dict, it should have below keys:  - **embed_dims** (int): The dimensions of embedding. - **depths** "
+"(List[int]): The number of blocks in each stage. - **num_heads** (List[int]): The number of heads in "
+"attention   modules of each stage.  Defaults to 't'."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:9 of
+msgid ""
+"DaViT architecture. If use string, choose from 'tiny', 'small', 'base' and 'large', 'huge', 'giant'. If use "
+"dict, it should have below keys:"
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:13 mmcls.models.backbones.deit.DistilledVisionTransformer:10
+#: mmcls.models.backbones.deit3.DeiT3:15 mmcls.models.backbones.mlp_mixer.MlpMixer:10
+#: mmcls.models.backbones.mvit.MViT:14 mmcls.models.backbones.swin_transformer.SwinTransformer:14
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:14
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:10 of
+msgid "**embed_dims** (int): The dimensions of embedding."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:14 mmcls.models.backbones.hornet.HorNet:14
+#: mmcls.models.backbones.swin_transformer.SwinTransformer:15
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:15 mmcls.models.backbones.van.VAN:14 of
+msgid "**depths** (List[int]): The number of blocks in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:15 mmcls.models.backbones.swin_transformer.SwinTransformer:16
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:16 of
+msgid "**num_heads** (List[int]): The number of heads in attention modules of each stage."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:18 of
+msgid "Defaults to 't'."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:20 mmcls.models.backbones.repmlp.RepMLPNet:19
+#: mmcls.models.backbones.swin_transformer.SwinTransformer:25
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:27 of
+msgid "The patch size in patch embedding. Defaults to 4."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:23 mmcls.models.backbones.deit.DistilledVisionTransformer:25
+#: mmcls.models.backbones.deit3.DeiT3:30 mmcls.models.backbones.efficientformer.EfficientFormer:20
+#: mmcls.models.backbones.mvit.MViT:25 mmcls.models.backbones.swin_transformer.SwinTransformer:28
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:30 mmcls.models.backbones.van.VAN:23
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:25 of
+msgid "The num of input channels. Defaults to 3."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:25 mmcls.models.backbones.swin_transformer.SwinTransformer:30
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:32 of
+msgid "The height and width of the window. Defaults to 7."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:27 of
+msgid "The expansion ratio of feedforward network hidden layer channels. Defaults to 4."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:30 mmcls.models.backbones.deit.DistilledVisionTransformer:35
+#: mmcls.models.backbones.deit3.DeiT3:40 mmcls.models.backbones.vision_transformer.VisionTransformer:35 of
+msgid "Whether to add bias for qkv in attention modules. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:33 mmcls.models.backbones.mvit.MViT:31
+#: mmcls.models.backbones.swin_transformer.SwinTransformer:34
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:37 mmcls.models.backbones.van.VAN:27 of
+msgid "Stochastic depth rate. Defaults to 0.1."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:35 mmcls.models.backbones.swin_transformer.SwinTransformer:36 of
+msgid "Whether to output the feature map of a stage after the following downsample layer. Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:38 mmcls.models.backbones.swin_transformer.SwinTransformer:55
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:55
+#: mmcls.models.utils.attention.ShiftWindowMSA:15 of
+msgid ""
+"If True, pad the small feature map to the window size, which is common used in detection and segmentation. "
+"If False, avoid shifting window and shrink the window size to the size of feature map, which is common used "
+"in classification. Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:44 mmcls.models.backbones.swin_transformer.SwinTransformer:61
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:61 mmcls.models.backbones.van.VAN:39 of
+msgid "Config dict for normalization layer for all output features. Defaults to ``dict(type='LN')``"
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:47 mmcls.models.backbones.swin_transformer.SwinTransformer:64
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:64 of
+msgid "Extra config dict for each stage. Defaults to an empty dict."
+msgstr ""
+
+#: mmcls.models.backbones.davit.DaViT:63 mmcls.models.backbones.hornet.HorNet:43
+#: mmcls.models.backbones.mvit.MViT:75 mmcls.models.backbones.swin_transformer.SwinTransformer:70
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:73 mmcls.models.backbones.t2t_vit.T2T_ViT:46
+#: mmcls.models.backbones.twins.PCPVT:41 mmcls.models.backbones.twins.SVT:42 mmcls.models.backbones.van.VAN:45
+#: mmcls.models.utils.attention.MultiheadAttention:35 of
+msgid "The Config for initialization. Defaults to None."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.DeiT3.rst:7
+msgid "DeiT3"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.deit3.DeiT3:1 of
+msgid "DeiT3 backbone."
+msgstr ""
+
+#: mmcls.models.backbones.deit3.DeiT3:3 of
+msgid "A PyTorch implement of : `DeiT III: Revenge of the ViT <https://arxiv.org/pdf/2204.07118.pdf>`_"
+msgstr ""
+
+#: mmcls.models.backbones.deit3.DeiT3:6 of
+msgid "The differences between DeiT3 & VisionTransformer:"
+msgstr ""
+
+#: mmcls.models.backbones.deit3.DeiT3:8 of
+msgid "Use LayerScale."
+msgstr ""
+
+#: mmcls.models.backbones.deit3.DeiT3:9 of
+msgid "Concat cls token after adding pos_embed."
+msgstr ""
+
+#: mmcls.models.backbones.deit3.DeiT3:11 of
+msgid ""
+"DeiT3 architecture. If use string, choose from 'small', 'base', 'medium', 'large' and 'huge'. If use dict, "
+"it should have below keys:  - **embed_dims** (int): The dimensions of embedding. - **num_layers** (int): "
+"The number of transformer encoder layers. - **num_heads** (int): The number of heads in attention modules. "
+"- **feedforward_channels** (int): The hidden dimensions in   feedforward modules.  Defaults to 'base'."
+msgstr ""
+
+#: mmcls.models.backbones.deit3.DeiT3:11 of
+msgid ""
+"DeiT3 architecture. If use string, choose from 'small', 'base', 'medium', 'large' and 'huge'. If use dict, "
+"it should have below keys:"
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:11 mmcls.models.backbones.deit3.DeiT3:16
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:11 of
+msgid "**num_layers** (int): The number of transformer encoder layers."
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:12 mmcls.models.backbones.deit3.DeiT3:17
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:12 of
+msgid "**num_heads** (int): The number of heads in attention modules."
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:13 mmcls.models.backbones.deit3.DeiT3:18
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:13 of
+msgid "**feedforward_channels** (int): The hidden dimensions in feedforward modules."
+msgstr ""
+
+#: mmcls.models.backbones.deit3.DeiT3:21 mmcls.models.backbones.mlp_mixer.MlpMixer:16
+#: mmcls.models.backbones.mvit.MViT:21 mmcls.models.backbones.vision_transformer.VisionTransformer:16 of
+msgid "Defaults to 'base'."
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:18 mmcls.models.backbones.deit3.DeiT3:23
+#: mmcls.models.backbones.swin_transformer.SwinTransformer:21
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:23 mmcls.models.backbones.t2t_vit.T2T_ViT:6
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:18 of
+msgid ""
+"The expected input image shape. Because we support dynamic input shape, just set the argument to the most "
+"common input image shape. Defaults to 224."
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:22 mmcls.models.backbones.deit3.DeiT3:27
+#: mmcls.models.backbones.mlp_mixer.MlpMixer:20 mmcls.models.backbones.vision_transformer.VisionTransformer:22
+#: of
+msgid "The patch size in patch embedding. Defaults to 16."
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:30 mmcls.models.backbones.deit3.DeiT3:35
+#: mmcls.models.backbones.mlp_mixer.MlpMixer:26 mmcls.models.backbones.twins.PCPVT:27
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:30 of
+msgid "Probability of an element to be zeroed. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:38 mmcls.models.backbones.deit3.DeiT3:43
+#: mmcls.models.backbones.mlp_mixer.MlpMixer:31 mmcls.models.backbones.t2t_vit.T2T_ViT:25
+#: mmcls.models.backbones.twins.PCPVT:35 mmcls.models.backbones.twins.SVT:36
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:38 mmcls.models.utils.embed.PatchMerging:35 of
+msgid "Config dict for normalization layer. Defaults to ``dict(type='LN')``."
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:41 mmcls.models.backbones.deit3.DeiT3:46
+#: mmcls.models.backbones.repmlp.RepMLPNet:42 mmcls.models.backbones.t2t_vit.T2T_ViT:28
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:41 of
+msgid "Whether to add a additional layer to normalize final feature map. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:44 mmcls.models.backbones.deit3.DeiT3:49
+#: mmcls.models.backbones.t2t_vit.T2T_ViT:31 mmcls.models.backbones.vision_transformer.VisionTransformer:44 of
+msgid "Whether concatenating class token into image tokens as transformer input. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:47 mmcls.models.backbones.deit3.DeiT3:52
+#: mmcls.models.backbones.t2t_vit.T2T_ViT:34 mmcls.models.backbones.vision_transformer.VisionTransformer:54 of
+msgid "Whether output the cls_token. If set True, ``with_cls_token`` must be True. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.deit3.DeiT3:55 of
+msgid "Whether to use layer_scale in  DeiT3. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:50 mmcls.models.backbones.deit3.DeiT3:58
+#: mmcls.models.backbones.t2t_vit.T2T_ViT:37 mmcls.models.backbones.vision_transformer.VisionTransformer:62 of
+msgid "Select the interpolate mode for position embeding vector resize. Defaults to \"bicubic\"."
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:53 mmcls.models.backbones.deit3.DeiT3:61
+#: mmcls.models.backbones.mlp_mixer.MlpMixer:36 mmcls.models.backbones.vision_transformer.VisionTransformer:65
+#: of
+msgid "Configs of patch embeding. Defaults to an empty dict."
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:55 mmcls.models.backbones.deit3.DeiT3:63
+#: mmcls.models.backbones.t2t_vit.T2T_ViT:43 mmcls.models.backbones.vision_transformer.VisionTransformer:67 of
+msgid "Configs of each transformer layer in encoder. Defaults to an empty dict."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.DenseNet.rst:7
+msgid "DenseNet"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.densenet.DenseNet:1 of
+msgid "DenseNet."
+msgstr ""
+
+#: mmcls.models.backbones.densenet.DenseNet:3 of
+msgid ""
+"A PyTorch implementation of : `Densely Connected Convolutional Networks <https://arxiv.org/pdf/1608.06993."
+"pdf>`_"
+msgstr ""
+
+#: mmcls.models.backbones.densenet.DenseNet:6 of
+msgid ""
+"Modified from the `official repo <https://github.com/liuzhuang13/DenseNet>`_ and `pytorch <https://github."
+"com/pytorch/vision/blob/main/torchvision/models/densenet.py>`_."
+msgstr ""
+
+#: mmcls.models.backbones.densenet.DenseNet:11 of
+msgid ""
+"The model's architecture. If string, it should be one of architecture in ``DenseNet.arch_settings``. And if "
+"dict, it should include the following two keys:  - growth_rate (int): Each layer of DenseBlock produce `k` "
+"feature   maps. Here refers `k` as the growth rate of the network. - depths (list[int]): Number of repeated "
+"layers in each DenseBlock. - init_channels (int): The output channels of stem layers.  Defaults to '121'."
+msgstr ""
+
+#: mmcls.models.backbones.densenet.DenseNet:11 of
+msgid ""
+"The model's architecture. If string, it should be one of architecture in ``DenseNet.arch_settings``. And if "
+"dict, it should include the following two keys:"
+msgstr ""
+
+#: mmcls.models.backbones.densenet.DenseNet:15 of
+msgid ""
+"growth_rate (int): Each layer of DenseBlock produce `k` feature maps. Here refers `k` as the growth rate of "
+"the network."
+msgstr ""
+
+#: mmcls.models.backbones.densenet.DenseNet:17 of
+msgid "depths (list[int]): Number of repeated layers in each DenseBlock."
+msgstr ""
+
+#: mmcls.models.backbones.densenet.DenseNet:18 of
+msgid "init_channels (int): The output channels of stem layers."
+msgstr ""
+
+#: mmcls.models.backbones.densenet.DenseNet:20 of
+msgid "Defaults to '121'."
+msgstr ""
+
+#: mmcls.models.backbones.densenet.DenseNet:24 of
+msgid "Refers to channel expansion parameter of 1x1 convolution layer. Defaults to 4."
+msgstr ""
+
+#: mmcls.models.backbones.densenet.DenseNet:27 of
+msgid "Drop rate of Dropout Layer. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.backbones.densenet.DenseNet:29 of
+msgid "The reduction rate of transition layers. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.models.backbones.densenet.DenseNet:32 of
+msgid ""
+"If True, uses checkpointing. Much more memory efficient, but slower. Defaults to False. See `\"paper\" "
+"<https://arxiv.org/pdf/1707.06990.pdf>`_."
+msgstr ""
+
+#: mmcls.models.backbones.densenet.DenseNet:39 of
+msgid "The config dict for activation after each convolution. Defaults to ``dict(type='ReLU')``."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.DistilledVisionTransformer.rst:7
+msgid "DistilledVisionTransformer"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.deit.DistilledVisionTransformer:1 of
+msgid "Distilled Vision Transformer."
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:3 of
+msgid ""
+"A PyTorch implement of : `Training data-efficient image transformers & distillation through attention "
+"<https://arxiv.org/abs/2012.12877>`_"
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:6 of
+msgid ""
+"Vision Transformer architecture. If use string, choose from 'small', 'base', 'large', 'deit-tiny', 'deit-"
+"small' and 'deit-base'. If use dict, it should have below keys:  - **embed_dims** (int): The dimensions of "
+"embedding. - **num_layers** (int): The number of transformer encoder layers. - **num_heads** (int): The "
+"number of heads in attention modules. - **feedforward_channels** (int): The hidden dimensions in   "
+"feedforward modules.  Defaults to 'deit-base'."
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:6
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:6 of
+msgid ""
+"Vision Transformer architecture. If use string, choose from 'small', 'base', 'large', 'deit-tiny', 'deit-"
+"small' and 'deit-base'. If use dict, it should have below keys:"
+msgstr ""
+
+#: mmcls.models.backbones.deit.DistilledVisionTransformer:16 of
+msgid "Defaults to 'deit-base'."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.EdgeNeXt.rst:7
+msgid "EdgeNeXt"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.edgenext.EdgeNeXt:1 of
+msgid "EdgeNeXt."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:3 of
+msgid ""
+"A PyTorch implementation of: `EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile "
+"Vision Applications <https://arxiv.org/abs/2206.10589>`_"
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:7 of
+msgid "Inspiration from https://github.com/mmaaz60/EdgeNeXt"
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:10 of
+msgid ""
+"The model's architecture. If string, it should be one of architectures in ``EdgeNeXt.arch_settings``. And "
+"if dict, it should include the following keys:  - channels (list[int]): The number of channels at each "
+"stage. - depths (list[int]): The number of blocks at each stage. - num_heads (list[int]): The number of "
+"heads at each stage.  Defaults to 'xxsmall'."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:10 of
+msgid ""
+"The model's architecture. If string, it should be one of architectures in ``EdgeNeXt.arch_settings``. And "
+"if dict, it should include the following keys:"
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:15 of
+msgid "depths (list[int]): The number of blocks at each stage."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:16 of
+msgid "num_heads (list[int]): The number of heads at each stage."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:18 of
+msgid "Defaults to 'xxsmall'."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:20 of
+msgid "The number of input channels. Defaults to 3."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:23 of
+msgid "The number of global blocks. Defaults to [0, 1, 1, 1]."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:26 of
+msgid "The type of global blocks. Defaults to ['None', 'SDTA', 'SDTA', 'SDTA']."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:29 of
+msgid "Stochastic depth dropout rate. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:32 of
+msgid "Initial value of layer scale. Defaults to 1e-6."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:35 of
+msgid "Whether to use linear layer to do pointwise convolution. Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:38 of
+msgid "The number of channel ratio in MLP layers. Defaults to 4."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:41 of
+msgid "The kernel size of convolutional layers at each stage. Defaults to [3, 5, 7, 9]."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:44 of
+msgid ""
+"Whether to use positional embedding in Channel Self-Attention. Defaults to [False, True, False, False]."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:47 of
+msgid "Whether to use positional embedding for whole network. Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:50 of
+msgid "The number of channel groups used for SDTA at each stage. Defaults to [2, 2, 3, 4]."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:53 of
+msgid "The config of normalization layer. Defaults to ``dict(type='LN2d', eps=1e-6)``."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:62 of
+msgid "Whether to globally average the feature map before the final norm layer. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:65 of
+msgid "The config of activation layer. Defaults to ``dict(type='GELU')``."
+msgstr ""
+
+#: mmcls.models.backbones.edgenext.EdgeNeXt:68 of
+msgid "Config for initialization. Defaults to None."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.EfficientFormer.rst:7
+msgid "EfficientFormer"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.efficientformer.EfficientFormer:1 of
+msgid "EfficientFormer."
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:3 of
+msgid ""
+"A PyTorch implementation of EfficientFormer introduced by: `EfficientFormer: Vision Transformers at "
+"MobileNet Speed <https://arxiv.org/abs/2206.01191>`_"
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:6 of
+msgid "Modified from the `official repo <https://github.com/snap-research/EfficientFormer>`."
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:9 of
+msgid ""
+"The model's architecture. If string, it should be one of architecture in ``EfficientFormer.arch_settings``. "
+"And if dict, it should include the following 4 keys:  - layers (list[int]): Number of blocks at each stage. "
+"- embed_dims (list[int]): The number of channels at each stage. - downsamples (list[int]): Has downsample "
+"or not in the four stages. - vit_num (int): The num of vit blocks in the last stage.  Defaults to 'l1'."
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:9 of
+msgid ""
+"The model's architecture. If string, it should be one of architecture in ``EfficientFormer.arch_settings``. "
+"And if dict, it should include the following 4 keys:"
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:13 mmcls.models.backbones.poolformer.PoolFormer:13
+#: of
+msgid "layers (list[int]): Number of blocks at each stage."
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:14 mmcls.models.backbones.poolformer.PoolFormer:14
+#: of
+msgid "embed_dims (list[int]): The number of channels at each stage."
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:15 of
+msgid "downsamples (list[int]): Has downsample or not in the four stages."
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:16 of
+msgid "vit_num (int): The num of vit blocks in the last stage."
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:18 of
+msgid "Defaults to 'l1'."
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:22 of
+msgid "The pooling size of ``Meta4D`` blocks. Defaults to 3."
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:24 of
+msgid "The dimension ratio of multi-head attention mechanism in ``Meta4D`` blocks. Defaults to 3."
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:27 of
+msgid ""
+"Whether to reshape the feature map from (B, N, C) to (B, C, H, W) in the last stage, when the ``vit-num`` "
+"in ``arch`` is not 0. Defaults to False. Usually set to True in downstream tasks."
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:32 of
+msgid "Output from which stages. Defaults to -1."
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:41 mmcls.models.backbones.poolformer.PoolFormer:44
+#: mmcls.models.backbones.twins.SVT:29 of
+msgid "Dropout rate. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.backbones.efficientformer.EfficientFormer:45 of
+msgid "Whether to use use_layer_scale in MetaFormer block. Defaults to True."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.EfficientNet.rst:7
+msgid "EfficientNet"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.efficientnet.EfficientNet:1 of
+msgid "EfficientNet backbone."
+msgstr ""
+
+#: mmcls.models.backbones.efficientnet.EfficientNet:3 of
+msgid "Architecture of efficientnet. Defaults to b0."
+msgstr ""
+
+#: mmcls.models.backbones.efficientnet.EfficientNet:5 of
+msgid "Output from which stages. Defaults to (6, )."
+msgstr ""
+
+#: mmcls.models.backbones.efficientnet.EfficientNet:11 mmcls.models.backbones.mobilevit.MobileViT:38
+#: mmcls.models.utils.inverted_residual.InvertedResidual:17 of
+msgid "Config dict for convolution layer. Defaults to None, which means using conv2d."
+msgstr ""
+
+#: mmcls.models.backbones.efficientnet.EfficientNet:14 mmcls.models.backbones.mobilevit.MobileViT:41 of
+msgid "Config dict for normalization layer. Defaults to dict(type='BN')."
+msgstr ""
+
+#: mmcls.models.backbones.efficientnet.EfficientNet:17 mmcls.models.backbones.mobilevit.MobileViT:44 of
+msgid "Config dict for activation layer. Defaults to dict(type='Swish')."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.HRNet.rst:7
+msgid "HRNet"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.hrnet.HRNet:1 of
+msgid "HRNet backbone."
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet:3 of
+msgid "`High-Resolution Representations for Labeling Pixels and Regions <https://arxiv.org/abs/1904.04514>`_."
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet:6 of
+msgid ""
+"The preset HRNet architecture, includes 'w18', 'w30', 'w32', 'w40', 'w44', 'w48', 'w64'. It will only be "
+"used if extra is ``None``. Defaults to 'w32'."
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet:10 of
+msgid ""
+"Detailed configuration for each stage of HRNet. There must be 4 stages, the configuration for each stage "
+"must have 5 keys:  - num_modules (int): The number of HRModule in this stage. - num_branches (int): The "
+"number of branches in the HRModule. - block (str): The type of convolution block. Please choose between   "
+"'BOTTLENECK' and 'BASIC'. - num_blocks (tuple): The number of blocks in each branch.   The length must be "
+"equal to num_branches. - num_channels (tuple): The number of base channels in each branch.   The length "
+"must be equal to num_branches.  Defaults to None."
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet:10 of
+msgid ""
+"Detailed configuration for each stage of HRNet. There must be 4 stages, the configuration for each stage "
+"must have 5 keys:"
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet:14 of
+msgid "num_modules (int): The number of HRModule in this stage."
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet:15 of
+msgid "num_branches (int): The number of branches in the HRModule."
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet:16 of
+msgid "block (str): The type of convolution block. Please choose between 'BOTTLENECK' and 'BASIC'."
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet:18 of
+msgid "num_blocks (tuple): The number of blocks in each branch. The length must be equal to num_branches."
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet:20 of
+msgid ""
+"num_channels (tuple): The number of base channels in each branch. The length must be equal to num_branches."
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet:27 of
+msgid "Dictionary to construct and config conv layer. Defaults to None."
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet:30 of
+msgid "Dictionary to construct and config norm layer. Defaults to ``dict(type='BN')``."
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet:40 of
+msgid ""
+"Whether to use zero init for last norm layer in resblocks to let them behave as identity. Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet:43 of
+msgid ""
+"Whether to output multi-level features produced by multiple branches. If False, only the first level "
+"feature will be output. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet.forward:1 mmcls.models.backbones.inception_v3.InceptionV3.forward:1
+#: mmcls.models.losses.seesaw_loss.SeesawLoss.forward:1
+#: mmcls.models.utils.inverted_residual.InvertedResidual.forward:1 of
+msgid "Forward function."
+msgstr ""
+
+#: mmcls.models.backbones.HRNet.norm1:1 of
+msgid "the normalization layer named \"norm1\""
+msgstr ""
+
+#: mmcls.models.backbones.HRNet.norm1 mmcls.models.backbones.HRNet.norm2
+#: mmcls.models.classifiers.base.BaseClassifier of
+msgid "type"
+msgstr ""
+
+#: mmcls.models.backbones.HRNet.norm1:3 mmcls.models.backbones.HRNet.norm2:3 of
+msgid "nn.Module"
+msgstr ""
+
+#: mmcls.models.backbones.HRNet.norm2:1 of
+msgid "the normalization layer named \"norm2\""
+msgstr ""
+
+#: mmcls.models.backbones.hrnet.HRNet.train:1 of
+msgid "Convert the model into training mode will keeping the normalization layer freezed."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.HorNet.rst:7
+msgid "HorNet"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.hornet.HorNet:1 of
+msgid "HorNet backbone."
+msgstr ""
+
+#: mmcls.models.backbones.hornet.HorNet:3 of
+msgid ""
+"A PyTorch implementation of paper `HorNet: Efficient High-Order Spatial Interactions with Recursive Gated "
+"Convolutions <https://arxiv.org/abs/2207.14284>`_ . Inspiration from https://github.com/raoyongming/HorNet"
+msgstr ""
+
+#: mmcls.models.backbones.hornet.HorNet:8 of
+msgid ""
+"HorNet architecture.  If use string, choose from 'tiny', 'small', 'base' and 'large'. If use dict, it "
+"should have below keys:  - **base_dim** (int): The base dimensions of embedding. - **depths** (List[int]): "
+"The number of blocks in each stage. - **orders** (List[int]): The number of order of gnConv in each     "
+"stage. - **dw_cfg** (List[dict]): The Config for dw conv.  Defaults to 'tiny'."
+msgstr ""
+
+#: mmcls.models.backbones.hornet.HorNet:8 of
+msgid "HorNet architecture."
+msgstr ""
+
+#: mmcls.models.backbones.hornet.HorNet:10 of
+msgid ""
+"If use string, choose from 'tiny', 'small', 'base' and 'large'. If use dict, it should have below keys:"
+msgstr ""
+
+#: mmcls.models.backbones.hornet.HorNet:13 of
+msgid "**base_dim** (int): The base dimensions of embedding."
+msgstr ""
+
+#: mmcls.models.backbones.hornet.HorNet:15 of
+msgid "**orders** (List[int]): The number of order of gnConv in each"
+msgstr ""
+
+#: mmcls.models.backbones.hornet.HorNet:16 of
+msgid "stage."
+msgstr ""
+
+#: mmcls.models.backbones.hornet.HorNet:17 of
+msgid "**dw_cfg** (List[dict]): The Config for dw conv."
+msgstr ""
+
+#: mmcls.models.backbones.hornet.HorNet:25 of
+msgid "Scaling parameter of gflayer outputs. Defaults to 1/3."
+msgstr ""
+
+#: mmcls.models.backbones.hornet.HorNet:27 of
+msgid "Whether to use use_layer_scale in HorNet block. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.hornet.HorNet:30 mmcls.models.backbones.repmlp.RepMLPNet:22
+#: mmcls.models.backbones.resnet.ResNet:22 mmcls.models.backbones.van.VAN:29 of
+msgid "Output from which stages. Default: ``(3, )``."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.InceptionV3.rst:7
+msgid "InceptionV3"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.inception_v3.InceptionV3:1 of
+msgid "Inception V3 backbone."
+msgstr ""
+
+#: mmcls.models.backbones.inception_v3.InceptionV3:3 of
+msgid ""
+"A PyTorch implementation of `Rethinking the Inception Architecture for Computer Vision <https://arxiv.org/"
+"abs/1512.00567>`_"
+msgstr ""
+
+#: mmcls.models.backbones.inception_v3.InceptionV3:6 of
+msgid ""
+"This implementation is modified from https://github.com/pytorch/vision/blob/main/torchvision/models/"
+"inception.py. Licensed under the BSD 3-Clause License."
+msgstr ""
+
+#: mmcls.models.backbones.inception_v3.InceptionV3:10 of
+msgid "The number of categroies. Defaults to 1000."
+msgstr ""
+
+#: mmcls.models.backbones.inception_v3.InceptionV3:12 of
+msgid ""
+"Whether to enable the auxiliary branch. If False, the auxiliary logits output will be None. Defaults to "
+"False."
+msgstr ""
+
+#: mmcls.models.backbones.inception_v3.InceptionV3:15 of
+msgid "Dropout rate. Defaults to 0.5."
+msgstr ""
+
+#: mmcls.models.backbones.inception_v3.InceptionV3:17 of
+msgid ""
+"The config of initialization. Defaults to use trunc normal with ``std=0.1`` for all Conv2d and Linear "
+"layers and constant with ``val=1`` for all BatchNorm2d layers."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.LeNet5.rst:7
+msgid "LeNet5"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.lenet.LeNet5:1 of
+msgid "`LeNet5 <https://en.wikipedia.org/wiki/LeNet>`_ backbone."
+msgstr ""
+
+#: mmcls.models.backbones.lenet.LeNet5:3 of
+msgid "The input for LeNet-5 is a 32×32 grayscale image."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.MViT.rst:7
+msgid "MViT"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.mvit.MViT:1 of
+msgid "Multi-scale ViT v2."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:3 of
+msgid ""
+"A PyTorch implement of : `MViTv2: Improved Multiscale Vision Transformers for Classification and Detection "
+"<https://arxiv.org/abs/2112.01526>`_"
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:6 of
+msgid ""
+"Inspiration from `the official implementation <https://github.com/facebookresearch/mvit>`_ and `the "
+"detectron2 implementation <https://github.com/facebookresearch/detectron2>`_"
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:10 of
+msgid ""
+"MViT architecture. If use string, choose from 'tiny', 'small', 'base' and 'large'. If use dict, it should "
+"have below keys:  - **embed_dims** (int): The dimensions of embedding. - **num_layers** (int): The number "
+"of layers. - **num_heads** (int): The number of heads in attention   modules of the initial layer. - "
+"**downscale_indices** (List[int]): The layer indices to downscale   the feature map.  Defaults to 'base'."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:10 of
+msgid ""
+"MViT architecture. If use string, choose from 'tiny', 'small', 'base' and 'large'. If use dict, it should "
+"have below keys:"
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:15 of
+msgid "**num_layers** (int): The number of layers."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:16 of
+msgid "**num_heads** (int): The number of heads in attention modules of the initial layer."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:18 of
+msgid "**downscale_indices** (List[int]): The layer indices to downscale the feature map."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:23 of
+msgid "The expected input image shape. Defaults to 224."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:27 of
+msgid ""
+"The output scale indices. They should not exceed the length of ``downscale_indices``. Defaults to -1, which "
+"means the last scale."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:33 mmcls.models.backbones.swin_transformer.SwinTransformer:39
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:39 of
+msgid "If True, add absolute position embedding to the patch embedding. Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:36 of
+msgid "Select the interpolate mode for absolute position embedding vector resize. Defaults to \"bicubic\"."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:39 of
+msgid "kernel size for qkv pooling layers. Defaults to (3, 3)."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:42 of
+msgid "The magnification for ``embed_dims`` in the downscale layers. Defaults to 2."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:45 of
+msgid "The magnification for ``num_heads`` in the downscale layers. Defaults to 2."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:48 of
+msgid "The stride size for kv pooling in the initial layer. Defaults to 4."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:51 of
+msgid "Whether to enable the spatial relative position embedding. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:54 of
+msgid "Whether to enable the residual connection after attention pooling. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:57 of
+msgid ""
+"Whether to multiply the ``embed_dims`` in attention layers. If False, multiply it in MLP layers. Defaults "
+"to True."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:61 of
+msgid "If True, zero initialize relative positional parameters. Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:64 of
+msgid "Ratio of hidden dimensions in MLP layers. Defaults to 4.0."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:67 of
+msgid "enable bias for qkv if True. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:69 of
+msgid ""
+"Config dict for normalization layer for all output features. Defaults to ``dict(type='LN', eps=1e-6)``."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT:72 of
+msgid "Config dict for the patch embedding layer. Defaults to ``dict(kernel_size=7, stride=4, padding=3)``."
+msgstr ""
+
+#: mmcls.models.backbones.mvit.MViT.forward:1 of
+msgid "Forward the MViT."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.MlpMixer.rst:7
+msgid "MlpMixer"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.mlp_mixer.MlpMixer:1 of
+msgid "Mlp-Mixer backbone."
+msgstr ""
+
+#: mmcls.models.backbones.mlp_mixer.MlpMixer:3 of
+msgid ""
+"Pytorch implementation of `MLP-Mixer: An all-MLP Architecture for Vision <https://arxiv.org/pdf/2105.01601."
+"pdf>`_"
+msgstr ""
+
+#: mmcls.models.backbones.mlp_mixer.MlpMixer:6 of
+msgid ""
+"MLP Mixer architecture. If use string, choose from 'small', 'base' and 'large'. If use dict, it should have "
+"below keys:  - **embed_dims** (int): The dimensions of embedding. - **num_layers** (int): The number of MLP "
+"blocks. - **tokens_mlp_dims** (int): The hidden dimensions for tokens FFNs. - **channels_mlp_dims** (int): "
+"The The hidden dimensions for   channels FFNs.  Defaults to 'base'."
+msgstr ""
+
+#: mmcls.models.backbones.mlp_mixer.MlpMixer:6 of
+msgid ""
+"MLP Mixer architecture. If use string, choose from 'small', 'base' and 'large'. If use dict, it should have "
+"below keys:"
+msgstr ""
+
+#: mmcls.models.backbones.mlp_mixer.MlpMixer:11 of
+msgid "**num_layers** (int): The number of MLP blocks."
+msgstr ""
+
+#: mmcls.models.backbones.mlp_mixer.MlpMixer:12 of
+msgid "**tokens_mlp_dims** (int): The hidden dimensions for tokens FFNs."
+msgstr ""
+
+#: mmcls.models.backbones.mlp_mixer.MlpMixer:13 of
+msgid "**channels_mlp_dims** (int): The The hidden dimensions for channels FFNs."
+msgstr ""
+
+#: mmcls.models.backbones.mlp_mixer.MlpMixer:18 of
+msgid "The input image shape. Defaults to 224."
+msgstr ""
+
+#: mmcls.models.backbones.mlp_mixer.MlpMixer:23 of
+msgid "Output from which layer. Defaults to -1, means the last layer."
+msgstr ""
+
+#: mmcls.models.backbones.mlp_mixer.MlpMixer:34 of
+msgid "The activation config for FFNs. Default GELU."
+msgstr ""
+
+#: mmcls.models.backbones.mlp_mixer.MlpMixer:38 of
+msgid "Configs of each mixer block layer. Defaults to an empty dict."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.MobileNetV2.rst:7
+msgid "MobileNetV2"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.mobilenet_v2.MobileNetV2:1 of
+msgid "MobileNetV2 backbone."
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2:3 of
+msgid "Width multiplier, multiply number of channels in each layer by this amount. Default: 1.0."
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2:6 of
+msgid "Output from which stages. Default: (7, )."
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2:9 mmcls.models.backbones.mobilenet_v3.MobileNetV3:15
+#: mmcls.models.backbones.shufflenet_v1.ShuffleNetV1:12 mmcls.models.backbones.shufflenet_v2.ShuffleNetV2:9 of
+msgid "Stages to be frozen (all param fixed). Default: -1, which means not freezing any parameters."
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2:12 mmcls.models.backbones.mobilenet_v3.MobileNetV3:6
+#: mmcls.models.backbones.shufflenet_v1.ShuffleNetV1:15 mmcls.models.backbones.shufflenet_v2.ShuffleNetV2:12
+#: mmcls.models.utils.se_layer.SELayer:16 of
+msgid "Config dict for convolution layer. Default: None, which means using conv2d."
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2:15 mmcls.models.backbones.mobilenet_v3.MobileNetV3:9
+#: mmcls.models.backbones.shufflenet_v1.ShuffleNetV1:18 mmcls.models.backbones.shufflenet_v2.ShuffleNetV2:15
+#: of
+msgid "Config dict for normalization layer. Default: dict(type='BN')."
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2:18 of
+msgid "Config dict for activation layer. Default: dict(type='ReLU6')."
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2:21 mmcls.models.backbones.mobilenet_v3.MobileNetV3:18
+#: mmcls.models.backbones.regnet.RegNet:33 mmcls.models.backbones.resnest.ResNeSt:55
+#: mmcls.models.backbones.resnet.ResNet:42 mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:46
+#: mmcls.models.backbones.resnext.ResNeXt:47 mmcls.models.backbones.seresnet.SEResNet:44
+#: mmcls.models.backbones.seresnext.SEResNeXt:49 mmcls.models.backbones.shufflenet_v1.ShuffleNetV1:24
+#: mmcls.models.backbones.shufflenet_v2.ShuffleNetV2:21 mmcls.models.backbones.vgg.VGG:23 of
+msgid ""
+"Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch "
+"Norm and its variants only. Default: False."
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2:25 mmcls.models.backbones.mobilenet_v3.MobileNetV3:22
+#: mmcls.models.backbones.regnet.RegNet:37 mmcls.models.backbones.resnest.ResNeSt:59
+#: mmcls.models.backbones.resnet.ResNet:46 mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:50
+#: mmcls.models.backbones.resnext.ResNeXt:51 mmcls.models.backbones.seresnet.SEResNet:48
+#: mmcls.models.backbones.seresnext.SEResNeXt:53 mmcls.models.backbones.shufflenet_v1.ShuffleNetV1:28
+#: mmcls.models.backbones.shufflenet_v2.ShuffleNetV2:25 of
+msgid ""
+"Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. "
+"Default: False."
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2.make_layer:1 of
+msgid "Stack InvertedResidual blocks to build a layer for MobileNetV2."
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2.make_layer:3 of
+msgid "out_channels of block."
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2.make_layer:5 of
+msgid "number of blocks."
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2.make_layer:7 of
+msgid "stride of the first block. Default: 1"
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v2.MobileNetV2.make_layer:9 of
+msgid "Expand the number of channels of the hidden layer in InvertedResidual by this ratio. Default: 6."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.MobileNetV3.rst:7
+msgid "MobileNetV3"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.mobilenet_v3.MobileNetV3:1 of
+msgid "MobileNetV3 backbone."
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v3.MobileNetV3:3 of
+msgid "Architecture of mobilnetv3, from {small, large}. Default: small."
+msgstr ""
+
+#: mmcls.models.backbones.mobilenet_v3.MobileNetV3:12 of
+msgid "Output from which stages. Default: None, which means output tensors from final stage."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.MobileOne.rst:7
+msgid "MobileOne"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.mobileone.MobileOne:1 of
+msgid "MobileOne backbone."
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne:3 of
+msgid ""
+"A PyTorch impl of : `An Improved One millisecond Mobile Backbone <https://arxiv.org/pdf/2206.04040.pdf>`_"
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne:6 of
+msgid ""
+"MobileOne architecture. If use string, choose from 's0', 's1', 's2', 's3' and 's4'. If use dict, it should "
+"have below keys:  - num_blocks (Sequence[int]): Number of blocks in each stage. - width_factor "
+"(Sequence[float]): Width factor in each stage. - num_conv_branches (Sequence[int]): Number of conv "
+"branches   in each stage. - num_se_blocks (Sequence[int]): Number of SE layers in each   stage, all the SE "
+"layers are placed in the subsequent order   in each stage.  Defaults to 's0'."
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne:6 of
+msgid ""
+"MobileOne architecture. If use string, choose from 's0', 's1', 's2', 's3' and 's4'. If use dict, it should "
+"have below keys:"
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne:10 of
+msgid "num_blocks (Sequence[int]): Number of blocks in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne:11 of
+msgid "width_factor (Sequence[float]): Width factor in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne:12 of
+msgid "num_conv_branches (Sequence[int]): Number of conv branches in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne:14 of
+msgid ""
+"num_se_blocks (Sequence[int]): Number of SE layers in each stage, all the SE layers are placed in the "
+"subsequent order in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne:18 of
+msgid "Defaults to 's0'."
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne:22 mmcls.models.backbones.repvgg.RepVGG:26
+#: mmcls.models.backbones.res2net.Res2Net:22 mmcls.models.backbones.twins.PCPVT:22 of
+msgid "Output from which stages. Defaults to ``(3, )``."
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne:25 mmcls.models.backbones.repvgg.RepVGG:35 of
+msgid "Stages to be frozen (all param fixed). -1 means not freezing any parameters. Defaults to -1."
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne:28 mmcls.models.backbones.repvgg.RepVGG:38 of
+msgid "The config dict for conv layers. Defaults to None."
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne:34 mmcls.models.backbones.repvgg.RepVGG:44
+#: mmcls.models.utils.inverted_residual.InvertedResidual:23 of
+msgid "Config dict for activation layer. Defaults to ``dict(type='ReLU')``."
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne:37 mmcls.models.backbones.repvgg.RepVGG:50 of
+msgid "Whether to switch the model structure to deployment mode. Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne.switch_to_deploy:1 of
+msgid "switch the model to deploy mode, which has smaller amount of parameters and calculations."
+msgstr ""
+
+#: mmcls.models.backbones.mobileone.MobileOne.train:1 of
+msgid "switch the mobile to train mode or not."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.MobileViT.rst:7
+msgid "MobileViT"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.mobilevit.MobileViT:1 of
+msgid "MobileViT backbone."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT:3 of
+msgid ""
+"A PyTorch implementation of : `MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision "
+"Transformer <https://arxiv.org/pdf/2110.02178.pdf>`_"
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT:6 of
+msgid ""
+"Modified from the `official repo <https://github.com/apple/ml-cvnets/blob/main/cvnets/models/classification/"
+"mobilevit.py>`_ and `timm <https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/"
+"mobilevit.py>`_."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT:11 of
+msgid ""
+"Architecture of MobileViT.  - If a string, choose from \"small\", \"x_small\" and \"xx_small\".  - If a "
+"list, every item should be also a list, and the first item   of the sub-list can be chosen from "
+"\"moblienetv2\" and \"mobilevit\",   which indicates the type of this layer sequence. If \"mobilenetv2\",   "
+"the other items are the arguments of :attr:`~MobileViT.make_mobilenetv2_layer`   (except ``in_channels``) "
+"and if \"mobilevit\", the other items are   the arguments of :attr:`~MobileViT.make_mobilevit_layer`   "
+"(except ``in_channels``).  Defaults to \"small\"."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT:11 of
+msgid "Architecture of MobileViT."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT:13 of
+msgid "If a string, choose from \"small\", \"x_small\" and \"xx_small\"."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT:15 of
+msgid ""
+"If a list, every item should be also a list, and the first item of the sub-list can be chosen from "
+"\"moblienetv2\" and \"mobilevit\", which indicates the type of this layer sequence. If \"mobilenetv2\", the "
+"other items are the arguments of :attr:`~MobileViT.make_mobilenetv2_layer` (except ``in_channels``) and if "
+"\"mobilevit\", the other items are the arguments of :attr:`~MobileViT.make_mobilevit_layer` (except "
+"``in_channels``)."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT:23 of
+msgid "Defaults to \"small\"."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT:27 of
+msgid "Channels of stem layer.  Defaults to 16."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT:29 of
+msgid "Channels expand factor of last layer. Defaults to 4."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT:32 of
+msgid "Output from which stages. Defaults to (4, )."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT:35 of
+msgid "Stages to be frozen (all param fixed). Defaults to -1, which means not freezing any parameters."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilenetv2_layer:1 of
+msgid "Build mobilenetv2 layer, which consists of several InvertedResidual layers."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilenetv2_layer:4
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilevit_layer:4 of
+msgid "The input channels."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilenetv2_layer:6
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilevit_layer:6 of
+msgid "The output channels."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilenetv2_layer:8
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilevit_layer:8 of
+msgid "The stride of the first 3x3 convolution in the ``InvertedResidual`` layers."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilenetv2_layer:11 of
+msgid "The number of ``InvertedResidual`` blocks."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilenetv2_layer:13
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilevit_layer:18 of
+msgid "adjusts number of channels of the hidden layer in ``InvertedResidual`` by this amount. Defaults to 4."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilevit_layer:1 of
+msgid "Build mobilevit layer, which consists of one InvertedResidual and one MobileVitBlock."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilevit_layer:11 of
+msgid "The channels of the transformer layers."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilevit_layer:13 of
+msgid "The mid-channels of the feedforward network in transformer layers."
+msgstr ""
+
+#: mmcls.models.backbones.mobilevit.MobileViT.make_mobilevit_layer:16 of
+msgid "The number of transformer blocks."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.PCPVT.rst:7
+msgid "PCPVT"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.twins.PCPVT:1 of
+msgid "The backbone of Twins-PCPVT."
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:3 mmcls.models.backbones.twins.SVT:3 of
+msgid ""
+"This backbone is the implementation of `Twins: Revisiting the Design of Spatial Attention in Vision "
+"Transformers <https://arxiv.org/abs/1512.03385>`_."
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:7 of
+msgid ""
+"PCPVT architecture, a str value in arch zoo or a detailed configuration dict with 7 keys, and the length of "
+"all the values in dict should be the same:  - depths (List[int]): The number of encoder layers in each "
+"stage. - embed_dims (List[int]): Embedding dimension in each stage. - patch_sizes (List[int]): The patch "
+"sizes in each stage. - num_heads (List[int]): Numbers of attention head in each stage. - strides "
+"(List[int]): The strides in each stage. - mlp_ratios (List[int]): The ratios of mlp in each stage. - "
+"sr_ratios (List[int]): The ratios of GSA-encoder layers in each   stage."
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:7 of
+msgid ""
+"PCPVT architecture, a str value in arch zoo or a detailed configuration dict with 7 keys, and the length of "
+"all the values in dict should be the same:"
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:11 mmcls.models.backbones.twins.SVT:11 of
+msgid "depths (List[int]): The number of encoder layers in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:12 mmcls.models.backbones.twins.SVT:12 of
+msgid "embed_dims (List[int]): Embedding dimension in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:13 mmcls.models.backbones.twins.SVT:13 of
+msgid "patch_sizes (List[int]): The patch sizes in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:14 mmcls.models.backbones.twins.SVT:14 of
+msgid "num_heads (List[int]): Numbers of attention head in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:15 mmcls.models.backbones.twins.SVT:15 of
+msgid "strides (List[int]): The strides in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:16 mmcls.models.backbones.twins.SVT:16 of
+msgid "mlp_ratios (List[int]): The ratios of mlp in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:17 mmcls.models.backbones.twins.SVT:17 of
+msgid "sr_ratios (List[int]): The ratios of GSA-encoder layers in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:20 mmcls.models.backbones.twins.SVT:22 of
+msgid "Number of input channels. Defaults to 3."
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:25 mmcls.models.backbones.twins.SVT:27 of
+msgid "Enable bias for qkv if True. Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:30 of
+msgid "The drop out rate for attention layer. Defaults to 0.0"
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:33 of
+msgid "Stochastic depth rate. Defaults to 0.0."
+msgstr ""
+
+#: mmcls.models.backbones.twins.PCPVT:38 mmcls.models.backbones.twins.SVT:39 of
+msgid "Add extra norm after each stage. Defaults to False."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.PoolFormer.rst:7
+msgid "PoolFormer"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.poolformer.PoolFormer:1 of
+msgid "PoolFormer."
+msgstr ""
+
+#: mmcls.models.backbones.poolformer.PoolFormer:3 of
+msgid ""
+"A PyTorch implementation of PoolFormer introduced by: `MetaFormer is Actually What You Need for Vision "
+"<https://arxiv.org/abs/2111.11418>`_"
+msgstr ""
+
+#: mmcls.models.backbones.poolformer.PoolFormer:6 of
+msgid ""
+"Modified from the `official repo <https://github.com/sail-sg/poolformer/blob/main/models/poolformer.py>`."
+msgstr ""
+
+#: mmcls.models.backbones.poolformer.PoolFormer:9 of
+msgid ""
+"The model's architecture. If string, it should be one of architecture in ``PoolFormer.arch_settings``. And "
+"if dict, it should include the following two keys:  - layers (list[int]): Number of blocks at each stage. - "
+"embed_dims (list[int]): The number of channels at each stage. - mlp_ratios (list[int]): Expansion ratio of "
+"MLPs. - layer_scale_init_value (float): Init value for Layer Scale.  Defaults to 'S12'."
+msgstr ""
+
+#: mmcls.models.backbones.poolformer.PoolFormer:9 of
+msgid ""
+"The model's architecture. If string, it should be one of architecture in ``PoolFormer.arch_settings``. And "
+"if dict, it should include the following two keys:"
+msgstr ""
+
+#: mmcls.models.backbones.poolformer.PoolFormer:15 of
+msgid "mlp_ratios (list[int]): Expansion ratio of MLPs."
+msgstr ""
+
+#: mmcls.models.backbones.poolformer.PoolFormer:16 of
+msgid "layer_scale_init_value (float): Init value for Layer Scale."
+msgstr ""
+
+#: mmcls.models.backbones.poolformer.PoolFormer:18 of
+msgid "Defaults to 'S12'."
+msgstr ""
+
+#: mmcls.models.backbones.poolformer.PoolFormer:26 of
+msgid "The patch size of input image patch embedding. Defaults to 7."
+msgstr ""
+
+#: mmcls.models.backbones.poolformer.PoolFormer:29 of
+msgid "The stride of input image patch embedding. Defaults to 4."
+msgstr ""
+
+#: mmcls.models.backbones.poolformer.PoolFormer:32 of
+msgid "The padding of input image patch embedding. Defaults to 2."
+msgstr ""
+
+#: mmcls.models.backbones.poolformer.PoolFormer:35 of
+msgid "The patch size of downsampling patch embedding. Defaults to 3."
+msgstr ""
+
+#: mmcls.models.backbones.poolformer.PoolFormer:38 of
+msgid "The stride of downsampling patch embedding. Defaults to 2."
+msgstr ""
+
+#: mmcls.models.backbones.poolformer.PoolFormer:41 of
+msgid "The padding of downsampling patch embedding. Defaults to 1."
+msgstr ""
+
+#: mmcls.models.backbones.poolformer.PoolFormer:48 of
+msgid ""
+"Output from which network position. Index 0-6 respectively corresponds to [stage1, downsampling, stage2, "
+"downsampling, stage3, downsampling, stage4] Defaults to -1, means the last stage."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.RegNet.rst:7
+msgid "RegNet"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.regnet.RegNet:1 of
+msgid "RegNet backbone."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet:3 of
+msgid "More details can be found in `paper <https://arxiv.org/abs/2003.13678>`_ ."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet:5 of
+msgid ""
+"The parameter of RegNets. - w0 (int): initial width - wa (float): slope of width - wm (float): quantization "
+"parameter to quantize the width - depth (int): depth of the backbone - group_w (int): width of group - "
+"bot_mul (float): bottleneck ratio, i.e. expansion of bottleneck."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet:13 of
+msgid "Strides of the first block of each stage."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet:15 of
+msgid "Base channels after stem layer."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet:19 mmcls.models.backbones.vgg.VGG:11 of
+msgid "Dilation of each stage."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet:21 of
+msgid "Output from which stages."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet:23 of
+msgid ""
+"`pytorch` or `caffe`. If set to \"pytorch\", the stride-two layer is the 3x3 conv layer, otherwise the "
+"stride-two layer is the first 1x1 conv layer. Default: \"pytorch\"."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet:27 of
+msgid "Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet:30 of
+msgid "dictionary to construct and config norm layer. Default: dict(type='BN', requires_grad=True)."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet:40 of
+msgid ""
+"whether to use zero init for last norm layer in resblocks to let them behave as identity. Default: True."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.adjust_width_group:1 of
+msgid "Adjusts the compatibility of widths and groups."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.adjust_width_group:3 of
+msgid "Width of each stage."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.adjust_width_group:5 of
+msgid "Bottleneck ratio."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.adjust_width_group:7 of
+msgid "number of groups in each stage"
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.adjust_width_group:10 of
+msgid "The adjusted widths and groups of each stage."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet:1 of
+msgid "Generates per block width from RegNet parameters."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet:3 of
+msgid "Initial width of the backbone"
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet:5 of
+msgid "Slope of the quantized linear function"
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet:7 of
+msgid "Parameter used to quantize the width."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet:9 of
+msgid "Depth of the backbone."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet:11 of
+msgid "The divisor of channels. Defaults to 8."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet:14 of
+msgid "tuple containing:     - list: Widths of each stage.     - int: The number of stages."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet:17 of
+msgid "tuple containing:"
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet:17 of
+msgid "list: Widths of each stage."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.generate_regnet:18 of
+msgid "int: The number of stages."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.get_stages_from_blocks:1 of
+msgid "Gets widths/stage_blocks of network at each stage."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.get_stages_from_blocks:3 of
+msgid "Width in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.get_stages_from_blocks:6 of
+msgid "width and depth of each stage"
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.quantize_float:1 of
+msgid "Converts a float to closest non-zero int divisible by divior."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.quantize_float:3 of
+msgid "Original number to be quantized."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.quantize_float:5 of
+msgid "Divisor used to quantize the number."
+msgstr ""
+
+#: mmcls.models.backbones.regnet.RegNet.quantize_float:8 of
+msgid "quantized number that is divisible by devisor."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.RepLKNet.rst:7
+msgid "RepLKNet"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.replknet.RepLKNet:1 of
+msgid "RepLKNet backbone."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:3 of
+msgid ""
+"A PyTorch impl of : `Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs <https://"
+"arxiv.org/abs/2203.06717>`_"
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:7 of
+msgid ""
+"The parameter of RepLKNet. If it's a dict, it should contain the following keys:  - large_kernel_sizes "
+"(Sequence[int]):     Large kernel size in each stage. - layers (Sequence[int]): Number of blocks in each "
+"stage. - channels (Sequence[int]): Number of channels in each stage. - small_kernel (int): size of the "
+"parallel small kernel. - dw_ratio (float): The intermediate channels     expansion ratio of the block."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:7 of
+msgid "The parameter of RepLKNet. If it's a dict, it should contain the following keys:"
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:10 of
+msgid "large_kernel_sizes (Sequence[int]):"
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:11 of
+msgid "Large kernel size in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:12 of
+msgid "layers (Sequence[int]): Number of blocks in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:13 of
+msgid "channels (Sequence[int]): Number of channels in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:14 of
+msgid "small_kernel (int): size of the parallel small kernel."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:15 of
+msgid "dw_ratio (float): The intermediate channels"
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:16 of
+msgid "expansion ratio of the block."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:18 of
+msgid "Number of input image channels. Default to  3."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:20 of
+msgid "Mlp expansion ratio. Defaults to 4."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:22 of
+msgid "Output from which stages. Default to  (3, )."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:25 of
+msgid "Strides of the first block of each stage. Default to  (2, 2, 2, 2)."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:28 of
+msgid "Dilation of each stage. Default to  (1, 1, 1, 1)."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:31 of
+msgid "Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default to  -1."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:35 of
+msgid "The config dict for conv layers. Default to None."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:38 of
+msgid "The config dict for norm layers. Default to  ``dict(type='BN')``."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:41 of
+msgid "Config dict for activation layer. Default to  ``dict(type='ReLU')``."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:44 of
+msgid ""
+"Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. "
+"Default to False."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:47 of
+msgid "Whether to switch the model structure to deployment mode. Default to False."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:50 of
+msgid ""
+"Construct and config norm layer or not. Using True will normalize the intermediate features for downstream "
+"dense prediction tasks."
+msgstr ""
+
+#: mmcls.models.backbones.replknet.RepLKNet:55 of
+msgid ""
+"Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch "
+"Norm and its variants only. Default to False."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.RepMLPNet.rst:7
+msgid "RepMLPNet"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.repmlp.RepMLPNet:1 of
+msgid "RepMLPNet backbone."
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:3 of
+msgid ""
+"A PyTorch impl of : `RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image "
+"Recognition <https://arxiv.org/abs/2105.01883>`_"
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:7 of
+msgid ""
+"RepMLP architecture. If use string, choose from 'base' and 'b'. If use dict, it should have below keys:  - "
+"channels (List[int]): Number of blocks in each stage. - depths (List[int]): The number of blocks in each "
+"branch. - sharesets_nums (List[int]): RepVGG Block that declares   the need to apply group convolution."
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:7 of
+msgid ""
+"RepMLP architecture. If use string, choose from 'base' and 'b'. If use dict, it should have below keys:"
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:10 of
+msgid "channels (List[int]): Number of blocks in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:11 of
+msgid "depths (List[int]): The number of blocks in each branch."
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:12 of
+msgid "sharesets_nums (List[int]): RepVGG Block that declares the need to apply group convolution."
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:15 of
+msgid "The size of input image. Defaults: 224."
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:25 of
+msgid "The conv kernels in the GlobalPerceptron. Default: None."
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:28 of
+msgid "The reducation ratio in the GlobalPerceptron. Default: 4."
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:31 of
+msgid "The number of sharesets in the PartitionPerceptron. Default 1."
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:34 mmcls.models.backbones.resnest.ResNeSt:51
+#: mmcls.models.backbones.resnet.ResNet:38 mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:42
+#: mmcls.models.backbones.resnext.ResNeXt:43 mmcls.models.backbones.seresnet.SEResNet:40
+#: mmcls.models.backbones.seresnext.SEResNeXt:45 mmcls.models.utils.embed.HybridEmbed:17 of
+msgid "The config dict for conv layers. Default: None."
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:36 of
+msgid "The config dict for norm layers. Default: dict(type='BN', requires_grad=True)."
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:39 mmcls.models.backbones.swin_transformer.SwinTransformer:67
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:67 of
+msgid "Extra config dict for patch embedding. Defaults to an empty dict."
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:45 mmcls.models.backbones.shufflenet_v1.ShuffleNetV1:21
+#: mmcls.models.backbones.shufflenet_v2.ShuffleNetV2:18 of
+msgid "Config dict for activation layer. Default: dict(type='ReLU')."
+msgstr ""
+
+#: mmcls.models.backbones.repmlp.RepMLPNet:48 of
+msgid "Whether to switch the model structure to deployment mode. Default: False."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.RepVGG.rst:7
+msgid "RepVGG"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.repvgg.RepVGG:1 of
+msgid "RepVGG backbone."
+msgstr ""
+
+#: mmcls.models.backbones.repvgg.RepVGG:3 of
+msgid ""
+"A PyTorch impl of : `RepVGG: Making VGG-style ConvNets Great Again <https://arxiv.org/abs/2101.03697>`_"
+msgstr ""
+
+#: mmcls.models.backbones.repvgg.RepVGG:6 of
+msgid ""
+"RepVGG architecture. If use string, choose from 'A0', 'A1`', 'A2', 'B0', 'B1', 'B1g2', 'B1g4', 'B2', "
+"'B2g2', 'B2g4', 'B3', 'B3g2', 'B3g4'  or 'D2se'. If use dict, it should have below keys:  - **num_blocks** "
+"(Sequence[int]): Number of blocks in each stage. - **width_factor** (Sequence[float]): Width deflator in "
+"each stage. - **group_layer_map** (dict | None): RepVGG Block that declares   the need to apply group "
+"convolution. - **se_cfg** (dict | None): SE Layer config. - **stem_channels** (int, optional): The stem "
+"channels, the final   stem channels will be   ``min(stem_channels, base_channels*width_factor[0])``.   If "
+"not set here, 64 is used by default in the code."
+msgstr ""
+
+#: mmcls.models.backbones.repvgg.RepVGG:6 of
+msgid ""
+"RepVGG architecture. If use string, choose from 'A0', 'A1`', 'A2', 'B0', 'B1', 'B1g2', 'B1g4', 'B2', "
+"'B2g2', 'B2g4', 'B3', 'B3g2', 'B3g4'  or 'D2se'. If use dict, it should have below keys:"
+msgstr ""
+
+#: mmcls.models.backbones.repvgg.RepVGG:11 of
+msgid "**num_blocks** (Sequence[int]): Number of blocks in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.repvgg.RepVGG:12 of
+msgid "**width_factor** (Sequence[float]): Width deflator in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.repvgg.RepVGG:13 of
+msgid "**group_layer_map** (dict | None): RepVGG Block that declares the need to apply group convolution."
+msgstr ""
+
+#: mmcls.models.backbones.repvgg.RepVGG:15 of
+msgid "**se_cfg** (dict | None): SE Layer config."
+msgstr ""
+
+#: mmcls.models.backbones.repvgg.RepVGG:16 of
+msgid ""
+"**stem_channels** (int, optional): The stem channels, the final stem channels will be ``min(stem_channels, "
+"base_channels*width_factor[0])``. If not set here, 64 is used by default in the code."
+msgstr ""
+
+#: mmcls.models.backbones.repvgg.RepVGG:23 of
+msgid "Base channels of RepVGG backbone, work with width_factor together. Defaults to 64."
+msgstr ""
+
+#: mmcls.models.backbones.repvgg.RepVGG:29 of
+msgid "Strides of the first block of each stage. Defaults to ``(2, 2, 2, 2)``."
+msgstr ""
+
+#: mmcls.models.backbones.repvgg.RepVGG:32 mmcls.models.backbones.res2net.Res2Net:19 of
+msgid "Dilation of each stage. Defaults to ``(1, 1, 1, 1)``."
+msgstr ""
+
+#: mmcls.models.backbones.repvgg.RepVGG:57 of
+msgid "Whether to use the MTSPPF block. Defaults to False."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.Res2Net.rst:7
+msgid "Res2Net"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.res2net.Res2Net:1 of
+msgid "Res2Net backbone."
+msgstr ""
+
+#: mmcls.models.backbones.res2net.Res2Net:3 of
+msgid ""
+"A PyTorch implement of : `Res2Net: A New Multi-scale Backbone Architecture <https://arxiv.org/"
+"pdf/1904.01169.pdf>`_"
+msgstr ""
+
+#: mmcls.models.backbones.res2net.Res2Net:6 of
+msgid "Depth of Res2Net, choose from {50, 101, 152}."
+msgstr ""
+
+#: mmcls.models.backbones.res2net.Res2Net:8 of
+msgid "Scales used in Res2Net. Defaults to 4."
+msgstr ""
+
+#: mmcls.models.backbones.res2net.Res2Net:10 of
+msgid "Basic width of each scale. Defaults to 26."
+msgstr ""
+
+#: mmcls.models.backbones.res2net.Res2Net:14 of
+msgid "Number of Res2Net stages. Defaults to 4."
+msgstr ""
+
+#: mmcls.models.backbones.res2net.Res2Net:16 of
+msgid "Strides of the first block of each stage. Defaults to ``(1, 2, 2, 2)``."
+msgstr ""
+
+#: mmcls.models.backbones.res2net.Res2Net:25 of
+msgid ""
+"\"pytorch\" or \"caffe\". If set to \"pytorch\", the stride-two layer is the 3x3 conv layer, otherwise the "
+"stride-two layer is the first 1x1 conv layer. Defaults to \"pytorch\"."
+msgstr ""
+
+#: mmcls.models.backbones.res2net.Res2Net:29 of
+msgid "Replace 7x7 conv in input stem with 3 3x3 conv. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.res2net.Res2Net:32 of
+msgid "Use AvgPool instead of stride conv when downsampling in the bottle2neck. Defaults to True."
+msgstr ""
+
+#: mmcls.models.backbones.res2net.Res2Net:38 of
+msgid "Dictionary to construct and config norm layer. Defaults to ``dict(type='BN', requires_grad=True)``."
+msgstr ""
+
+#: mmcls.models.backbones.res2net.Res2Net:48 of
+msgid ""
+"Whether to use zero init for last norm layer in resblocks to let them behave as identity. Defaults to True."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.ResNeSt.rst:7
+msgid "ResNeSt"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.resnest.ResNeSt:1 of
+msgid "ResNeSt backbone."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:3 of
+msgid "Please refer to the `paper <https://arxiv.org/pdf/2004.08955.pdf>`__ for details."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:6 of
+msgid "Network depth, from {50, 101, 152, 200}."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:8 mmcls.models.backbones.resnext.ResNeXt:8
+#: mmcls.models.backbones.seresnext.SEResNeXt:8 of
+msgid "Groups of conv2 in Bottleneck. Default: 32."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:10 mmcls.models.backbones.resnext.ResNeXt:10
+#: mmcls.models.backbones.seresnext.SEResNeXt:10 of
+msgid "Width per group of conv2 in Bottleneck. Default: 4."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:13 of
+msgid "Radix of SpltAtConv2d. Default: 2"
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:15 of
+msgid "Reduction factor of SplitAttentionConv2d. Default: 4."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:18 of
+msgid "Whether to use average pool for stride in Bottleneck. Default: True."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:23 mmcls.models.backbones.resnet.ResNet:10
+#: mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:12 mmcls.models.backbones.resnext.ResNeXt:15
+#: mmcls.models.backbones.seresnet.SEResNet:12 mmcls.models.backbones.seresnext.SEResNeXt:17 of
+msgid "Output channels of the stem layer. Default: 64."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:25 mmcls.models.backbones.resnet.ResNet:14
+#: mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:16 mmcls.models.backbones.resnext.ResNeXt:17
+#: mmcls.models.backbones.seresnet.SEResNet:14 mmcls.models.backbones.seresnext.SEResNeXt:19 of
+msgid "Stages of the network. Default: 4."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:27 mmcls.models.backbones.resnet.ResNet:16
+#: mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:18 mmcls.models.backbones.resnext.ResNeXt:19
+#: mmcls.models.backbones.seresnet.SEResNet:16 mmcls.models.backbones.seresnext.SEResNeXt:21 of
+msgid "Strides of the first block of each stage. Default: ``(1, 2, 2, 2)``."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:30 mmcls.models.backbones.resnet.ResNet:19
+#: mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:21 mmcls.models.backbones.resnext.ResNeXt:22
+#: mmcls.models.backbones.seresnet.SEResNet:19 mmcls.models.backbones.seresnext.SEResNeXt:24 of
+msgid "Dilation of each stage. Default: ``(1, 1, 1, 1)``."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:33 mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:24
+#: mmcls.models.backbones.resnext.ResNeXt:25 mmcls.models.backbones.seresnet.SEResNet:22
+#: mmcls.models.backbones.seresnext.SEResNeXt:27 of
+msgid ""
+"Output from which stages. If only one stage is specified, a single tensor (feature map) is returned, "
+"otherwise multiple stages are specified, a tuple of tensors will be returned. Default: ``(3, )``."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:38 mmcls.models.backbones.resnet.ResNet:25
+#: mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:29 mmcls.models.backbones.resnext.ResNeXt:30
+#: mmcls.models.backbones.seresnet.SEResNet:27 mmcls.models.backbones.seresnext.SEResNeXt:32 of
+msgid ""
+"`pytorch` or `caffe`. If set to \"pytorch\", the stride-two layer is the 3x3 conv layer, otherwise the "
+"stride-two layer is the first 1x1 conv layer."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:42 mmcls.models.backbones.resnet.ResNet:29
+#: mmcls.models.backbones.resnext.ResNeXt:34 mmcls.models.backbones.seresnet.SEResNet:31
+#: mmcls.models.backbones.seresnext.SEResNeXt:36 of
+msgid "Replace 7x7 conv in input stem with 3 3x3 conv. Default: False."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:45 mmcls.models.backbones.resnet.ResNet:32
+#: mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:36 mmcls.models.backbones.resnext.ResNeXt:37
+#: mmcls.models.backbones.seresnet.SEResNet:34 mmcls.models.backbones.seresnext.SEResNeXt:39 of
+msgid "Use AvgPool instead of stride conv when downsampling in the bottleneck. Default: False."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:53 mmcls.models.backbones.resnet.ResNet:40
+#: mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:44 mmcls.models.backbones.resnext.ResNeXt:45
+#: mmcls.models.backbones.seresnet.SEResNet:42 mmcls.models.backbones.seresnext.SEResNeXt:47 of
+msgid "The config dict for norm layers."
+msgstr ""
+
+#: mmcls.models.backbones.resnest.ResNeSt:62 mmcls.models.backbones.resnet.ResNet:49
+#: mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:53 mmcls.models.backbones.resnext.ResNeXt:54
+#: mmcls.models.backbones.seresnet.SEResNet:51 mmcls.models.backbones.seresnext.SEResNeXt:56 of
+msgid ""
+"Whether to use zero init for last norm layer in resblocks to let them behave as identity. Default: True."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.ResNeXt.rst:7
+msgid "ResNeXt"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.resnext.ResNeXt:1 of
+msgid "ResNeXt backbone."
+msgstr ""
+
+#: mmcls.models.backbones.resnext.ResNeXt:3 of
+msgid "Please refer to the `paper <https://arxiv.org/abs/1611.05431>`__ for details."
+msgstr ""
+
+#: mmcls.models.backbones.resnext.ResNeXt:6 mmcls.models.backbones.seresnet.SEResNet:6
+#: mmcls.models.backbones.seresnext.SEResNeXt:6 of
+msgid "Network depth, from {50, 101, 152}."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.ResNet.rst:7
+msgid "ResNet"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.resnet.ResNet:1 of
+msgid "ResNet backbone."
+msgstr ""
+
+#: mmcls.models.backbones.resnet.ResNet:3 of
+msgid "Please refer to the `paper <https://arxiv.org/abs/1512.03385>`__ for details."
+msgstr ""
+
+#: mmcls.models.backbones.resnet.ResNet:6 mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:8 of
+msgid "Network depth, from {18, 34, 50, 101, 152}."
+msgstr ""
+
+#: mmcls.models.backbones.resnet.ResNet:12 mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:14 of
+msgid "Middle channels of the first stage. Default: 64."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.ResNetV1c.rst:7
+msgid "ResNetV1c"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.resnet.ResNetV1c:1 of
+msgid "ResNetV1c backbone."
+msgstr ""
+
+#: mmcls.models.backbones.resnet.ResNetV1c:3 mmcls.models.backbones.resnet.ResNetV1d:3 of
+msgid "This variant is described in `Bag of Tricks. <https://arxiv.org/pdf/1812.01187.pdf>`_."
+msgstr ""
+
+#: mmcls.models.backbones.resnet.ResNetV1c:6 of
+msgid ""
+"Compared with default ResNet(ResNetV1b), ResNetV1c replaces the 7x7 conv in the input stem with three 3x3 "
+"convs."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.ResNetV1d.rst:7
+msgid "ResNetV1d"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.resnet.ResNetV1d:1 of
+msgid "ResNetV1d backbone."
+msgstr ""
+
+#: mmcls.models.backbones.resnet.ResNetV1d:6 of
+msgid ""
+"Compared with default ResNet(ResNetV1b), ResNetV1d replaces the 7x7 conv in the input stem with three 3x3 "
+"convs. And in the downsampling block, a 2x2 avg_pool with stride 2 is added before conv, whose stride is "
+"changed to 1."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.ResNet_CIFAR.rst:7
+msgid "ResNet_CIFAR"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:1 of
+msgid "ResNet backbone for CIFAR."
+msgstr ""
+
+#: mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:3 of
+msgid ""
+"Compared to standard ResNet, it uses `kernel_size=3` and `stride=1` in conv1, and does not apply "
+"MaxPoolinng after stem. It has been proven to be more efficient than standard ResNet in other public "
+"codebase, e.g., `https://github.com/kuangliu/pytorch-cifar/blob/master/models/resnet.py`."
+msgstr ""
+
+#: mmcls.models.backbones.resnet_cifar.ResNet_CIFAR:33 of
+msgid "This network has specific designed stem, thus it is asserted to be False."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.SEResNeXt.rst:7
+msgid "SEResNeXt"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.seresnext.SEResNeXt:1 of
+msgid "SEResNeXt backbone."
+msgstr ""
+
+#: mmcls.models.backbones.seresnet.SEResNet:3 mmcls.models.backbones.seresnext.SEResNeXt:3 of
+msgid "Please refer to the `paper <https://arxiv.org/abs/1709.01507>`__ for details."
+msgstr ""
+
+#: mmcls.models.backbones.seresnet.SEResNet:8 mmcls.models.backbones.seresnext.SEResNeXt:13 of
+msgid "Squeeze ratio in SELayer. Default: 16."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.SEResNet.rst:7
+msgid "SEResNet"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.seresnet.SEResNet:1 of
+msgid "SEResNet backbone."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.SVT.rst:7
+msgid "SVT"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.twins.SVT:1 of
+msgid "The backbone of Twins-SVT."
+msgstr ""
+
+#: mmcls.models.backbones.twins.SVT:7 of
+msgid ""
+"SVT architecture, a str value in arch zoo or a detailed configuration dict with 8 keys, and the length of "
+"all the values in dict should be the same:  - depths (List[int]): The number of encoder layers in each "
+"stage. - embed_dims (List[int]): Embedding dimension in each stage. - patch_sizes (List[int]): The patch "
+"sizes in each stage. - num_heads (List[int]): Numbers of attention head in each stage. - strides "
+"(List[int]): The strides in each stage. - mlp_ratios (List[int]): The ratios of mlp in each stage. - "
+"sr_ratios (List[int]): The ratios of GSA-encoder layers in each   stage. - windiow_sizes (List[int]): The "
+"window sizes in LSA-encoder layers   in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.twins.SVT:7 of
+msgid ""
+"SVT architecture, a str value in arch zoo or a detailed configuration dict with 8 keys, and the length of "
+"all the values in dict should be the same:"
+msgstr ""
+
+#: mmcls.models.backbones.twins.SVT:19 of
+msgid "windiow_sizes (List[int]): The window sizes in LSA-encoder layers in each stage."
+msgstr ""
+
+#: mmcls.models.backbones.twins.SVT:24 of
+msgid "Output from which stages. Defaults to (3, )."
+msgstr ""
+
+#: mmcls.models.backbones.twins.SVT:31 of
+msgid "Dropout ratio of attention weight. Defaults to 0.0"
+msgstr ""
+
+#: mmcls.models.backbones.twins.SVT:34 of
+msgid "Stochastic depth rate. Defaults to 0.2."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.ShuffleNetV1.rst:7
+msgid "ShuffleNetV1"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.shufflenet_v1.ShuffleNetV1:1 of
+msgid "ShuffleNetV1 backbone."
+msgstr ""
+
+#: mmcls.models.backbones.shufflenet_v1.ShuffleNetV1:3 of
+msgid "The number of groups to be used in grouped 1x1 convolutions in each ShuffleUnit. Default: 3."
+msgstr ""
+
+#: mmcls.models.backbones.shufflenet_v1.ShuffleNetV1:6 mmcls.models.backbones.shufflenet_v2.ShuffleNetV2:3 of
+msgid "Width multiplier - adjusts the number of channels in each layer by this amount. Default: 1.0."
+msgstr ""
+
+#: mmcls.models.backbones.shufflenet_v1.ShuffleNetV1:9 of
+msgid "Output from which stages. Default: (2, )"
+msgstr ""
+
+#: mmcls.models.backbones.shufflenet_v1.ShuffleNetV1.make_layer:1 of
+msgid "Stack ShuffleUnit blocks to make a layer."
+msgstr ""
+
+#: mmcls.models.backbones.shufflenet_v1.ShuffleNetV1.make_layer:3 of
+msgid "out_channels of the block."
+msgstr ""
+
+#: mmcls.models.backbones.shufflenet_v1.ShuffleNetV1.make_layer:5 of
+msgid "Number of blocks."
+msgstr ""
+
+#: mmcls.models.backbones.shufflenet_v1.ShuffleNetV1.make_layer:7 of
+msgid ""
+"Whether is the first ShuffleUnit of a sequential ShuffleUnits. Default: False, which means using the "
+"grouped 1x1 convolution."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.ShuffleNetV2.rst:7
+msgid "ShuffleNetV2"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.shufflenet_v2.ShuffleNetV2:1 of
+msgid "ShuffleNetV2 backbone."
+msgstr ""
+
+#: mmcls.models.backbones.shufflenet_v2.ShuffleNetV2:6 of
+msgid "Output from which stages. Default: (0, 1, 2, 3)."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.SwinTransformer.rst:7
+msgid "SwinTransformer"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.swin_transformer.SwinTransformer:1 of
+msgid "Swin Transformer."
+msgstr ""
+
+#: mmcls.models.backbones.swin_transformer.SwinTransformer:3 of
+msgid ""
+"A PyTorch implement of : `Swin Transformer: Hierarchical Vision Transformer using Shifted Windows <https://"
+"arxiv.org/abs/2103.14030>`_"
+msgstr ""
+
+#: mmcls.models.backbones.swin_transformer.SwinTransformer:7
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:7 of
+msgid "Inspiration from https://github.com/microsoft/Swin-Transformer"
+msgstr ""
+
+#: mmcls.models.backbones.swin_transformer.SwinTransformer:10 of
+msgid ""
+"Swin Transformer architecture. If use string, choose from 'tiny', 'small', 'base' and 'large'. If use dict, "
+"it should have below keys:  - **embed_dims** (int): The dimensions of embedding. - **depths** (List[int]): "
+"The number of blocks in each stage. - **num_heads** (List[int]): The number of heads in attention   modules "
+"of each stage.  Defaults to 'tiny'."
+msgstr ""
+
+#: mmcls.models.backbones.swin_transformer.SwinTransformer:10
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:10 of
+msgid ""
+"Swin Transformer architecture. If use string, choose from 'tiny', 'small', 'base' and 'large'. If use dict, "
+"it should have below keys:"
+msgstr ""
+
+#: mmcls.models.backbones.swin_transformer.SwinTransformer:32
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:35 mmcls.models.backbones.van.VAN:25 of
+msgid "Dropout rate after embedding. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.backbones.swin_transformer.SwinTransformer:42
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:42 of
+msgid "Select the interpolate mode for absolute position embeding vector resize. Defaults to \"bicubic\"."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.SwinTransformerV2.rst:7
+msgid "SwinTransformerV2"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:1 of
+msgid "Swin Transformer V2."
+msgstr ""
+
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:3 of
+msgid ""
+"A PyTorch implement of : `Swin Transformer V2: Scaling Up Capacity and Resolution <https://arxiv.org/"
+"abs/2111.09883>`_"
+msgstr ""
+
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:10 of
+msgid ""
+"Swin Transformer architecture. If use string, choose from 'tiny', 'small', 'base' and 'large'. If use dict, "
+"it should have below keys:  - **embed_dims** (int): The dimensions of embedding. - **depths** (List[int]): "
+"The number of blocks in each stage. - **num_heads** (List[int]): The number of heads in attention   modules "
+"of each stage. - **extra_norm_every_n_blocks** (int): Add extra norm at the end   of main branch every n "
+"blocks.  Defaults to 'tiny'."
+msgstr ""
+
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:18 of
+msgid "**extra_norm_every_n_blocks** (int): Add extra norm at the end of main branch every n blocks."
+msgstr ""
+
+#: mmcls.models.backbones.swin_transformer_v2.SwinTransformerV2:70 of
+msgid "Pretrained window sizes of each layer."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.T2T_ViT.rst:7
+msgid "T2T_ViT"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.t2t_vit.T2T_ViT:1 of
+msgid "Tokens-to-Token Vision Transformer (T2T-ViT)"
+msgstr ""
+
+#: mmcls.models.backbones.t2t_vit.T2T_ViT:3 of
+msgid ""
+"A PyTorch implementation of `Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet "
+"<https://arxiv.org/abs/2101.11986>`_"
+msgstr ""
+
+#: mmcls.models.backbones.t2t_vit.T2T_ViT:10 mmcls.models.utils.attention.ShiftWindowMSA:3
+#: mmcls.models.utils.attention.WindowMSA:4 mmcls.models.utils.attention.WindowMSAV2:8
+#: mmcls.models.utils.position_encoding.ConditionalPositionEncoding:6 of
+msgid "Number of input channels."
+msgstr ""
+
+#: mmcls.models.backbones.t2t_vit.T2T_ViT:12 of
+msgid "Embedding dimension."
+msgstr ""
+
+#: mmcls.models.backbones.t2t_vit.T2T_ViT:14 of
+msgid "Num of transformer layers in encoder. Defaults to 14."
+msgstr ""
+
+#: mmcls.models.backbones.t2t_vit.T2T_ViT:20 of
+msgid "Dropout rate after position embedding. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.backbones.t2t_vit.T2T_ViT:40 of
+msgid "Extra config of Tokens-to-Token module. Defaults to an empty dict."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.TIMMBackbone.rst:7
+msgid "TIMMBackbone"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.timm_backbone.TIMMBackbone:1 of
+msgid "Wrapper to use backbones from timm library."
+msgstr ""
+
+#: mmcls.models.backbones.timm_backbone.TIMMBackbone:3 of
+msgid ""
+"More details can be found in `timm <https://github.com/rwightman/pytorch-image-models>`_. See especially "
+"the document for `feature extraction <https://rwightman.github.io/pytorch-image-models/feature_extraction/"
+">`_."
+msgstr ""
+
+#: mmcls.models.backbones.timm_backbone.TIMMBackbone:8 of
+msgid "Name of timm model to instantiate."
+msgstr ""
+
+#: mmcls.models.backbones.timm_backbone.TIMMBackbone:10 of
+msgid ""
+"Whether to extract feature pyramid (multi-scale feature maps from the deepest layer at each stride). For "
+"Vision Transformer models that do not support this argument, set this False. Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.timm_backbone.TIMMBackbone:15 of
+msgid "Whether to load pretrained weights. Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.timm_backbone.TIMMBackbone:18 of
+msgid ""
+"Path of checkpoint to load at the last of ``timm.create_model``. Defaults to empty string, which means not "
+"loading."
+msgstr ""
+
+#: mmcls.models.backbones.timm_backbone.TIMMBackbone:24 of
+msgid "Initialization config dict of OpenMMLab projects. Defaults to None."
+msgstr ""
+
+#: mmcls.models.backbones.timm_backbone.TIMMBackbone:27 of
+msgid "Other timm & model specific arguments."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.TNT.rst:7
+msgid "TNT"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.tnt.TNT:1 of
+msgid "Transformer in Transformer."
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:3 of
+msgid "A PyTorch implement of: `Transformer in Transformer <https://arxiv.org/abs/2103.00112>`_"
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:6 of
+msgid "Inspiration from https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/tnt.py"
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:9 of
+msgid "Vision Transformer architecture Default: 'b'"
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:12 of
+msgid "Input image size. Defaults to 224"
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:14 of
+msgid "The patch size. Deault to 16"
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:16 of
+msgid "Number of input channels. Defaults to 3"
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:18 of
+msgid "A ratio to calculate the hidden_dims in ffn layer. Default: 4"
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:21 of
+msgid "Enable bias for qkv if True. Default False"
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:23 of
+msgid "Probability of an element to be zeroed after the feed forward layer. Default 0."
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:26 of
+msgid "The drop out rate for attention layer. Default 0."
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:29 of
+msgid "stochastic depth rate. Default 0."
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:31 of
+msgid "The activation config for FFNs. Defaults to GELU."
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:33 of
+msgid "Config dict for normalization layer. Default layer normalization"
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:36 of
+msgid ""
+"The stride of the conv2d layer. We use a conv2d layer and a unfold layer to implement image to pixel "
+"embedding."
+msgstr ""
+
+#: mmcls.models.backbones.tnt.TNT:39 of
+msgid "The number of fully-connected layers for FFNs. Default 2"
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.VAN.rst:7
+msgid "VAN"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.van.VAN:1 of
+msgid "Visual Attention Network."
+msgstr ""
+
+#: mmcls.models.backbones.van.VAN:3 of
+msgid "A PyTorch implement of : `Visual Attention Network <https://arxiv.org/pdf/2202.09741v2.pdf>`_"
+msgstr ""
+
+#: mmcls.models.backbones.van.VAN:6 of
+msgid "Inspiration from https://github.com/Visual-Attention-Network/VAN-Classification"
+msgstr ""
+
+#: mmcls.models.backbones.van.VAN:9 of
+msgid ""
+"Visual Attention Network architecture. If use string, choose from 'tiny', 'small', 'base' and 'large'. If "
+"use dict, it should have below keys:  - **embed_dims** (List[int]): The dimensions of embedding. - "
+"**depths** (List[int]): The number of blocks in each stage. - **ffn_ratios** (List[int]): The number of "
+"expansion ratio of   feedforward network hidden layer channels.  Defaults to 'tiny'."
+msgstr ""
+
+#: mmcls.models.backbones.van.VAN:9 of
+msgid ""
+"Visual Attention Network architecture. If use string, choose from 'tiny', 'small', 'base' and 'large'. If "
+"use dict, it should have below keys:"
+msgstr ""
+
+#: mmcls.models.backbones.van.VAN:13 of
+msgid "**embed_dims** (List[int]): The dimensions of embedding."
+msgstr ""
+
+#: mmcls.models.backbones.van.VAN:15 of
+msgid ""
+"**ffn_ratios** (List[int]): The number of expansion ratio of feedforward network hidden layer channels."
+msgstr ""
+
+#: mmcls.models.backbones.van.VAN:20 of
+msgid "The patch size in patch embeddings. Defaults to [7, 3, 3, 3]."
+msgstr ""
+
+#: mmcls.models.backbones.van.VAN:42 of
+msgid "The extra config of each block. Defaults to empty dicts."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.VGG.rst:7
+msgid "VGG"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.vgg.VGG:1 of
+msgid "VGG backbone."
+msgstr ""
+
+#: mmcls.models.backbones.vgg.VGG:3 of
+msgid "Depth of vgg, from {11, 13, 16, 19}."
+msgstr ""
+
+#: mmcls.models.backbones.vgg.VGG:5 of
+msgid "Use BatchNorm or not."
+msgstr ""
+
+#: mmcls.models.backbones.vgg.VGG:7 of
+msgid "number of classes for classification."
+msgstr ""
+
+#: mmcls.models.backbones.vgg.VGG:9 of
+msgid "VGG stages, normally 5."
+msgstr ""
+
+#: mmcls.models.backbones.vgg.VGG:13 of
+msgid ""
+"Output from which stages. When it is None, the default behavior depends on whether num_classes is "
+"specified. If num_classes <= 0, the default value is (4, ), output the last feature map before classifier. "
+"If num_classes > 0, the default value is (5, ), output the classification score. Default: None."
+msgstr ""
+
+#: mmcls.models.backbones.vgg.VGG:20 of
+msgid "Stages to be frozen (all param fixed). -1 means not freezing any parameters."
+msgstr ""
+
+#: mmcls.models.backbones.vgg.VGG:27 of
+msgid "Whether to use ceil_mode of MaxPool. Default: False."
+msgstr ""
+
+#: mmcls.models.backbones.vgg.VGG:29 of
+msgid "Whether to keep the last pooling before classifier. Default: True."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.backbones.VisionTransformer.rst:7
+msgid "VisionTransformer"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1 mmcls.models.backbones.vision_transformer.VisionTransformer:1 of
+msgid "Vision Transformer."
+msgstr ""
+
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:3 of
+msgid ""
+"A PyTorch implement of : `An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale "
+"<https://arxiv.org/abs/2010.11929>`_"
+msgstr ""
+
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:6 of
+msgid ""
+"Vision Transformer architecture. If use string, choose from 'small', 'base', 'large', 'deit-tiny', 'deit-"
+"small' and 'deit-base'. If use dict, it should have below keys:  - **embed_dims** (int): The dimensions of "
+"embedding. - **num_layers** (int): The number of transformer encoder layers. - **num_heads** (int): The "
+"number of heads in attention modules. - **feedforward_channels** (int): The hidden dimensions in   "
+"feedforward modules.  Defaults to 'base'."
+msgstr ""
+
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:47 of
+msgid ""
+"Whether or not to use the mean patch token for classification. If True, the model will only take the "
+"average of all patch tokens. Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:57 of
+msgid "Whether or not use BEiT-style. Defaults to False."
+msgstr ""
+
+#: mmcls.models.backbones.vision_transformer.VisionTransformer:59 of
+msgid "The initialization value for the learnable scaling of attention and FFN. Defaults to 0.1."
+msgstr ""
+
+#: mmcls.models.backbones.vision_transformer.VisionTransformer.resize_pos_embed:1 of
+msgid "Interface for backward-compatibility."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.build_backbone.rst:2
+msgid "mmcls.models.build\\_backbone"
+msgstr ""
+
+#: ../../api/models.rst:34:<autosummary>:1 mmcls.models.builder.build_backbone:1 of
+msgid "Build backbone."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.build_classifier.rst:2
+msgid "mmcls.models.build\\_classifier"
+msgstr ""
+
+#: ../../api/models.rst:34:<autosummary>:1 mmcls.models.builder.build_classifier:1 of
+msgid "Build classifier."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.build_head.rst:2
+msgid "mmcls.models.build\\_head"
+msgstr ""
+
+#: ../../api/models.rst:34:<autosummary>:1 mmcls.models.builder.build_head:1 of
+msgid "Build head."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.build_loss.rst:2
+msgid "mmcls.models.build\\_loss"
+msgstr ""
+
+#: ../../api/models.rst:34:<autosummary>:1 mmcls.models.builder.build_loss:1 of
+msgid "Build loss."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.build_neck.rst:2
+msgid "mmcls.models.build\\_neck"
+msgstr ""
+
+#: ../../api/models.rst:34:<autosummary>:1 mmcls.models.builder.build_neck:1 of
+msgid "Build neck."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.classifiers.BaseClassifier.rst:7
+msgid "BaseClassifier"
+msgstr ""
+
+#: ../../api/models.rst:49:<autosummary>:1 mmcls.models.classifiers.base.BaseClassifier:1 of
+msgid "Base class for classifiers."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier:6 of
+msgid ""
+"The config for preprocessing input data. If None, it will use \"BaseDataPreprocessor\" as type, see :class:"
+"`mmengine.model.BaseDataPreprocessor` for more details. Defaults to None."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier:16 of
+msgid "dict"
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier:20 of
+msgid ""
+"An extra data pre-processing module, which processes data from dataloader to the format accepted by :meth:"
+"`forward`."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier:24 of
+msgid ":obj:`mmengine.model.BaseDataPreprocessor`"
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.extract_feat:1
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat:1 of
+msgid "Extract features from the input tensor with shape (N, C, ...)."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.extract_feat:3 of
+msgid "The sub-classes are recommended to implement this method to extract features from backbone and neck."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.extract_feat:6
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat:3 of
+msgid "A batch of inputs. The shape of it should be ``(num_samples, num_channels, *img_shape)``."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.extract_feats:1 of
+msgid "Extract features from a sequence of input tensor."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.extract_feats:3 of
+msgid "A sequence of input tensor. It can be used in augmented inference."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.extract_feats:6 of
+msgid "Other keyword arguments accepted by :meth:`extract_feat`."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.extract_feats:8 of
+msgid "Features of every input tensor."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.forward:1
+#: mmcls.models.classifiers.image.ImageClassifier.forward:1 of
+msgid "The unified entry for a forward process in both training and test."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.forward:3
+#: mmcls.models.classifiers.image.ImageClassifier.forward:3 of
+msgid "The method should accept three modes: \"tensor\", \"predict\" and \"loss\":"
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.forward:5
+#: mmcls.models.classifiers.image.ImageClassifier.forward:5 of
+msgid ""
+"\"tensor\": Forward the whole network and return tensor or tuple of tensor without any post-processing, "
+"same as a common nn.Module."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.forward:7 of
+msgid ""
+"\"predict\": Forward and return the predictions, which are fully processed to a list of :obj:"
+"`BaseDataElement`."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.forward:9
+#: mmcls.models.classifiers.image.ImageClassifier.forward:9 of
+msgid "\"loss\": Forward and return a dict of losses according to the given inputs and data samples."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.forward:12
+#: mmcls.models.classifiers.image.ImageClassifier.forward:12 of
+msgid ""
+"Note that this method doesn't handle neither back propagation nor optimizer updating, which are done in "
+"the :meth:`train_step`."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.forward:15
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.loss:3
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.predict:3
+#: mmcls.models.classifiers.image.ImageClassifier.forward:15
+#: mmcls.models.classifiers.image.ImageClassifier.loss:3
+#: mmcls.models.classifiers.image.ImageClassifier.predict:3
+#: mmcls.models.classifiers.timm.TimmClassifier.loss:3 mmcls.models.classifiers.timm.TimmClassifier.predict:3
+#: of
+msgid "The input tensor with shape (N, C, ...) in general."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.forward:18
+#: mmcls.models.classifiers.image.ImageClassifier.forward:18 of
+msgid "The annotation data of every samples. It's required if ``mode=\"loss\"``. Defaults to None."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.forward:22
+#: mmcls.models.classifiers.image.ImageClassifier.forward:22 of
+msgid "Return what kind of value. Defaults to 'tensor'."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.forward:25 of
+msgid ""
+"The return type depends on ``mode``.  - If ``mode=\"tensor\"``, return a tensor or a tuple of tensor. - If "
+"``mode=\"predict\"``, return a list of   :obj:`mmengine.BaseDataElement`. - If ``mode=\"loss\"``, return a "
+"dict of tensor."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.forward:25
+#: mmcls.models.classifiers.image.ImageClassifier.forward:25 of
+msgid "The return type depends on ``mode``."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.forward:27
+#: mmcls.models.classifiers.image.ImageClassifier.forward:27 of
+msgid "If ``mode=\"tensor\"``, return a tensor or a tuple of tensor."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.forward:28 of
+msgid "If ``mode=\"predict\"``, return a list of :obj:`mmengine.BaseDataElement`."
+msgstr ""
+
+#: mmcls.models.classifiers.base.BaseClassifier.forward:30
+#: mmcls.models.classifiers.image.ImageClassifier.forward:30 of
+msgid "If ``mode=\"loss\"``, return a dict of tensor."
+msgstr ""
+
+#: mmcls.models.classifiers.BaseClassifier.with_head:1 of
+msgid "Whether the classifier has a head."
+msgstr ""
+
+#: mmcls.models.classifiers.BaseClassifier.with_neck:1 of
+msgid "Whether the classifier has a neck."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.classifiers.HuggingFaceClassifier.rst:7
+msgid "HuggingFaceClassifier"
+msgstr ""
+
+#: ../../api/models.rst:49:<autosummary>:1 mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:1 of
+msgid "Image classifiers for HuggingFace model."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:3 of
+msgid ""
+"This class accepts all positional and keyword arguments of the API ``from_pretrained`` (when "
+"``pretrained=True``) and ``from_config`` (when ``pretrained=False``) of `transformers."
+"AutoModelForImageClassification`_ and use it to create a model from hugging-face."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:8 of
+msgid ""
+"It can load checkpoints of hugging-face directly, and the saved checkpoints also can be directly load by "
+"hugging-face."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:11 of
+msgid "Please confirm that you have installed ``transfromers`` if you want to use it."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:16 of
+msgid "The name of the model to use in hugging-face."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:18 of
+msgid "Whether to load pretrained checkpoint from hugging-face. Defaults to False."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:21 of
+msgid "Other positional arguments of the method `from_pretrained` or `from_config`."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:23
+#: mmcls.models.classifiers.timm.TimmClassifier:14 mmcls.models.heads.cls_head.ClsHead:3
+#: mmcls.models.heads.linear_head.LinearClsHead:8 mmcls.models.heads.margin_head.ArcFaceClsHead:82 of
+msgid "Config of classification loss. Defaults to ``dict(type='CrossEntropyLoss', loss_weight=1.0)``."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:26
+#: mmcls.models.classifiers.image.ImageClassifier:17 mmcls.models.classifiers.timm.TimmClassifier:17 of
+msgid ""
+"The training setting. The acceptable fields are:  - augments (List[dict]): The batch augmentation methods "
+"to use.   More details can be found in :mod:`mmcls.model.utils.augment`.  Defaults to None."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:26
+#: mmcls.models.classifiers.image.ImageClassifier:17 mmcls.models.classifiers.timm.TimmClassifier:17 of
+msgid "The training setting. The acceptable fields are:"
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:29
+#: mmcls.models.classifiers.image.ImageClassifier:20 mmcls.models.classifiers.timm.TimmClassifier:20 of
+msgid ""
+"augments (List[dict]): The batch augmentation methods to use. More details can be found in :mod:`mmcls."
+"model.utils.augment`."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:37
+#: mmcls.models.classifiers.image.ImageClassifier:25 mmcls.models.classifiers.timm.TimmClassifier:28 of
+msgid ""
+"The config for preprocessing input data. If None or no specified type, it will use \"ClsDataPreprocessor\" "
+"as type. See :class:`ClsDataPreprocessor` for more details. Defaults to None."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:42
+#: mmcls.models.classifiers.image.ImageClassifier:30 mmcls.models.classifiers.timm.TimmClassifier:33
+#: mmcls.models.heads.cls_head.ClsHead:13 mmcls.models.heads.margin_head.ArcFaceClsHead:85 of
+msgid "the config to control the initialization. Defaults to None."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier:45 of
+msgid "Other keyword arguments of the method `from_pretrained` or `from_config`."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.loss:1
+#: mmcls.models.classifiers.image.ImageClassifier.loss:1 mmcls.models.classifiers.timm.TimmClassifier.loss:1
+#: of
+msgid "Calculate losses from a batch of inputs and data samples."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.loss:6
+#: mmcls.models.classifiers.image.ImageClassifier.loss:6 mmcls.models.classifiers.timm.TimmClassifier.loss:6
+#: mmcls.models.heads.cls_head.ClsHead.loss:8
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead.loss:8
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.loss:8
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.loss:8 of
+msgid "The annotation data of every samples."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.loss:9
+#: mmcls.models.classifiers.timm.TimmClassifier.loss:9 of
+msgid "Other keyword arguments of the loss module."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.loss:11
+#: mmcls.models.classifiers.image.ImageClassifier.loss:10 mmcls.models.classifiers.timm.TimmClassifier.loss:11
+#: mmcls.models.heads.cls_head.ClsHead.loss:13
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead.loss:13
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.loss:13
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.loss:13 of
+msgid "a dictionary of loss components"
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.predict:1
+#: mmcls.models.classifiers.image.ImageClassifier.predict:1
+#: mmcls.models.classifiers.timm.TimmClassifier.predict:1 of
+msgid "Predict results from a batch of inputs."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.predict:6
+#: mmcls.models.classifiers.image.ImageClassifier.predict:6
+#: mmcls.models.classifiers.timm.TimmClassifier.predict:6 of
+msgid "The annotation data of every samples. Defaults to None."
+msgstr ""
+
+#: mmcls.models.classifiers.hugging_face.HuggingFaceClassifier.predict:10
+#: mmcls.models.classifiers.timm.TimmClassifier.predict:10 of
+msgid "The prediction results."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.classifiers.ImageClassifier.rst:7
+msgid "ImageClassifier"
+msgstr ""
+
+#: ../../api/models.rst:49:<autosummary>:1 mmcls.models.classifiers.image.ImageClassifier:1 of
+msgid "Image classifiers for supervised classification task."
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier:3 of
+msgid "The backbone module. See :mod:`mmcls.models.backbones`."
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier:6 of
+msgid "The neck module to process features from backbone. See :mod:`mmcls.models.necks`. Defaults to None."
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier:9 of
+msgid ""
+"The head module to do prediction and calculate loss from processed features. See :mod:`mmcls.models.heads`. "
+"Notice that if the head is not set, almost all methods cannot be used except :meth:`extract_feat`. Defaults "
+"to None."
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier:14 of
+msgid "The pretrained checkpoint path, support local path and remote path. Defaults to None."
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat:6 of
+msgid ""
+"Which stage to output the feature. Choose from:  - \"backbone\": The output of backbone network. Returns a "
+"tuple   including multiple stages features. - \"neck\": The output of neck module. Returns a tuple "
+"including   multiple stages features. - \"pre_logits\": The feature before the final classification   "
+"linear layer. Usually returns a tensor.  Defaults to \"neck\"."
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat:6 of
+msgid "Which stage to output the feature. Choose from:"
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat:8 of
+msgid "\"backbone\": The output of backbone network. Returns a tuple including multiple stages features."
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat:10 of
+msgid "\"neck\": The output of neck module. Returns a tuple including multiple stages features."
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat:12 of
+msgid "\"pre_logits\": The feature before the final classification linear layer. Usually returns a tensor."
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat:15 of
+msgid "Defaults to \"neck\"."
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat:18 of
+msgid ""
+"The output of specified stage. The output depends on detailed implementation. In general, the output of "
+"backbone and neck is a tuple and the output of pre_logits is a tensor."
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat:26 of
+msgid "Backbone output"
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat:43 of
+msgid "Neck output"
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier.extract_feat:61 of
+msgid "Pre-logits output (without the final linear classifier head)"
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier.forward:7 of
+msgid ""
+"\"predict\": Forward and return the predictions, which are fully processed to a list of :obj:"
+"`ClsDataSample`."
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier.forward:25 of
+msgid ""
+"The return type depends on ``mode``.  - If ``mode=\"tensor\"``, return a tensor or a tuple of tensor. - If "
+"``mode=\"predict\"``, return a list of   :obj:`mmcls.structures.ClsDataSample`. - If ``mode=\"loss\"``, "
+"return a dict of tensor."
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier.forward:28 of
+msgid "If ``mode=\"predict\"``, return a list of :obj:`mmcls.structures.ClsDataSample`."
+msgstr ""
+
+#: mmcls.models.classifiers.image.ImageClassifier.predict:9 of
+msgid "Other keyword arguments accepted by the ``predict`` method of :attr:`head`."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.classifiers.TimmClassifier.rst:7
+msgid "TimmClassifier"
+msgstr ""
+
+#: ../../api/models.rst:49:<autosummary>:1 mmcls.models.classifiers.timm.TimmClassifier:1 of
+msgid "Image classifiers for pytorch-image-models (timm) model."
+msgstr ""
+
+#: mmcls.models.classifiers.timm.TimmClassifier:3 of
+msgid ""
+"This class accepts all positional and keyword arguments of the function `timm.models.create_model <https://"
+"timm.fast.ai/create_model>`_ and use it to create a model from pytorch-image-models."
+msgstr ""
+
+#: mmcls.models.classifiers.timm.TimmClassifier:7 of
+msgid "It can load checkpoints of timm directly, and the saved checkpoints also can be directly load by timm."
+msgstr ""
+
+#: mmcls.models.classifiers.timm.TimmClassifier:10 of
+msgid "Please confirm that you have installed ``timm`` if you want to use it."
+msgstr ""
+
+#: mmcls.models.classifiers.timm.TimmClassifier:12 of
+msgid "All positional arguments of the function `timm.models.create_model`."
+msgstr ""
+
+#: mmcls.models.classifiers.timm.TimmClassifier:36 of
+msgid "Other keyword arguments of the function `timm.models.create_model`."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.heads.ArcFaceClsHead.rst:7
+msgid "ArcFaceClsHead"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1 mmcls.models.heads.margin_head.ArcFaceClsHead:1 of
+msgid "ArcFace classifier head."
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead:3 of
+msgid ""
+"A PyTorch implementation of paper `ArcFace: Additive Angular Margin Loss for Deep Face Recognition <https://"
+"arxiv.org/abs/1801.07698>`_ and `Sub-center ArcFace: Boosting Face Recognition by Large-Scale Noisy Web "
+"Faces <https://link.springer.com/chapter/10.1007/978-3-030-58621-8_43>`_"
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead:10 of
+msgid "To use ArcFace in config files."
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead:12 of
+msgid "use vanilla ArcFace"
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead:27 of
+msgid "use SubCenterArcFace with 3 sub-centers"
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead:43 of
+msgid "use SubCenterArcFace With CountPowerAdaptiveMargins"
+msgstr ""
+
+#: mmcls.models.heads.conformer_head.ConformerHead:3 mmcls.models.heads.deit_head.DeiTClsHead:8
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead:3
+#: mmcls.models.heads.linear_head.LinearClsHead:3 mmcls.models.heads.margin_head.ArcFaceClsHead:61
+#: mmcls.models.heads.vision_transformer_head.VisionTransformerClsHead:3 of
+msgid "Number of categories excluding the background category."
+msgstr ""
+
+#: mmcls.models.heads.conformer_head.ConformerHead:6 mmcls.models.heads.deit_head.DeiTClsHead:11
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead:6
+#: mmcls.models.heads.linear_head.LinearClsHead:6 mmcls.models.heads.margin_head.ArcFaceClsHead:64
+#: mmcls.models.heads.multi_label_csra_head.CSRAClsHead:9
+#: mmcls.models.heads.stacked_head.StackedLinearClsHead:5
+#: mmcls.models.heads.vision_transformer_head.VisionTransformerClsHead:6 of
+msgid "Number of channels in the input feature map."
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead:66 of
+msgid "Number of subcenters. Defaults to 1."
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead:68 of
+msgid "Scale factor of output logit. Defaults to 64.0."
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead:70 of
+msgid ""
+"The penalty margin. Could be the fllowing formats:  - float: The margin, would be same for all the "
+"categories. - Sequence[float]: The category-based margins list. - str: A '.txt' file path which contains a "
+"list. Each line   represents the margin of a category, and the number in the   i-th row indicates the "
+"margin of the i-th class.  Defaults to 0.5."
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead:70 of
+msgid "The penalty margin. Could be the fllowing formats:"
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead:72 of
+msgid "float: The margin, would be same for all the categories."
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead:73 of
+msgid "Sequence[float]: The category-based margins list."
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead:74 of
+msgid ""
+"str: A '.txt' file path which contains a list. Each line represents the margin of a category, and the "
+"number in the i-th row indicates the margin of the i-th class."
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead:78 of
+msgid "Defaults to 0.5."
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead:80 of
+msgid "Avoid theta + m >= PI. Defaults to False."
+msgstr ""
+
+#: mmcls.models.heads.cls_head.ClsHead.forward:1 mmcls.models.heads.conformer_head.ConformerHead.forward:1
+#: mmcls.models.heads.deit_head.DeiTClsHead.forward:1
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead.forward:1
+#: mmcls.models.heads.linear_head.LinearClsHead.forward:1
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.forward:1
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.forward:1
+#: mmcls.models.heads.multi_label_csra_head.CSRAClsHead.forward:1
+#: mmcls.models.heads.multi_label_linear_head.MultiLabelLinearClsHead.forward:1
+#: mmcls.models.heads.stacked_head.StackedLinearClsHead.forward:1
+#: mmcls.models.heads.vision_transformer_head.VisionTransformerClsHead.forward:1 of
+msgid "The forward process."
+msgstr ""
+
+#: mmcls.models.heads.cls_head.ClsHead.loss:1
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead.loss:1
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.loss:1
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.loss:1 of
+msgid "Calculate losses from the classification score."
+msgstr ""
+
+#: mmcls.models.heads.cls_head.ClsHead.loss:3 mmcls.models.heads.cls_head.ClsHead.predict:3
+#: mmcls.models.heads.conformer_head.ConformerHead.predict:3
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead.loss:3
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.loss:3
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.loss:3
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.predict:3 of
+msgid ""
+"The features extracted from the backbone. Multiple stage inputs are acceptable but only the last stage will "
+"be used to classify. The shape of every item should be ``(num_samples, num_classes)``."
+msgstr ""
+
+#: mmcls.models.heads.cls_head.ClsHead.loss:11
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead.loss:11
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.loss:11
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.loss:11 of
+msgid "Other keyword arguments to forward the loss module."
+msgstr ""
+
+#: mmcls.models.heads.cls_head.ClsHead.pre_logits:1
+#: mmcls.models.heads.conformer_head.ConformerHead.pre_logits:1
+#: mmcls.models.heads.deit_head.DeiTClsHead.pre_logits:1
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead.pre_logits:1
+#: mmcls.models.heads.linear_head.LinearClsHead.pre_logits:1
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.pre_logits:1
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.pre_logits:1
+#: mmcls.models.heads.multi_label_csra_head.CSRAClsHead.pre_logits:1
+#: mmcls.models.heads.multi_label_linear_head.MultiLabelLinearClsHead.pre_logits:1
+#: mmcls.models.heads.stacked_head.StackedLinearClsHead.pre_logits:1
+#: mmcls.models.heads.vision_transformer_head.VisionTransformerClsHead.pre_logits:1 of
+msgid "The process before the final classification head."
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.pre_logits:3 of
+msgid ""
+"The input ``feats`` is a tuple of tensor, and each tensor is the feature of a backbone stage. In "
+"``ArcFaceHead``, we just obtain the feature of the last stage."
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.set_margins:1 of
+msgid "set margins of arcface head."
+msgstr ""
+
+#: mmcls.models.heads.margin_head.ArcFaceClsHead.set_margins:3 of
+msgid "The marigins."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.heads.CSRAClsHead.rst:7
+msgid "CSRAClsHead"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1 mmcls.models.heads.multi_label_csra_head.CSRAClsHead:1 of
+msgid "Class-specific residual attention classifier head."
+msgstr ""
+
+#: mmcls.models.heads.multi_label_csra_head.CSRAClsHead:3 of
+msgid ""
+"Please refer to the `Residual Attention: A Simple but Effective Method for Multi-Label Recognition (ICCV "
+"2021) <https://arxiv.org/abs/2108.02456>`_ for details."
+msgstr ""
+
+#: mmcls.models.heads.multi_label_csra_head.CSRAClsHead:7
+#: mmcls.models.heads.stacked_head.StackedLinearClsHead:3 of
+msgid "Number of categories."
+msgstr ""
+
+#: mmcls.models.heads.multi_label_csra_head.CSRAClsHead:11 of
+msgid "Number of residual at tensor heads."
+msgstr ""
+
+#: mmcls.models.heads.multi_label_csra_head.CSRAClsHead:13 of
+msgid "Config of classification loss."
+msgstr ""
+
+#: mmcls.models.heads.multi_label_csra_head.CSRAClsHead:15 of
+msgid "Lambda that combines global average and max pooling scores."
+msgstr ""
+
+#: mmcls.models.heads.conformer_head.ConformerHead:9 mmcls.models.heads.multi_label_csra_head.CSRAClsHead:18
+#: of
+msgid "The extra init config of layers. Defaults to use ``dict(type='Normal', layer='Linear', std=0.01)``."
+msgstr ""
+
+#: mmcls.models.heads.multi_label_csra_head.CSRAClsHead.pre_logits:3 of
+msgid ""
+"The input ``feats`` is a tuple of tensor, and each tensor is the feature of a backbone stage. In "
+"``CSRAClsHead``, we just obtain the feature of the last stage."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.heads.ClsHead.rst:7
+msgid "ClsHead"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1 mmcls.models.heads.cls_head.ClsHead:1 of
+msgid "Classification head."
+msgstr ""
+
+#: mmcls.models.heads.cls_head.ClsHead:6 mmcls.models.heads.linear_head.LinearClsHead:11 of
+msgid "Top-k accuracy. Defaults to ``(1, )``."
+msgstr ""
+
+#: mmcls.models.heads.cls_head.ClsHead:8 mmcls.models.heads.linear_head.LinearClsHead:13 of
+msgid ""
+"Whether to calculate accuracy during training. If you use batch augmentations like Mixup and CutMix during "
+"training, it is pointless to calculate accuracy. Defaults to False."
+msgstr ""
+
+#: mmcls.models.heads.cls_head.ClsHead.pre_logits:3 of
+msgid ""
+"The input ``feats`` is a tuple of tensor, and each tensor is the feature of a backbone stage. In "
+"``ClsHead``, we just obtain the feature of the last stage."
+msgstr ""
+
+#: mmcls.models.heads.cls_head.ClsHead.predict:1 mmcls.models.heads.conformer_head.ConformerHead.predict:1
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.predict:1 of
+msgid "Inference without augmentation."
+msgstr ""
+
+#: mmcls.models.heads.cls_head.ClsHead.predict:8 mmcls.models.heads.conformer_head.ConformerHead.predict:8
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.predict:8 of
+msgid ""
+"The annotation data of every samples. If not None, set ``pred_label`` of the input data samples. Defaults "
+"to None."
+msgstr ""
+
+#: mmcls.models.heads.cls_head.ClsHead.predict:13 mmcls.models.heads.conformer_head.ConformerHead.predict:13
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.predict:13 of
+msgid "A list of data samples which contains the predicted results."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.heads.ConformerHead.rst:7
+msgid "ConformerHead"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1 mmcls.models.heads.conformer_head.ConformerHead:1
+#: mmcls.models.heads.linear_head.LinearClsHead:1 of
+msgid "Linear classifier head."
+msgstr ""
+
+#: mmcls.models.heads.conformer_head.ConformerHead.pre_logits:3 of
+msgid ""
+"The input ``feats`` is a tuple of tensor, and each tensor is the feature of a backbone stage. In "
+"``ConformerHead``, we just obtain the feature of the last stage."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.heads.DeiTClsHead.rst:7
+msgid "DeiTClsHead"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1 mmcls.models.heads.deit_head.DeiTClsHead:1 of
+msgid "Distilled Vision Transformer classifier head."
+msgstr ""
+
+#: mmcls.models.heads.deit_head.DeiTClsHead:3 of
+msgid ""
+"Comparing with the :class:`VisionTransformerClsHead`, this head adds an extra linear layer to handle the "
+"dist token. The final classification score is the average of both linear transformation results of "
+"``cls_token`` and ``dist_token``."
+msgstr ""
+
+#: mmcls.models.heads.deit_head.DeiTClsHead:13
+#: mmcls.models.heads.vision_transformer_head.VisionTransformerClsHead:8 of
+msgid "Number of the dimensions for hidden layer. Defaults to None, which means no extra hidden layer."
+msgstr ""
+
+#: mmcls.models.heads.deit_head.DeiTClsHead:16
+#: mmcls.models.heads.vision_transformer_head.VisionTransformerClsHead:11 of
+msgid "The activation config. Only available during pre-training. Defaults to ``dict(type='Tanh')``."
+msgstr ""
+
+#: mmcls.models.heads.deit_head.DeiTClsHead:19
+#: mmcls.models.heads.vision_transformer_head.VisionTransformerClsHead:14 of
+msgid "The extra initialization configs. Defaults to ``dict(type='Constant', layer='Linear', val=0)``."
+msgstr ""
+
+#: mmcls.models.heads.deit_head.DeiTClsHead.pre_logits:3 of
+msgid ""
+"The input ``feats`` is a tuple of list of tensor, and each tensor is the feature of a backbone stage. In "
+"``DeiTClsHead``, we obtain the feature of the last stage and forward in hidden layer if exists."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.heads.EfficientFormerClsHead.rst:7
+msgid "EfficientFormerClsHead"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1 mmcls.models.heads.efficientformer_head.EfficientFormerClsHead:1
+#: of
+msgid "EfficientFormer classifier head."
+msgstr ""
+
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead:8 of
+msgid "Whether use a additional distilled head. Defaults to True."
+msgstr ""
+
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead:11 of
+msgid "The extra initialization configs. Defaults to ``dict(type='Normal', layer='Linear', std=0.01)``."
+msgstr ""
+
+#: mmcls.models.heads.efficientformer_head.EfficientFormerClsHead.pre_logits:3 of
+msgid ""
+"The input ``feats`` is a tuple of tensor, and each tensor is the feature of a backbone stage. In :"
+"obj`EfficientFormerClsHead`, we just obtain the feature of the last stage."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.heads.LinearClsHead.rst:7
+msgid "LinearClsHead"
+msgstr ""
+
+#: mmcls.models.heads.linear_head.LinearClsHead:18 of
+msgid ""
+"the config to control the initialization. Defaults to ``dict(type='Normal', layer='Linear', std=0.01)``."
+msgstr ""
+
+#: mmcls.models.heads.linear_head.LinearClsHead.pre_logits:3 of
+msgid ""
+"The input ``feats`` is a tuple of tensor, and each tensor is the feature of a backbone stage. In "
+"``LinearClsHead``, we just obtain the feature of the last stage."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.heads.MultiLabelClsHead.rst:7
+msgid "MultiLabelClsHead"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1 mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead:1 of
+msgid "Classification head for multilabel task."
+msgstr ""
+
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead:3
+#: mmcls.models.heads.multi_label_linear_head.MultiLabelLinearClsHead:3 of
+msgid "Config of classification loss. Defaults to dict(type='CrossEntropyLoss', use_sigmoid=True)."
+msgstr ""
+
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead:6
+#: mmcls.models.heads.multi_label_linear_head.MultiLabelLinearClsHead:6 of
+msgid "Predictions with scores under the thresholds are considered as negative. Defaults to None."
+msgstr ""
+
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead:9
+#: mmcls.models.heads.multi_label_linear_head.MultiLabelLinearClsHead:9 of
+msgid "Predictions with the k-th highest scores are considered as positive. Defaults to None."
+msgstr ""
+
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead:12 of
+msgid "The extra init config of layers. Defaults to None."
+msgstr ""
+
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead:18
+#: mmcls.models.heads.multi_label_linear_head.MultiLabelLinearClsHead:18 of
+msgid ""
+"If both ``thr`` and ``topk`` are set, use ``thr` to determine positive predictions. If neither is set, use "
+"``thr=0.5`` as default."
+msgstr ""
+
+#: mmcls.models.heads.multi_label_cls_head.MultiLabelClsHead.pre_logits:3 of
+msgid ""
+"The input ``feats`` is a tuple of tensor, and each tensor is the feature of a backbone stage. In "
+"``MultiLabelClsHead``, we just obtain the feature of the last stage."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.heads.MultiLabelLinearClsHead.rst:7
+msgid "MultiLabelLinearClsHead"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1
+#: mmcls.models.heads.multi_label_linear_head.MultiLabelLinearClsHead:1 of
+msgid "Linear classification head for multilabel task."
+msgstr ""
+
+#: mmcls.models.heads.multi_label_linear_head.MultiLabelLinearClsHead:12 of
+msgid "The extra init config of layers. Defaults to use dict(type='Normal', layer='Linear', std=0.01)."
+msgstr ""
+
+#: mmcls.models.heads.multi_label_linear_head.MultiLabelLinearClsHead.pre_logits:3 of
+msgid ""
+"The input ``feats`` is a tuple of tensor, and each tensor is the feature of a backbone stage. In "
+"``MultiLabelLinearClsHead``, we just obtain the feature of the last stage."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.heads.StackedLinearClsHead.rst:7
+msgid "StackedLinearClsHead"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1 mmcls.models.heads.stacked_head.StackedLinearClsHead:1 of
+msgid "Classifier head with several hidden fc layer and a output fc layer."
+msgstr ""
+
+#: mmcls.models.heads.stacked_head.StackedLinearClsHead:7 of
+msgid "Number of channels in the hidden fc layers."
+msgstr ""
+
+#: mmcls.models.heads.stacked_head.StackedLinearClsHead:10 of
+msgid "Dropout rate after each hidden fc layer, except the last layer. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.heads.stacked_head.StackedLinearClsHead:13 of
+msgid ""
+"Config dict of normalization layer after each hidden fc layer, except the last layer. Defaults to None."
+msgstr ""
+
+#: mmcls.models.heads.stacked_head.StackedLinearClsHead:16 of
+msgid ""
+"Config dict of activation function after each hidden layer, except the last layer. Defaults to use \"ReLU\"."
+msgstr ""
+
+#: mmcls.models.heads.StackedLinearClsHead.fc:1 of
+msgid "Full connected layer."
+msgstr ""
+
+#: mmcls.models.heads.stacked_head.StackedLinearClsHead.pre_logits:3 of
+msgid "The input ``feats`` is a tuple of tensor, and each tensor is the feature of a backbone stage."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.heads.VisionTransformerClsHead.rst:7
+msgid "VisionTransformerClsHead"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1
+#: mmcls.models.heads.vision_transformer_head.VisionTransformerClsHead:1 of
+msgid "Vision Transformer classifier head."
+msgstr ""
+
+#: mmcls.models.heads.vision_transformer_head.VisionTransformerClsHead.init_weights:1 of
+msgid "\"Init weights of hidden layer if exists."
+msgstr ""
+
+#: mmcls.models.heads.vision_transformer_head.VisionTransformerClsHead.pre_logits:3 of
+msgid ""
+"The input ``feats`` is a tuple of list of tensor, and each tensor is the feature of a backbone stage. In "
+"``VisionTransformerClsHead``, we obtain the feature of the last stage and forward in hidden layer if exists."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.losses.AsymmetricLoss.rst:7
+msgid "AsymmetricLoss"
+msgstr ""
+
+#: ../../api/models.rst:163:<autosummary>:1 mmcls.models.losses.asymmetric_loss.AsymmetricLoss:1
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss.forward:1 of
+msgid "asymmetric loss."
+msgstr ""
+
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss:3 of
+msgid "positive focusing parameter. Defaults to 0.0."
+msgstr ""
+
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss:6 of
+msgid "Negative focusing parameter. We usually set gamma_neg > gamma_pos. Defaults to 4.0."
+msgstr ""
+
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss:9 of
+msgid "Probability margin. Defaults to 0.05."
+msgstr ""
+
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss:11 of
+msgid "The method used to reduce the loss into a scalar."
+msgstr ""
+
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss:14 mmcls.models.losses.focal_loss.FocalLoss:12 of
+msgid "Weight of loss. Defaults to 1.0."
+msgstr ""
+
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss:16 of
+msgid "Whether the prediction uses sigmoid instead of softmax. Defaults to True."
+msgstr ""
+
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss:19 of
+msgid "The minimum value of the argument of logarithm. Defaults to 1e-8."
+msgstr ""
+
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss.forward:3
+#: mmcls.models.losses.focal_loss.FocalLoss.forward:3
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss.forward:3 of
+msgid "The prediction with shape (N, \\*)."
+msgstr ""
+
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss.forward:5
+#: mmcls.models.losses.focal_loss.FocalLoss.forward:5 of
+msgid "The ground truth label of the prediction with shape (N, \\*), N or (N,1)."
+msgstr ""
+
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss.forward:8
+#: mmcls.models.losses.focal_loss.FocalLoss.forward:8
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss.forward:8 of
+msgid "Sample-wise loss weight with shape (N, \\*). Defaults to None."
+msgstr ""
+
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss.forward:11
+#: mmcls.models.losses.focal_loss.FocalLoss.forward:11
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss.forward:11
+#: mmcls.models.losses.seesaw_loss.SeesawLoss.forward:9 of
+msgid "Average factor that is used to average the loss. Defaults to None."
+msgstr ""
+
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss.forward:14
+#: mmcls.models.losses.focal_loss.FocalLoss.forward:14
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss.forward:14 of
+msgid ""
+"The method used to reduce the loss into a scalar. Options are \"none\", \"mean\" and \"sum\". Defaults to "
+"None."
+msgstr ""
+
+#: mmcls.models.losses.asymmetric_loss.AsymmetricLoss.forward:19
+#: mmcls.models.losses.focal_loss.FocalLoss.forward:19
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss.forward:19 of
+msgid "Loss."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.losses.CrossEntropyLoss.rst:7
+msgid "CrossEntropyLoss"
+msgstr ""
+
+#: ../../api/models.rst:163:<autosummary>:1 mmcls.models.losses.cross_entropy_loss.CrossEntropyLoss:1 of
+msgid "Cross entropy loss."
+msgstr ""
+
+#: mmcls.models.losses.cross_entropy_loss.CrossEntropyLoss:3 of
+msgid "Whether the prediction uses sigmoid of softmax. Defaults to False."
+msgstr ""
+
+#: mmcls.models.losses.cross_entropy_loss.CrossEntropyLoss:6 of
+msgid "Whether to use the soft version of CrossEntropyLoss. Defaults to False."
+msgstr ""
+
+#: mmcls.models.losses.cross_entropy_loss.CrossEntropyLoss:9
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:23 of
+msgid "The method used to reduce the loss. Options are \"none\", \"mean\" and \"sum\". Defaults to 'mean'."
+msgstr ""
+
+#: mmcls.models.losses.cross_entropy_loss.CrossEntropyLoss:12
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:26 of
+msgid "Weight of the loss. Defaults to 1.0."
+msgstr ""
+
+#: mmcls.models.losses.cross_entropy_loss.CrossEntropyLoss:14 of
+msgid "The weight for each class with shape (C), C is the number of classes. Default None."
+msgstr ""
+
+#: mmcls.models.losses.cross_entropy_loss.CrossEntropyLoss:17 of
+msgid ""
+"The positive weight for each class with shape (C), C is the number of classes. Only enabled in BCE loss "
+"when ``use_sigmoid`` is True. Default None."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.losses.FocalLoss.rst:7
+msgid "FocalLoss"
+msgstr ""
+
+#: ../../api/models.rst:163:<autosummary>:1 mmcls.models.losses.focal_loss.FocalLoss:1 of
+msgid "Focal loss."
+msgstr ""
+
+#: mmcls.models.losses.focal_loss.FocalLoss:3 of
+msgid "Focusing parameter in focal loss. Defaults to 2.0."
+msgstr ""
+
+#: mmcls.models.losses.focal_loss.FocalLoss:6 of
+msgid "The parameter in balanced form of focal loss. Defaults to 0.25."
+msgstr ""
+
+#: mmcls.models.losses.focal_loss.FocalLoss:9 of
+msgid ""
+"The method used to reduce the loss into a scalar. Options are \"none\" and \"mean\". Defaults to 'mean'."
+msgstr ""
+
+#: mmcls.models.losses.focal_loss.FocalLoss.forward:1 of
+msgid "Sigmoid focal loss."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.losses.LabelSmoothLoss.rst:7
+msgid "LabelSmoothLoss"
+msgstr ""
+
+#: ../../api/models.rst:163:<autosummary>:1 mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:1 of
+msgid "Initializer for the label smoothed cross entropy loss."
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:3 of
+msgid ""
+"Refers to `Rethinking the Inception Architecture for Computer Vision <https://arxiv.org/abs/1512.00567>`_"
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:6 of
+msgid ""
+"This decreases gap between output scores and encourages generalization. Labels provided to forward can be "
+"one-hot like vectors (NxC) or class indices (Nx1). And this accepts linear combination of one-hot like "
+"labels from mixup or cutmix except multi-label task."
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:12 of
+msgid "The degree of label smoothing."
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:14 of
+msgid "Number of classes. Defaults to None."
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:16 of
+msgid "Refers to notes, Options are 'original', 'classy_vision', 'multi_label'. Defaults to 'original'."
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:19 of
+msgid ""
+"Whether the prediction uses sigmoid of softmax. Defaults to None, which means to use sigmoid in "
+"\"multi_label\" mode and not use in other modes."
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:31 of
+msgid "if the mode is **\"original\"**, this will use the same label smooth method as the original paper as:"
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:34 of
+msgid ""
+"(1-\\epsilon)\\delta_{k, y} + \\frac{\\epsilon}{K}\n"
+"\n"
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:37 of
+msgid ""
+"where :math:`\\epsilon` is the ``label_smooth_val``, :math:`K` is the ``num_classes`` and :math:`"
+"\\delta_{k, y}` is Dirac delta, which equals 1 for :math:`k=y` and 0 otherwise."
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:41 of
+msgid ""
+"if the mode is **\"classy_vision\"**, this will use the same label smooth method as the facebookresearch/"
+"ClassyVision repo as:"
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:44 of
+msgid ""
+"\\frac{\\delta_{k, y} + \\epsilon/K}{1+\\epsilon}\n"
+"\n"
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:47 of
+msgid ""
+"if the mode is **\"multi_label\"**, this will accept labels from multi-label task and smoothing them as:"
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss:50 of
+msgid ""
+"(1-2\\epsilon)\\delta_{k, y} + \\epsilon\n"
+"\n"
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss.forward:1 of
+msgid "Label smooth loss."
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss.forward:5 of
+msgid "The ground truth label of the prediction with shape (N, \\*)."
+msgstr ""
+
+#: mmcls.models.losses.label_smooth_loss.LabelSmoothLoss.generate_one_hot_like_label:1 of
+msgid "This function takes one-hot or index label vectors and computes one- hot like label vectors (float)"
+msgstr ""
+
+#: ../../api/generated/mmcls.models.losses.SeesawLoss.rst:7
+msgid "SeesawLoss"
+msgstr ""
+
+#: ../../api/models.rst:163:<autosummary>:1 mmcls.models.losses.seesaw_loss.SeesawLoss:1 of
+msgid "Implementation of seesaw loss."
+msgstr ""
+
+#: mmcls.models.losses.seesaw_loss.SeesawLoss:3 of
+msgid ""
+"Refers to `Seesaw Loss for Long-Tailed Instance Segmentation (CVPR 2021) <https://arxiv.org/"
+"abs/2008.10032>`_"
+msgstr ""
+
+#: mmcls.models.losses.seesaw_loss.SeesawLoss:6 of
+msgid "Whether the prediction uses sigmoid of softmax. Only False is supported. Defaults to False."
+msgstr ""
+
+#: mmcls.models.losses.seesaw_loss.SeesawLoss:9 of
+msgid "The ``p`` in the mitigation factor. Defaults to 0.8."
+msgstr ""
+
+#: mmcls.models.losses.seesaw_loss.SeesawLoss:12 of
+msgid "The ``q`` in the compenstation factor. Defaults to 2.0."
+msgstr ""
+
+#: mmcls.models.losses.seesaw_loss.SeesawLoss:15 of
+msgid "The number of classes. Defaults to 1000 for the ImageNet dataset."
+msgstr ""
+
+#: mmcls.models.losses.seesaw_loss.SeesawLoss:18 of
+msgid "The minimal value of divisor to smooth the computation of compensation factor, default to 1e-2."
+msgstr ""
+
+#: mmcls.models.losses.seesaw_loss.SeesawLoss:21 of
+msgid ""
+"The method that reduces the loss to a scalar. Options are \"none\", \"mean\" and \"sum\". Defaults to \"mean"
+"\"."
+msgstr ""
+
+#: mmcls.models.losses.seesaw_loss.SeesawLoss:24 of
+msgid "The weight of the loss. Defaults to 1.0"
+msgstr ""
+
+#: mmcls.models.losses.seesaw_loss.SeesawLoss.forward:3 of
+msgid "The prediction with shape (N, C)."
+msgstr ""
+
+#: mmcls.models.losses.seesaw_loss.SeesawLoss.forward:5 of
+msgid "The learning label of the prediction."
+msgstr ""
+
+#: mmcls.models.losses.seesaw_loss.SeesawLoss.forward:7 of
+msgid "Sample-wise loss weight."
+msgstr ""
+
+#: mmcls.models.losses.seesaw_loss.SeesawLoss.forward:12 of
+msgid "The method used to reduce the loss. Options are \"none\", \"mean\" and \"sum\"."
+msgstr ""
+
+#: mmcls.models.losses.seesaw_loss.SeesawLoss.forward:16 of
+msgid "The calculated loss"
+msgstr ""
+
+#: ../../api/generated/mmcls.models.necks.GeneralizedMeanPooling.rst:7
+msgid "GeneralizedMeanPooling"
+msgstr ""
+
+#: ../../api/models.rst:125:<autosummary>:1 mmcls.models.necks.gem.GeneralizedMeanPooling:1 of
+msgid "Generalized Mean Pooling neck."
+msgstr ""
+
+#: mmcls.models.necks.gap.GlobalAveragePooling:3 mmcls.models.necks.gem.GeneralizedMeanPooling:3 of
+msgid ""
+"Note that we use `view` to remove extra channel after pooling. We do not use `squeeze` as it will also "
+"remove the batch dimension when the tensor has a batch dimension of size 1, which can lead to unexpected "
+"errors."
+msgstr ""
+
+#: mmcls.models.necks.gem.GeneralizedMeanPooling:7 of
+msgid "Parameter value. Default: 3."
+msgstr ""
+
+#: mmcls.models.necks.gem.GeneralizedMeanPooling:10 of
+msgid "epsilon. Default: 1e-6"
+msgstr ""
+
+#: mmcls.models.necks.gem.GeneralizedMeanPooling:13 of
+msgid "Use clamp before pooling. Default: True"
+msgstr ""
+
+#: ../../api/generated/mmcls.models.necks.GlobalAveragePooling.rst:7
+msgid "GlobalAveragePooling"
+msgstr ""
+
+#: ../../api/models.rst:125:<autosummary>:1 mmcls.models.necks.gap.GlobalAveragePooling:1 of
+msgid "Global Average Pooling neck."
+msgstr ""
+
+#: mmcls.models.necks.gap.GlobalAveragePooling:7 of
+msgid "Dimensions of each sample channel, can be one of {1, 2, 3}. Default: 2"
+msgstr ""
+
+#: ../../api/generated/mmcls.models.necks.HRFuseScales.rst:7
+msgid "HRFuseScales"
+msgstr ""
+
+#: ../../api/models.rst:125:<autosummary>:1 mmcls.models.necks.hr_fuse.HRFuseScales:1 of
+msgid "Fuse feature map of multiple scales in HRNet."
+msgstr ""
+
+#: mmcls.models.necks.hr_fuse.HRFuseScales:3 of
+msgid "The input channels of all scales."
+msgstr ""
+
+#: mmcls.models.necks.hr_fuse.HRFuseScales:5 of
+msgid "The channels of fused feature map. Defaults to 2048."
+msgstr ""
+
+#: mmcls.models.necks.hr_fuse.HRFuseScales:8 of
+msgid "dictionary to construct norm layers. Defaults to ``dict(type='BN', momentum=0.1)``."
+msgstr ""
+
+#: mmcls.models.necks.hr_fuse.HRFuseScales:11 of
+msgid "Initialization config dict. Defaults to ``dict(type='Normal', layer='Linear', std=0.01))``."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.ConditionalPositionEncoding.rst:7
+msgid "ConditionalPositionEncoding"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1 mmcls.models.utils.position_encoding.ConditionalPositionEncoding:1
+#: of
+msgid "The Conditional Position Encoding (CPE) module."
+msgstr ""
+
+#: mmcls.models.utils.position_encoding.ConditionalPositionEncoding:3 of
+msgid ""
+"The CPE is the implementation of 'Conditional Positional Encodings for Vision Transformers <https://arxiv."
+"org/abs/2102.10882>'_."
+msgstr ""
+
+#: mmcls.models.utils.position_encoding.ConditionalPositionEncoding:8 of
+msgid "The feature dimension. Default: 768."
+msgstr ""
+
+#: mmcls.models.utils.position_encoding.ConditionalPositionEncoding:10 of
+msgid "Stride of conv layer. Default: 1."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.HybridEmbed.rst:7
+msgid "HybridEmbed"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1 mmcls.models.utils.embed.HybridEmbed:1 of
+msgid "CNN Feature Map Embedding."
+msgstr ""
+
+#: mmcls.models.utils.embed.HybridEmbed:3 of
+msgid "Extract feature map from CNN, flatten, project to embedding dim."
+msgstr ""
+
+#: mmcls.models.utils.embed.HybridEmbed:6 of
+msgid "CNN backbone"
+msgstr ""
+
+#: mmcls.models.utils.embed.HybridEmbed:8 mmcls.models.utils.embed.PatchEmbed:5 of
+msgid "The size of input image. Default: 224"
+msgstr ""
+
+#: mmcls.models.utils.embed.HybridEmbed:10 of
+msgid "Size of feature map extracted by CNN backbone. Default: None"
+msgstr ""
+
+#: mmcls.models.utils.embed.HybridEmbed:13 mmcls.models.utils.embed.PatchEmbed:7 of
+msgid "The num of input channels. Default: 3"
+msgstr ""
+
+#: mmcls.models.utils.embed.HybridEmbed:15 mmcls.models.utils.embed.PatchEmbed:9 of
+msgid "The dimensions of embedding. Default: 768"
+msgstr ""
+
+#: mmcls.models.utils.embed.HybridEmbed:20 of
+msgid "The Config for initialization. Default: None."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.InvertedResidual.rst:7
+msgid "InvertedResidual"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1 mmcls.models.utils.inverted_residual.InvertedResidual:1 of
+msgid "Inverted Residual Block."
+msgstr ""
+
+#: mmcls.models.utils.inverted_residual.InvertedResidual:3 of
+msgid "The input channels of this module."
+msgstr ""
+
+#: mmcls.models.utils.inverted_residual.InvertedResidual:5 of
+msgid "The output channels of this module."
+msgstr ""
+
+#: mmcls.models.utils.inverted_residual.InvertedResidual:7 of
+msgid "The input channels of the depthwise convolution."
+msgstr ""
+
+#: mmcls.models.utils.inverted_residual.InvertedResidual:9 of
+msgid "The kernel size of the depthwise convolution. Defaults to 3."
+msgstr ""
+
+#: mmcls.models.utils.inverted_residual.InvertedResidual:12 of
+msgid "The stride of the depthwise convolution. Defaults to 1."
+msgstr ""
+
+#: mmcls.models.utils.inverted_residual.InvertedResidual:14 of
+msgid "Config dict for se layer. Defaults to None, which means no se layer."
+msgstr ""
+
+#: mmcls.models.utils.inverted_residual.InvertedResidual:20 of
+msgid "Config dict for normalization layer. Defaults to ``dict(type='BN')``."
+msgstr ""
+
+#: mmcls.models.utils.channel_shuffle.channel_shuffle:6
+#: mmcls.models.utils.inverted_residual.InvertedResidual.forward:3 of
+msgid "The input tensor."
+msgstr ""
+
+#: mmcls.models.utils.inverted_residual.InvertedResidual.forward:6 of
+msgid "The output tensor."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.LayerScale.rst:7
+msgid "LayerScale"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1 mmcls.models.utils.layer_scale.LayerScale:1 of
+msgid "LayerScale layer."
+msgstr ""
+
+#: mmcls.models.utils.layer_scale.LayerScale:3 of
+msgid "Dimension of input features."
+msgstr ""
+
+#: mmcls.models.utils.layer_scale.LayerScale:5 of
+msgid "inplace: can optionally do the operation in-place. Defaults to False."
+msgstr ""
+
+#: mmcls.models.utils.layer_scale.LayerScale:8 of
+msgid ""
+"The input data format, could be 'channels_last' or 'channels_first', representing (B, C, H, W) and (B, N, "
+"C) format data respectively. Defaults to 'channels_last'."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.MultiheadAttention.rst:7
+msgid "MultiheadAttention"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1 mmcls.models.utils.attention.MultiheadAttention:1 of
+msgid "Multi-head Attention Module."
+msgstr ""
+
+#: mmcls.models.utils.attention.MultiheadAttention:3 of
+msgid ""
+"This module implements multi-head attention that supports different input dims and embed dims. And it also "
+"supports a shortcut from ``value``, which is useful if input dims is not the same with embed dims."
+msgstr ""
+
+#: mmcls.models.utils.attention.MultiheadAttention:7 of
+msgid "The embedding dimension."
+msgstr ""
+
+#: mmcls.models.utils.attention.MultiheadAttention:9 of
+msgid "Parallel attention heads."
+msgstr ""
+
+#: mmcls.models.utils.attention.MultiheadAttention:11 of
+msgid "The input dimension, and if None, use ``embed_dims``. Defaults to None."
+msgstr ""
+
+#: mmcls.models.utils.attention.MultiheadAttention:14 of
+msgid "Dropout rate of the dropout layer after the attention calculation of query and key. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.utils.attention.MultiheadAttention:17 of
+msgid "Dropout rate of the dropout layer after the output projection. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.utils.attention.MultiheadAttention:20 of
+msgid "The dropout config before adding the shortcut. Defaults to ``dict(type='Dropout', drop_prob=0.)``."
+msgstr ""
+
+#: mmcls.models.utils.attention.MultiheadAttention:23 mmcls.models.utils.attention.WindowMSA:10
+#: mmcls.models.utils.attention.WindowMSAV2:14 of
+msgid "If True, add a learnable bias to q, k, v. Defaults to True."
+msgstr ""
+
+#: mmcls.models.utils.attention.MultiheadAttention:26 mmcls.models.utils.attention.WindowMSA:13 of
+msgid "Override default qk scale of ``head_dim ** -0.5`` if set. Defaults to None."
+msgstr ""
+
+#: mmcls.models.utils.attention.MultiheadAttention:29 of
+msgid "Defaults to True."
+msgstr ""
+
+#: mmcls.models.utils.attention.MultiheadAttention:31 of
+msgid ""
+"Add a shortcut from value to output. It's usually used if ``input_dims`` is different from ``embed_dims``. "
+"Defaults to False."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.PatchEmbed.rst:7
+msgid "PatchEmbed"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1 mmcls.models.utils.embed.PatchEmbed:1 of
+msgid "Image to Patch Embedding."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchEmbed:3 of
+msgid "We use a conv layer to implement PatchEmbed."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchEmbed:11 of
+msgid "Config dict for normalization layer. Default: None"
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchEmbed:14 of
+msgid "The config dict for conv layers. Default: None"
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchEmbed:17 of
+msgid "The Config for initialization. Default: None"
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.PatchMerging.rst:7
+msgid "PatchMerging"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1 mmcls.models.utils.embed.PatchMerging:1 of
+msgid "Merge patch feature map."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging:3 of
+msgid "Modified from mmcv, and this module supports specifying whether to use post-norm."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging:6 of
+#, python-format
+msgid ""
+"This layer groups feature map by kernel_size, and applies norm and linear layers to the grouped feature map "
+"((used in Swin Transformer)). Our implementation uses :class:`torch.nn.Unfold` to merge patches, which is "
+"about 25% faster than the original implementation. However, we need to modify pretrained models for "
+"compatibility."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging:12 of
+msgid "The num of input channels. To gets fully covered by filter and stride you specified."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging:15 of
+msgid "The num of output channels."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging:17 of
+msgid "the kernel size in the unfold layer. Defaults to 2."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging:20 of
+msgid ""
+"the stride of the sliding blocks in the unfold layer. Defaults to None, which means to be set as "
+"``kernel_size``."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging:24 of
+msgid ""
+"The padding length of embedding conv. When it is a string, it means the mode of adaptive padding, support "
+"\"same\" and \"corner\" now. Defaults to \"corner\"."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging:29 of
+msgid "dilation parameter in the unfold layer. Defaults to 1."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging:32 of
+msgid "Whether to add bias in linear layer or not. Defaults to False."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging:38 of
+msgid "Whether to use post normalization here. Defaults to False."
+msgstr ""
+
+#: mmcls.models.utils.attention.ShiftWindowMSA:24 mmcls.models.utils.attention.WindowMSA:21
+#: mmcls.models.utils.attention.WindowMSAV2:29 mmcls.models.utils.embed.PatchMerging:41 of
+msgid "The extra config for initialization. Defaults to None."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging.forward:1 of
+msgid "Has shape (B, H*W, C_in)."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging.forward:3 of
+msgid "The spatial shape of x, arrange as (H, W). Default: None."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging.forward:7 of
+msgid ""
+"Contains merged results and its spatial shape.  - x (Tensor): Has shape (B, Merged_H * Merged_W, C_out) - "
+"out_size (tuple[int]): Spatial shape of x, arrange as   (Merged_H, Merged_W)."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging.forward:7 of
+msgid "Contains merged results and its spatial shape."
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging.forward:9 of
+msgid "x (Tensor): Has shape (B, Merged_H * Merged_W, C_out)"
+msgstr ""
+
+#: mmcls.models.utils.embed.PatchMerging.forward:10 of
+msgid "out_size (tuple[int]): Spatial shape of x, arrange as (Merged_H, Merged_W)."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.SELayer.rst:7
+msgid "SELayer"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1 mmcls.models.utils.se_layer.SELayer:1 of
+msgid "Squeeze-and-Excitation Module."
+msgstr ""
+
+#: mmcls.models.utils.se_layer.SELayer:3 of
+msgid "The input (and output) channels of the SE layer."
+msgstr ""
+
+#: mmcls.models.utils.se_layer.SELayer:5 of
+msgid ""
+"The intermediate channel number of SElayer. Default: None, means the value of ``squeeze_channels`` is "
+"``make_divisible(channels // ratio, divisor)``."
+msgstr ""
+
+#: mmcls.models.utils.se_layer.SELayer:9 of
+msgid ""
+"Squeeze ratio in SELayer, the intermediate channel will be ``make_divisible(channels // ratio, divisor)``. "
+"Only used when ``squeeze_channels`` is None. Default: 16."
+msgstr ""
+
+#: mmcls.models.utils.se_layer.SELayer:13 of
+msgid ""
+"The divisor to true divide the channel number. Only used when ``squeeze_channels`` is None. Default: 8."
+msgstr ""
+
+#: mmcls.models.utils.se_layer.SELayer:19 of
+msgid "Whether to return the weight. Default: False."
+msgstr ""
+
+#: mmcls.models.utils.se_layer.SELayer:21 of
+msgid ""
+"Config dict for activation layer. If act_cfg is a dict, two activation layers will be configurated by this "
+"dict. If act_cfg is a sequence of dicts, the first activation layer will be configurated by the first dict "
+"and the second activation layer will be configurated by the second dict. Default: (dict(type='ReLU'), "
+"dict(type='Sigmoid'))"
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.ShiftWindowMSA.rst:7
+msgid "ShiftWindowMSA"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1 mmcls.models.utils.attention.ShiftWindowMSA:1 of
+msgid "Shift Window Multihead Self-Attention Module."
+msgstr ""
+
+#: mmcls.models.utils.attention.ShiftWindowMSA:5 mmcls.models.utils.attention.WindowMSA:8
+#: mmcls.models.utils.attention.WindowMSAV2:12 mmcls.models.utils.embed.resize_relative_position_bias_table:11
+#: of
+msgid "Number of attention heads."
+msgstr ""
+
+#: mmcls.models.utils.attention.ShiftWindowMSA:7 mmcls.models.utils.attention.WindowMSA:6
+#: mmcls.models.utils.attention.WindowMSAV2:10 of
+msgid "The height and width of the window."
+msgstr ""
+
+#: mmcls.models.utils.attention.ShiftWindowMSA:9 of
+msgid "The shift step of each window towards right-bottom. If zero, act as regular window-msa. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.utils.attention.ShiftWindowMSA:12 of
+msgid "The dropout_layer used before output. Defaults to dict(type='DropPath', drop_prob=0.)."
+msgstr ""
+
+#: mmcls.models.utils.attention.ShiftWindowMSA:21 of
+msgid "To build a window multi-head attention module. Defaults to :class:`WindowMSA`."
+msgstr ""
+
+#: mmcls.models.utils.attention.ShiftWindowMSA:27 of
+msgid "Other keyword arguments to build the window multi-head attention module."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.WindowMSA.rst:7
+msgid "WindowMSA"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1 mmcls.models.utils.attention.WindowMSA:1
+#: mmcls.models.utils.attention.WindowMSAV2:1 of
+msgid "Window based multi-head self-attention (W-MSA) module with relative position bias."
+msgstr ""
+
+#: mmcls.models.utils.attention.WindowMSA:16 mmcls.models.utils.attention.WindowMSAV2:17 of
+msgid "Dropout ratio of attention weight. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.utils.attention.WindowMSA:19 mmcls.models.utils.attention.WindowMSAV2:20 of
+msgid "Dropout ratio of output. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.utils.attention.WindowMSA.forward:1 mmcls.models.utils.attention.WindowMSAV2.forward:1 of
+msgid "input features with shape of (num_windows*B, N, C)"
+msgstr ""
+
+#: mmcls.models.utils.attention.WindowMSA.forward:3 mmcls.models.utils.attention.WindowMSAV2.forward:3 of
+msgid "mask with shape of (num_windows, Wh*Ww, Wh*Ww), value should be between (-inf, 0]."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.WindowMSAV2.rst:7
+msgid "WindowMSAV2"
+msgstr ""
+
+#: mmcls.models.utils.attention.WindowMSAV2:4 of
+msgid ""
+"Based on implementation on Swin Transformer V2 original repo. Refers to https://github.com/microsoft/Swin-"
+"Transformer/blob/main/models/swin_transformer_v2.py for more details."
+msgstr ""
+
+#: mmcls.models.utils.attention.WindowMSAV2:22 of
+msgid "The hidden dimensions of the continuous relative position bias network. Defaults to 512."
+msgstr ""
+
+#: mmcls.models.utils.attention.WindowMSAV2:25 of
+msgid ""
+"The height and width of the window in pre-training. Defaults to (0, 0), which means not load pretrained "
+"model."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.batch_augments.CutMix.rst:7
+msgid "CutMix"
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix:3 of
+msgid ""
+"CutMix is a method to improve the network's generalization capability. It's proposed in `CutMix: "
+"Regularization Strategy to Train Strong Classifiers with Localizable Features <https://arxiv.org/"
+"abs/1905.04899>`"
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix:7 of
+msgid ""
+"With this method, patches are cut and pasted among training images where the ground truth labels are also "
+"mixed proportionally to the area of the patches."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix:11 mmcls.models.utils.batch_augments.resizemix.ResizeMix:7
+#: of
+msgid ""
+"Parameters for Beta distribution to generate the mixing ratio. It should be a positive number. More details "
+"can be found in :class:`Mixup`."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix:15 mmcls.models.utils.batch_augments.resizemix.ResizeMix:22
+#: of
+msgid ""
+"The min/max area ratio of the patches. If not None, the bounding-box of patches is uniform sampled within "
+"this ratio range, and the ``alpha`` will be ignored. Otherwise, the bounding-box is generated according to "
+"the ``alpha``. Defaults to None."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix:21 of
+msgid "Whether to apply lambda correction when cutmix bbox clipped by image borders. Defaults to True."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix:26 of
+msgid ""
+"If the ``cutmix_minmax`` is None, how to generate the bounding-box of patches according to the ``alpha``?"
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix:29 of
+msgid ""
+"First, generate a :math:`\\lambda`, details can be found in :class:`Mixup`. And then, the area ratio of the "
+"bounding-box is calculated by:"
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix:33 mmcls.models.utils.batch_augments.resizemix.ResizeMix:45
+#: of
+msgid ""
+"\\text{ratio} = \\sqrt{1-\\lambda}\n"
+"\n"
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.cutmix_bbox_and_lam:1 of
+msgid "Generate bbox and apply lambda correction."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.cutmix_bbox_and_lam:3
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.rand_bbox:5
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.rand_bbox_minmax:8 of
+msgid "Image shape as tuple"
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.cutmix_bbox_and_lam:5
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.rand_bbox:7 of
+msgid "Cutmix lambda value"
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.cutmix_bbox_and_lam:7
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.rand_bbox:12
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.rand_bbox_minmax:10 of
+msgid "Number of bbox to generate. Defaults to None"
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.mix:1 mmcls.models.utils.batch_augments.mixup.Mixup.mix:1
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix.mix:1 of
+msgid "Mix the batch inputs and batch one-hot format ground truth."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.mix:3 mmcls.models.utils.batch_augments.mixup.Mixup.mix:3
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix.mix:3 of
+msgid "A batch of images tensor in the shape of ``(N, C, H, W)``."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.mix:6 mmcls.models.utils.batch_augments.mixup.Mixup.mix:6
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix.mix:6 of
+msgid "A batch of one-hot format labels in the shape of ``(N, num_classes)``."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.mix:10 mmcls.models.utils.batch_augments.mixup.Mixup.mix:10
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix.mix:10 of
+msgid "The mixed inputs and labels."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.rand_bbox:1 of
+msgid ""
+"Standard CutMix bounding-box that generates a random square bbox based on lambda value. This implementation "
+"includes support for enforcing a border margin as percent of bbox dimensions."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.rand_bbox:9 of
+msgid "Percentage of bbox dimension to enforce as margin (reduce amount of box outside image). Defaults to 0."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.rand_bbox_minmax:1 of
+msgid ""
+"Min-Max CutMix bounding-box Inspired by Darknet cutmix implementation. It generates a random rectangular "
+"bbox based on min/max percent values applied to each dimension of the input image."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.cutmix.CutMix.rand_bbox_minmax:5 of
+msgid "Typical defaults for minmax are usually in the  .2-.3 for min and .8-.9 range for max."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.batch_augments.Mixup.rst:7
+msgid "Mixup"
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.mixup.Mixup:3 of
+msgid ""
+"Mixup is a method to reduces the memorization of corrupt labels and increases the robustness to adversarial "
+"examples. It's proposed in `mixup: Beyond Empirical Risk Minimization <https://arxiv.org/abs/1710.09412>`_"
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.mixup.Mixup:8 of
+msgid ""
+"Parameters for Beta distribution to generate the mixing ratio. It should be a positive number. More details "
+"are in the note."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.mixup.Mixup:15 of
+msgid ""
+"The :math:`\\alpha` (``alpha``) determines a random distribution :math:`Beta(\\alpha, \\alpha)`. For each "
+"batch of data, we sample a mixing ratio (marked as :math:`\\lambda`, ``lam``) from the random distribution."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.batch_augments.ResizeMix.rst:7
+msgid "ResizeMix"
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix:3 of
+msgid ""
+"The ResizeMix will resize an image to a small patch and paste it on another image. It's proposed in "
+"`ResizeMix: Mixing Data with Preserved Object Information and True Labels <https://arxiv.org/"
+"abs/2012.11101>`_"
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix:11 of
+msgid "The minimum value of lam. Defaults to 0.1."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix:13 of
+msgid "The maximum value of lam. Defaults to 0.8."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix:15 of
+msgid ""
+"algorithm used for upsampling: 'nearest' | 'linear' | 'bilinear' | 'bicubic' | 'trilinear' | 'area'. "
+"Defaults to 'bilinear'."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix:19 of
+msgid "The probability to execute resizemix. It should be in range [0, 1]. Defaults to 1.0."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix:28 of
+msgid "Whether to apply lambda correction when cutmix bbox clipped by image borders. Defaults to True"
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix:31 of
+msgid "Any other parameters accpeted by :class:`CutMix`."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix:35 of
+msgid ""
+"The :math:`\\lambda` (``lam``) is the mixing ratio. It's a random variable which follows :math:"
+"`Beta(\\alpha, \\alpha)` and is mapped to the range [``lam_min``, ``lam_max``]."
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix:39 of
+msgid ""
+"\\lambda = \\frac{Beta(\\alpha, \\alpha)}\n"
+"{\\lambda_{max} - \\lambda_{min}} + \\lambda_{min}\n"
+"\n"
+msgstr ""
+
+#: mmcls.models.utils.batch_augments.resizemix.ResizeMix:43 of
+msgid "And the resize ratio of source images is calculated by :math:`\\lambda`:"
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.channel_shuffle.rst:2
+msgid "mmcls.models.utils.channel\\_shuffle"
+msgstr ""
+
+#: ../../api/models.rst:207:<autosummary>:1 mmcls.models.utils.channel_shuffle.channel_shuffle:1 of
+msgid "Channel Shuffle operation."
+msgstr ""
+
+#: mmcls.models.utils.channel_shuffle.channel_shuffle:3 of
+msgid "This function enables cross-group information flow for multiple groups convolution layers."
+msgstr ""
+
+#: mmcls.models.utils.channel_shuffle.channel_shuffle:8 of
+msgid "The number of groups to divide the input tensor in the channel dimension."
+msgstr ""
+
+#: mmcls.models.utils.channel_shuffle.channel_shuffle:12 of
+msgid "The output tensor after channel shuffle operation."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.data_preprocessor.ClsDataPreprocessor.rst:7
+msgid "ClsDataPreprocessor"
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:3 of
+msgid "Comparing with the :class:`mmengine.model.ImgDataPreprocessor`,"
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:5 of
+msgid "It won't do normalization if ``mean`` is not specified."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:6 of
+msgid "It does normalization and color space conversion after stacking batch."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:7 of
+msgid "It supports batch augmentations like mixup and cutmix."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:9 of
+msgid "It provides the data pre-processing as follows"
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:11 of
+msgid "Collate and move data to the target device."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:12 of
+msgid ""
+"Pad inputs to the maximum size of current batch with defined ``pad_value``. The padding size can be "
+"divisible by a defined ``pad_size_divisor``"
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:15 of
+msgid "Stack inputs to batch_inputs."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:18 of
+msgid "Do batch augmentations like Mixup and Cutmix during training."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:20 of
+msgid "The pixel mean of R, G, B channels. Defaults to None."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:23 of
+msgid "The pixel standard deviation of R, G, B channels. Defaults to None."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:26 of
+msgid "The size of padded image should be divisible by ``pad_size_divisor``. Defaults to 1."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:29 of
+msgid "The padded pixel value. Defaults to 0."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:31 of
+msgid "whether to convert image from BGR to RGB. Defaults to False."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:34 of
+msgid "Whether to generate one-hot format gt-labels and set to data samples. Defaults to False."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor:39 of
+msgid ""
+"The batch augmentations settings, including \"augments\" and \"probs\". For more details, see :class:`mmcls."
+"models.RandomBatchAugment`."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor.forward:1 of
+msgid ""
+"Perform normalization, padding, bgr2rgb conversion and batch augmentation based on ``BaseDataPreprocessor``."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor.forward:4 of
+msgid "data sampled from dataloader."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor.forward:6 of
+msgid "Whether to enable training time augmentation."
+msgstr ""
+
+#: mmcls.models.utils.data_preprocessor.ClsDataPreprocessor.forward:9 of
+msgid "Data in the same format as the model input."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.is_tracing.rst:2
+msgid "mmcls.models.utils.is\\_tracing"
+msgstr ""
+
+#: ../../api/models.rst:207:<autosummary>:1 mmcls.models.utils.helpers.is_tracing:1 of
+msgid "Determine whether the model is called during the tracing of code with ``torch.jit.trace``."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.make_divisible.rst:2
+msgid "mmcls.models.utils.make\\_divisible"
+msgstr ""
+
+#: ../../api/models.rst:207:<autosummary>:1 mmcls.models.utils.make_divisible.make_divisible:1 of
+msgid "Make divisible function."
+msgstr ""
+
+#: mmcls.models.utils.make_divisible.make_divisible:3 of
+msgid ""
+"This function rounds the channel number down to the nearest value that can be divisible by the divisor."
+msgstr ""
+
+#: mmcls.models.utils.make_divisible.make_divisible:6 of
+msgid "The original channel number."
+msgstr ""
+
+#: mmcls.models.utils.make_divisible.make_divisible:8 of
+msgid "The divisor to fully divide the channel number."
+msgstr ""
+
+#: mmcls.models.utils.make_divisible.make_divisible:10 of
+msgid ""
+"The minimum value of the output channel. Default: None, means that the minimum value equal to the divisor."
+msgstr ""
+
+#: mmcls.models.utils.make_divisible.make_divisible:13 of
+msgid "The minimum ratio of the rounded channel number to the original channel number. Default: 0.9."
+msgstr ""
+
+#: mmcls.models.utils.make_divisible.make_divisible:17 of
+msgid "The modified output channel number"
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.resize_pos_embed.rst:2
+msgid "mmcls.models.utils.resize\\_pos\\_embed"
+msgstr ""
+
+#: ../../api/models.rst:207:<autosummary>:1 mmcls.models.utils.embed.resize_pos_embed:1 of
+msgid "Resize pos_embed weights."
+msgstr ""
+
+#: mmcls.models.utils.embed.resize_pos_embed:3 of
+msgid "Position embedding weights with shape [1, L, C]."
+msgstr ""
+
+#: mmcls.models.utils.embed.resize_pos_embed:6 mmcls.models.utils.embed.resize_relative_position_bias_table:3
+#: of
+msgid "The resolution of downsampled origin training image, in format (H, W)."
+msgstr ""
+
+#: mmcls.models.utils.embed.resize_pos_embed:9 mmcls.models.utils.embed.resize_relative_position_bias_table:6
+#: of
+msgid "The resolution of downsampled new training image, in format (H, W)."
+msgstr ""
+
+#: mmcls.models.utils.embed.resize_pos_embed:12 of
+msgid ""
+"Algorithm used for upsampling. Choose one from 'nearest', 'linear', 'bilinear', 'bicubic' and 'trilinear'. "
+"Defaults to 'bicubic'."
+msgstr ""
+
+#: mmcls.models.utils.embed.resize_pos_embed:16 of
+msgid "The number of extra tokens, such as cls_token. Defaults to 1."
+msgstr ""
+
+#: mmcls.models.utils.embed.resize_pos_embed:20 of
+msgid "The resized pos_embed of shape [1, L_new, C]"
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.resize_relative_position_bias_table.rst:2
+msgid "mmcls.models.utils.resize\\_relative\\_position\\_bias\\_table"
+msgstr ""
+
+#: ../../api/models.rst:207:<autosummary>:1 mmcls.models.utils.embed.resize_relative_position_bias_table:1 of
+msgid "Resize relative position bias table."
+msgstr ""
+
+#: mmcls.models.utils.embed.resize_relative_position_bias_table:9 of
+msgid "The relative position bias of the pretrained model."
+msgstr ""
+
+#: mmcls.models.utils.embed.resize_relative_position_bias_table:14 of
+msgid "The resized relative position bias table."
+msgstr ""
+
+#: ../../api/generated/mmcls.models.utils.to_ntuple.rst:2
+msgid "mmcls.models.utils.to\\_ntuple"
+msgstr ""
+
+#: ../../api/models.rst:207:<autosummary>:1 mmcls.models.utils.helpers._ntuple:1 of
+msgid "A `to_tuple` function generator."
+msgstr ""
+
+#: mmcls.models.utils.helpers._ntuple:3 of
+msgid ""
+"It returns a function, this function will repeat the input to a tuple of length ``n`` if the input is not "
+"an Iterable object, otherwise, return the input directly."
+msgstr ""
+
+#: mmcls.models.utils.helpers._ntuple:7 of
+msgid "The number of the target length."
+msgstr ""
+
+#: ../../api/generated/mmcls.utils.collect_env.rst:2
+msgid "mmcls.utils.collect\\_env"
+msgstr ""
+
+#: ../../api/utils.rst:16:<autosummary>:1 mmcls.utils.collect_env.collect_env:1 of
+msgid "Collect the information of the running environments."
+msgstr ""
+
+#: ../../api/generated/mmcls.utils.register_all_modules.rst:2
+msgid "mmcls.utils.register\\_all\\_modules"
+msgstr ""
+
+#: ../../api/utils.rst:16:<autosummary>:1 mmcls.utils.setup_env.register_all_modules:1 of
+msgid "Register all modules in mmcls into the registries."
+msgstr ""
+
+#: mmcls.utils.setup_env.register_all_modules:3 of
+msgid ""
+"Whether initialize the mmcls default scope. If True, the global default scope will be set to `mmcls`, and "
+"all registries will build modules from mmcls's registry node. To understand more about the registry, please "
+"refer to https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/registry.md Defaults to True."
+msgstr ""
+
+#: ../../api/models.rst:7
+msgid "mmcls.models"
+msgstr ""
+
+#: ../../api/models.rst:9
+msgid ""
+"The ``models`` package contains several sub-packages for addressing the different components of a model."
+msgstr "``models`` 包中包含了若干子包，分别对应神经网络中不同的组件。"
+
+#: ../../api/models.rst:11
+msgid ""
+":mod:`~mmcls.models.classifiers`: The top-level module which defines the whole process of a classification "
+"model."
+msgstr ":mod:`~mmcls.models.classifiers`：定义完整分类模型的顶级模块。"
+
+#: ../../api/models.rst:12
+msgid ":mod:`~mmcls.models.backbones`: Usually a feature extraction network, e.g., ResNet, MobileNet."
+msgstr ":mod:`~mmcls.models.backbones`：用于特征提取的主干网络结构，如 ResNet、MobileNet。"
+
+#: ../../api/models.rst:13
+msgid ":mod:`~mmcls.models.necks`: The component between backbones and heads, e.g., GlobalAveragePooling."
+msgstr ":mod:`~mmcls.models.necks`：位于主干网络和头部网络之间的过渡层，如 GlobalAveragePooling。"
+
+#: ../../api/models.rst:14
+msgid ""
+":mod:`~mmcls.models.heads`: The component for specific tasks. In MMClassification, we provides heads for "
+"classification."
+msgstr ""
+":mod:`~mmcls.models.heads`：用于特定任务的头部网络。在 MMClassification 中，我们提供了若干用于分类任务的头部"
+"网络。"
+
+#: ../../api/models.rst:15
+msgid ":mod:`~mmcls.models.losses`: Loss functions."
+msgstr ":mod:`~mmcls.models.losses`：损失函数"
+
+#: ../../api/models.rst:16
+msgid ":mod:`~mmcls.models.utils`: Some helper functions and common components used in various networks."
+msgstr ":mod:`~mmcls.models.utils`：一些辅助函数，或是在多个网络中出现的公共模块。"
+
+#: ../../api/models.rst:18
+msgid ""
+":mod:`~mmcls.models.utils.data_preprocessor`: The component before model to preprocess the inputs, e.g., "
+"ClsDataPreprocessor."
+msgstr ""
+":mod:`~mmcls.models.utils.data_preprocessor`：对网络的输入进行预处理的模块，如 ``ClsDataPreprocessor``。"
+
+#: ../../api/models.rst:19
+msgid ":ref:`components`: Common components used in various networks."
+msgstr ":ref:`components`：多个网络共用的一些公共模块。"
+
+#: ../../api/models.rst:20
+msgid ":ref:`helpers`: Helper functions."
+msgstr ":ref:`helpers`：模型中用到的辅助函数。"
+
+#: ../../api/models.rst:23
+msgid "Build Functions"
+msgstr ""
+
+#: ../../api/models.rst:34:<autosummary>:1
+msgid ":py:obj:`build_classifier <mmcls.models.build_classifier>`"
+msgstr ""
+
+#: ../../api/models.rst:34:<autosummary>:1
+msgid ":py:obj:`build_backbone <mmcls.models.build_backbone>`"
+msgstr ""
+
+#: ../../api/models.rst:34:<autosummary>:1
+msgid ":py:obj:`build_neck <mmcls.models.build_neck>`"
+msgstr ""
+
+#: ../../api/models.rst:34:<autosummary>:1
+msgid ":py:obj:`build_head <mmcls.models.build_head>`"
+msgstr ""
+
+#: ../../api/models.rst:34:<autosummary>:1
+msgid ":py:obj:`build_loss <mmcls.models.build_loss>`"
+msgstr ""
+
+#: ../../api/models.rst:38
+msgid "Classifiers"
+msgstr ""
+
+#: ../../api/models.rst:49:<autosummary>:1
+msgid ":py:obj:`BaseClassifier <mmcls.models.classifiers.BaseClassifier>`"
+msgstr ""
+
+#: ../../api/models.rst:49:<autosummary>:1
+msgid ":py:obj:`ImageClassifier <mmcls.models.classifiers.ImageClassifier>`"
+msgstr ""
+
+#: ../../api/models.rst:49:<autosummary>:1
+msgid ":py:obj:`TimmClassifier <mmcls.models.classifiers.TimmClassifier>`"
+msgstr ""
+
+#: ../../api/models.rst:49:<autosummary>:1
+msgid ":py:obj:`HuggingFaceClassifier <mmcls.models.classifiers.HuggingFaceClassifier>`"
+msgstr ""
+
+#: ../../api/models.rst:53
+msgid "Backbones"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`AlexNet <mmcls.models.backbones.AlexNet>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`CSPDarkNet <mmcls.models.backbones.CSPDarkNet>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`CSPNet <mmcls.models.backbones.CSPNet>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`CSPResNeXt <mmcls.models.backbones.CSPResNeXt>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`CSPResNet <mmcls.models.backbones.CSPResNet>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`Conformer <mmcls.models.backbones.Conformer>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`ConvMixer <mmcls.models.backbones.ConvMixer>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid "ConvMixer."
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`ConvNeXt <mmcls.models.backbones.ConvNeXt>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`DaViT <mmcls.models.backbones.DaViT>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`DeiT3 <mmcls.models.backbones.DeiT3>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`DenseNet <mmcls.models.backbones.DenseNet>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`DistilledVisionTransformer <mmcls.models.backbones.DistilledVisionTransformer>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`EdgeNeXt <mmcls.models.backbones.EdgeNeXt>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`EfficientFormer <mmcls.models.backbones.EfficientFormer>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`EfficientNet <mmcls.models.backbones.EfficientNet>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`HorNet <mmcls.models.backbones.HorNet>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`HRNet <mmcls.models.backbones.HRNet>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`InceptionV3 <mmcls.models.backbones.InceptionV3>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`LeNet5 <mmcls.models.backbones.LeNet5>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`MViT <mmcls.models.backbones.MViT>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`MlpMixer <mmcls.models.backbones.MlpMixer>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`MobileNetV2 <mmcls.models.backbones.MobileNetV2>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`MobileNetV3 <mmcls.models.backbones.MobileNetV3>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`MobileOne <mmcls.models.backbones.MobileOne>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`MobileViT <mmcls.models.backbones.MobileViT>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`PCPVT <mmcls.models.backbones.PCPVT>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`PoolFormer <mmcls.models.backbones.PoolFormer>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`RegNet <mmcls.models.backbones.RegNet>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`RepLKNet <mmcls.models.backbones.RepLKNet>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`RepMLPNet <mmcls.models.backbones.RepMLPNet>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`RepVGG <mmcls.models.backbones.RepVGG>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`Res2Net <mmcls.models.backbones.Res2Net>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`ResNeSt <mmcls.models.backbones.ResNeSt>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`ResNeXt <mmcls.models.backbones.ResNeXt>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`ResNet <mmcls.models.backbones.ResNet>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`ResNetV1c <mmcls.models.backbones.ResNetV1c>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`ResNetV1d <mmcls.models.backbones.ResNetV1d>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`ResNet_CIFAR <mmcls.models.backbones.ResNet_CIFAR>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`SEResNeXt <mmcls.models.backbones.SEResNeXt>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`SEResNet <mmcls.models.backbones.SEResNet>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`SVT <mmcls.models.backbones.SVT>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`ShuffleNetV1 <mmcls.models.backbones.ShuffleNetV1>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`ShuffleNetV2 <mmcls.models.backbones.ShuffleNetV2>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`SwinTransformer <mmcls.models.backbones.SwinTransformer>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`SwinTransformerV2 <mmcls.models.backbones.SwinTransformerV2>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`T2T_ViT <mmcls.models.backbones.T2T_ViT>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`TIMMBackbone <mmcls.models.backbones.TIMMBackbone>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`TNT <mmcls.models.backbones.TNT>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`VAN <mmcls.models.backbones.VAN>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`VGG <mmcls.models.backbones.VGG>`"
+msgstr ""
+
+#: ../../api/models.rst:111:<autosummary>:1
+msgid ":py:obj:`VisionTransformer <mmcls.models.backbones.VisionTransformer>`"
+msgstr ""
+
+#: ../../api/models.rst:115
+msgid "Necks"
+msgstr ""
+
+#: ../../api/models.rst:125:<autosummary>:1
+msgid ":py:obj:`GlobalAveragePooling <mmcls.models.necks.GlobalAveragePooling>`"
+msgstr ""
+
+#: ../../api/models.rst:125:<autosummary>:1
+msgid ":py:obj:`GeneralizedMeanPooling <mmcls.models.necks.GeneralizedMeanPooling>`"
+msgstr ""
+
+#: ../../api/models.rst:125:<autosummary>:1
+msgid ":py:obj:`HRFuseScales <mmcls.models.necks.HRFuseScales>`"
+msgstr ""
+
+#: ../../api/models.rst:129
+msgid "Heads"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1
+msgid ":py:obj:`ClsHead <mmcls.models.heads.ClsHead>`"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1
+msgid ":py:obj:`LinearClsHead <mmcls.models.heads.LinearClsHead>`"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1
+msgid ":py:obj:`StackedLinearClsHead <mmcls.models.heads.StackedLinearClsHead>`"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1
+msgid ":py:obj:`VisionTransformerClsHead <mmcls.models.heads.VisionTransformerClsHead>`"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1
+msgid ":py:obj:`EfficientFormerClsHead <mmcls.models.heads.EfficientFormerClsHead>`"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1
+msgid ":py:obj:`DeiTClsHead <mmcls.models.heads.DeiTClsHead>`"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1
+msgid ":py:obj:`ConformerHead <mmcls.models.heads.ConformerHead>`"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1
+msgid ":py:obj:`ArcFaceClsHead <mmcls.models.heads.ArcFaceClsHead>`"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1
+msgid ":py:obj:`MultiLabelClsHead <mmcls.models.heads.MultiLabelClsHead>`"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1
+msgid ":py:obj:`MultiLabelLinearClsHead <mmcls.models.heads.MultiLabelLinearClsHead>`"
+msgstr ""
+
+#: ../../api/models.rst:147:<autosummary>:1
+msgid ":py:obj:`CSRAClsHead <mmcls.models.heads.CSRAClsHead>`"
+msgstr ""
+
+#: ../../api/models.rst:151
+msgid "Losses"
+msgstr ""
+
+#: ../../api/models.rst:163:<autosummary>:1
+msgid ":py:obj:`CrossEntropyLoss <mmcls.models.losses.CrossEntropyLoss>`"
+msgstr ""
+
+#: ../../api/models.rst:163:<autosummary>:1
+msgid ":py:obj:`LabelSmoothLoss <mmcls.models.losses.LabelSmoothLoss>`"
+msgstr ""
+
+#: ../../api/models.rst:163:<autosummary>:1
+msgid ":py:obj:`FocalLoss <mmcls.models.losses.FocalLoss>`"
+msgstr ""
+
+#: ../../api/models.rst:163:<autosummary>:1
+msgid ":py:obj:`AsymmetricLoss <mmcls.models.losses.AsymmetricLoss>`"
+msgstr ""
+
+#: ../../api/models.rst:163:<autosummary>:1
+msgid ":py:obj:`SeesawLoss <mmcls.models.losses.SeesawLoss>`"
+msgstr ""
+
+#: ../../api/models.rst:167
+msgid "models.utils"
+msgstr ""
+
+#: ../../api/models.rst:169
+msgid "This package includes some helper functions and common components used in various networks."
+msgstr ""
+
+#: ../../api/models.rst:174
+msgid "Common Components"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1
+msgid ":py:obj:`InvertedResidual <mmcls.models.utils.InvertedResidual>`"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1
+msgid ":py:obj:`SELayer <mmcls.models.utils.SELayer>`"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1
+msgid ":py:obj:`WindowMSA <mmcls.models.utils.WindowMSA>`"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1
+msgid ":py:obj:`WindowMSAV2 <mmcls.models.utils.WindowMSAV2>`"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1
+msgid ":py:obj:`ShiftWindowMSA <mmcls.models.utils.ShiftWindowMSA>`"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1
+msgid ":py:obj:`MultiheadAttention <mmcls.models.utils.MultiheadAttention>`"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1
+msgid ":py:obj:`ConditionalPositionEncoding <mmcls.models.utils.ConditionalPositionEncoding>`"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1
+msgid ":py:obj:`PatchEmbed <mmcls.models.utils.PatchEmbed>`"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1
+msgid ":py:obj:`PatchMerging <mmcls.models.utils.PatchMerging>`"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1
+msgid ":py:obj:`HybridEmbed <mmcls.models.utils.HybridEmbed>`"
+msgstr ""
+
+#: ../../api/models.rst:192:<autosummary>:1
+msgid ":py:obj:`LayerScale <mmcls.models.utils.LayerScale>`"
+msgstr ""
+
+#: ../../api/models.rst:196
+msgid "Helper Functions"
+msgstr ""
+
+#: ../../api/models.rst:207:<autosummary>:1
+msgid ":py:obj:`channel_shuffle <mmcls.models.utils.channel_shuffle>`"
+msgstr ""
+
+#: ../../api/models.rst:207:<autosummary>:1
+msgid ":py:obj:`make_divisible <mmcls.models.utils.make_divisible>`"
+msgstr ""
+
+#: ../../api/models.rst:207:<autosummary>:1
+msgid ":py:obj:`resize_pos_embed <mmcls.models.utils.resize_pos_embed>`"
+msgstr ""
+
+#: ../../api/models.rst:207:<autosummary>:1
+msgid ":py:obj:`resize_relative_position_bias_table <mmcls.models.utils.resize_relative_position_bias_table>`"
+msgstr ""
+
+#: ../../api/models.rst:207:<autosummary>:1
+msgid ":py:obj:`to_ntuple <mmcls.models.utils.to_ntuple>`"
+msgstr ""
+
+#: ../../api/models.rst:207:<autosummary>:1
+msgid ":py:obj:`is_tracing <mmcls.models.utils.is_tracing>`"
+msgstr ""
+
+#: ../../api/structures.rst:7
+msgid "mmcls.structures"
+msgstr ""
+
+#: ../../api/structures.rst:9
+msgid "This package includes basic data structures for classification tasks."
+msgstr "该包中包含了用于分类任务的基础数据结构。"
+
+#: ../../api/structures.rst:12
+msgid "ClsDataSample"
+msgstr ""
+
+#: mmcls.structures.cls_data_sample.ClsDataSample:1 of
+msgid "A data structure interface of classification task."
+msgstr ""
+
+#: mmcls.structures.cls_data_sample.ClsDataSample:3 of
+msgid "It's used as interfaces between different components."
+msgstr ""
+
+#: mmcls.structures.cls_data_sample.ClsDataSample of
+msgid "Meta fields"
+msgstr ""
+
+#: mmcls.structures.cls_data_sample.ClsDataSample:5 of
+msgid "**img_shape** (*Tuple*) -- The shape of the corresponding input image. Used for visualization."
+msgstr ""
+
+#: mmcls.structures.cls_data_sample.ClsDataSample:7 of
+msgid "**ori_shape** (*Tuple*) -- The original shape of the corresponding image. Used for visualization."
+msgstr ""
+
+#: mmcls.structures.cls_data_sample.ClsDataSample:9 of
+msgid "**num_classes** (*int*) -- The number of all categories. Used for label format conversion."
+msgstr ""
+
+#: mmcls.structures.cls_data_sample.ClsDataSample of
+msgid "Data fields"
+msgstr ""
+
+#: mmcls.structures.cls_data_sample.ClsDataSample:12 of
+msgid "**gt_label** (:obj:`~mmengine.structures.LabelData`) -- The ground truth label."
+msgstr ""
+
+#: mmcls.structures.cls_data_sample.ClsDataSample:14 of
+msgid "**pred_label** (:obj:`~mmengine.structures.LabelData`) -- The predicted label."
+msgstr ""
+
+#: mmcls.structures.cls_data_sample.ClsDataSample:16 of
+msgid "**scores** (*torch.Tensor*) -- The outputs of model."
+msgstr ""
+
+#: mmcls.structures.cls_data_sample.ClsDataSample:17 of
+msgid "**logits** (*torch.Tensor*) -- The outputs of model without softmax nor sigmoid."
+msgstr ""
+
+#: ../../api/utils.rst:7
+msgid "mmcls.utils"
+msgstr ""
+
+#: ../../api/utils.rst:9
+msgid "This package includes some useful helper functions for developing."
+msgstr "该包中包含了一些有助于开发的辅助函数。"
+
+#: ../../api/utils.rst:16:<autosummary>:1
+msgid ":py:obj:`collect_env <mmcls.utils.collect_env>`"
+msgstr ""
+
+#: ../../api/utils.rst:16:<autosummary>:1
+msgid ":py:obj:`register_all_modules <mmcls.utils.register_all_modules>`"
+msgstr ""
+
+#: ../../api/visualization.rst:7
+msgid "mmcls.visualization"
+msgstr ""
+
+#: ../../api/visualization.rst:9
+msgid "This package includes visualizer components for classification tasks."
+msgstr "该包中包含了用于分类任务的一些可视化组件。"
+
+#: ../../api/visualization.rst:12
+msgid "ClsVisualizer"
+msgstr ""
+
+#: mmcls.visualization.cls_visualizer.ClsVisualizer:1 of
+msgid "Universal Visualizer for classification task."
+msgstr ""
+
+#: mmcls.visualization.cls_visualizer.ClsVisualizer:3 of
+msgid "Name of the instance. Defaults to 'visualizer'."
+msgstr ""
+
+#: mmcls.visualization.cls_visualizer.ClsVisualizer:5 of
+msgid "the origin image to draw. The format should be RGB. Defaults to None."
+msgstr ""
+
+#: mmcls.visualization.cls_visualizer.ClsVisualizer:8 of
+msgid "Visual backend config list. Defaults to None."
+msgstr ""
+
+#: mmcls.visualization.cls_visualizer.ClsVisualizer:11 of
+msgid "Save file dir for all storage backends. If it is None, the backend storage will not save any data."
+msgstr ""
+
+#: mmcls.visualization.cls_visualizer.ClsVisualizer:14 of
+msgid "Keyword parameters of figure for saving. Defaults to empty dict."
+msgstr ""
+
+#: mmcls.visualization.cls_visualizer.ClsVisualizer:17 of
+msgid "Keyword parameters of figure for showing. Defaults to empty dict."
+msgstr ""
diff --git a/docs/zh_CN/locales/zh_CN/LC_MESSAGES/papers.po b/docs/zh_CN/locales/zh_CN/LC_MESSAGES/papers.po
new file mode 100644
index 0000000000000000000000000000000000000000..70fe7c2c26903eb4620bb0bbfab0f14cf29e7b93
--- /dev/null
+++ b/docs/zh_CN/locales/zh_CN/LC_MESSAGES/papers.po
@@ -0,0 +1,8971 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2020, OpenMMLab
+# This file is distributed under the same license as the MMClassification
+# package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2021.
+#
+msgid ""
+msgstr ""
+"Project-Id-Version: MMClassification\n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2022-11-22 08:42+0800\n"
+"PO-Revision-Date: 2022-11-22 14:24+0800\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.9.1\n"
+"Last-Translator: Ma Zerun <mzr1996@163.com>\n"
+"Language-Team: \n"
+"Language: zh_CN\n"
+"X-Generator: Poedit 2.3\n"
+
+#: ../../papers/conformer.md:4
+msgid "Conformer"
+msgstr ""
+
+#: ../../papers/conformer.md:6
+msgid ""
+"[Conformer: Local Features Coupling Global Representations for Visual Recognition](https://arxiv.org/"
+"abs/2105.03889)"
+msgstr ""
+
+#: ../../papers/conformer.md:10 ../../papers/convmixer.md:10 ../../papers/convnext.md:20
+#: ../../papers/cspnet.md:10 ../../papers/csra.md:10 ../../papers/davit.md:10 ../../papers/deit.md:10
+#: ../../papers/deit3.md:10 ../../papers/densenet.md:10 ../../papers/edgenext.md:10
+#: ../../papers/efficientformer.md:10 ../../papers/efficientnet.md:20 ../../papers/hornet.md:10
+#: ../../papers/hrnet.md:10 ../../papers/inception_v3.md:10 ../../papers/mlp_mixer.md:10
+#: ../../papers/mobilenet_v2.md:10 ../../papers/mobilenet_v3.md:10 ../../papers/mobileone.md:21
+#: ../../papers/mobilevit.md:24 ../../papers/mvit.md:10 ../../papers/poolformer.md:10
+#: ../../papers/regnet.md:10 ../../papers/replknet.md:10 ../../papers/repmlp.md:10 ../../papers/repvgg.md:10
+#: ../../papers/res2net.md:10 ../../papers/resnet.md:28 ../../papers/resnext.md:10 ../../papers/seresnet.md:10
+#: ../../papers/shufflenet_v1.md:10 ../../papers/shufflenet_v2.md:10 ../../papers/swin_transformer.md:20
+#: ../../papers/swin_transformer_v2.md:28 ../../papers/t2t_vit.md:10 ../../papers/tnt.md:10
+#: ../../papers/twins.md:10 ../../papers/van.md:10 ../../papers/vgg.md:10
+#: ../../papers/vision_transformer.md:20 ../../papers/wrn.md:10
+msgid "Abstract"
+msgstr "摘要"
+
+#: ../../papers/conformer.md:12
+#, python-format
+msgid ""
+"Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features "
+"but experience difficulty to capture global representations. Within visual transformer, the cascaded self-"
+"attention modules can capture long-distance feature dependencies but unfortunately deteriorate local "
+"feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage "
+"of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer "
+"roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under "
+"different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local "
+"features and global representations are retained to the maximum extent. Experiments show that Conformer, "
+"under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet. "
+"On MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, "
+"respectively, demonstrating the great potential to be a general backbone network."
+msgstr ""
+
+#: ../../papers/conformer.md:18 ../../papers/convmixer.md:22 ../../papers/convnext.md:85
+#: ../../papers/cspnet.md:22 ../../papers/csra.md:18 ../../papers/davit.md:18 ../../papers/deit.md:18
+#: ../../papers/deit3.md:18 ../../papers/densenet.md:18 ../../papers/edgenext.md:22
+#: ../../papers/efficientformer.md:18 ../../papers/efficientnet.md:31 ../../papers/hornet.md:18
+#: ../../papers/hrnet.md:18 ../../papers/inception_v3.md:18 ../../papers/mlp_mixer.md:18
+#: ../../papers/mobilenet_v2.md:20 ../../papers/mobilenet_v3.md:18 ../../papers/mobileone.md:151
+#: ../../papers/mobilevit.md:89 ../../papers/mvit.md:25 ../../papers/poolformer.md:18
+#: ../../papers/regnet.md:18 ../../papers/replknet.md:18 ../../papers/repmlp.md:18 ../../papers/repvgg.md:18
+#: ../../papers/res2net.md:18 ../../papers/resnet.md:94 ../../papers/resnext.md:18 ../../papers/seresnet.md:18
+#: ../../papers/shufflenet_v1.md:18 ../../papers/shufflenet_v2.md:18 ../../papers/swin_transformer.md:84
+#: ../../papers/swin_transformer_v2.md:94 ../../papers/t2t_vit.md:18 ../../papers/tnt.md:18
+#: ../../papers/twins.md:18 ../../papers/van.md:18 ../../papers/vgg.md:18
+#: ../../papers/vision_transformer.md:89 ../../papers/wrn.md:18
+msgid "Results and models"
+msgstr "结果和模型"
+
+#: ../../papers/conformer.md:20 ../../papers/convmixer.md:24 ../../papers/convnext.md:68
+#: ../../papers/convnext.md:87 ../../papers/cspnet.md:24 ../../papers/davit.md:20 ../../papers/deit.md
+#: ../../papers/deit.md:20 ../../papers/deit3.md:20 ../../papers/densenet.md:20 ../../papers/edgenext.md:24
+#: ../../papers/efficientformer.md:20 ../../papers/efficientnet.md:33 ../../papers/hornet.md:20
+#: ../../papers/hrnet.md:20 ../../papers/inception_v3.md:20 ../../papers/mlp_mixer.md:20
+#: ../../papers/mobilenet_v2.md:22 ../../papers/mobilenet_v3.md:20 ../../papers/mobileone.md:153
+#: ../../papers/mobilevit.md:91 ../../papers/mvit.md:27 ../../papers/poolformer.md:20
+#: ../../papers/regnet.md:20 ../../papers/replknet.md:20 ../../papers/repmlp.md:20 ../../papers/repvgg.md:20
+#: ../../papers/res2net.md:20 ../../papers/resnet.md:120 ../../papers/resnext.md:20
+#: ../../papers/seresnet.md:20 ../../papers/shufflenet_v1.md:20 ../../papers/shufflenet_v2.md:20
+#: ../../papers/swin_transformer.md:97 ../../papers/swin_transformer_v2.md:105 ../../papers/t2t_vit.md:20
+#: ../../papers/tnt.md:20 ../../papers/twins.md:20 ../../papers/van.md:20 ../../papers/vgg.md:20
+#: ../../papers/vision_transformer.md:109 ../../papers/wrn.md:20
+msgid "ImageNet-1k"
+msgstr ""
+
+#: ../../papers/conformer.md ../../papers/convmixer.md ../../papers/convnext.md:68 ../../papers/cspnet.md
+#: ../../papers/csra.md ../../papers/davit.md ../../papers/deit.md ../../papers/deit3.md
+#: ../../papers/densenet.md ../../papers/edgenext.md ../../papers/efficientformer.md
+#: ../../papers/efficientnet.md ../../papers/hornet.md ../../papers/hrnet.md ../../papers/inception_v3.md
+#: ../../papers/mlp_mixer.md ../../papers/mobilenet_v2.md ../../papers/mobilenet_v3.md
+#: ../../papers/mobileone.md:86 ../../papers/mobilevit.md:71 ../../papers/mvit.md ../../papers/poolformer.md
+#: ../../papers/regnet.md ../../papers/replknet.md ../../papers/repmlp.md ../../papers/repvgg.md
+#: ../../papers/res2net.md ../../papers/resnet.md:76 ../../papers/resnext.md ../../papers/seresnet.md
+#: ../../papers/shufflenet_v1.md ../../papers/shufflenet_v2.md ../../papers/swin_transformer.md:66
+#: ../../papers/swin_transformer_v2.md:76 ../../papers/t2t_vit.md ../../papers/tnt.md ../../papers/twins.md
+#: ../../papers/van.md ../../papers/vgg.md ../../papers/vision_transformer.md:71 ../../papers/wrn.md
+msgid "Model"
+msgstr "模型"
+
+#: ../../papers/conformer.md ../../papers/convmixer.md ../../papers/convnext.md:68 ../../papers/cspnet.md
+#: ../../papers/csra.md ../../papers/davit.md ../../papers/deit.md ../../papers/deit3.md
+#: ../../papers/densenet.md ../../papers/edgenext.md ../../papers/efficientformer.md
+#: ../../papers/efficientnet.md ../../papers/hornet.md ../../papers/hrnet.md ../../papers/inception_v3.md
+#: ../../papers/mlp_mixer.md ../../papers/mobilenet_v2.md ../../papers/mobilenet_v3.md
+#: ../../papers/mobileone.md:86 ../../papers/mobilevit.md:71 ../../papers/mvit.md ../../papers/poolformer.md
+#: ../../papers/regnet.md ../../papers/replknet.md ../../papers/repmlp.md ../../papers/repvgg.md
+#: ../../papers/res2net.md ../../papers/resnet.md:76 ../../papers/resnext.md ../../papers/seresnet.md
+#: ../../papers/shufflenet_v1.md ../../papers/shufflenet_v2.md ../../papers/swin_transformer.md:66
+#: ../../papers/swin_transformer_v2.md:76 ../../papers/t2t_vit.md ../../papers/tnt.md ../../papers/twins.md
+#: ../../papers/van.md ../../papers/vgg.md ../../papers/vision_transformer.md:71 ../../papers/wrn.md
+msgid "Params(M)"
+msgstr "参数量（M）"
+
+#: ../../papers/conformer.md ../../papers/convmixer.md ../../papers/convnext.md:68 ../../papers/cspnet.md
+#: ../../papers/csra.md ../../papers/davit.md ../../papers/deit.md ../../papers/deit3.md
+#: ../../papers/densenet.md ../../papers/edgenext.md ../../papers/efficientformer.md
+#: ../../papers/efficientnet.md ../../papers/hornet.md ../../papers/hrnet.md ../../papers/inception_v3.md
+#: ../../papers/mlp_mixer.md ../../papers/mobilenet_v2.md ../../papers/mobilenet_v3.md
+#: ../../papers/mobileone.md:86 ../../papers/mobilevit.md:71 ../../papers/mvit.md ../../papers/poolformer.md
+#: ../../papers/regnet.md ../../papers/replknet.md ../../papers/repmlp.md ../../papers/repvgg.md
+#: ../../papers/res2net.md ../../papers/resnet.md:76 ../../papers/resnext.md ../../papers/seresnet.md
+#: ../../papers/shufflenet_v1.md ../../papers/shufflenet_v2.md ../../papers/swin_transformer.md:66
+#: ../../papers/swin_transformer_v2.md:76 ../../papers/t2t_vit.md ../../papers/tnt.md ../../papers/twins.md
+#: ../../papers/van.md ../../papers/vgg.md ../../papers/vision_transformer.md:71 ../../papers/wrn.md
+msgid "Flops(G)"
+msgstr ""
+
+#: ../../papers/conformer.md ../../papers/convmixer.md ../../papers/convnext.md:68 ../../papers/cspnet.md
+#: ../../papers/davit.md ../../papers/deit.md ../../papers/deit3.md ../../papers/densenet.md
+#: ../../papers/edgenext.md ../../papers/efficientformer.md ../../papers/efficientnet.md
+#: ../../papers/hornet.md ../../papers/hrnet.md ../../papers/inception_v3.md ../../papers/mlp_mixer.md
+#: ../../papers/mobilenet_v2.md ../../papers/mobilenet_v3.md ../../papers/mobileone.md:86
+#: ../../papers/mobilevit.md:71 ../../papers/mvit.md ../../papers/poolformer.md ../../papers/regnet.md
+#: ../../papers/replknet.md ../../papers/repmlp.md ../../papers/repvgg.md ../../papers/res2net.md
+#: ../../papers/resnet.md:76 ../../papers/resnext.md ../../papers/seresnet.md ../../papers/shufflenet_v1.md
+#: ../../papers/shufflenet_v2.md ../../papers/swin_transformer.md:66 ../../papers/swin_transformer_v2.md:76
+#: ../../papers/t2t_vit.md ../../papers/tnt.md ../../papers/twins.md ../../papers/van.md ../../papers/vgg.md
+#: ../../papers/vision_transformer.md:71 ../../papers/wrn.md
+msgid "Top-1 (%)"
+msgstr ""
+
+#: ../../papers/conformer.md ../../papers/convmixer.md ../../papers/convnext.md:68 ../../papers/cspnet.md
+#: ../../papers/davit.md ../../papers/deit.md ../../papers/deit3.md ../../papers/densenet.md
+#: ../../papers/edgenext.md ../../papers/efficientformer.md ../../papers/efficientnet.md
+#: ../../papers/hornet.md ../../papers/hrnet.md ../../papers/inception_v3.md ../../papers/mlp_mixer.md
+#: ../../papers/mobilenet_v2.md ../../papers/mobilenet_v3.md ../../papers/mobileone.md:86
+#: ../../papers/mobilevit.md:71 ../../papers/mvit.md ../../papers/poolformer.md ../../papers/regnet.md
+#: ../../papers/replknet.md ../../papers/repmlp.md ../../papers/repvgg.md ../../papers/res2net.md
+#: ../../papers/resnet.md:76 ../../papers/resnext.md ../../papers/seresnet.md ../../papers/shufflenet_v1.md
+#: ../../papers/shufflenet_v2.md ../../papers/swin_transformer.md:66 ../../papers/swin_transformer_v2.md:76
+#: ../../papers/t2t_vit.md ../../papers/tnt.md ../../papers/twins.md ../../papers/van.md ../../papers/vgg.md
+#: ../../papers/vision_transformer.md:71 ../../papers/wrn.md
+msgid "Top-5 (%)"
+msgstr ""
+
+#: ../../papers/conformer.md ../../papers/convmixer.md ../../papers/convnext.md:68 ../../papers/cspnet.md
+#: ../../papers/csra.md ../../papers/davit.md ../../papers/deit.md ../../papers/deit3.md
+#: ../../papers/densenet.md ../../papers/edgenext.md ../../papers/efficientformer.md
+#: ../../papers/efficientnet.md ../../papers/hornet.md ../../papers/hrnet.md ../../papers/inception_v3.md
+#: ../../papers/mlp_mixer.md ../../papers/mobilenet_v2.md ../../papers/mobilenet_v3.md
+#: ../../papers/mobileone.md:86 ../../papers/mobilevit.md:71 ../../papers/mvit.md ../../papers/poolformer.md
+#: ../../papers/regnet.md ../../papers/replknet.md ../../papers/repmlp.md ../../papers/repvgg.md
+#: ../../papers/res2net.md ../../papers/resnet.md:76 ../../papers/resnext.md ../../papers/seresnet.md
+#: ../../papers/shufflenet_v1.md ../../papers/shufflenet_v2.md ../../papers/swin_transformer.md:66
+#: ../../papers/swin_transformer_v2.md:76 ../../papers/t2t_vit.md ../../papers/tnt.md ../../papers/twins.md
+#: ../../papers/van.md ../../papers/vgg.md ../../papers/vision_transformer.md:71 ../../papers/wrn.md
+msgid "Config"
+msgstr "配置文件"
+
+#: ../../papers/conformer.md ../../papers/convmixer.md ../../papers/convnext.md:68 ../../papers/cspnet.md
+#: ../../papers/csra.md ../../papers/davit.md ../../papers/deit.md ../../papers/deit3.md
+#: ../../papers/densenet.md ../../papers/edgenext.md ../../papers/efficientformer.md
+#: ../../papers/efficientnet.md ../../papers/hornet.md ../../papers/hrnet.md ../../papers/inception_v3.md
+#: ../../papers/mlp_mixer.md ../../papers/mobilenet_v2.md ../../papers/mobilenet_v3.md
+#: ../../papers/mobileone.md:86 ../../papers/mobilevit.md:71 ../../papers/mvit.md ../../papers/poolformer.md
+#: ../../papers/regnet.md ../../papers/replknet.md ../../papers/repmlp.md ../../papers/repvgg.md
+#: ../../papers/res2net.md ../../papers/resnet.md:76 ../../papers/resnext.md ../../papers/seresnet.md
+#: ../../papers/shufflenet_v1.md ../../papers/shufflenet_v2.md ../../papers/swin_transformer.md:66
+#: ../../papers/swin_transformer_v2.md:76 ../../papers/t2t_vit.md ../../papers/tnt.md ../../papers/twins.md
+#: ../../papers/van.md ../../papers/vgg.md ../../papers/vision_transformer.md:71 ../../papers/wrn.md
+msgid "Download"
+msgstr "下载"
+
+#: ../../papers/conformer.md
+msgid "Conformer-tiny-p16\\*"
+msgstr ""
+
+#: ../../papers/conformer.md ../../papers/resnet.md:76
+msgid "23.52"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "4.90"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "81.31"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "95.60"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/conformer/conformer-tiny-"
+"p16_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-tiny-"
+"p16_3rdparty_8xb128_in1k_20211206-f6860372.pth)"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "Conformer-small-p32\\*"
+msgstr ""
+
+#: ../../papers/conformer.md ../../papers/deit3.md
+msgid "38.85"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "7.09"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "81.96"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "96.02"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/conformer/conformer-small-"
+"p32_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-small-"
+"p32_8xb128_in1k_20211206-947a0816.pth)"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "Conformer-small-p16\\*"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "37.67"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "10.31"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "83.32"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "96.46"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/conformer/conformer-small-"
+"p16_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-small-"
+"p16_3rdparty_8xb128_in1k_20211206-3065dcf5.pth)"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "Conformer-base-p16\\*"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "83.29"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid "22.89"
+msgstr ""
+
+#: ../../papers/conformer.md ../../papers/efficientnet.md
+msgid "83.82"
+msgstr ""
+
+#: ../../papers/conformer.md ../../papers/twins.md
+msgid "96.59"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/conformer/conformer-base-"
+"p16_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/conformer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/conformer/conformer-base-"
+"p16_3rdparty_8xb128_in1k_20211206-bfdf8637.pth)"
+msgstr ""
+
+#: ../../papers/conformer.md:29
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/pengzhiliang/Conformer). The "
+"config files of these models are only for validation. We don't ensure these config files' training accuracy "
+"and welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/conformer.md:31 ../../papers/convmixer.md:34 ../../papers/convnext.md:116
+#: ../../papers/cspnet.md:34 ../../papers/csra.md:26 ../../papers/davit.md:32 ../../papers/deit.md:43
+#: ../../papers/deit3.md:43 ../../papers/densenet.md:31 ../../papers/edgenext.md:37
+#: ../../papers/efficientformer.md:30 ../../papers/efficientnet.md:130 ../../papers/hornet.md:45
+#: ../../papers/hrnet.md:36 ../../papers/inception_v3.md:28 ../../papers/mlp_mixer.md:29
+#: ../../papers/mobilenet_v2.md:28 ../../papers/mobilenet_v3.md:35 ../../papers/mobileone.md:163
+#: ../../papers/mobilevit.md:101 ../../papers/mvit.md:38 ../../papers/poolformer.md:32
+#: ../../papers/regnet.md:43 ../../papers/replknet.md:88 ../../papers/repmlp.md:87 ../../papers/repvgg.md:94
+#: ../../papers/res2net.md:30 ../../papers/resnet.md:152 ../../papers/resnext.md:29
+#: ../../papers/seresnet.md:27 ../../papers/shufflenet_v1.md:26 ../../papers/shufflenet_v2.md:26
+#: ../../papers/swin_transformer.md:120 ../../papers/swin_transformer_v2.md:124 ../../papers/t2t_vit.md:30
+#: ../../papers/tnt.md:28 ../../papers/twins.md:33 ../../papers/van.md:31 ../../papers/vgg.md:33
+#: ../../papers/vision_transformer.md:121 ../../papers/wrn.md:30
+msgid "Citation"
+msgstr "引用"
+
+#: ../../papers/convmixer.md:4
+msgid "ConvMixer"
+msgstr ""
+
+#: ../../papers/convmixer.md:6
+msgid "[Patches Are All You Need?](https://arxiv.org/abs/2201.09792)"
+msgstr ""
+
+#: ../../papers/convmixer.md:14
+msgid ""
+"Although convolutional networks have been the dominant architecture for vision tasks for many years, recent "
+"experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed "
+"their performance in some settings. However, due to the quadratic runtime of the self-attention layers in "
+"Transformers, ViTs require the use of patch embeddings, which group together small regions of the image "
+"into single input features, in order to be applied to larger image sizes. This raises a question: Is the "
+"performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly "
+"due to using patches as the input representation? In this paper, we present some evidence for the latter: "
+"specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and "
+"the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of "
+"spatial and channel dimensions, and maintains equal size and resolution throughout the network. In "
+"contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its "
+"simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for "
+"similar parameter counts and data set sizes, in addition to outperforming classical vision models such as "
+"the ResNet."
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid "ConvMixer-768/32\\*"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid "21.11"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid "19.62"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid "80.16"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid "95.08"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/convmixer/"
+"convmixer-768-32_10xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convmixer/"
+"convmixer-768-32_3rdparty_10xb64_in1k_20220323-bca1f7b8.pth)"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid "ConvMixer-1024/20\\*"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid "24.38"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid "5.55"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid "76.94"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid "93.36"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/convmixer/"
+"convmixer-1024-20_10xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convmixer/"
+"convmixer-1024-20_3rdparty_10xb64_in1k_20220323-48f8aeba.pth)"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid "ConvMixer-1536/20\\*"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid "51.63"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid "48.71"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid "81.37"
+msgstr ""
+
+#: ../../papers/convmixer.md ../../papers/swin_transformer.md:66
+msgid "95.61"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/convmixer/"
+"convmixer-1536-20_10xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/convmixer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convmixer/"
+"convmixer-1536_20_3rdparty_10xb64_in1k_20220323-ea5786f3.pth)"
+msgstr ""
+
+#: ../../papers/convmixer.md:32
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/locuslab/convmixer). The config "
+"files of these models are only for inference. We don't ensure these config files' training accuracy and "
+"welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/convnext.md:4
+msgid "ConvNeXt"
+msgstr ""
+
+#: ../../papers/convnext.md:6
+msgid "[A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545v1)"
+msgstr ""
+
+#: ../../papers/convnext.md:10 ../../papers/efficientnet.md:10 ../../papers/mobileone.md:10
+#: ../../papers/mobilevit.md:10 ../../papers/resnet.md:10 ../../papers/swin_transformer.md:10
+#: ../../papers/swin_transformer_v2.md:10 ../../papers/vision_transformer.md:10
+msgid "Introduction"
+msgstr "简介"
+
+#: ../../papers/convnext.md:12
+msgid ""
+"**ConvNeXt** is initially described in [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545v1), which "
+"is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers. The ConvNeXt has "
+"the pyramid structure and achieve competitive  performance on various vision tasks, with simplicity and "
+"efficiency."
+msgstr ""
+
+#: ../../papers/convnext.md:34 ../../papers/efficientnet.md:78 ../../papers/mobilevit.md:37
+#: ../../papers/resnet.md:42 ../../papers/swin_transformer.md:32 ../../papers/swin_transformer_v2.md:42
+#: ../../papers/vision_transformer.md:33
+msgid "How to use it?"
+msgstr "使用方式"
+
+#: ../../papers/convnext.md:39 ../../papers/efficientnet.md:83 ../../papers/mobileone.md:40
+#: ../../papers/mobilevit.md:42 ../../papers/resnet.md:47 ../../papers/swin_transformer.md:37
+#: ../../papers/swin_transformer_v2.md:47 ../../papers/vision_transformer.md:38
+msgid "Predict image"
+msgstr "推理图片"
+
+#: ../../papers/convnext.md:52 ../../papers/efficientnet.md:96 ../../papers/mobileone.md:63
+#: ../../papers/mobilevit.md:55 ../../papers/resnet.md:60 ../../papers/swin_transformer.md:50
+#: ../../papers/swin_transformer_v2.md:60 ../../papers/vision_transformer.md:51
+msgid "Use the model"
+msgstr "调用模型"
+
+#: ../../papers/convnext.md:69 ../../papers/efficientnet.md:113 ../../papers/mobileone.md:87
+#: ../../papers/mobilevit.md:72 ../../papers/resnet.md:77 ../../papers/swin_transformer.md:67
+#: ../../papers/swin_transformer_v2.md:77 ../../papers/vision_transformer.md:72
+msgid "Train/Test Command"
+msgstr "训练/测试"
+
+#: ../../papers/convnext.md:69 ../../papers/efficientnet.md:113 ../../papers/mobileone.md:87
+#: ../../papers/mobilevit.md:72 ../../papers/resnet.md:77 ../../papers/swin_transformer.md:67
+#: ../../papers/swin_transformer_v2.md:77 ../../papers/vision_transformer.md:72
+msgid ""
+"Place the ImageNet dataset to the `data/imagenet/` directory, or prepare datasets according to the [docs]"
+"(https://mmclassification.readthedocs.io/en/1.x/user_guides/dataset_prepare.html#prepare-dataset)."
+msgstr ""
+"将 ImageNet 数据集放置在 `data/imagenet` 目录下，或者根据 [docs](https://mmclassification.readthedocs.io/"
+"en/1.x/user_guides/dataset_prepare.html#prepare-dataset) 准备其他数据集。"
+
+#: ../../papers/convnext.md:71 ../../papers/efficientnet.md:115 ../../papers/mobileone.md:89
+#: ../../papers/mobilevit.md:74 ../../papers/resnet.md:79 ../../papers/swin_transformer.md:69
+#: ../../papers/swin_transformer_v2.md:79 ../../papers/vision_transformer.md:74
+msgid "Train:"
+msgstr "训练："
+
+#: ../../papers/convnext.md:83
+msgid ""
+"For more configurable parameters, please refer to the [API](https://mmclassification.readthedocs.io/en/1.x/"
+"api/generated/mmcls.models.backbones.ConvNeXt.html#mmcls.models.backbones.ConvNeXt)."
+msgstr ""
+
+#: ../../papers/convnext.md:68 ../../papers/cspnet.md ../../papers/csra.md ../../papers/davit.md
+#: ../../papers/deit.md ../../papers/deit3.md ../../papers/edgenext.md ../../papers/hornet.md
+#: ../../papers/mvit.md ../../papers/resnet.md:76 ../../papers/swin_transformer.md:66
+#: ../../papers/swin_transformer_v2.md:76 ../../papers/van.md ../../papers/vision_transformer.md:71
+msgid "Pretrain"
+msgstr "预训练"
+
+#: ../../papers/convnext.md:68
+msgid "ConvNeXt-T\\*"
+msgstr ""
+
+#: ../../papers/convnext.md:68 ../../papers/cspnet.md ../../papers/davit.md ../../papers/deit.md
+#: ../../papers/deit3.md ../../papers/edgenext.md ../../papers/hornet.md ../../papers/mvit.md
+#: ../../papers/swin_transformer.md:66 ../../papers/swin_transformer_v2.md:76 ../../papers/van.md
+#: ../../papers/vision_transformer.md:71
+msgid "From scratch"
+msgstr "从头训练"
+
+#: ../../papers/convnext.md:68
+msgid "28.59"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "4.46"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "82.05"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "95.86"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/convnext/convnext-"
+"tiny_32xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-"
+"tiny_3rdparty_32xb128_in1k_20220124-18abde00.pth)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "ConvNeXt-S\\*"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "50.22"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "8.69"
+msgstr ""
+
+#: ../../papers/convnext.md:68 ../../papers/twins.md
+msgid "83.13"
+msgstr ""
+
+#: ../../papers/convnext.md:68 ../../papers/efficientnet.md ../../papers/swin_transformer.md:66
+msgid "96.44"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/convnext/convnext-"
+"small_32xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-"
+"small_3rdparty_32xb128_in1k_20220124-d39b5192.pth)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "ConvNeXt-B\\*"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "88.59"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "15.36"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "83.85"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "96.74"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/convnext/convnext-"
+"base_32xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-"
+"base_3rdparty_32xb128_in1k_20220124-d0915162.pth)"
+msgstr ""
+
+#: ../../papers/convnext.md:68 ../../papers/deit3.md ../../papers/hornet.md
+#: ../../papers/swin_transformer.md:66 ../../papers/swin_transformer.md:86
+#: ../../papers/swin_transformer_v2.md:76 ../../papers/swin_transformer_v2.md:96
+#: ../../papers/vision_transformer.md:71 ../../papers/vision_transformer.md:97
+msgid "ImageNet-21k"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "85.81"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "97.86"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_in21k-"
+"pre-3rdparty_32xb128_in1k_20220124-eb2d6ada.pth)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "ConvNeXt-L\\*"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "197.77"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "34.37"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "84.30"
+msgstr ""
+
+#: ../../papers/convnext.md:68 ../../papers/efficientnet.md
+msgid "96.89"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/convnext/convnext-"
+"large_64xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-"
+"large_3rdparty_64xb64_in1k_20220124-f8a0ded0.pth)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "86.61"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "98.04"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-large_in21k-"
+"pre-3rdparty_64xb64_in1k_20220124-2412403d.pth)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "ConvNeXt-XL\\*"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "350.20"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "60.93"
+msgstr ""
+
+#: ../../papers/convnext.md:68 ../../papers/deit3.md
+msgid "86.97"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "98.20"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/convnext/convnext-"
+"xlarge_64xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_in21k-"
+"pre-3rdparty_64xb64_in1k_20220124-76b6863d.pth)"
+msgstr ""
+
+#: ../../papers/convnext.md:99
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/facebookresearch/ConvNeXt). The "
+"config files of these models are only for inference. We don't ensure these config files' training accuracy "
+"and welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/convnext.md:101 ../../papers/hornet.md:33
+msgid "Pre-trained Models"
+msgstr ""
+
+#: ../../papers/convnext.md:103
+msgid "The pre-trained models on ImageNet-1k or ImageNet-21k are used to fine-tune on the downstream tasks."
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid "Training Data"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-tiny_3rdparty_32xb128-"
+"noema_in1k_20220222-2908964a.pth)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-small_3rdparty_32xb128-"
+"noema_in1k_20220222-fa001ca5.pth)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-base_3rdparty_32xb128-"
+"noema_in1k_20220222-dba4f95f.pth)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-"
+"base_3rdparty_in21k_20220124-13b83eec.pth)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-"
+"large_3rdparty_in21k_20220124-41b5a79f.pth)"
+msgstr ""
+
+#: ../../papers/convnext.md:68
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/convnext/convnext-xlarge_3rdparty_in21k_20220124-"
+"f909bad7.pth)"
+msgstr ""
+
+#: ../../papers/convnext.md:114
+msgid "*Models with * are converted from the [official repo](https://github.com/facebookresearch/ConvNeXt).*"
+msgstr ""
+
+#: ../../papers/cspnet.md:4
+msgid "CSPNet"
+msgstr ""
+
+#: ../../papers/cspnet.md:6
+msgid "[CSPNet: A New Backbone that can Enhance Learning Capability of CNN](https://arxiv.org/abs/1911.11929)"
+msgstr ""
+
+#: ../../papers/cspnet.md:14
+msgid ""
+"Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision "
+"tasks such as object detection. However, such success greatly relies on costly computation resources, which "
+"hinders people with cheap devices from appreciating the advanced technology. In this paper, we propose "
+"Cross Stage Partial Network (CSPNet) to mitigate the problem that previous works require heavy inference "
+"computations from the network architecture perspective. We attribute the problem to the duplicate gradient "
+"information within network optimization. The proposed networks respect the variability of the gradients by "
+"integrating feature maps from the beginning and the end of a network stage, which, in our experiments, "
+"reduces computations by 20% with equivalent or even superior accuracy on the ImageNet dataset, and "
+"significantly outperforms state-of-the-art approaches in terms of AP50 on the MS COCO object detection "
+"dataset. The CSPNet is easy to implement and general enough to cope with architectures based on ResNet, "
+"ResNeXt, and DenseNet. Source code is at this https URL."
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid "CSPDarkNet50\\*"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid "27.64"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid "5.04"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid "80.05"
+msgstr ""
+
+#: ../../papers/cspnet.md ../../papers/efficientnet.md
+msgid "95.07"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/cspnet/cspdarknet50_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/cspnet/cspdarknet50_3rdparty_8xb32_in1k_20220329-"
+"bd275287.pth)"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid "CSPResNet50\\*"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid "21.62"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid "3.48"
+msgstr ""
+
+#: ../../papers/cspnet.md ../../papers/resnet.md:76
+msgid "79.55"
+msgstr ""
+
+#: ../../papers/cspnet.md ../../papers/repvgg.md
+msgid "94.68"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/cspnet/cspresnet50_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/cspnet/cspresnet50_3rdparty_8xb32_in1k_20220329-"
+"dd6dddfb.pth)"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid "CSPResNeXt50\\*"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid "20.57"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid "3.11"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid "79.96"
+msgstr ""
+
+#: ../../papers/cspnet.md ../../papers/efficientnet.md
+msgid "94.96"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/cspnet/cspresnext50_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/cspnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/cspnet/"
+"cspresnext50_3rdparty_8xb32_in1k_20220329-2cc84d21.pth)"
+msgstr ""
+
+#: ../../papers/cspnet.md:32
+msgid ""
+"*Models with * are converted from the [timm repo](https://github.com/rwightman/pytorch-image-models). The "
+"config files of these models are only for inference. We don't ensure these config files' training accuracy "
+"and welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/csra.md:4
+msgid "CSRA"
+msgstr ""
+
+#: ../../papers/csra.md:6
+msgid ""
+"[Residual Attention: A Simple but Effective Method for Multi-Label Recognition](https://arxiv.org/"
+"abs/2108.02456)"
+msgstr ""
+
+#: ../../papers/csra.md:12
+msgid ""
+"Multi-label image recognition is a challenging computer vision task of practical use. Progresses in this "
+"area, however, are often characterized by complicated methods, heavy computations, and lack of intuitive "
+"explanations. To effectively capture different spatial regions occupied by objects from different "
+"categories, we propose an embarrassingly simple module, named class-specific residual attention (CSRA). "
+"CSRA generates class-specific features for every category by proposing a simple spatial attention score, "
+"and then combines it with the class-agnostic average pooling feature. CSRA achieves state-of-the-art "
+"results on multilabel recognition, and at the same time is much simpler than them. Furthermore, with only 4 "
+"lines of code, CSRA also leads to consistent improvement across many diverse pretrained models and datasets "
+"without any extra training. CSRA is both easy to implement and light in computations, which also enjoys "
+"intuitive explanations and visualizations."
+msgstr ""
+
+#: ../../papers/csra.md:20
+msgid "VOC2007"
+msgstr ""
+
+#: ../../papers/csra.md
+msgid "mAP"
+msgstr ""
+
+#: ../../papers/csra.md
+msgid "OF1 (%)"
+msgstr ""
+
+#: ../../papers/csra.md
+msgid "CF1 (%)"
+msgstr ""
+
+#: ../../papers/csra.md
+msgid "Resnet101-CSRA"
+msgstr ""
+
+#: ../../papers/csra.md
+msgid ""
+"[ImageNet-1k](https://download.openmmlab.com/mmclassification/v0/resnet/"
+"resnet101_8xb32_in1k_20210831-539c63f8.pth)"
+msgstr ""
+
+#: ../../papers/csra.md
+msgid "23.55"
+msgstr ""
+
+#: ../../papers/csra.md ../../papers/resnet.md:76
+msgid "4.12"
+msgstr ""
+
+#: ../../papers/csra.md
+msgid "94.98"
+msgstr ""
+
+#: ../../papers/csra.md ../../papers/vgg.md
+msgid "90.80"
+msgstr ""
+
+#: ../../papers/csra.md
+msgid "89.16"
+msgstr ""
+
+#: ../../papers/csra.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/master/configs/csra/resnet101-"
+"csra_1xb16_voc07-448px.py)"
+msgstr ""
+
+#: ../../papers/csra.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/csra/resnet101-"
+"csra_1xb16_voc07-448px_20220722-29efb40a.pth) | [log](https://download.openmmlab.com/mmclassification/v0/"
+"csra/resnet101-csra_1xb16_voc07-448px_20220722-29efb40a.log.json)"
+msgstr ""
+
+#: ../../papers/davit.md:4
+msgid "DaViT"
+msgstr ""
+
+#: ../../papers/davit.md:6
+msgid "[DaViT: Dual Attention Vision Transformers](https://arxiv.org/abs/2204.03645v1)"
+msgstr ""
+
+#: ../../papers/davit.md:12
+msgid ""
+"In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision "
+"transformer architecture that is able to capture global context while maintaining computational efficiency. "
+"We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "
+"\"spatial tokens\" and \"channel tokens\". With spatial tokens, the spatial dimension defines the token "
+"scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the "
+"inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature "
+"dimension. We further group tokens along the sequence direction for both spatial and channel tokens to "
+"maintain the linear complexity of the entire model. We show that these two self-attentions complement each "
+"other: (i) since each channel token contains an abstract representation of the entire image, the channel "
+"attention naturally captures global interactions and representations by taking all spatial positions into "
+"account when computing attention scores between channels; (ii) the spatial attention refines the local "
+"representations by performing fine-grained interactions across spatial locations, which in turn helps the "
+"global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-"
+"the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, "
+"DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, "
+"49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image "
+"and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K."
+msgstr ""
+
+#: ../../papers/davit.md ../../papers/deit3.md ../../papers/hornet.md ../../papers/res2net.md
+#: ../../papers/resnet.md:76 ../../papers/swin_transformer.md:66 ../../papers/swin_transformer_v2.md:76
+#: ../../papers/van.md ../../papers/vision_transformer.md:71
+msgid "resolution"
+msgstr "分辨率"
+
+#: ../../papers/davit.md
+msgid "DaViT-T\\*"
+msgstr ""
+
+#: ../../papers/davit.md ../../papers/deit3.md ../../papers/hornet.md ../../papers/replknet.md
+#: ../../papers/res2net.md ../../papers/resnet.md:76 ../../papers/swin_transformer.md:66 ../../papers/van.md
+#: ../../papers/vision_transformer.md:71
+msgid "224x224"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid "28.36"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid "4.54"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid "82.24"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid "96.13"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/davit/davit-tiny_4xb256_in1k.py)"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/davit/davit-tiny_3rdparty_in1k_20221116-700fdf7d."
+"pth)"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid "DaViT-S\\*"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid "49.74"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid "8.79"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid "83.61"
+msgstr ""
+
+#: ../../papers/davit.md ../../papers/hornet.md
+msgid "96.75"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/davit/davit-small_4xb256_in1k.py)"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/davit/davit-"
+"small_3rdparty_in1k_20221116-51a849a6.pth)"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid "DaViT-B\\*"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid "87.95"
+msgstr ""
+
+#: ../../papers/davit.md ../../papers/vgg.md
+msgid "15.5"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid "84.09"
+msgstr ""
+
+#: ../../papers/davit.md ../../papers/efficientnet.md
+msgid "96.82"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/davit/davit-base_4xb256_in1k.py)"
+msgstr ""
+
+#: ../../papers/davit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/davit/davit-base_3rdparty_in1k_20221116-19e0d956."
+"pth)"
+msgstr ""
+
+#: ../../papers/davit.md:28
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/dingmyu/davit). The config files "
+"of these models are only for validation. We don't ensure these config files' training accuracy and welcome "
+"you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/davit.md:30
+msgid ""
+"Note: Inference accuracy is a bit lower than paper result because of inference code for classification "
+"doesn't exist."
+msgstr ""
+
+#: ../../papers/deit.md:4
+msgid "DeiT"
+msgstr ""
+
+#: ../../papers/deit.md:6
+msgid ""
+"[Training data-efficient image transformers & distillation through attention](https://arxiv.org/"
+"abs/2012.12877)"
+msgstr ""
+
+#: ../../papers/deit.md:12
+msgid ""
+"Recently, neural networks purely based on attention were shown to address image understanding tasks such as "
+"image classification. However, these visual transformers are pre-trained with hundreds of millions of "
+"images using an expensive infrastructure, thereby limiting their adoption.   In this work, we produce a "
+"competitive convolution-free transformer by training on Imagenet only. We train them on a single computer "
+"in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% "
+"(single-crop evaluation) on ImageNet with no external data.   More importantly, we introduce a teacher-"
+"student strategy specific to transformers. It relies on a distillation token ensuring that the student "
+"learns from the teacher through attention. We show the interest of this token-based distillation, "
+"especially when using a convnet as a teacher. This leads us to report results competitive with convnets for "
+"both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our "
+"code and models."
+msgstr ""
+
+#: ../../papers/deit.md:22
+msgid "The teacher of the distilled version DeiT is RegNetY-16GF."
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "DeiT-tiny"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "5.72"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "1.08"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "74.50"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "92.24"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit/deit-tiny_pt-4xb256_in1k.py)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny_pt-4xb256_in1k_20220218-13b382a0."
+"pth)  | [log](https://download.openmmlab.com/mmclassification/v0/deit/deit-"
+"tiny_pt-4xb256_in1k_20220218-13b382a0.log.json)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "DeiT-tiny distilled\\*"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "74.51"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "91.90"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit/deit-tiny-"
+"distilled_pt-4xb256_in1k.py)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit/deit-tiny-"
+"distilled_3rdparty_pt-4xb256_in1k_20211216-c429839a.pth)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "DeiT-small"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "22.05"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "4.24"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "80.69"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "95.06"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit/deit-small_pt-4xb256_in1k.py)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit/deit-small_pt-4xb256_in1k_20220218-9425b9bb."
+"pth)  | [log](https://download.openmmlab.com/mmclassification/v0/deit/deit-"
+"small_pt-4xb256_in1k_20220218-9425b9bb.log.json)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "DeiT-small distilled\\*"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "81.17"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "95.40"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit/deit-small-"
+"distilled_pt-4xb256_in1k.py)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit/deit-small-"
+"distilled_3rdparty_pt-4xb256_in1k_20211216-4de1d725.pth)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "DeiT-base"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "86.57"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "16.86"
+msgstr ""
+
+#: ../../papers/deit.md ../../papers/swin_transformer_v2.md:76
+msgid "81.76"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "95.81"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit/deit-base_pt-16xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base_pt-16xb64_in1k_20220216-db63c16c."
+"pth)  | [log](https://download.openmmlab.com/mmclassification/v0/deit/deit-base_pt-16xb64_in1k_20220216-"
+"db63c16c.log.json)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "DeiT-base\\*"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "81.79"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "95.59"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit/deit-"
+"base_3rdparty_pt-16xb64_in1k_20211124-6f40c188.pth)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "DeiT-base distilled\\*"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "83.33"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "96.49"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit/deit-base-"
+"distilled_pt-16xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base-"
+"distilled_3rdparty_pt-16xb64_in1k_20211216-42891296.pth)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "DeiT-base 384px\\*"
+msgstr ""
+
+#: ../../papers/deit.md ../../papers/vision_transformer.md:71
+msgid "86.86"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "49.37"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "83.04"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "96.31"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit/deit-"
+"base_ft-16xb32_in1k-384px.py)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit/deit-"
+"base_3rdparty_ft-16xb32_in1k-384px_20211124-822d02f2.pth)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "DeiT-base distilled 384px\\*"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "85.55"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid "97.35"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit/deit-base-"
+"distilled_ft-16xb32_in1k-384px.py)"
+msgstr ""
+
+#: ../../papers/deit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit/deit-base-"
+"distilled_3rdparty_ft-16xb32_in1k-384px_20211216-e48d6000.pth)"
+msgstr ""
+
+#: ../../papers/deit.md:36 ../../papers/deit3.md:41
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/facebookresearch/deit). The config "
+"files of these models are only for validation. We don't ensure these config files' training accuracy and "
+"welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/deit.md:39
+msgid ""
+"MMClassification doesn't support training the distilled version DeiT. And we provide distilled version "
+"checkpoints for inference only."
+msgstr ""
+
+#: ../../papers/deit3.md:4
+msgid "DeiT III: Revenge of the ViT"
+msgstr ""
+
+#: ../../papers/deit3.md:6
+msgid "[DeiT III: Revenge of the ViT](https://arxiv.org/pdf/2204.07118.pdf)"
+msgstr ""
+
+#: ../../papers/deit3.md:12
+msgid ""
+"A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. "
+"It has limited built-in architectural priors, in contrast to more recent architectures that incorporate "
+"priors either about the input data or of specific tasks. Recent works show that ViTs benefit from self-"
+"supervised pre-training, in particular BerT-like pre-training like BeiT. In this paper, we revisit the "
+"supervised training of ViTs. Our procedure builds upon and simplifies a recipe introduced for training "
+"ResNet-50. It includes a new simple data-augmentation procedure with only 3 augmentations, closer to the "
+"practice in self-supervised learning. Our evaluations on Image classification (ImageNet-1k with and without "
+"pre-training on ImageNet-21k), transfer learning and semantic segmentation show that our procedure "
+"outperforms by a large margin previous fully supervised training recipes for ViT. It also reveals that the "
+"performance of our ViT trained with supervision is comparable to that of more recent architectures. Our "
+"results could serve as better baselines for recent self-supervised approaches demonstrated on ViT."
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "DeiT3-S\\*"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "22.06"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "4.61"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "81.35"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "95.31"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit3/deit3-small-p16_64xb64_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-"
+"p16_3rdparty_in1k_20221008-0f7c70cf.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md ../../papers/hornet.md ../../papers/replknet.md ../../papers/swin_transformer.md:66
+#: ../../papers/swin_transformer_v2.md:76 ../../papers/vision_transformer.md:71
+msgid "384x384"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "22.21"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "15.52"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "83.43"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "96.68"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit3/deit3-small-"
+"p16_64xb64_in1k-384px.py)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-"
+"p16_3rdparty_in1k-384px_20221008-a2c1a0c7.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "83.06"
+msgstr ""
+
+#: ../../papers/deit3.md ../../papers/hornet.md
+msgid "96.77"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_in21k-"
+"pre_3rdparty_in1k_20221009-dcd90827.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md ../../papers/replknet.md
+msgid "84.84"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "97.48"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-small-p16_in21k-"
+"pre_3rdparty_in1k-384px_20221009-de116dd7.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "DeiT3-M\\*"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "8.00"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "82.99"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "96.22"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit3/deit3-medium-p16_64xb64_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-medium-"
+"p16_3rdparty_in1k_20221008-3b21284d.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "84.56"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "97.19"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-medium-p16_in21k-"
+"pre_3rdparty_in1k_20221009-472f11e2.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "DeiT3-B\\*"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "86.59"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "17.58"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "83.80"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "96.55"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit3/deit3-base-p16_64xb64_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-"
+"p16_3rdparty_in1k_20221008-60b8c8bf.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md ../../papers/swin_transformer.md:66
+msgid "86.88"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "55.54"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "85.08"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "97.25"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit3/deit3-base-"
+"p16_64xb32_in1k-384px.py)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-"
+"p16_3rdparty_in1k-384px_20221009-e19e36d4.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "85.70"
+msgstr ""
+
+#: ../../papers/deit3.md ../../papers/efficientnet.md ../../papers/replknet.md
+msgid "97.75"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_in21k-"
+"pre_3rdparty_in1k_20221009-87983ca1.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "86.73"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "98.11"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-base-p16_in21k-"
+"pre_3rdparty_in1k-384px_20221009-5e4e37b9.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "DeiT3-L\\*"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "304.37"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "61.60"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "84.87"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "97.01"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit3/deit3-large-p16_64xb64_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-"
+"p16_3rdparty_in1k_20221009-03b427ea.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "304.76"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "191.21"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "85.82"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "97.60"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit3/deit3-large-"
+"p16_64xb16_in1k-384px.py)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-"
+"p16_3rdparty_in1k-384px_20221009-4317ce62.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "98.24"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_in21k-"
+"pre_3rdparty_in1k_20221009-d8d27084.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "87.73"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "98.51"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-large-p16_in21k-"
+"pre_3rdparty_in1k-384px_20221009-75fea03f.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "DeiT3-H\\*"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "632.13"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "167.40"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "85.21"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "97.36"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/deit3/deit3-huge-p14_64xb32_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-huge-p14_3rdparty_in1k_20221009-"
+"e107bcb7.pth)"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "87.19"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid "98.26"
+msgstr ""
+
+#: ../../papers/deit3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/deit3/deit3-huge-p14_in21k-"
+"pre_3rdparty_in1k_20221009-19b8a535.pth)"
+msgstr ""
+
+#: ../../papers/densenet.md:4
+msgid "DenseNet"
+msgstr ""
+
+#: ../../papers/densenet.md:6
+msgid "[Densely Connected Convolutional Networks](https://arxiv.org/abs/1608.06993)"
+msgstr ""
+
+#: ../../papers/densenet.md:12
+msgid ""
+"Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient "
+"to train if they contain shorter connections between layers close to the input and those close to the "
+"output. In this paper, we embrace this observation and introduce the Dense Convolutional Network "
+"(DenseNet), which connects each layer to every layer in a feed-forward fashion. Whereas traditional "
+"convolutional networks with L layers have L connections - one between each layer and its subsequent layer - "
+"our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are "
+"used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have "
+"several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature "
+"propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our "
+"proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, "
+"SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, "
+"whilst requiring less computation to achieve high performance."
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "DenseNet121\\*"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "7.98"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "2.88"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "74.96"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "92.21"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/densenet/densenet121_4xb256_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/densenet/"
+"densenet121_4xb256_in1k_20220426-07450f99.pth)"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "DenseNet169\\*"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "14.15"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "3.42"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "76.08"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "93.11"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/densenet/densenet169_4xb256_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet169_4xb256_in1k_20220426-"
+"a2889902.pth)"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "DenseNet201\\*"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "20.01"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "4.37"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "77.32"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "93.64"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/densenet/densenet201_4xb256_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/densenet/"
+"densenet201_4xb256_in1k_20220426-05cae4ef.pth)"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "DenseNet161\\*"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "28.68"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "7.82"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid "77.61"
+msgstr ""
+
+#: ../../papers/densenet.md ../../papers/mobileone.md:86
+msgid "93.83"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/densenet/densenet161_4xb256_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/densenet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/densenet/densenet161_4xb256_in1k_20220426-"
+"ee6a80a9.pth)"
+msgstr ""
+
+#: ../../papers/densenet.md:29
+msgid ""
+"*Models with * are converted from [pytorch](https://pytorch.org/vision/stable/models.html), guided by "
+"[original repo](https://github.com/liuzhuang13/DenseNet). The config files of these models are only for "
+"inference. We don't ensure these config files' training accuracy and welcome you to contribute your "
+"reproduction results.*"
+msgstr ""
+
+#: ../../papers/edgenext.md:4
+msgid "EdgeNeXt"
+msgstr ""
+
+#: ../../papers/edgenext.md:6
+msgid ""
+"[EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications](https://"
+"arxiv.org/abs/2206.10589)"
+msgstr ""
+
+#: ../../papers/edgenext.md:14
+#, python-format
+msgid ""
+"In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually "
+"developed. Such models demand high computational resources and therefore cannot be deployed on edge "
+"devices. It is of great interest to build resource-efficient general purpose networks due to their "
+"usefulness in several application areas. In this work, we strive to effectively combine the strengths of "
+"both CNN and Transformer models and propose a new efficient hybrid architecture EdgeNeXt. Specifically in "
+"EdgeNeXt, we introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into "
+"multiple channel groups and utilizes depth-wise convolution along with self-attention across channel "
+"dimensions to implicitly increase the receptive field and encode multi-scale features. Our extensive "
+"experiments on classification, detection and segmentation tasks, reveal the merits of the proposed "
+"approach, outperforming state-of-the-art methods with comparatively lower compute requirements. Our "
+"EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K, outperforming MobileViT "
+"with an absolute gain of 2.2% with 28% reduction in FLOPs. Further, our EdgeNeXt model with 5.6M parameters "
+"achieves 79.4% top-1 accuracy on ImageNet-1K."
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "EdgeNeXt-Base-usi\\*"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "18.51"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "3.84"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "83.67"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "96.7"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/edgenext/edgenext-base_8xb256-"
+"usi_in1k.py)"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-base_3rdparty-"
+"usi_in1k_20220801-909e8939.pth)"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "EdgeNeXt-Base\\*"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "82.48"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "96.2"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/edgenext/edgenext-base_8xb256_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-"
+"base_3rdparty_in1k_20220801-9ade408b.pth)"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "EdgeNeXt-Small-usi\\*"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "5.59"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "1.26"
+msgstr ""
+
+#: ../../papers/edgenext.md ../../papers/hrnet.md
+msgid "81.06"
+msgstr ""
+
+#: ../../papers/edgenext.md ../../papers/efficientnet.md ../../papers/resnet.md:76
+msgid "95.34"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/edgenext/edgenext-small_8xb256-"
+"usi_in1k.py)"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty-"
+"usi_in1k_20220801-ae6d8dd3.pth)"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "EdgeNeXt-Small\\*"
+msgstr ""
+
+#: ../../papers/edgenext.md ../../papers/resnet.md:76
+msgid "79.41"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "94.53"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/edgenext/edgenext-"
+"small_8xb256_in1k.py)"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-small_3rdparty_in1k_20220801-"
+"d00db5f8.pth)"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "EdgeNeXt-X-Small\\*"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "2.34"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "0.538"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "74.86"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "92.31"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/edgenext/edgenext-"
+"xsmall_8xb256_in1k.py)"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-"
+"xsmall_3rdparty_in1k_20220801-974f9fe7.pth)"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "EdgeNeXt-XX-Small\\*"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "1.33"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "0.261"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "71.2"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid "89.91"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/edgenext/edgenext-"
+"xxsmall_8xb256_in1k.py)"
+msgstr ""
+
+#: ../../papers/edgenext.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/edgenext/edgenext-"
+"xxsmall_3rdparty_in1k_20220801-7ca8a81d.pth)"
+msgstr ""
+
+#: ../../papers/edgenext.md:35
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/mmaaz60/EdgeNeXt). The config "
+"files of these models are only for inference. We don't ensure these config files' training accuracy and "
+"welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/efficientformer.md:4
+msgid "EfficientFormer"
+msgstr ""
+
+#: ../../papers/efficientformer.md:6
+msgid "[EfficientFormer: Vision Transformers at MobileNet Speed](https://arxiv.org/abs/2206.01191)"
+msgstr ""
+
+#: ../../papers/efficientformer.md:12
+msgid ""
+"Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results "
+"on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention "
+"mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, "
+"the deployment of ViT for real-time applications is particularly challenging, especially on resource-"
+"constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT "
+"through network architecture search or hybrid design with MobileNet block, yet the inference speed is still "
+"unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while "
+"obtaining high performance? To answer this, we first revisit the network architecture and operators used in "
+"ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure "
+"transformer (without MobileNet blocks) as a design paradigm.  Finally, we perform latency-driven slimming "
+"to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of "
+"EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves "
+"79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on iPhone 12 (compiled with CoreML), "
+"which runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1), and our largest model, EfficientFormer-L7, "
+"obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can "
+"reach extremely low latency on mobile devices while maintaining high performance."
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid "EfficientFormer-l1\\*"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid "12.19"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid "1.30"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid "80.46"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid "94.99"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientformer/efficientformer-"
+"l1_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-"
+"l1_3rdparty_in1k_20220915-cc3e1ac6.pth)"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid "EfficientFormer-l3\\*"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid "31.41"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid "3.93"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid "82.45"
+msgstr ""
+
+#: ../../papers/efficientformer.md ../../papers/t2t_vit.md
+msgid "96.18"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientformer/efficientformer-"
+"l3_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-"
+"l3_3rdparty_in1k_20220915-466793d6.pth)"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid "EfficientFormer-l7\\*"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid "82.23"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid "10.16"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid "83.40"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid "96.60"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientformer/efficientformer-"
+"l7_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientformer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientformer/efficientformer-"
+"l7_3rdparty_in1k_20220915-185e30af.pth)"
+msgstr ""
+
+#: ../../papers/efficientformer.md:28
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/snap-research/EfficientFormer). "
+"The config files of these models are only for inference. We don't ensure these config files' training "
+"accuracy and welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/efficientnet.md:4
+msgid "EfficientNet"
+msgstr ""
+
+#: ../../papers/efficientnet.md:6
+msgid "[Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946v5)"
+msgstr ""
+
+#: ../../papers/efficientnet.md:12
+msgid ""
+"EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, yet "
+"being an order-of-magnitude smaller and faster than previous models."
+msgstr ""
+
+#: ../../papers/efficientnet.md:14
+msgid ""
+"EfficientNets are based on AutoML and Compound Scaling. In particular, we first use [AutoML MNAS Mobile "
+"framework](https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html) to develop a mobile-"
+"size baseline network, named as EfficientNet-B0; Then, we use the compound scaling method to scale up this "
+"baseline to obtain EfficientNet-B1 to B7."
+msgstr ""
+
+#: ../../papers/efficientnet.md:35
+msgid ""
+"In the result table, AA means trained with AutoAugment pre-processing, more details can be found in the "
+"[paper](https://arxiv.org/abs/1805.09501); AdvProp is a method to train with adversarial examples, more "
+"details can be found in the [paper](https://arxiv.org/abs/1911.09665); RA means trained with RandAugment "
+"pre-processing, more details can be found in the [paper](https://arxiv.org/abs/1909.13719); NoisyStudent "
+"means trained with extra JFT-300M unlabeled data, more details can be found in the [paper](https://arxiv."
+"org/abs/1911.04252); L2-475 means the same L2 architecture with input image size 475."
+msgstr ""
+
+#: ../../papers/efficientnet.md:37
+msgid "Note: In MMClassification, we support training with AutoAugment, don't support AdvProp by now."
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B0\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "5.29"
+msgstr ""
+
+#: ../../papers/efficientnet.md ../../papers/mobilevit.md:71
+msgid "0.42"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "76.74"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "93.17"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b0_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-"
+"b0_3rdparty_8xb32_in1k_20220119-a7e2a0b1.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B0 (AA)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "77.26"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "93.41"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32-"
+"aa_in1k_20220119-8d939117.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B0 (AA + AdvProp)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "77.53"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "93.61"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b0_8xb32-01norm_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32-aa-"
+"advprop_in1k_20220119-26434485.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B0 (RA + NoisyStudent)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "77.63"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "94.00"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty-ra-"
+"noisystudent_in1k_20221103-75cd08d3.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B1\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "7.79"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "0.74"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "78.68"
+msgstr ""
+
+#: ../../papers/efficientnet.md ../../papers/resnet.md:76 ../../papers/wrn.md
+msgid "94.28"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b1_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-"
+"b1_3rdparty_8xb32_in1k_20220119-002556d9.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B1 (AA)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md ../../papers/res2net.md
+msgid "79.20"
+msgstr ""
+
+#: ../../papers/efficientnet.md ../../papers/repvgg.md
+msgid "94.42"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32-"
+"aa_in1k_20220119-619d8ae3.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B1 (AA + AdvProp)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "79.52"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "94.43"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b1_8xb32-01norm_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty_8xb32-aa-"
+"advprop_in1k_20220119-5715267d.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B1 (RA + NoisyStudent)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "81.44"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "95.83"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b1_3rdparty-ra-"
+"noisystudent_in1k_20221103-756bcbc0.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B2\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "9.11"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "1.07"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "79.64"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "94.80"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b2_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-"
+"b2_3rdparty_8xb32_in1k_20220119-ea374a30.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B2 (AA)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "80.21"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32-"
+"aa_in1k_20220119-dd61e80b.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B2 (AA + AdvProp)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "80.45"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b2_8xb32-01norm_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b2_3rdparty_8xb32-aa-"
+"advprop_in1k_20220119-1655338a.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B2 (RA + NoisyStudent)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "82.47"
+msgstr ""
+
+#: ../../papers/efficientnet.md ../../papers/swin_transformer_v2.md:76
+msgid "96.23"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B3\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "12.23"
+msgstr ""
+
+#: ../../papers/efficientnet.md ../../papers/van.md
+msgid "81.01"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b3_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-"
+"b3_3rdparty_8xb32_in1k_20220119-4b4d7487.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B3 (AA)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "81.58"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "95.67"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32-"
+"aa_in1k_20220119-5b4887a0.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B3 (AA + AdvProp)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md ../../papers/repvgg.md
+msgid "81.81"
+msgstr ""
+
+#: ../../papers/efficientnet.md ../../papers/twins.md
+msgid "95.69"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b3_8xb32-01norm_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty_8xb32-aa-"
+"advprop_in1k_20220119-53b41118.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B3 (RA + NoisyStudent)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "84.02"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b3_3rdparty-ra-"
+"noisystudent_in1k_20221103-a4ab5fd6.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B4\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "19.34"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "1.95"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "82.57"
+msgstr ""
+
+#: ../../papers/efficientnet.md ../../papers/t2t_vit.md
+msgid "96.09"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b4_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-"
+"b4_3rdparty_8xb32_in1k_20220119-81fd4077.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B4 (AA)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "82.95"
+msgstr ""
+
+#: ../../papers/efficientnet.md ../../papers/twins.md
+msgid "96.26"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32-"
+"aa_in1k_20220119-45b8bd2b.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B4 (AA + AdvProp)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "83.25"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b4_8xb32-01norm_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty_8xb32-aa-"
+"advprop_in1k_20220119-38c2238c.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B4 (RA + NoisyStudent)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md ../../papers/mvit.md
+msgid "85.25"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "97.52"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b4_3rdparty-ra-"
+"noisystudent_in1k_20221103-16ba8a2d.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B5\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "30.39"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "10.1"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "83.18"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "96.47"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b5_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-"
+"b5_3rdparty_8xb32_in1k_20220119-e9814430.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B5 (AA)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "96.76"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32-"
+"aa_in1k_20220119-2cab8b78.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B5 (AA + AdvProp)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "84.21"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "96.98"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b5_8xb32-01norm_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty_8xb32-aa-"
+"advprop_in1k_20220119-f57a895a.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B5 (RA + NoisyStudent)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "86.08"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b5_3rdparty-ra-"
+"noisystudent_in1k_20221103-111a185f.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B6 (AA)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "43.04"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "20.0"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "84.05"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b6_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty_8xb32-"
+"aa_in1k_20220119-45b03310.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B6 (AA + AdvProp)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "84.74"
+msgstr ""
+
+#: ../../papers/efficientnet.md ../../papers/mvit.md
+msgid "97.14"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b6_8xb32-01norm_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty_8xb32-aa-"
+"advprop_in1k_20220119-bfe3485e.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B6 (RA + NoisyStudent)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "86.47"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "97.87"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b6_3rdparty-ra-"
+"noisystudent_in1k_20221103-7de7d2cc.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B7 (AA)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "66.35"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "39.3"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "84.38"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "96.88"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b7_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty_8xb32-"
+"aa_in1k_20220119-bf03951c.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B7 (AA + AdvProp)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "85.14"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "97.23"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b7_8xb32-01norm_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty_8xb32-aa-"
+"advprop_in1k_20220119-c6dbff10.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B7 (RA + NoisyStudent)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "65.0"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "86.83"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "98.08"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b7_3rdparty-ra-"
+"noisystudent_in1k_20221103-a82894bc.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-B8 (AA + AdvProp)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md ../../papers/mobilenet_v3.md
+msgid "87.41"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "85.38"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "97.28"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"b8_8xb32-01norm_in1k.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b8_3rdparty_8xb32-aa-"
+"advprop_in1k_20220119-297ce1b7.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-L2-475 (RA + NoisyStudent)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "480.30"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "174.20"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "88.18"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "98.55"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"l2_8xb32_in1k-475px.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-l2_3rdparty-ra-"
+"noisystudent_in1k-475px_20221103-5a0d8058.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "EfficientNet-L2 (RA + NoisyStudent)\\*"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "484.98"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "88.33"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid "98.65"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/efficientnet/efficientnet-"
+"l2_8xb8_in1k-800px.py)"
+msgstr ""
+
+#: ../../papers/efficientnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-l2_3rdparty-ra-"
+"noisystudent_in1k_20221103-be73be13.pth)"
+msgstr ""
+
+#: ../../papers/efficientnet.md:75
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/tensorflow/tpu/tree/master/models/"
+"official/efficientnet). The config files of these models are only for inference. We don't ensure these "
+"config files' training accuracy and welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/efficientnet.md:121 ../../papers/mobilevit.md:80 ../../papers/resnet.md:85
+#: ../../papers/swin_transformer.md:75 ../../papers/swin_transformer_v2.md:85
+#: ../../papers/vision_transformer.md:80
+msgid "Test:"
+msgstr "测试："
+
+#: ../../papers/efficientnet.md:128
+msgid ""
+"For more configurable parameters, please refer to the [API](https://mmclassification.readthedocs.io/en/1.x/"
+"api/generated/mmcls.models.backbones.EfficientNet.html#mmcls.models.backbones.EfficientNet)."
+msgstr ""
+
+#: ../../papers/hornet.md:4
+msgid "HorNet"
+msgstr ""
+
+#: ../../papers/hornet.md:6
+msgid ""
+"[HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions](https://arxiv.org/"
+"pdf/2207.14284v2.pdf)"
+msgstr ""
+
+#: ../../papers/hornet.md:12
+msgid ""
+"Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial "
+"modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients "
+"behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can "
+"also be efficiently implemented with a convolution-based framework. We present the Recursive Gated "
+"Convolution (g nConv) that performs high-order spatial interactions with gated convolutions and recursive "
+"designs. The new operation is highly flexible and customizable, which is compatible with various variants "
+"of convolution and extends the two-order interactions in self-attention to arbitrary orders without "
+"introducing significant extra computation. g nConv can serve as a plug-and-play module to improve various "
+"vision Transformers and convolution-based models. Based on the operation, we construct a new family of "
+"generic vision backbones named HorNet. Extensive experiments on ImageNet classification, COCO object "
+"detection and ADE20K semantic segmentation show HorNet outperform Swin Transformers and ConvNeXt by a "
+"significant margin with similar overall architecture and training configurations. HorNet also shows "
+"favorable scalability to more training data and a larger model size. Apart from the effectiveness in visual "
+"encoders, we also show g nConv can be applied to task-specific decoders and consistently improve dense "
+"prediction performance with less computation. Our results demonstrate that g nConv can be a new basic "
+"module for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code "
+"is available at https://github.com/raoyongming/HorNet."
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "HorNet-T\\*"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "22.41"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "3.98"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "82.84"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "96.24"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/master/configs/hornet/hornet-tiny_8xb128_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-"
+"tiny_3rdparty_in1k_20220915-0e8eedff.pth)"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "HorNet-T-GF\\*"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "22.99"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "3.9"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "82.98"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "96.38"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/master/configs/hornet/hornet-tiny-"
+"gf_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-tiny-"
+"gf_3rdparty_in1k_20220915-4c35a66b.pth)"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "HorNet-S\\*"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "49.53"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "8.83"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "83.79"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/master/configs/hornet/hornet-small_8xb64_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-"
+"small_3rdparty_in1k_20220915-5935f60f.pth)"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "HorNet-S-GF\\*"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "50.4"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "8.71"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "83.98"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/master/configs/hornet/hornet-small-"
+"gf_8xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-small-"
+"gf_3rdparty_in1k_20220915-649ca492.pth)"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "HorNet-B\\*"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "87.26"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "15.59"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "84.24"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "96.94"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/master/configs/hornet/hornet-base_8xb64_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-base_3rdparty_in1k_20220915-"
+"a06176bb.pth)"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "HorNet-B-GF\\*"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "88.42"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "15.42"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "84.32"
+msgstr ""
+
+#: ../../papers/hornet.md ../../papers/swin_transformer.md:66
+msgid "96.95"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/master/configs/hornet/hornet-base-"
+"gf_8xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-base-"
+"gf_3rdparty_in1k_20220915-82c06fa7.pth)"
+msgstr ""
+
+#: ../../papers/hornet.md:31
+msgid ""
+"\\*Models with * are converted from [the official repo](https://github.com/raoyongming/HorNet). The config "
+"files of these models are only for validation. We don't ensure these config files' training accuracy and "
+"welcome you to contribute your reproduction results."
+msgstr ""
+
+#: ../../papers/hornet.md:35
+msgid "The pre-trained models on ImageNet-21k are used to fine-tune on the downstream tasks."
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "HorNet-L\\*"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "194.54"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "34.83"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-"
+"large_3rdparty_in21k_20220909-9ccef421.pth)"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "HorNet-L-GF\\*"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "196.29"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "34.58"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-large-"
+"gf_3rdparty_in21k_20220909-3aea3b61.pth)"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "HorNet-L-GF384\\*"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "201.23"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid "101.63"
+msgstr ""
+
+#: ../../papers/hornet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hornet/hornet-large-"
+"gf384_3rdparty_in21k_20220909-80894290.pth)"
+msgstr ""
+
+#: ../../papers/hornet.md:43
+msgid "\\*Models with * are converted from [the official repo](https://github.com/raoyongming/HorNet)."
+msgstr ""
+
+#: ../../papers/hrnet.md:4
+msgid "HRNet"
+msgstr ""
+
+#: ../../papers/hrnet.md:6
+msgid ""
+"[Deep High-Resolution Representation Learning for Visual Recognition](https://arxiv.org/abs/1908.07919v2)"
+msgstr ""
+
+#: ../../papers/hrnet.md:12
+msgid ""
+"High-resolution representations are essential for position-sensitive vision problems, such as human pose "
+"estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode "
+"the input image as a low-resolution representation through a subnetwork that is formed by connecting high-"
+"to-low resolution convolutions *in series* (e.g., ResNet, VGGNet), and then recover the high-resolution "
+"representation from the encoded low-resolution representation. Instead, our proposed network, named as High-"
+"Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are "
+"two key characteristics: (i) Connect the high-to-low resolution convolution streams *in parallel*; (ii) "
+"Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is "
+"semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide "
+"range of applications, including human pose estimation, semantic segmentation, and object detection, "
+"suggesting that the HRNet is a stronger backbone for computer vision problems."
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "HRNet-W18\\*"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "21.30"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "4.33"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "76.75"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "93.44"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/hrnet/hrnet-w18_4xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-"
+"w18_3rdparty_8xb32_in1k_20220120-0c10b180.pth)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "HRNet-W30\\*"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "37.71"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "8.17"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "78.19"
+msgstr ""
+
+#: ../../papers/hrnet.md ../../papers/regnet.md
+msgid "94.22"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/hrnet/hrnet-w30_4xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-"
+"w30_3rdparty_8xb32_in1k_20220120-8aa3832f.pth)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "HRNet-W32\\*"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "41.23"
+msgstr ""
+
+#: ../../papers/hrnet.md ../../papers/van.md
+msgid "8.99"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "78.44"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "94.19"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/hrnet/hrnet-w32_4xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w32_3rdparty_8xb32_in1k_20220120-"
+"c394f1ab.pth)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "HRNet-W40\\*"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "57.55"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "12.77"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "78.94"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "94.47"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/hrnet/hrnet-w40_4xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-"
+"w40_3rdparty_8xb32_in1k_20220120-9a2dbfc5.pth)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "HRNet-W44\\*"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "67.06"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "14.96"
+msgstr ""
+
+#: ../../papers/hrnet.md ../../papers/resnext.md
+msgid "78.88"
+msgstr ""
+
+#: ../../papers/hrnet.md ../../papers/resnet.md:76
+msgid "94.37"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/hrnet/hrnet-w44_4xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-"
+"w44_3rdparty_8xb32_in1k_20220120-35d07f73.pth)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "HRNet-W48\\*"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "77.47"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "17.36"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "79.32"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "94.52"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/hrnet/hrnet-w48_4xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w48_3rdparty_8xb32_in1k_20220120-"
+"e555ef50.pth)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "HRNet-W64\\*"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "128.06"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "29.00"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "79.46"
+msgstr ""
+
+#: ../../papers/hrnet.md ../../papers/regnet.md
+msgid "94.65"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/hrnet/hrnet-w64_4xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-"
+"w64_3rdparty_8xb32_in1k_20220120-19126642.pth)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "HRNet-W18 (ssld)\\*"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "95.70"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w18_3rdparty_8xb32-"
+"ssld_in1k_20220120-455f69ea.pth)"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "HRNet-W48 (ssld)\\*"
+msgstr ""
+
+#: ../../papers/hrnet.md ../../papers/mvit.md
+msgid "83.63"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid "96.79"
+msgstr ""
+
+#: ../../papers/hrnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/hrnet/hrnet-w48_3rdparty_8xb32-"
+"ssld_in1k_20220120-d0459c38.pth)"
+msgstr ""
+
+#: ../../papers/hrnet.md:34
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/HRNet/HRNet-Image-Classification). "
+"The config files of these models are only for inference. We don't ensure these config files' training "
+"accuracy and welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/inception_v3.md:4
+msgid "Inception V3"
+msgstr ""
+
+#: ../../papers/inception_v3.md:6
+msgid "[Rethinking the Inception Architecture for Computer Vision](http://arxiv.org/abs/1512.00567)"
+msgstr ""
+
+#: ../../papers/inception_v3.md:12
+#, python-format
+msgid ""
+"Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide "
+"variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding "
+"substantial gains in various benchmarks. Although increased model size and computational cost tend to "
+"translate to immediate quality gains for most tasks (as long as enough labeled data is provided for "
+"training), computational efficiency and low parameter count are still enabling factors for various use "
+"cases such as mobile vision and big-data scenarios. Here we explore ways to scale up networks in ways that "
+"aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and "
+"aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation "
+"set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single "
+"frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and "
+"with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we "
+"report 3.5% top-5 error on the validation set (3.6% error on the test set) and 17.3% top-1 error on the "
+"validation set."
+msgstr ""
+
+#: ../../papers/inception_v3.md
+msgid "Inception V3\\*"
+msgstr ""
+
+#: ../../papers/inception_v3.md
+msgid "23.83"
+msgstr ""
+
+#: ../../papers/inception_v3.md
+msgid "5.75"
+msgstr ""
+
+#: ../../papers/inception_v3.md
+msgid "77.57"
+msgstr ""
+
+#: ../../papers/inception_v3.md ../../papers/resnet.md:76
+msgid "93.58"
+msgstr ""
+
+#: ../../papers/inception_v3.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/inception_v3/inception-"
+"v3_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/inception_v3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/inception-v3/inception-"
+"v3_3rdparty_8xb32_in1k_20220615-dcd4d910.pth)"
+msgstr ""
+
+#: ../../papers/inception_v3.md:26
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/pytorch/vision/blob/main/"
+"torchvision/models/inception.py#L28). The config files of these models are only for inference. We don't "
+"ensure these config files' training accuracy and welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md:4
+msgid "Mlp-Mixer"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md:6
+msgid "[MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/abs/2105.01601)"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md:12
+msgid ""
+"Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based "
+"networks, such as the Vision Transformer, have also become popular. In this paper we show that while "
+"convolutions and attention are both sufficient for good performance, neither of them are necessary. We "
+"present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains "
+"two types of layers: one with MLPs applied independently to image patches (i.e. \"mixing\" the per-location "
+"features), and one with MLPs applied across patches (i.e. \"mixing\" spatial information). When trained on "
+"large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image "
+"classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We "
+"hope that these results spark further research beyond the realms of well established CNNs and Transformers."
+msgstr ""
+
+#: ../../papers/mlp_mixer.md
+msgid "Mixer-B/16\\*"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md
+msgid "59.88"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md
+msgid "12.61"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md
+msgid "76.68"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md
+msgid "92.25"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mlp_mixer/mlp-mixer-base-"
+"p16_64xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-base-"
+"p16_3rdparty_64xb64_in1k_20211124-1377e3e0.pth)"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md
+msgid "Mixer-L/16\\*"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md
+msgid "208.2"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md ../../papers/resnet.md:76
+msgid "44.57"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md
+msgid "72.34"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md
+msgid "88.02"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mlp_mixer/mlp-mixer-large-"
+"p16_64xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mlp-mixer/mixer-large-"
+"p16_3rdparty_64xb64_in1k_20211124-5a2519d2.pth)"
+msgstr ""
+
+#: ../../papers/mlp_mixer.md:27
+msgid ""
+"*Models with * are converted from [timm](https://github.com/rwightman/pytorch-image-models/blob/master/timm/"
+"models/mlp_mixer.py). The config files of these models are only for validation. We don't ensure these "
+"config files' training accuracy and welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/mobilenet_v2.md ../../papers/mobilenet_v2.md:4
+msgid "MobileNet V2"
+msgstr ""
+
+#: ../../papers/mobilenet_v2.md:6
+msgid "[MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381)"
+msgstr ""
+
+#: ../../papers/mobilenet_v2.md:12
+msgid ""
+"In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art "
+"performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different "
+"model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel "
+"framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models "
+"through a reduced form of DeepLabv3 which we call Mobile DeepLabv3."
+msgstr ""
+
+#: ../../papers/mobilenet_v2.md:14
+msgid ""
+"The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the "
+"residual block are thin bottleneck layers opposite to traditional residual models which use expanded "
+"representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in "
+"the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in "
+"the narrow layers in order to maintain representational power. We demonstrate that this improves "
+"performance and provide an intuition that led to this design. Finally, our approach allows decoupling of "
+"the input/output domains from the expressiveness of the transformation, which provides a convenient "
+"framework for further analysis. We measure our performance on Imagenet classification, COCO object "
+"detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations "
+"measured by multiply-adds (MAdd), as well as the number of parameters"
+msgstr ""
+
+#: ../../papers/mobilenet_v2.md
+msgid "3.5"
+msgstr ""
+
+#: ../../papers/mobilenet_v2.md
+msgid "0.319"
+msgstr ""
+
+#: ../../papers/mobilenet_v2.md
+msgid "71.86"
+msgstr ""
+
+#: ../../papers/mobilenet_v2.md
+msgid "90.42"
+msgstr ""
+
+#: ../../papers/mobilenet_v2.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v2/mobilenet-"
+"v2_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/mobilenet_v2.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/"
+"mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth) | [log](https://download.openmmlab.com/"
+"mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.log.json)"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md:4
+msgid "MobileNet V3"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md:6
+msgid "[Searching for MobileNetV3](https://arxiv.org/abs/1905.02244)"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md:12
+#, python-format
+msgid ""
+"We present the next generation of MobileNets based on a combination of complementary search techniques as "
+"well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of "
+"hardware-aware network architecture search (NAS) complemented by the NetAdapt algorithm and then "
+"subsequently improved through novel architecture advances. This paper starts the exploration of how "
+"automated search algorithms and network design can work together to harness complementary approaches "
+"improving the overall state of the art. Through this process we create two new MobileNet models for "
+"release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. "
+"These models are then adapted and applied to the tasks of object detection and semantic segmentation. For "
+"the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation "
+"decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for "
+"mobile classification, detection and segmentation. MobileNetV3-Large is 3.2% more accurate on ImageNet "
+"classification while reducing latency by 15% compared to MobileNetV2. MobileNetV3-Small is 4.6% more "
+"accurate while reducing latency by 5% compared to MobileNetV2. MobileNetV3-Large detection is 25% faster at "
+"roughly the same accuracy as MobileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 30% faster than "
+"MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation."
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "MobileNetV3-Small-050\\*\\*"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "1.59"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "0.02"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "57.91"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "80.19"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v3/mobilenet-v3-"
+"small-050_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-"
+"small-050_3rdparty_in1k_20221114-e0b86be1.pth)"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "MobileNetV3-Small-075\\*\\*"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "2.04"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "0.04"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "65.23"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "85.44"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v3/mobilenet-v3-"
+"small-075_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-"
+"small-075_3rdparty_in1k_20221114-2011fa76.pth)"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "MobileNetV3-Small"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "2.54"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "0.06"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "66.68"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md ../../papers/resnet.md:76 ../../papers/swin_transformer.md:66
+msgid "86.74"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v3/mobilenet-v3-"
+"small_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-"
+"small_8xb128_in1k_20221114-bd1bfcde.pth) | [log](https://download.openmmlab.com/mmclassification/v0/"
+"mobilenet_v3/mobilenet-v3-small_8xb128_in1k_20221114-bd1bfcde.log.json)"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "MobileNetV3-Small\\*"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "67.66"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_small-8427ecf0."
+"pth)"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "MobileNetV3-Large"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "5.48"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "0.23"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "73.49"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "91.31"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilenet_v3/mobilenet-v3-"
+"large_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-"
+"large_8xb128_in1k_20221114-0ed9ed9a.pth) | [log](https://download.openmmlab.com/mmclassification/v0/"
+"mobilenet_v3/mobilenet-v3-large_8xb128_in1k_20221114-0ed9ed9a.log.json)"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "MobileNetV3-Large\\*"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "74.04"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid "91.34"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186."
+"pth)"
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md:31
+#, python-format
+msgid ""
+"We cannot reproduce the performances provided by TorchVision with the [training script](https://github.com/"
+"pytorch/vision/tree/master/references/classification#mobilenetv3-large--small), and the accuracy results we "
+"got are 65.5±0.5% and 73.39% for small and large model respectively. Here we provide checkpoints trained by "
+"MMClassification that outperform the aforementioned results, and the original checkpoints provided by "
+"TorchVision."
+msgstr ""
+
+#: ../../papers/mobilenet_v3.md:33
+msgid ""
+"*Models with * are converted from [torchvision](https://pytorch.org/vision/stable/_modules/torchvision/"
+"models/mobilenetv3.html). Models with \\*\\* are converted from [timm](https://github.com/rwightman/pytorch-"
+"image-models/blob/main/timm/models/mobilenetv3.py). The config files of these models are only for "
+"validation. We don't ensure these config files' training accuracy and welcome you to contribute your "
+"reproduction results.*"
+msgstr ""
+
+#: ../../papers/mobileone.md:4
+msgid "MobileOne"
+msgstr ""
+
+#: ../../papers/mobileone.md:6
+msgid "[An Improved One millisecond Mobile Backbone](https://arxiv.org/abs/2206.04040)"
+msgstr ""
+
+#: ../../papers/mobileone.md:12
+msgid ""
+"Mobileone is proposed by apple and based on reparameterization. On the apple chips, the accuracy of the "
+"model is close to 0.76 on the ImageNet dataset when the latency is less than 1ms. Its main improvements "
+"based on [RepVGG](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobileone/../repvgg) are "
+"fllowing:"
+msgstr ""
+
+#: ../../papers/mobileone.md:14
+msgid ""
+"Reparameterization using Depthwise convolution and Pointwise convolution instead of normal convolution."
+msgstr ""
+
+#: ../../papers/mobileone.md:15
+msgid "Removal of the residual structure which is not friendly to access memory."
+msgstr ""
+
+#: ../../papers/mobileone.md:33 ../../papers/replknet.md:33 ../../papers/repmlp.md:29
+#: ../../papers/repvgg.md:39
+msgid "How to use"
+msgstr ""
+
+#: ../../papers/mobileone.md:35
+msgid ""
+"The checkpoints provided are all `training-time` models. Use the reparameterize tool or `switch_to_deploy` "
+"interface to switch them to more efficient `inference-time` architecture, which not only has fewer "
+"parameters but also less calculations."
+msgstr ""
+
+#: ../../papers/mobileone.md:40
+msgid "Use `classifier.backbone.switch_to_deploy()` interface to switch the MobileOne to a inference mode."
+msgstr ""
+
+#: ../../papers/mobileone.md:95
+msgid "Download Checkpoint:"
+msgstr ""
+
+#: ../../papers/mobileone.md:101
+msgid "Test use unfused model:"
+msgstr ""
+
+#: ../../papers/mobileone.md:107
+msgid "Reparameterize checkpoint:"
+msgstr ""
+
+#: ../../papers/mobileone.md:113
+msgid "Test use fused model:"
+msgstr ""
+
+#: ../../papers/mobileone.md:120
+msgid "Reparameterize Tool"
+msgstr ""
+
+#: ../../papers/mobileone.md:122 ../../papers/replknet.md:39 ../../papers/repmlp.md:35
+#: ../../papers/repvgg.md:45
+msgid "Use provided tool to reparameterize the given model and save the checkpoint:"
+msgstr ""
+
+#: ../../papers/mobileone.md:128
+msgid ""
+"`${CFG_PATH}` is the config file path, `${SRC_CKPT_PATH}` is the source chenpoint file path, `"
+"${TARGET_CKPT_PATH}` is the target deploy weight file path."
+msgstr ""
+
+#: ../../papers/mobileone.md:130
+msgid "For example:"
+msgstr ""
+
+#: ../../papers/mobileone.md:137
+msgid ""
+"To use reparameterized weights, the config file must switch to [**the deploy config files**](https://github."
+"com/open-mmlab/mmclassification/blob/1.x/configs/mobileone/deploy)."
+msgstr ""
+
+#: ../../papers/mobileone.md:143
+msgid "For example of using the reparameterized weights above:"
+msgstr ""
+
+#: ../../papers/mobileone.md:149
+msgid ""
+"For more configurable parameters, please refer to the [API](https://mmclassification.readthedocs.io/en/1.x/"
+"api/generated/mmcls.models.backbones.MobileOne.html#mmcls.models.backbones.MobileOne)."
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "MobileOne-s0"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "5.29（train) | 2.08 (deploy)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "1.09 (train) | 0.28 (deploy)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "71.34"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "89.87"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobileone/mobileone-"
+"s0_8xb32_in1k.py) | [config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/"
+"mobileone/deploy/mobileone-s0_deploy_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-"
+"s0_8xb32_in1k_20221110-0bc94952.pth) | [log](https://download.openmmlab.com/mmclassification/v0/mobileone/"
+"mobileone-s0_8xb32_in1k_20221110-0bc94952.json)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "MobileOne-s1"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "4.83 (train) | 4.76 (deploy)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "0.86 (train) | 0.84 (deploy)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "75.72"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "92.54"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobileone/mobileone-"
+"s1_8xb32_in1k.py) | [config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/"
+"mobileone/deploy/mobileone-s1_deploy_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s1_8xb32_in1k_20221110-"
+"ceeef467.pth)  | [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-"
+"s1_8xb32_in1k_20221110-ceeef467.json)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "MobileOne-s2"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "7.88 (train) | 7.88 (deploy)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "1.34 (train)  | 1.31 (deploy)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "77.37"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "93.34"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobileone/mobileone-"
+"s2_8xb32_in1k.py) |[config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/"
+"mobileone/deploy/mobileone-s2_deploy_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-"
+"s2_8xb32_in1k_20221110-9c7ecb97.pth)  | [log](https://download.openmmlab.com/mmclassification/v0/mobileone/"
+"mobileone-s2_8xb32_in1k_20221110-9c7ecb97.json)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "MobileOne-s3"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "10.17 (train) | 10.08 (deploy)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "1.95 (train)  | 1.91 (deploy)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "78.06"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobileone/mobileone-"
+"s3_8xb32_in1k.py) |[config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/"
+"mobileone/deploy/mobileone-s3_deploy_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-s3_8xb32_in1k_20221110-"
+"c95eb3bf.pth)  | [log](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-"
+"s3_8xb32_in1k_20221110-c95eb3bf.pth)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "MobileOne-s4"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "14.95 (train) | 14.84 (deploy)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "3.05 (train) | 3.00 (deploy)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "79.69"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid "94.46"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobileone/mobileone-"
+"s4_8xb32_in1k.py) |[config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/"
+"mobileone/deploy/mobileone-s4_deploy_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/mobileone.md:86
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobileone/mobileone-"
+"s4_8xb32_in1k_20221110-28d888cb.pth) | [log](https://download.openmmlab.com/mmclassification/v0/mobileone/"
+"mobileone-s4_8xb32_in1k_20221110-28d888cb.pth)"
+msgstr ""
+
+#: ../../papers/mobilevit.md:4
+msgid "MobileVit"
+msgstr ""
+
+#: ../../papers/mobilevit.md:6
+msgid ""
+"[MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/"
+"abs/2110.02178)"
+msgstr ""
+
+#: ../../papers/mobilevit.md:12
+msgid ""
+"**MobileViT** aims at introducing a light-weight network, which takes the advantages of both ViTs and CNNs, "
+"uses the `InvertedResidual` blocks in [MobileNetV2](https://github.com/open-mmlab/mmclassification/blob/1.x/"
+"configs/mobilevit/../mobilenet_v2/README.md) and `MobileViTBlock` which refers to [ViT](https://github.com/"
+"open-mmlab/mmclassification/blob/1.x/configs/mobilevit/../vision_transformer/README.md) transformer blocks "
+"to build a standard 5-stage model structure."
+msgstr ""
+
+#: ../../papers/mobilevit.md:14
+msgid ""
+"The MobileViTBlock reckons transformers as convolutions to perform a global representation, meanwhile "
+"conbined with original convolution layers for local representation to build a block with global receptive "
+"field. This is different from ViT, which adds an extra class token and position embeddings for learning "
+"relative relationship. Without any position embeddings, MobileViT can benfit from multi-scale inputs during "
+"training."
+msgstr ""
+
+#: ../../papers/mobilevit.md:16
+msgid ""
+"Also, this paper puts forward a strategy for multi-scale training to dynamically adjust batch size based on "
+"the image size to both improve training efficiency and final performance."
+msgstr ""
+
+#: ../../papers/mobilevit.md:18
+msgid "It is also proven effective in downstream tasks such as object detection and segmentation."
+msgstr ""
+
+#: ../../papers/mobilevit.md:32
+msgid ""
+"Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial "
+"inductive biases allow them to learn representations with fewer parameters across different vision tasks. "
+"However, these networks are spatially local. To learn global representations, self-attention-based vision "
+"trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the "
+"following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and "
+"low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and "
+"general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the "
+"global processing of information with transformers, i.e., transformers as convolutions. Our results show "
+"that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. "
+"On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, "
+"which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number "
+"of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a "
+"similar number of parameters. </br>"
+msgstr ""
+
+#: ../../papers/mobilevit.md:87
+msgid ""
+"For more configurable parameters, please refer to the [API](https://mmclassification.readthedocs.io/en/1.x/"
+"api/generated/mmcls.models.backbones.MobileViT.html#mmcls.models.backbones.MobileViT)."
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid "MobileViT-XXSmall\\*"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid "1.27"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid "69.02"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid "88.91"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilevit/mobilevit-"
+"xxsmall_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-"
+"xxsmall_3rdparty_in1k_20221018-77835605.pth)"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid "MobileViT-XSmall\\*"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid "2.32"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid "1.05"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid "74.75"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71 ../../papers/regnet.md
+msgid "92.32"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilevit/mobilevit-"
+"xsmall_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-"
+"xsmall_3rdparty_in1k_20221018-be39a6e7.pth)"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid "MobileViT-Small\\*"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid "5.58"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid "2.03"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid "78.25"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid "94.09"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mobilevit/mobilevit-"
+"small_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/mobilevit.md:71
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mobilevit/mobilevit-small_3rdparty_in1k_20221018-"
+"cb4f741c.pth)"
+msgstr ""
+
+#: ../../papers/mobilevit.md:99
+msgid ""
+"*Models with * are converted from [ml-cvnets](https://github.com/apple/ml-cvnets). The config files of "
+"these models are only for validation. We don't ensure these config files' training accuracy and welcome you "
+"to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/mvit.md:4
+msgid "MViT V2"
+msgstr ""
+
+#: ../../papers/mvit.md:6
+msgid ""
+"[MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](http://openaccess.thecvf."
+"com//content/CVPR2022/papers/"
+"Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf)"
+msgstr ""
+
+#: ../../papers/mvit.md:12
+#, python-format
+msgid ""
+"In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and "
+"video classification, as well as object detection. We present an improved version of MViT that incorporates "
+"decomposed relative positional embeddings and residual pooling connections. We instantiate this "
+"architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video "
+"recognition where it outperforms prior work. We further compare MViTv2s' pooling attention to window "
+"attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, "
+"MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP "
+"on COCO object detection as well as 86.1% on Kinetics-400 video classification."
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "MViTv2-tiny\\*"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "24.17"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "4.70"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "82.33"
+msgstr ""
+
+#: ../../papers/mvit.md ../../papers/vision_transformer.md:71
+msgid "96.15"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mvit/mvitv2-tiny_8xb256_in1k.py)"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-tiny_3rdparty_in1k_20220722-db7beeef."
+"pth)"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "MViTv2-small\\*"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "34.87"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "7.00"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "96.51"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mvit/mvitv2-small_8xb256_in1k.py)"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-"
+"small_3rdparty_in1k_20220722-986bd741.pth)"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "MViTv2-base\\*"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "51.47"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "10.20"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "84.34"
+msgstr ""
+
+#: ../../papers/mvit.md ../../papers/swin_transformer_v2.md:76
+msgid "96.86"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mvit/mvitv2-base_8xb256_in1k.py)"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-base_3rdparty_in1k_20220722-9c4f0a17."
+"pth)"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "MViTv2-large\\*"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "217.99"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid "42.10"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/mvit/mvitv2-large_8xb256_in1k.py)"
+msgstr ""
+
+#: ../../papers/mvit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/mvit/mvitv2-"
+"large_3rdparty_in1k_20220722-2b57b983.pth)"
+msgstr ""
+
+#: ../../papers/mvit.md:36
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/facebookresearch/mvit). The config "
+"files of these models are only for inference. We don't ensure these config files' training accuracy and "
+"welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/poolformer.md:4
+msgid "PoolFormer"
+msgstr ""
+
+#: ../../papers/poolformer.md:6
+msgid "[MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418)"
+msgstr ""
+
+#: ../../papers/poolformer.md:12
+#, python-format
+msgid ""
+"Transformers have shown great potential in computer vision tasks. A common belief is their attention-based "
+"token mixer module contributes most to their competence. However, recent works show the attention-based "
+"module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. "
+"Based on this observation, we hypothesize that the general architecture of the transformers, instead of the "
+"specific token mixer module, is more essential to the model's performance. To verify this, we deliberately "
+"replace the attention module in transformers with an embarrassingly simple spatial pooling operator to "
+"conduct only basic token mixing. Surprisingly, we observe that the derived model, termed as PoolFormer, "
+"achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolFormer "
+"achieves 82.1% top-1 accuracy, surpassing well-tuned vision transformer/MLP-like baselines DeiT-B/ResMLP-"
+"B24 by 0.3%/1.1% accuracy with 35%/52% fewer parameters and 49%/61% fewer MACs. The effectiveness of "
+"PoolFormer verifies our hypothesis and urges us to initiate the concept of \"MetaFormer\", a general "
+"architecture abstracted from transformers without specifying the token mixer. Based on the extensive "
+"experiments, we argue that MetaFormer is the key player in achieving superior results for recent "
+"transformer and MLP-like models on vision tasks. This work calls for more future research dedicated to "
+"improving MetaFormer instead of focusing on the token mixer modules. Additionally, our proposed PoolFormer "
+"could serve as a starting baseline for future MetaFormer architecture design."
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "PoolFormer-S12\\*"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "11.92"
+msgstr ""
+
+#: ../../papers/poolformer.md ../../papers/shufflenet_v1.md
+msgid "1.87"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "77.24"
+msgstr ""
+
+#: ../../papers/poolformer.md ../../papers/regnet.md
+msgid "93.51"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/poolformer/poolformer-"
+"s12_32xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-"
+"s12_3rdparty_32xb128_in1k_20220414-f8d83051.pth)"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "PoolFormer-S24\\*"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "21.39"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "3.51"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "80.33"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "95.05"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/poolformer/poolformer-"
+"s24_32xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-"
+"s24_3rdparty_32xb128_in1k_20220414-d7055904.pth)"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "PoolFormer-S36\\*"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "30.86"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "5.15"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "81.43"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "95.45"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/poolformer/poolformer-"
+"s36_32xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-"
+"s36_3rdparty_32xb128_in1k_20220414-d78ff3e8.pth)"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "PoolFormer-M36\\*"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "56.17"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "8.96"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "82.14"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "95.71"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/poolformer/poolformer-"
+"m36_32xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-"
+"m36_3rdparty_32xb128_in1k_20220414-c55e0949.pth)"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "PoolFormer-M48\\*"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "73.47"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "11.80"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "82.51"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid "95.95"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/poolformer/poolformer-"
+"m48_32xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/poolformer.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/poolformer/poolformer-"
+"m48_3rdparty_32xb128_in1k_20220414-9378f3eb.pth)"
+msgstr ""
+
+#: ../../papers/poolformer.md:30
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/sail-sg/poolformer). The config "
+"files of these models are only for inference. We don't ensure these config files' training accuracy and "
+"welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/regnet.md:4
+msgid "RegNet"
+msgstr ""
+
+#: ../../papers/regnet.md:6
+msgid "[Designing Network Design Spaces](https://arxiv.org/abs/2003.13678)"
+msgstr ""
+
+#: ../../papers/regnet.md:12
+msgid ""
+"In this work, we present a new network design paradigm. Our goal is to help advance the understanding of "
+"network design and discover design principles that generalize across settings. Instead of focusing on "
+"designing individual network instances, we design network design spaces that parametrize populations of "
+"networks. The overall process is analogous to classic manual design of networks, but elevated to the design "
+"space level. Using our methodology we explore the structure aspect of network design and arrive at a low-"
+"dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of "
+"the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a "
+"quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do "
+"not match the current practice of network design. The RegNet design space provides simple and fast networks "
+"that work well across a wide range of flop regimes. Under comparable training settings and flops, the "
+"RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs."
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-400MF"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "5.16"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "0.41"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "72.56"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "90.78"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/regnet/regnetx-400mf_8xb128_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/regnet/"
+"regnetx-400mf_8xb128_in1k_20211213-89bfc226.pth) | [log](https://download.openmmlab.com/mmclassification/v0/"
+"regnet/regnetx-400mf_8xb128_in1k_20211208_143316.log.json)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-800MF"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "7.26"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "0.81"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "74.76"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/regnet/regnetx-800mf_8xb128_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/regnet/"
+"regnetx-800mf_8xb128_in1k_20211213-222b0f11.pth) | [log](https://download.openmmlab.com/mmclassification/v0/"
+"regnet/regnetx-800mf_8xb128_in1k_20211207_143037.log.json)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-1.6GF"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "9.19"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "1.63"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "76.84"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "93.31"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/regnet/regnetx-1.6gf_8xb128_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-1.6gf_8xb128_in1k_20211213-"
+"d1b89758.pth) | [log](https://download.openmmlab.com/mmclassification/v0/regnet/"
+"regnetx-1.6gf_8xb128_in1k_20211208_143018.log.json)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-3.2GF"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "15.3"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "3.21"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "78.09"
+msgstr ""
+
+#: ../../papers/regnet.md ../../papers/resnet.md:76 ../../papers/wrn.md
+msgid "94.08"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/regnet/regnetx-3.2gf_8xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/regnet/"
+"regnetx-3.2gf_8xb64_in1k_20211213-1fdd82ae.pth)  | [log](https://download.openmmlab.com/mmclassification/v0/"
+"regnet/regnetx-3.2gf_8xb64_in1k_20211208_142720.log.json)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-4.0GF"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "22.12"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "4.0"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "78.60"
+msgstr ""
+
+#: ../../papers/regnet.md ../../papers/resnext.md
+msgid "94.17"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/regnet/regnetx-4.0gf_8xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-4.0gf_8xb64_in1k_20211213-"
+"efed675c.pth)  | [log](https://download.openmmlab.com/mmclassification/v0/regnet/"
+"regnetx-4.0gf_8xb64_in1k_20211207_150431.log.json)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-6.4GF"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "26.21"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "6.51"
+msgstr ""
+
+#: ../../papers/regnet.md ../../papers/repvgg.md
+msgid "79.38"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/regnet/regnetx-6.4gf_8xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/regnet/"
+"regnetx-6.4gf_8xb64_in1k_20211215-5c6089da.pth)  | [log](https://download.openmmlab.com/mmclassification/v0/"
+"regnet/regnetx-6.4gf_8xb64_in1k_20211213_172748.log.json)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-8.0GF"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "39.57"
+msgstr ""
+
+#: ../../papers/regnet.md ../../papers/resnext.md
+msgid "8.03"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "79.12"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "94.51"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/regnet/regnetx-8.0gf_8xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/regnet/"
+"regnetx-8.0gf_8xb64_in1k_20211213-9a9fcc76.pth)  | [log](https://download.openmmlab.com/mmclassification/v0/"
+"regnet/regnetx-8.0gf_8xb64_in1k_20211208_103250.log.json)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-12GF"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "46.11"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "12.15"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "79.67"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "95.03"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/regnet/regnetx-12gf_8xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/regnet/regnetx-12gf_8xb64_in1k_20211213-5df8c2f8."
+"pth)  | [log](https://download.openmmlab.com/mmclassification/v0/regnet/"
+"regnetx-12gf_8xb64_in1k_20211208_143713.log.json)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-400MF\\*"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "72.55"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "90.91"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "[model](https://download.openmmlab.com/mmclassification/v0/regnet/convert/RegNetX-400MF-0db9f35c.pth)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-800MF\\*"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "75.21"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "92.37"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "[model](https://download.openmmlab.com/mmclassification/v0/regnet/convert/RegNetX-800MF-4f9d1e8a.pth)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-1.6GF\\*"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "77.04"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "[model](https://download.openmmlab.com/mmclassification/v0/regnet/convert/RegNetX-1.6GF-cfb32375.pth)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-3.2GF\\*"
+msgstr ""
+
+#: ../../papers/regnet.md ../../papers/seresnet.md
+msgid "78.26"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "94.20"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "[model](https://download.openmmlab.com/mmclassification/v0/regnet/convert/RegNetX-3.2GF-82c43fd5.pth)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-4.0GF\\*"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "78.72"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "[model](https://download.openmmlab.com/mmclassification/v0/regnet/convert/RegNetX-4.0GF-ef8bb32c.pth)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-6.4GF\\*"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "79.22"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "94.61"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "[model](https://download.openmmlab.com/mmclassification/v0/regnet/convert/RegNetX-6.4GF-6888c0ea.pth)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-8.0GF\\*"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "79.31"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "94.57"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "[model](https://download.openmmlab.com/mmclassification/v0/regnet/convert/RegNetX-8.0GF-cb4c77ec.pth)"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "RegNetX-12GF\\*"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "79.91"
+msgstr ""
+
+#: ../../papers/regnet.md ../../papers/resnet.md:76
+msgid "94.78"
+msgstr ""
+
+#: ../../papers/regnet.md
+msgid "[model](https://download.openmmlab.com/mmclassification/v0/regnet/convert/RegNetX-12GF-0574538f.pth)"
+msgstr ""
+
+#: ../../papers/regnet.md:41
+msgid ""
+"*Models with * are converted from [pycls](https://github.com/facebookresearch/pycls/blob/master/MODEL_ZOO."
+"md). The config files of these models are only for validation.*"
+msgstr ""
+
+#: ../../papers/replknet.md:4
+msgid "RepLKNet"
+msgstr ""
+
+#: ../../papers/replknet.md:6
+msgid ""
+"[Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs](https://arxiv.org/abs/2203.06717)"
+msgstr ""
+
+#: ../../papers/replknet.md:12
+msgid ""
+"We revisit large kernel design in modern convolutional neural networks (CNNs). Inspired by recent advances "
+"in vision transformers (ViTs), in this paper, we demonstrate that using a few large convolutional kernels "
+"instead of a stack of small kernels could be a more powerful paradigm. We suggested five guidelines, e.g., "
+"applying re-parameterized large depth-wise convolutions, to design efficient highperformance large-kernel "
+"CNNs. Following the guidelines, we propose RepLKNet, a pure CNN architecture whose kernel size is as large "
+"as 31×31, in contrast to commonly used 3×3. RepLKNet greatly closes the performance gap between CNNs and "
+"ViTs, e.g., achieving comparable or superior results than Swin Transformer on ImageNet and a few typical "
+"downstream tasks, with lower latency. RepLKNet also shows nice scalability to big data and large models, "
+"obtaining 87.8% top-1 accuracy on ImageNet and 56.0% mIoU on ADE20K, which is very competitive among the "
+"state-of-the-arts with similar model sizes. Our study further reveals that, in contrast to small-kernel "
+"CNNs, large kernel CNNs have much larger effective receptive fields and higher shape bias rather than "
+"texture bias."
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "Resolution"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "Pretrained Dataset"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "RepLKNet-31B\\*"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "From Scratch"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "79.9（train) | 79.5 (deploy)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "15.6 (train) | 15.4 (deploy)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "83.48"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "96.57"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/replknet/"
+"replknet-31B_32xb64_in1k.py) | [config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/"
+"configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_3rdparty_in1k_20221118-"
+"fd08e268.pth)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "46.0 (train) | 45.3 (deploy)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "97.34"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/replknet/"
+"replknet-31B_32xb64_in1k-384px.py) | [config (deploy)](https://github.com/open-mmlab/mmclassification/"
+"blob/1.x/configs/replknet/deploy/replknet-31B-deploy_32xb64_in1k-384px.py)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/replknet/"
+"replknet-31B_3rdparty_in1k-384px_20221118-03a170ce.pth)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "ImageNet-21K"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "85.20"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "97.56"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_in21k-"
+"pre_3rdparty_in1k_20221118-54ed5c46.pth)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "85.99"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31B_in21k-"
+"pre_3rdparty_in1k-384px_20221118-76c92b24.pth)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "RepLKNet-31L\\*"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "172.7（train) | 172.0 (deploy)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "97.2 (train) | 97.0 (deploy)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "86.63"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "98.00"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/replknet/"
+"replknet-31L_32xb64_in1k-384px.py) | [config (deploy)](https://github.com/open-mmlab/mmclassification/"
+"blob/1.x/configs/replknet/deploy/replknet-31L-deploy_32xb64_in1k-384px.py)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-31L_in21k-"
+"pre_3rdparty_in1k-384px_20221118-dc3fc07c.pth)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "RepLKNet-XL\\*"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "320x320"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "MegData-73M"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "335.4（train) | 335.0 (deploy)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "129.6 (train) | 129.0 (deploy)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "87.57"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid "98.39"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/replknet/replknet-"
+"XL_32xb64_in1k-320px.py) | [config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/"
+"configs/replknet/deploy/replknet-XL-deploy_32xb64_in1k-320px.py)"
+msgstr ""
+
+#: ../../papers/replknet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/replknet/replknet-XL_meg73m-"
+"pre_3rdparty_in1k-320px_20221118-88259b1d.pth)"
+msgstr ""
+
+#: ../../papers/replknet.md:31 ../../papers/repvgg.md:37
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/DingXiaoH/RepVGG). The config "
+"files of these models are only for validation. We don't ensure these config files' training accuracy and "
+"welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/replknet.md:35 ../../papers/repmlp.md:31 ../../papers/repvgg.md:41
+msgid ""
+"The checkpoints provided are all `training-time` models. Use the reparameterize tool to switch them to more "
+"efficient `inference-time` architecture, which not only has fewer parameters but also less calculations."
+msgstr ""
+
+#: ../../papers/replknet.md:37 ../../papers/repmlp.md:33 ../../papers/repvgg.md:43
+msgid "Use tool"
+msgstr ""
+
+#: ../../papers/replknet.md:45 ../../papers/repmlp.md:41 ../../papers/repvgg.md:51
+msgid ""
+"`${CFG_PATH}` is the config file, `${SRC_CKPT_PATH}` is the source chenpoint file, `${TARGET_CKPT_PATH}` is "
+"the target deploy weight file path."
+msgstr ""
+
+#: ../../papers/replknet.md:47 ../../papers/repmlp.md:43 ../../papers/repvgg.md:53
+msgid "To use reparameterized weights, the config file must switch to the deploy config files."
+msgstr ""
+
+#: ../../papers/replknet.md:53 ../../papers/repmlp.md:49 ../../papers/repvgg.md:59
+msgid "In the code"
+msgstr ""
+
+#: ../../papers/replknet.md:55 ../../papers/repmlp.md:51 ../../papers/repvgg.md:61
+msgid ""
+"Use `backbone.switch_to_deploy()` or `classificer.backbone.switch_to_deploy()` to switch to the deploy "
+"mode. For example:"
+msgstr ""
+
+#: ../../papers/replknet.md:65 ../../papers/repmlp.md:61 ../../papers/repvgg.md:71
+msgid "or"
+msgstr ""
+
+#: ../../papers/repmlp.md:4
+msgid "RepMLP"
+msgstr ""
+
+#: ../../papers/repmlp.md:6
+msgid ""
+"[RepMLP: Re-parameterizing Convolutions into Fully-connected Layers forImage Recognition](https://arxiv.org/"
+"abs/2105.01883)"
+msgstr ""
+
+#: ../../papers/repmlp.md:12
+#, python-format
+msgid ""
+"We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, "
+"which is composed of a series of fully-connected (FC) layers. Compared to convolutional layers, FC layers "
+"are more efficient, better at modeling the long-range dependencies and positional patterns, but worse at "
+"capturing the local structures, hence usually less favored for image recognition. We propose a structural "
+"re-parameterization technique that adds local prior into an FC to make it powerful for image recognition. "
+"Specifically, we construct convolutional layers inside a RepMLP during training and merge them into the FC "
+"for inference. On CIFAR, a simple pure-MLP model shows performance very close to CNN. By inserting RepMLP "
+"in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% "
+"mIoU on Cityscapes with lower FLOPs. Our intriguing findings highlight that combining the global "
+"representational capacity and positional perception of FC with the local prior of convolution can improve "
+"the performance of neural network with faster speed on both the tasks with translation invariance (e.g., "
+"semantic segmentation) and those with aligned images and positional patterns (e.g., face recognition)."
+msgstr ""
+
+#: ../../papers/repmlp.md
+msgid "RepMLP-B224\\*"
+msgstr ""
+
+#: ../../papers/repmlp.md
+msgid "68.24"
+msgstr ""
+
+#: ../../papers/repmlp.md
+msgid "6.71"
+msgstr ""
+
+#: ../../papers/repmlp.md
+msgid "80.41"
+msgstr ""
+
+#: ../../papers/repmlp.md
+msgid "95.12"
+msgstr ""
+
+#: ../../papers/repmlp.md
+msgid ""
+"[train_cfg](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repmlp/repmlp-base_8xb64_in1k."
+"py) | [deploy_cfg](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repmlp/repmlp-"
+"base_delopy_8xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/repmlp.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-"
+"base_3rdparty_8xb64_in1k_20220330-1cb1f11b.pth)"
+msgstr ""
+
+#: ../../papers/repmlp.md
+msgid "RepMLP-B256\\*"
+msgstr ""
+
+#: ../../papers/repmlp.md
+msgid "96.45"
+msgstr ""
+
+#: ../../papers/repmlp.md
+msgid "9.69"
+msgstr ""
+
+#: ../../papers/repmlp.md
+msgid "81.11"
+msgstr ""
+
+#: ../../papers/repmlp.md
+msgid "95.5"
+msgstr ""
+
+#: ../../papers/repmlp.md
+msgid ""
+"[train_cfg](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repmlp/repmlp-"
+"base_8xb64_in1k-256px.py) | [deploy_cfg](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/"
+"repmlp/repmlp-base_deploy_8xb64_in1k-256px.py)"
+msgstr ""
+
+#: ../../papers/repmlp.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/repmlp/repmlp-"
+"base_3rdparty_8xb64_in1k-256px_20220330-7c5a91ce.pth)"
+msgstr ""
+
+#: ../../papers/repmlp.md:27
+msgid ""
+"*Models with * are converted from [the official repo.](https://github.com/DingXiaoH/RepMLP). The config "
+"files of these models are only for validation. We don't ensure these config files' training accuracy and "
+"welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/repvgg.md:4
+msgid "RepVGG"
+msgstr ""
+
+#: ../../papers/repvgg.md:6
+msgid "[Repvgg: Making vgg-style convnets great again](https://arxiv.org/abs/2101.03697)"
+msgstr ""
+
+#: ../../papers/repvgg.md:12
+#, python-format
+msgid ""
+"We present a simple but powerful architecture of convolutional neural network, which has a VGG-like "
+"inference-time body composed of nothing but a stack of 3x3 convolution and ReLU, while the training-time "
+"model has a multi-branch topology. Such decoupling of the training-time and inference-time architecture is "
+"realized by a structural re-parameterization technique so that the model is named RepVGG. On ImageNet, "
+"RepVGG reaches over 80% top-1 accuracy, which is the first time for a plain model, to the best of our "
+"knowledge. On NVIDIA 1080Ti GPU, RepVGG models run 83% faster than ResNet-50 or 101% faster than ResNet-101 "
+"with higher accuracy and show favorable accuracy-speed trade-off compared to the state-of-the-art models "
+"like EfficientNet and RegNet."
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "Epochs"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "RepVGG-A0\\*"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "120"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "9.11（train) | 8.31 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "1.52 (train) | 1.36 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md ../../papers/vgg.md
+msgid "72.41"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "90.50"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repvgg/repvgg-A0_4xb64-"
+"coslr-120e_in1k.py) | [config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/"
+"repvgg/deploy/repvgg-A0_deploy_4xb64-coslr-120e_in1k.py)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A0_3rdparty_4xb64-"
+"coslr-120e_in1k_20210909-883ab98c.pth)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "RepVGG-A1\\*"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "14.09 (train) | 12.79 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "2.64 (train) | 2.37 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "74.47"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "91.85"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repvgg/repvgg-A1_4xb64-"
+"coslr-120e_in1k.py) | [config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/"
+"repvgg/deploy/repvgg-A1_deploy_4xb64-coslr-120e_in1k.py)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A1_3rdparty_4xb64-"
+"coslr-120e_in1k_20210909-24003a24.pth)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "RepVGG-A2\\*"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "28.21 (train) | 25.5 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "5.7 (train)  | 5.12 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "76.48"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "93.01"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repvgg/repvgg-A2_4xb64-"
+"coslr-120e_in1k.py) |[config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/"
+"repvgg/deploy/repvgg-A2_deploy_4xb64-coslr-120e_in1k.py)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-A2_3rdparty_4xb64-"
+"coslr-120e_in1k_20210909-97d7695a.pth)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "RepVGG-B0\\*"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "15.82 (train) | 14.34 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "3.42 (train) | 3.06 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "75.14"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "92.42"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repvgg/repvgg-B0_4xb64-"
+"coslr-120e_in1k.py) |[config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/"
+"repvgg/deploy/repvgg-B0_deploy_4xb64-coslr-120e_in1k.py)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B0_3rdparty_4xb64-"
+"coslr-120e_in1k_20210909-446375f4.pth)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "RepVGG-B1\\*"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "57.42 (train) | 51.83 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "13.16 (train) | 11.82 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "78.37"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "94.11"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repvgg/repvgg-B1_4xb64-"
+"coslr-120e_in1k.py) |[config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/"
+"repvgg/deploy/repvgg-B1_deploy_4xb64-coslr-120e_in1k.py)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1_3rdparty_4xb64-"
+"coslr-120e_in1k_20210909-750cdf67.pth)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "RepVGG-B1g2\\*"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "45.78 (train) | 41.36 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "9.82 (train) | 8.82 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "77.79"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "93.88"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repvgg/repvgg-B1g2_4xb64-"
+"coslr-120e_in1k.py) |[config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/"
+"repvgg/deploy/repvgg-B1g2_deploy_4xb64-coslr-120e_in1k.py)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g2_3rdparty_4xb64-"
+"coslr-120e_in1k_20210909-344f6422.pth)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "RepVGG-B1g4\\*"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "39.97 (train) | 36.13 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "8.15 (train) | 7.32 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "77.58"
+msgstr ""
+
+#: ../../papers/repvgg.md ../../papers/seresnet.md
+msgid "93.84"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repvgg/repvgg-B1g4_4xb64-"
+"coslr-120e_in1k.py) |[config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/"
+"repvgg/deploy/repvgg-B1g4_deploy_4xb64-coslr-120e_in1k.py)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B1g4_3rdparty_4xb64-"
+"coslr-120e_in1k_20210909-d4c1a642.pth)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "RepVGG-B2\\*"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "89.02 (train) | 80.32 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "20.46 (train) | 18.39 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "78.78"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repvgg/repvgg-B2_4xb64-"
+"coslr-120e_in1k.py) |[config (deploy)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/"
+"repvgg/deploy/repvgg-B2_deploy_4xb64-coslr-120e_in1k.py)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2_3rdparty_4xb64-"
+"coslr-120e_in1k_20210909-bd6b937c.pth)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "RepVGG-B2g4\\*"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "200"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "61.76 (train) | 55.78 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "12.63 (train) | 11.34 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repvgg/repvgg-B2g4_4xb64-"
+"autoaug-lbs-mixup-coslr-200e_in1k.py) |[config (deploy)](https://github.com/open-mmlab/mmclassification/"
+"blob/1.x/configs/repvgg/deploy/repvgg-B2g4_deploy_4xb64-autoaug-lbs-mixup-coslr-200e_in1k.py)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B2g4_3rdparty_4xb64-autoaug-lbs-"
+"mixup-coslr-200e_in1k_20210909-7b7955f0.pth)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "RepVGG-B3\\*"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "123.09 (train) | 110.96 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "29.17 (train) | 26.22 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "80.52"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "95.26"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repvgg/repvgg-B3_4xb64-"
+"autoaug-lbs-mixup-coslr-200e_in1k.py) |[config (deploy)](https://github.com/open-mmlab/mmclassification/"
+"blob/1.x/configs/repvgg/deploy/repvgg-B3_deploy_4xb64-autoaug-lbs-mixup-coslr-200e_in1k.py)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3_3rdparty_4xb64-autoaug-lbs-"
+"mixup-coslr-200e_in1k_20210909-dda968bf.pth)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "RepVGG-B3g4\\*"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "83.83 (train) | 75.63 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "17.9 (train) | 16.08 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "80.22"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "95.10"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repvgg/repvgg-B3g4_4xb64-"
+"autoaug-lbs-mixup-coslr-200e_in1k.py) |[config (deploy)](https://github.com/open-mmlab/mmclassification/"
+"blob/1.x/configs/repvgg/deploy/repvgg-B3g4_deploy_4xb64-autoaug-lbs-mixup-coslr-200e_in1k.py)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-B3g4_3rdparty_4xb64-autoaug-lbs-"
+"mixup-coslr-200e_in1k_20210909-4e54846a.pth)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "RepVGG-D2se\\*"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "133.33 (train) | 120.39 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "36.56 (train) | 32.85 (deploy)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid "95.94"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[config (train)](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/repvgg/repvgg-D2se_4xb64-"
+"autoaug-lbs-mixup-coslr-200e_in1k.py) |[config (deploy)](https://github.com/open-mmlab/mmclassification/"
+"blob/1.x/configs/repvgg/deploy/repvgg-D2se_deploy_4xb64-autoaug-lbs-mixup-coslr-200e_in1k.py)"
+msgstr ""
+
+#: ../../papers/repvgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/repvgg/repvgg-D2se_3rdparty_4xb64-autoaug-lbs-"
+"mixup-coslr-200e_in1k_20210909-cf3139b7.pth)"
+msgstr ""
+
+#: ../../papers/res2net.md:4
+msgid "Res2Net"
+msgstr ""
+
+#: ../../papers/res2net.md:6
+msgid "[Res2Net: A New Multi-scale Backbone Architecture](https://arxiv.org/pdf/1904.01169.pdf)"
+msgstr ""
+
+#: ../../papers/res2net.md:12
+msgid ""
+"Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances "
+"in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale "
+"representation ability, leading to consistent performance gains on a wide range of applications. However, "
+"most existing methods represent the multi-scale features in a layer-wise manner. In this paper, we propose "
+"a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections "
+"within one single residual block. The Res2Net represents multi-scale features at a granular level and "
+"increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged "
+"into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net "
+"block on all these models and demonstrate consistent performance gains over baseline models on widely-used "
+"datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative "
+"computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, "
+"further verify the superiority of the Res2Net over the state-of-the-art baseline methods."
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid "Res2Net-50-14w-8s\\*"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid "25.06"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid "4.22"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid "78.14"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid "93.85"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/res2net/res2net50-w14-"
+"s8_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w14-"
+"s8_3rdparty_8xb32_in1k_20210927-bc967bf1.pth)"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid "Res2Net-50-26w-8s\\*"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid "48.40"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid "8.39"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid "94.36"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/res2net/res2net50-w26-"
+"s8_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/res2net/res2net50-w26-"
+"s8_3rdparty_8xb32_in1k_20210927-f547a94b.pth)"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid "Res2Net-101-26w-4s\\*"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid "45.21"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid "8.12"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid "79.19"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid "94.44"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/res2net/res2net101-w26-"
+"s4_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/res2net.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/res2net/res2net101-w26-"
+"s4_3rdparty_8xb32_in1k_20210927-870b6c36.pth)"
+msgstr ""
+
+#: ../../papers/res2net.md:28
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/Res2Net/Res2Net-PretrainedModels). "
+"The config files of these models are only for validation. We don't ensure these config files' training "
+"accuracy and welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/resnet.md:4
+msgid "ResNet"
+msgstr ""
+
+#: ../../papers/resnet.md:6
+msgid ""
+"[Deep Residual Learning for Image Recognition](https://openaccess.thecvf.com/content_cvpr_2016/html/"
+"He_Deep_Residual_Learning_CVPR_2016_paper.html)"
+msgstr ""
+
+#: ../../papers/resnet.md:12
+msgid ""
+"**Residual Networks**, or **ResNets**, learn residual functions with reference to the layer inputs, instead "
+"of learning unreferenced functions. In the mainstream previous works, like VGG, the neural networks are a "
+"stack of layers and every layer attempts to fit a desired underlying mapping. In ResNets, a few stacked "
+"layers are grouped as a block, and the layers in a block attempts to learn a residual mapping."
+msgstr ""
+
+#: ../../papers/resnet.md:17
+msgid ""
+"Formally, denoting the desired underlying mapping of a block as $\\mathcal{H}(x)$, split the underlying "
+"mapping into the sum of the identity and the residual mapping as $\\mathcal{H}(x) = x + \\mathcal{F}(x)$, "
+"and let the stacked non-linear layers fit the residual mapping $\\mathcal{F}(x)$."
+msgstr ""
+
+#: ../../papers/resnet.md:21
+msgid ""
+"Many works proved this method makes deep neural networks easier to optimize, and can gain accuracy from "
+"considerably increased depth. Recently, the residual structure is widely used in various models."
+msgstr ""
+
+#: ../../papers/resnet.md:37
+#, python-format
+msgid ""
+"The depth of representations is of central importance for many visual recognition tasks. Solely due to our "
+"extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. "
+"Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won "
+"the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO "
+"segmentation. </br>"
+msgstr ""
+
+#: ../../papers/resnet.md:92
+msgid ""
+"For more configurable parameters, please refer to the [API](https://mmclassification.readthedocs.io/en/1.x/"
+"api/generated/mmcls.models.backbones.ResNet.html#mmcls.models.backbones.ResNet)."
+msgstr ""
+
+#: ../../papers/resnet.md:96 ../../papers/swin_transformer.md:88 ../../papers/swin_transformer_v2.md:98
+#: ../../papers/vision_transformer.md:99
+msgid ""
+"The pre-trained models on ImageNet-21k are used to fine-tune, and therefore don't have evaluation results."
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNet-50-mill"
+msgstr ""
+
+#: ../../papers/resnet.md:76 ../../papers/swin_transformer.md:66 ../../papers/swin_transformer_v2.md:76
+msgid "15.14"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-mill_in21k_20220331-"
+"faac000b.pth)"
+msgstr ""
+
+#: ../../papers/resnet.md:102
+msgid ""
+"*The \"mill\" means using the mutil-label pretrain weight from [ImageNet-21K Pretraining for the Masses]"
+"(https://github.com/Alibaba-MIIL/ImageNet21K).*"
+msgstr ""
+
+#: ../../papers/resnet.md:104
+msgid "Cifar10"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNet-18"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "11.17"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "0.56"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "94.82"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "99.87"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet18_8xb16_cifar10.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-"
+"bd6371c8.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNet-34"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "21.28"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "1.16"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet34_8xb16_cifar10.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_b16x8_cifar10_20210528-a8aa36a6."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_b16x8_cifar10_20210528-"
+"a8aa36a6.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNet-50"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "1.31"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "95.55"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "99.91"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet50_8xb16_cifar10.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-"
+"f54bfad9.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNet-101"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "42.51"
+msgstr ""
+
+#: ../../papers/resnet.md:76 ../../papers/van.md
+msgid "2.52"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "95.58"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet101_8xb16_cifar10.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_b16x8_cifar10_20210528-2d29e936."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/"
+"resnet101_b16x8_cifar10_20210528-2d29e936.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNet-152"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "58.16"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "3.74"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "95.76"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "99.89"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet152_8xb16_cifar10.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_b16x8_cifar10_20210528-3e8e9178."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/"
+"resnet152_b16x8_cifar10_20210528-3e8e9178.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:114
+msgid "Cifar100"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "23.71"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "79.90"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "95.19"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet50_8xb16_cifar100.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar100_20210528-67b58a1b."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/"
+"resnet50_b16x8_cifar100_20210528-67b58a1b.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "11.69"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "1.82"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "69.90"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "89.43"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet18_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_8xb32_in1k_20210831-fbbb1da6."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_8xb32_in1k_20210831-"
+"fbbb1da6.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "21.8"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "3.68"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "73.62"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "91.59"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet34_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_8xb32_in1k_20210831-f257d4e6."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet34_8xb32_in1k_20210831-"
+"f257d4e6.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "25.56"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "76.55"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "93.06"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet50_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-"
+"ea4938fc.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "44.55"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "7.85"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "77.97"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "94.06"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet101_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_8xb32_in1k_20210831-539c63f8."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/"
+"resnet101_8xb32_in1k_20210831-539c63f8.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "60.19"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "11.58"
+msgstr ""
+
+#: ../../papers/resnet.md:76 ../../papers/wrn.md
+msgid "78.48"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "94.13"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet152_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_8xb32_in1k_20210901-4d7582fa."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/"
+"resnet152_8xb32_in1k_20210901-4d7582fa.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNetV1C-50"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "25.58"
+msgstr ""
+
+#: ../../papers/resnet.md:76 ../../papers/swin_transformer.md:66
+msgid "4.36"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "77.01"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnetv1c50_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c50_8xb32_in1k_20220214-3343eccd."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/"
+"resnetv1c50_8xb32_in1k_20220214-3343eccd.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNetV1C-101"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "8.09"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "78.30"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "94.27"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnetv1c101_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c101_8xb32_in1k_20220214-434fe45f."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/"
+"resnetv1c101_8xb32_in1k_20220214-434fe45f.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNetV1C-152"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "60.21"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "11.82"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "78.76"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "94.41"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnetv1c152_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c152_8xb32_in1k_20220214-c013291f."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1c152_8xb32_in1k_20220214-"
+"c013291f.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNetV1D-50"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "77.54"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "93.57"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnetv1d50_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d50_b32x8_imagenet_20210531-"
+"db14775a.pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/"
+"resnetv1d50_b32x8_imagenet_20210531-db14775a.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNetV1D-101"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "78.93"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "94.48"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnetv1d101_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/"
+"resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.pth) | [log](https://download.openmmlab.com/mmclassification/"
+"v0/resnet/resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNetV1D-152"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "94.70"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnetv1d152_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/"
+"resnetv1d152_b32x8_imagenet_20210531-278cf22a.pth) | [log](https://download.openmmlab.com/mmclassification/"
+"v0/resnet/resnetv1d152_b32x8_imagenet_20210531-278cf22a.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNet-50 (fp16)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "76.30"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "93.07"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet50_8xb32-fp16_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/fp16/resnet50_batch256_fp16_imagenet_20210320-"
+"b3964210.pth) | [log](https://download.openmmlab.com/mmclassification/v0/fp16/"
+"resnet50_batch256_fp16_imagenet_20210320-b3964210.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "Wide-ResNet-50\\*"
+msgstr ""
+
+#: ../../papers/resnet.md:76 ../../papers/wrn.md
+msgid "68.88"
+msgstr ""
+
+#: ../../papers/resnet.md:76 ../../papers/wrn.md
+msgid "11.44"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/../wrn/wide-"
+"resnet50_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/wide-"
+"resnet50_3rdparty_8xb32_in1k_20220304-66678344.pth)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "Wide-ResNet-101\\*"
+msgstr ""
+
+#: ../../papers/resnet.md:76 ../../papers/wrn.md
+msgid "126.89"
+msgstr ""
+
+#: ../../papers/resnet.md:76 ../../papers/wrn.md
+msgid "22.81"
+msgstr ""
+
+#: ../../papers/resnet.md:76 ../../papers/wrn.md
+msgid "78.84"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/../wrn/wide-"
+"resnet101_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/wide-"
+"resnet101_3rdparty_8xb32_in1k_20220304-8d5f9d61.pth)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNet-50 (rsb-a1)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "80.12"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet50_8xb256-rsb-"
+"a1-600e_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-"
+"a1-600e_in1k_20211228-20e21305.pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/"
+"resnet50_8xb256-rsb-a1-600e_in1k_20211228-20e21305.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNet-50 (rsb-a2)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet50_8xb256-rsb-"
+"a2-300e_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-"
+"a2-300e_in1k_20211228-0fd8be6e.pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/"
+"resnet50_8xb256-rsb-a2-300e_in1k_20211228-0fd8be6e.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "ResNet-50 (rsb-a3)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "93.80"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet50_8xb256-rsb-"
+"a3-100e_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb256-rsb-"
+"a3-100e_in1k_20211228-3493673c.pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/"
+"resnet50_8xb256-rsb-a3-100e_in1k_20211228-3493673c.log.json)"
+msgstr ""
+
+#: ../../papers/resnet.md:142
+msgid ""
+"*The \"rsb\" means using the training settings from [ResNet strikes back: An improved training procedure in "
+"timm](https://arxiv.org/abs/2110.00476).*"
+msgstr ""
+
+#: ../../papers/resnet.md:144
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/pytorch/vision). The config files "
+"of these models are only for validation. We don't ensure these config files' training accuracy and welcome "
+"you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/resnet.md:146 ../../papers/swin_transformer.md:114
+msgid "CUB-200-2011"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[ImageNet-21k-mill](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_3rdparty-"
+"mill_in21k_20220331-faac000b.pth)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "448x448"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "23.92"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "16.48"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "88.45"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid "[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnet/resnet50_8xb8_cub.py)"
+msgstr ""
+
+#: ../../papers/resnet.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb8_cub_20220307-57840e60.pth) "
+"| [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb8_cub_20220307-57840e60.log."
+"json)"
+msgstr ""
+
+#: ../../papers/resnext.md:4
+msgid "ResNeXt"
+msgstr ""
+
+#: ../../papers/resnext.md:6
+msgid ""
+"[Aggregated Residual Transformations for Deep Neural Networks](https://openaccess.thecvf.com/"
+"content_cvpr_2017/html/Xie_Aggregated_Residual_Transformations_CVPR_2017_paper.html)"
+msgstr ""
+
+#: ../../papers/resnext.md:12
+msgid ""
+"We present a simple, highly modularized network architecture for image classification. Our network is "
+"constructed by repeating a building block that aggregates a set of transformations with the same topology. "
+"Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters "
+"to set. This strategy exposes a new dimension, which we call \"cardinality\" (the size of the set of "
+"transformations), as an essential factor in addition to the dimensions of depth and width. On the "
+"ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining "
+"complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing "
+"cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named "
+"ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd "
+"place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better "
+"results than its ResNet counterpart. The code and models are publicly available online."
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "ResNeXt-32x4d-50"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "25.03"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "4.27"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "77.90"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "93.66"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnext/resnext50-32x4d_8xb32_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnext/"
+"resnext50_32x4d_b32x8_imagenet_20210429-56066e27.pth) | [log](https://download.openmmlab.com/"
+"mmclassification/v0/resnext/resnext50_32x4d_b32x8_imagenet_20210429-56066e27.log.json)"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "ResNeXt-32x4d-101"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "44.18"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "78.61"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnext/"
+"resnext101-32x4d_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnext/resnext101_32x4d_b32x8_imagenet_20210506-"
+"e0fa3dd5.pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnext/"
+"resnext101_32x4d_b32x8_imagenet_20210506-e0fa3dd5.log.json)"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "ResNeXt-32x8d-101"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "88.79"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "16.5"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "79.27"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "94.58"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnext/"
+"resnext101-32x8d_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnext/"
+"resnext101_32x8d_b32x8_imagenet_20210506-23a247d5.pth) | [log](https://download.openmmlab.com/"
+"mmclassification/v0/resnext/resnext101_32x8d_b32x8_imagenet_20210506-23a247d5.log.json)"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "ResNeXt-32x4d-152"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "59.95"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "11.8"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid "94.33"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/resnext/"
+"resnext152-32x4d_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/resnext.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/resnext/"
+"resnext152_32x4d_b32x8_imagenet_20210524-927787be.pth) | [log](https://download.openmmlab.com/"
+"mmclassification/v0/resnext/resnext152_32x4d_b32x8_imagenet_20210524-927787be.log.json)"
+msgstr ""
+
+#: ../../papers/seresnet.md:4
+msgid "SE-ResNet"
+msgstr ""
+
+#: ../../papers/seresnet.md:6
+msgid ""
+"[Squeeze-and-Excitation Networks](https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-"
+"Excitation_Networks_CVPR_2018_paper.html)"
+msgstr ""
+
+#: ../../papers/seresnet.md:12
+msgid ""
+"The central building block of convolutional neural networks (CNNs) is the convolution operator, which "
+"enables networks to construct informative features by fusing both spatial and channel-wise information "
+"within local receptive fields at each layer. A broad range of prior research has investigated the spatial "
+"component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the "
+"quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the "
+"channel relationship and propose a novel architectural unit, which we term the \"Squeeze-and-Excitation"
+"\" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling "
+"interdependencies between channels. We show that these blocks can be stacked together to form SENet "
+"architectures that generalise extremely effectively across different datasets. We further demonstrate that "
+"SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight "
+"additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 "
+"classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the "
+"winning entry of 2016 by a relative improvement of ~25%."
+msgstr ""
+
+#: ../../papers/seresnet.md
+msgid "SE-ResNet-50"
+msgstr ""
+
+#: ../../papers/seresnet.md
+msgid "28.09"
+msgstr ""
+
+#: ../../papers/seresnet.md
+msgid "4.13"
+msgstr ""
+
+#: ../../papers/seresnet.md
+msgid "77.74"
+msgstr ""
+
+#: ../../papers/seresnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/seresnet/seresnet50_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/seresnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-resnet50_batch256_imagenet_20200804-"
+"ae206104.pth) | [log](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-"
+"resnet50_batch256_imagenet_20200708-657b3c36.log.json)"
+msgstr ""
+
+#: ../../papers/seresnet.md
+msgid "SE-ResNet-101"
+msgstr ""
+
+#: ../../papers/seresnet.md
+msgid "49.33"
+msgstr ""
+
+#: ../../papers/seresnet.md
+msgid "7.86"
+msgstr ""
+
+#: ../../papers/seresnet.md
+msgid "94.07"
+msgstr ""
+
+#: ../../papers/seresnet.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/seresnet/seresnet101_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/seresnet.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/se-resnet/se-"
+"resnet101_batch256_imagenet_20200804-ba5b51d4.pth) | [log](https://download.openmmlab.com/mmclassification/"
+"v0/se-resnet/se-resnet101_batch256_imagenet_20200708-038a4d04.log.json)"
+msgstr ""
+
+#: ../../papers/shufflenet_v1.md:4
+msgid "ShuffleNet V1"
+msgstr ""
+
+#: ../../papers/shufflenet_v1.md:6
+msgid ""
+"[ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices](https://openaccess."
+"thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html)"
+msgstr ""
+
+#: ../../papers/shufflenet_v1.md:12
+msgid ""
+"We introduce an extremely computation-efficient CNN architecture named ShuffleNet, which is designed "
+"specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs). The new architecture "
+"utilizes two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation "
+"cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection "
+"demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top-1 error (absolute "
+"7.8%) than recent MobileNet on ImageNet classification task, under the computation budget of 40 MFLOPs. On "
+"an ARM-based mobile device, ShuffleNet achieves ~13x actual speedup over AlexNet while maintaining "
+"comparable accuracy."
+msgstr ""
+
+#: ../../papers/shufflenet_v1.md
+msgid "ShuffleNetV1 1.0x (group=3)"
+msgstr ""
+
+#: ../../papers/shufflenet_v1.md
+msgid "0.146"
+msgstr ""
+
+#: ../../papers/shufflenet_v1.md
+msgid "68.13"
+msgstr ""
+
+#: ../../papers/shufflenet_v1.md
+msgid "87.81"
+msgstr ""
+
+#: ../../papers/shufflenet_v1.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/shufflenet_v1/shufflenet-"
+"v1-1x_16xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/shufflenet_v1.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/"
+"shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.pth) | [log](https://download.openmmlab.com/"
+"mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.log.json)"
+msgstr ""
+
+#: ../../papers/shufflenet_v2.md:4
+msgid "ShuffleNet V2"
+msgstr ""
+
+#: ../../papers/shufflenet_v2.md:6
+msgid ""
+"[Shufflenet v2: Practical guidelines for efficient cnn architecture design](https://openaccess.thecvf.com/"
+"content_ECCV_2018/papers/Ningning_Light-weight_CNN_Architecture_ECCV_2018_paper.pdf)"
+msgstr ""
+
+#: ../../papers/shufflenet_v2.md:12
+msgid ""
+"Currently, the neural network architecture design is mostly guided by the *indirect* metric of computation "
+"complexity, i.e., FLOPs. However, the *direct* metric, e.g., speed, also depends on the other factors such "
+"as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on "
+"the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work "
+"derives several practical *guidelines* for efficient network design. Accordingly, a new architecture is "
+"presented, called *ShuffleNet V2*. Comprehensive ablation experiments verify that our model is the state-of-"
+"the-art in terms of speed and accuracy tradeoff."
+msgstr ""
+
+#: ../../papers/shufflenet_v2.md
+msgid "ShuffleNetV2 1.0x"
+msgstr ""
+
+#: ../../papers/shufflenet_v2.md
+msgid "2.28"
+msgstr ""
+
+#: ../../papers/shufflenet_v2.md
+msgid "0.149"
+msgstr ""
+
+#: ../../papers/shufflenet_v2.md
+msgid "69.55"
+msgstr ""
+
+#: ../../papers/shufflenet_v2.md
+msgid "88.92"
+msgstr ""
+
+#: ../../papers/shufflenet_v2.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/shufflenet_v2/shufflenet-"
+"v2-1x_16xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/shufflenet_v2.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/"
+"shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.pth) | [log](https://download.openmmlab.com/"
+"mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200804-8860eec9.log.json)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:4
+msgid "Swin Transformer"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:6
+msgid ""
+"[Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/pdf/2103.14030."
+"pdf)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:12
+msgid ""
+"**Swin Transformer** (the name **Swin** stands for Shifted window) is initially described in [the paper]"
+"(https://arxiv.org/pdf/2103.14030.pdf), which capably serves as a general-purpose backbone for computer "
+"vision. It is basically a hierarchical Transformer whose representation is computed with shifted windows. "
+"The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-"
+"overlapping local windows while also allowing for cross-window connection."
+msgstr ""
+
+#: ../../papers/swin_transformer.md:14
+msgid ""
+"Swin Transformer achieves strong performance on COCO object detection (58.7 box AP and 51.1 mask AP on test-"
+"dev) and ADE20K semantic segmentation (53.5 mIoU on val), surpassing previous models by a large margin."
+msgstr ""
+
+#: ../../papers/swin_transformer.md:82
+msgid ""
+"For more configurable parameters, please refer to the [API](https://mmclassification.readthedocs.io/en/1.x/"
+"api/generated/mmcls.models.backbones.SwinTransformer.html#mmcls.models.backbones.SwinTransformer)."
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "Swin-B"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin-"
+"base_3rdparty_in21k.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "44.49"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin-"
+"base_3rdparty_in21k-384px.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "Swin-L"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "195.00"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "34.04"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin-"
+"large_3rdparty_in21k.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "195.20"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "100.04"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "Swin-T"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "28.29"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "81.18"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer/swin-"
+"tiny_16xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/"
+"swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth)  | [log](https://download.openmmlab.com/"
+"mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925.log.json)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "Swin-S"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "49.61"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "8.52"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "83.02"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66 ../../papers/twins.md
+msgid "96.29"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer/swin-"
+"small_16xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/"
+"swin_small_224_b16x64_300e_imagenet_20210615_110219-7f9d988b.pth)  | [log](https://download.openmmlab.com/"
+"mmclassification/v0/swin-transformer/swin_small_224_b16x64_300e_imagenet_20210615_110219.log.json)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "87.77"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "83.36"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer/swin-"
+"base_16xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/"
+"swin_base_224_b16x64_300e_imagenet_20210616_190742-93230b0d.pth)  | [log](https://download.openmmlab.com/"
+"mmclassification/v0/swin-transformer/swin_base_224_b16x64_300e_imagenet_20210616_190742.log.json)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66 ../../papers/swin_transformer_v2.md:76
+msgid "Swin-S\\*"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "83.21"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "96.25"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/"
+"swin_small_patch4_window7_224-cc7a01c9.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66 ../../papers/swin_transformer_v2.md:76
+msgid "Swin-B\\*"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "83.42"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/"
+"swin_base_patch4_window7_224-4670dd19.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "87.90"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "84.49"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer/swin-"
+"base_16xb64_in1k-384px.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/"
+"swin_base_patch4_window12_384-02c598a4.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "85.16"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "97.50"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/"
+"swin_base_patch4_window7_224_22kto1k-f967f799.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "86.44"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "98.05"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/"
+"swin_base_patch4_window12_384_22kto1k-d59b0d1d.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66 ../../papers/swin_transformer_v2.md:76
+msgid "Swin-L\\*"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "196.53"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "86.24"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66 ../../papers/swin_transformer_v2.md:76
+msgid "97.88"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer/swin-"
+"large_16xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/"
+"swin_large_patch4_window7_224_22kto1k-5f0996db.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66 ../../papers/swin_transformer_v2.md:76
+msgid "196.74"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "87.25"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "98.25"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer/swin-"
+"large_16xb64_in1k-384px.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/"
+"swin_large_patch4_window12_384_22kto1k-0a40944b.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:112 ../../papers/swin_transformer_v2.md:120
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/microsoft/Swin-Transformer#main-"
+"results-on-imagenet-with-pretrained-models). The config files of these models are only for validation. We "
+"don't ensure these config files' training accuracy and welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[ImageNet-21k](https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin-"
+"base_3rdparty_in21k-384px.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "195.51"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid "91.87"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer/swin-"
+"large_8xb8_cub_384px.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer.md:66
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin-"
+"large_8xb8_cub_384px_20220307-1bbaee6a.pth) | [log](https://download.openmmlab.com/mmclassification/v0/swin-"
+"transformer/swin-large_8xb8_cub_384px_20220307-1bbaee6a.log.json)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:4
+msgid "Swin Transformer V2"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:6
+msgid "[Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883.pdf)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:12
+msgid ""
+"**Swin Transformer V2** is a work on the scale up visual model based on [Swin Transformer](https://github."
+"com/open-mmlab/mmclassification/tree/1.x/configs/swin_transformer). In the visual field, We can not "
+"increase the performance by just simply scaling up the visual model like NLP models. The possible reasons "
+"mentioned in the article are:"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:14
+msgid "Training instability when increasing the vision model"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:15
+msgid "Migrating the model trained at low resolution to a larger scale resolution task"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:16
+msgid "Too mush GPU memory"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:18
+msgid "To solve it, The following method improvements are proposed in the paper:"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:20
+msgid "post normalization: layer normalization after self-attention layer and MLP block"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:21
+msgid ""
+"scaled cosine attention approach: use cosine similarity to calculate the relationship between token pairs"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:22
+msgid "log-spaced continuous position bias: redefine relative position encoding"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:36
+msgid ""
+"Large-scale NLP models have been shown to significantly improve the performance on language tasks with no "
+"signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This "
+"paper aims to explore large-scale models in computer vision. We tackle three major issues in training and "
+"application of large vision models, including training instability, resolution gaps between pre-training "
+"and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm "
+"method combined with cosine attention to improve training stability; 2) A log-spaced continuous position "
+"bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with "
+"high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast "
+"labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin "
+"Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training "
+"with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision "
+"tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and "
+"Kinetics-400 video action classification. Also note our training is much more efficient than that in "
+"Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training "
+"time."
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:92
+msgid ""
+"For more configurable parameters, please refer to the [API](https://mmclassification.readthedocs.io/en/1.x/"
+"api/generated/mmcls.models.backbones.SwinTransformerV2.html#mmcls.models.backbones.SwinTransformerV2)."
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "192x192"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "87.92"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "8.51"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-base-"
+"w12_3rdparty_in21k-192px_20220803-f7dc9763.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "19.04"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-v2/pretrain/swinv2-large-"
+"w12_3rdparty_in21k-192px_20220803-d9073fee.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "window"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "Swin-T\\*"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "256x256"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "8x8"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "28.35"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "4.35"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "95.87"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer_v2/swinv2-tiny-"
+"w8_16xb64_in1k-256px.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-"
+"w8_3rdparty_in1k-256px_20220803-e318968f.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "16x16"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "4.4"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "82.81"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer_v2/swinv2-tiny-"
+"w16_16xb64_in1k-256px.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-tiny-"
+"w16_3rdparty_in1k-256px_20220803-9651cdd7.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "49.73"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "8.45"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "83.74"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "96.6"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer_v2/swinv2-small-"
+"w8_16xb64_in1k-256px.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-"
+"w8_3rdparty_in1k-256px_20220803-b01a4332.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "8.57"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "84.13"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "96.83"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer_v2/swinv2-small-"
+"w16_16xb64_in1k-256px.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-small-"
+"w16_3rdparty_in1k-256px_20220803-b707d206.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "14.99"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "84.2"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer_v2/swinv2-base-"
+"w8_16xb64_in1k-256px.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-"
+"w8_3rdparty_in1k-256px_20220803-8ff28f2b.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "84.6"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "97.05"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer_v2/swinv2-base-"
+"w16_16xb64_in1k-256px.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-"
+"w16_3rdparty_in1k-256px_20220803-5a1886b7.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "86.17"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer_v2/swinv2-base-"
+"w16_in21k-pre_16xb64_in1k-256px.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w16_in21k-"
+"pre_3rdparty_in1k-256px_20220803-8d7aa8ad.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "24x24"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "34.07"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "87.14"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "98.23"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer_v2/swinv2-base-"
+"w24_in21k-pre_16xb64_in1k-384px.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-base-w24_in21k-"
+"pre_3rdparty_in1k-384px_20220803-44eb70f8.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "256X256"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "196.75"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "33.86"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "86.93"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "98.06"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer_v2/swinv2-large-"
+"w16_in21k-pre_16xb64_in1k-256px.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w16_in21k-"
+"pre_3rdparty_in1k-256px_20220803-c40cbed7.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "76.2"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "87.59"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid "98.27"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/swin_transformer_v2/swinv2-large-"
+"w24_in21k-pre_16xb64_in1k-384px.py)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:76
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/swin-v2/swinv2-large-w24_in21k-"
+"pre_3rdparty_in1k-384px_20220803-3b36c165.pth)"
+msgstr ""
+
+#: ../../papers/swin_transformer_v2.md:122
+msgid ""
+"*ImageNet-21k pretrained models with input resolution of 256x256 and 384x384 both fine-tuned from the same "
+"pre-training model using a smaller input resolution of 192x192.*"
+msgstr ""
+
+#: ../../papers/t2t_vit.md:4
+msgid "Tokens-to-Token ViT"
+msgstr ""
+
+#: ../../papers/t2t_vit.md:6
+msgid ""
+"[Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet](https://arxiv.org/"
+"abs/2101.11986)"
+msgstr ""
+
+#: ../../papers/t2t_vit.md:12
+#, python-format
+msgid ""
+"Transformers, which are popular for language modeling, have been explored for solving vision tasks "
+"recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into "
+"a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global "
+"relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch "
+"on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails "
+"to model the important local structure such as edges and lines among neighboring pixels, leading to low "
+"training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature "
+"richness for fixed computation budgets and limited training samples. To overcome such limitations, we "
+"propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-"
+"Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating "
+"neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding "
+"tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow "
+"structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-"
+"ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0% improvement "
+"when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with "
+"MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M "
+"parameters) can achieve 83.3% top1 accuracy in image resolution 384×384 on ImageNet."
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid "T2T-ViT_t-14"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid "21.47"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid "4.34"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid "81.83"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid "95.84"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_8xb64_in1k_20211220-"
+"f7378dd5.pth)  | [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-"
+"t-14_8xb64_in1k_20211220-f7378dd5.log.json)"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid "T2T-ViT_t-19"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid "39.08"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid "7.80"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid "82.63"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/t2t_vit/t2t-vit-t-19_8xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-"
+"t-19_8xb64_in1k_20211214-7f5e3aaf.pth)  | [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/"
+"t2t-vit-t-19_8xb64_in1k_20211214-7f5e3aaf.log.json)"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid "T2T-ViT_t-24"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid "64.00"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid "12.69"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid "82.71"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/t2t_vit/t2t-vit-t-24_8xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/t2t_vit.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-24_8xb64_in1k_20211214-"
+"b2a68ae3.pth)  | [log](https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-"
+"t-24_8xb64_in1k_20211214-b2a68ae3.log.json)"
+msgstr ""
+
+#: ../../papers/t2t_vit.md:28
+msgid ""
+"*In consistent with the [official repo](https://github.com/yitu-opensource/T2T-ViT), we adopt the best "
+"checkpoints during training.*"
+msgstr ""
+
+#: ../../papers/tnt.md:4
+msgid "TNT"
+msgstr ""
+
+#: ../../papers/tnt.md:6
+msgid "[Transformer in Transformer](https://arxiv.org/abs/2103.00112)"
+msgstr ""
+
+#: ../../papers/tnt.md:12
+#, python-format
+msgid ""
+"Transformer is a new kind of neural architecture which encodes the input data as powerful features via the "
+"attention mechanism. Basically, the visual transformers first divide the input images into several local "
+"patches and then calculate both representations and their relationship. Since natural images are of high "
+"complexity with abundant detail and color information, the granularity of the patch dividing is not fine "
+"enough for excavating features of objects in different scales and locations. In this paper, we point out "
+"that the attention inside these local patches are also essential for building visual transformers with high "
+"performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we "
+"regard the local patches (e.g., 16×16) as \"visual sentences\" and present to further divide them into "
+"smaller patches (e.g., 4×4) as \"visual words\". The attention of each word will be calculated with other "
+"words in the given visual sentence with negligible computational costs. Features of both words and "
+"sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks "
+"demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on "
+"the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar "
+"computational cost."
+msgstr ""
+
+#: ../../papers/tnt.md
+msgid "TNT-small\\*"
+msgstr ""
+
+#: ../../papers/tnt.md
+msgid "23.76"
+msgstr ""
+
+#: ../../papers/tnt.md
+msgid "3.36"
+msgstr ""
+
+#: ../../papers/tnt.md
+msgid "81.52"
+msgstr ""
+
+#: ../../papers/tnt.md
+msgid "95.73"
+msgstr ""
+
+#: ../../papers/tnt.md
+msgid "[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/tnt/tnt-s-p16_16xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/tnt.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/tnt/tnt-small-p16_3rdparty_in1k_20210903-"
+"c56ee7df.pth)"
+msgstr ""
+
+#: ../../papers/tnt.md:26
+msgid ""
+"*Models with * are converted from [timm](https://github.com/rwightman/pytorch-image-models/). The config "
+"files of these models are only for validation. We don't ensure these config files' training accuracy and "
+"welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/twins.md:4
+msgid "Twins"
+msgstr ""
+
+#: ../../papers/twins.md:6
+msgid ""
+"[Twins: Revisiting the Design of Spatial Attention in Vision Transformers](http://arxiv-export-lb.library."
+"cornell.edu/abs/2104.13840)"
+msgstr ""
+
+#: ../../papers/twins.md:12
+msgid ""
+"Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed "
+"and they show that the design of spatial attention is critical to their success in these tasks. In this "
+"work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple "
+"spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we "
+"propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures "
+"are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized "
+"in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent "
+"performance on a wide range of visual tasks, including image level classification as well as dense "
+"detection and segmentation. The simplicity and strong performance suggest that our proposed architectures "
+"may serve as stronger backbones for many vision tasks. Our code is released at [this https URL](https://"
+"github.com/Meituan-AutoML/Twins)."
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "PCPVT-small\\*"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "24.11"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "3.67"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "81.14"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/twins/twins-pcpvt-"
+"small_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-"
+"small_3rdparty_8xb128_in1k_20220126-ef23c132.pth)"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "PCPVT-base\\*"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "43.83"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "6.45"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "82.66"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/twins/twins-pcpvt-base_8xb128_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-"
+"base_3rdparty_8xb128_in1k_20220126-f8c4b0d5.pth)"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "PCPVT-large\\*"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "60.99"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "9.51"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "83.09"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/twins/twins-pcpvt-"
+"large_16xb64_in1k.py)"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/twins/twins-pcpvt-"
+"large_3rdparty_16xb64_in1k_20220126-c1ef8d80.pth)"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "SVT-small\\*"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "24.06"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "2.82"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "81.77"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "95.57"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/twins/twins-svt-small_8xb128_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-"
+"small_3rdparty_8xb128_in1k_20220126-8fe5205b.pth)"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "SVT-base\\*"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "56.07"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "8.35"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/twins/twins-svt-base_8xb128_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-"
+"base_3rdparty_8xb128_in1k_20220126-e31cc8e9.pth)"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "SVT-large\\*"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "99.27"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "14.82"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "83.60"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid "96.50"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/twins/twins-svt-large_16xb64_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/twins.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/twins/twins-svt-"
+"large_3rdparty_16xb64_in1k_20220126-4817645f.pth)"
+msgstr ""
+
+#: ../../papers/twins.md:31
+msgid ""
+"*Models with * are converted from [the official repo](https://github.com/Meituan-AutoML/Twins). The config "
+"files of these models are only for validation. We don't ensure these config files' training accuracy and "
+"welcome you to contribute your reproduction results. The validation accuracy is a little different from the "
+"official paper because of the PyTorch version. This result is get in PyTorch=1.9 while the official result "
+"is get in PyTorch=1.7*"
+msgstr ""
+
+#: ../../papers/van.md:4
+msgid "Visual Attention Network"
+msgstr ""
+
+#: ../../papers/van.md:6
+msgid "[Visual Attention Network](https://arxiv.org/pdf/2202.09741v2.pdf)"
+msgstr ""
+
+#: ../../papers/van.md:12
+msgid ""
+"While originally designed for natural language processing (NLP) tasks, the self-attention mechanism has "
+"recently taken various computer vision areas by storm. However, the 2D nature of images brings three "
+"challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects "
+"their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only "
+"captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel large "
+"kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention while "
+"avoiding the above issues. We further introduce a novel neural network based on LKA, namely Visual "
+"Attention Network (VAN). While extremely simple and efficient, VAN outperforms the state-of-the-art vision "
+"transformers and convolutional neural networks with a large margin in extensive experiments, including "
+"image classification, object detection, semantic segmentation, instance segmentation, etc."
+msgstr ""
+
+#: ../../papers/van.md
+msgid "VAN-T\\*"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "4.11"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "0.88"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "75.41"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "93.02"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/van/van-tiny_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/van.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/van/van-tiny_8xb128_in1k_20220501-385941af.pth)"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "VAN-S\\*"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "13.86"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "95.63"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/van/van-small_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/van.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/van/van-small_8xb128_in1k_20220501-17bc91aa.pth)"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "VAN-B\\*"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "26.58"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "5.03"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "82.80"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "96.21"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/van/van-base_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/van.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/van/van-base_8xb128_in1k_20220501-6a4cc31b.pth)"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "VAN-L\\*"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "44.77"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "83.86"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "96.73"
+msgstr ""
+
+#: ../../papers/van.md
+msgid "[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/van/van-large_8xb128_in1k.py)"
+msgstr ""
+
+#: ../../papers/van.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/van/van-large_8xb128_in1k_20220501-f212ba21.pth)"
+msgstr ""
+
+#: ../../papers/van.md:29
+msgid ""
+"\\*Models with * are converted from [the official repo](https://github.com/Visual-Attention-Network/VAN-"
+"Classification). The config files of these models are only for validation. We don't ensure these config "
+"files' training accuracy and welcome you to contribute your reproduction results."
+msgstr ""
+
+#: ../../papers/vgg.md:4
+msgid "VGG"
+msgstr ""
+
+#: ../../papers/vgg.md:6
+msgid "[Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/abs/1409.1556)"
+msgstr ""
+
+#: ../../papers/vgg.md:12
+msgid ""
+"In this work we investigate the effect of the convolutional network depth on its accuracy in the large-"
+"scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing "
+"depth using an architecture with very small (3x3) convolution filters, which shows that a significant "
+"improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. "
+"These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first "
+"and the second places in the localisation and classification tracks respectively. We also show that our "
+"representations generalise well to other datasets, where they achieve state-of-the-art results. We have "
+"made our two best-performing ConvNet models publicly available to facilitate further research on the use of "
+"deep visual representations in computer vision."
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "VGG-11"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "132.86"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "7.63"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "68.75"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "88.87"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/vgg/vgg11_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_batch256_imagenet_20210208-4271cd6c."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/vgg/"
+"vgg11_batch256_imagenet_20210208-4271cd6c.log.json)"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "VGG-13"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "133.05"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "11.34"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "70.02"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "89.46"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/vgg/vgg13_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_batch256_imagenet_20210208-4d1d6080."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/vgg/"
+"vgg13_batch256_imagenet_20210208-4d1d6080.log.json)"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "VGG-16"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "138.36"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "71.62"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "90.49"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/vgg/vgg16_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_batch256_imagenet_20210208-db26f1a5."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_batch256_imagenet_20210208-"
+"db26f1a5.log.json)"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "VGG-19"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "143.67"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "19.67"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/vgg/vgg19_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_batch256_imagenet_20210208-e6920e4a."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_batch256_imagenet_20210208-"
+"e6920e4a.log.json)"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "VGG-11-BN"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "132.87"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "7.64"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "70.67"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "90.16"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/vgg/vgg11bn_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_bn_batch256_imagenet_20210207-f244902c."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg11_bn_batch256_imagenet_20210207-"
+"f244902c.log.json)"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "VGG-13-BN"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "11.36"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "72.12"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "90.66"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/vgg/vgg13bn_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg13_bn_batch256_imagenet_20210207-1a8b7864."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/vgg/"
+"vgg13_bn_batch256_imagenet_20210207-1a8b7864.log.json)"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "VGG-16-BN"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "138.37"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "15.53"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "73.74"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "91.66"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg16_bn_batch256_imagenet_20210208-7e55cd29."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/vgg/"
+"vgg16_bn_batch256_imagenet_20210208-7e55cd29.log.json)"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "VGG-19-BN"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "143.68"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "19.7"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "74.68"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "92.27"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid "[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/vgg/vgg19bn_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/vgg.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_bn_batch256_imagenet_20210208-da620c4f."
+"pth) | [log](https://download.openmmlab.com/mmclassification/v0/vgg/vgg19_bn_batch256_imagenet_20210208-"
+"da620c4f.log.json)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:4
+msgid "Vision Transformer"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:6
+msgid ""
+"[An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/"
+"pdf/2010.11929.pdf)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:12
+msgid ""
+"**Vision Transformer**, known as **ViT**, succeeded in using a full transformer to outperform previous "
+"works that based on convolutional networks in vision field. ViT splits image into patches to feed the multi-"
+"head attentions, concatenates a learnable class token for final prediction and adds a learnable position "
+"embeddings for relative positional message between patches. Based on these three techniques with "
+"attentions, ViT provides a brand-new pattern to build a basic structure in vision field."
+msgstr ""
+
+#: ../../papers/vision_transformer.md:14
+msgid ""
+"The strategy works even better when coupled with large datasets pre-trainings. Because of its simplicity "
+"and effectiveness, some after works in classification field are originated from ViT. And even in recent "
+"multi-modality field, ViT-based method still plays a role in it."
+msgstr ""
+
+#: ../../papers/vision_transformer.md:28
+msgid ""
+"While the Transformer architecture has become the de-facto standard for natural language processing tasks, "
+"its applications to computer vision remain limited. In vision, attention is either applied in conjunction "
+"with convolutional networks, or used to replace certain components of convolutional networks while keeping "
+"their overall structure in place. We show that this reliance on CNNs is not necessary and a pure "
+"transformer applied directly to sequences of image patches can perform very well on image classification "
+"tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image "
+"recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent "
+"results compared to state-of-the-art convolutional networks while requiring substantially fewer "
+"computational resources to train. </br>"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:87
+msgid ""
+"For more configurable parameters, please refer to the [API](https://mmclassification.readthedocs.io/en/1.x/"
+"api/generated/mmcls.models.backbones.VisionTransformer.html#mmcls.models.backbones.VisionTransformer)."
+msgstr ""
+
+#: ../../papers/vision_transformer.md:91
+msgid ""
+"The training step of Vision Transformers is divided into two steps. The first step is training the model on "
+"a large dataset, like ImageNet-21k, and get the pre-trained model. And the second step is training the "
+"model on the target dataset, like ImageNet-1k, and get the fine-tuned model. Here, we provide both pre-"
+"trained models and fine-tuned models."
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "ViT-B16\\*"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "33.03"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vit/pretrain/vit-base-"
+"p16_3rdparty_pt-64xb64_in1k-224_20210928-02284250.pth)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "ViT-B32\\*"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "88.30"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "8.56"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vit/pretrain/vit-base-"
+"p32_3rdparty_pt-64xb64_in1k-224_20210928-eee25dd4.pth)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "ViT-L16\\*"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "304.72"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "116.68"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vit/pretrain/vit-large-"
+"p16_3rdparty_pt-64xb64_in1k-224_20210928-0001f9a1.pth)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:107
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/google-research/"
+"vision_transformer#available-vit-models).*"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "ViT-B16"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "82.37"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/vision_transformer/vit-base-"
+"p16_pt-32xb128-mae_in1k-224.py)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vit/vit-base-p16_pt-32xb128-"
+"mae_in1k_20220623-4c544545.pth) | [log](https://download.openmmlab.com/mmclassification/v0/vit/vit-base-"
+"p16_pt-32xb128-mae_in1k_20220623-4c544545.log)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "85.43"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "97.77"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/vision_transformer/vit-base-"
+"p16_ft-64xb64_in1k-384.py)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-"
+"pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "ViT-B16 (IPU)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "81.22"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "95.56"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/vision_transformer/vit-base-"
+"p16_ft-4xb544-ipu_in1k.py)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vit/vit-base-p16_ft-4xb544-ipu_in1k_20220603-"
+"c215811a.pth) | [log](https://download.openmmlab.com/mmclassification/v0/vit/vit-base-p16_ft-4xb544-"
+"ipu_in1k.log)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "84.01"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "97.08"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/vision_transformer/vit-base-"
+"p32_ft-64xb64_in1k-384.py)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-"
+"pre-3rdparty_ft-64xb64_in1k-384_20210928-9cea8599.pth)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "85.63"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid "97.63"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/vision_transformer/vit-large-"
+"p16_ft-64xb64_in1k-384.py)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:71
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-large-p16_in21k-"
+"pre-3rdparty_ft-64xb64_in1k-384_20210928-b20ba619.pth)"
+msgstr ""
+
+#: ../../papers/vision_transformer.md:119
+msgid ""
+"*Models with * are converted from the [official repo](https://github.com/google-research/"
+"vision_transformer#available-vit-models). The config files of these models are only for validation. We "
+"don't ensure these config files' training accuracy and welcome you to contribute your reproduction results.*"
+msgstr ""
+
+#: ../../papers/wrn.md:4
+msgid "Wide-ResNet"
+msgstr ""
+
+#: ../../papers/wrn.md:6
+msgid "[Wide Residual Networks](https://arxiv.org/abs/1605.07146)"
+msgstr ""
+
+#: ../../papers/wrn.md:12
+msgid ""
+"Deep residual networks were shown to be able to scale up to thousands of layers and still have improving "
+"performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of "
+"layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes "
+"these networks very slow to train. To tackle these problems, in this paper we conduct a detailed "
+"experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture "
+"where we decrease depth and increase width of residual networks. We call the resulting network structures "
+"wide residual networks (WRNs) and show that these are far superior over their commonly used thin and very "
+"deep counterparts. For example, we demonstrate that even a simple 16-layer-deep wide residual network "
+"outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layer-deep "
+"networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on "
+"ImageNet."
+msgstr ""
+
+#: ../../papers/wrn.md
+msgid "WRN-50\\*"
+msgstr ""
+
+#: ../../papers/wrn.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/wrn/wide-resnet50_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/wrn.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/wrn/wide-"
+"resnet50_3rdparty_8xb32_in1k_20220304-66678344.pth)"
+msgstr ""
+
+#: ../../papers/wrn.md
+msgid "WRN-101\\*"
+msgstr ""
+
+#: ../../papers/wrn.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/wrn/wide-resnet101_8xb32_in1k.py)"
+msgstr ""
+
+#: ../../papers/wrn.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/wrn/wide-"
+"resnet101_3rdparty_8xb32_in1k_20220304-8d5f9d61.pth)"
+msgstr ""
+
+#: ../../papers/wrn.md
+msgid "WRN-50 (timm)\\*"
+msgstr ""
+
+#: ../../papers/wrn.md
+msgid "81.45"
+msgstr ""
+
+#: ../../papers/wrn.md
+msgid "95.53"
+msgstr ""
+
+#: ../../papers/wrn.md
+msgid ""
+"[config](https://github.com/open-mmlab/mmclassification/blob/1.x/configs/wrn/wide-resnet50_timm_8xb32_in1k."
+"py)"
+msgstr ""
+
+#: ../../papers/wrn.md
+msgid ""
+"[model](https://download.openmmlab.com/mmclassification/v0/wrn/wide-resnet50_3rdparty-"
+"timm_8xb32_in1k_20220304-83ae4399.pth)"
+msgstr ""
+
+#: ../../papers/wrn.md:28
+msgid ""
+"*Models with * are converted from the [TorchVision](https://github.com/pytorch/vision/blob/main/torchvision/"
+"models/resnet.py) and [TIMM](https://github.com/rwightman/pytorch-image-models/blob/master). The config "
+"files of these models are only for inference. We don't ensure these config files' training accuracy and "
+"welcome you to contribute your reproduction results.*"
+msgstr ""
diff --git a/docs/zh_CN/migration.md b/docs/zh_CN/migration.md
new file mode 100644
index 0000000000000000000000000000000000000000..5e4a18d352a550de0ee4bddfdd98a558b3a2de8d
--- /dev/null
+++ b/docs/zh_CN/migration.md
@@ -0,0 +1,732 @@
+# 迁移文档
+
+我们在 MMPretrain 1.x 版本中引入了一些修改，可能会产生兼容性问题。请按照本教程从 MMClassification 0.x 或是 MMSelfSup 0.x 迁移您的项目。
+
+## 新的依赖
+
+```{warning}
+MMPretrain 1.x 版本依赖于一些新的代码包，您应该根据 [安装教程](./get_started.md) 来创建新的环境，尽管你可能已经拥有了一个可以正常运行 MMClassification 0.x 或 MMSelfSup 0.x 的环境。请参考[安装文档](./get_started.md) 对依赖库进行对应的安装。
+```
+
+1. [MMEngine](https://github.com/open-mmlab/mmengine)：MMEngine 是 OpenMMLab 2.0 架构的核心库，我们将许多与计算机视觉无关的组件从 MMCV 拆分到了 MMEngine。
+2. [MMCV](https://github.com/open-mmlab/mmcv)：OpenMMLab 计算机视觉基础库，这不是一个新的依赖，但你需要将其升级到 `2.0.0rc1` 版本以上。
+3. [rich](https://github.com/Textualize/rich)：一个命令行美化库，用以在命令行中呈现更美观的输出。
+
+# 配置文件的通用改变
+
+在这个部分，我们将介绍一些旧版本 (**MMClassification 0.x** 或 **MMSelfSup 0.x**) 和 **MMPretrain 1.x** 之间通用的变化规范。
+
+## 训练策略设置
+
+| MMCls or MMSelfSup 0.x | MMPretrain 1.x  | 备注                                                                                                     |
+| ---------------------- | --------------- | -------------------------------------------------------------------------------------------------------- |
+| optimizer_config       | /               | `optimizer_config` 已经被**移除**。                                                                      |
+| /                      | optim_wrapper   | `optim_wrapper` 提供了参数更新的相关字段。                                                               |
+| lr_config              | param_scheduler | `param_scheduler` 是一个列表设置学习率或者是其它参数，这将比之前更加灵活。                               |
+| runner                 | train_cfg       | `train_cfg` 中的循环设置（如 `EpochBasedTrainLoop`，`IterBasedTrainLoop`）将控制模型训练过程中的工作流。 |
+
+**`optimizer`** 和 **`optimizer_config`** 字段的变化：
+
+- 现在我们使用 `optim_wrapper` 字段指定与优化过程有关的所有配置。而 `optimizer` 字段是 `optim_wrapper` 的一个
+  子字段。
+- `paramwise_cfg` 字段不再是 `optimizer` 的子字段，而是 `optim_wrapper` 的子字段。
+- `optimizer_config` 字段被移除，其配置项被移入 `optim_wrapper` 字段。
+- `grad_clip` 被重命名为 `clip_grad`
+
+<table class="docutils">
+<tr>
+<td>原配置</td>
+<td>
+
+```python
+optimizer = dict(
+    type='AdamW',
+    lr=0.0015,
+    weight_decay=0.3,
+    paramwise_cfg = dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+    ))
+
+optimizer_config = dict(grad_clip=dict(max_norm=1.0))
+```
+
+</td>
+<tr>
+<td>新配置</td>
+<td>
+
+```python
+optim_wrapper = dict(
+    optimizer=dict(type='AdamW', lr=0.0015, weight_decay=0.3),
+    paramwise_cfg = dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+    ),
+    clip_grad=dict(max_norm=1.0),
+)
+```
+
+</td>
+</tr>
+</table>
+
+**`lr_config`** 字段的变化：
+
+- `lr_config` 字段被移除，我们使用新的 `param_scheduler` 配置取代。
+- `warmup` 相关的字段都被移除，因为学习率预热可以通过多个学习率规划器的组合来实现，因此不再单独实现。
+
+新的优化器参数规划器组合机制非常灵活，你可以使用它来设计多种学习率、动量曲线，详见{external+mmengine:doc}`MMEngine 中的教程 <tutorials/param_scheduler>`。
+
+<table class="docutils">
+<tr>
+<td>原配置</td>
+<td>
+
+```python
+lr_config = dict(
+    policy='CosineAnnealing',
+    min_lr=0,
+    warmup='linear',
+    warmup_iters=5,
+    warmup_ratio=0.01,
+    warmup_by_epoch=True)
+```
+
+</td>
+<tr>
+<td>新配置</td>
+<td>
+
+```python
+param_scheduler = [
+    # 学习率预热
+    dict(
+        type='LinearLR',
+        start_factor=0.01,
+        by_epoch=True,
+        end=5,
+        # 每轮迭代都更新学习率，而不是每个 epoch
+        convert_to_iter_based=True),
+    # 主学习率规划器
+    dict(type='CosineAnnealingLR', by_epoch=True, begin=5),
+]
+```
+
+</td>
+</tr>
+</table>
+
+**`runner`** 字段的变化：
+
+原 `runner` 字段被拆分为 `train_cfg`，`val_cfg` 和 `test_cfg` 三个字段，分别配置训练、验证和测试循环。
+
+<table class="docutils">
+<tr>
+<td>原配置</td>
+<td>
+
+```python
+runner = dict(type='EpochBasedRunner', max_epochs=100)
+```
+
+</td>
+<tr>
+<td>新配置</td>
+<td>
+
+```python
+# `val_interval` 字段来自原配置中 `evaluation.interval` 字段
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()   # 空字典表示使用默认验证配置
+test_cfg = dict()  # 空字典表示使用默认测试配置
+```
+
+</td>
+</tr>
+</table>
+
+在 OpenMMLab 2.0 中，我们引入了“循环控制器”来控制训练、验证和测试行为，而原先 `Runner` 功能也相应地发生了变化。详细介绍参见 MMEngine 中的{external+mmengine:doc}`执行器教程 <design/runner>`。
+
+## 运行设置
+
+**`checkpoint_config`** 和 **`log_config`** 字段的变化：
+
+`checkpoint_config` 被移动至 `default_hooks.checkpoint`，`log_config` 被移动至 `default_hooks.logger`。同时，
+我们将很多原先在训练脚本中隐式定义的钩子移动到了 `default_hooks` 字段。
+
+```python
+default_hooks = dict(
+    # 记录每轮迭代的耗时
+    timer=dict(type='IterTimerHook'),
+
+    # 每 100 轮迭代打印一次日志
+    logger=dict(type='LoggerHook', interval=100),
+
+    # 启用优化器参数规划器
+    param_scheduler=dict(type='ParamSchedulerHook'),
+
+    # 每个 epoch 保存一次模型权重文件，并且自动保存最优权重文件
+    checkpoint=dict(type='CheckpointHook', interval=1, save_best='auto'),
+
+    # 在分布式环境中设置采样器种子
+    sampler_seed=dict(type='DistSamplerSeedHook'),
+
+    # 可视化验证结果，将 `enable` 设为 True 来启用这一功能。
+    visualization=dict(type='VisualizationHook', enable=False),
+)
+```
+
+此外，我们将原来的日志功能拆分为日志记录和可视化器。日志记录负责按照指定间隔保存日志数据，以及进行数据平滑等处理，可视化器用于在不同的后端记录日志，如终端、TensorBoard 和 WandB。
+
+<table class="docutils">
+<tr>
+<td>原配置</td>
+<td>
+
+```python
+log_config = dict(
+    interval=100,
+    hooks=[
+        dict(type='TextLoggerHook'),
+        dict(type='TensorboardLoggerHook'),
+    ])
+```
+
+</td>
+<tr>
+<td>新配置</td>
+<td>
+
+```python
+default_hooks = dict(
+    ...
+    logger=dict(type='LoggerHook', interval=100),
+)
+
+visualizer = dict(
+    type='UniversalVisualizer',
+    vis_backends=[dict(type='LocalVisBackend'), dict(type='TensorboardVisBackend')],
+)
+```
+
+</td>
+</tr>
+</table>
+
+**`load_from`** 和 **`resume_from`** 字段的变动：
+
+- `resume_from` 字段被移除。我们现在使用 `resume` 和 `load_from` 字段实现以下功能：
+  - 如 `resume=True` 且 `load_from` 不为 None，从 `load_from` 指定的权重文件恢复训练。
+  - 如 `resume=True` 且 `load_from` 为 None，尝试从工作目录中最新的权重文件恢复训练。
+  - 如 `resume=False` 且 `load_from` 不为 None，仅加载指定的权重文件，不恢复训练。
+  - 如 `resume=False` 且 `load_from` 为 None，不进行任何操作。
+
+**`dist_params`** 字段的变动：`dist_params` 字段被移动为 `env_cfg` 字段的一个子字段。以下为 `env_cfg` 字段的所
+有配置项：
+
+```python
+env_cfg = dict(
+    # 是否启用 cudnn benchmark
+    cudnn_benchmark=False,
+
+    # 设置多进程相关参数
+    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+
+    # 设置分布式相关参数
+    dist_cfg=dict(backend='nccl'),
+)
+```
+
+**`workflow`** 字段的变动：`workflow` 相关的功能现已被移除。
+
+新字段 **`visualizer`**：可视化器是 OpenMMLab 2.0 架构中的新设计，我们使用可视化器进行日志、结果的可视化与多后
+端的存储。详见 MMEngine 中的{external+mmengine:doc}`可视化教程 <advanced_tutorials/visualization>`。
+
+```python
+visualizer = dict(
+    type='UniversalVisualizer',
+    vis_backends=[
+        dict(type='LocalVisBackend'),
+        # 将下行取消注释，即可将日志和可视化结果保存至 TesnorBoard
+        # dict(type='TensorboardVisBackend')
+    ]
+)
+```
+
+新字段 **`default_scope`**：指定所有注册器进行模块搜索默认的起点。MMPretrain 中的 `default_scope` 字段为 `mmpretrain`，大部分情况下不需要修改。详见 MMengine 中的{external+mmengine:doc}`注册器教程 <advanced_tutorials/registry>`。
+
+## 其他变动
+
+我们将所有注册器的定义从各个包移动到了 `mmpretrain.registry`。
+
+# 从 MMClassification 0.x 迁移
+
+## 配置文件
+
+在 MMPretrain 1.x 中，我们重构了配置文件的结构，绝大部分原来的配置文件无法直接使用。
+
+在本节中，我们将介绍配置文件的所有变化。我们假设您已经对[配置文件](./user_guides/config.md)有所了解。
+
+### 模型设置
+
+`model.backbone`、`model.neck` 和 `model.head` 字段没有变化。
+
+**`model.train_cfg`** 字段的变化：
+
+- `BatchMixup` 被重命名为 [`Mixup`](mmpretrain.models.utils.batch_augments.Mixup)
+- `BatchCutMix` 被重命名为 [`CutMix`](mmpretrain.models.utils.batch_augments.CutMix)
+- `BatchResizeMix` 被重命名为 [`ResizeMix`](mmpretrain.models.utils.batch_augments.ResizeMix)
+- 以上增强中的 `prob` 参数均被移除，现在在 `train_cfg` 中使用一个统一的 `probs` 字段指定每个增强的概率。如果没
+  有指定 `probs` 字段，现在将均匀地随机选择一种增强。
+
+<table class="docutils">
+<tr>
+<td>原配置</td>
+<td>
+
+```python
+model = dict(
+    ...
+    train_cfg=dict(augments=[
+        dict(type='BatchMixup', alpha=0.8, num_classes=1000, prob=0.5),
+        dict(type='BatchCutMix', alpha=1.0, num_classes=1000, prob=0.5)
+    ]
+)
+```
+
+</td>
+<tr>
+<td>新配置</td>
+<td>
+
+```python
+model = dict(
+    ...
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0),
+    ]
+)
+```
+
+</td>
+</tr>
+</table>
+
+### 数据设置
+
+**`data`** 字段的变化：
+
+- 原先的 `data` 字段被拆分为 `train_dataloader`，`val_dataloader` 和 `test_dataloader` 字段。这允许我们进行更
+  加细粒度的配置。比如在训练和测试中指定不同的采样器、批次大小等。
+- `samples_per_gpu` 字段被重命名为 `batch_size`
+- `workers_per_gpu` 字段被重命名为 `num_workers`
+
+<table class="docutils">
+<tr>
+<td>原配置</td>
+<td>
+
+```python
+data = dict(
+    samples_per_gpu=32,
+    workers_per_gpu=2,
+    train=dict(...),
+    val=dict(...),
+    test=dict(...),
+)
+```
+
+</td>
+<tr>
+<td>新配置</td>
+<td>
+
+```python
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=2,
+    dataset=dict(...),
+    sampler=dict(type='DefaultSampler', shuffle=True)  # 必要的
+)
+
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=2,
+    dataset=dict(...),
+    sampler=dict(type='DefaultSampler', shuffle=False)  # 必要的
+)
+
+test_dataloader = val_dataloader
+```
+
+</td>
+</tr>
+</table>
+
+**`pipeline`** 字段的变化：
+
+- 原先的 **`ToTensor`**、**`ImageToTensor`** 和 **`Collect`** 被合并为 [`PackInputs`](mmpretrain.datasets.transforms.PackInputs)
+- 我们建议去除数据集流水线中的 **`Normalize`** 变换，转而使用 `data_preprocessor` 字段进行归一化预处理。
+- [**`RandomFlip`**](mmcv.transforms.RandomFlip) 中的 `flip_prob` 参数被重命名为 `prob`
+- [**`RandomCrop`**](mmpretrain.datasets.transforms.RandomCrop) 中的 `size` 参数被重命名为 `crop_size`
+- [**`RandomResizedCrop`**](mmpretrain.datasets.transforms.RandomResizedCrop) 中的 `size` 参数被重命名为 `scale`
+- [**`Resize`**](mmcv.transforms.Resize) 中的 `size` 参数被重命名为 `scale`。并且不再支持形如 `(256, -1)` 的尺寸，请使用 [`ResizeEdge`](mmpretrain.datasets.transforms.ResizeEdge)
+- [**`AutoAugment`**](mmpretrain.datasets.transforms.AutoAugment) 和 [**`RandAugment`**](mmpretrain.datasets.transforms.RandAugment) 中的 `policies` 参数现在支持使用字符串来指定某些预设的策略集，`AutoAugment` 支持 "imagenet"，`RandAugment` 支持 "timm_increasing"
+- **`RandomResizedCrop`** 和 **`CenterCrop`** 不再支持 `efficientnet_style` 参数，请使用 [`EfficientNetRandomCrop`](mmpretrain.datasets.transforms.EfficientNetRandomCrop) 和 [`EfficientNetCenterCrop`](mmpretrain.datasets.transforms.EfficientNetCenterCrop)
+
+```{note}
+我们将一些数据变换工作移至数据预处理器进行，如归一化，请参阅[文档](mmpretrain.models.utils.data_preprocessor)了解更多详细信息。
+```
+
+<table class="docutils">
+<tr>
+<td>原配置</td>
+<td>
+
+```python
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', size=224),
+    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='ToTensor', keys=['gt_label']),
+    dict(type='Collect', keys=['img', 'gt_label'])
+]
+```
+
+</td>
+<tr>
+<td>新配置</td>
+<td>
+
+```python
+data_preprocessor = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+```
+
+</td>
+</tr>
+</table>
+
+**`evaluation`** 字段的变化：
+
+- 原先的 **`evaluation`** 字段被拆分为 `val_evaluator` 和 `test_evaluator`，并且不再支持 `interval` 和 `save_best`
+  参数。`interval` 参数被移动至 `train_cfg.val_interval` 字段，详见[训练策略配置](./user_guides/config.md#训练策略)。而 `save_best` 参数被移动至 `default_hooks.checkpoint.save_best` 字段，详见 [运行设置](./user_guides/config.md#运行设置)。
+- 'accuracy' 指标被重命名为 [`Accuracy`](mmpretrain.evaluation.Accuracy)
+- 'precision'，'recall'，'f1-score' 和 'support' 指标被组合为 [`SingleLabelMetric`](mmpretrain.evaluation.SingleLabelMetric)，并使用 `items` 参数指定具体计算哪些指标。
+- 'mAP' 指标被重命名为 [`AveragePrecision`](mmpretrain.evaluation.AveragePrecision)
+- 'CP'，'CR'，'CF1'，'OP'，'OR' 和 'OF1' 指标被组合为 [`MultiLabelMetric`](mmpretrain.evaluation.MultiLabelMetric)，并使用 `items` 和 `average` 参数指定具体计算哪些指标。
+
+<table class="docutils">
+<tr>
+<td>原配置</td>
+<td>
+
+```python
+evaluation = dict(
+    interval=1,
+    metric='accuracy',
+    metric_options=dict(topk=(1, 5))
+)
+```
+
+</td>
+<tr>
+<td>新配置</td>
+<td>
+
+```python
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+test_evaluator = val_evaluator
+```
+
+</td>
+</tr>
+</table>
+<table class="docutils">
+<tr>
+<td>原配置</td>
+<td>
+
+```python
+evaluation = dict(
+    interval=1,
+    metric=['mAP', 'CP', 'OP', 'CR', 'OR', 'CF1', 'OF1'],
+    metric_options=dict(thr=0.5),
+)
+```
+
+</td>
+<tr>
+<td>新配置</td>
+<td>
+
+```python
+val_evaluator = [
+    dict(type='AveragePrecision'),
+    dict(type='MultiLabelMetric',
+        items=['precision', 'recall', 'f1-score'],
+        average='both',
+        thr=0.5),
+]
+test_evaluator = val_evaluator
+```
+
+</td>
+</tr>
+</table>
+
+## 模块变动
+
+### `mmpretrain.apis`
+
+详见[包文档](mmpretrain.apis)
+
+|         函数         | 变动                                                                                                                              |
+| :------------------: | :-------------------------------------------------------------------------------------------------------------------------------- |
+|     `init_model`     | 无变动                                                                                                                            |
+|  `inference_model`   | 无变动，但我们推荐使用功能更强的 [`mmpretrain.ImageClassificationInferencer`](mmpretrain.apis.ImageClassificationInferencer)。    |
+|    `train_model`     | 移除，直接使用 `runner.train` 进行训练。                                                                                          |
+|   `multi_gpu_test`   | 移除，直接使用 `runner.test` 进行测试。                                                                                           |
+|  `single_gpu_test`   | 移除，直接使用 `runner.test` 进行测试。                                                                                           |
+| `show_result_pyplot` | 移除，使用 [`mmpretrain.ImageClassificationInferencer`](mmpretrain.apis.ImageClassificationInferencer) 进行模型推理和结果可视化。 |
+|  `set_random_seed`   | 移除，使用 `mmengine.runner.set_random_seed`.                                                                                     |
+|  `init_random_seed`  | 移除，使用 `mmengine.dist.sync_random_seed`.                                                                                      |
+
+### `mmpretrain.core`
+
+`mmpretrain.core` 包被重命名为 [`mmpretrain.engine`](mmpretrain.engine)
+
+|      子包       | 变动                                                                                                                              |
+| :-------------: | :-------------------------------------------------------------------------------------------------------------------------------- |
+|  `evaluation`   | 移除，使用 [`mmpretrain.evaluation`](mmpretrain.evaluation)                                                                       |
+|     `hook`      | 移动至 [`mmpretrain.engine.hooks`](mmpretrain.engine.hooks)                                                                       |
+|  `optimizers`   | 移动至 [`mmpretrain.engine.optimizers`](mmpretrain.engine.optimizers)                                                             |
+|     `utils`     | 移除，分布式环境相关的函数统一至 [`mmengine.dist`](api/dist) 包                                                                   |
+| `visualization` | 移除，其中可视化相关的功能被移动至 [`mmpretrain.visualization.UniversalVisualizer`](mmpretrain.visualization.UniversalVisualizer) |
+
+`hooks` 包中的 `MMClsWandbHook` 尚未实现。
+
+`hooks` 包中的 `CosineAnnealingCooldownLrUpdaterHook` 被移除。我们现在支持使用学习率规划器的组合实现该功能。详见[自定义训练优化策略](./advanced_guides/schedule.md)。
+
+### `mmpretrain.datasets`
+
+详见[包文档](mmpretrain.datasets)
+
+|                                         数据集类                                          | 变动                                                                     |
+| :---------------------------------------------------------------------------------------: | :----------------------------------------------------------------------- |
+|                   [`CustomDataset`](mmpretrain.datasets.CustomDataset)                    | 增加了 `data_root` 参数，作为 `data_prefix` 和 `ann_file` 的共同根路径。 |
+|                        [`ImageNet`](mmpretrain.datasets.ImageNet)                         | 与 `CustomDataset` 相同。                                                |
+|                     [`ImageNet21k`](mmpretrain.datasets.ImageNet21k)                      | 与 `CustomDataset` 相同。                                                |
+|   [`CIFAR10`](mmpretrain.datasets.CIFAR10) & [`CIFAR100`](mmpretrain.datasets.CIFAR100)   | `test_mode` 参数目前是必要参数。                                         |
+| [`MNIST`](mmpretrain.datasets.MNIST) & [`FashionMNIST`](mmpretrain.datasets.FashionMNIST) | `test_mode` 参数目前是必要参数。                                         |
+|                             [`VOC`](mmpretrain.datasets.VOC)                              | 现在需要指定 `data_root`，`image_set_path` 和 `test_mode` 参数。         |
+|                             [`CUB`](mmpretrain.datasets.CUB)                              | 现在需要指定 `data_root` 和 `test_mode` 参数。                           |
+
+`mmpretrain.datasets.pipelines` 包被重命名为 `mmpretrain.datasets.transforms`
+
+|           数据变换类            | 变动                                                                                                                                                                      |
+| :-----------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+|       `LoadImageFromFile`       | 移除，使用 [`mmcv.transforms.LoadImageFromFile`](mmcv.transforms.LoadImageFromFile)                                                                                       |
+|          `RandomFlip`           | 移除，使用 [`mmcv.transforms.RandomFlip`](mmcv.transforms.RandomFlip)，其中 `flip_prob` 参数被重命名为 `prob`                                                             |
+|          `RandomCrop`           | `size` 参数被重命名为 `crop_size`                                                                                                                                         |
+|       `RandomResizedCrop`       | `size` 参数被重命名为 `scale`；`scale` 参数被重命名为 `crop_ratio_range`；不再支持 `efficientnet_style`，请使用 [`EfficientNetRandomCrop`](mmpretrain.datasets.transforms.EfficientNetRandomCrop) |
+|          `CenterCrop`           | 移除，使用 [`mmcv.transforms.CenterCrop`](mmcv.transforms.CenterCrop)；不再支持 `efficientnet_style`，请使用 [`EfficientNetCenterCrop`](mmpretrain.datasets.transforms.EfficientNetCenterCrop) |
+|            `Resize`             | 移除，使用 [`mmcv.transforms.Resize`](mmcv.transforms.Resize)；`size` 参数被重命名为 `scale`，且不再支持形如 `(256, -1)` 参数，使用 [`ResizeEdge`](mmpretrain.datasets.transforms.ResizeEdge) |
+| `AutoAugment` & `RandomAugment` | `policies` 参数现在支持使用字符串指定预设的策略集。                                                                                                                       |
+|            `Compose`            | 移除，使用 [`mmcv.transforms.Compose`](mmcv.transforms.Compose)                                                                                                           |
+
+### `mmpretrain.models`
+
+详见[包文档](mmpretrain.models)，**backbones**、**necks** 和 **losses** 的结构没有变动。
+
+[`ImageClassifier`](mmpretrain.models.classifiers.ImageClassifier) 的变动：
+
+|  分类器的方法   | 变动                                                                                                                    |
+| :-------------: | :---------------------------------------------------------------------------------------------------------------------- |
+| `extract_feat`  | 无变动                                                                                                                  |
+|    `forward`    | 现在需要三个输入：`inputs`、`data_samples` 和 `mode`。详见[文档](mmpretrain.models.classifiers.ImageClassifier.forward) |
+| `forward_train` | 变更为 `loss` 方法。                                                                                                    |
+|  `simple_test`  | 变更为 `predict` 方法。                                                                                                 |
+|  `train_step`   | `optimizer` 参数被修改为 `optim_wrapper`，接受 [`OptimWrapper`](mmengine.optim.OptimWrapper)                            |
+|   `val_step`    | 原先的 `val_step` 与 `train_step` 一致，现在该方法将会调用 `predict`                                                    |
+|   `test_step`   | 新方法，与 `val_step` 一致。                                                                                            |
+
+[heads](mmpretrain.models.heads) 中的变动：
+
+|  分类头的方法   | 变动                                                                                                                                     |
+| :-------------: | :--------------------------------------------------------------------------------------------------------------------------------------- |
+|  `pre_logits`   | 无变动                                                                                                                                   |
+| `forward_train` | 变更为 `loss` 方法。                                                                                                                     |
+|  `simple_test`  | 变更为 `predict` 方法。                                                                                                                  |
+|     `loss`      | 现在接受 `data_samples` 参数，而不是 `gt_labels`，`data_samples` 参数应当接受 [ClsDataSample](mmpretrain.structures.DataSample) 的列表。 |
+|    `forward`    | 新方法，它将返回分类头的输出，不会进行任何后处理（包括 softmax 或 sigmoid）。                                                            |
+
+### `mmpretrain.utils`
+
+详见[包文档](mmpretrain.utils)
+
+|             函数             | 变动                                                                                                          |
+| :--------------------------: | :------------------------------------------------------------------------------------------------------------ |
+|        `collect_env`         | 无变动                                                                                                        |
+|      `get_root_logger`       | 移除，使用 [`mmengine.logging.MMLogger.get_current_instance`](mmengine.logging.MMLogger.get_current_instance) |
+|       `load_json_log`        | 输出格式发生变化。                                                                                            |
+|   `setup_multi_processes`    | 移除，使用 [`mmengine.utils.dl_utils.set_multi_processing`](mmengine.utils.dl_utils.set_multi_processing)     |
+| `wrap_non_distributed_model` | 移除，现在 runner 会自动包装模型。                                                                            |
+|   `wrap_distributed_model`   | 移除，现在 runner 会自动包装模型。                                                                            |
+|     `auto_select_device`     | 移除，现在 runner 会自动选择设备。                                                                            |
+
+# 从 MMSelfSup 0.x 迁移
+
+## 配置文件
+
+本章节将介绍 `_base_` 文件夹中的配置文件的变化，主要包含以下三个部分：
+
+- 数据集：`configs/_base_/datasets`
+- 模型：`configs/_base_/models`
+- 优化器及调度：`configs/_base_/schedules`
+
+### 数据集
+
+在 **MMSelfSup 0.x** 中，我们使用字段 `data` 来整合数据相关信息, 例如 `samples_per_gpu`，`train`，`val` 等。
+
+在 **MMPretrain 1.x** 中，我们分别使用字段  `train_dataloader`, `val_dataloader` 整理训练和验证的数据相关信息，并且 `data` 字段已经被 **移除**。
+
+<table class="docutils">
+<tr>
+<td>旧版本</td>
+<td>
+
+```python
+data = dict(
+    samples_per_gpu=32,  # total 32*8(gpu)=256
+    workers_per_gpu=4,
+    train=dict(
+        type=dataset_type,
+        data_source=dict(
+            type=data_source,
+            data_prefix='data/imagenet/train',
+            ann_file='data/imagenet/meta/train.txt',
+        ),
+        num_views=[1, 1],
+        pipelines=[train_pipeline1, train_pipeline2],
+        prefetch=prefetch,
+    ),
+    val=...)
+```
+
+</td>
+
+<tr>
+<td>新版本</td>
+<td>
+
+```python
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=4,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+val_dataloader = ...
+```
+
+</td>
+</tr>
+</table>
+
+另外，我们 **移除** 了字段 `data_source`，以此来保证我们项目和其它 OpenMMLab 项目数据流的一致性。请查阅 [Config](user_guides/config.md) 获取更详细的信息。
+
+**`pipeline`** 中的变化：
+
+以 MAE 的 `pipeline` 作为例子，新的写法如下：
+
+```python
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        size=224,
+        scale=(0.2, 1.0),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(type='PackSelfSupInputs', meta_keys=['img_path'])
+]
+```
+
+### 模型
+
+在模型的配置文件中，和 MMSeflSup 0.x 版本相比，主要有两点不同。
+
+1. 有一个新的字段 `data_preprocessor`，主要负责对数据进行预处理，例如归一化，通道转换等。例子如下：
+
+```python
+data_preprocessor=dict(
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    bgr_to_rgb=True)
+model = dict(
+    type='MAE',
+    data_preprocessor=dict(
+        mean=[127.5, 127.5, 127.5],
+        std=[127.5, 127.5, 127.5],
+        bgr_to_rgb=True),
+    backbone=...,
+    neck=...,
+    head=...,
+    init_cfg=...)
+```
+
+2. 在新版本的 `head` 字段中，我们新增加了 `loss`，主要负责损失函数的构建。例子如下：
+
+```python
+model = dict(
+    type='MAE',
+    backbone=...,
+    neck=...,
+    head=dict(
+        type='MAEPretrainHead',
+        norm_pix=True,
+        patch_size=16,
+        loss=dict(type='MAEReconstructionLoss')),
+    init_cfg=...)
+```
+
+## 模块变动
+
+下列表格记录了代码模块、文件夹的主要改变。
+
+| MMSelfSup 0.x            | MMPretrain 1.x      | Remark                                                                                                                                                        |
+| ------------------------ | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| apis                     | /                   | 目前 `apis` 文件夹已暂时被**移除**，在未来可能会再添加回来。                                                                                                  |
+| core                     | engine              | `core` 文件夹重命名为 `engine`，包含了 `hooks`，`opimizers`。([API link](mmpretrain.engine))                                                                  |
+| datasets                 | datasets            | 数据集相关类主要基于不同的数据集实现，例如 ImageNet，Places205。([API link](mmpretrain.datasets))                                                             |
+| datasets/data_sources    | /                   | `data_sources` 已经被**移除**，并且现在 `datasets` 的逻辑和 OpenMMLab 其它项目保持一致。                                                                      |
+| datasets/pipelines       | datasets/transforms | `pipelines` 文件夹已经重命名为 `transforms`。([API link](mmpretrain.datasets.transforms))                                                                     |
+| /                        | evaluation          | `evaluation` 主要负责管理一些评测函数或者是类。([API link](mmpretrain.evaluation))                                                                            |
+| models/algorithms        | selfsup             | 算法文件移动至 `selfsup` 文件夹。([API link](mmpretrain.models.selfsup))                                                                                      |
+| models/backbones         | selfsup             | 自监督学习算法对应的，重新实现的主干网络移动到算法的 `.py` 文件中。([API link](mmpretrain.models.selfsup))                                                    |
+| models/target_generators | selfsup             | 目标生成器的实现移动到算法的 `.py` 文件中。([API link](mmpretrain.models.selfsup))                                                                            |
+| /                        | models/losses       | `losses` 文件夹提供了各种不同损失函数的实现。([API link](mmpretrain.models.losses))                                                                           |
+| /                        | structures          | `structures` 文件夹提供了数据结构的实现。在 MMPretrain 中，我们实现了一种新的数据结构，`DataSample`，在训练/验证过程中来传输和接受数据信息。([API link](mmpretrain.structures)) |
+| /                        | visualization       | `visualization` 文件夹包含了 visualizer，主要负责一些可视化的工作，例如数据增强的可视化。([API link](mmpretrain.visualization))                               |
diff --git a/docs/zh_CN/notes/changelog.md b/docs/zh_CN/notes/changelog.md
new file mode 120000
index 0000000000000000000000000000000000000000..6fc371ea69f27963c1e89678bae8493e127e2973
--- /dev/null
+++ b/docs/zh_CN/notes/changelog.md
@@ -0,0 +1 @@
+../../en/notes/changelog.md
\ No newline at end of file
diff --git a/docs/zh_CN/notes/contribution_guide.md b/docs/zh_CN/notes/contribution_guide.md
new file mode 100644
index 0000000000000000000000000000000000000000..2549cc285e05acd8f95c8ff502b63eb515042fa1
--- /dev/null
+++ b/docs/zh_CN/notes/contribution_guide.md
@@ -0,0 +1,62 @@
+# 参与贡献 OpenMMLab
+
+欢迎任何类型的贡献，包括但不限于
+
+- 修改拼写错误或代码错误
+- 添加文档或将文档翻译成其他语言
+- 添加新功能和新组件
+
+## 工作流程
+
+1. fork 并 pull 最新的 OpenMMLab 仓库 (MMPreTrain)
+2. 签出到一个新分支（不要使用 master 分支提交 PR）
+3. 进行修改并提交至 fork 出的自己的远程仓库
+4. 在我们的仓库中创建一个 PR
+
+```{note}
+如果你计划添加一些新的功能，并引入大量改动，请尽量首先创建一个 issue 来进行讨论。
+```
+
+## 代码风格
+
+### Python
+
+我们采用 [PEP8](https://www.python.org/dev/peps/pep-0008/) 作为统一的代码风格。
+
+我们使用下列工具来进行代码风格检查与格式化：
+
+- [flake8](https://github.com/PyCQA/flake8): Python 官方发布的代码规范检查工具，是多个检查工具的封装
+- [isort](https://github.com/timothycrosley/isort): 自动调整模块导入顺序的工具
+- [yapf](https://github.com/google/yapf): 一个 Python 文件的格式化工具。
+- [codespell](https://github.com/codespell-project/codespell): 检查单词拼写是否有误
+- [mdformat](https://github.com/executablebooks/mdformat): 检查 markdown 文件的工具
+- [docformatter](https://github.com/myint/docformatter): 一个 docstring 格式化工具。
+
+yapf 和 isort 的格式设置位于 [setup.cfg](https://github.com/open-mmlab/mmpretrain/blob/main/setup.cfg)
+
+我们使用 [pre-commit hook](https://pre-commit.com/) 来保证每次提交时自动进行代
+码检查和格式化，启用的功能包括 `flake8`, `yapf`, `isort`, `trailing whitespaces`, `markdown files`, 修复 `end-of-files`, `double-quoted-strings`,
+`python-encoding-pragma`, `mixed-line-ending`, 对 `requirments.txt`的排序等。
+pre-commit hook 的配置文件位于 [.pre-commit-config](https://github.com/open-mmlab/mmpretrain/blob/main/.pre-commit-config.yaml)
+
+在你克隆仓库后，你需要按照如下步骤安装并初始化 pre-commit hook。
+
+```shell
+pip install -U pre-commit
+```
+
+在仓库文件夹中执行
+
+```shell
+pre-commit install
+```
+
+在此之后，每次提交，代码规范检查和格式化工具都将被强制执行。
+
+```{important}
+在创建 PR 之前，请确保你的代码完成了代码规范检查，并经过了 yapf 的格式化。
+```
+
+### C++ 和 CUDA
+
+C++ 和 CUDA 的代码规范遵从 [Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html)
diff --git a/docs/zh_CN/notes/faq.md b/docs/zh_CN/notes/faq.md
new file mode 100644
index 0000000000000000000000000000000000000000..9e94cd8b1f3dc6bcfbf1969fbe84a28ab150f77e
--- /dev/null
+++ b/docs/zh_CN/notes/faq.md
@@ -0,0 +1,101 @@
+# 常见问题
+
+我们在这里列出了一些常见问题及其相应的解决方案。如果您发现任何常见问题并有方法
+帮助解决，欢迎随时丰富列表。如果这里的内容没有涵盖您的问题，请按照
+[提问模板](https://github.com/open-mmlab/mmpretrain/issues/new/choose)
+在 GitHub 上提出问题，并补充模板中需要的信息。
+
+## 安装
+
+- MMEngine, MMCV 与 MMPretrain 的兼容问题
+
+  这里我们列举了各版本 MMPretrain 对 MMEngine 和 MMCV 版本的依赖，请选择合适的 MMEngine 和 MMCV 版本来避免安装和使用中的问题。
+
+  | MMPretrain 版本 |   MMEngine 版本   |    MMCV 版本     |
+  | :-------------: | :---------------: | :--------------: |
+  |  1.2.0 (main)   | mmengine >= 0.8.3 |  mmcv >= 2.0.0   |
+  |      1.1.1      | mmengine >= 0.8.3 |  mmcv >= 2.0.0   |
+  |      1.0.0      | mmengine >= 0.8.0 |  mmcv >= 2.0.0   |
+  |    1.0.0rc8     | mmengine >= 0.7.1 | mmcv >= 2.0.0rc4 |
+  |    1.0.0rc7     | mmengine >= 0.5.0 | mmcv >= 2.0.0rc4 |
+
+  ```{note}
+  由于 `dev` 分支处于频繁开发中，MMEngine 和 MMCV 版本依赖可能不准确。如果您在使用
+  `dev` 分支时遇到问题，请尝试更新 MMEngine 和 MMCV 到最新版。
+  ```
+
+- 使用 Albumentations
+
+  如果你希望使用 `albumentations` 相关的功能，我们建议使用 `pip install -r requirements/optional.txt` 或者
+  `pip install -U albumentations>=0.3.2 --no-binary qudida,albumentations` 命令进行安装。
+
+  如果你直接使用 `pip install albumentations>=0.3.2` 来安装，它会同时安装 `opencv-python-headless`
+  （即使你已经安装了 `opencv-python`）。具体细节可参阅
+  [官方文档](https://albumentations.ai/docs/getting_started/installation/#note-on-opencv-dependencies)。
+
+## 通用问题
+
+### 如果我对源码进行了改动，需要重新安装以使改动生效吗？
+
+如果你遵照[最佳实践](../get_started.md#最佳实践)的指引，从源码安装 mmpretrain，那么任何本地修改都不需要重新安装即可生效。
+
+### 如何在多个 MMPretrain 版本下进行开发？
+
+通常来说，我们推荐通过不同虚拟环境来管理多个开发目录下的 MMPretrain。
+但如果你希望在不同目录（如 mmpretrain-0.21, mmpretrain-0.23 等）使用同一个环境进行开发，
+我们提供的训练和测试 shell 脚本会自动使用当前目录的 mmpretrain，其他 Python 脚本
+则可以在命令前添加 `` PYTHONPATH=`pwd`  `` 来使用当前目录的代码。
+
+反过来，如果你希望 shell 脚本使用环境中安装的 MMPretrain，而不是当前目录的，
+则可以去掉 shell 脚本中如下一行代码：
+
+```shell
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH
+```
+
+### `load_from` 和 `init_cfg` 之间的关系是什么？
+
+- `load_from`: 如果`resume=False`，只导入模型权重，主要用于加载训练过的模型；
+  如果 `resume=True` ，加载所有的模型权重、优化器状态和其他训练信息，主要用于恢复中断的训练。
+
+- `init_cfg`: 你也可以指定`init=dict(type="Pretrained", checkpoint=xxx)`来加载权重，
+  表示在模型权重初始化时加载权重，通常在训练的开始阶段执行。
+  主要用于微调预训练模型，你可以在骨干网络的配置中配置它，还可以使用 `prefix` 字段来只加载对应的权重，例如：
+
+  ```python
+  model = dict(
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        init_cfg=dict(type='Pretrained', checkpoints=xxx, prefix='backbone'),
+    )
+    ...
+  )
+  ```
+
+参见 [微调模型](./finetune_custom_dataset.md) 以了解更多关于模型微调的细节。
+
+### `default_hooks` 和 `custom_hooks` 之间有什么区别？
+
+几乎没有区别。通常，`default_hooks` 字段用于指定几乎所有实验都会使用的钩子，
+而 `custom_hooks` 字段指部分实验特有的钩子。
+
+另一个区别是 `default_hooks` 是一个字典，而 `custom_hooks` 是一个列表，请不要混淆。
+
+### 在训练期间，我没有收到训练日志，这是什么原因？
+
+如果你的训练数据集很小，而批处理量却很大，我们默认的日志间隔可能太大，无法记录你的训练日志。
+
+你可以缩减日志间隔，再试一次，比如:
+
+```python
+default_hooks = dict(
+    ...
+    logger=dict(type='LoggerHook', interval=10),
+    ...
+)
+```
+
+### 如何基于其它数据集训练，例如我自己的数据集或者是 COCO 数据集？
+
+我们提供了 [具体示例](./pretrain_custom_dataset.md) 来展示如何在其它数据集上进行训练。
diff --git a/docs/zh_CN/notes/finetune_custom_dataset.md b/docs/zh_CN/notes/finetune_custom_dataset.md
new file mode 100644
index 0000000000000000000000000000000000000000..2b8cbd68edcc78ffccaa24716ec1ea2e97a20944
--- /dev/null
+++ b/docs/zh_CN/notes/finetune_custom_dataset.md
@@ -0,0 +1,328 @@
+# 如何在自定义数据集上微调模型
+
+在很多场景下，我们需要快速地将模型应用到新的数据集上，但从头训练模型通常很难快速收敛，这种不确定性会浪费额外的时间。
+通常，已有的、在大数据集上训练好的模型会比随机初始化提供更为有效的先验信息，粗略来讲，在此基础上的学习我们称之为模型微调。
+
+已经证明，在 ImageNet 数据集上预训练的模型对于其他数据集和其他下游任务有很好的效果。
+因此，该教程提供了如何将 [Model Zoo](../modelzoo_statistics.md) 中提供的预训练模型用于其他数据集，已获得更好的效果。
+
+在本教程中，我们提供了一个实践示例和一些关于如何在自己的数据集上微调模型的技巧。
+
+## 第一步：准备你的数据集
+
+按照 [准备数据集](../user_guides/dataset_prepare.md) 准备你的数据集。
+假设我们的数据集根文件夹路径为 `data/custom_dataset/`
+
+假设我们想进行有监督图像分类训练，并使用子文件夹格式的 `CustomDataset` 来组织数据集：
+
+```text
+data/custom_dataset/
+├── train
+│   ├── class_x
+│   │   ├── x_1.png
+│   │   ├── x_2.png
+│   │   ├── x_3.png
+│   │   └── ...
+│   ├── class_y
+│   └── ...
+└── test
+    ├── class_x
+    │   ├── test_x_1.png
+    │   ├── test_x_2.png
+    │   ├── test_x_3.png
+    │   └── ...
+    ├── class_y
+    └── ...
+```
+
+## 第二步：选择一个配置文件作为模板
+
+在这里，我们使用 `configs/resnet/resnet50_8xb32_in1k.py` 作为示例。
+首先在同一文件夹下复制一份配置文件，并将其重命名为 `resnet50_8xb32-ft_custom.py`。
+
+```{tip}
+按照惯例，配置名称的最后一个字段是数据集，例如，`in1k` 表示 ImageNet-1k，`coco` 表示 coco 数据集
+```
+
+这个配置的内容是：
+
+```python
+_base_ = [
+    '../_base_/models/resnet50.py',           # 模型设置
+    '../_base_/datasets/imagenet_bs32.py',    # 数据设置
+    '../_base_/schedules/imagenet_bs256.py',  # 训练策略设置
+    '../_base_/default_runtime.py',           # 运行设置
+]
+```
+
+## 第三步：修改模型设置
+
+在进行模型微调时，我们通常希望在主干网络（backbone）加载预训练模型，再用我们的数据集训练一个新的分类头（head）。
+
+为了在主干网络加载预训练模型，我们需要修改主干网络的初始化设置，使用
+`Pretrained` 类型的初始化函数。另外，在初始化设置中，我们使用 `prefix='backbone'`
+来告诉初始化函数需要加载的子模块的前缀，`backbone`即指加载模型中的主干网络。
+方便起见，我们这里使用一个在线的权重文件链接，它
+会在训练前自动下载对应的文件，你也可以提前下载这个模型，然后使用本地路径。
+
+接下来，新的配置文件需要按照新数据集的类别数目来修改分类头的配置。只需要修改分
+类头中的 `num_classes` 设置即可。
+
+另外，当新的小数据集和原本预训练的大数据集中的数据分布较为类似的话，我们在进行微调时会希望
+冻结主干网络前面几层的参数，只训练后面层以及分类头的参数，这么做有助于在后续训练中，
+保持网络从预训练权重中获得的提取低阶特征的能力。在 MMPretrain 中，
+这一功能可以通过简单的一个 `frozen_stages` 参数来实现。比如我们需要冻结前两层网
+络的参数，只需要在上面的配置中添加一行：
+
+```{note}
+注意，目前并非所有的主干网络都支持 `frozen_stages` 参数。请检查[文档](https://mmpretrain.readthedocs.io/en/latest/api.html#module-mmpretrain.models.backbones)
+确认使用的主干网络是否支持这一参数。
+```
+
+```python
+_base_ = [
+    '../_base_/models/resnet50.py',           # 模型设置
+    '../_base_/datasets/imagenet_bs32.py',    # 数据设置
+    '../_base_/schedules/imagenet_bs256.py',  # 训练策略设置
+    '../_base_/default_runtime.py',           # 运行设置
+]
+
+# >>>>>>>>>>>>>>> 在这里重载模型相关配置 >>>>>>>>>>>>>>>>>>>
+model = dict(
+    backbone=dict(
+        frozen_stages=2,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
+            prefix='backbone',
+        )),
+    head=dict(num_classes=10),
+)
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+```
+
+```{tip}
+这里我们只需要设定我们想要修改的部分配置，其他配置将会自动从我们的基配置文件中获取。
+```
+
+## 第四步：修改数据集设置
+
+为了在新数据集上进行微调，我们需要覆盖一些数据集设置，例如数据集类型、数据流水线等。
+
+```python
+_base_ = [
+    '../_base_/models/resnet50.py',           # 模型设置
+    '../_base_/datasets/imagenet_bs32.py',    # 数据设置
+    '../_base_/schedules/imagenet_bs256.py',  # 训练策略设置
+    '../_base_/default_runtime.py',           # 运行设置
+]
+
+# 模型设置
+model = dict(
+    backbone=dict(
+        frozen_stages=2,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
+            prefix='backbone',
+        )),
+    head=dict(num_classes=10),
+)
+
+# >>>>>>>>>>>>>>> 在这里重载数据配置 >>>>>>>>>>>>>>>>>>>
+data_root = 'data/custom_dataset'
+train_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',       # 我们假定使用子文件夹格式，因此需要将标注文件置空
+        data_prefix='train',
+    ))
+val_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',       # 我们假定使用子文件夹格式，因此需要将标注文件置空
+        data_prefix='test',
+    ))
+test_dataloader = val_dataloader
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+```
+
+## 第五步：修改训练策略设置（可选）
+
+微调所使用的训练超参数一般与默认的超参数不同，它通常需要更小的学习率和更快的学习率衰减。
+
+```python
+_base_ = [
+    '../_base_/models/resnet50.py',           # 模型设置
+    '../_base_/datasets/imagenet_bs32.py',    # 数据设置
+    '../_base_/schedules/imagenet_bs256.py',  # 训练策略设置
+    '../_base_/default_runtime.py',           # 运行设置
+]
+
+# 模型设置
+model = dict(
+    backbone=dict(
+        frozen_stages=2,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
+            prefix='backbone',
+        )),
+    head=dict(num_classes=10),
+)
+
+# 数据设置
+data_root = 'data/custom_dataset'
+train_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',
+        data_prefix='train',
+    ))
+val_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',
+        data_prefix='test',
+    ))
+test_dataloader = val_dataloader
+
+# >>>>>>>>>>>>>>> 在这里重载训练策略设置 >>>>>>>>>>>>>>>>>>>
+# 优化器超参数
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001))
+# 学习率策略
+param_scheduler = dict(
+    type='MultiStepLR', by_epoch=True, milestones=[15], gamma=0.1)
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+```
+
+```{tip}
+更多关于配置文件的信息，请参阅[学习配置文件](../user_guides/config.md)
+```
+
+## 开始训练
+
+现在，我们完成了用于微调的配置文件，完整的文件如下：
+
+```python
+_base_ = [
+    '../_base_/models/resnet50.py',           # 模型设置
+    '../_base_/datasets/imagenet_bs32.py',    # 数据设置
+    '../_base_/schedules/imagenet_bs256.py',  # 训练策略设置
+    '../_base_/default_runtime.py',           # 运行设置
+]
+
+# 模型设置
+model = dict(
+    backbone=dict(
+        frozen_stages=2,
+        init_cfg=dict(
+            type='Pretrained',
+            checkpoint='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
+            prefix='backbone',
+        )),
+    head=dict(num_classes=10),
+)
+
+# 数据设置
+data_root = 'data/custom_dataset'
+train_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',
+        data_prefix='train',
+    ))
+val_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',
+        data_prefix='test',
+    ))
+test_dataloader = val_dataloader
+
+# 训练策略设置
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001))
+param_scheduler = dict(
+    type='MultiStepLR', by_epoch=True, milestones=[15], gamma=0.1)
+```
+
+接下来，我们使用一台 8 张 GPU 的电脑来训练我们的模型，指令如下：
+
+```shell
+bash tools/dist_train.sh configs/resnet/resnet50_8xb32-ft_custom.py 8
+```
+
+当然，我们也可以使用单张 GPU 来进行训练，使用如下命令：
+
+```shell
+python tools/train.py configs/resnet/resnet50_8xb32-ft_custom.py
+```
+
+但是如果我们使用单张 GPU 进行训练的话，需要在数据集设置部分作如下修改：
+
+```python
+data_root = 'data/custom_dataset'
+train_dataloader = dict(
+    batch_size=256,
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',
+        data_prefix='train',
+    ))
+val_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root=data_root,
+        ann_file='',
+        data_prefix='test',
+    ))
+test_dataloader = val_dataloader
+```
+
+这是因为我们的训练策略是针对批次大小（batch size）为 256 设置的。在父配置文件中，
+设置了单张 `batch_size=32`，如果使用 8 张 GPU，总的批次大小就是 256。而如果使
+用单张 GPU，就必须手动修改 `batch_size=256` 来匹配训练策略。
+
+然而，更大的批次大小需要更大的 GPU 显存，这里有几个简单的技巧来节省显存：
+
+1. 启用自动混合精度训练
+
+   ```shell
+   python tools/train.py configs/resnet/resnet50_8xb32-ft_custom.py --amp
+   ```
+
+2. 使用较小的批次大小，例如仍然使用 `batch_size=32`，而不是 256，并启用学习率自动缩放
+
+   ```shell
+   python tools/train.py configs/resnet/resnet50_8xb32-ft_custom.py --auto-scale-lr
+   ```
+
+   学习率自动缩放功能会根据实际的 batch size 和配置文件中的 `auto_scale_lr.base_batch_size`
+   字段对学习率进行线性调整（你可以在基配置文件 `configs/_base_/schedules/imagenet_bs256.py`
+   中找到这一字段）
+
+```{note}
+以上技巧都有可能对训练效果造成轻微影响。
+```
+
+### 在命令行指定预训练模型
+
+如果您不想修改配置文件，您可以使用 `--cfg-options` 将您的预训练模型文件添加到 `init_cfg`.
+
+例如，以下命令也会加载预训练模型：
+
+```shell
+bash tools/dist_train.sh configs/tutorial/resnet50_finetune_cifar.py 8 \
+    --cfg-options model.backbone.init_cfg.type='Pretrained' \
+    model.backbone.init_cfg.checkpoint='https://download.openmmlab.com/mmselfsup/1.x/mocov3/mocov3_resnet50_8xb512-amp-coslr-100e_in1k/mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.pth' \
+    model.backbone.init_cfg.prefix='backbone' \
+```
diff --git a/docs/zh_CN/notes/pretrain_custom_dataset.md b/docs/zh_CN/notes/pretrain_custom_dataset.md
new file mode 100644
index 0000000000000000000000000000000000000000..786ecba893634d47034c7fd049ab64f928a691b9
--- /dev/null
+++ b/docs/zh_CN/notes/pretrain_custom_dataset.md
@@ -0,0 +1,247 @@
+# 如何在自定义数据集上进行模型预训练
+
+在本教程中，我们提供了一个实践示例和一些有关如何在您自己的数据集上进行训练的技巧。
+
+在 MMPretrain 中，我们支持用户直接调用 MMPretrain 的 `CustomDataset` （类似于 `torchvision` 的 `ImageFolder`）, 该数据集能自动的读取给的路径下的图片。你只需要准备你的数据集路径，并修改配置文件，即可轻松使用 MMPretrain 进行预训练。
+
+## 第一步：准备你的数据集
+
+按照 [准备数据集](../user_guides/dataset_prepare.md) 准备你的数据集。
+假设我们的数据集根文件夹路径为 `data/custom_dataset/`
+
+假设我们想使用 MAE 算法进行图像自监督训练，并使用子文件夹格式的 `CustomDataset` 来组织数据集：
+
+```text
+data/custom_dataset/
+├── sample1.png
+├── sample2.png
+├── sample3.png
+├── sample4.png
+└── ...
+```
+
+## 第二步：选择一个配置文件作为模板
+
+在本教程中，我们使用 `configs/mae/mae_vit-base-p16_8xb512-amp-coslr-300e_in1k.py` 作为一个示例进行介绍。
+首先在同一文件夹下复制一份配置文件，并将其重命名为 `mae_vit-base-p16_8xb512-amp-coslr-300e_custom.py`。
+
+```{tip}
+按照惯例，配置名称的最后一个字段是数据集，例如，`in1k` 表示 ImageNet-1k，`coco` 表示 coco 数据集
+```
+
+这个配置文件的内容如下：
+
+```python
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
+```
+
+## 第三步：修改数据集设置
+
+- 重载数据集设置中的 `type` 为 `'CustomDataset'`
+- 重载数据集设置中的 `data_root` 为 `data/custom_dataset`
+- 重载数据集设置中的 `ann_file` 为空字符串，这是因为我们使用子文件格式的 `CustomDataset`，需要将配置文件置空
+- 重载数据集设置中的 `data_prefix` 为空字符串，这是因为我们希望使用数据集根目录下的所有数据进行训练，并不需要将其拆分为不同子集。
+
+修改后的文件应如下：
+
+```python
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_bs512_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# >>>>>>>>>>>>>>> 在此重载数据设置 >>>>>>>>>>>>>>>>>>>
+train_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root='data/custom_dataset/',
+        ann_file='',       # 我们假定使用子文件夹格式，因此需要将标注文件置空
+        data_prefix='',    # 使用 `data_root` 路径下所有数据
+        with_label=False,
+    )
+)
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
+```
+
+使用上述配置文件，你就能够轻松的在自定义数据集上使用 `MAE` 算法来进行预训练了。
+
+## 另一个例子：在 COCO 数据集上训练 MAE
+
+```{note}
+你可能需要参考[文档](https://github.com/open-mmlab/mmdetection/blob/3.x/docs/en/get_started.md)安装 MMDetection 来使用 `mmdet.CocoDataset`。
+```
+
+与在自定义数据集上进行预训练类似，我们在本教程中也提供了一个使用 COCO 数据集进行预训练的示例。修改后的文件如下：
+
+```python
+# >>>>>>>>>>>>>>>>>>>>> Start of Changed >>>>>>>>>>>>>>>>>>>>>>>>>
+_base_ = [
+    '../_base_/models/mae_vit-base-p16.py',
+    '../_base_/datasets/imagenet_mae.py',
+    '../_base_/default_runtime.py',
+]
+
+# >>>>>>>>>>>>>>> 在这里重载数据配置 >>>>>>>>>>>>>>>>>>>
+train_dataloader = dict(
+    dataset=dict(
+        type='mmdet.CocoDataset',
+        data_root='data/coco/',
+        ann_file='annotations/instances_train2017.json',  # 仅用于加载图片，不会使用标签
+        data_prefix=dict(img='train2017/'),
+    )
+)
+# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW',
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+# runtime settings
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))
+randomness = dict(seed=0, diff_rank_seed=True)
+# auto resume
+resume = True
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
+```
diff --git a/docs/zh_CN/notes/projects.md b/docs/zh_CN/notes/projects.md
new file mode 100644
index 0000000000000000000000000000000000000000..0843dc4d3ec27e9ec926cc063f5184f563677281
--- /dev/null
+++ b/docs/zh_CN/notes/projects.md
@@ -0,0 +1 @@
+# 基于 MMPretrain 的项目列表（待更新）
diff --git a/docs/zh_CN/stat.py b/docs/zh_CN/stat.py
new file mode 100755
index 0000000000000000000000000000000000000000..29e57563ccf26711a244637475b91f15ebac4222
--- /dev/null
+++ b/docs/zh_CN/stat.py
@@ -0,0 +1,249 @@
+#!/usr/bin/env python
+import re
+import warnings
+from collections import defaultdict
+from pathlib import Path
+
+from modelindex.load_model_index import load
+from modelindex.models.Result import Result
+from tabulate import tabulate
+
+MMPT_ROOT = Path(__file__).absolute().parents[2]
+PAPERS_ROOT = Path('papers')  # Path to save generated paper pages.
+GITHUB_PREFIX = 'https://github.com/open-mmlab/mmpretrain/blob/main/'
+MODELZOO_TEMPLATE = """\
+# 模型库统计
+
+在本页面中，我们列举了我们支持的[所有算法](#所有已支持的算法)。你可以点击链接跳转至对应的模型详情页面。
+
+另外，我们还列出了我们提供的所有模型权重文件。你可以使用排序和搜索功能找到需要的模型权重，并使用链接跳转至模型详情页面。
+
+## 所有已支持的算法
+
+* 论文数量：{num_papers}
+{type_msg}
+
+* 模型权重文件数量：{num_ckpts}
+{paper_msg}
+
+"""  # noqa: E501
+
+METRIC_ALIAS = {
+    'Top 1 Accuracy': 'Top-1 (%)',
+    'Top 5 Accuracy': 'Top-5 (%)',
+}
+
+model_index = load(str(MMPT_ROOT / 'model-index.yml'))
+
+
+def build_collections(model_index):
+    col_by_name = {}
+    for col in model_index.collections:
+        setattr(col, 'models', [])
+        col_by_name[col.name] = col
+
+    for model in model_index.models:
+        col = col_by_name[model.in_collection]
+        col.models.append(model)
+        setattr(model, 'collection', col)
+        if model.results is None:
+            setattr(model, 'tasks', [])
+        else:
+            setattr(model, 'tasks', [result.task for result in model.results])
+
+
+build_collections(model_index)
+
+
+def count_papers(collections):
+    total_num_ckpts = 0
+    type_count = defaultdict(int)
+    paper_msgs = []
+
+    for collection in collections:
+        with open(MMPT_ROOT / collection.readme) as f:
+            readme = f.read()
+        ckpts = set(x.lower().strip()
+                    for x in re.findall(r'\[model\]\((https?.*)\)', readme))
+        total_num_ckpts += len(ckpts)
+        title = collection.paper['Title']
+        papertype = collection.data.get('type', 'Algorithm')
+        type_count[papertype] += 1
+
+        readme = PAPERS_ROOT / Path(
+            collection.filepath).parent.with_suffix('.md').name
+        paper_msgs.append(
+            f'\t- [{papertype}] [{title}]({readme}) ({len(ckpts)} ckpts)')
+
+    type_msg = '\n'.join(
+        [f'\t- {type_}: {count}' for type_, count in type_count.items()])
+    paper_msg = '\n'.join(paper_msgs)
+
+    modelzoo = MODELZOO_TEMPLATE.format(
+        num_papers=len(collections),
+        num_ckpts=total_num_ckpts,
+        type_msg=type_msg,
+        paper_msg=paper_msg,
+    )
+
+    with open('modelzoo_statistics.md', 'w') as f:
+        f.write(modelzoo)
+
+
+count_papers(model_index.collections)
+
+
+def generate_paper_page(collection):
+    PAPERS_ROOT.mkdir(exist_ok=True)
+
+    # Write a copy of README
+    with open(MMPT_ROOT / collection.readme) as f:
+        readme = f.read()
+    folder = Path(collection.filepath).parent
+    copy = PAPERS_ROOT / folder.with_suffix('.md').name
+
+    def replace_link(matchobj):
+        # Replace relative link to GitHub link.
+        name = matchobj.group(1)
+        link = matchobj.group(2)
+        if not link.startswith('http'):
+            assert (folder / link).exists(), \
+                f'Link not found:\n{collection.readme}: {link}'
+            rel_link = (folder / link).absolute().relative_to(MMPT_ROOT)
+            link = GITHUB_PREFIX + str(rel_link)
+        return f'[{name}]({link})'
+
+    content = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', replace_link, readme)
+    content = f'---\ngithub_page: /{collection.readme}\n---\n' + content
+
+    def make_tabs(matchobj):
+        """modify the format from emphasis black symbol to tabs."""
+        content = matchobj.group()
+        content = content.replace('<!-- [TABS-BEGIN] -->', '')
+        content = content.replace('<!-- [TABS-END] -->', '')
+
+        # split the content by "**{Tab-Name}**""
+        splits = re.split(r'^\*\*(.*)\*\*$', content, flags=re.M)[1:]
+        tabs_list = []
+        for title, tab_content in zip(splits[::2], splits[1::2]):
+            title = ':::{tab} ' + title + '\n'
+            tab_content = tab_content.strip() + '\n:::\n'
+            tabs_list.append(title + tab_content)
+
+        return '::::{tabs}\n' + ''.join(tabs_list) + '::::'
+
+    if '<!-- [TABS-BEGIN] -->' in content and '<!-- [TABS-END] -->' in content:
+        # Make TABS block a selctive tabs
+        try:
+            pattern = r'<!-- \[TABS-BEGIN\] -->([\d\D]*?)<!-- \[TABS-END\] -->'
+            content = re.sub(pattern, make_tabs, content)
+        except Exception as e:
+            warnings.warn(f'Can not parse the TABS, get an error : {e}')
+
+    with open(copy, 'w') as copy_file:
+        copy_file.write(content)
+
+
+for collection in model_index.collections:
+    generate_paper_page(collection)
+
+
+def scatter_results(models):
+    model_result_pairs = []
+    for model in models:
+        if model.results is None:
+            result = Result(task=None, dataset=None, metrics={})
+            model_result_pairs.append((model, result))
+        else:
+            for result in model.results:
+                model_result_pairs.append((model, result))
+    return model_result_pairs
+
+
+def generate_summary_table(task, model_result_pairs, title=None):
+    metrics = set()
+    for model, result in model_result_pairs:
+        if result.task == task:
+            metrics = metrics.union(result.metrics.keys())
+    metrics = sorted(list(metrics))
+
+    rows = []
+    for model, result in model_result_pairs:
+        if result.task != task:
+            continue
+        name = model.name
+        params = f'{model.metadata.parameters / 1e6:.2f}'  # Params
+        if model.metadata.flops is not None:
+            flops = f'{model.metadata.flops / 1e9:.2f}'  # Flops
+        else:
+            flops = None
+        readme = Path(model.collection.filepath).parent.with_suffix('.md').name
+        page = f'[链接]({PAPERS_ROOT / readme})'
+        model_metrics = []
+        for metric in metrics:
+            model_metrics.append(str(result.metrics.get(metric, '')))
+
+        rows.append([name, params, flops, *model_metrics, page])
+
+    with open('modelzoo_statistics.md', 'a') as f:
+        if title is not None:
+            f.write(f'\n{title}')
+        f.write("""\n```{table}\n:class: model-summary\n""")
+        header = [
+            '模型',
+            '参数量 (M)',
+            'Flops (G)',
+            *[METRIC_ALIAS.get(metric, metric) for metric in metrics],
+            'Readme',
+        ]
+        table_cfg = dict(
+            tablefmt='pipe',
+            floatfmt='.2f',
+            numalign='right',
+            stralign='center')
+        f.write(tabulate(rows, header, **table_cfg))
+        f.write('\n```\n')
+
+
+def generate_dataset_wise_table(task, model_result_pairs, title=None):
+    dataset_rows = defaultdict(list)
+    for model, result in model_result_pairs:
+        if result.task == task:
+            dataset_rows[result.dataset].append((model, result))
+
+    if title is not None:
+        with open('modelzoo_statistics.md', 'a') as f:
+            f.write(f'\n{title}')
+    for dataset, pairs in dataset_rows.items():
+        generate_summary_table(task, pairs, title=f'### {dataset}')
+
+
+model_result_pairs = scatter_results(model_index.models)
+
+# Generate Pretrain Summary
+generate_summary_table(
+    task=None,
+    model_result_pairs=model_result_pairs,
+    title='## 预训练模型',
+)
+
+# Generate Image Classification Summary
+generate_dataset_wise_table(
+    task='Image Classification',
+    model_result_pairs=model_result_pairs,
+    title='## 图像分类',
+)
+
+# Generate Multi-Label Classification Summary
+generate_dataset_wise_table(
+    task='Multi-Label Classification',
+    model_result_pairs=model_result_pairs,
+    title='## 图像多标签分类',
+)
+
+# Generate Image Retrieval Summary
+generate_dataset_wise_table(
+    task='Image Retrieval',
+    model_result_pairs=model_result_pairs,
+    title='## 图像检索',
+)
diff --git a/docs/zh_CN/useful_tools/cam_visualization.md b/docs/zh_CN/useful_tools/cam_visualization.md
new file mode 100644
index 0000000000000000000000000000000000000000..94d5ed177846bbad1db3e258d068b97272e34ca5
--- /dev/null
+++ b/docs/zh_CN/useful_tools/cam_visualization.md
@@ -0,0 +1,164 @@
+# 类别激活图（CAM）可视化
+
+## 类别激活图可视化工具介绍
+
+MMPretrain 提供 `tools/visualization/vis_cam.py` 工具来可视化类别激活图。请使用 `pip install "grad-cam>=1.3.6"` 安装依赖的 [pytorch-grad-cam](https://github.com/jacobgil/pytorch-grad-cam)。
+
+目前支持的方法有：
+
+|    Method    |                                           What it does                                            |
+| :----------: | :-----------------------------------------------------------------------------------------------: |
+|   GradCAM    |                                  使用平均梯度对 2D 激活进行加权                                   |
+|  GradCAM++   |                                  类似 GradCAM，但使用了二阶梯度                                   |
+|   XGradCAM   |                         类似 GradCAM，但通过归一化的激活对梯度进行了加权                          |
+|   EigenCAM   |                     使用 2D 激活的第一主成分（无法区分类别，但效果似乎不错）                      |
+| EigenGradCAM | 类似 EigenCAM，但支持类别区分，使用了激活 * 梯度的第一主成分，看起来和 GradCAM 差不多，但是更干净 |
+|   LayerCAM   |                        使用正梯度对激活进行空间加权，对于浅层有更好的效果                         |
+
+也可以使用新版本 `pytorch-grad-cam` 支持的更多 CAM 方法，但我们尚未验证可用性。
+
+**命令行**：
+
+```bash
+python tools/visualization/vis_cam.py \
+    ${IMG} \
+    ${CONFIG_FILE} \
+    ${CHECKPOINT} \
+    [--target-layers ${TARGET-LAYERS}] \
+    [--preview-model] \
+    [--method ${METHOD}] \
+    [--target-category ${TARGET-CATEGORY}] \
+    [--save-path ${SAVE_PATH}] \
+    [--vit-like] \
+    [--num-extra-tokens ${NUM-EXTRA-TOKENS}]
+    [--aug_smooth] \
+    [--eigen_smooth] \
+    [--device ${DEVICE}] \
+    [--cfg-options ${CFG-OPTIONS}]
+```
+
+**所有参数的说明**：
+
+- `img`：目标图片路径。
+- `config`：模型配置文件的路径。
+- `checkpoint`：权重路径。
+- `--target-layers`：所查看的网络层名称，可输入一个或者多个网络层，如果不设置，将使用最后一个`block`中的`norm`层。
+- `--preview-model`：是否查看模型所有网络层。
+- `--method`：类别激活图图可视化的方法，目前支持 `GradCAM`, `GradCAM++`, `XGradCAM`, `EigenCAM`, `EigenGradCAM`, `LayerCAM`，不区分大小写。如果不设置，默认为 `GradCAM`。
+- `--target-category`：查看的目标类别，如果不设置，使用模型检测出来的类别做为目标类别。
+- `--save-path`：保存的可视化图片的路径，默认不保存。
+- `--eigen-smooth`：是否使用主成分降低噪音，默认不开启。
+- `--vit-like`: 是否为 `ViT` 类似的 Transformer-based 网络
+- `--num-extra-tokens`: `ViT` 类网络的额外的 tokens 通道数，默认使用主干网络的 `num_extra_tokens`。
+- `--aug-smooth`：是否使用测试时增强
+- `--device`：使用的计算设备，如果不设置，默认为'cpu'。
+- `--cfg-options`：对配置文件的修改，参考[学习配置文件](../user_guides/config.md)。
+
+```{note}
+在指定 `--target-layers` 时，如果不知道模型有哪些网络层，可使用命令行添加 `--preview-model` 查看所有网络层名称；
+```
+
+## 如何可视化 CNN 网络的类别激活图（如 ResNet-50）
+
+`--target-layers` 在 `Resnet-50` 中的一些示例如下：
+
+- `'backbone.layer4'`，表示第四个 `ResLayer` 层的输出。
+- `'backbone.layer4.2'` 表示第四个 `ResLayer` 层中第三个 `BottleNeck` 块的输出。
+- `'backbone.layer4.2.conv1'` 表示上述 `BottleNeck` 块中 `conv1` 层的输出。
+
+1. 使用不同方法可视化 `ResNet50`，默认 `target-category` 为模型检测的结果，使用默认推导的 `target-layers`。
+
+   ```shell
+   python tools/visualization/vis_cam.py \
+       demo/bird.JPEG \
+       configs/resnet/resnet50_8xb32_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth \
+       --method GradCAM
+       # GradCAM++, XGradCAM, EigenCAM, EigenGradCAM, LayerCAM
+   ```
+
+   | Image                                | GradCAM                                 | GradCAM++                                 | EigenGradCAM                                 | LayerCAM                                 |
+   | ------------------------------------ | --------------------------------------- | ----------------------------------------- | -------------------------------------------- | ---------------------------------------- |
+   | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429496-628d3fb3-1f6e-41ff-aa5c-1b08c60c32a9.JPEG' height="auto" width="160" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065002-f1c86516-38b2-47ba-90c1-e00b49556c70.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065119-82581fa1-3414-4d6c-a849-804e1503c74b.jpg' height="auto" width="150"></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065096-75a6a2c1-6c57-4789-ad64-ebe5e38765f4.jpg' height="auto" width="150"></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/147065129-814d20fb-98be-4106-8c5e-420adcc85295.jpg' height="auto" width="150"></div> |
+
+2. 同一张图不同类别的激活图效果图，在 `ImageNet` 数据集中，类别 238 为 'Greater Swiss Mountain dog'，类别 281 为 'tabby, tabby cat'。
+
+   ```shell
+   python tools/visualization/vis_cam.py \
+       demo/cat-dog.png configs/resnet/resnet50_8xb32_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth \
+       --target-layers 'backbone.layer4.2' \
+       --method GradCAM \
+       --target-category 238
+       # --target-category 281
+   ```
+
+   | Category | Image                                          | GradCAM                                          | XGradCAM                                          | LayerCAM                                          |
+   | -------- | ---------------------------------------------- | ------------------------------------------------ | ------------------------------------------------- | ------------------------------------------------- |
+   | Dog      | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429526-f27f4cce-89b9-4117-bfe6-55c2ca7eaba6.png' height="auto" width="165" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144433562-968a57bc-17d9-413e-810e-f91e334d648a.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144433853-319f3a8f-95f2-446d-b84f-3028daca5378.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144433937-daef5a69-fd70-428f-98a3-5e7747f4bb88.jpg' height="auto" width="150" ></div> |
+   | Cat      | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429526-f27f4cce-89b9-4117-bfe6-55c2ca7eaba6.png' height="auto" width="165" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144434518-867ae32a-1cb5-4dbd-b1b9-5e375e94ea48.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144434603-0a2fd9ec-c02e-4e6c-a17b-64c234808c56.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144434623-b4432cc2-c663-4b97-aed3-583d9d3743e6.jpg' height="auto" width="150" ></div> |
+
+3. 使用 `--eigen-smooth` 以及 `--aug-smooth` 获取更好的可视化效果。
+
+   ```shell
+   python tools/visualization/vis_cam.py \
+       demo/dog.jpg  \
+       configs/mobilenet_v3/mobilenet-v3-large_8xb128_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/convert/mobilenet_v3_large-3ea3c186.pth \
+       --target-layers 'backbone.layer16' \
+       --method LayerCAM \
+       --eigen-smooth --aug-smooth
+   ```
+
+   | Image                                | LayerCAM                                | eigen-smooth                                | aug-smooth                                | eigen&aug                                 |
+   | ------------------------------------ | --------------------------------------- | ------------------------------------------- | ----------------------------------------- | ----------------------------------------- |
+   | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557492-98ac5ce0-61f9-4da9-8ea7-396d0b6a20fa.jpg' height="auto" width="160"></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557541-a4cf7d86-7267-46f9-937c-6f657ea661b4.jpg'  height="auto" width="145" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557547-2731b53e-e997-4dd2-a092-64739cc91959.jpg'  height="auto" width="145" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557545-8189524a-eb92-4cce-bf6a-760cab4a8065.jpg'  height="auto" width="145" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144557548-c1e3f3ec-3c96-43d4-874a-3b33cd3351c5.jpg'  height="auto" width="145" ></div> |
+
+## 如何可视化 Transformer 类型网络的类别激活图
+
+`--target-layers` 在 Transformer-based 网络中的一些示例如下：
+
+- Swin-Transformer 中：`'backbone.norm3'`
+- ViT 中：`'backbone.layers.11.ln1'`
+
+对于 Transformer-based 的网络，比如 ViT、T2T-ViT 和 Swin-Transformer，特征是被展平的。为了绘制 CAM 图，我们需要指定 `--vit-like` 选项，从而让被展平的特征恢复方形的特征图。
+
+除了特征被展平之外，一些类 ViT 的网络还会添加额外的 tokens。比如 ViT 和 T2T-ViT 中添加了分类 token，DeiT 中还添加了蒸馏 token。在这些网络中，分类计算在最后一个注意力模块之后就已经完成了，分类得分也只和这些额外的 tokens 有关，与特征图无关，也就是说，分类得分对这些特征图的导数为 0。因此，我们不能使用最后一个注意力模块的输出作为 CAM 绘制的目标层。
+
+另外，为了去除这些额外的 toekns 以获得特征图，我们需要知道这些额外 tokens 的数量。MMPretrain 中几乎所有 Transformer-based 的网络都拥有 `num_extra_tokens` 属性。而如果你希望将此工具应用于新的，或者第三方的网络，而且该网络没有指定 `num_extra_tokens` 属性，那么可以使用 `--num-extra-tokens` 参数手动指定其数量。
+
+1. 对 `Swin Transformer` 使用默认 `target-layers` 进行 CAM 可视化：
+
+   ```shell
+   python tools/visualization/vis_cam.py \
+       demo/bird.JPEG  \
+       configs/swin_transformer/swin-tiny_16xb64_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth \
+       --vit-like
+   ```
+
+2. 对 `Vision Transformer(ViT)` 进行 CAM 可视化：
+
+   ```shell
+   python tools/visualization/vis_cam.py \
+       demo/bird.JPEG  \
+       configs/vision_transformer/vit-base-p16_64xb64_in1k-384px.py \
+       https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth \
+       --vit-like \
+       --target-layers 'backbone.layers.11.ln1'
+   ```
+
+3. 对 `T2T-ViT` 进行 CAM 可视化：
+
+   ```shell
+   python tools/visualization/vis_cam.py \
+       demo/bird.JPEG  \
+       configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py \
+       https://download.openmmlab.com/mmclassification/v0/t2t-vit/t2t-vit-t-14_3rdparty_8xb64_in1k_20210928-b7c09b62.pth \
+       --vit-like \
+       --target-layers 'backbone.encoder.12.ln1'
+   ```
+
+| Image                                   | ResNet50                                   | ViT                                    | Swin                                    | T2T-ViT                                    |
+| --------------------------------------- | ------------------------------------------ | -------------------------------------- | --------------------------------------- | ------------------------------------------ |
+| <div align=center><img src='https://user-images.githubusercontent.com/18586273/144429496-628d3fb3-1f6e-41ff-aa5c-1b08c60c32a9.JPEG' height="auto" width="165" ></div> | <div align=center><img src=https://user-images.githubusercontent.com/18586273/144431491-a2e19fe3-5c12-4404-b2af-a9552f5a95d9.jpg  height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144436218-245a11de-6234-4852-9c08-ff5069f6a739.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144436168-01b0e565-442c-4e1e-910c-17c62cff7cd3.jpg' height="auto" width="150" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/18586273/144436198-51dbfbda-c48d-48cc-ae06-1a923d19b6f6.jpg' height="auto" width="150" ></div> |
diff --git a/docs/zh_CN/useful_tools/complexity_analysis.md b/docs/zh_CN/useful_tools/complexity_analysis.md
new file mode 100644
index 0000000000000000000000000000000000000000..83e7632527dee2ab1e56976bf629b10cc9e14cd1
--- /dev/null
+++ b/docs/zh_CN/useful_tools/complexity_analysis.md
@@ -0,0 +1,80 @@
+# 模型复杂度分析
+
+## 计算 FLOPs 和参数数量（实验性的）
+
+我们根据 [MMEngine](https://github.com/open-mmlab/mmengine/blob/main/mmengine/analysis/complexity_analysis.py) 提供了一个脚本用于计算给定模型的 FLOPs 和参数量。
+
+```shell
+python tools/analysis_tools/get_flops.py ${CONFIG_FILE} [--shape ${INPUT_SHAPE}]
+```
+
+所有参数说明：
+
+- `config` : 配置文件的路径。
+- `--shape`: 输入尺寸，支持单值或者双值， 如： `--shape 256`、`--shape 224 256`。默认为`224 224`。
+
+示例：
+
+```shell
+python tools/analysis_tools/get_flops.py configs/resnet/resnet50_8xb32_in1k.py
+```
+
+你将获得如下结果：
+
+```text
+==============================
+Input shape: (3, 224, 224)
+Flops: 4.109G
+Params: 25.557M
+Activation: 11.114M
+==============================
+```
+
+同时，你会得到每层的详细复杂度信息，如下所示：
+
+```text
++--------------------------+----------------------+-----------+--------------+
+| module                   | #parameters or shape | #flops    | #activations |
++--------------------------+----------------------+-----------+--------------+
+| model                    | 25.557M              | 4.109G    | 11.114M      |
+|  backbone                |  23.508M             |  4.109G   |  11.114M     |
+|   backbone.conv1         |   9.408K             |   0.118G  |   0.803M     |
+|    backbone.conv1.weight |    (64, 3, 7, 7)     |           |              |
+|   backbone.bn1           |   0.128K             |   1.606M  |   0          |
+|    backbone.bn1.weight   |    (64,)             |           |              |
+|    backbone.bn1.bias     |    (64,)             |           |              |
+|   backbone.layer1        |   0.216M             |   0.677G  |   4.415M     |
+|    backbone.layer1.0     |    75.008K           |    0.235G |    2.007M    |
+|    backbone.layer1.1     |    70.4K             |    0.221G |    1.204M    |
+|    backbone.layer1.2     |    70.4K             |    0.221G |    1.204M    |
+|   backbone.layer2        |   1.22M              |   1.034G  |   3.111M     |
+|    backbone.layer2.0     |    0.379M            |    0.375G |    1.305M    |
+|    backbone.layer2.1     |    0.28M             |    0.22G  |    0.602M    |
+|    backbone.layer2.2     |    0.28M             |    0.22G  |    0.602M    |
+|    backbone.layer2.3     |    0.28M             |    0.22G  |    0.602M    |
+|   backbone.layer3        |   7.098M             |   1.469G  |   2.158M     |
+|    backbone.layer3.0     |    1.512M            |    0.374G |    0.652M    |
+|    backbone.layer3.1     |    1.117M            |    0.219G |    0.301M    |
+|    backbone.layer3.2     |    1.117M            |    0.219G |    0.301M    |
+|    backbone.layer3.3     |    1.117M            |    0.219G |    0.301M    |
+|    backbone.layer3.4     |    1.117M            |    0.219G |    0.301M    |
+|    backbone.layer3.5     |    1.117M            |    0.219G |    0.301M    |
+|   backbone.layer4        |   14.965M            |   0.81G   |   0.627M     |
+|    backbone.layer4.0     |    6.04M             |    0.373G |    0.326M    |
+|    backbone.layer4.1     |    4.463M            |    0.219G |    0.151M    |
+|    backbone.layer4.2     |    4.463M            |    0.219G |    0.151M    |
+|  head.fc                 |  2.049M              |           |              |
+|   head.fc.weight         |   (1000, 2048)       |           |              |
+|   head.fc.bias           |   (1000,)            |           |              |
+|  neck.gap                |                      |  0.1M     |  0           |
++--------------------------+----------------------+-----------+--------------+
+```
+
+```{warning}
+警告
+
+此工具仍处于试验阶段，我们不保证该数字正确无误。您最好将结果用于简单比较，但在技术报告或论文中采用该结果之前，请仔细检查。
+
+- FLOPs 与输入的尺寸有关，而参数量与输入尺寸无关。默认输入尺寸为 (1, 3, 224, 224)
+- 一些运算不会被计入 FLOPs 的统计中，例如某些自定义运算。详细信息请参考 [`mmengine.analysis.complexity_analysis._DEFAULT_SUPPORTED_FLOP_OPS`](https://github.com/open-mmlab/mmengine/blob/main/mmengine/analysis/complexity_analysis.py)。
+```
diff --git a/docs/zh_CN/useful_tools/confusion_matrix.md b/docs/zh_CN/useful_tools/confusion_matrix.md
new file mode 100644
index 0000000000000000000000000000000000000000..98c039c63ae087fbb840e038e8891cf214a1e9de
--- /dev/null
+++ b/docs/zh_CN/useful_tools/confusion_matrix.md
@@ -0,0 +1,83 @@
+# 混淆矩阵
+
+MMPretrain 提供 `tools/analysis_tools/confusion_matrix.py` 工具来分析预测结果的混淆矩阵。关于混淆矩阵的介绍，可参考[链接](https://zh.wikipedia.org/zh-cn/%E6%B7%B7%E6%B7%86%E7%9F%A9%E9%98%B5)。
+
+## 命令行使用
+
+**命令行**：
+
+```shell
+python tools/analysis_tools/confusion_matrix.py \
+    ${CONFIG_FILE} \
+    ${CHECKPOINT} \
+    [--show] \
+    [--show-path] \
+    [--include-values] \
+    [--cmap ${CMAP}] \
+    [--cfg-options ${CFG-OPTIONS}]
+```
+
+**所有参数的说明**：
+
+- `config`：模型配置文件的路径。
+- `checkpoint`：权重路径。
+- `--show`：是否展示混淆矩阵的 matplotlib 可视化结果，默认不展示。
+- `--show-path`：如果 `show` 为 True，可视化结果的保存路径。
+- `--include-values`：是否在可视化结果上添加数值。
+- `--cmap`：可视化结果使用的颜色映射图，即 `cmap`，默认为 `viridis`。
+- `--cfg-options`：对配置文件的修改，参考[学习配置文件](../user_guides/config.md)。
+
+**使用示例**：
+
+```shell
+python tools/analysis_tools/confusion_matrix.py \
+    configs/resnet/resnet50_8xb16_cifar10.py \
+    https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth \
+    --show
+```
+
+**输出图片**：
+
+<div align=center><img src="https://user-images.githubusercontent.com/26739999/210298124-49ae00f7-c8fd-488a-a4da-58c285e9c1f1.png" style=" width: auto; height: 40%; "></div>
+
+## 基础用法
+
+```python
+>>> import torch
+>>> from mmpretrain.evaluation import ConfusionMatrix
+>>> y_pred = [0, 1, 1, 3]
+>>> y_true = [0, 2, 1, 3]
+>>> ConfusionMatrix.calculate(y_pred, y_true, num_classes=4)
+tensor([[1, 0, 0, 0],
+        [0, 1, 0, 0],
+        [0, 1, 0, 0],
+        [0, 0, 0, 1]])
+>>> # plot the confusion matrix
+>>> import matplotlib.pyplot as plt
+>>> y_score = torch.rand((1000, 10))
+>>> y_true = torch.randint(10, (1000, ))
+>>> matrix = ConfusionMatrix.calculate(y_score, y_true)
+>>> ConfusionMatrix().plot(matrix)
+>>> plt.show()
+```
+
+## 结合评估器使用
+
+```python
+>>> import torch
+>>> from mmpretrain.evaluation import ConfusionMatrix
+>>> from mmpretrain.structures import DataSample
+>>> from mmengine.evaluator import Evaluator
+>>> data_samples = [
+...     DataSample().set_gt_label(i%5).set_pred_score(torch.rand(5))
+...     for i in range(1000)
+... ]
+>>> evaluator = Evaluator(metrics=ConfusionMatrix())
+>>> evaluator.process(data_samples)
+>>> evaluator.evaluate(1000)
+{'confusion_matrix/result': tensor([[37, 37, 48, 43, 35],
+         [35, 51, 32, 46, 36],
+         [45, 28, 39, 42, 46],
+         [42, 40, 40, 35, 43],
+         [40, 39, 41, 37, 43]])}
+```
diff --git a/docs/zh_CN/useful_tools/dataset_visualization.md b/docs/zh_CN/useful_tools/dataset_visualization.md
new file mode 100644
index 0000000000000000000000000000000000000000..aa20251d6d5afcdbe0a9ff05e37fd590dd983428
--- /dev/null
+++ b/docs/zh_CN/useful_tools/dataset_visualization.md
@@ -0,0 +1,90 @@
+# 数据集可视化
+
+## 数据集可视化工具简介
+
+```bash
+python tools/visualization/browse_dataset.py \
+    ${CONFIG_FILE} \
+    [-o, --output-dir ${OUTPUT_DIR}] \
+    [-p, --phase ${DATASET_PHASE}] \
+    [-n, --show-number ${NUMBER_IMAGES_DISPLAY}] \
+    [-i, --show-interval ${SHOW_INTERRVAL}] \
+    [-m, --mode ${DISPLAY_MODE}] \
+    [-r, --rescale-factor ${RESCALE_FACTOR}] \
+    [-c, --channel-order ${CHANNEL_ORDER}] \
+    [--cfg-options ${CFG_OPTIONS}]
+```
+
+**所有参数的说明**：
+
+- `config` : 模型配置文件的路径。
+- `-o, --output-dir`: 保存图片文件夹，如果没有指定，默认为 `''`,表示不保存。
+- **`-p, --phase`**: 可视化数据集的阶段，只能为 `['train', 'val', 'test']` 之一，默认为 `'train'`。
+- **`-n, --show-number`**: 可视化样本数量。如果没有指定，默认展示数据集的所有图片。
+- `-i, --show-interval`: 浏览时，每张图片的停留间隔，单位为秒。
+- **`-m, --mode`**: 可视化的模式，只能为 `['original', 'transformed', 'concat', 'pipeline']` 之一。 默认为`'transformed'`.
+- `-r, --rescale-factor`: 在 `mode='original'` 下，可视化图片的放缩倍数，在图片过大或过小时设置。
+- `-c, --channel-order`: 图片的通道顺序，为  `['BGR', 'RGB']` 之一，默认为 `'BGR'`。
+- `--cfg-options` : 对配置文件的修改，参考[学习配置文件](../user_guides/config.md)。
+
+```{note}
+
+1. `-m, --mode` 用于设置可视化的模式，默认设置为 'transformed'。
+- 如果 `--mode` 设置为 'original'，则获取原始图片；
+- 如果 `--mode` 设置为 'transformed'，则获取预处理后的图片；
+- 如果 `--mode` 设置为 'concat'，获取原始图片和预处理后图片拼接的图片；
+- 如果 `--mode` 设置为 'pipeline'，则获得数据流水线所有中间过程图片。
+
+2. `-r, --rescale-factor` 在数据集中图片的分辨率过大或者过小时设置。比如在可视化 CIFAR 数据集时，由于图片的分辨率非常小，可将 `-r, --rescale-factor` 设置为 10。
+```
+
+## 如何可视化原始图像
+
+使用 **'original'** 模式 ：
+
+```shell
+python ./tools/visualization/browse_dataset.py ./configs/resnet/resnet101_8xb16_cifar10.py --phase val --output-dir tmp --mode original --show-number 100 --rescale-factor 10 --channel-order RGB
+```
+
+- `--phase val`: 可视化验证集，可简化为 `-p val`;
+- `--output-dir tmp`: 可视化结果保存在 "tmp" 文件夹，可简化为 `-o tmp`;
+- `--mode original`: 可视化原图，可简化为 `-m original`;
+- `--show-number 100`: 可视化 100 张图，可简化为 `-n 100`;
+- `--rescale-factor`: 图像放大 10 倍，可简化为 `-r 10`;
+- `--channel-order RGB`: 可视化图像的通道顺序为 "RGB", 可简化为 `-c RGB`。
+
+<div align=center><img src="https://user-images.githubusercontent.com/18586273/190993839-216a7a1e-590e-47b9-92ae-08f87a7d58df.jpg" style=" width: auto; height: 40%; "></div>
+
+## 如何可视化处理后图像
+
+使用 **'transformed'** 模式：
+
+```shell
+python ./tools/visualization/browse_dataset.py ./configs/resnet/resnet50_8xb32_in1k.py -n 100
+```
+
+<div align=center><img src="https://user-images.githubusercontent.com/18586273/190994696-737b09d9-d0fb-4593-94a2-4487121e0286.JPEG" style=" width: auto; height: 40%; "></div>
+
+## 如何同时可视化原始图像与处理后图像
+
+使用 **'concat'** 模式：
+
+```shell
+python ./tools/visualization/browse_dataset.py configs/swin_transformer/swin-small_16xb64_in1k.py -n 10 -m concat
+```
+
+<div align=center><img src="https://user-images.githubusercontent.com/18586273/190995078-3872feb2-d4e2-4727-a21b-7062d52f7d3e.JPEG" style=" width: auto; height: 40%; "></div>
+
+使用 **'pipeline'** 模式：
+
+```shell
+python ./tools/visualization/browse_dataset.py configs/swin_transformer/swin-small_16xb64_in1k.py -m pipeline
+```
+
+<div align=center><img src="https://user-images.githubusercontent.com/18586273/190995525-fac0220f-6630-4013-b94a-bc6de4fdff7a.JPEG" style=" width: auto; height: 40%; "></div>
+
+```shell
+python ./tools/visualization/browse_dataset.py configs/beit/beit_beit-base-p16_8xb256-amp-coslr-300e_in1k.py -m pipeline
+```
+
+<div align=center><img src="https://user-images.githubusercontent.com/26739999/226542300-74216187-e3d0-4a6e-8731-342abe719721.png" style=" width: auto; height: 40%; "></div>
diff --git a/docs/zh_CN/useful_tools/log_result_analysis.md b/docs/zh_CN/useful_tools/log_result_analysis.md
new file mode 100644
index 0000000000000000000000000000000000000000..748a907676c59962c9fda4214ede8ac79486bab7
--- /dev/null
+++ b/docs/zh_CN/useful_tools/log_result_analysis.md
@@ -0,0 +1,223 @@
+# 日志分析工具
+
+## 日志分析
+
+### 日志分析工具介绍
+
+`tools/analysis_tools/analyze_logs.py` 脚本绘制指定键值的变化曲线。
+
+<div align=center><img src="../_static/image/tools/analysis/analyze_log.jpg" style=" width: 75%; height: 30%; "></div>
+
+```shell
+python tools/analysis_tools/analyze_logs.py plot_curve  \
+    ${JSON_LOGS}  \
+    [--keys ${KEYS}]  \
+    [--title ${TITLE}]  \
+    [--legend ${LEGEND}]  \
+    [--backend ${BACKEND}]  \
+    [--style ${STYLE}]  \
+    [--out ${OUT_FILE}]  \
+    [--window-size ${WINDOW_SIZE}]
+```
+
+**所有参数的说明：**
+
+- `json_logs` : 模型配置文件的路径（可同时传入多个，使用空格分开）。
+- `--keys` : 分析日志的关键字段，数量为 `len(${JSON_LOGS}) * len(${KEYS})` 默认为 ‘loss’。
+- `--title` : 分析日志的图片名称，默认使用配置文件名， 默认为空。
+- `--legend` : 图例的名称，其数目必须与相等`len(${JSON_LOGS}) * len(${KEYS})`。     默认使用 `"${JSON_LOG}-${KEYS}"`.
+- `--backend` : matplotlib 的绘图后端，默认由 matplotlib 自动选择。
+- `--style` : 绘图配色风格，默认为 `whitegrid`。
+- `--out` : 保存分析图片的路径，如不指定则不保存。
+- `--window-size`: 可视化窗口大小，如果没有指定，默认为  `'12*7'`。如果需要指定，需按照格式 `'W*H'`。
+
+```{note}
+The `--style` option depends on `seaborn` package, please install it before setting it.
+```
+
+### 如何或绘制损失/精度曲线
+
+我们将给出一些示例，来展示如何使用 `tools/analysis_tools/analyze_logs.py`脚本绘制精度曲线的损失曲线
+
+#### 绘制某日志文件对应的损失曲线图
+
+```shell
+python tools/analysis_tools/analyze_logs.py plot_curve your_log_json --keys loss --legend loss
+```
+
+#### 绘制某日志文件对应的 top-1 和 top-5 准确率曲线图，并将曲线图导出为 results.jpg 文件。
+
+```shell
+python tools/analysis_tools/analyze_logs.py plot_curve your_log_json --keys accuracy/top1 accuracy/top5  --legend top1 top5 --out results.jpg
+```
+
+#### 在同一图像内绘制两份日志文件对应的 top-1 准确率曲线图。
+
+```shell
+python tools/analysis_tools/analyze_logs.py plot_curve log1.json log2.json --keys accuracy/top1 --legend exp1 exp2
+```
+
+### 如何统计训练时间
+
+`tools/analysis_tools/analyze_logs.py` 也可以根据日志文件统计训练耗时。
+
+```shell
+python tools/analysis_tools/analyze_logs.py cal_train_time \
+    ${JSON_LOGS}
+    [--include-outliers]
+```
+
+**所有参数的说明：**
+
+- `json_logs`：模型配置文件的路径（可同时传入多个，使用空格分开）。
+- `--include-outliers`：如果指定，将不会排除每个轮次中第一个时间记录（有时第一轮迭代会耗时较长）。
+
+**示例：**
+
+```shell
+python tools/analysis_tools/analyze_logs.py cal_train_time work_dirs/your_exp/20230206_181002/vis_data/scalars.json
+```
+
+预计输出结果如下所示：
+
+```text
+-----Analyze train time of work_dirs/your_exp/20230206_181002/vis_data/scalars.json-----
+slowest epoch 68, average time is 0.3818
+fastest epoch 1, average time is 0.3694
+time std over epochs is 0.0020
+average iter time: 0.3777 s/iter
+```
+
+## 结果分析
+
+利用 `tools/test.py` 的`--out`，我们可以将所有的样本的推理结果保存到输出文件中。利用这一文件，我们可以进行进一步的分析。
+
+### 如何进行离线度量评估
+
+我们提供了 `tools/analysis_tools/eval_metric.py` 脚本，使用户能够根据预测文件评估模型。
+
+```shell
+python tools/analysis_tools/eval_metric.py \
+      ${RESULT} \
+      [--metric ${METRIC_OPTIONS} ...]
+```
+
+**所有参数说明**：
+
+- `result`：`tools/test.py` 输出的结果文件。
+- `--metric`：用于评估结果的指标，请至少指定一个指标，并且你可以通过指定多个 `--metric` 来同时计算多个指标。
+
+请参考[评估文档](mmpretrain.evaluation)选择可用的评估指标和对应的选项。
+
+```{note}
+在 `tools/test.py` 中，我们支持使用 `--out-item` 选项来选择保存何种结果至输出文件。
+请确保没有额外指定 `--out-item`，或指定了 `--out-item=pred`。
+```
+
+**示例**:
+
+```shell
+# 获取结果文件
+python tools/test.py configs/resnet/resnet18_8xb16_cifar10.py \
+    https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth \
+    --out results.pkl
+
+# 计算 top-1 和 top-5 准确率
+python tools/analysis_tools/eval_metric.py results.pkl --metric type=Accuracy topk=1,5
+
+# 计算总体准确率，各个类别上的精确度、召回率、F1-score
+python tools/analysis_tools/eval_metric.py results.pkl --metric type=Accuracy \
+    --metric type=SingleLabelMetric items=precision,recall,f1-score average=None
+```
+
+### 如何绘制测试结果的混淆矩阵
+
+我们提供 `tools/analysis_tools/confusion_matrix.py`，帮助用户能够从测试输出文件中绘制混淆矩阵。
+
+```shell
+python tools/analysis_tools/confusion_matrix.py \
+      ${CONFIG} \
+      ${RESULT} \
+      [--out ${OUT}] \
+      [--show] \
+      [--show-path ${SHOW_PATH}] \
+      [--include-values] \
+      [--cmap] \
+      [--cfg-options ${CFG_OPTIONS} ...] \
+```
+
+**所有参数说明**：
+
+- `config`：配置文件的路径。
+- `result`：`tools/test.py`的输出结果文件，或是模型权重文件。
+- `--out`：将混淆矩阵保存到指定路径下的 pickle 文件中。
+- `--show`：是否可视化混淆矩阵图。
+- `--show-path`：将可视化混淆矩阵图保存到指定路径下。
+- `--include-values`：是否在可视化混淆矩阵图中显示具体值。
+- `--cmap`：用以可视化混淆矩阵的颜色配置。
+- `--cfg-options`：额外的配置选项，会被合入配置文件，参考[学习配置文件](../user_guides/config.md)。
+
+```{note}
+在 `tools/test.py` 中，我们支持使用 `--out-item` 选项来选择保存何种结果至输出文件。
+请确保没有额外指定 `--out-item`，或指定了 `--out-item=pred`。
+```
+
+**Examples**:
+
+```shell
+# 获取结果文件
+python tools/test.py configs/resnet/resnet18_8xb16_cifar10.py \
+    https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth \
+    --out results.pkl
+
+# 将混淆矩阵计算结果保存至 cm.pkl 中
+python tools/analysis_tools/confusion_matrix.py configs/resnet/resnet18_8xb16_cifar10.py results.pkl --out cm.pkl
+
+# 可视化混淆矩阵图，并在图形窗口显示
+python tools/analysis_tools/confusion_matrix.py configs/resnet/resnet18_8xb16_cifar10.py results.pkl --show
+```
+
+### 如何将预测结果可视化
+
+我们可以使用脚本 `tools/analysis_tools/analyze_results.py` 来保存预测成功或失败时得分最高的图像。
+
+```shell
+python tools/analysis_tools/analyze_results.py \
+      ${CONFIG} \
+      ${RESULT} \
+      [--out-dir ${OUT_DIR}] \
+      [--topk ${TOPK}] \
+      [--rescale-factor ${RESCALE_FACTOR}] \
+      [--cfg-options ${CFG_OPTIONS}]
+```
+
+**所有参数说明：**:
+
+- `config`：配置文件的路径。
+- `result`：`tools/test.py`的输出结果文件。
+- `--out_dir`：保存结果分析的文件夹路径。
+- `--topk`：分别保存多少张预测成功/失败的图像。如果不指定，默认为 `20`。
+- `--rescale-factor`：图像的缩放系数，如果样本图像过大或过小时可以使用（过小的图像可能导致结果标签非常模糊）。
+- `--cfg-options`：额外的配置选项，会被合入配置文件，参考[学习配置文件](../user_guides/config.md)。
+
+```{note}
+在 `tools/test.py` 中，我们支持使用 `--out-item` 选项来选择保存何种结果至输出文件。
+请确保没有额外指定 `--out-item`，或指定了 `--out-item=pred`。
+```
+
+**示例**:
+
+```shell
+# 获取预测结果文件
+python tools/test.py configs/resnet/resnet18_8xb16_cifar10.py \
+    https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth \
+    --out results.pkl
+
+# 保存预测成功/失败的图像中，得分最高的前 10 张，并在可视化时将输出图像放大 10 倍。
+python tools/analysis_tools/analyze_results.py \
+       configs/resnet/resnet18_8xb16_cifar10.py \
+       results.pkl \
+       --out-dir output \
+       --topk 10 \
+       --rescale-factor 10
+```
diff --git a/docs/zh_CN/useful_tools/model_serving.md b/docs/zh_CN/useful_tools/model_serving.md
new file mode 100644
index 0000000000000000000000000000000000000000..12f67797595a3449d7574570d2cdef601a2c0401
--- /dev/null
+++ b/docs/zh_CN/useful_tools/model_serving.md
@@ -0,0 +1,88 @@
+# TorchServe 部署
+
+为了使用 [`TorchServe`](https://pytorch.org/serve/) 部署一个 `MMPretrain` 模型，需要进行以下几步：
+
+## 1. 转换 MMPretrain 模型至 TorchServe
+
+```shell
+python tools/torchserve/mmpretrain2torchserve.py ${CONFIG_FILE} ${CHECKPOINT_FILE} \
+--output-folder ${MODEL_STORE} \
+--model-name ${MODEL_NAME}
+```
+
+```{note}
+${MODEL_STORE} 需要是一个文件夹的绝对路径。
+```
+
+示例：
+
+```shell
+python tools/torchserve/mmpretrain2torchserve.py \
+  configs/resnet/resnet18_8xb32_in1k.py \
+  checkpoints/resnet18_8xb32_in1k_20210831-fbbb1da6.pth \
+  --output-folder ./checkpoints \
+  --model-name resnet18_in1k
+```
+
+## 2. 构建 `mmpretrain-serve` docker 镜像
+
+```shell
+docker build -t mmpretrain-serve:latest docker/serve/
+```
+
+## 3. 运行 `mmpretrain-serve` 镜像
+
+请参考官方文档 [基于 docker 运行 TorchServe](https://github.com/pytorch/serve/blob/master/docker/README.md#running-torchserve-in-a-production-docker-environment).
+
+为了使镜像能够使用 GPU 资源，需要安装 [nvidia-docker](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)。之后可以传递 `--gpus` 参数以在 GPU 上运。
+
+示例：
+
+```shell
+docker run --rm \
+--name mar \
+--cpus 8 \
+--gpus device=0 \
+-p8080:8080 -p8081:8081 -p8082:8082 \
+--mount type=bind,source=`realpath ./checkpoints`,target=/home/model-server/model-store \
+mmpretrain-serve:latest
+```
+
+```{note}
+`realpath ./checkpoints` 是 "./checkpoints" 的绝对路径，你可以将其替换为你保存 TorchServe 模型的目录的绝对路径。
+```
+
+参考 [该文档](https://github.com/pytorch/serve/blob/master/docs/rest_api.md) 了解关于推理 (8080)，管理 (8081) 和指标 (8082) 等 API 的信息。
+
+## 4. 测试部署
+
+```shell
+curl http://127.0.0.1:8080/predictions/${MODEL_NAME} -T demo/demo.JPEG
+```
+
+您应该获得类似于以下内容的响应：
+
+```json
+{
+  "pred_label": 58,
+  "pred_score": 0.38102269172668457,
+  "pred_class": "water snake"
+}
+```
+
+另外，你也可以使用 `test_torchserver.py` 来比较 TorchServe 和 PyTorch 的结果，并进行可视化。
+
+```shell
+python tools/torchserve/test_torchserver.py ${IMAGE_FILE} ${CONFIG_FILE} ${CHECKPOINT_FILE} ${MODEL_NAME}
+[--inference-addr ${INFERENCE_ADDR}] [--device ${DEVICE}]
+```
+
+示例：
+
+```shell
+python tools/torchserve/test_torchserver.py \
+  demo/demo.JPEG \
+  configs/resnet/resnet18_8xb32_in1k.py \
+  checkpoints/resnet18_8xb32_in1k_20210831-fbbb1da6.pth \
+  resnet18_in1k
+```
diff --git a/docs/zh_CN/useful_tools/print_config.md b/docs/zh_CN/useful_tools/print_config.md
new file mode 100644
index 0000000000000000000000000000000000000000..3ec0d303a4a7853dc9a7c34e7124cb91967aebcc
--- /dev/null
+++ b/docs/zh_CN/useful_tools/print_config.md
@@ -0,0 +1,28 @@
+# 打印完整配置文件
+
+`print_config.py`脚本脚本会解析所有输入变量，并打印完整配置信息。
+
+## 说明：
+
+`tools/misc/print_config.py` 脚本会逐字打印整个配置文件，并展示所有导入的文件。
+
+```shell
+python tools/misc/print_config.py ${CONFIG} [--cfg-options ${CFG_OPTIONS}]
+```
+
+所有参数的说明：
+
+- `config`：模型配置文件的路径。
+- `--cfg-options`：额外的配置选项，会被合入配置文件，参考[学习配置文件](../user_guides/config.md)。
+
+## 示例：
+
+打印`configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py`文件的完整配置
+
+```shell
+# 打印完成的配置文件
+python tools/misc/print_config.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py
+
+# 将完整的配置文件保存为一个独立的配置文件
+python tools/misc/print_config.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py > final_config.py
+```
diff --git a/docs/zh_CN/useful_tools/scheduler_visualization.md b/docs/zh_CN/useful_tools/scheduler_visualization.md
new file mode 100644
index 0000000000000000000000000000000000000000..8ac903ac2e6cb19a18fe5d5aa6b120b88131878c
--- /dev/null
+++ b/docs/zh_CN/useful_tools/scheduler_visualization.md
@@ -0,0 +1,44 @@
+# 优化器参数策略可视化
+
+该工具旨在帮助用户检查优化器的超参数调度器（无需训练），支持学习率（learning rate）和动量（momentum）。
+
+## 工具简介
+
+```bash
+python tools/visualization/vis_scheduler.py \
+    ${CONFIG_FILE} \
+    [-p, --parameter ${PARAMETER_NAME}] \
+    [-d, --dataset-size ${DATASET_SIZE}] \
+    [-n, --ngpus ${NUM_GPUs}] \
+    [-s, --save-path ${SAVE_PATH}] \
+    [--title ${TITLE}] \
+    [--style ${STYLE}] \
+    [--window-size ${WINDOW_SIZE}] \
+    [--cfg-options]
+```
+
+**所有参数的说明**：
+
+- `config` : 模型配置文件的路径。
+- **`-p, parameter`**: 可视化参数名，只能为 `["lr", "momentum"]` 之一， 默认为 `"lr"`.
+- **`-d, --dataset-size`**: 数据集的大小。如果指定，`build_dataset` 将被跳过并使用这个大小作为数据集大小，默认使用 `build_dataset` 所得数据集的大小。
+- **`-n, --ngpus`**: 使用 GPU 的数量，默认为 1。
+- **`-s, --save-path`**: 保存的可视化图片的路径，默认不保存。
+- `--title`: 可视化图片的标题，默认为配置文件名。
+- `--style`: 可视化图片的风格，默认为 `whitegrid`。
+- `--window-size`: 可视化窗口大小，如果没有指定，默认为 `12*7`。如果需要指定，按照格式 `'W*H'`。
+- `--cfg-options`: 对配置文件的修改，参考[学习配置文件](../user_guides/config.md)。
+
+```{note}
+部分数据集在解析标注阶段比较耗时，可直接将 `-d, dataset-size` 指定数据集的大小，以节约时间。
+```
+
+## 如何在开始训练前可视化学习率曲线
+
+你可以使用如下命令来绘制配置文件 `configs/swin_transformer/swin-base_16xb64_in1k.py` 将会使用的变化率曲线：
+
+```bash
+python tools/visualization/vis_scheduler.py configs/swin_transformer/swin-base_16xb64_in1k.py --dataset-size 1281167 --ngpus 16
+```
+
+<div align=center><img src="https://user-images.githubusercontent.com/26739999/226544329-cf3a3d45-6ab3-48aa-8972-2c2a58c35e62.png" style=" width: auto; height: 40%; "></div>
diff --git a/docs/zh_CN/useful_tools/shape_bias.md b/docs/zh_CN/useful_tools/shape_bias.md
new file mode 100644
index 0000000000000000000000000000000000000000..f557197dae76ab9a04387bff13d95e3c7cc3d85a
--- /dev/null
+++ b/docs/zh_CN/useful_tools/shape_bias.md
@@ -0,0 +1,96 @@
+## 形状偏差(Shape Bias)工具用法
+
+形状偏差(shape bias)衡量模型与纹理相比，如何依赖形状来感知图像中的语义。关于更多细节，我们向感兴趣的读者推荐这篇[论文](https://arxiv.org/abs/2106.07411) 。MMPretrain提供现成的工具箱来获得分类模型的形状偏差。您可以按照以下步骤操作：
+
+### 准备数据集
+
+首先你应该下载[cue-conflict](https://github.com/bethgelab/model-vs-human/releases/download/v0.1/cue-conflict.tar.gz) 到`data`文件夹，然后解压缩这个数据集。之后，你的`data`文件夹应具有一下结构：
+
+```text
+data
+├──cue-conflict
+|      |──airplane
+|      |──bear
+|      ...
+|      |── truck
+```
+
+### 修改分类配置
+
+我们在使用MAE预训练的ViT-base模型上运行形状偏移工具。它的配置文件为`configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py`，它的检查点可从[此链接](https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.pth) 下载。将原始配置中的test_pipeline, test_dataloader和test_evaluation替换为以下配置：
+
+```python
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=256,
+        edge='short',
+        backend='pillow'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+test_dataloader = dict(
+    pin_memory=True,
+    collate_fn=dict(type='default_collate'),
+    batch_size=32,
+    num_workers=4,
+    dataset=dict(
+        type='CustomDataset',
+        data_root='data/cue-conflict',
+        pipeline=test_pipeline,
+        _delete_=True),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+    drop_last=False)
+test_evaluator = dict(
+    type='mmpretrain.ShapeBiasMetric',
+    _delete_=True,
+    csv_dir='work_dirs/shape_bias',
+    model_name='mae')
+```
+
+请注意，你应该对上面的`csv_dir`和`model_name`进行自定义修改。我把修改后的示例配置文件重命名为`configs/mae/benchmarks/`文件夹中的`vit-base-p16_8xb128-coslr-100e_in1k_shape-bias.py`文件。
+
+### 用上面修改后的配置文件在你的模型上做推断
+
+然后，你应该使用修改后的配置文件在`cue-conflict`数据集上推断你的模型。
+
+```shell
+# For PyTorch
+bash tools/dist_test.sh $CONFIG $CHECKPOINT
+```
+
+**所有参数的说明**：
+
+- `$CONFIG`: 修改后的配置文件的路径。
+- `$CHECKPOINT`: 检查点文件的路径或链接。
+
+```shell
+# Example
+bash tools/dist_test.sh configs/mae/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k_shape-bias.py https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-1600e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k_20220825-cf70aa21.pth 1
+```
+
+之后，你应该在`csv_dir`文件夹中获得一个名为`cue-conflict_model-name_session-1.csv`的csv文件。除了这个文件以外，你还应该下载这些[csv文件](https://github.com/bethgelab/model-vs-human/tree/master/raw-data/cue-conflict) 到`csv_dir`。
+
+### 绘制形状偏差图
+
+然后我们可以开始绘制形状偏差图：
+
+```shell
+python tools/analysis_tools/shape_bias.py --csv-dir $CSV_DIR --result-dir $RESULT_DIR --colors $RGB --markers o --plotting-names $YOUR_MODEL_NAME --model-names $YOUR_MODEL_NAME
+```
+
+**所有参数的说明**:
+
+- `--csv-dir $CSV_DIR`, 与保存这些csv文件的目录相同。
+- `--result-dir $RESULT_DIR`, 输出名为`cue-conflict_shape-bias_matrixplot.pdf`的结果的目录。
+- `--colors $RGB`, 应该是RGB值，格式为R G B，例如100 100 100，如果你想绘制几个模型的形状偏差，可以是多个RGB值。
+- `--plotting-names $YOUR_MODEL_NAME`, 形状偏移图中图例的名称，您可以将其设置为模型名称。如果要绘制多个模型，plotting_names可以是多个值。
+- `model-names $YOUR_MODEL_NAME`, 应与配置中指定的名称相同，如果要绘制多个模型的形状偏差，则可以是多个名称。
+
+请注意，`--colors`的每三个值对应于`--model-names`的一个值。完成以上所有步骤后，你将获得下图。
+
+<div align="center">
+<img src="https://github.com/open-mmlab/mmpretrain/assets/42371271/dc608d06-43eb-4860-bb70-486ed2a3f927" width="500" />
+</div>
diff --git a/docs/zh_CN/useful_tools/t-sne_visualization.md b/docs/zh_CN/useful_tools/t-sne_visualization.md
new file mode 100644
index 0000000000000000000000000000000000000000..124e915ea0b67df4b82735433a1b10a076aeedb8
--- /dev/null
+++ b/docs/zh_CN/useful_tools/t-sne_visualization.md
@@ -0,0 +1,85 @@
+# t-分布随机邻域嵌入（t-SNE）可视化
+
+## t-分布随机邻域嵌入可视化工具介绍
+
+MMPretrain 提供 `tools/visualization/vis_tsne.py` 工具来用t-SNE可视化图像的特征嵌入。请使用 `pip install scikit-learn` 安装 `sklearn` 来计算t-SNE。
+
+**命令**：
+
+```bash
+python tools/visualization/vis_tsne.py \
+    CONFIG \
+    [--checkpoint CHECKPOINT] \
+    [--work-dir WORK_DIR] \
+    [--test-cfg TEST_CFG] \
+    [--vis-stage {backbone,neck,pre_logits}]
+    [--class-idx ${CLASS_IDX} [CLASS_IDX ...]]
+    [--max-num-class MAX_NUM_CLASS]
+    [--max-num-samples MAX_NUM_SAMPLES]
+    [--cfg-options CFG_OPTIONS [CFG_OPTIONS ...]]
+    [--device DEVICE]
+    [--legend]
+    [--show]
+    [--n-components N_COMPONENTS]
+    [--perplexity PERPLEXITY]
+    [--early-exaggeration EARLY_EXAGGERATION]
+    [--learning-rate LEARNING_RATE]
+    [--n-iter N_ITER]
+    [--n-iter-without-progress N_ITER_WITHOUT_PROGRESS]
+    [--init INIT]
+```
+
+**所有参数的说明**：
+
+- `CONFIG`: t-SNE 配置文件的路径。
+- `--checkpoint CHECKPOINT`: 模型权重文件的路径。
+- `--work-dir WORK_DIR`: 保存日志和可视化图像的目录。
+- `--test-cfg TEST_CFG`: 用来加载 test_dataloader 配置的 t-SNE 配置文件的路径。
+- `--vis-stage {backbone,neck,pre_logits}`: 模型可视化的阶段。
+- `--class-idx CLASS_IDX [CLASS_IDX ...]`: 用来计算 t-SNE 的类别。
+- `--max-num-class MAX_NUM_CLASS`: 前 N 个被应用 t-SNE 算法的类别，默认为20。
+- `--max-num-samples MAX_NUM_SAMPLES`: 每个类别中最大的样本数，值越高需要的计算时间越长，默认为100。
+- `--cfg-options CFG_OPTIONS [CFG_OPTIONS ...]`: 覆盖被使用的配置中的一些设定，形如 xxx=yyy 格式的关键字-值对会被合并到配置文件中。如果被覆盖的值是一个列表，它应该形如 key="[a,b]" 或者 key=a,b 。它还允许嵌套的列表/元组值，例如 key="[(a,b),(c,d)]" 。注意引号是必需的，而且不允许有空格。
+- `--device DEVICE`: 用于推理的设备。
+- `--legend`: 显示所有类别的图例。
+- `--show`: 在图形窗口中显示结果。
+- `--n-components N_COMPONENTS`: 结果的维数。
+- `--perplexity PERPLEXITY`: 复杂度与其他流形学习算法中使用的最近邻的数量有关。
+- `--early-exaggeration EARLY_EXAGGERATION`: 控制原空间中的自然聚类在嵌入空间中的紧密程度以及它们之间的空间大小。
+- `--learning-rate LEARNING_RATE`: t-SNE 的学习率通常在[10.0, 1000.0]的范围内。如果学习率太高，数据可能看起来像一个球，其中任何一点与它最近的邻居近似等距。如果学习率太低，大多数点可能看起来被压缩在一个几乎没有异常值的密集点云中。
+- `--n-iter N_ITER`: 优化的最大迭代次数。应该至少为250。
+- `--n-iter-without-progress N_ITER_WITHOUT_PROGRESS`: 在我们中止优化之前，最大的没有进展的迭代次数。
+- `--init INIT`: 初始化方法。
+
+## 如何可视化分类模型的t-SNE（如 ResNet）
+
+以下是在CIFAR-10数据集上训练的 ResNet-18 和 ResNet-50 模型上运行 t-SNE 可视化的两个样例：
+
+```shell
+python tools/visualization/vis_tsne.py \
+    configs/resnet/resnet18_8xb16_cifar10.py \
+    --checkpoint https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth
+
+python tools/visualization/vis_tsne.py \
+    configs/resnet/resnet50_8xb16_cifar10.py \
+    --checkpoint https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_b16x8_cifar10_20210528-f54bfad9.pth
+```
+
+| ResNet-18                                                                                            | ResNet-50                                                                                            |
+| ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
+| <div align=center><img src='https://user-images.githubusercontent.com/42371271/236410521-c4d087da-d16f-48ad-b951-c74d10c68f33.png' height="auto" width="auto" ></div> | <div align=center><img src='https://user-images.githubusercontent.com/42371271/236411844-c97dc514-dad0-401e-ba8f-307d0a385b4e.png' height="auto" width="auto" ></div> |
+
+## 如何可视化自监督模型的t-SNE（如 MAE）
+
+以下是在ImageNet数据集上训练的 MAE-ViT-base 模型上运行 t-SNE 可视化的一个样例。输入数据来自 ImageNet 验证集。MAE和一些自监督预训练算法配置中没有 test_dataloader 信息。在分析这些自监督算法时，你需要在配置中添加 test_dataloader 信息，或者使用 `--test-cfg` 字段来指定一个配置文件。
+
+```shell
+python tools/visualization/vis_tsne.py \
+    configs/mae/mae_vit-base-p16_8xb512-amp-coslr-800e_in1k.py \
+    --checkpoint https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-base-p16_8xb512-fp16-coslr-800e_in1k/mae_vit-base-p16_8xb512-coslr-800e-fp16_in1k_20220825-5d81fbc4.pth \
+    --test-cfg configs/_base_/datasets/imagenet_bs32.py
+```
+
+| MAE-ViT-base                                                                                                                                                  |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| <div align=center><img src='https://github.com/open-mmlab/mmpretrain/assets/42371271/ee576c0c-abef-43d1-8866-24a5f5fd0cf6' height="auto" width="auto" ></div> |
diff --git a/docs/zh_CN/useful_tools/verify_dataset.md b/docs/zh_CN/useful_tools/verify_dataset.md
new file mode 100644
index 0000000000000000000000000000000000000000..655ce97760d1cd43d94f51f1cc245d62b500553a
--- /dev/null
+++ b/docs/zh_CN/useful_tools/verify_dataset.md
@@ -0,0 +1,28 @@
+# 数据集验证
+
+在 MMPretrain 中，`tools/misc/verify_dataset.py` 脚本会检查数据集的所有图片，查看是否有**已经损坏**的图片。
+
+## 工具介绍
+
+```shell
+python tools/print_config.py \
+    ${CONFIG} \
+    [--out-path ${OUT-PATH}] \
+    [--phase ${PHASE}] \
+    [--num-process ${NUM-PROCESS}]
+    [--cfg-options ${CFG_OPTIONS}]
+```
+
+**所有参数说明**:
+
+- `config` : 配置文件的路径。
+- `--out-path` : 输出结果路径，默认为 ‘brokenfiles.log’。
+- `--phase` :  检查哪个阶段的数据集，可用值为 “train” 、”test” 或者 “val”， 默认为 “train”。
+- `--num-process` : 指定的进程数，默认为 1。
+- `--cfg-options`: 额外的配置选项，会被合入配置文件，参考[教程 1：如何编写配置文件](https://mmpretrain.readthedocs.io/zh_CN/latest/tutorials/config.html)。
+
+## 示例：
+
+```shell
+python tools/misc/verify_dataset.py configs/t2t_vit/t2t-vit-t-14_8xb64_in1k.py --out-path broken_imgs.log --phase val --num-process 8
+```
diff --git a/docs/zh_CN/user_guides/config.md b/docs/zh_CN/user_guides/config.md
new file mode 100644
index 0000000000000000000000000000000000000000..c6013f28f60d01ad0dea48242ca68716f15baf87
--- /dev/null
+++ b/docs/zh_CN/user_guides/config.md
@@ -0,0 +1,412 @@
+# 学习配置文件
+
+为了管理深度学习实验的各种设置，我们使用配置文件来记录所有这些配置。这种配置文件系统具有模块化和继承特性，更多细节可以在{external+mmengine:doc}`MMEngine 中的教程 <advanced_tutorials/config>`。
+
+MMPretrain 主要使用 python 文件作为配置文件，所有配置文件都放置在 [`configs`](https://github.com/open-mmlab/mmpretrain/tree/main/configs) 文件夹下，目录结构如下所示：
+
+```text
+MMPretrain/
+    ├── configs/
+    │   ├── _base_/                       # primitive configuration folder
+    │   │   ├── datasets/                      # primitive datasets
+    │   │   ├── models/                        # primitive models
+    │   │   ├── schedules/                     # primitive schedules
+    │   │   └── default_runtime.py             # primitive runtime setting
+    │   ├── beit/                         # BEiT Algorithms Folder
+    │   ├── mae/                          # MAE Algorithms Folder
+    │   ├── mocov2/                       # MoCoV2 Algorithms Folder
+    │   ├── resnet/                       # ResNet Algorithms Folder
+    │   ├── swin_transformer/             # Swin Algorithms Folder
+    │   ├── vision_transformer/           # ViT Algorithms Folder
+    │   ├── ...
+    └── ...
+```
+
+可以使用 `python tools/misc/print_config.py /PATH/TO/CONFIG` 命令来查看完整的配置信息，从而方便检查所对应的配置文件。
+
+本文主要讲解 MMPretrain 配置文件的命名和结构，以及如何基于已有的配置文件修改，并以 [ResNet50 配置文件](https://github.com/open-mmlab/mmpretrain/blob/main/configs/resnet/resnet50_8xb32_in1k.py) 逐行解释。
+
+## 配置文件结构
+
+在 `configs/_base_` 文件夹下有 4 个基本组件类型，分别是：
+
+- [模型(model)](https://github.com/open-mmlab/mmpretrain/tree/main/configs/_base_/models)
+- [数据(data)](https://github.com/open-mmlab/mmpretrain/tree/main/configs/_base_/datasets)
+- [训练策略(schedule)](https://github.com/open-mmlab/mmpretrain/tree/main/configs/_base_/schedules)
+- [运行设置(runtime)](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/default_runtime.py)
+
+你可以通过继承一些基本配置文件轻松构建自己的训练配置文件。我们称这些被继承的配置文件为 _原始配置文件_，如 `_base_` 文件夹中的文件一般仅作为原始配置文件。
+
+下面使用 [ResNet50 配置文件](https://github.com/open-mmlab/mmpretrain/blob/main/configs/resnet/resnet50_8xb32_in1k.py) 作为案例进行说明并注释每一行含义。
+
+```python
+_base_ = [                                    # 此配置文件将继承所有 `_base_` 中的配置
+    '../_base_/models/resnet50.py',           # 模型配置
+    '../_base_/datasets/imagenet_bs32.py',    # 数据配置
+    '../_base_/schedules/imagenet_bs256.py',  # 训练策略配置
+    '../_base_/default_runtime.py'            # 默认运行设置
+]
+```
+
+我们将在下面分别解释这四个原始配置文件。
+
+### 模型配置
+
+模型原始配置文件包含一个 `model` 字典数据结构，主要包括网络结构、损失函数等信息：
+
+- `type`：算法类型，我们支持了多种任务
+  - 对于图像分类任务，通常为 `ImageClassifier`，更多细节请参考 [API 文档](mmpretrain.models.classifiers)。
+  - 对于自监督任务，有多种类型的算法，例如 `MoCoV2`, `BEiT`, `MAE` 等。更多细节请参考 [API 文档](mmpretrain.models.selfsup)。
+  - 对于图像检索任务，通常为 `ImageToImageRetriever`，更多细节请参考 [API 文档](mmpretrain.models.retrievers).
+
+通常，我们使用 **`type`字段** 来指定组件的类，并使用其他字段来传递类的初始化参数。{external+mmengine:doc}`注册器教程 <advanced_tutorials/registry>` 对其进行了详细描述。
+
+这里我们以 [`ImageClassifier`](mmpretrain.models.classifiers.ImageClassifier) 的配置字段为例，对初始化参数进行说明：
+
+- `backbone`： 主干网络设置，主干网络为主要的特征提取网络，比如 `ResNet`, `Swin Transformer`, `Vision Transformer` 等等。更多可用选项请参考 [API 文档](mmpretrain.models.backbones)。
+  - 对于自监督学习，有些主干网络需要重新实现，您可以在 [API 文档](mmpretrain.models.selfsup) 中获取更多细节。
+- `neck`： 颈网络设置，颈网络主要是连接主干网和头网络的中间部分，比如 `GlobalAveragePooling` 等，更多可用选项请参考 [API 文档](mmpretrain.models.necks)。
+- `head`： 头网络设置，头网络主要是与具体任务关联的部件，如图像分类、自监督训练等，更多可用选项请参考 [API 文档](mmpretrain.models.heads)。
+  - `loss`： 损失函数设置， 支持 `CrossEntropyLoss`, `LabelSmoothLoss`, `PixelReconstructionLoss` 等，更多可用选项参考 [API 文档](mmpretrain.models.losses)。
+- `data_preprocessor`: 图像输入的预处理模块，输入在进入模型前的预处理操作，例如 `ClsDataPreprocessor`, 有关详细信息，请参阅 [API 文档](mmpretrain.models.utils.data_preprocessor)。
+- `train_cfg`: `ImageClassifier` 的额外训练配置。在 `ImageClassifier` 中，我们使用这一参数指定批数据增强设置，比如 `Mixup` 和 `CutMix`。详见[文档](mmpretrain.models.utils.batch_augments)。
+
+以下是 ResNet50 的模型配置['configs/_base_/models/resnet50.py'](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/models/resnet50.py)：
+
+```python
+model = dict(
+    type='ImageClassifier',     # 主模型类型（对于图像分类任务，使用 `ImageClassifier`）
+    backbone=dict(
+        type='ResNet',          # 主干网络类型
+        # 除了 `type` 之外的所有字段都来自 `ResNet` 类的 __init__ 方法
+        # 可查阅 https://mmpretrain.readthedocs.io/zh_CN/latest/api/generated/mmpretrain.models.backbones.ResNet.html
+        depth=50,
+        num_stages=4,           # 主干网络状态(stages)的数目，这些状态产生的特征图作为后续的 head 的输入。
+        out_indices=(3, ),      # 输出的特征图输出索引。
+        frozen_stages=-1,       # 冻结主干网的层数
+        style='pytorch'),
+    neck=dict(type='GlobalAveragePooling'),    # 颈网络类型
+    head=dict(
+        type='LinearClsHead',         # 分类颈网络类型
+        # 除了 `type` 之外的所有字段都来自 `LinearClsHead` 类的 __init__ 方法
+        # 可查阅 https://mmpretrain.readthedocs.io/zh_CN/latest/api/generated/mmpretrain.models.heads.LinearClsHead.html
+        num_classes=1000,
+        in_channels=2048,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0), # 损失函数配置信息
+        topk=(1, 5),                 # 评估指标，Top-k 准确率， 这里为 top1 与 top5 准确率
+    ))
+```
+
+### 数据
+
+数据原始配置文件主要包括预处理设置、dataloader 以及 评估器等设置：
+
+- `data_preprocessor`: 模型输入预处理配置，与 `model.data_preprocessor` 相同，但优先级更低。
+- `train_evaluator | val_evaluator | test_evaluator`: 构建评估器，参考 [API 文档](mmpretrain.evaluation)。
+- `train_dataloader | val_dataloader | test_dataloader`: 构建 dataloader
+  - `batch_size`: 每个 GPU 的 batch size
+  - `num_workers`: 每个 GPU 的线程数
+  - `sampler`: 采样器配置
+  - `dataset`: 数据集配置
+    - `type`:  数据集类型， MMPretrain 支持 `ImageNet`、 `Cifar` 等数据集 ，参考 [API 文档](mmpretrain.datasets)
+    - `pipeline`:  数据处理流水线，参考相关教程文档 [如何设计数据处理流水线](../advanced_guides/pipeline.md)
+
+以下是 ResNet50 的数据配置 ['configs/_base_/datasets/imagenet_bs32.py'](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/datasets/imagenet_bs32.py)：
+
+```python
+dataset_type = 'ImageNet'
+# 预处理配置
+data_preprocessor = dict(
+    # 输入的图片数据通道以 'RGB' 顺序
+    mean=[123.675, 116.28, 103.53],    # 输入图像归一化的 RGB 通道均值
+    std=[58.395, 57.12, 57.375],       # 输入图像归一化的 RGB 通道标准差
+    to_rgb=True,                       # 是否将通道翻转，从 BGR 转为 RGB 或者 RGB 转为 BGR
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),     # 读取图像
+    dict(type='RandomResizedCrop', scale=224),     # 随机放缩裁剪
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),   # 随机水平翻转
+    dict(type='PackInputs'),         # 准备图像以及标签
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),     # 读取图像
+    dict(type='ResizeEdge', scale=256, edge='short'),  # 缩放短边尺寸至 256px
+    dict(type='CenterCrop', crop_size=224),     # 中心裁剪
+    dict(type='PackInputs'),                 # 准备图像以及标签
+]
+
+# 构造训练集 dataloader
+train_dataloader = dict(
+    batch_size=32,                     # 每张 GPU 的 batchsize
+    num_workers=5,                     # 每个 GPU 的线程数
+    dataset=dict(                      # 训练数据集
+        type=dataset_type,
+        data_root='data/imagenet',
+        ann_file='meta/train.txt',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),   # 默认采样器
+    persistent_workers=True,                             # 是否保持进程，可以缩短每个 epoch 的准备时间
+)
+
+# 构造验证集 dataloader
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        ann_file='meta/val.txt',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    persistent_workers=True,
+)
+# 验证集评估设置，使用准确率为指标， 这里使用 topk1 以及 top5 准确率
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+test_dataloader = val_dataloader  # test dataloader 配置，这里直接与 val_dataloader 相同
+test_evaluator = val_evaluator    # 测试集的评估配置，这里直接与 val_evaluator 相同
+```
+
+```{note}
+预处理配置（`data_preprocessor`）既可以作为 `model` 的一个子字段，也可以定义在外部的 `data_preprocessor` 字段，
+同时配置时，优先使用 `model.data_preprocessor` 的配置。
+```
+
+### 训练策略
+
+训练策略原始配置文件主要包括预优化器设置和训练、验证及测试的循环控制器(LOOP)：
+
+- `optim_wrapper`: 优化器装饰器配置信息，我们使用优化器装饰配置优化进程。
+  - `optimizer`: 支持 `pytorch` 所有的优化器，参考相关 {external+mmengine:doc}`MMEngine <tutorials/optim_wrapper>` 文档。
+  - `paramwise_cfg`: 根据参数的类型或名称设置不同的优化参数，参考相关 [学习策略文档](../advanced_guides/schedule.md) 文档。
+  - `accumulative_counts`: 积累几个反向传播后再优化参数，你可以用它通过小批量来模拟大批量。
+- `param_scheduler` : 学习率策略，你可以指定训练期间的学习率和动量曲线。有关详细信息，请参阅 MMEngine 中的 {external+mmengine:doc}`文档 <tutorials/param_scheduler>`。
+- `train_cfg | val_cfg | test_cfg`: 训练、验证以及测试的循环执行器配置，请参考相关的{external+mmengine:doc}`MMEngine 文档 <design/runner>`。
+
+以下是 ResNet50 的训练策略配置['configs/_base_/schedules/imagenet_bs256.py'](https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/schedules/imagenet_bs256.py)：
+
+```python
+optim_wrapper = dict(
+    # 使用 SGD 优化器来优化参数
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# 学习率参数的调整策略
+# 'MultiStepLR' 表示使用多步策略来调度学习率（LR）。
+param_scheduler = dict(
+    type='MultiStepLR', by_epoch=True, milestones=[30, 60, 90], gamma=0.1)
+
+# 训练的配置，迭代 100 个 epoch，每一个训练 epoch 后都做验证集评估
+# 'by_epoch=True' 默认使用 `EpochBaseLoop`,  'by_epoch=False' 默认使用 `IterBaseLoop`
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+# 使用默认的验证循环控制器
+val_cfg = dict()
+# 使用默认的测试循环控制器
+test_cfg = dict()
+
+# 通过默认策略自动缩放学习率，此策略适用于总批次大小 256
+# 如果你使用不同的总批量大小，比如 512 并启用自动学习率缩放
+# 我们将学习率扩大到 2 倍
+auto_scale_lr = dict(base_batch_size=256)
+```
+
+### 运行设置
+
+本部分主要包括保存权重策略、日志配置、训练参数、断点权重路径和工作目录等等。
+
+以下是几乎所有算法都使用的运行配置['configs/_base_/default_runtime.py'](https://github.com/open-mmlab/mmpretrain/blob/main//configs/_base_/default_runtime.py)：
+
+```python
+# 默认所有注册器使用的域
+default_scope = 'mmpretrain'
+
+# 配置默认的 hook
+default_hooks = dict(
+    # 记录每次迭代的时间。
+    timer=dict(type='IterTimerHook'),
+
+    # 每 100 次迭代打印一次日志。
+    logger=dict(type='LoggerHook', interval=100),
+
+    # 启用默认参数调度 hook。
+    param_scheduler=dict(type='ParamSchedulerHook'),
+
+    # 每个 epoch 保存检查点。
+    checkpoint=dict(type='CheckpointHook', interval=1),
+
+    # 在分布式环境中设置采样器种子。
+    sampler_seed=dict(type='DistSamplerSeedHook'),
+
+    # 验证结果可视化，默认不启用，设置 True 时启用。
+    visualization=dict(type='VisualizationHook', enable=False),
+)
+
+# 配置环境
+env_cfg = dict(
+   # 是否开启 cudnn benchmark
+    cudnn_benchmark=False,
+
+    # 设置多进程参数
+    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+
+    # 设置分布式参数
+    dist_cfg=dict(backend='nccl'),
+)
+
+# 设置可视化工具
+vis_backends = [dict(type='LocalVisBackend')] # 使用磁盘(HDD)后端
+visualizer = dict(
+    type='UniversalVisualizer', vis_backends=vis_backends, name='visualizer')
+
+# 设置日志级别
+log_level = 'INFO'
+
+# 从哪个检查点加载
+load_from = None
+
+# 是否从加载的检查点恢复训练
+resume = False
+```
+
+## 继承并修改配置文件
+
+为了精简代码、更快的修改配置文件以及便于理解，我们建议继承现有方法。
+
+对于在同一算法文件夹下的所有配置文件，MMPretrain 推荐只存在 **一个** 对应的 _原始配置_ 文件。
+所有其他的配置文件都应该继承 _原始配置_ 文件，这样就能保证配置文件的最大继承深度为 3。
+
+例如，如果在 ResNet 的基础上做了一些修改，用户首先可以通过指定 `_base_ = './resnet50_8xb32_in1k.py'`（相对于你的配置文件的路径），来继承基础的 ResNet 结构、数据集以及其他训练配置信息，然后修改配置文件中的必要参数以完成继承。如想在基础 resnet50 的基础上使用 `CutMix` 训练增强，将训练轮数由 100 改为 300 和修改学习率衰减轮数，同时修改数据集路径，可以建立新的配置文件 `configs/resnet/resnet50_8xb32-300e_in1k.py`， 文件中写入以下内容：
+
+```python
+# 在 'configs/resnet/' 创建此文件
+_base_ = './resnet50_8xb32_in1k.py'
+
+# 模型在之前的基础上使用 CutMix 训练增强
+model = dict(
+    train_cfg=dict(
+        augments=dict(type='CutMix', alpha=1.0)
+    )
+)
+
+# 优化策略在之前基础上训练更多个 epoch
+train_cfg = dict(max_epochs=300, val_interval=10)  # 训练 300 个 epoch，每 10 个 epoch 评估一次
+param_scheduler = dict(step=[150, 200, 250])   # 学习率调整也有所变动
+
+# 使用自己的数据集目录
+train_dataloader = dict(
+    dataset=dict(data_root='mydata/imagenet/train'),
+)
+val_dataloader = dict(
+    batch_size=64,                  # 验证时没有反向传播，可以使用更大的 batchsize
+    dataset=dict(data_root='mydata/imagenet/val'),
+)
+test_dataloader = dict(
+    batch_size=64,                  # 测试时没有反向传播，可以使用更大的 batchsize
+    dataset=dict(data_root='mydata/imagenet/val'),
+)
+```
+
+### 使用配置文件里的中间变量
+
+用一些中间变量，中间变量让配置文件更加清晰，也更容易修改。
+
+例如数据集里的 `train_pipeline` / `test_pipeline` 是作为数据流水线的中间变量。我们首先要定义它们，然后将它们传递到 `train_dataloader` / `test_dataloader` 中。如果想修改训练或测试时输入图片的大小，就需要修改 `train_pipeline` / `test_pipeline` 这些中间变量。
+
+```python
+bgr_mean = [103.53, 116.28, 123.675]
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=224, backend='pillow', interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(
+        type='RandAugment',
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=6,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=236, edge='short', backend='pillow', interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=val_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=val_pipeline))
+```
+
+### 忽略基础配置文件里的部分内容
+
+有时，您需要设置 `_delete_=True` 去忽略基础配置文件里的一些域内容。可以查看 {external+mmengine:doc}`MMEngine 文档 <advanced_tutorials/config>` 进一步了解该设计。
+
+以下是一个简单应用案例。 如果在上述 ResNet50 案例中 使用余弦调度 ,使用继承并直接修改会报 `get unexcepected keyword 'step'` 错，因为基础配置文件 `param_scheduler` 域信息的 `'step'` 字段被保留下来了，需要加入 `_delete_=True` 去忽略基础配置文件里的 `param_scheduler` 相关域内容：
+
+```python
+_base_ = '../../configs/resnet/resnet50_8xb32_in1k.py'
+
+# 学习率调整策略
+param_scheduler = dict(type='CosineAnnealingLR', by_epoch=True, _delete_=True)
+```
+
+### 引用基础配置文件里的变量
+
+有时，您可以引用 `_base_` 配置信息的一些域内容，这样可以避免重复定义。可以查看 {external+mmengine:doc}`MMEngine 文档 <advanced_tutorials/config>` 进一步了解该设计。
+
+以下是一个简单应用案例，在训练数据预处理流水线中使用 `auto augment` 数据增强，参考配置文件 [`configs/resnest/resnest50_32xb64_in1k.py`](https://github.com/open-mmlab/mmpretrain/blob/main/configs/resnest/resnest50_32xb64_in1k.py)。 在定义 `train_pipeline` 时，可以直接在 `_base_` 中加入定义 auto augment 数据增强的文件命名，再通过 `{{_base_.auto_increasing_policies}}` 引用变量：
+
+```python
+_base_ = [
+    '../_base_/models/resnest50.py', '../_base_/datasets/imagenet_bs64.py',
+    '../_base_/default_runtime.py', './_randaug_policies.py',
+]
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandAugment',
+        policies={{_base_.policies}},    # 这里使用了 _base_ 里的 `policies` 参数。
+        num_policies=2,
+        magnitude_level=12),
+    dict(type='EfficientNetRandomCrop', scale=224, backend='pillow'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.4),
+    dict(
+        type='Lighting',
+        eigval=EIGVAL,
+        eigvec=EIGVEC,
+        alphastd=0.1,
+        to_rgb=False),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+```
+
+## 通过命令行参数修改配置信息
+
+当用户使用脚本 "tools/train.py" 或者 "tools/test.py" 提交任务，以及使用一些工具脚本时，可以通过指定 `--cfg-options` 参数来直接修改所使用的配置文件内容。
+
+- 更新配置文件内的字典
+
+  可以按照原始配置文件中字典的键的顺序指定配置选项。
+  例如，`--cfg-options model.backbone.norm_eval=False` 将主干网络中的所有 BN 模块更改为 `train` 模式。
+
+- 更新配置文件内列表的键
+
+  一些配置字典在配置文件中会形成一个列表。例如，训练流水线 `data.train.pipeline` 通常是一个列表。
+  例如，`[dict(type='LoadImageFromFile'), dict(type='TopDownRandomFlip', flip_prob=0.5), ...]` 。如果要将流水线中的 `'flip_prob=0.5'` 更改为 `'flip_prob=0.0'`，您可以这样指定 `--cfg-options data.train.pipeline.1.flip_prob=0.0` 。
+
+- 更新列表/元组的值。
+
+  当配置文件中需要更新的是一个列表或者元组，例如，配置文件通常会设置 `val_evaluator = dict(type='Accuracy', topk=(1, 5))`，用户如果想更改 `topk`，
+  需要指定 `--cfg-options val_evaluator.topk="(1,3)"`。注意这里的引号 " 对于列表以及元组数据类型的修改是必要的，
+  并且 **不允许** 引号内所指定的值的书写存在空格。
diff --git a/docs/zh_CN/user_guides/dataset_prepare.md b/docs/zh_CN/user_guides/dataset_prepare.md
new file mode 100644
index 0000000000000000000000000000000000000000..aa1e1fdebe5494bdb582f210f9bc70cebb15db27
--- /dev/null
+++ b/docs/zh_CN/user_guides/dataset_prepare.md
@@ -0,0 +1,351 @@
+# 准备数据集
+
+## CustomDataset
+
+[`CustomDataset`](mmpretrain.datasets.CustomDataset) 是一个通用的数据集类，供您使用自己的数据集。目前 `CustomDataset` 支持以下两种方式组织你的数据集文件：
+
+### 子文件夹方式
+
+在这种格式下，您只需要重新组织您的数据集文件夹并将所有样本放在一个文件夹中，而无需创建任何标注文件。
+
+对于监督任务（使用 `with_label=true`），我们使用子文件夹的名称作为类别名称，如下例所示，`class_x` 和 `class_y` 将被识别为类别名称。
+
+```text
+data_prefix/
+├── class_x
+│   ├── xxx.png
+│   ├── xxy.png
+│   └── ...
+│       └── xxz.png
+└── class_y
+    ├── 123.png
+    ├── nsdf3.png
+    ├── ...
+    └── asd932_.png
+```
+
+对于无监督任务（使用 `with_label=false`），我们直接加载指定文件夹下的所有样本文件：
+
+```
+data_prefix/
+├── folder_1
+│   ├── xxx.png
+│   ├── xxy.png
+│   └── ...
+├── 123.png
+├── nsdf3.png
+└── ...
+```
+
+假如你希望将之用于训练，那么配置文件中需要添加以下配置：
+
+```python
+train_dataloader = dict(
+    ...
+    # 训练数据集配置
+    dataset=dict(
+        type='CustomDataset',
+        data_prefix='path/to/data_prefix',
+        with_label=True,  # 对于无监督任务，使用 False
+        pipeline=...
+    )
+)
+```
+
+```{note}
+如果要使用此格式，请不要指定 `ann_file`，或指定 `ann_file=''`。
+
+请注意，子文件夹格式需要对文件夹进行扫描，这可能会导致初始化速度变慢，尤其是对于大型数据集或慢速文件 IO。
+```
+
+### 标注文件方式
+
+标注文件格式主要使用文本文件来保存类别信息，`data_prefix` 存放图片，`ann_file` 存放标注类别信息。
+
+如下案例，dataset 目录如下：
+
+在这种格式中，我们使用文本标注文件来存储图像文件路径和对应的类别索引。
+
+对于监督任务（`with_label=true`），注释文件应在一行中包含一个样本的文件路径和类别索引，并用空格分隔，如下所示：
+
+所有这些文件路径都可以是绝对路径，也可以是相对于 `data_prefix` 的相对路径。
+
+```text
+folder_1/xxx.png 0
+folder_1/xxy.png 1
+123.png 4
+nsdf3.png 3
+...
+```
+
+```{note}
+类别的索引号从 0 开始。真实标签的值应在`[0, num_classes - 1]`范围内。
+
+此外，请使用数据集设置中的 `classes` 字段来指定每个类别的名称
+```
+
+对于无监督任务（`with_label=false`），标注文件只需要在一行中包含一个样本的文件路径，如下：
+
+```text
+folder_1/xxx.png
+folder_1/xxy.png
+123.png
+nsdf3.png
+...
+```
+
+假设整个数据集文件夹如下：
+
+```text
+data_root
+├── meta
+│   ├── test.txt     # 测试数据集的标注文件
+│   ├── train.txt    # 训练数据集的标注文件
+│   └── val.txt      # 验证数据集的标注文件
+
+├── train
+│   ├── 123.png
+│   ├── folder_1
+│   │   ├── xxx.png
+│   │   └── xxy.png
+│   └── nsdf3.png
+├── test
+└── val
+```
+
+这是配置文件中的数据集设置的示例：
+
+```python
+# 训练数据设置
+train_dataloader = dict(
+    dataset=dict(
+        type='CustomDataset',
+        data_root='path/to/data_root',  # `ann_flie` 和 `data_prefix` 共同的文件路径前缀
+        ann_file='meta/train.txt',      # 相对于 `data_root` 的标注文件路径
+        data_prefix='train',            # `ann_file` 中文件路径的前缀，相对于 `data_root`
+        classes=['A', 'B', 'C', 'D', ...],  # 每个类别的名称
+        pipeline=...,    # 处理数据集样本的一系列变换操作
+    )
+    ...
+)
+```
+
+```{note}
+有关如何使用 `CustomDataset` 的完整示例，请参阅[如何使用自定义数据集进行预训练](../notes/pretrain_custom_dataset.md)
+```
+
+## ImageNet
+
+ImageNet 有多个版本，但最常用的一个是 [ILSVRC 2012](http://www.image-net.org/challenges/LSVRC/2012/)。 可以通过以下步骤使用它。
+
+`````{tabs}
+
+````{group-tab} MIM 下载
+
+MIM支持使用一条命令行从 [OpenXLab](https://openxlab.org.cn/datasets?lang=zh-CN) 下载并预处理 ImageNet 数据集。
+
+_需要在 [OpenXLab 官网](https://openxlab.org.cn/datasets?lang=zh-CN) 注册账号并命令行登录_。
+
+```Bash
+# 安装 OpenXLab CLI 工具
+pip install -U openxlab
+# 登录 OpenXLab
+openxlab login
+# 使用 MIM 下载数据集, 最好在 $MMPreTrain 目录执行
+mim download mmpretrain --dataset imagenet1k
+```
+
+````
+
+````{group-tab} 从官网下载
+
+
+1. 注册一个帐户并登录到[下载页面](http://www.image-net.org/download-images)。
+2. 找到 ILSVRC2012 的下载链接，下载以下两个文件：
+   - ILSVRC2012_img_train.tar (~138GB)
+   - ILSVRC2012_img_val.tar (~6.3GB)
+3. 解压已下载的图片。
+
+````
+`````
+
+### ImageNet数据集目录结构
+
+我们支持两种方式组织ImageNet数据集，子目录格式和文本注释文件格式。
+
+#### 子文件夹格式
+
+我们提供了一个样例，您可以从这个[链接](https://download.openmmlab.com/mmpretrain/datasets/imagenet_1k.zip)下载和解压。数据集的目录结构应如下所示：
+
+```text
+data/imagenet/
+├── train/
+│   ├── n01440764
+│   │   ├── n01440764_10026.JPEG
+│   │   ├── n01440764_10027.JPEG
+│   │   ├── n01440764_10029.JPEG
+│   │   ├── n01440764_10040.JPEG
+│   │   ├── n01440764_10042.JPEG
+│   │   ├── n01440764_10043.JPEG
+│   │   └── n01440764_10048.JPEG
+│   ├── ...
+├── val/
+│   ├── n01440764
+│   │   ├── ILSVRC2012_val_00000293.JPEG
+│   │   ├── ILSVRC2012_val_00002138.JPEG
+│   │   ├── ILSVRC2012_val_00003014.JPEG
+│   │   └── ...
+│   ├── ...
+```
+
+#### 文本标注文件格式
+
+您可以从[此链接](https://download.openmmlab.com/mmclassification/datasets/imagenet/meta/caffe_ilsvrc12.tar.gz)下载并解压元数据，然后组织文件夹如下：
+
+```text
+data/imagenet/
+├── meta/
+│   ├── train.txt
+│   ├── test.txt
+│   └── val.txt
+├── train/
+│   ├── n01440764
+│   │   ├── n01440764_10026.JPEG
+│   │   ├── n01440764_10027.JPEG
+│   │   ├── n01440764_10029.JPEG
+│   │   ├── n01440764_10040.JPEG
+│   │   ├── n01440764_10042.JPEG
+│   │   ├── n01440764_10043.JPEG
+│   │   └── n01440764_10048.JPEG
+│   ├── ...
+├── val/
+│   ├── ILSVRC2012_val_00000001.JPEG
+│   ├── ILSVRC2012_val_00000002.JPEG
+│   ├── ILSVRC2012_val_00000003.JPEG
+│   ├── ILSVRC2012_val_00000004.JPEG
+│   ├── ...
+```
+
+### 配置
+
+当您的数据集以上述方式组织时，您可以使用具有以下配置的 [`ImageNet`](mmpretrain.datasets.ImageNet) 数据集：
+
+```python
+train_dataloader = dict(
+    ...
+    # 训练数据集配置
+    dataset=dict(
+        type='ImageNet',
+        data_root='data/imagenet/',
+        split='train',
+        pipeline=...,
+    )
+)
+
+val_dataloader = dict(
+    ...
+    # 验证数据集配置
+    dataset=dict(
+        type='ImageNet',
+        data_root='data/imagenet/',
+        split='val',
+        pipeline=...,
+    )
+)
+
+test_dataloader = val_dataloader
+```
+
+## 支持的图像分类数据集
+
+| 数据集                                                                              | split                               | 主页                                                                               |
+| ----------------------------------------------------------------------------------- | ----------------------------------- | ---------------------------------------------------------------------------------- |
+| [`Calthch101`](mmpretrain.datasets.Caltech101)(data_root[, split, pipeline, ...])   | ["train", "test"]                   | [Caltech 101](https://data.caltech.edu/records/mzrjq-6wc02) 数据集                 |
+| [`CIFAR10`](mmpretrain.datasets.CIFAR10)(data_root[, split, pipeline, ...])         | ["train", "test"]                   | [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) 数据集                      |
+| [`CIFAR100`](mmpretrain.datasets.CIFAR100)(data_root[, split, pipeline, ...])       | ["train", "test"]                   | [CIFAR100](https://www.cs.toronto.edu/~kriz/cifar.html) 数据集                     |
+| [`CUB`](mmpretrain.datasets.CUB)(data_root[, split, pipeline, ...])                 | ["train", "test"]                   | [CUB-200-2011](http://www.vision.caltech.edu/datasets/cub_200_2011/) 数据集        |
+| [`DTD`](mmpretrain.datasets.DTD)(data_root[, split, pipeline, ...])                 | ["train", "val", "tranval", "test"] | [Describable Texture Dataset (DTD)](https://www.robots.ox.ac.uk/~vgg/data/dtd/) 数据集 |
+| [`FashionMNIST`](mmpretrain.datasets.FashionMNIST) (data_root[, split, pipeline, ...]) | ["train", "test"]                   | [Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist) 数据集           |
+| [`FGVCAircraft`](mmpretrain.datasets.FGVCAircraft)(data_root[, split, pipeline, ...]) | ["train", "val", "tranval", "test"] | [FGVC Aircraft](https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/) 数据集       |
+| [`Flowers102`](mmpretrain.datasets.Flowers102)(data_root[, split, pipeline, ...])   | ["train", "val", "tranval", "test"] | [Oxford 102 Flower](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/) 数据集     |
+| [`Food101`](mmpretrain.datasets.Food101)(data_root[, split, pipeline, ...])         | ["train", "test"]                   | [Food101](https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/) 数据集      |
+| [`MNIST`](mmpretrain.datasets.MNIST) (data_root[, split, pipeline, ...])            | ["train", "test"]                   | [MNIST](http://yann.lecun.com/exdb/mnist/) 数据集                                  |
+| [`OxfordIIITPet`](mmpretrain.datasets.OxfordIIITPet)(data_root[, split, pipeline, ...]) | ["tranval", test"]                  | [Oxford-IIIT Pets](https://www.robots.ox.ac.uk/~vgg/data/pets/) 数据集             |
+| [`Places205`](mmpretrain.datasets.Places205)(data_root[, pipeline, ...])            | -                                   | [Places205](http://places.csail.mit.edu/downloadData.html) 数据集                  |
+| [`StanfordCars`](mmpretrain.datasets.StanfordCars)(data_root[, split, pipeline, ...]) | ["train", "test"]                   | [StanfordCars](https://ai.stanford.edu/~jkrause/cars/car_dataset.html) 数据集      |
+| [`SUN397`](mmpretrain.datasets.SUN397)(data_root[, split, pipeline, ...])           | ["train", "test"]                   | [SUN397](https://vision.princeton.edu/projects/2010/SUN/) 数据集                   |
+| [`VOC`](mmpretrain.datasets.VOC)(data_root[, image_set_path, pipeline, ...])        | ["train", "val", "tranval", "test"] | [Pascal VOC](http://host.robots.ox.ac.uk/pascal/VOC/) 数据集                       |
+
+有些数据集主页链接可能已经失效，您可以通过[OpenXLab](https://openxlab.org.cn/datasets?lang=zh-CN)下载数据集，例如 [Stanford Cars](https://openxlab.org.cn/datasets/OpenDataLab/Stanford_Cars)数据集。
+
+## OpenMMLab 2.0 标准数据集
+
+为了统一不同任务的数据集接口，便于多任务的算法模型训练，OpenMMLab 制定了 **OpenMMLab 2.0 数据集格式规范**， 数据集标注文件需符合该规范，数据集基类基于该规范去读取与解析数据标注文件。如果用户提供的数据标注文件不符合规定格式，用户可以选择将其转化为规定格式，并使用 OpenMMLab 的算法库基于该数据标注文件进行算法训练和测试。
+
+OpenMMLab 2.0 数据集格式规范规定，标注文件必须为 `json` 或 `yaml`，`yml` 或 `pickle`，`pkl` 格式；标注文件中存储的字典必须包含 `metainfo` 和 `data_list` 两个字段。其中 `metainfo` 是一个字典，里面包含数据集的元信息；`data_list` 是一个列表，列表中每个元素是一个字典，该字典定义了一个原始数据（raw data），每个原始数据包含一个或若干个训练/测试样本。
+
+假设您要使用训练数据集，那么配置文件如下所示：
+
+```
+
+{
+    'metainfo':
+        {
+            'classes': ('cat', 'dog'), # 'cat' 的类别序号为 0，'dog' 为 1。
+            ...
+        },
+    'data_list':
+        [
+            {
+                'img_path': "xxx/xxx_0.jpg",
+                'gt_label': 0,
+                ...
+            },
+            {
+                'img_path': "xxx/xxx_1.jpg",
+                'gt_label': 1,
+                ...
+            },
+            ...
+        ]
+}
+```
+
+同时假设数据集存放路径如下：
+
+```text
+data
+├── annotations
+│   ├── train.json
+│   └── ...
+├── train
+│   ├── xxx/xxx_0.jpg
+│   ├── xxx/xxx_1.jpg
+│   ├── ...
+```
+
+通过以下字典构建：
+
+```python
+dataset_cfg=dict(
+    type='CustomDataset',
+    ann_file='path/to/ann_file_path',
+    data_prefix='path/to/images_folder',
+    pipeline=transfrom_list)
+```
+
+## 其他数据集
+
+MMPretrain 还支持更多其他的数据集，可以通过查阅[数据集文档](mmpretrain.datasets)获取它们的配置信息。
+
+如果需要使用一些特殊格式的数据集，您需要实现您自己的数据集类，请参阅[添加新数据集](../advanced_guides/datasets.md)。
+
+## 数据集包装
+
+MMEngine 中支持以下数据包装器，您可以参考 {external+mmengine:doc}`MMEngine 教程 <advanced_tutorials/basedataset>` 了解如何使用它。
+
+- {external:py:class}`~mmengine.dataset.ConcatDataset`
+- {external:py:class}`~mmengine.dataset.RepeatDataset`
+- {external:py:class}`~mmengine.dataset.ClassBalancedDataset`
+
+除上述之外，MMPretrain 还支持了[KFoldDataset](mmpretrain.datasets.KFoldDataset)，需用通过使用 `tools/kfold-cross-valid.py` 来使用它。
diff --git a/docs/zh_CN/user_guides/downstream.md b/docs/zh_CN/user_guides/downstream.md
new file mode 100644
index 0000000000000000000000000000000000000000..0744f1e3b3836b8da05dc7c030a42d055fc80308
--- /dev/null
+++ b/docs/zh_CN/user_guides/downstream.md
@@ -0,0 +1,125 @@
+# 下游任务
+
+## 检测
+
+我们使用 MMDetection 进行图像检测。首先确保您已经安装了 [MIM](https://github.com/open-mmlab/mim)，这也是 OpenMMLab 的一个项目。
+
+```shell
+pip install openmim
+mim install 'mmdet>=3.0.0rc0'
+```
+
+此外，请参考 MMDetection 的[安装](https://mmdetection.readthedocs.io/en/dev-3.x/get_started.html)和[数据准备](https://mmdetection.readthedocs.io/en/dev-3.x/user_guides/dataset_prepare.html)
+
+### 训练
+
+安装完后，您可以使用如下的简单命令运行 MMDetection。
+
+```shell
+# distributed version
+bash tools/benchmarks/mmdetection/mim_dist_train_c4.sh ${CONFIG} ${PRETRAIN} ${GPUS}
+bash tools/benchmarks/mmdetection/mim_dist_train_fpn.sh ${CONFIG} ${PRETRAIN} ${GPUS}
+
+# slurm version
+bash tools/benchmarks/mmdetection/mim_slurm_train_c4.sh ${PARTITION} ${CONFIG} ${PRETRAIN}
+bash tools/benchmarks/mmdetection/mim_slurm_train_fpn.sh ${PARTITION} ${CONFIG} ${PRETRAIN}
+```
+
+- `${CONFIG}`：直接用 MMDetection 中的配置文件路径即可。对于一些算法，我们有一些修改过的配置文件，
+  可以在相应算法文件夹下的 `benchmarks` 文件夹中找到。另外，您也可以从头开始编写配置文件。
+- `${PRETRAIN}`：预训练模型文件
+- `${GPUS}`：使用多少 GPU 进行训练，对于检测任务，我们默认使用 8 个 GPU。
+
+例子：
+
+```shell
+bash ./tools/benchmarks/mmdetection/mim_dist_train_c4.sh \
+  configs/byol/benchmarks/mask-rcnn_r50-c4_ms-1x_coco.py \
+  https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth 8
+```
+
+### 测试
+
+在训练之后，您可以运行如下命令测试您的模型。
+
+```shell
+# distributed version
+bash tools/benchmarks/mmdetection/mim_dist_test.sh ${CONFIG} ${CHECKPOINT} ${GPUS}
+
+# slurm version
+bash tools/benchmarks/mmdetection/mim_slurm_test.sh ${PARTITION} ${CONFIG} ${CHECKPOINT}
+```
+
+备注：
+
+- `${CHECKPOINT}`：您想测试的训练好的检测模型。
+
+例子：
+
+```shell
+bash ./tools/benchmarks/mmdetection/mim_dist_test.sh \
+configs/benchmarks/mmdetection/coco/mask-rcnn_r50_fpn_ms-1x_coco.py \
+https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth 8
+```
+
+## 分割
+
+我们使用 MMSegmentation 进行图像分割。首先确保您已经安装了 [MIM](https://github.com/open-mmlab/mim)，这也是 OpenMMLab 的一个项目。
+
+```shell
+pip install openmim
+mim install 'mmsegmentation>=1.0.0rc0'
+```
+
+此外，请参考 MMSegmentation 的[安装](https://mmsegmentation.readthedocs.io/en/dev-1.x/get_started.html)和[数据准备](https://mmsegmentation.readthedocs.io/en/dev-1.x/user_guides/2_dataset_prepare.html)。
+
+### 训练
+
+在安装完后，可以使用如下简单命令运行 MMSegmentation。
+
+```shell
+# distributed version
+bash tools/benchmarks/mmsegmentation/mim_dist_train.sh ${CONFIG} ${PRETRAIN} ${GPUS}
+
+# slurm version
+bash tools/benchmarks/mmsegmentation/mim_slurm_train.sh ${PARTITION} ${CONFIG} ${PRETRAIN}
+```
+
+备注：
+
+- `${CONFIG}`：直接用 MMSegmentation 中的配置文件路径即可。对于一些算法，我们有一些修改过的配置文件，
+  可以在相应算法文件夹下的 `benchmarks` 文件夹中找到。另外，您也可以从头开始编写配置文件。
+- `${PRETRAIN}`：预训练模型文件
+- `${GPUS}`：使用多少 GPU 进行训练，对于检测任务，我们默认使用 8 个 GPU。
+
+例子：
+
+```shell
+bash ./tools/benchmarks/mmsegmentation/mim_dist_train.sh \
+configs/benchmarks/mmsegmentation/voc12aug/fcn_r50-d8_4xb4-20k_voc12aug-512x512.py \
+https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth 4
+```
+
+### 测试
+
+在训练之后，您可以运行如下命令测试您的模型。
+
+```shell
+# distributed version
+bash tools/benchmarks/mmsegmentation/mim_dist_test.sh ${CONFIG} ${CHECKPOINT} ${GPUS}
+
+# slurm version
+bash tools/benchmarks/mmsegmentation/mim_slurm_test.sh ${PARTITION} ${CONFIG} ${CHECKPOINT}
+```
+
+备注：
+
+- `${CHECKPOINT}`：您想测试的训练好的分割模型。
+
+例子：
+
+```shell
+bash ./tools/benchmarks/mmsegmentation/mim_dist_test.sh \
+configs/benchmarks/mmsegmentation/voc12aug/fcn_r50-d8_4xb4-20k_voc12aug-512x512.py \
+https://download.openmmlab.com/mmselfsup/1.x/byol/byol_resnet50_16xb256-coslr-200e_in1k/byol_resnet50_16xb256-coslr-200e_in1k_20220825-de817331.pth 4
+```
diff --git a/docs/zh_CN/user_guides/inference.md b/docs/zh_CN/user_guides/inference.md
new file mode 100644
index 0000000000000000000000000000000000000000..068e42e16debb2aa8c6695227850567bd068628d
--- /dev/null
+++ b/docs/zh_CN/user_guides/inference.md
@@ -0,0 +1,176 @@
+# 使用现有模型进行推理
+
+本文将展示如何使用以下API：
+
+- [**`list_models`**](mmpretrain.apis.list_models): 列举 MMPretrain 中所有可用模型名称
+- [**`get_model`**](mmpretrain.apis.get_model): 通过模型名称或模型配置文件获取模型
+- [**`inference_model`**](mmpretrain.apis.inference_model): 使用与模型相对应任务的推理器进行推理。主要用作快速
+  展示。如需配置进阶用法，还需要直接使用下列推理器。
+- 推理器:
+  1. [**`ImageClassificationInferencer`**](mmpretrain.apis.ImageClassificationInferencer):
+     对给定图像执行图像分类。
+  2. [**`ImageRetrievalInferencer`**](mmpretrain.apis.ImageRetrievalInferencer):
+     从给定的一系列图像中，检索与给定图像最相似的图像。
+  3. [**`ImageCaptionInferencer`**](mmpretrain.apis.ImageCaptionInferencer):
+     生成给定图像的一段描述。
+  4. [**`VisualQuestionAnsweringInferencer`**](mmpretrain.apis.VisualQuestionAnsweringInferencer):
+     根据给定的图像回答问题。
+  5. [**`VisualGroundingInferencer`**](mmpretrain.apis.VisualGroundingInferencer):
+     根据一段描述，从给定图像中找到一个与描述对应的对象。
+  6. [**`TextToImageRetrievalInferencer`**](mmpretrain.apis.TextToImageRetrievalInferencer):
+     从给定的一系列图像中，检索与给定文本最相似的图像。
+  7. [**`ImageToTextRetrievalInferencer`**](mmpretrain.apis.ImageToTextRetrievalInferencer):
+     从给定的一系列文本中，检索与给定图像最相似的文本。
+  8. [**`NLVRInferencer`**](mmpretrain.apis.NLVRInferencer):
+     对给定的一对图像和一段文本进行自然语言视觉推理（NLVR 任务）。
+  9. [**`FeatureExtractor`**](mmpretrain.apis.FeatureExtractor):
+     通过视觉主干网络从图像文件提取特征。
+
+## 列举可用模型
+
+列出 MMPreTrain 中的所有已支持的模型。
+
+```python
+>>> from mmpretrain import list_models
+>>> list_models()
+['barlowtwins_resnet50_8xb256-coslr-300e_in1k',
+ 'beit-base-p16_beit-in21k-pre_3rdparty_in1k',
+ ...]
+```
+
+`list_models` 支持 Unix 文件名风格的模式匹配，你可以使用 \*\* * \*\* 匹配任意字符。
+
+```python
+>>> from mmpretrain import list_models
+>>> list_models("*convnext-b*21k")
+['convnext-base_3rdparty_in21k',
+ 'convnext-base_in21k-pre-3rdparty_in1k-384px',
+ 'convnext-base_in21k-pre_3rdparty_in1k']
+```
+
+你还可以使用推理器的 `list_models` 方法获取对应任务可用的所有模型。
+
+```python
+>>> from mmpretrain import ImageCaptionInferencer
+>>> ImageCaptionInferencer.list_models()
+['blip-base_3rdparty_caption',
+ 'blip2-opt2.7b_3rdparty-zeroshot_caption',
+ 'flamingo_3rdparty-zeroshot_caption',
+ 'ofa-base_3rdparty-finetuned_caption']
+```
+
+## 获取模型
+
+选定需要的模型后，你可以使用 `get_model` 获取特定模型。
+
+```python
+>>> from mmpretrain import get_model
+
+# 不加载预训练权重的模型
+>>> model = get_model("convnext-base_in21k-pre_3rdparty_in1k")
+
+# 加载默认的权重文件
+>>> model = get_model("convnext-base_in21k-pre_3rdparty_in1k", pretrained=True)
+
+# 加载制定的权重文件
+>>> model = get_model("convnext-base_in21k-pre_3rdparty_in1k", pretrained="your_local_checkpoint_path")
+
+# 指定额外的模型初始化参数，例如修改 head 中的 num_classes。
+>>> model = get_model("convnext-base_in21k-pre_3rdparty_in1k", head=dict(num_classes=10))
+
+# 另外一个例子：移除模型的 neck，head 模块，直接从 backbone 中的 stage 1, 2, 3 输出
+>>> model_headless = get_model("resnet18_8xb32_in1k", head=None, neck=None, backbone=dict(out_indices=(1, 2, 3)))
+```
+
+获得的模型是一个通常的 PyTorch Module
+
+```python
+>>> import torch
+>>> from mmpretrain import get_model
+>>> model = get_model('convnext-base_in21k-pre_3rdparty_in1k', pretrained=True)
+>>> x = torch.rand((1, 3, 224, 224))
+>>> y = model(x)
+>>> print(type(y), y.shape)
+<class 'torch.Tensor'> torch.Size([1, 1000])
+```
+
+## 在给定图像上进行推理
+
+这里是一个例子，我们将使用 ResNet-50 预训练模型对给定的 [图像](https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG) 进行分类。
+
+```python
+>>> from mmpretrain import inference_model
+>>> image = 'https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG'
+>>> # 如果你没有图形界面，请设置 `show=False`
+>>> result = inference_model('resnet50_8xb32_in1k', image, show=True)
+>>> print(result['pred_class'])
+sea snake
+```
+
+上述 `inference_model` 接口可以快速进行模型推理，但它每次调用都需要重新初始化模型，也无法进行多个样本的推理。
+因此我们需要使用推理器来进行多次调用。
+
+```python
+>>> from mmpretrain import ImageClassificationInferencer
+>>> image = 'https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG'
+>>> inferencer = ImageClassificationInferencer('resnet50_8xb32_in1k')
+>>> # 注意推理器的输出始终为一个结果列表，即使输入只有一个样本
+>>> result = inferencer('https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG')[0]
+>>> print(result['pred_class'])
+sea snake
+>>>
+>>> # 你可以对多张图像进行批量推理
+>>> image_list = ['demo/demo.JPEG', 'demo/bird.JPEG'] * 16
+>>> results = inferencer(image_list, batch_size=8)
+>>> print(len(results))
+32
+>>> print(results[1]['pred_class'])
+house finch, linnet, Carpodacus mexicanus
+```
+
+通常，每个样本的结果都是一个字典。比如图像分类的结果是一个包含了 `pred_label`、`pred_score`、`pred_scores`、`pred_class` 等字段的字典：
+
+```python
+{
+    "pred_label": 65,
+    "pred_score": 0.6649366617202759,
+    "pred_class":"sea snake",
+    "pred_scores": array([..., 0.6649366617202759, ...], dtype=float32)
+}
+```
+
+你可以为推理器配置额外的参数，比如使用你自己的配置文件和权重文件，在 CUDA 上进行推理：
+
+```python
+>>> from mmpretrain import ImageClassificationInferencer
+>>> image = 'https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG'
+>>> config = 'configs/resnet/resnet50_8xb32_in1k.py'
+>>> checkpoint = 'https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth'
+>>> inferencer = ImageClassificationInferencer(model=config, pretrained=checkpoint, device='cuda')
+>>> result = inferencer(image)[0]
+>>> print(result['pred_class'])
+sea snake
+```
+
+## 使用 Gradio 推理示例
+
+我们还提供了一个基于 gradio 的推理示例，提供了 MMPretrain 所支持的所有任务的推理展示功能，你可以在 [projects/gradio_demo/launch.py](https://github.com/open-mmlab/mmpretrain/blob/main/projects/gradio_demo/launch.py) 找到这一例程。
+
+请首先使用 `pip install -U gradio` 安装 `gradio` 库。
+
+这里是界面效果预览：
+
+<img src="https://user-images.githubusercontent.com/26739999/236147750-90ccb517-92c0-44e9-905e-1473677023b1.jpg" width="100%"/>
+
+## 从图像中提取特征
+
+与 `model.extract_feat` 相比，`FeatureExtractor` 用于直接从图像文件中提取特征，而不是从一批张量中提取特征。简单说，`model.extract_feat` 的输入是 `torch.Tensor`，`FeatureExtractor` 的输入是图像。
+
+```
+>>> from mmpretrain import FeatureExtractor, get_model
+>>> model = get_model('resnet50_8xb32_in1k', backbone=dict(out_indices=(0, 1, 2, 3)))
+>>> extractor = FeatureExtractor(model)
+>>> features = extractor('https://github.com/open-mmlab/mmpretrain/raw/main/demo/demo.JPEG')[0]
+>>> features[0].shape, features[1].shape, features[2].shape, features[3].shape
+(torch.Size([256]), torch.Size([512]), torch.Size([1024]), torch.Size([2048]))
+```
diff --git a/docs/zh_CN/user_guides/test.md b/docs/zh_CN/user_guides/test.md
new file mode 100644
index 0000000000000000000000000000000000000000..054e1e41edb275836f6bbd9fa14ae1cccb0ea898
--- /dev/null
+++ b/docs/zh_CN/user_guides/test.md
@@ -0,0 +1,117 @@
+# 测试
+
+## 单机单卡测试
+
+你可以使用 `tools/test.py` 在电脑上用 CPU 或是 GPU 进行模型的测试。
+
+以下是测试脚本的完整用法：
+
+```shell
+python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]
+```
+
+````{note}
+默认情况下，MMPretrain 会自动调用你的 GPU 进行测试。如果你有 GPU 但仍想使用 CPU 进行测试，请设置环境变量 `CUDA_VISIBLE_DEVICES` 为空或者 -1 来对禁用 GPU。
+
+```bash
+CUDA_VISIBLE_DEVICES=-1 python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]
+```
+````
+
+| 参数                                  | 描述                                                                                                                                                                |
+| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `CONFIG_FILE`                         | 配置文件的路径。                                                                                                                                                    |
+| `CHECKPOINT_FILE`                     | 权重文件路径（支持 http 链接，你可以在[这里](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html)寻找需要的权重文件）。                            |
+| `--work-dir WORK_DIR`                 | 用来保存测试指标结果的文件夹。                                                                                                                                      |
+| `--out OUT`                           | 用来保存测试输出的文件。                                                                                                                                            |
+| `--out-item OUT_ITEM`                 | 指定测试输出文件的内容，可以为 "pred" 或 "metrics"，其中 "pred" 表示保存所有模型输出，这些数据可以用于离线测评；"metrics" 表示输出测试指标。默认为 "pred"。         |
+| `--cfg-options CFG_OPTIONS`           | 重载配置文件中的一些设置。使用类似 `xxx=yyy` 的键值对形式指定，这些设置会被融合入从配置文件读取的配置。你可以使用 `key="[a,b]"` 或者 `key=a,b` 的格式来指定列表格式的值，且支持嵌套，例如 \`key="[(a,b),(c,d)]"，这里的引号是不可省略的。另外每个重载项内部不可出现空格。 |
+| `--show-dir SHOW_DIR`                 | 用于保存可视化预测结果图像的文件夹。                                                                                                                                |
+| `--show`                              | 在窗口中显示预测结果图像。                                                                                                                                          |
+| `--interval INTERVAL`                 | 每隔多少样本进行一次预测结果可视化。                                                                                                                                |
+| `--wait-time WAIT_TIME`               | 每个窗口的显示时间（单位为秒）。                                                                                                                                    |
+| `--no-pin-memory`                     | 是否在 dataloaders 中关闭 `pin_memory` 选项                                                                                                                         |
+| `--tta`                               | 是否开启 Test-Time-Aug (TTA). 如果配置文件有 `tta_pipeline` 和 `tta_model`，将使用这些配置指定 TTA transforms，并且决定如何融合 TTA 的结果。 否则，通过平均分类分数使用 flip TTA。 |
+| `--launcher {none,pytorch,slurm,mpi}` | 启动器，默认为 "none"。                                                                                                                                             |
+
+## 单机多卡测试
+
+我们提供了一个 shell 脚本，可以使用 `torch.distributed.launch` 启动多 GPU 任务。
+
+```shell
+bash ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]
+```
+
+| 参数              | 描述                                                                                                                                     |
+| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
+| `CONFIG_FILE`     | 配置文件的路径。                                                                                                                         |
+| `CHECKPOINT_FILE` | 权重文件路径（支持 http 链接，你可以在[这里](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html)寻找需要的权重文件）。 |
+| `GPU_NUM`         | 使用的 GPU 数量。                                                                                                                        |
+| `[PY_ARGS]`       | `tools/test.py` 支持的其他可选参数，参见[上文](#单机单卡测试)。                                                                          |
+
+你还可以使用环境变量来指定启动器的额外参数，比如用如下命令将启动器的通讯端口变更为 29666：
+
+```shell
+PORT=29666 bash ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]
+```
+
+如果你希望使用不同的 GPU 进行多项测试任务，可以在启动时指定不同的通讯端口和不同的可用设备。
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash ./tools/dist_test.sh ${CONFIG_FILE1} ${CHECKPOINT_FILE} 4 [PY_ARGS]
+CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash ./tools/dist_test.sh ${CONFIG_FILE2} ${CHECKPOINT_FILE} 4 [PY_ARGS]
+```
+
+## 多机测试
+
+### 同一网络下的多机
+
+如果你希望使用同一局域网下连接的多台电脑进行一个测试任务，可以使用如下命令：
+
+在第一台机器上：
+
+```shell
+NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT_FILE $GPUS
+```
+
+在第二台机器上：
+
+```shell
+NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT_FILE $GPUS
+```
+
+和单机多卡相比，你需要指定一些额外的环境变量：
+
+| 环境变量      | 描述                                           |
+| ------------- | ---------------------------------------------- |
+| `NNODES`      | 机器总数。                                     |
+| `NODE_RANK`   | 本机的序号                                     |
+| `PORT`        | 通讯端口，它在所有机器上都应当是一致的。       |
+| `MASTER_ADDR` | 主机的 IP 地址，它在所有机器上都应当是一致的。 |
+
+### Slurm 管理下的多机集群
+
+如果你在 [slurm](https://slurm.schedmd.com/) 集群上，可以使用 `tools/slurm_test.sh` 脚本启动任务。
+
+```shell
+[ENV_VARS] ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} [PY_ARGS]
+```
+
+这里是该脚本的一些参数：
+
+| 参数              | 描述                                                                                                                                     |
+| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
+| `PARTITION`       | 使用的集群分区。                                                                                                                         |
+| `JOB_NAME`        | 任务的名称，你可以随意起一个名字。                                                                                                       |
+| `CONFIG_FILE`     | 配置文件路径。                                                                                                                           |
+| `CHECKPOINT_FILE` | 权重文件路径（支持 http 链接，你可以在[这里](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html)寻找需要的权重文件）。 |
+| `[PY_ARGS]`       | `tools/test.py` 支持的其他可选参数，参见[上文](#单机单卡测试)。                                                                          |
+
+这里是一些你可以用来配置 slurm 任务的环境变量：
+
+| 环境变量        | 描述                                                                                       |
+| --------------- | ------------------------------------------------------------------------------------------ |
+| `GPUS`          | 使用的 GPU 总数，默认为 8。                                                                |
+| `GPUS_PER_NODE` | 每个节点分配的 GPU 数，你可以根据节点情况指定。默认为 8。                                  |
+| `CPUS_PER_TASK` | 每个任务分配的 CPU 数（通常一个 GPU 对应一个任务）。默认为 5。                             |
+| `SRUN_ARGS`     | `srun` 命令支持的其他参数。可用的选项参见[官方文档](https://slurm.schedmd.com/srun.html)。 |
diff --git a/docs/zh_CN/user_guides/train.md b/docs/zh_CN/user_guides/train.md
new file mode 100644
index 0000000000000000000000000000000000000000..841edabbcda0bdf538ccb01e1bc4c1c60556dca9
--- /dev/null
+++ b/docs/zh_CN/user_guides/train.md
@@ -0,0 +1,118 @@
+# 训练
+
+在本教程中，我们将介绍如何使用 MMPretrain 中提供的脚本启动训练任务。
+如果你需要了解一些具体的训练例子，可以查阅 [如何在自定义数据集上进行模型预训练](../notes/pretrain_custom_dataset.md) 和 [如何在自定义数据集上微调模型](../notes/finetune_custom_dataset.md).
+
+## 单机单卡训练
+
+你可以使用 `tools/train.py` 在电脑上用 CPU 或是 GPU 进行模型的训练。
+
+以下是训练脚本的完整用法：
+
+```shell
+python tools/train.py ${CONFIG_FILE} [ARGS]
+```
+
+````{note}
+默认情况下，MMPretrain 会自动调用你的 GPU 进行训练。如果你有 GPU 但仍想使用 CPU 进行训练，请设置环境变量 `CUDA_VISIBLE_DEVICES` 为空或者 -1 来对禁用 GPU。
+
+```bash
+CUDA_VISIBLE_DEVICES=-1 python tools/train.py ${CONFIG_FILE} [ARGS]
+```
+````
+
+| 参数                                  | 描述                                                                                                                                                                |
+| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `CONFIG_FILE`                         | 配置文件的路径。                                                                                                                                                    |
+| `--work-dir WORK_DIR`                 | 用来保存训练日志和权重文件的文件夹，默认是 `./work_dirs` 目录下，与配置文件同名的文件夹。                                                                           |
+| `--resume [RESUME]`                   | 恢复训练。如果指定了权重文件路径，则从指定的权重文件恢复；如果没有指定，则尝试从最新的权重文件进行恢复。                                                            |
+| `--amp`                               | 启用混合精度训练。                                                                                                                                                  |
+| `--no-validate`                       | **不建议** 在训练过程中不进行验证集上的精度验证。                                                                                                                   |
+| `--auto-scale-lr`                     | 自动根据实际的批次大小（batch size）和预设的批次大小对学习率进行缩放。                                                                                              |
+| `--no-pin-memory`                     | 是否在 dataloaders 中关闭 `pin_memory` 选项                                                                                                                         |
+| `--no-persistent-workers`             | 是否在 dataloaders 中关闭 `persistent_workers` 选项                                                                                                                 |
+| `--cfg-options CFG_OPTIONS`           | 重载配置文件中的一些设置。使用类似 `xxx=yyy` 的键值对形式指定，这些设置会被融合入从配置文件读取的配置。你可以使用 `key="[a,b]"` 或者 `key=a,b` 的格式来指定列表格式的值，且支持嵌套，例如 \`key="[(a,b),(c,d)]"，这里的引号是不可省略的。另外每个重载项内部不可出现空格。 |
+| `--launcher {none,pytorch,slurm,mpi}` | 启动器，默认为 "none"。                                                                                                                                             |
+
+## 单机多卡训练
+
+我们提供了一个 shell 脚本，可以使用 `torch.distributed.launch` 启动多 GPU 任务。
+
+```shell
+bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]
+```
+
+| 参数          | 描述                                                             |
+| ------------- | ---------------------------------------------------------------- |
+| `CONFIG_FILE` | 配置文件的路径。                                                 |
+| `GPU_NUM`     | 使用的 GPU 数量。                                                |
+| `[PY_ARGS]`   | `tools/train.py` 支持的其他可选参数，参见[上文](#单机单卡训练)。 |
+
+你还可以使用环境变量来指定启动器的额外参数，比如用如下命令将启动器的通讯端口变更为 29666：
+
+```shell
+PORT=29666 bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]
+```
+
+如果你希望使用不同的 GPU 进行多项训练任务，可以在启动时指定不同的通讯端口和不同的可用设备。
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash ./tools/dist_train.sh ${CONFIG_FILE1} 4 [PY_ARGS]
+CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash ./tools/dist_train.sh ${CONFIG_FILE2} 4 [PY_ARGS]
+```
+
+## 多机训练
+
+### 同一网络下的多机
+
+如果你希望使用同一局域网下连接的多台电脑进行一个训练任务，可以使用如下命令：
+
+在第一台机器上：
+
+```shell
+NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS
+```
+
+在第二台机器上：
+
+```shell
+NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS
+```
+
+和单机多卡相比，你需要指定一些额外的环境变量：
+
+| 环境变量      | 描述                                           |
+| ------------- | ---------------------------------------------- |
+| `NNODES`      | 机器总数。                                     |
+| `NODE_RANK`   | 本机的序号                                     |
+| `PORT`        | 通讯端口，它在所有机器上都应当是一致的。       |
+| `MASTER_ADDR` | 主机的 IP 地址，它在所有机器上都应当是一致的。 |
+
+通常来说，如果这几台机器之间不是高速网络连接，训练速度会非常慢。
+
+### Slurm 管理下的多机集群
+
+如果你在 [slurm](https://slurm.schedmd.com/) 集群上，可以使用 `tools/slurm_train.sh` 脚本启动任务。
+
+```shell
+[ENV_VARS] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} [PY_ARGS]
+```
+
+这里是该脚本的一些参数：
+
+| 参数          | 描述                                                             |
+| ------------- | ---------------------------------------------------------------- |
+| `PARTITION`   | 使用的集群分区。                                                 |
+| `JOB_NAME`    | 任务的名称，你可以随意起一个名字。                               |
+| `CONFIG_FILE` | 配置文件路径。                                                   |
+| `WORK_DIR`    | 用以保存日志和权重文件的文件夹。                                 |
+| `[PY_ARGS]`   | `tools/train.py` 支持的其他可选参数，参见[上文](#单机单卡训练)。 |
+
+这里是一些你可以用来配置 slurm 任务的环境变量：
+
+| 环境变量        | 描述                                                                                       |
+| --------------- | ------------------------------------------------------------------------------------------ |
+| `GPUS`          | 使用的 GPU 总数，默认为 8。                                                                |
+| `GPUS_PER_NODE` | 每个节点分配的 GPU 数，你可以根据节点情况指定。默认为 8。                                  |
+| `CPUS_PER_TASK` | 每个任务分配的 CPU 数（通常一个 GPU 对应一个任务）。默认为 5。                             |
+| `SRUN_ARGS`     | `srun` 命令支持的其他参数。可用的选项参见[官方文档](https://slurm.schedmd.com/srun.html)。 |
diff --git a/icon.png b/icon.png
deleted file mode 100644
index 4e8db2e35b11d3b0b1ab40d0c182da7042fd3031..0000000000000000000000000000000000000000
Binary files a/icon.png and /dev/null differ
diff --git a/images/20231124104337.png b/images/20231124104337.png
deleted file mode 100644
index 14b38e4377097b28d3d5d01272c4546a77dc74ec..0000000000000000000000000000000000000000
Binary files a/images/20231124104337.png and /dev/null differ
diff --git a/images/d15a0e56517b4f7284a862f1d6eaef9a.png b/images/d15a0e56517b4f7284a862f1d6eaef9a.png
deleted file mode 100644
index 007ffe164124ae7cae2f40c0f62194cc8b3c17f9..0000000000000000000000000000000000000000
Binary files a/images/d15a0e56517b4f7284a862f1d6eaef9a.png and /dev/null differ
diff --git a/inception-v3_8xb32_in1k.py b/inception-v3_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e8a7cff86891045fde5ce9a43f21819de6929193
--- /dev/null
+++ b/inception-v3_8xb32_in1k.py
@@ -0,0 +1,46 @@
+_base_ = [
+    'configs/_base_/models/inception_v3.py',
+    'configs/_base_/datasets/tiny_imagenet_bs32.py',
+    'configs/_base_/schedules/imagenet_bs256_coslr.py',
+    'configs/_base_/default_runtime.py',
+]
+
+import os
+import torch
+
+torch.backends.cuda.matmul.allow_tf32=True
+torch.backends.cudnn.allow_tf32=True
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', scale=299),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='ResizeEdge', scale=342, edge='short'),
+    dict(type='CenterCrop', crop_size=299),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+
+# optimizer
+optim_wrapper = dict(
+        #type='AmpOptimWrapper',
+        #dtype='bfloat16',
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# 自定义hooks，添加ProfilerHook, 只在rank0启用
+custom_hooks = [
+    dict(type='ProfilerHook', by_epoch=False,
+        profile_times=5,
+        on_trace_ready=dict(type="log_trace", sort_by="self_cuda_time_total"),
+        json_trace_path=f"trace_inceptionv3_tf32.json",
+        activity_with_cuda=True,
+        schedule=dict(wait=3, warmup=1, active=1, repeat=1))  # 这样的设置是10次
+] if os.environ['LOCAL_RANK'] == '0' else []
diff --git a/mmclassification-mmcv b/mmclassification-mmcv
deleted file mode 160000
index 0f6a312ab4b30c6e27efd93608268fe0fe3f7dcc..0000000000000000000000000000000000000000
--- a/mmclassification-mmcv
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit 0f6a312ab4b30c6e27efd93608268fe0fe3f7dcc
diff --git a/mmpretrain.egg-info/PKG-INFO b/mmpretrain.egg-info/PKG-INFO
new file mode 100644
index 0000000000000000000000000000000000000000..6849445ff1992a5d956fbd047b2fbd3f9cc88126
--- /dev/null
+++ b/mmpretrain.egg-info/PKG-INFO
@@ -0,0 +1,399 @@
+Metadata-Version: 2.1
+Name: mmpretrain
+Version: 1.2.0
+Summary: OpenMMLab Model Pretraining Toolbox and Benchmark
+Home-page: https://github.com/open-mmlab/mmpretrain
+Author: MMPretrain Contributors
+Author-email: openmmlab@gmail.com
+License: Apache License 2.0
+Keywords: computer vision,image classification,unsupervised learning,self-supervised learning
+Classifier: Development Status :: 4 - Beta
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.7
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.7
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: einops
+Requires-Dist: importlib-metadata
+Requires-Dist: mat4py
+Requires-Dist: matplotlib
+Requires-Dist: modelindex
+Requires-Dist: numpy
+Requires-Dist: rich
+Provides-Extra: all
+Requires-Dist: albumentations>=0.3.2; extra == "all"
+Requires-Dist: grad-cam<1.5.0,>=1.3.7; extra == "all"
+Requires-Dist: requests; extra == "all"
+Requires-Dist: scikit-learn; extra == "all"
+Requires-Dist: einops; extra == "all"
+Requires-Dist: importlib-metadata; extra == "all"
+Requires-Dist: mat4py; extra == "all"
+Requires-Dist: matplotlib; extra == "all"
+Requires-Dist: modelindex; extra == "all"
+Requires-Dist: numpy; extra == "all"
+Requires-Dist: rich; extra == "all"
+Requires-Dist: coverage; extra == "all"
+Requires-Dist: interrogate; extra == "all"
+Requires-Dist: pytest; extra == "all"
+Provides-Extra: tests
+Requires-Dist: coverage; extra == "tests"
+Requires-Dist: interrogate; extra == "tests"
+Requires-Dist: pytest; extra == "tests"
+Provides-Extra: optional
+Requires-Dist: albumentations>=0.3.2; extra == "optional"
+Requires-Dist: grad-cam<1.5.0,>=1.3.7; extra == "optional"
+Requires-Dist: requests; extra == "optional"
+Requires-Dist: scikit-learn; extra == "optional"
+Provides-Extra: mim
+Requires-Dist: mmcv<2.4.0,>=2.0.0; extra == "mim"
+Requires-Dist: mmengine<1.0.0,>=0.8.3; extra == "mim"
+Provides-Extra: multimodal
+Requires-Dist: pycocotools; extra == "multimodal"
+Requires-Dist: transformers>=4.28.0; extra == "multimodal"
+
+<div align="center">
+
+<img src="resources/mmpt-logo.png" width="600"/>
+  <div>&nbsp;</div>
+  <div align="center">
+    <b><font size="5">OpenMMLab website</font></b>
+    <sup>
+      <a href="https://openmmlab.com">
+        <i><font size="4">HOT</font></i>
+      </a>
+    </sup>
+    &nbsp;&nbsp;&nbsp;&nbsp;
+    <b><font size="5">OpenMMLab platform</font></b>
+    <sup>
+      <a href="https://platform.openmmlab.com">
+        <i><font size="4">TRY IT OUT</font></i>
+      </a>
+    </sup>
+  </div>
+  <div>&nbsp;</div>
+
+[![PyPI](https://img.shields.io/pypi/v/mmpretrain)](https://pypi.org/project/mmpretrain)
+[![Docs](https://img.shields.io/badge/docs-latest-blue)](https://mmpretrain.readthedocs.io/en/latest/)
+[![Build Status](https://github.com/open-mmlab/mmpretrain/workflows/build/badge.svg)](https://github.com/open-mmlab/mmpretrain/actions)
+[![codecov](https://codecov.io/gh/open-mmlab/mmpretrain/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/mmpretrain)
+[![license](https://img.shields.io/github/license/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/blob/main/LICENSE)
+[![open issues](https://isitmaintained.com/badge/open/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/issues)
+[![issue resolution](https://isitmaintained.com/badge/resolution/open-mmlab/mmpretrain.svg)](https://github.com/open-mmlab/mmpretrain/issues)
+
+[📘 Documentation](https://mmpretrain.readthedocs.io/en/latest/) |
+[🛠️ Installation](https://mmpretrain.readthedocs.io/en/latest/get_started.html#installation) |
+[👀 Model Zoo](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html) |
+[🆕 Update News](https://mmpretrain.readthedocs.io/en/latest/notes/changelog.html) |
+[🤔 Reporting Issues](https://github.com/open-mmlab/mmpretrain/issues/new/choose)
+
+<img src="https://user-images.githubusercontent.com/36138628/230307505-4727ad0a-7d71-4069-939d-b499c7e272b7.png" width="400"/>
+
+English | [简体中文](/README_zh-CN.md)
+
+</div>
+
+</div>
+
+<div align="center">
+  <a href="https://openmmlab.medium.com/" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://discord.gg/raweFPmdzG" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218346637-d30c8a0f-3eba-4699-8131-512fb06d46db.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
+</div>
+
+## Introduction
+
+MMPreTrain is an open source pre-training toolbox based on PyTorch. It is a part of the [OpenMMLab](https://openmmlab.com/) project.
+
+The `main` branch works with **PyTorch 1.8+**.
+
+### Major features
+
+- Various backbones and pretrained models
+- Rich training strategies (supervised learning, self-supervised learning, multi-modality learning etc.)
+- Bag of training tricks
+- Large-scale training configs
+- High efficiency and extensibility
+- Powerful toolkits for model analysis and experiments
+- Various out-of-box inference tasks.
+  - Image Classification
+  - Image Caption
+  - Visual Question Answering
+  - Visual Grounding
+  - Retrieval (Image-To-Image, Text-To-Image, Image-To-Text)
+
+https://github.com/open-mmlab/mmpretrain/assets/26739999/e4dcd3a2-f895-4d1b-a351-fbc74a04e904
+
+## What's new
+
+🌟 v1.2.0 was released in 04/01/2023
+
+- Support LLaVA 1.5.
+- Implement of RAM with a gradio interface.
+
+🌟 v1.1.0 was released in 12/10/2023
+
+- Support Mini-GPT4 training and provide a Chinese model (based on Baichuan-7B)
+- Support zero-shot classification based on CLIP.
+
+🌟 v1.0.0 was released in 04/07/2023
+
+- Support inference of more **multi-modal** algorithms, such as [**LLaVA**](./configs/llava/), [**MiniGPT-4**](./configs/minigpt4), [**Otter**](./configs/otter/), etc.
+- Support around **10 multi-modal** datasets!
+- Add [**iTPN**](./configs/itpn/), [**SparK**](./configs/spark/) self-supervised learning algorithms.
+- Provide examples of [New Config](./mmpretrain/configs/) and [DeepSpeed/FSDP with FlexibleRunner](./configs/mae/benchmarks/). Here are the documentation links of [New Config](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#a-pure-python-style-configuration-file-beta) and [DeepSpeed/FSDP with FlexibleRunner](https://mmengine.readthedocs.io/en/latest/api/generated/mmengine.runner.FlexibleRunner.html#mmengine.runner.FlexibleRunner).
+
+🌟 Upgrade from MMClassification to MMPreTrain
+
+- Integrated Self-supervised learning algorithms from **MMSelfSup**, such as **MAE**, **BEiT**, etc.
+- Support **RIFormer**, a simple but effective vision backbone by removing token mixer.
+- Refactor dataset pipeline visualization.
+- Support **LeViT**, **XCiT**, **ViG**, **ConvNeXt-V2**, **EVA**, **RevViT**, **EfficientnetV2**, **CLIP**, **TinyViT** and **MixMIM** backbones.
+
+This release introduced a brand new and flexible training & test engine, but it's still in progress. Welcome
+to try according to [the documentation](https://mmpretrain.readthedocs.io/en/latest/).
+
+And there are some BC-breaking changes. Please check [the migration tutorial](https://mmpretrain.readthedocs.io/en/latest/migration.html).
+
+Please refer to [changelog](https://mmpretrain.readthedocs.io/en/latest/notes/changelog.html) for more details and other release history.
+
+## Installation
+
+Below are quick steps for installation:
+
+```shell
+conda create -n open-mmlab python=3.8 pytorch==1.10.1 torchvision==0.11.2 cudatoolkit=11.3 -c pytorch -y
+conda activate open-mmlab
+pip install openmim
+git clone https://github.com/open-mmlab/mmpretrain.git
+cd mmpretrain
+mim install -e .
+```
+
+Please refer to [installation documentation](https://mmpretrain.readthedocs.io/en/latest/get_started.html) for more detailed installation and dataset preparation.
+
+For multi-modality models support, please install the extra dependencies by:
+
+```shell
+mim install -e ".[multimodal]"
+```
+
+## User Guides
+
+We provided a series of tutorials about the basic usage of MMPreTrain for new users:
+
+- [Learn about Configs](https://mmpretrain.readthedocs.io/en/latest/user_guides/config.html)
+- [Prepare Dataset](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html)
+- [Inference with existing models](https://mmpretrain.readthedocs.io/en/latest/user_guides/inference.html)
+- [Train](https://mmpretrain.readthedocs.io/en/latest/user_guides/train.html)
+- [Test](https://mmpretrain.readthedocs.io/en/latest/user_guides/test.html)
+- [Downstream tasks](https://mmpretrain.readthedocs.io/en/latest/user_guides/downstream.html)
+
+For more information, please refer to [our documentation](https://mmpretrain.readthedocs.io/en/latest/).
+
+## Model zoo
+
+Results and models are available in the [model zoo](https://mmpretrain.readthedocs.io/en/latest/modelzoo_statistics.html).
+
+<div align="center">
+  <b>Overview</b>
+</div>
+<table align="center">
+  <tbody>
+    <tr align="center" valign="bottom">
+      <td>
+        <b>Supported Backbones</b>
+      </td>
+      <td>
+        <b>Self-supervised Learning</b>
+      </td>
+      <td>
+        <b>Multi-Modality Algorithms</b>
+      </td>
+      <td>
+        <b>Others</b>
+      </td>
+    </tr>
+    <tr valign="top">
+      <td>
+        <ul>
+        <li><a href="configs/vgg">VGG</a></li>
+        <li><a href="configs/resnet">ResNet</a></li>
+        <li><a href="configs/resnext">ResNeXt</a></li>
+        <li><a href="configs/seresnet">SE-ResNet</a></li>
+        <li><a href="configs/seresnet">SE-ResNeXt</a></li>
+        <li><a href="configs/regnet">RegNet</a></li>
+        <li><a href="configs/shufflenet_v1">ShuffleNet V1</a></li>
+        <li><a href="configs/shufflenet_v2">ShuffleNet V2</a></li>
+        <li><a href="configs/mobilenet_v2">MobileNet V2</a></li>
+        <li><a href="configs/mobilenet_v3">MobileNet V3</a></li>
+        <li><a href="configs/swin_transformer">Swin-Transformer</a></li>
+        <li><a href="configs/swin_transformer_v2">Swin-Transformer V2</a></li>
+        <li><a href="configs/repvgg">RepVGG</a></li>
+        <li><a href="configs/vision_transformer">Vision-Transformer</a></li>
+        <li><a href="configs/tnt">Transformer-in-Transformer</a></li>
+        <li><a href="configs/res2net">Res2Net</a></li>
+        <li><a href="configs/mlp_mixer">MLP-Mixer</a></li>
+        <li><a href="configs/deit">DeiT</a></li>
+        <li><a href="configs/deit3">DeiT-3</a></li>
+        <li><a href="configs/conformer">Conformer</a></li>
+        <li><a href="configs/t2t_vit">T2T-ViT</a></li>
+        <li><a href="configs/twins">Twins</a></li>
+        <li><a href="configs/efficientnet">EfficientNet</a></li>
+        <li><a href="configs/edgenext">EdgeNeXt</a></li>
+        <li><a href="configs/convnext">ConvNeXt</a></li>
+        <li><a href="configs/hrnet">HRNet</a></li>
+        <li><a href="configs/van">VAN</a></li>
+        <li><a href="configs/convmixer">ConvMixer</a></li>
+        <li><a href="configs/cspnet">CSPNet</a></li>
+        <li><a href="configs/poolformer">PoolFormer</a></li>
+        <li><a href="configs/inception_v3">Inception V3</a></li>
+        <li><a href="configs/mobileone">MobileOne</a></li>
+        <li><a href="configs/efficientformer">EfficientFormer</a></li>
+        <li><a href="configs/mvit">MViT</a></li>
+        <li><a href="configs/hornet">HorNet</a></li>
+        <li><a href="configs/mobilevit">MobileViT</a></li>
+        <li><a href="configs/davit">DaViT</a></li>
+        <li><a href="configs/replknet">RepLKNet</a></li>
+        <li><a href="configs/beit">BEiT</a></li>
+        <li><a href="configs/mixmim">MixMIM</a></li>
+        <li><a href="configs/efficientnet_v2">EfficientNet V2</a></li>
+        <li><a href="configs/revvit">RevViT</a></li>
+        <li><a href="configs/convnext_v2">ConvNeXt V2</a></li>
+        <li><a href="configs/vig">ViG</a></li>
+        <li><a href="configs/xcit">XCiT</a></li>
+        <li><a href="configs/levit">LeViT</a></li>
+        <li><a href="configs/riformer">RIFormer</a></li>
+        <li><a href="configs/glip">GLIP</a></li>
+        <li><a href="configs/sam">ViT SAM</a></li>
+        <li><a href="configs/eva02">EVA02</a></li>
+        <li><a href="configs/dinov2">DINO V2</a></li>
+        <li><a href="configs/hivit">HiViT</a></li>
+        </ul>
+      </td>
+      <td>
+        <ul>
+        <li><a href="configs/mocov2">MoCo V1 (CVPR'2020)</a></li>
+        <li><a href="configs/simclr">SimCLR (ICML'2020)</a></li>
+        <li><a href="configs/mocov2">MoCo V2 (arXiv'2020)</a></li>
+        <li><a href="configs/byol">BYOL (NeurIPS'2020)</a></li>
+        <li><a href="configs/swav">SwAV (NeurIPS'2020)</a></li>
+        <li><a href="configs/densecl">DenseCL (CVPR'2021)</a></li>
+        <li><a href="configs/simsiam">SimSiam (CVPR'2021)</a></li>
+        <li><a href="configs/barlowtwins">Barlow Twins (ICML'2021)</a></li>
+        <li><a href="configs/mocov3">MoCo V3 (ICCV'2021)</a></li>
+        <li><a href="configs/beit">BEiT (ICLR'2022)</a></li>
+        <li><a href="configs/mae">MAE (CVPR'2022)</a></li>
+        <li><a href="configs/simmim">SimMIM (CVPR'2022)</a></li>
+        <li><a href="configs/maskfeat">MaskFeat (CVPR'2022)</a></li>
+        <li><a href="configs/cae">CAE (arXiv'2022)</a></li>
+        <li><a href="configs/milan">MILAN (arXiv'2022)</a></li>
+        <li><a href="configs/beitv2">BEiT V2 (arXiv'2022)</a></li>
+        <li><a href="configs/eva">EVA (CVPR'2023)</a></li>
+        <li><a href="configs/mixmim">MixMIM (arXiv'2022)</a></li>
+        <li><a href="configs/itpn">iTPN (CVPR'2023)</a></li>
+        <li><a href="configs/spark">SparK (ICLR'2023)</a></li>
+        <li><a href="configs/mff">MFF (ICCV'2023)</a></li>
+        </ul>
+      </td>
+      <td>
+        <ul>
+        <li><a href="configs/blip">BLIP (arxiv'2022)</a></li>
+        <li><a href="configs/blip2">BLIP-2 (arxiv'2023)</a></li>
+        <li><a href="configs/ofa">OFA (CoRR'2022)</a></li>
+        <li><a href="configs/flamingo">Flamingo (NeurIPS'2022)</a></li>
+        <li><a href="configs/chinese_clip">Chinese CLIP (arxiv'2022)</a></li>
+        <li><a href="configs/minigpt4">MiniGPT-4 (arxiv'2023)</a></li>
+        <li><a href="configs/llava">LLaVA (arxiv'2023)</a></li>
+        <li><a href="configs/otter">Otter (arxiv'2023)</a></li>
+        </ul>
+      </td>
+      <td>
+      Image Retrieval Task:
+        <ul>
+        <li><a href="configs/arcface">ArcFace (CVPR'2019)</a></li>
+        </ul>
+      Training&Test Tips:
+        <ul>
+        <li><a href="https://arxiv.org/abs/1909.13719">RandAug</a></li>
+        <li><a href="https://arxiv.org/abs/1805.09501">AutoAug</a></li>
+        <li><a href="mmpretrain/datasets/samplers/repeat_aug.py">RepeatAugSampler</a></li>
+        <li><a href="mmpretrain/models/tta/score_tta.py">TTA</a></li>
+        <li>...</li>
+        </ul>
+      </td>
+  </tbody>
+</table>
+
+## Contributing
+
+We appreciate all contributions to improve MMPreTrain.
+Please refer to [CONTRUBUTING](https://mmpretrain.readthedocs.io/en/latest/notes/contribution_guide.html) for the contributing guideline.
+
+## Acknowledgement
+
+MMPreTrain is an open source project that is contributed by researchers and engineers from various colleges and companies. We appreciate all the contributors who implement their methods or add new features, as well as users who give valuable feedbacks.
+We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to reimplement existing methods and supporting their own academic research.
+
+## Citation
+
+If you find this project useful in your research, please consider cite:
+
+```BibTeX
+@misc{2023mmpretrain,
+    title={OpenMMLab's Pre-training Toolbox and Benchmark},
+    author={MMPreTrain Contributors},
+    howpublished = {\url{https://github.com/open-mmlab/mmpretrain}},
+    year={2023}
+}
+```
+
+## License
+
+This project is released under the [Apache 2.0 license](LICENSE).
+
+## Projects in OpenMMLab
+
+- [MMEngine](https://github.com/open-mmlab/mmengine): OpenMMLab foundational library for training deep learning models.
+- [MMCV](https://github.com/open-mmlab/mmcv): OpenMMLab foundational library for computer vision.
+- [MIM](https://github.com/open-mmlab/mim): MIM installs OpenMMLab packages.
+- [MMEval](https://github.com/open-mmlab/mmeval): A unified evaluation library for multiple machine learning libraries.
+- [MMPreTrain](https://github.com/open-mmlab/mmpretrain): OpenMMLab pre-training toolbox and benchmark.
+- [MMDetection](https://github.com/open-mmlab/mmdetection): OpenMMLab detection toolbox and benchmark.
+- [MMDetection3D](https://github.com/open-mmlab/mmdetection3d): OpenMMLab's next-generation platform for general 3D object detection.
+- [MMRotate](https://github.com/open-mmlab/mmrotate): OpenMMLab rotated object detection toolbox and benchmark.
+- [MMYOLO](https://github.com/open-mmlab/mmyolo): OpenMMLab YOLO series toolbox and benchmark.
+- [MMSegmentation](https://github.com/open-mmlab/mmsegmentation): OpenMMLab semantic segmentation toolbox and benchmark.
+- [MMOCR](https://github.com/open-mmlab/mmocr): OpenMMLab text detection, recognition, and understanding toolbox.
+- [MMPose](https://github.com/open-mmlab/mmpose): OpenMMLab pose estimation toolbox and benchmark.
+- [MMHuman3D](https://github.com/open-mmlab/mmhuman3d): OpenMMLab 3D human parametric model toolbox and benchmark.
+- [MMSelfSup](https://github.com/open-mmlab/mmselfsup): OpenMMLab self-supervised learning toolbox and benchmark.
+- [MMRazor](https://github.com/open-mmlab/mmrazor): OpenMMLab model compression toolbox and benchmark.
+- [MMFewShot](https://github.com/open-mmlab/mmfewshot): OpenMMLab fewshot learning toolbox and benchmark.
+- [MMAction2](https://github.com/open-mmlab/mmaction2): OpenMMLab's next-generation action understanding toolbox and benchmark.
+- [MMTracking](https://github.com/open-mmlab/mmtracking): OpenMMLab video perception toolbox and benchmark.
+- [MMFlow](https://github.com/open-mmlab/mmflow): OpenMMLab optical flow toolbox and benchmark.
+- [MMagic](https://github.com/open-mmlab/mmagic): Open**MM**Lab **A**dvanced, **G**enerative and **I**ntelligent **C**reation toolbox.
+- [MMGeneration](https://github.com/open-mmlab/mmgeneration): OpenMMLab image and video generative models toolbox.
+- [MMDeploy](https://github.com/open-mmlab/mmdeploy): OpenMMLab model deployment framework.
+- [Playground](https://github.com/open-mmlab/playground): A central hub for gathering and showcasing amazing projects built upon OpenMMLab.
diff --git a/mmpretrain.egg-info/SOURCES.txt b/mmpretrain.egg-info/SOURCES.txt
new file mode 100644
index 0000000000000000000000000000000000000000..debd6792cf7ede8af041d21be444bac67ed5a6a8
--- /dev/null
+++ b/mmpretrain.egg-info/SOURCES.txt
@@ -0,0 +1,371 @@
+LICENSE
+MANIFEST.in
+README.md
+setup.cfg
+setup.py
+mmpretrain/__init__.py
+mmpretrain/registry.py
+mmpretrain/version.py
+mmpretrain.egg-info/PKG-INFO
+mmpretrain.egg-info/SOURCES.txt
+mmpretrain.egg-info/dependency_links.txt
+mmpretrain.egg-info/not-zip-safe
+mmpretrain.egg-info/requires.txt
+mmpretrain.egg-info/top_level.txt
+mmpretrain/apis/__init__.py
+mmpretrain/apis/base.py
+mmpretrain/apis/feature_extractor.py
+mmpretrain/apis/image_caption.py
+mmpretrain/apis/image_classification.py
+mmpretrain/apis/image_retrieval.py
+mmpretrain/apis/model.py
+mmpretrain/apis/multimodal_retrieval.py
+mmpretrain/apis/nlvr.py
+mmpretrain/apis/utils.py
+mmpretrain/apis/visual_grounding.py
+mmpretrain/apis/visual_question_answering.py
+mmpretrain/datasets/__init__.py
+mmpretrain/datasets/base_dataset.py
+mmpretrain/datasets/builder.py
+mmpretrain/datasets/caltech101.py
+mmpretrain/datasets/categories.py
+mmpretrain/datasets/cifar.py
+mmpretrain/datasets/coco_caption.py
+mmpretrain/datasets/coco_retrieval.py
+mmpretrain/datasets/coco_vqa.py
+mmpretrain/datasets/cub.py
+mmpretrain/datasets/custom.py
+mmpretrain/datasets/dataset_wrappers.py
+mmpretrain/datasets/dtd.py
+mmpretrain/datasets/fgvcaircraft.py
+mmpretrain/datasets/flamingo.py
+mmpretrain/datasets/flickr30k_caption.py
+mmpretrain/datasets/flickr30k_retrieval.py
+mmpretrain/datasets/flowers102.py
+mmpretrain/datasets/food101.py
+mmpretrain/datasets/gqa_dataset.py
+mmpretrain/datasets/iconqa.py
+mmpretrain/datasets/imagenet.py
+mmpretrain/datasets/infographic_vqa.py
+mmpretrain/datasets/inshop.py
+mmpretrain/datasets/minigpt4_dataset.py
+mmpretrain/datasets/mnist.py
+mmpretrain/datasets/multi_label.py
+mmpretrain/datasets/multi_task.py
+mmpretrain/datasets/nlvr2.py
+mmpretrain/datasets/nocaps.py
+mmpretrain/datasets/ocr_vqa.py
+mmpretrain/datasets/oxfordiiitpet.py
+mmpretrain/datasets/places205.py
+mmpretrain/datasets/refcoco.py
+mmpretrain/datasets/scienceqa.py
+mmpretrain/datasets/stanfordcars.py
+mmpretrain/datasets/sun397.py
+mmpretrain/datasets/textvqa.py
+mmpretrain/datasets/utils.py
+mmpretrain/datasets/vg_vqa.py
+mmpretrain/datasets/visual_genome.py
+mmpretrain/datasets/vizwiz.py
+mmpretrain/datasets/voc.py
+mmpretrain/datasets/vsr.py
+mmpretrain/datasets/samplers/__init__.py
+mmpretrain/datasets/samplers/repeat_aug.py
+mmpretrain/datasets/samplers/sequential.py
+mmpretrain/datasets/transforms/__init__.py
+mmpretrain/datasets/transforms/auto_augment.py
+mmpretrain/datasets/transforms/formatting.py
+mmpretrain/datasets/transforms/processing.py
+mmpretrain/datasets/transforms/utils.py
+mmpretrain/datasets/transforms/wrappers.py
+mmpretrain/engine/__init__.py
+mmpretrain/engine/hooks/__init__.py
+mmpretrain/engine/hooks/class_num_check_hook.py
+mmpretrain/engine/hooks/densecl_hook.py
+mmpretrain/engine/hooks/ema_hook.py
+mmpretrain/engine/hooks/margin_head_hooks.py
+mmpretrain/engine/hooks/precise_bn_hook.py
+mmpretrain/engine/hooks/retriever_hooks.py
+mmpretrain/engine/hooks/simsiam_hook.py
+mmpretrain/engine/hooks/swav_hook.py
+mmpretrain/engine/hooks/switch_recipe_hook.py
+mmpretrain/engine/hooks/visualization_hook.py
+mmpretrain/engine/hooks/warmup_param_hook.py
+mmpretrain/engine/optimizers/__init__.py
+mmpretrain/engine/optimizers/adan_t.py
+mmpretrain/engine/optimizers/lamb.py
+mmpretrain/engine/optimizers/lars.py
+mmpretrain/engine/optimizers/layer_decay_optim_wrapper_constructor.py
+mmpretrain/engine/runners/__init__.py
+mmpretrain/engine/runners/retrieval_loop.py
+mmpretrain/engine/schedulers/__init__.py
+mmpretrain/engine/schedulers/weight_decay_scheduler.py
+mmpretrain/evaluation/__init__.py
+mmpretrain/evaluation/functional/__init__.py
+mmpretrain/evaluation/metrics/ANLS.py
+mmpretrain/evaluation/metrics/__init__.py
+mmpretrain/evaluation/metrics/caption.py
+mmpretrain/evaluation/metrics/gqa.py
+mmpretrain/evaluation/metrics/multi_label.py
+mmpretrain/evaluation/metrics/multi_task.py
+mmpretrain/evaluation/metrics/nocaps.py
+mmpretrain/evaluation/metrics/retrieval.py
+mmpretrain/evaluation/metrics/scienceqa.py
+mmpretrain/evaluation/metrics/shape_bias_label.py
+mmpretrain/evaluation/metrics/single_label.py
+mmpretrain/evaluation/metrics/visual_grounding_eval.py
+mmpretrain/evaluation/metrics/voc_multi_label.py
+mmpretrain/evaluation/metrics/vqa.py
+mmpretrain/models/__init__.py
+mmpretrain/models/builder.py
+mmpretrain/models/backbones/__init__.py
+mmpretrain/models/backbones/alexnet.py
+mmpretrain/models/backbones/base_backbone.py
+mmpretrain/models/backbones/beit.py
+mmpretrain/models/backbones/conformer.py
+mmpretrain/models/backbones/convmixer.py
+mmpretrain/models/backbones/convnext.py
+mmpretrain/models/backbones/cspnet.py
+mmpretrain/models/backbones/davit.py
+mmpretrain/models/backbones/deit.py
+mmpretrain/models/backbones/deit3.py
+mmpretrain/models/backbones/densenet.py
+mmpretrain/models/backbones/edgenext.py
+mmpretrain/models/backbones/efficientformer.py
+mmpretrain/models/backbones/efficientnet.py
+mmpretrain/models/backbones/efficientnet_v2.py
+mmpretrain/models/backbones/hivit.py
+mmpretrain/models/backbones/hornet.py
+mmpretrain/models/backbones/hrnet.py
+mmpretrain/models/backbones/inception_v3.py
+mmpretrain/models/backbones/lenet.py
+mmpretrain/models/backbones/levit.py
+mmpretrain/models/backbones/mixmim.py
+mmpretrain/models/backbones/mlp_mixer.py
+mmpretrain/models/backbones/mobilenet_v2.py
+mmpretrain/models/backbones/mobilenet_v3.py
+mmpretrain/models/backbones/mobileone.py
+mmpretrain/models/backbones/mobilevit.py
+mmpretrain/models/backbones/mvit.py
+mmpretrain/models/backbones/poolformer.py
+mmpretrain/models/backbones/regnet.py
+mmpretrain/models/backbones/replknet.py
+mmpretrain/models/backbones/repmlp.py
+mmpretrain/models/backbones/repvgg.py
+mmpretrain/models/backbones/res2net.py
+mmpretrain/models/backbones/resnest.py
+mmpretrain/models/backbones/resnet.py
+mmpretrain/models/backbones/resnet_cifar.py
+mmpretrain/models/backbones/resnext.py
+mmpretrain/models/backbones/revvit.py
+mmpretrain/models/backbones/riformer.py
+mmpretrain/models/backbones/seresnet.py
+mmpretrain/models/backbones/seresnext.py
+mmpretrain/models/backbones/shufflenet_v1.py
+mmpretrain/models/backbones/shufflenet_v2.py
+mmpretrain/models/backbones/sparse_convnext.py
+mmpretrain/models/backbones/sparse_resnet.py
+mmpretrain/models/backbones/swin_transformer.py
+mmpretrain/models/backbones/swin_transformer_v2.py
+mmpretrain/models/backbones/t2t_vit.py
+mmpretrain/models/backbones/timm_backbone.py
+mmpretrain/models/backbones/tinyvit.py
+mmpretrain/models/backbones/tnt.py
+mmpretrain/models/backbones/twins.py
+mmpretrain/models/backbones/van.py
+mmpretrain/models/backbones/vgg.py
+mmpretrain/models/backbones/vig.py
+mmpretrain/models/backbones/vision_transformer.py
+mmpretrain/models/backbones/vit_eva02.py
+mmpretrain/models/backbones/vit_sam.py
+mmpretrain/models/backbones/xcit.py
+mmpretrain/models/classifiers/__init__.py
+mmpretrain/models/classifiers/base.py
+mmpretrain/models/classifiers/hugging_face.py
+mmpretrain/models/classifiers/image.py
+mmpretrain/models/classifiers/timm.py
+mmpretrain/models/heads/__init__.py
+mmpretrain/models/heads/beitv1_head.py
+mmpretrain/models/heads/beitv2_head.py
+mmpretrain/models/heads/cae_head.py
+mmpretrain/models/heads/cls_head.py
+mmpretrain/models/heads/conformer_head.py
+mmpretrain/models/heads/contrastive_head.py
+mmpretrain/models/heads/deit_head.py
+mmpretrain/models/heads/efficientformer_head.py
+mmpretrain/models/heads/grounding_head.py
+mmpretrain/models/heads/itc_head.py
+mmpretrain/models/heads/itm_head.py
+mmpretrain/models/heads/itpn_clip_head.py
+mmpretrain/models/heads/latent_heads.py
+mmpretrain/models/heads/levit_head.py
+mmpretrain/models/heads/linear_head.py
+mmpretrain/models/heads/mae_head.py
+mmpretrain/models/heads/margin_head.py
+mmpretrain/models/heads/mim_head.py
+mmpretrain/models/heads/mixmim_head.py
+mmpretrain/models/heads/mocov3_head.py
+mmpretrain/models/heads/multi_label_cls_head.py
+mmpretrain/models/heads/multi_label_csra_head.py
+mmpretrain/models/heads/multi_label_linear_head.py
+mmpretrain/models/heads/multi_task_head.py
+mmpretrain/models/heads/seq_gen_head.py
+mmpretrain/models/heads/simmim_head.py
+mmpretrain/models/heads/spark_head.py
+mmpretrain/models/heads/stacked_head.py
+mmpretrain/models/heads/swav_head.py
+mmpretrain/models/heads/vig_head.py
+mmpretrain/models/heads/vision_transformer_head.py
+mmpretrain/models/heads/vqa_head.py
+mmpretrain/models/losses/__init__.py
+mmpretrain/models/losses/asymmetric_loss.py
+mmpretrain/models/losses/cae_loss.py
+mmpretrain/models/losses/cosine_similarity_loss.py
+mmpretrain/models/losses/cross_correlation_loss.py
+mmpretrain/models/losses/cross_entropy_loss.py
+mmpretrain/models/losses/focal_loss.py
+mmpretrain/models/losses/label_smooth_loss.py
+mmpretrain/models/losses/reconstruction_loss.py
+mmpretrain/models/losses/seesaw_loss.py
+mmpretrain/models/losses/swav_loss.py
+mmpretrain/models/losses/utils.py
+mmpretrain/models/multimodal/__init__.py
+mmpretrain/models/multimodal/blip/__init__.py
+mmpretrain/models/multimodal/blip/blip_caption.py
+mmpretrain/models/multimodal/blip/blip_grounding.py
+mmpretrain/models/multimodal/blip/blip_nlvr.py
+mmpretrain/models/multimodal/blip/blip_retrieval.py
+mmpretrain/models/multimodal/blip/blip_vqa.py
+mmpretrain/models/multimodal/blip/language_model.py
+mmpretrain/models/multimodal/blip2/Qformer.py
+mmpretrain/models/multimodal/blip2/__init__.py
+mmpretrain/models/multimodal/blip2/blip2_caption.py
+mmpretrain/models/multimodal/blip2/blip2_opt_vqa.py
+mmpretrain/models/multimodal/blip2/blip2_retriever.py
+mmpretrain/models/multimodal/blip2/modeling_opt.py
+mmpretrain/models/multimodal/chinese_clip/__init__.py
+mmpretrain/models/multimodal/chinese_clip/bert.py
+mmpretrain/models/multimodal/chinese_clip/chinese_clip.py
+mmpretrain/models/multimodal/chinese_clip/utils.py
+mmpretrain/models/multimodal/clip/__init__.py
+mmpretrain/models/multimodal/clip/clip.py
+mmpretrain/models/multimodal/clip/clip_transformer.py
+mmpretrain/models/multimodal/clip/utils.py
+mmpretrain/models/multimodal/flamingo/__init__.py
+mmpretrain/models/multimodal/flamingo/adapter.py
+mmpretrain/models/multimodal/flamingo/flamingo.py
+mmpretrain/models/multimodal/flamingo/modules.py
+mmpretrain/models/multimodal/flamingo/utils.py
+mmpretrain/models/multimodal/llava/__init__.py
+mmpretrain/models/multimodal/llava/llava.py
+mmpretrain/models/multimodal/llava/modules.py
+mmpretrain/models/multimodal/minigpt4/__init__.py
+mmpretrain/models/multimodal/minigpt4/minigpt4.py
+mmpretrain/models/multimodal/ofa/__init__.py
+mmpretrain/models/multimodal/ofa/ofa.py
+mmpretrain/models/multimodal/ofa/ofa_modules.py
+mmpretrain/models/multimodal/otter/__init__.py
+mmpretrain/models/multimodal/otter/otter.py
+mmpretrain/models/multimodal/ram/__init__.py
+mmpretrain/models/multimodal/ram/bert.py
+mmpretrain/models/multimodal/ram/gradio_demo.py
+mmpretrain/models/multimodal/ram/openset_utils.py
+mmpretrain/models/multimodal/ram/ram.py
+mmpretrain/models/multimodal/ram/utils.py
+mmpretrain/models/multimodal/ram/config/__init__.py
+mmpretrain/models/multimodal/ram/config/ram_swin_large_14m.py
+mmpretrain/models/multimodal/ram/run/__init__.py
+mmpretrain/models/multimodal/ram/run/inference.py
+mmpretrain/models/necks/__init__.py
+mmpretrain/models/necks/beitv2_neck.py
+mmpretrain/models/necks/cae_neck.py
+mmpretrain/models/necks/densecl_neck.py
+mmpretrain/models/necks/gap.py
+mmpretrain/models/necks/gem.py
+mmpretrain/models/necks/hr_fuse.py
+mmpretrain/models/necks/itpn_neck.py
+mmpretrain/models/necks/linear_neck.py
+mmpretrain/models/necks/mae_neck.py
+mmpretrain/models/necks/milan_neck.py
+mmpretrain/models/necks/mixmim_neck.py
+mmpretrain/models/necks/mocov2_neck.py
+mmpretrain/models/necks/nonlinear_neck.py
+mmpretrain/models/necks/simmim_neck.py
+mmpretrain/models/necks/spark_neck.py
+mmpretrain/models/necks/swav_neck.py
+mmpretrain/models/peft/__init__.py
+mmpretrain/models/peft/lora.py
+mmpretrain/models/retrievers/__init__.py
+mmpretrain/models/retrievers/base.py
+mmpretrain/models/retrievers/image2image.py
+mmpretrain/models/selfsup/__init__.py
+mmpretrain/models/selfsup/barlowtwins.py
+mmpretrain/models/selfsup/base.py
+mmpretrain/models/selfsup/beit.py
+mmpretrain/models/selfsup/byol.py
+mmpretrain/models/selfsup/cae.py
+mmpretrain/models/selfsup/densecl.py
+mmpretrain/models/selfsup/eva.py
+mmpretrain/models/selfsup/itpn.py
+mmpretrain/models/selfsup/mae.py
+mmpretrain/models/selfsup/maskfeat.py
+mmpretrain/models/selfsup/mff.py
+mmpretrain/models/selfsup/milan.py
+mmpretrain/models/selfsup/mixmim.py
+mmpretrain/models/selfsup/moco.py
+mmpretrain/models/selfsup/mocov3.py
+mmpretrain/models/selfsup/simclr.py
+mmpretrain/models/selfsup/simmim.py
+mmpretrain/models/selfsup/simsiam.py
+mmpretrain/models/selfsup/spark.py
+mmpretrain/models/selfsup/swav.py
+mmpretrain/models/tta/__init__.py
+mmpretrain/models/tta/score_tta.py
+mmpretrain/models/utils/__init__.py
+mmpretrain/models/utils/attention.py
+mmpretrain/models/utils/batch_shuffle.py
+mmpretrain/models/utils/box_utils.py
+mmpretrain/models/utils/channel_shuffle.py
+mmpretrain/models/utils/clip_generator_helper.py
+mmpretrain/models/utils/data_preprocessor.py
+mmpretrain/models/utils/ema.py
+mmpretrain/models/utils/embed.py
+mmpretrain/models/utils/helpers.py
+mmpretrain/models/utils/huggingface.py
+mmpretrain/models/utils/inverted_residual.py
+mmpretrain/models/utils/layer_scale.py
+mmpretrain/models/utils/make_divisible.py
+mmpretrain/models/utils/norm.py
+mmpretrain/models/utils/position_encoding.py
+mmpretrain/models/utils/res_layer_extra_norm.py
+mmpretrain/models/utils/se_layer.py
+mmpretrain/models/utils/sparse_modules.py
+mmpretrain/models/utils/swiglu_ffn.py
+mmpretrain/models/utils/tokenizer.py
+mmpretrain/models/utils/vector_quantizer.py
+mmpretrain/models/utils/batch_augments/__init__.py
+mmpretrain/models/utils/batch_augments/cutmix.py
+mmpretrain/models/utils/batch_augments/mixup.py
+mmpretrain/models/utils/batch_augments/resizemix.py
+mmpretrain/models/utils/batch_augments/wrapper.py
+mmpretrain/structures/__init__.py
+mmpretrain/structures/data_sample.py
+mmpretrain/structures/multi_task_data_sample.py
+mmpretrain/structures/utils.py
+mmpretrain/utils/__init__.py
+mmpretrain/utils/analyze.py
+mmpretrain/utils/collect_env.py
+mmpretrain/utils/dependency.py
+mmpretrain/utils/misc.py
+mmpretrain/utils/progress.py
+mmpretrain/utils/setup_env.py
+mmpretrain/visualization/__init__.py
+mmpretrain/visualization/utils.py
+mmpretrain/visualization/visualizer.py
+requirements/docs.txt
+requirements/mminstall.txt
+requirements/multimodal.txt
+requirements/optional.txt
+requirements/readthedocs.txt
+requirements/runtime.txt
+requirements/tests.txt
+tests/test_tools.py
\ No newline at end of file
diff --git a/mmpretrain.egg-info/dependency_links.txt b/mmpretrain.egg-info/dependency_links.txt
new file mode 100644
index 0000000000000000000000000000000000000000..8b137891791fe96927ad78e64b0aad7bded08bdc
--- /dev/null
+++ b/mmpretrain.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/mmpretrain.egg-info/not-zip-safe b/mmpretrain.egg-info/not-zip-safe
new file mode 100644
index 0000000000000000000000000000000000000000..8b137891791fe96927ad78e64b0aad7bded08bdc
--- /dev/null
+++ b/mmpretrain.egg-info/not-zip-safe
@@ -0,0 +1 @@
+
diff --git a/mmpretrain.egg-info/requires.txt b/mmpretrain.egg-info/requires.txt
new file mode 100644
index 0000000000000000000000000000000000000000..5571166c8ff1f87e345e022d21bc51ca0ec7ccd2
--- /dev/null
+++ b/mmpretrain.egg-info/requires.txt
@@ -0,0 +1,42 @@
+einops
+importlib-metadata
+mat4py
+matplotlib
+modelindex
+numpy
+rich
+
+[all]
+albumentations>=0.3.2
+grad-cam<1.5.0,>=1.3.7
+requests
+scikit-learn
+einops
+importlib-metadata
+mat4py
+matplotlib
+modelindex
+numpy
+rich
+coverage
+interrogate
+pytest
+
+[mim]
+mmcv<2.4.0,>=2.0.0
+mmengine<1.0.0,>=0.8.3
+
+[multimodal]
+pycocotools
+transformers>=4.28.0
+
+[optional]
+albumentations>=0.3.2
+grad-cam<1.5.0,>=1.3.7
+requests
+scikit-learn
+
+[tests]
+coverage
+interrogate
+pytest
diff --git a/mmpretrain.egg-info/top_level.txt b/mmpretrain.egg-info/top_level.txt
new file mode 100644
index 0000000000000000000000000000000000000000..2a6a347eace33806002863fe2e01a7d7028f076a
--- /dev/null
+++ b/mmpretrain.egg-info/top_level.txt
@@ -0,0 +1 @@
+mmpretrain
diff --git a/mmpretrain/__init__.py b/mmpretrain/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..66866a8627722485fe6416128b992eaf0c9d1b20
--- /dev/null
+++ b/mmpretrain/__init__.py
@@ -0,0 +1,28 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import mmcv
+import mmengine
+from mmengine.utils import digit_version
+
+from .apis import *  # noqa: F401, F403
+from .version import __version__
+
+mmcv_minimum_version = '2.0.0'
+mmcv_maximum_version = '2.4.0'
+mmcv_version = digit_version(mmcv.__version__)
+
+mmengine_minimum_version = '0.8.3'
+mmengine_maximum_version = '1.0.0'
+mmengine_version = digit_version(mmengine.__version__)
+
+assert (mmcv_version >= digit_version(mmcv_minimum_version)
+        and mmcv_version < digit_version(mmcv_maximum_version)), \
+    f'MMCV=={mmcv.__version__} is used but incompatible. ' \
+    f'Please install mmcv>={mmcv_minimum_version}, <{mmcv_maximum_version}.'
+
+assert (mmengine_version >= digit_version(mmengine_minimum_version)
+        and mmengine_version < digit_version(mmengine_maximum_version)), \
+    f'MMEngine=={mmengine.__version__} is used but incompatible. ' \
+    f'Please install mmengine>={mmengine_minimum_version}, ' \
+    f'<{mmengine_maximum_version}.'
+
+__all__ = ['__version__']
diff --git a/mmpretrain/__pycache__/__init__.cpython-310.pyc b/mmpretrain/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..0b7367c3a40e1dd6f8bbc31c558c38965f250780
Binary files /dev/null and b/mmpretrain/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/__pycache__/registry.cpython-310.pyc b/mmpretrain/__pycache__/registry.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..1608fb697691d8d7224527579fe6f52d58a634ca
Binary files /dev/null and b/mmpretrain/__pycache__/registry.cpython-310.pyc differ
diff --git a/mmpretrain/__pycache__/version.cpython-310.pyc b/mmpretrain/__pycache__/version.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..44233a3a554a7c14090ae93aa32f2b90fbb563a7
Binary files /dev/null and b/mmpretrain/__pycache__/version.cpython-310.pyc differ
diff --git a/mmpretrain/apis/__init__.py b/mmpretrain/apis/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..6fbf443772a983c41f7273124f843bdfbb7f0f46
--- /dev/null
+++ b/mmpretrain/apis/__init__.py
@@ -0,0 +1,22 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .base import BaseInferencer
+from .feature_extractor import FeatureExtractor
+from .image_caption import ImageCaptionInferencer
+from .image_classification import ImageClassificationInferencer
+from .image_retrieval import ImageRetrievalInferencer
+from .model import (ModelHub, get_model, inference_model, init_model,
+                    list_models)
+from .multimodal_retrieval import (ImageToTextRetrievalInferencer,
+                                   TextToImageRetrievalInferencer)
+from .nlvr import NLVRInferencer
+from .visual_grounding import VisualGroundingInferencer
+from .visual_question_answering import VisualQuestionAnsweringInferencer
+
+__all__ = [
+    'init_model', 'inference_model', 'list_models', 'get_model', 'ModelHub',
+    'ImageClassificationInferencer', 'ImageRetrievalInferencer',
+    'FeatureExtractor', 'ImageCaptionInferencer',
+    'TextToImageRetrievalInferencer', 'VisualGroundingInferencer',
+    'VisualQuestionAnsweringInferencer', 'ImageToTextRetrievalInferencer',
+    'BaseInferencer', 'NLVRInferencer'
+]
diff --git a/mmpretrain/apis/__pycache__/__init__.cpython-310.pyc b/mmpretrain/apis/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..b6be681423b84d70404e2654acaa3dd7a0df2c10
Binary files /dev/null and b/mmpretrain/apis/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/apis/__pycache__/base.cpython-310.pyc b/mmpretrain/apis/__pycache__/base.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..51c19b9eacca872e73b323669f10fc425484f3f8
Binary files /dev/null and b/mmpretrain/apis/__pycache__/base.cpython-310.pyc differ
diff --git a/mmpretrain/apis/__pycache__/feature_extractor.cpython-310.pyc b/mmpretrain/apis/__pycache__/feature_extractor.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..069c469c4f32d28bc548e90800d3ea58bec10c2d
Binary files /dev/null and b/mmpretrain/apis/__pycache__/feature_extractor.cpython-310.pyc differ
diff --git a/mmpretrain/apis/__pycache__/image_caption.cpython-310.pyc b/mmpretrain/apis/__pycache__/image_caption.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..5f613f543ecb811d8b2c8541f94b3ab71416539b
Binary files /dev/null and b/mmpretrain/apis/__pycache__/image_caption.cpython-310.pyc differ
diff --git a/mmpretrain/apis/__pycache__/image_classification.cpython-310.pyc b/mmpretrain/apis/__pycache__/image_classification.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a7e339d1f11506fda1f3185552f20ad2c7f415b9
Binary files /dev/null and b/mmpretrain/apis/__pycache__/image_classification.cpython-310.pyc differ
diff --git a/mmpretrain/apis/__pycache__/image_retrieval.cpython-310.pyc b/mmpretrain/apis/__pycache__/image_retrieval.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ad212a6c107ebc5a2b40c389864e4400afb951bb
Binary files /dev/null and b/mmpretrain/apis/__pycache__/image_retrieval.cpython-310.pyc differ
diff --git a/mmpretrain/apis/__pycache__/model.cpython-310.pyc b/mmpretrain/apis/__pycache__/model.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..eca64c5bb4462e787cc0775d21894b38927e6b19
Binary files /dev/null and b/mmpretrain/apis/__pycache__/model.cpython-310.pyc differ
diff --git a/mmpretrain/apis/__pycache__/multimodal_retrieval.cpython-310.pyc b/mmpretrain/apis/__pycache__/multimodal_retrieval.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..611d515b45bae16ae704885082e9c7a6be1eca1a
Binary files /dev/null and b/mmpretrain/apis/__pycache__/multimodal_retrieval.cpython-310.pyc differ
diff --git a/mmpretrain/apis/__pycache__/nlvr.cpython-310.pyc b/mmpretrain/apis/__pycache__/nlvr.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a586e09cd9dd9dd5fe33d550f20d706ede97d621
Binary files /dev/null and b/mmpretrain/apis/__pycache__/nlvr.cpython-310.pyc differ
diff --git a/mmpretrain/apis/__pycache__/visual_grounding.cpython-310.pyc b/mmpretrain/apis/__pycache__/visual_grounding.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..221b7f1e9f9d4b593a812981955c5ac4f1393505
Binary files /dev/null and b/mmpretrain/apis/__pycache__/visual_grounding.cpython-310.pyc differ
diff --git a/mmpretrain/apis/__pycache__/visual_question_answering.cpython-310.pyc b/mmpretrain/apis/__pycache__/visual_question_answering.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..fd21b097655c58c20c639b24975d7f6448af1d69
Binary files /dev/null and b/mmpretrain/apis/__pycache__/visual_question_answering.cpython-310.pyc differ
diff --git a/mmpretrain/apis/base.py b/mmpretrain/apis/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..7bff6bd18675a3a0996dcd09081a15728311657f
--- /dev/null
+++ b/mmpretrain/apis/base.py
@@ -0,0 +1,390 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from abc import abstractmethod
+from math import ceil
+from typing import Callable, Iterable, List, Optional, Tuple, Union
+
+import numpy as np
+import torch
+from mmengine.config import Config
+from mmengine.dataset import default_collate
+from mmengine.fileio import get_file_backend
+from mmengine.model import BaseModel
+from mmengine.runner import load_checkpoint
+
+from mmpretrain.structures import DataSample
+from mmpretrain.utils import track
+from .model import get_model, list_models
+
+ModelType = Union[BaseModel, str, Config]
+InputType = Union[str, np.ndarray, list]
+
+
+class BaseInferencer:
+    """Base inferencer for various tasks.
+
+    The BaseInferencer provides the standard workflow for inference as follows:
+
+    1. Preprocess the input data by :meth:`preprocess`.
+    2. Forward the data to the model by :meth:`forward`. ``BaseInferencer``
+       assumes the model inherits from :class:`mmengine.models.BaseModel` and
+       will call `model.test_step` in :meth:`forward` by default.
+    3. Visualize the results by :meth:`visualize`.
+    4. Postprocess and return the results by :meth:`postprocess`.
+
+    When we call the subclasses inherited from BaseInferencer (not overriding
+    ``__call__``), the workflow will be executed in order.
+
+    All subclasses of BaseInferencer could define the following class
+    attributes for customization:
+
+    - ``preprocess_kwargs``: The keys of the kwargs that will be passed to
+      :meth:`preprocess`.
+    - ``forward_kwargs``: The keys of the kwargs that will be passed to
+      :meth:`forward`
+    - ``visualize_kwargs``: The keys of the kwargs that will be passed to
+      :meth:`visualize`
+    - ``postprocess_kwargs``: The keys of the kwargs that will be passed to
+      :meth:`postprocess`
+
+    All attributes mentioned above should be a ``set`` of keys (strings),
+    and each key should not be duplicated. Actually, :meth:`__call__` will
+    dispatch all the arguments to the corresponding methods according to the
+    ``xxx_kwargs`` mentioned above.
+
+    Subclasses inherited from ``BaseInferencer`` should implement
+    :meth:`_init_pipeline`, :meth:`visualize` and :meth:`postprocess`:
+
+    - _init_pipeline: Return a callable object to preprocess the input data.
+    - visualize: Visualize the results returned by :meth:`forward`.
+    - postprocess: Postprocess the results returned by :meth:`forward` and
+      :meth:`visualize`.
+
+    Args:
+        model (BaseModel | str | Config): A model name or a path to the config
+            file, or a :obj:`BaseModel` object. The model name can be found
+            by ``cls.list_models()`` and you can also query it in
+            :doc:`/modelzoo_statistics`.
+        pretrained (str, optional): Path to the checkpoint. If None, it will
+            try to find a pre-defined weight from the model you specified
+            (only work if the ``model`` is a model name). Defaults to None.
+        device (str | torch.device | None): Transfer the model to the target
+            device. Defaults to None.
+        device_map (str | dict | None): A map that specifies where each
+            submodule should go. It doesn't need to be refined to each
+            parameter/buffer name, once a given module name is inside, every
+            submodule of it will be sent to the same device. You can use
+            `device_map="auto"` to automatically generate the device map.
+            Defaults to None.
+        offload_folder (str | None): If the `device_map` contains any value
+            `"disk"`, the folder where we will offload weights.
+        **kwargs: Other keyword arguments to initialize the model (only work if
+            the ``model`` is a model name).
+    """
+
+    preprocess_kwargs: set = set()
+    forward_kwargs: set = set()
+    visualize_kwargs: set = set()
+    postprocess_kwargs: set = set()
+
+    def __init__(self,
+                 model: ModelType,
+                 pretrained: Union[bool, str] = True,
+                 device: Union[str, torch.device, None] = None,
+                 device_map=None,
+                 offload_folder=None,
+                 **kwargs) -> None:
+
+        if isinstance(model, BaseModel):
+            if isinstance(pretrained, str):
+                load_checkpoint(model, pretrained, map_location='cpu')
+            if device_map is not None:
+                from .utils import dispatch_model
+                model = dispatch_model(
+                    model,
+                    device_map=device_map,
+                    offload_folder=offload_folder)
+            elif device is not None:
+                model.to(device)
+        else:
+            model = get_model(
+                model,
+                pretrained,
+                device=device,
+                device_map=device_map,
+                offload_folder=offload_folder,
+                **kwargs)
+
+        model.eval()
+
+        self.config = model._config
+        self.model = model
+        self.pipeline = self._init_pipeline(self.config)
+        self.visualizer = None
+
+    def __call__(
+        self,
+        inputs,
+        return_datasamples: bool = False,
+        batch_size: int = 1,
+        **kwargs,
+    ) -> dict:
+        """Call the inferencer.
+
+        Args:
+            inputs (InputsType): Inputs for the inferencer.
+            return_datasamples (bool): Whether to return results as
+                :obj:`BaseDataElement`. Defaults to False.
+            batch_size (int): Batch size. Defaults to 1.
+            **kwargs: Key words arguments passed to :meth:`preprocess`,
+                :meth:`forward`, :meth:`visualize` and :meth:`postprocess`.
+                Each key in kwargs should be in the corresponding set of
+                ``preprocess_kwargs``, ``forward_kwargs``, ``visualize_kwargs``
+                and ``postprocess_kwargs``.
+
+        Returns:
+            dict: Inference and visualization results.
+        """
+        (
+            preprocess_kwargs,
+            forward_kwargs,
+            visualize_kwargs,
+            postprocess_kwargs,
+        ) = self._dispatch_kwargs(**kwargs)
+
+        ori_inputs = self._inputs_to_list(inputs)
+        inputs = self.preprocess(
+            ori_inputs, batch_size=batch_size, **preprocess_kwargs)
+        preds = []
+        for data in track(
+                inputs, 'Inference', total=ceil(len(ori_inputs) / batch_size)):
+            preds.extend(self.forward(data, **forward_kwargs))
+        visualization = self.visualize(ori_inputs, preds, **visualize_kwargs)
+        results = self.postprocess(preds, visualization, return_datasamples,
+                                   **postprocess_kwargs)
+        return results
+
+    def _inputs_to_list(self, inputs: InputType) -> list:
+        """Preprocess the inputs to a list.
+
+        Cast the input data to a list of data.
+
+        - list or tuple: return inputs
+        - str:
+            - Directory path: return all files in the directory
+            - other cases: return a list containing the string. The string
+              could be a path to file, a url or other types of string according
+              to the task.
+        - other: return a list with one item.
+
+        Args:
+            inputs (str | array | list): Inputs for the inferencer.
+
+        Returns:
+            list: List of input for the :meth:`preprocess`.
+        """
+        if isinstance(inputs, str):
+            backend = get_file_backend(inputs)
+            if hasattr(backend, 'isdir') and backend.isdir(inputs):
+                # Backends like HttpsBackend do not implement `isdir`, so only
+                # those backends that implement `isdir` could accept the inputs
+                # as a directory
+                file_list = backend.list_dir_or_file(inputs, list_dir=False)
+                inputs = [
+                    backend.join_path(inputs, file) for file in file_list
+                ]
+
+        if not isinstance(inputs, (list, tuple)):
+            inputs = [inputs]
+
+        return list(inputs)
+
+    def preprocess(self, inputs: InputType, batch_size: int = 1, **kwargs):
+        """Process the inputs into a model-feedable format.
+
+        Customize your preprocess by overriding this method. Preprocess should
+        return an iterable object, of which each item will be used as the
+        input of ``model.test_step``.
+
+        ``BaseInferencer.preprocess`` will return an iterable chunked data,
+        which will be used in __call__ like this:
+
+        .. code-block:: python
+
+            def __call__(self, inputs, batch_size=1, **kwargs):
+                chunked_data = self.preprocess(inputs, batch_size, **kwargs)
+                for batch in chunked_data:
+                    preds = self.forward(batch, **kwargs)
+
+        Args:
+            inputs (InputsType): Inputs given by user.
+            batch_size (int): batch size. Defaults to 1.
+
+        Yields:
+            Any: Data processed by the ``pipeline`` and ``default_collate``.
+        """
+        chunked_data = self._get_chunk_data(
+            map(self.pipeline, inputs), batch_size)
+        yield from map(default_collate, chunked_data)
+
+    @torch.no_grad()
+    def forward(self, inputs: Union[dict, tuple], **kwargs):
+        """Feed the inputs to the model."""
+        return self.model.test_step(inputs)
+
+    def visualize(self,
+                  inputs: list,
+                  preds: List[DataSample],
+                  show: bool = False,
+                  **kwargs) -> List[np.ndarray]:
+        """Visualize predictions.
+
+        Customize your visualization by overriding this method. visualize
+        should return visualization results, which could be np.ndarray or any
+        other objects.
+
+        Args:
+            inputs (list): Inputs preprocessed by :meth:`_inputs_to_list`.
+            preds (Any): Predictions of the model.
+            show (bool): Whether to display the image in a popup window.
+                Defaults to False.
+
+        Returns:
+            List[np.ndarray]: Visualization results.
+        """
+        if show:
+            raise NotImplementedError(
+                f'The `visualize` method of {self.__class__.__name__} '
+                'is not implemented.')
+
+    @abstractmethod
+    def postprocess(
+        self,
+        preds: List[DataSample],
+        visualization: List[np.ndarray],
+        return_datasample=False,
+        **kwargs,
+    ) -> dict:
+        """Process the predictions and visualization results from ``forward``
+        and ``visualize``.
+
+        This method should be responsible for the following tasks:
+
+        1. Convert datasamples into a json-serializable dict if needed.
+        2. Pack the predictions and visualization results and return them.
+        3. Dump or log the predictions.
+
+        Customize your postprocess by overriding this method. Make sure
+        ``postprocess`` will return a dict with visualization results and
+        inference results.
+
+        Args:
+            preds (List[Dict]): Predictions of the model.
+            visualization (np.ndarray): Visualized predictions.
+            return_datasample (bool): Whether to return results as datasamples.
+                Defaults to False.
+
+        Returns:
+            dict: Inference and visualization results with key ``predictions``
+            and ``visualization``
+
+            - ``visualization (Any)``: Returned by :meth:`visualize`
+            - ``predictions`` (dict or DataSample): Returned by
+              :meth:`forward` and processed in :meth:`postprocess`.
+              If ``return_datasample=False``, it usually should be a
+              json-serializable dict containing only basic data elements such
+              as strings and numbers.
+        """
+
+    @abstractmethod
+    def _init_pipeline(self, cfg: Config) -> Callable:
+        """Initialize the test pipeline.
+
+        Return a pipeline to handle various input data, such as ``str``,
+        ``np.ndarray``. It is an abstract method in BaseInferencer, and should
+        be implemented in subclasses.
+
+        The returned pipeline will be used to process a single data.
+        It will be used in :meth:`preprocess` like this:
+
+        .. code-block:: python
+            def preprocess(self, inputs, batch_size, **kwargs):
+                ...
+                dataset = map(self.pipeline, dataset)
+                ...
+        """
+
+    def _get_chunk_data(self, inputs: Iterable, chunk_size: int):
+        """Get batch data from dataset.
+
+        Args:
+            inputs (Iterable): An iterable dataset.
+            chunk_size (int): Equivalent to batch size.
+
+        Yields:
+            list: batch data.
+        """
+        inputs_iter = iter(inputs)
+        while True:
+            try:
+                chunk_data = []
+                for _ in range(chunk_size):
+                    processed_data = next(inputs_iter)
+                    chunk_data.append(processed_data)
+                yield chunk_data
+            except StopIteration:
+                if chunk_data:
+                    yield chunk_data
+                break
+
+    def _dispatch_kwargs(self, **kwargs) -> Tuple[dict, dict, dict, dict]:
+        """Dispatch kwargs to preprocess(), forward(), visualize() and
+        postprocess() according to the actual demands.
+
+        Returns:
+            Tuple[Dict, Dict, Dict, Dict]: kwargs passed to preprocess,
+            forward, visualize and postprocess respectively.
+        """
+        # Ensure each argument only matches one function
+        method_kwargs = self.preprocess_kwargs | self.forward_kwargs | \
+            self.visualize_kwargs | self.postprocess_kwargs
+
+        union_kwargs = method_kwargs | set(kwargs.keys())
+        if union_kwargs != method_kwargs:
+            unknown_kwargs = union_kwargs - method_kwargs
+            raise ValueError(
+                f'unknown argument {unknown_kwargs} for `preprocess`, '
+                '`forward`, `visualize` and `postprocess`')
+
+        preprocess_kwargs = {}
+        forward_kwargs = {}
+        visualize_kwargs = {}
+        postprocess_kwargs = {}
+
+        for key, value in kwargs.items():
+            if key in self.preprocess_kwargs:
+                preprocess_kwargs[key] = value
+            if key in self.forward_kwargs:
+                forward_kwargs[key] = value
+            if key in self.visualize_kwargs:
+                visualize_kwargs[key] = value
+            if key in self.postprocess_kwargs:
+                postprocess_kwargs[key] = value
+
+        return (
+            preprocess_kwargs,
+            forward_kwargs,
+            visualize_kwargs,
+            postprocess_kwargs,
+        )
+
+    @staticmethod
+    def list_models(pattern: Optional[str] = None):
+        """List models defined in metafile of corresponding packages.
+
+        Args:
+            pattern (str | None): A wildcard pattern to match model names.
+
+        Returns:
+            List[str]: a list of model names.
+        """
+        return list_models(pattern=pattern)
diff --git a/mmpretrain/apis/feature_extractor.py b/mmpretrain/apis/feature_extractor.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee14f92f489497dd036fe0567786a94207924d4a
--- /dev/null
+++ b/mmpretrain/apis/feature_extractor.py
@@ -0,0 +1,130 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Callable, List, Optional, Union
+
+import torch
+from mmcv.image import imread
+from mmengine.config import Config
+from mmengine.dataset import Compose, default_collate
+
+from mmpretrain.registry import TRANSFORMS
+from .base import BaseInferencer, InputType
+from .model import list_models
+
+
+class FeatureExtractor(BaseInferencer):
+    """The inferencer for extract features.
+
+    Args:
+        model (BaseModel | str | Config): A model name or a path to the config
+            file, or a :obj:`BaseModel` object. The model name can be found
+            by ``FeatureExtractor.list_models()`` and you can also query it in
+            :doc:`/modelzoo_statistics`.
+        pretrained (str, optional): Path to the checkpoint. If None, it will
+            try to find a pre-defined weight from the model you specified
+            (only work if the ``model`` is a model name). Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        **kwargs: Other keyword arguments to initialize the model (only work if
+            the ``model`` is a model name).
+
+    Example:
+        >>> from mmpretrain import FeatureExtractor
+        >>> inferencer = FeatureExtractor('resnet50_8xb32_in1k', backbone=dict(out_indices=(0, 1, 2, 3)))
+        >>> feats = inferencer('demo/demo.JPEG', stage='backbone')[0]
+        >>> for feat in feats:
+        >>>     print(feat.shape)
+        torch.Size([256, 56, 56])
+        torch.Size([512, 28, 28])
+        torch.Size([1024, 14, 14])
+        torch.Size([2048, 7, 7])
+    """  # noqa: E501
+
+    def __call__(self,
+                 inputs: InputType,
+                 batch_size: int = 1,
+                 **kwargs) -> dict:
+        """Call the inferencer.
+
+        Args:
+            inputs (str | array | list): The image path or array, or a list of
+                images.
+            batch_size (int): Batch size. Defaults to 1.
+            **kwargs: Other keyword arguments accepted by the `extract_feat`
+                method of the model.
+
+        Returns:
+            tensor | Tuple[tensor]: The extracted features.
+        """
+        ori_inputs = self._inputs_to_list(inputs)
+        inputs = self.preprocess(ori_inputs, batch_size=batch_size)
+        preds = []
+        for data in inputs:
+            preds.extend(self.forward(data, **kwargs))
+
+        return preds
+
+    @torch.no_grad()
+    def forward(self, inputs: Union[dict, tuple], **kwargs):
+        inputs = self.model.data_preprocessor(inputs, False)['inputs']
+        outputs = self.model.extract_feat(inputs, **kwargs)
+
+        def scatter(feats, index):
+            if isinstance(feats, torch.Tensor):
+                return feats[index]
+            else:
+                # Sequence of tensor
+                return type(feats)([scatter(item, index) for item in feats])
+
+        results = []
+        for i in range(inputs.shape[0]):
+            results.append(scatter(outputs, i))
+
+        return results
+
+    def _init_pipeline(self, cfg: Config) -> Callable:
+        test_pipeline_cfg = cfg.test_dataloader.dataset.pipeline
+        from mmpretrain.datasets import remove_transform
+
+        # Image loading is finished in `self.preprocess`.
+        test_pipeline_cfg = remove_transform(test_pipeline_cfg,
+                                             'LoadImageFromFile')
+        test_pipeline = Compose(
+            [TRANSFORMS.build(t) for t in test_pipeline_cfg])
+        return test_pipeline
+
+    def preprocess(self, inputs: List[InputType], batch_size: int = 1):
+
+        def load_image(input_):
+            img = imread(input_)
+            if img is None:
+                raise ValueError(f'Failed to read image {input_}.')
+            return dict(
+                img=img,
+                img_shape=img.shape[:2],
+                ori_shape=img.shape[:2],
+            )
+
+        pipeline = Compose([load_image, self.pipeline])
+
+        chunked_data = self._get_chunk_data(map(pipeline, inputs), batch_size)
+        yield from map(default_collate, chunked_data)
+
+    def visualize(self):
+        raise NotImplementedError(
+            "The FeatureExtractor doesn't support visualization.")
+
+    def postprocess(self):
+        raise NotImplementedError(
+            "The FeatureExtractor doesn't need postprocessing.")
+
+    @staticmethod
+    def list_models(pattern: Optional[str] = None):
+        """List all available model names.
+
+        Args:
+            pattern (str | None): A wildcard pattern to match model names.
+
+        Returns:
+            List[str]: a list of model names.
+        """
+        return list_models(pattern=pattern)
diff --git a/mmpretrain/apis/image_caption.py b/mmpretrain/apis/image_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..c11c0d3044d9924aba159782309d2cc20f1745bc
--- /dev/null
+++ b/mmpretrain/apis/image_caption.py
@@ -0,0 +1,166 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from pathlib import Path
+from typing import Callable, List, Optional
+
+import numpy as np
+from mmcv.image import imread
+from mmengine.config import Config
+from mmengine.dataset import Compose, default_collate
+
+from mmpretrain.registry import TRANSFORMS
+from mmpretrain.structures import DataSample
+from .base import BaseInferencer, InputType
+from .model import list_models
+
+
+class ImageCaptionInferencer(BaseInferencer):
+    """The inferencer for image caption.
+
+    Args:
+        model (BaseModel | str | Config): A model name or a path to the config
+            file, or a :obj:`BaseModel` object. The model name can be found
+            by ``ImageCaptionInferencer.list_models()`` and you can also
+            query it in :doc:`/modelzoo_statistics`.
+        pretrained (str, optional): Path to the checkpoint. If None, it will
+            try to find a pre-defined weight from the model you specified
+            (only work if the ``model`` is a model name). Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        **kwargs: Other keyword arguments to initialize the model (only work if
+            the ``model`` is a model name).
+
+    Example:
+        >>> from mmpretrain import ImageCaptionInferencer
+        >>> inferencer = ImageCaptionInferencer('blip-base_3rdparty_caption')
+        >>> inferencer('demo/cat-dog.png')[0]
+        {'pred_caption': 'a puppy and a cat sitting on a blanket'}
+    """  # noqa: E501
+
+    visualize_kwargs: set = {'resize', 'show', 'show_dir', 'wait_time'}
+
+    def __call__(self,
+                 images: InputType,
+                 return_datasamples: bool = False,
+                 batch_size: int = 1,
+                 **kwargs) -> dict:
+        """Call the inferencer.
+
+        Args:
+            images (str | array | list): The image path or array, or a list of
+                images.
+            return_datasamples (bool): Whether to return results as
+                :obj:`DataSample`. Defaults to False.
+            batch_size (int): Batch size. Defaults to 1.
+            resize (int, optional): Resize the short edge of the image to the
+                specified length before visualization. Defaults to None.
+            draw_score (bool): Whether to draw the prediction scores
+                of prediction categories. Defaults to True.
+            show (bool): Whether to display the visualization result in a
+                window. Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            show_dir (str, optional): If not None, save the visualization
+                results in the specified directory. Defaults to None.
+
+        Returns:
+            list: The inference results.
+        """
+        return super().__call__(images, return_datasamples, batch_size,
+                                **kwargs)
+
+    def _init_pipeline(self, cfg: Config) -> Callable:
+        test_pipeline_cfg = cfg.test_dataloader.dataset.pipeline
+        from mmpretrain.datasets import remove_transform
+
+        # Image loading is finished in `self.preprocess`.
+        test_pipeline_cfg = remove_transform(test_pipeline_cfg,
+                                             'LoadImageFromFile')
+        test_pipeline = Compose(
+            [TRANSFORMS.build(t) for t in test_pipeline_cfg])
+        return test_pipeline
+
+    def preprocess(self, inputs: List[InputType], batch_size: int = 1):
+
+        def load_image(input_):
+            img = imread(input_)
+            if img is None:
+                raise ValueError(f'Failed to read image {input_}.')
+            return dict(
+                img=img,
+                img_shape=img.shape[:2],
+                ori_shape=img.shape[:2],
+            )
+
+        pipeline = Compose([load_image, self.pipeline])
+
+        chunked_data = self._get_chunk_data(map(pipeline, inputs), batch_size)
+        yield from map(default_collate, chunked_data)
+
+    def visualize(self,
+                  ori_inputs: List[InputType],
+                  preds: List[DataSample],
+                  show: bool = False,
+                  wait_time: int = 0,
+                  resize: Optional[int] = None,
+                  show_dir=None):
+        if not show and show_dir is None:
+            return None
+
+        if self.visualizer is None:
+            from mmpretrain.visualization import UniversalVisualizer
+            self.visualizer = UniversalVisualizer()
+
+        visualization = []
+        for i, (input_, data_sample) in enumerate(zip(ori_inputs, preds)):
+            image = imread(input_)
+            if isinstance(input_, str):
+                # The image loaded from path is BGR format.
+                image = image[..., ::-1]
+                name = Path(input_).stem
+            else:
+                name = str(i)
+
+            if show_dir is not None:
+                show_dir = Path(show_dir)
+                show_dir.mkdir(exist_ok=True)
+                out_file = str((show_dir / name).with_suffix('.png'))
+            else:
+                out_file = None
+
+            self.visualizer.visualize_image_caption(
+                image,
+                data_sample,
+                resize=resize,
+                show=show,
+                wait_time=wait_time,
+                name=name,
+                out_file=out_file)
+            visualization.append(self.visualizer.get_image())
+        if show:
+            self.visualizer.close()
+        return visualization
+
+    def postprocess(self,
+                    preds: List[DataSample],
+                    visualization: List[np.ndarray],
+                    return_datasamples=False) -> dict:
+        if return_datasamples:
+            return preds
+
+        results = []
+        for data_sample in preds:
+            results.append({'pred_caption': data_sample.get('pred_caption')})
+
+        return results
+
+    @staticmethod
+    def list_models(pattern: Optional[str] = None):
+        """List all available model names.
+
+        Args:
+            pattern (str | None): A wildcard pattern to match model names.
+
+        Returns:
+            List[str]: a list of model names.
+        """
+        return list_models(pattern=pattern, task='Image Caption')
diff --git a/mmpretrain/apis/image_classification.py b/mmpretrain/apis/image_classification.py
new file mode 100644
index 0000000000000000000000000000000000000000..a20218071c7afc90c6a46d61b5ed3a8fee5bc012
--- /dev/null
+++ b/mmpretrain/apis/image_classification.py
@@ -0,0 +1,223 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from pathlib import Path
+from typing import Callable, List, Optional, Union
+
+import numpy as np
+import torch
+from mmcv.image import imread
+from mmengine.config import Config
+from mmengine.dataset import Compose, default_collate
+
+from mmpretrain.registry import TRANSFORMS
+from mmpretrain.structures import DataSample
+from .base import BaseInferencer, InputType, ModelType
+from .model import list_models
+
+
+class ImageClassificationInferencer(BaseInferencer):
+    """The inferencer for image classification.
+
+    Args:
+        model (BaseModel | str | Config): A model name or a path to the config
+            file, or a :obj:`BaseModel` object. The model name can be found
+            by ``ImageClassificationInferencer.list_models()`` and you can also
+            query it in :doc:`/modelzoo_statistics`.
+        pretrained (str, optional): Path to the checkpoint. If None, it will
+            try to find a pre-defined weight from the model you specified
+            (only work if the ``model`` is a model name). Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        **kwargs: Other keyword arguments to initialize the model (only work if
+            the ``model`` is a model name).
+
+    Example:
+        1. Use a pre-trained model in MMPreTrain to inference an image.
+
+           >>> from mmpretrain import ImageClassificationInferencer
+           >>> inferencer = ImageClassificationInferencer('resnet50_8xb32_in1k')
+           >>> inferencer('demo/demo.JPEG')
+           [{'pred_score': array([...]),
+             'pred_label': 65,
+             'pred_score': 0.6649367809295654,
+             'pred_class': 'sea snake'}]
+
+        2. Use a config file and checkpoint to inference multiple images on GPU,
+           and save the visualization results in a folder.
+
+           >>> from mmpretrain import ImageClassificationInferencer
+           >>> inferencer = ImageClassificationInferencer(
+                   model='configs/resnet/resnet50_8xb32_in1k.py',
+                   pretrained='https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_8xb32_in1k_20210831-ea4938fc.pth',
+                   device='cuda')
+           >>> inferencer(['demo/dog.jpg', 'demo/bird.JPEG'], show_dir="./visualize/")
+    """  # noqa: E501
+
+    visualize_kwargs: set = {
+        'resize', 'rescale_factor', 'draw_score', 'show', 'show_dir',
+        'wait_time'
+    }
+
+    def __init__(self,
+                 model: ModelType,
+                 pretrained: Union[bool, str] = True,
+                 device: Union[str, torch.device, None] = None,
+                 classes=None,
+                 **kwargs) -> None:
+        super().__init__(
+            model=model, pretrained=pretrained, device=device, **kwargs)
+
+        if classes is not None:
+            self.classes = classes
+        else:
+            self.classes = getattr(self.model, '_dataset_meta',
+                                   {}).get('classes')
+
+    def __call__(self,
+                 inputs: InputType,
+                 return_datasamples: bool = False,
+                 batch_size: int = 1,
+                 **kwargs) -> dict:
+        """Call the inferencer.
+
+        Args:
+            inputs (str | array | list): The image path or array, or a list of
+                images.
+            return_datasamples (bool): Whether to return results as
+                :obj:`DataSample`. Defaults to False.
+            batch_size (int): Batch size. Defaults to 1.
+            resize (int, optional): Resize the short edge of the image to the
+                specified length before visualization. Defaults to None.
+            rescale_factor (float, optional): Rescale the image by the rescale
+                factor for visualization. This is helpful when the image is too
+                large or too small for visualization. Defaults to None.
+            draw_score (bool): Whether to draw the prediction scores
+                of prediction categories. Defaults to True.
+            show (bool): Whether to display the visualization result in a
+                window. Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            show_dir (str, optional): If not None, save the visualization
+                results in the specified directory. Defaults to None.
+
+        Returns:
+            list: The inference results.
+        """
+        return super().__call__(
+            inputs,
+            return_datasamples=return_datasamples,
+            batch_size=batch_size,
+            **kwargs)
+
+    def _init_pipeline(self, cfg: Config) -> Callable:
+        test_pipeline_cfg = cfg.test_dataloader.dataset.pipeline
+        from mmpretrain.datasets import remove_transform
+
+        # Image loading is finished in `self.preprocess`.
+        test_pipeline_cfg = remove_transform(test_pipeline_cfg,
+                                             'LoadImageFromFile')
+        test_pipeline = Compose(
+            [TRANSFORMS.build(t) for t in test_pipeline_cfg])
+        return test_pipeline
+
+    def preprocess(self, inputs: List[InputType], batch_size: int = 1):
+
+        def load_image(input_):
+            img = imread(input_)
+            if img is None:
+                raise ValueError(f'Failed to read image {input_}.')
+            return dict(
+                img=img,
+                img_shape=img.shape[:2],
+                ori_shape=img.shape[:2],
+            )
+
+        pipeline = Compose([load_image, self.pipeline])
+
+        chunked_data = self._get_chunk_data(map(pipeline, inputs), batch_size)
+        yield from map(default_collate, chunked_data)
+
+    def visualize(self,
+                  ori_inputs: List[InputType],
+                  preds: List[DataSample],
+                  show: bool = False,
+                  wait_time: int = 0,
+                  resize: Optional[int] = None,
+                  rescale_factor: Optional[float] = None,
+                  draw_score=True,
+                  show_dir=None):
+        if not show and show_dir is None:
+            return None
+
+        if self.visualizer is None:
+            from mmpretrain.visualization import UniversalVisualizer
+            self.visualizer = UniversalVisualizer()
+
+        visualization = []
+        for i, (input_, data_sample) in enumerate(zip(ori_inputs, preds)):
+            image = imread(input_)
+            if isinstance(input_, str):
+                # The image loaded from path is BGR format.
+                image = image[..., ::-1]
+                name = Path(input_).stem
+            else:
+                name = str(i)
+
+            if show_dir is not None:
+                show_dir = Path(show_dir)
+                show_dir.mkdir(exist_ok=True)
+                out_file = str((show_dir / name).with_suffix('.png'))
+            else:
+                out_file = None
+
+            self.visualizer.visualize_cls(
+                image,
+                data_sample,
+                classes=self.classes,
+                resize=resize,
+                show=show,
+                wait_time=wait_time,
+                rescale_factor=rescale_factor,
+                draw_gt=False,
+                draw_pred=True,
+                draw_score=draw_score,
+                name=name,
+                out_file=out_file)
+            visualization.append(self.visualizer.get_image())
+        if show:
+            self.visualizer.close()
+        return visualization
+
+    def postprocess(self,
+                    preds: List[DataSample],
+                    visualization: List[np.ndarray],
+                    return_datasamples=False) -> dict:
+        if return_datasamples:
+            return preds
+
+        results = []
+        for data_sample in preds:
+            pred_scores = data_sample.pred_score
+            pred_score = float(torch.max(pred_scores).item())
+            pred_label = torch.argmax(pred_scores).item()
+            result = {
+                'pred_scores': pred_scores.detach().cpu().numpy(),
+                'pred_label': pred_label,
+                'pred_score': pred_score,
+            }
+            if self.classes is not None:
+                result['pred_class'] = self.classes[pred_label]
+            results.append(result)
+
+        return results
+
+    @staticmethod
+    def list_models(pattern: Optional[str] = None):
+        """List all available model names.
+
+        Args:
+            pattern (str | None): A wildcard pattern to match model names.
+
+        Returns:
+            List[str]: a list of model names.
+        """
+        return list_models(pattern=pattern, task='Image Classification')
diff --git a/mmpretrain/apis/image_retrieval.py b/mmpretrain/apis/image_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..27919b20f58afe603fb23d9aeb2fc37326683286
--- /dev/null
+++ b/mmpretrain/apis/image_retrieval.py
@@ -0,0 +1,288 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from pathlib import Path
+from typing import Callable, List, Optional, Union
+
+import numpy as np
+import torch
+from mmcv.image import imread
+from mmengine.config import Config
+from mmengine.dataset import BaseDataset, Compose, default_collate
+
+from mmpretrain.registry import TRANSFORMS
+from mmpretrain.structures import DataSample
+from .base import BaseInferencer, InputType, ModelType
+from .model import list_models
+
+
+class ImageRetrievalInferencer(BaseInferencer):
+    """The inferencer for image to image retrieval.
+
+    Args:
+        model (BaseModel | str | Config): A model name or a path to the config
+            file, or a :obj:`BaseModel` object. The model name can be found
+            by ``ImageRetrievalInferencer.list_models()`` and you can also
+            query it in :doc:`/modelzoo_statistics`.
+        prototype (str | list | dict | DataLoader, BaseDataset): The images to
+            be retrieved. It can be the following types:
+
+            - str: The directory of the the images.
+            - list: A list of path of the images.
+            - dict: A config dict of the a prototype dataset.
+            - BaseDataset: A prototype dataset.
+            - DataLoader: A data loader to load the prototype data.
+
+        prototype_cache (str, optional): The path of the generated prototype
+            features. If exists, directly load the cache instead of re-generate
+            the prototype features. If not exists, save the generated features
+            to the path. Defaults to None.
+        pretrained (str, optional): Path to the checkpoint. If None, it will
+            try to find a pre-defined weight from the model you specified
+            (only work if the ``model`` is a model name). Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        **kwargs: Other keyword arguments to initialize the model (only work if
+            the ``model`` is a model name).
+
+    Example:
+        >>> from mmpretrain import ImageRetrievalInferencer
+        >>> inferencer = ImageRetrievalInferencer(
+        ...     'resnet50-arcface_inshop',
+        ...     prototype='./demo/',
+        ...     prototype_cache='img_retri.pth')
+        >>> inferencer('demo/cat-dog.png', topk=2)[0][1]
+        {'match_score': tensor(0.4088, device='cuda:0'),
+         'sample_idx': 3,
+         'sample': {'img_path': './demo/dog.jpg'}}
+    """  # noqa: E501
+
+    visualize_kwargs: set = {
+        'draw_score', 'resize', 'show_dir', 'show', 'wait_time', 'topk'
+    }
+    postprocess_kwargs: set = {'topk'}
+
+    def __init__(
+        self,
+        model: ModelType,
+        prototype,
+        prototype_cache=None,
+        prepare_batch_size=8,
+        pretrained: Union[bool, str] = True,
+        device: Union[str, torch.device, None] = None,
+        **kwargs,
+    ) -> None:
+        super().__init__(
+            model=model, pretrained=pretrained, device=device, **kwargs)
+
+        self.prototype_dataset = self._prepare_prototype(
+            prototype, prototype_cache, prepare_batch_size)
+
+    def _prepare_prototype(self, prototype, cache=None, batch_size=8):
+        from mmengine.dataset import DefaultSampler
+        from torch.utils.data import DataLoader
+
+        def build_dataloader(dataset):
+            return DataLoader(
+                dataset,
+                batch_size=batch_size,
+                collate_fn=default_collate,
+                sampler=DefaultSampler(dataset, shuffle=False),
+                persistent_workers=False,
+            )
+
+        if isinstance(prototype, str):
+            # A directory path of images
+            prototype = dict(
+                type='CustomDataset', with_label=False, data_root=prototype)
+
+        if isinstance(prototype, list):
+            test_pipeline = [dict(type='LoadImageFromFile'), self.pipeline]
+            dataset = BaseDataset(
+                lazy_init=True, serialize_data=False, pipeline=test_pipeline)
+            dataset.data_list = [{
+                'sample_idx': i,
+                'img_path': file
+            } for i, file in enumerate(prototype)]
+            dataset._fully_initialized = True
+            dataloader = build_dataloader(dataset)
+        elif isinstance(prototype, dict):
+            # A config of dataset
+            from mmpretrain.registry import DATASETS
+            test_pipeline = [dict(type='LoadImageFromFile'), self.pipeline]
+            prototype.setdefault('pipeline', test_pipeline)
+            dataset = DATASETS.build(prototype)
+            dataloader = build_dataloader(dataset)
+        elif isinstance(prototype, DataLoader):
+            dataset = prototype.dataset
+            dataloader = prototype
+        elif isinstance(prototype, BaseDataset):
+            dataset = prototype
+            dataloader = build_dataloader(dataset)
+        else:
+            raise TypeError(f'Unsupported prototype type {type(prototype)}.')
+
+        if cache is not None and Path(cache).exists():
+            self.model.prototype = cache
+        else:
+            self.model.prototype = dataloader
+        self.model.prepare_prototype()
+
+        from mmengine.logging import MMLogger
+        logger = MMLogger.get_current_instance()
+        if cache is None:
+            logger.info('The prototype has been prepared, you can use '
+                        '`save_prototype` to dump it into a pickle '
+                        'file for the future usage.')
+        elif not Path(cache).exists():
+            self.save_prototype(cache)
+            logger.info(f'The prototype has been saved at {cache}.')
+
+        return dataset
+
+    def save_prototype(self, path):
+        self.model.dump_prototype(path)
+
+    def __call__(self,
+                 inputs: InputType,
+                 return_datasamples: bool = False,
+                 batch_size: int = 1,
+                 **kwargs) -> dict:
+        """Call the inferencer.
+
+        Args:
+            inputs (str | array | list): The image path or array, or a list of
+                images.
+            return_datasamples (bool): Whether to return results as
+                :obj:`DataSample`. Defaults to False.
+            batch_size (int): Batch size. Defaults to 1.
+            resize (int, optional): Resize the long edge of the image to the
+                specified length before visualization. Defaults to None.
+            draw_score (bool): Whether to draw the match scores.
+                Defaults to True.
+            show (bool): Whether to display the visualization result in a
+                window. Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            show_dir (str, optional): If not None, save the visualization
+                results in the specified directory. Defaults to None.
+
+        Returns:
+            list: The inference results.
+        """
+        return super().__call__(inputs, return_datasamples, batch_size,
+                                **kwargs)
+
+    def _init_pipeline(self, cfg: Config) -> Callable:
+        test_pipeline_cfg = cfg.test_dataloader.dataset.pipeline
+        from mmpretrain.datasets import remove_transform
+
+        # Image loading is finished in `self.preprocess`.
+        test_pipeline_cfg = remove_transform(test_pipeline_cfg,
+                                             'LoadImageFromFile')
+        test_pipeline = Compose(
+            [TRANSFORMS.build(t) for t in test_pipeline_cfg])
+        return test_pipeline
+
+    def preprocess(self, inputs: List[InputType], batch_size: int = 1):
+
+        def load_image(input_):
+            img = imread(input_)
+            if img is None:
+                raise ValueError(f'Failed to read image {input_}.')
+            return dict(
+                img=img,
+                img_shape=img.shape[:2],
+                ori_shape=img.shape[:2],
+            )
+
+        pipeline = Compose([load_image, self.pipeline])
+
+        chunked_data = self._get_chunk_data(map(pipeline, inputs), batch_size)
+        yield from map(default_collate, chunked_data)
+
+    def visualize(self,
+                  ori_inputs: List[InputType],
+                  preds: List[DataSample],
+                  topk: int = 3,
+                  resize: Optional[int] = 224,
+                  show: bool = False,
+                  wait_time: int = 0,
+                  draw_score=True,
+                  show_dir=None):
+        if not show and show_dir is None:
+            return None
+
+        if self.visualizer is None:
+            from mmpretrain.visualization import UniversalVisualizer
+            self.visualizer = UniversalVisualizer()
+
+        visualization = []
+        for i, (input_, data_sample) in enumerate(zip(ori_inputs, preds)):
+            image = imread(input_)
+            if isinstance(input_, str):
+                # The image loaded from path is BGR format.
+                image = image[..., ::-1]
+                name = Path(input_).stem
+            else:
+                name = str(i)
+
+            if show_dir is not None:
+                show_dir = Path(show_dir)
+                show_dir.mkdir(exist_ok=True)
+                out_file = str((show_dir / name).with_suffix('.png'))
+            else:
+                out_file = None
+
+            self.visualizer.visualize_image_retrieval(
+                image,
+                data_sample,
+                self.prototype_dataset,
+                topk=topk,
+                resize=resize,
+                draw_score=draw_score,
+                show=show,
+                wait_time=wait_time,
+                name=name,
+                out_file=out_file)
+            visualization.append(self.visualizer.get_image())
+        if show:
+            self.visualizer.close()
+        return visualization
+
+    def postprocess(
+        self,
+        preds: List[DataSample],
+        visualization: List[np.ndarray],
+        return_datasamples=False,
+        topk=1,
+    ) -> dict:
+        if return_datasamples:
+            return preds
+
+        results = []
+        for data_sample in preds:
+            match_scores, indices = torch.topk(data_sample.pred_score, k=topk)
+            matches = []
+            for match_score, sample_idx in zip(match_scores, indices):
+                sample = self.prototype_dataset.get_data_info(
+                    sample_idx.item())
+                sample_idx = sample.pop('sample_idx')
+                matches.append({
+                    'match_score': match_score,
+                    'sample_idx': sample_idx,
+                    'sample': sample
+                })
+            results.append(matches)
+
+        return results
+
+    @staticmethod
+    def list_models(pattern: Optional[str] = None):
+        """List all available model names.
+
+        Args:
+            pattern (str | None): A wildcard pattern to match model names.
+
+        Returns:
+            List[str]: a list of model names.
+        """
+        return list_models(pattern=pattern, task='Image Retrieval')
diff --git a/mmpretrain/apis/model.py b/mmpretrain/apis/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..eba475e7f791f42eb9aec384afec947f72722f27
--- /dev/null
+++ b/mmpretrain/apis/model.py
@@ -0,0 +1,408 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+import fnmatch
+import os.path as osp
+import re
+import warnings
+from os import PathLike
+from pathlib import Path
+from typing import List, Tuple, Union
+
+from mmengine.config import Config
+from modelindex.load_model_index import load
+from modelindex.models.Model import Model
+
+
+class ModelHub:
+    """A hub to host the meta information of all pre-defined models."""
+    _models_dict = {}
+    __mmpretrain_registered = False
+
+    @classmethod
+    def register_model_index(cls,
+                             model_index_path: Union[str, PathLike],
+                             config_prefix: Union[str, PathLike, None] = None):
+        """Parse the model-index file and register all models.
+
+        Args:
+            model_index_path (str | PathLike): The path of the model-index
+                file.
+            config_prefix (str | PathLike | None): The prefix of all config
+                file paths in the model-index file.
+        """
+        model_index = load(str(model_index_path))
+        model_index.build_models_with_collections()
+
+        for metainfo in model_index.models:
+            model_name = metainfo.name.lower()
+            if metainfo.name in cls._models_dict:
+                raise ValueError(
+                    'The model name {} is conflict in {} and {}.'.format(
+                        model_name, osp.abspath(metainfo.filepath),
+                        osp.abspath(cls._models_dict[model_name].filepath)))
+            metainfo.config = cls._expand_config_path(metainfo, config_prefix)
+            cls._models_dict[model_name] = metainfo
+
+    @classmethod
+    def get(cls, model_name):
+        """Get the model's metainfo by the model name.
+
+        Args:
+            model_name (str): The name of model.
+
+        Returns:
+            modelindex.models.Model: The metainfo of the specified model.
+        """
+        cls._register_mmpretrain_models()
+        # lazy load config
+        metainfo = copy.deepcopy(cls._models_dict.get(model_name.lower()))
+        if metainfo is None:
+            raise ValueError(
+                f'Failed to find model "{model_name}". please use '
+                '`mmpretrain.list_models` to get all available names.')
+        if isinstance(metainfo.config, str):
+            metainfo.config = Config.fromfile(metainfo.config)
+        return metainfo
+
+    @staticmethod
+    def _expand_config_path(metainfo: Model,
+                            config_prefix: Union[str, PathLike] = None):
+        if config_prefix is None:
+            config_prefix = osp.dirname(metainfo.filepath)
+
+        if metainfo.config is None or osp.isabs(metainfo.config):
+            config_path: str = metainfo.config
+        else:
+            config_path = osp.abspath(osp.join(config_prefix, metainfo.config))
+
+        return config_path
+
+    @classmethod
+    def _register_mmpretrain_models(cls):
+        # register models in mmpretrain
+        if not cls.__mmpretrain_registered:
+            from importlib_metadata import distribution
+            root = distribution('mmpretrain').locate_file('mmpretrain')
+            model_index_path = root / '.mim' / 'model-index.yml'
+            ModelHub.register_model_index(
+                model_index_path, config_prefix=root / '.mim')
+            cls.__mmpretrain_registered = True
+
+    @classmethod
+    def has(cls, model_name):
+        """Whether a model name is in the ModelHub."""
+        return model_name in cls._models_dict
+
+
+def get_model(model: Union[str, Config],
+              pretrained: Union[str, bool] = False,
+              device=None,
+              device_map=None,
+              offload_folder=None,
+              url_mapping: Tuple[str, str] = None,
+              **kwargs):
+    """Get a pre-defined model or create a model from config.
+
+    Args:
+        model (str | Config): The name of model, the config file path or a
+            config instance.
+        pretrained (bool | str): When use name to specify model, you can
+            use ``True`` to load the pre-defined pretrained weights. And you
+            can also use a string to specify the path or link of weights to
+            load. Defaults to False.
+        device (str | torch.device | None): Transfer the model to the target
+            device. Defaults to None.
+        device_map (str | dict | None): A map that specifies where each
+            submodule should go. It doesn't need to be refined to each
+            parameter/buffer name, once a given module name is inside, every
+            submodule of it will be sent to the same device. You can use
+            `device_map="auto"` to automatically generate the device map.
+            Defaults to None.
+        offload_folder (str | None): If the `device_map` contains any value
+            `"disk"`, the folder where we will offload weights.
+        url_mapping (Tuple[str, str], optional): The mapping of pretrained
+            checkpoint link. For example, load checkpoint from a local dir
+            instead of download by ``('https://.*/', './checkpoint')``.
+            Defaults to None.
+        **kwargs: Other keyword arguments of the model config.
+
+    Returns:
+        mmengine.model.BaseModel: The result model.
+
+    Examples:
+        Get a ResNet-50 model and extract images feature:
+
+        >>> import torch
+        >>> from mmpretrain import get_model
+        >>> inputs = torch.rand(16, 3, 224, 224)
+        >>> model = get_model('resnet50_8xb32_in1k', pretrained=True, backbone=dict(out_indices=(0, 1, 2, 3)))
+        >>> feats = model.extract_feat(inputs)
+        >>> for feat in feats:
+        ...     print(feat.shape)
+        torch.Size([16, 256])
+        torch.Size([16, 512])
+        torch.Size([16, 1024])
+        torch.Size([16, 2048])
+
+        Get Swin-Transformer model with pre-trained weights and inference:
+
+        >>> from mmpretrain import get_model, inference_model
+        >>> model = get_model('swin-base_16xb64_in1k', pretrained=True)
+        >>> result = inference_model(model, 'demo/demo.JPEG')
+        >>> print(result['pred_class'])
+        'sea snake'
+    """  # noqa: E501
+    if device_map is not None:
+        from .utils import dispatch_model
+        dispatch_model._verify_require()
+
+    metainfo = None
+    if isinstance(model, Config):
+        config = copy.deepcopy(model)
+        if pretrained is True and 'load_from' in config:
+            pretrained = config.load_from
+    elif isinstance(model, (str, PathLike)) and Path(model).suffix == '.py':
+        config = Config.fromfile(model)
+        if pretrained is True and 'load_from' in config:
+            pretrained = config.load_from
+    elif isinstance(model, str):
+        metainfo = ModelHub.get(model)
+        config = metainfo.config
+        if pretrained is True and metainfo.weights is not None:
+            pretrained = metainfo.weights
+    else:
+        raise TypeError('model must be a name, a path or a Config object, '
+                        f'but got {type(config)}')
+
+    if pretrained is True:
+        warnings.warn('Unable to find pre-defined checkpoint of the model.')
+        pretrained = None
+    elif pretrained is False:
+        pretrained = None
+
+    if kwargs:
+        config.merge_from_dict({'model': kwargs})
+    config.model.setdefault('data_preprocessor',
+                            config.get('data_preprocessor', None))
+
+    from mmengine.registry import DefaultScope
+
+    from mmpretrain.registry import MODELS
+    with DefaultScope.overwrite_default_scope('mmpretrain'):
+        model = MODELS.build(config.model)
+
+    dataset_meta = {}
+    if pretrained:
+        # Mapping the weights to GPU may cause unexpected video memory leak
+        # which refers to https://github.com/open-mmlab/mmdetection/pull/6405
+        from mmengine.runner import load_checkpoint
+        if url_mapping is not None:
+            pretrained = re.sub(url_mapping[0], url_mapping[1], pretrained)
+        checkpoint = load_checkpoint(model, pretrained, map_location='cpu')
+        if 'dataset_meta' in checkpoint.get('meta', {}):
+            # mmpretrain 1.x
+            dataset_meta = checkpoint['meta']['dataset_meta']
+        elif 'CLASSES' in checkpoint.get('meta', {}):
+            # mmcls 0.x
+            dataset_meta = {'classes': checkpoint['meta']['CLASSES']}
+
+    if len(dataset_meta) == 0 and 'test_dataloader' in config:
+        from mmpretrain.registry import DATASETS
+        dataset_class = DATASETS.get(config.test_dataloader.dataset.type)
+        dataset_meta = getattr(dataset_class, 'METAINFO', {})
+
+    if device_map is not None:
+        model = dispatch_model(
+            model, device_map=device_map, offload_folder=offload_folder)
+    elif device is not None:
+        model.to(device)
+
+    model._dataset_meta = dataset_meta  # save the dataset meta
+    model._config = config  # save the config in the model
+    model._metainfo = metainfo  # save the metainfo in the model
+    model.eval()
+    return model
+
+
+def init_model(config, checkpoint=None, device=None, **kwargs):
+    """Initialize a classifier from config file (deprecated).
+
+    It's only for compatibility, please use :func:`get_model` instead.
+
+    Args:
+        config (str | :obj:`mmengine.Config`): Config file path or the config
+            object.
+        checkpoint (str, optional): Checkpoint path. If left as None, the model
+            will not load any weights.
+        device (str | torch.device | None): Transfer the model to the target
+            device. Defaults to None.
+        **kwargs: Other keyword arguments of the model config.
+
+    Returns:
+        nn.Module: The constructed model.
+    """
+    return get_model(config, checkpoint, device, **kwargs)
+
+
+def list_models(pattern=None, exclude_patterns=None, task=None) -> List[str]:
+    """List all models available in MMPretrain.
+
+    Args:
+        pattern (str | None): A wildcard pattern to match model names.
+            Defaults to None.
+        exclude_patterns (list | None): A list of wildcard patterns to
+            exclude names from the matched names. Defaults to None.
+        task (str | none): The evaluation task of the model.
+
+    Returns:
+        List[str]: a list of model names.
+
+    Examples:
+        List all models:
+
+        >>> from mmpretrain import list_models
+        >>> list_models()
+
+        List ResNet-50 models on ImageNet-1k dataset:
+
+        >>> from mmpretrain import list_models
+        >>> list_models('resnet*in1k')
+        ['resnet50_8xb32_in1k',
+         'resnet50_8xb32-fp16_in1k',
+         'resnet50_8xb256-rsb-a1-600e_in1k',
+         'resnet50_8xb256-rsb-a2-300e_in1k',
+         'resnet50_8xb256-rsb-a3-100e_in1k']
+
+        List Swin-Transformer models trained from stratch and exclude
+        Swin-Transformer-V2 models:
+
+        >>> from mmpretrain import list_models
+        >>> list_models('swin', exclude_patterns=['swinv2', '*-pre'])
+        ['swin-base_16xb64_in1k',
+         'swin-base_3rdparty_in1k',
+         'swin-base_3rdparty_in1k-384',
+         'swin-large_8xb8_cub-384px',
+         'swin-small_16xb64_in1k',
+         'swin-small_3rdparty_in1k',
+         'swin-tiny_16xb64_in1k',
+         'swin-tiny_3rdparty_in1k']
+
+        List all EVA models for image classification task.
+
+        >>> from mmpretrain import list_models
+        >>> list_models('eva', task='Image Classification')
+        ['eva-g-p14_30m-in21k-pre_3rdparty_in1k-336px',
+         'eva-g-p14_30m-in21k-pre_3rdparty_in1k-560px',
+         'eva-l-p14_mim-in21k-pre_3rdparty_in1k-196px',
+         'eva-l-p14_mim-in21k-pre_3rdparty_in1k-336px',
+         'eva-l-p14_mim-pre_3rdparty_in1k-196px',
+         'eva-l-p14_mim-pre_3rdparty_in1k-336px']
+    """
+    ModelHub._register_mmpretrain_models()
+    matches = set(ModelHub._models_dict.keys())
+
+    if pattern is not None:
+        # Always match keys with any postfix.
+        matches = set(fnmatch.filter(matches, pattern + '*'))
+
+    exclude_patterns = exclude_patterns or []
+    for exclude_pattern in exclude_patterns:
+        exclude = set(fnmatch.filter(matches, exclude_pattern + '*'))
+        matches = matches - exclude
+
+    if task is not None:
+        task_matches = []
+        for key in matches:
+            metainfo = ModelHub._models_dict[key]
+            if metainfo.results is None and task == 'null':
+                task_matches.append(key)
+            elif metainfo.results is None:
+                continue
+            elif task in [result.task for result in metainfo.results]:
+                task_matches.append(key)
+        matches = task_matches
+
+    return sorted(list(matches))
+
+
+def inference_model(model, *args, **kwargs):
+    """Inference an image with the inferencer.
+
+    Automatically select inferencer to inference according to the type of
+    model. It's a shortcut for a quick start, and for advanced usage, please
+    use the correspondding inferencer class.
+
+    Here is the mapping from task to inferencer:
+
+    - Image Classification: :class:`ImageClassificationInferencer`
+    - Image Retrieval: :class:`ImageRetrievalInferencer`
+    - Image Caption: :class:`ImageCaptionInferencer`
+    - Visual Question Answering: :class:`VisualQuestionAnsweringInferencer`
+    - Visual Grounding: :class:`VisualGroundingInferencer`
+    - Text-To-Image Retrieval: :class:`TextToImageRetrievalInferencer`
+    - Image-To-Text Retrieval: :class:`ImageToTextRetrievalInferencer`
+    - NLVR: :class:`NLVRInferencer`
+
+    Args:
+        model (BaseModel | str | Config): The loaded model, the model
+            name or the config of the model.
+        *args: Positional arguments to call the inferencer.
+        **kwargs: Other keyword arguments to initialize and call the
+            correspondding inferencer.
+
+    Returns:
+        result (dict): The inference results.
+    """  # noqa: E501
+    from mmengine.model import BaseModel
+
+    if isinstance(model, BaseModel):
+        metainfo = getattr(model, '_metainfo', None)
+    else:
+        metainfo = ModelHub.get(model)
+
+    from inspect import signature
+
+    from .image_caption import ImageCaptionInferencer
+    from .image_classification import ImageClassificationInferencer
+    from .image_retrieval import ImageRetrievalInferencer
+    from .multimodal_retrieval import (ImageToTextRetrievalInferencer,
+                                       TextToImageRetrievalInferencer)
+    from .nlvr import NLVRInferencer
+    from .visual_grounding import VisualGroundingInferencer
+    from .visual_question_answering import VisualQuestionAnsweringInferencer
+    task_mapping = {
+        'Image Classification': ImageClassificationInferencer,
+        'Image Retrieval': ImageRetrievalInferencer,
+        'Image Caption': ImageCaptionInferencer,
+        'Visual Question Answering': VisualQuestionAnsweringInferencer,
+        'Visual Grounding': VisualGroundingInferencer,
+        'Text-To-Image Retrieval': TextToImageRetrievalInferencer,
+        'Image-To-Text Retrieval': ImageToTextRetrievalInferencer,
+        'NLVR': NLVRInferencer,
+    }
+
+    inferencer_type = None
+
+    if metainfo is not None and metainfo.results is not None:
+        tasks = set(result.task for result in metainfo.results)
+        inferencer_type = [
+            task_mapping.get(task) for task in tasks if task in task_mapping
+        ]
+        if len(inferencer_type) > 1:
+            inferencer_names = [cls.__name__ for cls in inferencer_type]
+            warnings.warn('The model supports multiple tasks, auto select '
+                          f'{inferencer_names[0]}, you can also use other '
+                          f'inferencer {inferencer_names} directly.')
+        inferencer_type = inferencer_type[0]
+
+    if inferencer_type is None:
+        raise NotImplementedError('No available inferencer for the model')
+
+    init_kwargs = {
+        k: kwargs.pop(k)
+        for k in list(kwargs)
+        if k in signature(inferencer_type).parameters.keys()
+    }
+
+    inferencer = inferencer_type(model, **init_kwargs)
+    return inferencer(*args, **kwargs)[0]
diff --git a/mmpretrain/apis/multimodal_retrieval.py b/mmpretrain/apis/multimodal_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..5eb9c859aca309306c1e775b7a03bf3bbc1c7717
--- /dev/null
+++ b/mmpretrain/apis/multimodal_retrieval.py
@@ -0,0 +1,603 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from pathlib import Path
+from typing import Callable, List, Optional, Tuple, Union
+
+import mmengine
+import numpy as np
+import torch
+from mmcv.image import imread
+from mmengine.config import Config
+from mmengine.dataset import BaseDataset, Compose, default_collate
+
+from mmpretrain.registry import TRANSFORMS
+from mmpretrain.structures import DataSample
+from mmpretrain.utils import track
+from .base import BaseInferencer
+from .base import InputType as ImageType
+from .base import ModelType
+from .model import list_models
+
+
+def filter_transforms(transforms: list, data_info: dict):
+    """Filter pipeline to avoid KeyError with partial data info."""
+    data_info = deepcopy(data_info)
+    filtered_transforms = []
+    for t in transforms:
+        try:
+            data_info = t(data_info)
+            filtered_transforms.append(t)
+        except KeyError:
+            pass
+    return filtered_transforms
+
+
+class TextToImageRetrievalInferencer(BaseInferencer):
+    """The inferencer for text to image retrieval.
+
+    Args:
+        model (BaseModel | str | Config): A model name or a path to the config
+            file, or a :obj:`BaseModel` object. The model name can be found
+            by ``TextToImageRetrievalInferencer.list_models()`` and you can also
+            query it in :doc:`/modelzoo_statistics`.
+        prototype (str | list | dict | DataLoader | BaseDataset): The images to
+            be retrieved. It can be the following types:
+
+            - str: The directory of the the images.
+            - list: A list of path of the images.
+            - dict: A config dict of the a prototype dataset.
+            - BaseDataset: A prototype dataset.
+            - DataLoader: A data loader to load the prototype data.
+
+        prototype_cache (str, optional): The path of the generated prototype
+            features. If exists, directly load the cache instead of re-generate
+            the prototype features. If not exists, save the generated features
+            to the path. Defaults to None.
+        fast_match (bool): Some algorithms will record extra image features for
+            further matching, which may consume large memory, set True to avoid
+            this behavior. Defaults to True.
+        pretrained (str, optional): Path to the checkpoint. If None, it will
+            try to find a pre-defined weight from the model you specified
+            (only work if the ``model`` is a model name). Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        **kwargs: Other keyword arguments to initialize the model (only work if
+            the ``model`` is a model name).
+
+    Example:
+        >>> from mmpretrain import TextToImageRetrievalInferencer
+        >>> inferencer = TextToImageRetrievalInferencer(
+        ...     'blip-base_3rdparty_retrieval',
+        ...     prototype='./demo/',
+        ...     prototype_cache='t2i_retri.pth')
+        >>> inferencer('A cat and a dog.')[0]
+        {'match_score': tensor(0.3855, device='cuda:0'),
+         'sample_idx': 1,
+         'sample': {'img_path': './demo/cat-dog.png'}}
+    """  # noqa: E501
+
+    visualize_kwargs: set = {
+        'draw_score', 'show_dir', 'show', 'wait_time', 'figsize', 'topk'
+    }
+    postprocess_kwargs: set = {'topk'}
+
+    def __init__(self,
+                 model: ModelType,
+                 prototype,
+                 prototype_cache=None,
+                 fast_match=True,
+                 prepare_batch_size=8,
+                 pretrained: Union[bool, str] = True,
+                 device: Union[str, torch.device, None] = None,
+                 **kwargs) -> None:
+        super().__init__(
+            model=model, pretrained=pretrained, device=device, **kwargs)
+
+        self.img_pipeline, self.text_pipeline = self.pipeline
+
+        if hasattr(self.model, 'fast_match'):
+            self.model.fast_match = fast_match
+
+        self.prototype_dataset = self._prepare_prototype(
+            prototype, prototype_cache, batch_size=prepare_batch_size)
+
+    def _prepare_prototype(self, prototype, cache=None, batch_size=8):
+        from mmengine.dataset import DefaultSampler
+        from torch.utils.data import DataLoader
+
+        def build_dataloader(dataset):
+            return DataLoader(
+                dataset,
+                batch_size=batch_size,
+                collate_fn=default_collate,
+                sampler=DefaultSampler(dataset, shuffle=False),
+                persistent_workers=False,
+            )
+
+        if isinstance(prototype, str):
+            # A directory path of images
+            prototype = dict(
+                type='CustomDataset', with_label=False, data_root=prototype)
+
+        if isinstance(prototype, list):
+            test_pipeline = [dict(type='LoadImageFromFile'), self.img_pipeline]
+            dataset = BaseDataset(
+                lazy_init=True, serialize_data=False, pipeline=test_pipeline)
+            dataset.data_list = [{
+                'sample_idx': i,
+                'img_path': file
+            } for i, file in enumerate(prototype)]
+            dataset._fully_initialized = True
+            dataloader = build_dataloader(dataset)
+        elif isinstance(prototype, dict):
+            # A config of dataset
+            from mmpretrain.registry import DATASETS
+            test_pipeline = [dict(type='LoadImageFromFile'), self.img_pipeline]
+            prototype.setdefault('pipeline', test_pipeline)
+            dataset = DATASETS.build(prototype)
+            dataloader = build_dataloader(dataset)
+        elif isinstance(prototype, list):
+            test_pipeline = [dict(type='LoadImageFromFile'), self.img_pipeline]
+            dataset = BaseDataset(
+                lazy_init=True, serialize_data=False, pipeline=test_pipeline)
+            dataset.data_list = [{
+                'sample_idx': i,
+                'img_path': file
+            } for i, file in enumerate(prototype)]
+            dataset._fully_initialized = True
+            dataloader = build_dataloader(dataset)
+        elif isinstance(prototype, DataLoader):
+            dataset = prototype.dataset
+            dataloader = prototype
+        elif isinstance(prototype, BaseDataset):
+            dataset = prototype
+            dataloader = build_dataloader(dataset)
+        else:
+            raise TypeError(f'Unsupported prototype type {type(prototype)}.')
+
+        if cache is not None and Path(cache).exists():
+            self.prototype = torch.load(cache)
+        else:
+            prototype = []
+            for data_batch in track(dataloader, 'Prepare prototype...'):
+                with torch.no_grad():
+                    data_batch = self.model.data_preprocessor(
+                        data_batch, False)
+                    feats = self.model._run_forward(data_batch, mode='tensor')
+                    prototype.append(feats)
+            prototype = {
+                k: torch.cat([d[k] for d in prototype])
+                for k in prototype[0]
+            }
+            self.prototype = prototype
+
+        from mmengine.logging import MMLogger
+        logger = MMLogger.get_current_instance()
+        if cache is None:
+            logger.info('The prototype has been prepared, you can use '
+                        '`save_prototype` to dump it into a pickle '
+                        'file for the future usage.')
+        elif not Path(cache).exists():
+            self.save_prototype(cache)
+            logger.info(f'The prototype has been saved at {cache}.')
+
+        return dataset
+
+    def save_prototype(self, path):
+        torch.save(self.prototype, path)
+
+    def __call__(self,
+                 inputs: ImageType,
+                 return_datasamples: bool = False,
+                 batch_size: int = 1,
+                 **kwargs) -> dict:
+        """Call the inferencer.
+
+        Args:
+            inputs (str | array | list): The image path or array, or a list of
+                images.
+            return_datasamples (bool): Whether to return results as
+                :obj:`DataSample`. Defaults to False.
+            batch_size (int): Batch size. Defaults to 1.
+            resize (int, optional): Resize the long edge of the image to the
+                specified length before visualization. Defaults to None.
+            draw_score (bool): Whether to draw the match scores.
+                Defaults to True.
+            show (bool): Whether to display the visualization result in a
+                window. Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            show_dir (str, optional): If not None, save the visualization
+                results in the specified directory. Defaults to None.
+
+        Returns:
+            list: The inference results.
+        """
+        return super().__call__(inputs, return_datasamples, batch_size,
+                                **kwargs)
+
+    @torch.no_grad()
+    def forward(self, data: dict, **kwargs):
+        """Feed the inputs to the model."""
+        data = self.model.data_preprocessor(data, False)
+        data_samples = data['data_samples']
+        feats = self.prototype.copy()
+        feats.update(self.model.extract_feat(data_samples=data_samples))
+        return self.model.predict_all(feats, data_samples, cal_i2t=False)[0]
+
+    def _init_pipeline(self, cfg: Config) -> Callable:
+        test_pipeline_cfg = cfg.test_dataloader.dataset.pipeline
+        test_transfroms = [TRANSFORMS.build(t) for t in test_pipeline_cfg]
+        img_info = {'img': np.zeros((224, 224, 3), dtype=np.uint8)}
+        text_info = {'text': 'example'}
+        img_pipeline = Compose(filter_transforms(test_transfroms, img_info))
+        text_pipeline = Compose(filter_transforms(test_transfroms, text_info))
+        return img_pipeline, text_pipeline
+
+    def preprocess(self, inputs: List[str], batch_size: int = 1):
+
+        def process_text(input_: str):
+            return self.text_pipeline({'text': input_})
+
+        chunked_data = self._get_chunk_data(
+            map(process_text, inputs), batch_size)
+        yield from map(default_collate, chunked_data)
+
+    def visualize(self,
+                  ori_inputs: List[str],
+                  preds: List[DataSample],
+                  topk: int = 3,
+                  figsize: Tuple[int, int] = (16, 9),
+                  show: bool = False,
+                  wait_time: int = 0,
+                  draw_score=True,
+                  show_dir=None):
+        if not show and show_dir is None:
+            return None
+
+        if self.visualizer is None:
+            from mmpretrain.visualization import UniversalVisualizer
+            self.visualizer = UniversalVisualizer()
+
+        visualization = []
+        for i, (text, data_sample) in enumerate(zip(ori_inputs, preds)):
+            name = str(i)
+
+            if show_dir is not None:
+                show_dir = Path(show_dir)
+                show_dir.mkdir(exist_ok=True)
+                out_file = str((show_dir / name).with_suffix('.png'))
+            else:
+                out_file = None
+
+            self.visualizer.visualize_t2i_retrieval(
+                text,
+                data_sample,
+                self.prototype_dataset,
+                topk=topk,
+                fig_cfg=dict(figsize=figsize),
+                draw_score=draw_score,
+                show=show,
+                wait_time=wait_time,
+                name=name,
+                out_file=out_file)
+            visualization.append(self.visualizer.get_image())
+        if show:
+            self.visualizer.close()
+        return visualization
+
+    def postprocess(
+        self,
+        preds: List[DataSample],
+        visualization: List[np.ndarray],
+        return_datasamples=False,
+        topk=1,
+    ) -> dict:
+        if return_datasamples:
+            return preds
+
+        results = []
+        for data_sample in preds:
+            match_scores, indices = torch.topk(data_sample.pred_score, k=topk)
+            matches = []
+            for match_score, sample_idx in zip(match_scores, indices):
+                sample = self.prototype_dataset.get_data_info(
+                    sample_idx.item())
+                sample_idx = sample.pop('sample_idx')
+                matches.append({
+                    'match_score': match_score,
+                    'sample_idx': sample_idx,
+                    'sample': sample
+                })
+            results.append(matches)
+
+        return results
+
+    @staticmethod
+    def list_models(pattern: Optional[str] = None):
+        """List all available model names.
+
+        Args:
+            pattern (str | None): A wildcard pattern to match model names.
+
+        Returns:
+            List[str]: a list of model names.
+        """
+        return list_models(pattern=pattern, task='Text-To-Image Retrieval')
+
+
+class ImageToTextRetrievalInferencer(BaseInferencer):
+    """The inferencer for image to text retrieval.
+
+    Args:
+        model (BaseModel | str | Config): A model name or a path to the config
+            file, or a :obj:`BaseModel` object. The model name can be found
+            by ``ImageToTextRetrievalInferencer.list_models()`` and you can
+            also query it in :doc:`/modelzoo_statistics`.
+        prototype (str | list | dict | DataLoader, BaseDataset): The images to
+            be retrieved. It can be the following types:
+
+            - str: The file path to load the string list.
+            - list: A list of string.
+
+        prototype_cache (str, optional): The path of the generated prototype
+            features. If exists, directly load the cache instead of re-generate
+            the prototype features. If not exists, save the generated features
+            to the path. Defaults to None.
+        fast_match (bool): Some algorithms will record extra image features for
+            further matching, which may consume large memory, set True to avoid
+            this behavior. Defaults to True.
+        pretrained (str, optional): Path to the checkpoint. If None, it will
+            try to find a pre-defined weight from the model you specified
+            (only work if the ``model`` is a model name). Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        **kwargs: Other keyword arguments to initialize the model (only work if
+            the ``model`` is a model name).
+
+    Example:
+        >>> from mmpretrain import ImageToTextRetrievalInferencer
+        >>> inferencer = ImageToTextRetrievalInferencer(
+        ...     'blip-base_3rdparty_retrieval',
+        ...     prototype=['cat', 'dog', 'snake', 'bird'],
+        ...     prototype_cache='i2t_retri.pth')
+        >>> inferencer('demo/bird.JPEG')[0]
+        {'match_score': tensor(0.3855, device='cuda:0'),
+         'sample_idx': 1,
+         'sample': {'img_path': './demo/cat-dog.png'}}
+    """  # noqa: E501
+
+    visualize_kwargs: set = {
+        'draw_score', 'resize', 'show_dir', 'show', 'wait_time', 'topk'
+    }
+    postprocess_kwargs: set = {'topk'}
+
+    def __init__(self,
+                 model: ModelType,
+                 prototype,
+                 prototype_cache=None,
+                 fast_match=True,
+                 prepare_batch_size=8,
+                 pretrained: Union[bool, str] = True,
+                 device: Union[str, torch.device, None] = None,
+                 **kwargs) -> None:
+        super().__init__(
+            model=model, pretrained=pretrained, device=device, **kwargs)
+
+        self.img_pipeline, self.text_pipeline = self.pipeline
+
+        if hasattr(self.model, 'fast_match'):
+            self.model.fast_match = fast_match
+
+        self.prototype_dataset = self._prepare_prototype(
+            prototype, cache=prototype_cache, batch_size=prepare_batch_size)
+
+    def _prepare_prototype(self, prototype, cache=None, batch_size=8):
+        from mmengine.dataset import DefaultSampler
+        from torch.utils.data import DataLoader
+
+        def build_dataloader(dataset):
+            return DataLoader(
+                [
+                    self.text_pipeline({
+                        'sample_idx': i,
+                        'text': text
+                    }) for i, text in enumerate(dataset)
+                ],
+                batch_size=batch_size,
+                collate_fn=default_collate,
+                sampler=DefaultSampler(dataset, shuffle=False),
+                persistent_workers=False,
+            )
+
+        if isinstance(prototype, str):
+            # A file path of a list of string
+            dataset = mmengine.list_from_file(prototype)
+        elif mmengine.utils.is_seq_of(prototype, str):
+            dataset = prototype
+        else:
+            raise TypeError(f'Unsupported prototype type {type(prototype)}.')
+
+        dataloader = build_dataloader(dataset)
+
+        if cache is not None and Path(cache).exists():
+            self.prototype = torch.load(cache)
+        else:
+            prototype = []
+            for data_batch in track(dataloader, 'Prepare prototype...'):
+                with torch.no_grad():
+                    data_batch = self.model.data_preprocessor(
+                        data_batch, False)
+                    feats = self.model._run_forward(data_batch, mode='tensor')
+                    prototype.append(feats)
+            prototype = {
+                k: torch.cat([d[k] for d in prototype])
+                for k in prototype[0]
+            }
+            self.prototype = prototype
+
+        from mmengine.logging import MMLogger
+        logger = MMLogger.get_current_instance()
+        if cache is None:
+            logger.info('The prototype has been prepared, you can use '
+                        '`save_prototype` to dump it into a pickle '
+                        'file for the future usage.')
+        elif not Path(cache).exists():
+            self.save_prototype(cache)
+            logger.info(f'The prototype has been saved at {cache}.')
+
+        return dataset
+
+    def save_prototype(self, path):
+        torch.save(self.prototype, path)
+
+    def __call__(self,
+                 inputs: ImageType,
+                 return_datasamples: bool = False,
+                 batch_size: int = 1,
+                 **kwargs) -> dict:
+        """Call the inferencer.
+
+        Args:
+            inputs (str | array | list): The image path or array, or a list of
+                images.
+            return_datasamples (bool): Whether to return results as
+                :obj:`DataSample`. Defaults to False.
+            batch_size (int): Batch size. Defaults to 1.
+            resize (int, optional): Resize the long edge of the image to the
+                specified length before visualization. Defaults to None.
+            draw_score (bool): Whether to draw the match scores.
+                Defaults to True.
+            show (bool): Whether to display the visualization result in a
+                window. Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            show_dir (str, optional): If not None, save the visualization
+                results in the specified directory. Defaults to None.
+
+        Returns:
+            list: The inference results.
+        """
+        return super().__call__(inputs, return_datasamples, batch_size,
+                                **kwargs)
+
+    @torch.no_grad()
+    def forward(self, data: dict, **kwargs):
+        """Feed the inputs to the model."""
+        data = self.model.data_preprocessor(data, False)
+        feats = self.prototype.copy()
+        feats.update(self.model.extract_feat(images=data['images']))
+        return self.model.predict_all(
+            feats, data['data_samples'], cal_t2i=False)[0]
+
+    def _init_pipeline(self, cfg: Config) -> Callable:
+        test_pipeline_cfg = cfg.test_dataloader.dataset.pipeline
+        test_transfroms = [TRANSFORMS.build(t) for t in test_pipeline_cfg]
+        img_info = {'img': np.zeros((224, 224, 3), dtype=np.uint8)}
+        text_info = {'text': 'example'}
+        img_pipeline = Compose(filter_transforms(test_transfroms, img_info))
+        text_pipeline = Compose(filter_transforms(test_transfroms, text_info))
+        return img_pipeline, text_pipeline
+
+    def preprocess(self, inputs: List[ImageType], batch_size: int = 1):
+
+        def load_image(input_):
+            img = imread(input_)
+            if img is None:
+                raise ValueError(f'Failed to read image {input_}.')
+            return dict(
+                img=img,
+                img_shape=img.shape[:2],
+                ori_shape=img.shape[:2],
+            )
+
+        pipeline = Compose([load_image, self.img_pipeline])
+
+        chunked_data = self._get_chunk_data(map(pipeline, inputs), batch_size)
+        yield from map(default_collate, chunked_data)
+
+    def visualize(self,
+                  ori_inputs: List[ImageType],
+                  preds: List[DataSample],
+                  topk: int = 3,
+                  resize: Optional[int] = 224,
+                  show: bool = False,
+                  wait_time: int = 0,
+                  draw_score=True,
+                  show_dir=None):
+        if not show and show_dir is None:
+            return None
+
+        if self.visualizer is None:
+            from mmpretrain.visualization import UniversalVisualizer
+            self.visualizer = UniversalVisualizer()
+
+        visualization = []
+        for i, (input_, data_sample) in enumerate(zip(ori_inputs, preds)):
+            image = imread(input_)
+            if isinstance(input_, str):
+                # The image loaded from path is BGR format.
+                image = image[..., ::-1]
+                name = Path(input_).stem
+            else:
+                name = str(i)
+
+            if show_dir is not None:
+                show_dir = Path(show_dir)
+                show_dir.mkdir(exist_ok=True)
+                out_file = str((show_dir / name).with_suffix('.png'))
+            else:
+                out_file = None
+
+            self.visualizer.visualize_i2t_retrieval(
+                image,
+                data_sample,
+                self.prototype_dataset,
+                topk=topk,
+                resize=resize,
+                draw_score=draw_score,
+                show=show,
+                wait_time=wait_time,
+                name=name,
+                out_file=out_file)
+            visualization.append(self.visualizer.get_image())
+        if show:
+            self.visualizer.close()
+        return visualization
+
+    def postprocess(
+        self,
+        preds: List[DataSample],
+        visualization: List[np.ndarray],
+        return_datasamples=False,
+        topk=1,
+    ) -> dict:
+        if return_datasamples:
+            return preds
+
+        results = []
+        for data_sample in preds:
+            match_scores, indices = torch.topk(data_sample.pred_score, k=topk)
+            matches = []
+            for match_score, sample_idx in zip(match_scores, indices):
+                text = self.prototype_dataset[sample_idx.item()]
+                matches.append({
+                    'match_score': match_score,
+                    'sample_idx': sample_idx,
+                    'text': text
+                })
+            results.append(matches)
+
+        return results
+
+    @staticmethod
+    def list_models(pattern: Optional[str] = None):
+        """List all available model names.
+
+        Args:
+            pattern (str | None): A wildcard pattern to match model names.
+
+        Returns:
+            List[str]: a list of model names.
+        """
+        return list_models(pattern=pattern, task='Image-To-Text Retrieval')
diff --git a/mmpretrain/apis/nlvr.py b/mmpretrain/apis/nlvr.py
new file mode 100644
index 0000000000000000000000000000000000000000..9977c3b06f36fa61a3cd2edf36077a993b2030cd
--- /dev/null
+++ b/mmpretrain/apis/nlvr.py
@@ -0,0 +1,150 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from typing import Callable, List, Optional, Tuple, Union
+
+import numpy as np
+import torch
+from mmcv.image import imread
+from mmengine.config import Config
+from mmengine.dataset import Compose, default_collate
+
+from mmpretrain.registry import TRANSFORMS
+from mmpretrain.structures import DataSample
+from .base import BaseInferencer
+from .model import list_models
+
+InputType = Tuple[Union[str, np.ndarray], Union[str, np.ndarray], str]
+InputsType = Union[List[InputType], InputType]
+
+
+class NLVRInferencer(BaseInferencer):
+    """The inferencer for Natural Language for Visual Reasoning.
+
+    Args:
+        model (BaseModel | str | Config): A model name or a path to the config
+            file, or a :obj:`BaseModel` object. The model name can be found
+            by ``NLVRInferencer.list_models()`` and you can also
+            query it in :doc:`/modelzoo_statistics`.
+        pretrained (str, optional): Path to the checkpoint. If None, it will
+            try to find a pre-defined weight from the model you specified
+            (only work if the ``model`` is a model name). Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        **kwargs: Other keyword arguments to initialize the model (only work if
+            the ``model`` is a model name).
+    """
+
+    visualize_kwargs: set = {
+        'resize', 'draw_score', 'show', 'show_dir', 'wait_time'
+    }
+
+    def __call__(self,
+                 inputs: InputsType,
+                 return_datasamples: bool = False,
+                 batch_size: int = 1,
+                 **kwargs) -> dict:
+        """Call the inferencer.
+
+        Args:
+            inputs (tuple, List[tuple]): The input data tuples, every tuple
+                should include three items (left image, right image, text).
+                The image can be a path or numpy array.
+            return_datasamples (bool): Whether to return results as
+                :obj:`DataSample`. Defaults to False.
+            batch_size (int): Batch size. Defaults to 1.
+            resize (int, optional): Resize the short edge of the image to the
+                specified length before visualization. Defaults to None.
+            draw_score (bool): Whether to draw the prediction scores
+                of prediction categories. Defaults to True.
+            show (bool): Whether to display the visualization result in a
+                window. Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            show_dir (str, optional): If not None, save the visualization
+                results in the specified directory. Defaults to None.
+
+        Returns:
+            list: The inference results.
+        """
+        assert isinstance(inputs, (tuple, list))
+        if isinstance(inputs, tuple):
+            inputs = [inputs]
+        for input_ in inputs:
+            assert isinstance(input_, tuple)
+            assert len(input_) == 3
+
+        return super().__call__(
+            inputs,
+            return_datasamples=return_datasamples,
+            batch_size=batch_size,
+            **kwargs)
+
+    def _init_pipeline(self, cfg: Config) -> Callable:
+        test_pipeline_cfg = cfg.test_dataloader.dataset.pipeline
+        assert test_pipeline_cfg[0]['type'] == 'ApplyToList'
+
+        list_pipeline = deepcopy(test_pipeline_cfg[0])
+        if list_pipeline.scatter_key == 'img_path':
+            # Remove `LoadImageFromFile`
+            list_pipeline.transforms.pop(0)
+            list_pipeline.scatter_key = 'img'
+
+        test_pipeline = Compose(
+            [TRANSFORMS.build(list_pipeline)] +
+            [TRANSFORMS.build(t) for t in test_pipeline_cfg[1:]])
+        return test_pipeline
+
+    def preprocess(self, inputs: InputsType, batch_size: int = 1):
+
+        def load_image(input_):
+            img1 = imread(input_[0])
+            img2 = imread(input_[1])
+            text = input_[2]
+            if img1 is None:
+                raise ValueError(f'Failed to read image {input_[0]}.')
+            if img2 is None:
+                raise ValueError(f'Failed to read image {input_[1]}.')
+            return dict(
+                img=[img1, img2],
+                img_shape=[img1.shape[:2], img2.shape[:2]],
+                ori_shape=[img1.shape[:2], img2.shape[:2]],
+                text=text,
+            )
+
+        pipeline = Compose([load_image, self.pipeline])
+
+        chunked_data = self._get_chunk_data(map(pipeline, inputs), batch_size)
+        yield from map(default_collate, chunked_data)
+
+    def postprocess(self,
+                    preds: List[DataSample],
+                    visualization: List[np.ndarray],
+                    return_datasamples=False) -> dict:
+        if return_datasamples:
+            return preds
+
+        results = []
+        for data_sample in preds:
+            pred_scores = data_sample.pred_score
+            pred_score = float(torch.max(pred_scores).item())
+            pred_label = torch.argmax(pred_scores).item()
+            result = {
+                'pred_scores': pred_scores.detach().cpu().numpy(),
+                'pred_label': pred_label,
+                'pred_score': pred_score,
+            }
+            results.append(result)
+
+        return results
+
+    @staticmethod
+    def list_models(pattern: Optional[str] = None):
+        """List all available model names.
+
+        Args:
+            pattern (str | None): A wildcard pattern to match model names.
+
+        Returns:
+            List[str]: a list of model names.
+        """
+        return list_models(pattern=pattern, task='NLVR')
diff --git a/mmpretrain/apis/utils.py b/mmpretrain/apis/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..83e76325472f6925f78c746e3a10f3a58b0e6de4
--- /dev/null
+++ b/mmpretrain/apis/utils.py
@@ -0,0 +1,270 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+from collections import defaultdict
+from contextlib import contextmanager
+from itertools import chain
+from typing import Dict, List, Optional, Union
+
+import torch
+import torch.nn as nn
+
+from mmpretrain.utils import require
+
+
+@require('torch>=1.9.0', 'https://pytorch.org/get-started/locally/')
+@require('accelerate')
+def dispatch_model(
+    model,
+    device_map: Union[str, dict],
+    max_memory: Optional[dict] = None,
+    no_split_module_classes: Optional[List[str]] = None,
+    offload_folder: str = None,
+    offload_buffers: bool = False,
+    preload_module_classes: Optional[List[str]] = None,
+):
+    """Split and dispatch a model across devices.
+
+    The function depends on the `accelerate` package. Refers to
+    https://huggingface.co/docs/accelerate/main/en/usage_guides/big_modeling
+
+    Args:
+        model (torch.nn.Module): The model to dispatch.
+        device_map (str | dict | None): A map that specifies where each
+            submodule should go. It doesn't need to be refined to each
+            parameter/buffer name, once a given module name is inside, every
+            submodule of it will be sent to the same device. You can use
+            `device_map="auto"` to automatically generate the device map.
+            Defaults to None.
+        max_memory (dict | None): A dictionary device identifier to maximum
+            memory. Will default to the maximum memory available for each GPU
+            and the available CPU RAM if unset. Defaults to None.
+        no_split_module_classes (List[str] | None): A list of layer class names
+            that should never be split across device (for instance any layer
+            that has a residual connection). If None, try to get the settings
+            from the model class. Defaults to None.
+        offload_folder (str | None): If the `device_map` contains any value
+            `"disk"`, the folder where we will offload weights.
+        offload_buffers (bool): In the layers that are offloaded on the CPU
+            or the hard drive, whether or not to offload the buffers as
+            well as the parameters. Defaults to False.
+        preload_module_classes (List[str] | None): A list of classes whose
+            instances should load all their weights (even in the submodules) at
+            the beginning of the forward. This should only be used for classes
+            that have submodules which are registered but not called directly
+            during the forward, for instance if a `dense` linear layer is
+            registered, but at forward, `dense.weight` and `dense.bias` are
+            used in some operations instead of calling `dense` directly.
+            Defaults to None.
+    """
+    from accelerate import dispatch_model, infer_auto_device_map
+
+    # Check valid device_map string.
+    valid_map_option = ['auto', 'balanced', 'balanced_low_0', 'sequential']
+    if isinstance(device_map, str) and device_map not in valid_map_option:
+        raise ValueError('If passing a string for `device_map`, please choose '
+                         f'from {valid_map_option}.')
+
+    # Generate device map automatically
+    if isinstance(device_map, str):
+        if no_split_module_classes is None:
+            no_split_module_classes = getattr(model, '_no_split_modules', None)
+        if no_split_module_classes is None:
+            raise ValueError(f'{model.__class__.__name__} does not support '
+                             f"`device_map='{device_map}'` yet.")
+
+        if device_map != 'sequential':
+            from accelerate.utils import get_balanced_memory
+            max_memory = get_balanced_memory(
+                model,
+                max_memory=max_memory,
+                no_split_module_classes=no_split_module_classes,
+                dtype=None,
+                low_zero=(device_map == 'balanced_low_0'),
+            )
+            max_memory[0] *= 0.9
+        device_map = infer_auto_device_map(
+            model,
+            max_memory=max_memory,
+            no_split_module_classes=no_split_module_classes,
+            dtype=None,
+        )
+
+    if 'disk' in device_map.values():
+        if offload_folder is None:
+            raise ValueError(
+                'The current `device_map` had weights offloaded to the disk. '
+                'Please provide an `offload_folder` for them.')
+        os.makedirs(offload_folder, exist_ok=True)
+
+    main_device = next(
+        (d for d in device_map.values() if d not in ['cpu', 'disk']), 'cpu')
+
+    model = dispatch_model(
+        model,
+        device_map=device_map,
+        main_device=main_device,
+        offload_dir=offload_folder,
+        offload_buffers=offload_buffers,
+        preload_module_classes=preload_module_classes,
+    )
+    if hasattr(model, 'data_preprocessor'):
+        model.data_preprocessor._device = torch.device(main_device)
+    return model
+
+
+@contextmanager
+def init_empty_weights(include_buffers: bool = False):
+    """A context manager under which models are initialized with all parameters
+    on the meta device.
+
+    With this context manager, we can create an empty model. Useful when just
+    initializing the model would blow the available RAM.
+
+    Besides move the parameters to meta device, this method will also avoid
+    load checkpoint from `mmengine.runner.load_checkpoint` and
+    `transformers.PreTrainedModel.from_pretrained`.
+
+    Modified from https://github.com/huggingface/accelerate
+
+    Args:
+        include_buffers (bool): Whether put all buffers on the meta device
+            during initialization.
+    """
+    device = torch.device('meta')
+
+    # move parameter and buffer to meta device
+    old_register_parameter = nn.Module.register_parameter
+    if include_buffers:
+        old_register_buffer = nn.Module.register_buffer
+        # See https://github.com/huggingface/accelerate/pull/699
+        tensor_constructors_to_patch = {
+            torch_function_name: getattr(torch, torch_function_name)
+            for torch_function_name in ['empty', 'zeros', 'ones', 'full']
+        }
+
+    def register_parameter(module, name, param):
+        old_register_parameter(module, name, param)
+        if param is not None:
+            param_cls = type(module._parameters[name])
+            kwargs = module._parameters[name].__dict__
+            module._parameters[name] = param_cls(
+                module._parameters[name].to(device), **kwargs)
+
+    def register_buffer(module, name, buffer, *args, **kwargs):
+        old_register_buffer(module, name, buffer, *args, **kwargs)
+        if buffer is not None:
+            module._buffers[name] = module._buffers[name].to(device)
+
+    def patch_tensor_constructor(fn):
+
+        def wrapper(*args, **kwargs):
+            kwargs['device'] = device
+            return fn(*args, **kwargs)
+
+        return wrapper
+
+    # Patch load_checkpoint
+    import mmengine.runner.checkpoint as mmengine_load
+    old_load_checkpoint = mmengine_load.load_checkpoint
+
+    def patch_load_checkpoint(*args, **kwargs):
+        return {}
+
+    # Patch transformers from pretrained
+    try:
+        from transformers import PreTrainedModel
+        from transformers.models.auto.auto_factory import (AutoConfig,
+                                                           _BaseAutoModelClass)
+        with_transformers = True
+    except ImportError:
+        with_transformers = False
+
+    @classmethod
+    def patch_auto_model(cls, pretrained_model_name_or_path, *model_args,
+                         **kwargs):
+        cfg = AutoConfig.from_pretrained(pretrained_model_name_or_path,
+                                         *model_args, **kwargs)
+        return cls.from_config(cfg)
+
+    @classmethod
+    def patch_pretrained_model(cls, pretrained_model_name_or_path, *model_args,
+                               **kwargs):
+        cfg = cls.config_class.from_pretrained(pretrained_model_name_or_path,
+                                               *model_args, **kwargs)
+        return cls(cfg)
+
+    if with_transformers:
+        old_pretrained_model = PreTrainedModel.from_pretrained
+        old_auto_model = _BaseAutoModelClass.from_pretrained
+
+    try:
+        nn.Module.register_parameter = register_parameter
+        mmengine_load.load_checkpoint = patch_load_checkpoint
+        if with_transformers:
+            PreTrainedModel.from_pretrained = patch_pretrained_model
+            _BaseAutoModelClass.from_pretrained = patch_auto_model
+        if include_buffers:
+            nn.Module.register_buffer = register_buffer
+            for func in tensor_constructors_to_patch.keys():
+                tensor_constructor = patch_tensor_constructor(
+                    getattr(torch, func))
+                setattr(torch, func, tensor_constructor)
+        yield
+    finally:
+        nn.Module.register_parameter = old_register_parameter
+        mmengine_load.load_checkpoint = old_load_checkpoint
+        if with_transformers:
+            PreTrainedModel.from_pretrained = old_pretrained_model
+            _BaseAutoModelClass.from_pretrained = old_auto_model
+        if include_buffers:
+            nn.Module.register_buffer = old_register_buffer
+            for func, ori in tensor_constructors_to_patch.items():
+                setattr(torch, func, ori)
+
+
+def compute_module_sizes(
+        model: nn.Module,
+        dtype: Union[str, torch.dtype, None] = None,
+        special_dtypes: Optional[Dict[str, Union[str, torch.dtype]]] = None):
+    """Compute the size of each submodule of a given model."""
+
+    def get_dtype(dtype):
+        if isinstance(dtype, str):
+            dtype = getattr(torch, dtype)
+        if dtype is not None:
+            assert issubclass(dtype, torch.dtype)
+        return dtype
+
+    def dtype_bytes(dtype: torch.dtype):
+        if dtype is torch.bool:
+            return 1
+        if dtype.is_floating_point:
+            return torch.finfo(dtype).bits / 8
+        else:
+            return torch.iinfo(dtype).bits / 8
+
+    if dtype is not None:
+        dtype = get_dtype(dtype)
+        dtype_size = dtype_bytes(dtype)
+
+    if special_dtypes is not None:
+        special_dtypes = {
+            key: dtype_bytes(dtype)
+            for key, dtype in special_dtypes.items()
+        }
+
+    module_sizes = defaultdict(int)
+    for name, tensor in chain(
+            model.named_parameters(recurse=True),
+            model.named_buffers(recurse=True)):
+        if special_dtypes is not None and name in special_dtypes:
+            size = tensor.numel() * special_dtypes[name]
+        elif dtype is None:
+            size = tensor.numel() * tensor.element_size()
+        else:
+            size = tensor.numel() * min(dtype_size, tensor.element_size())
+        name_parts = name.split('.')
+        for idx in range(len(name_parts) + 1):
+            module_sizes['.'.join(name_parts[:idx])] += size
+
+    return module_sizes
diff --git a/mmpretrain/apis/visual_grounding.py b/mmpretrain/apis/visual_grounding.py
new file mode 100644
index 0000000000000000000000000000000000000000..0153d56f5ca10a32e9fd2ccabb0d15c1135e213d
--- /dev/null
+++ b/mmpretrain/apis/visual_grounding.py
@@ -0,0 +1,182 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from pathlib import Path
+from typing import Callable, List, Optional, Union
+
+import numpy as np
+from mmcv.image import imread
+from mmengine.config import Config
+from mmengine.dataset import Compose, default_collate
+
+from mmpretrain.registry import TRANSFORMS
+from mmpretrain.structures import DataSample
+from .base import BaseInferencer
+from .model import list_models
+
+
+class VisualGroundingInferencer(BaseInferencer):
+    """The inferencer for visual grounding.
+
+    Args:
+        model (BaseModel | str | Config): A model name or a path to the config
+            file, or a :obj:`BaseModel` object. The model name can be found
+            by ``VisualGroundingInferencer.list_models()`` and you can also
+            query it in :doc:`/modelzoo_statistics`.
+        pretrained (str, optional): Path to the checkpoint. If None, it will
+            try to find a pre-defined weight from the model you specified
+            (only work if the ``model`` is a model name). Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        **kwargs: Other keyword arguments to initialize the model (only work if
+            the ``model`` is a model name).
+
+    Example:
+        >>> from mmpretrain import VisualGroundingInferencer
+        >>> inferencer = VisualGroundingInferencer('ofa-base_3rdparty_refcoco')
+        >>> inferencer('demo/cat-dog.png', 'dog')[0]
+        {'pred_bboxes': tensor([[ 36.6000,  29.6000, 355.8000, 395.2000]])}
+    """  # noqa: E501
+
+    visualize_kwargs: set = {
+        'resize', 'show', 'show_dir', 'wait_time', 'line_width', 'bbox_color'
+    }
+
+    def __call__(self,
+                 images: Union[str, np.ndarray, list],
+                 texts: Union[str, list],
+                 return_datasamples: bool = False,
+                 batch_size: int = 1,
+                 **kwargs) -> dict:
+        """Call the inferencer.
+
+        Args:
+            images (str | array | list): The image path or array, or a list of
+                images.
+            texts (str | list): The text to do visual grounding.
+            return_datasamples (bool): Whether to return results as
+                :obj:`DataSample`. Defaults to False.
+            batch_size (int): Batch size. Defaults to 1.
+            resize (int, optional): Resize the short edge of the image to the
+                specified length before visualization. Defaults to None.
+            draw_score (bool): Whether to draw the prediction scores
+                of prediction categories. Defaults to True.
+            show (bool): Whether to display the visualization result in a
+                window. Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            show_dir (str, optional): If not None, save the visualization
+                results in the specified directory. Defaults to None.
+            line_width (int): The line width of the bbox. Defaults to 3.
+            bbox_color (str | tuple): The color of the bbox.
+                Defaults to 'green'.
+
+        Returns:
+            list: The inference results.
+        """
+        if not isinstance(images, (list, tuple)):
+            assert isinstance(texts, str)
+            inputs = [{'img': images, 'text': texts}]
+        else:
+            inputs = []
+            for i in range(len(images)):
+                input_ = {'img': images[i], 'text': texts[i]}
+                inputs.append(input_)
+
+        return super().__call__(inputs, return_datasamples, batch_size,
+                                **kwargs)
+
+    def _init_pipeline(self, cfg: Config) -> Callable:
+        test_pipeline_cfg = cfg.test_dataloader.dataset.pipeline
+        from mmpretrain.datasets import remove_transform
+
+        # Image loading is finished in `self.preprocess`.
+        test_pipeline_cfg = remove_transform(test_pipeline_cfg,
+                                             'LoadImageFromFile')
+        test_pipeline = Compose(
+            [TRANSFORMS.build(t) for t in test_pipeline_cfg])
+        return test_pipeline
+
+    def preprocess(self, inputs: List[dict], batch_size: int = 1):
+
+        def load_image(input_: dict):
+            img = imread(input_['img'])
+            if img is None:
+                raise ValueError(f'Failed to read image {input_}.')
+            return {**input_, 'img': img}
+
+        pipeline = Compose([load_image, self.pipeline])
+
+        chunked_data = self._get_chunk_data(map(pipeline, inputs), batch_size)
+        yield from map(default_collate, chunked_data)
+
+    def visualize(self,
+                  ori_inputs: List[dict],
+                  preds: List[DataSample],
+                  show: bool = False,
+                  wait_time: int = 0,
+                  resize: Optional[int] = None,
+                  line_width: int = 3,
+                  bbox_color: Union[str, tuple] = 'green',
+                  show_dir=None):
+        if not show and show_dir is None:
+            return None
+
+        if self.visualizer is None:
+            from mmpretrain.visualization import UniversalVisualizer
+            self.visualizer = UniversalVisualizer()
+
+        visualization = []
+        for i, (input_, data_sample) in enumerate(zip(ori_inputs, preds)):
+            image = imread(input_['img'])
+            if isinstance(input_['img'], str):
+                # The image loaded from path is BGR format.
+                image = image[..., ::-1]
+                name = Path(input_['img']).stem
+            else:
+                name = str(i)
+
+            if show_dir is not None:
+                show_dir = Path(show_dir)
+                show_dir.mkdir(exist_ok=True)
+                out_file = str((show_dir / name).with_suffix('.png'))
+            else:
+                out_file = None
+
+            self.visualizer.visualize_visual_grounding(
+                image,
+                data_sample,
+                resize=resize,
+                show=show,
+                wait_time=wait_time,
+                line_width=line_width,
+                bbox_color=bbox_color,
+                name=name,
+                out_file=out_file)
+            visualization.append(self.visualizer.get_image())
+        if show:
+            self.visualizer.close()
+        return visualization
+
+    def postprocess(self,
+                    preds: List[DataSample],
+                    visualization: List[np.ndarray],
+                    return_datasamples=False) -> dict:
+        if return_datasamples:
+            return preds
+
+        results = []
+        for data_sample in preds:
+            results.append({'pred_bboxes': data_sample.get('pred_bboxes')})
+
+        return results
+
+    @staticmethod
+    def list_models(pattern: Optional[str] = None):
+        """List all available model names.
+
+        Args:
+            pattern (str | None): A wildcard pattern to match model names.
+
+        Returns:
+            List[str]: a list of model names.
+        """
+        return list_models(pattern=pattern, task='Visual Grounding')
diff --git a/mmpretrain/apis/visual_question_answering.py b/mmpretrain/apis/visual_question_answering.py
new file mode 100644
index 0000000000000000000000000000000000000000..616e1edc66709401df83cb5253590325e727aa98
--- /dev/null
+++ b/mmpretrain/apis/visual_question_answering.py
@@ -0,0 +1,183 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from pathlib import Path
+from typing import Callable, List, Optional, Union
+
+import numpy as np
+from mmcv.image import imread
+from mmengine.config import Config
+from mmengine.dataset import Compose, default_collate
+
+from mmpretrain.registry import TRANSFORMS
+from mmpretrain.structures import DataSample
+from .base import BaseInferencer
+from .model import list_models
+
+
+class VisualQuestionAnsweringInferencer(BaseInferencer):
+    """The inferencer for visual question answering.
+
+    Args:
+        model (BaseModel | str | Config): A model name or a path to the config
+            file, or a :obj:`BaseModel` object. The model name can be found
+            by ``VisualQuestionAnsweringInferencer.list_models()`` and you can
+            also query it in :doc:`/modelzoo_statistics`.
+        pretrained (str, optional): Path to the checkpoint. If None, it will
+            try to find a pre-defined weight from the model you specified
+            (only work if the ``model`` is a model name). Defaults to None.
+        device (str, optional): Device to run inference. If None, the available
+            device will be automatically used. Defaults to None.
+        **kwargs: Other keyword arguments to initialize the model (only work if
+            the ``model`` is a model name).
+
+    Example:
+        >>> from mmpretrain import VisualQuestionAnsweringInferencer
+        >>> inferencer = VisualQuestionAnsweringInferencer('ofa-base_3rdparty-zeroshot_vqa')
+        >>> inferencer('demo/cat-dog.png', "What's the animal next to the dog?")[0]
+        {'question': "What's the animal next to the dog?", 'pred_answer': 'cat'}
+    """  # noqa: E501
+
+    visualize_kwargs: set = {'resize', 'show', 'show_dir', 'wait_time'}
+
+    def __call__(self,
+                 images: Union[str, np.ndarray, list],
+                 questions: Union[str, list],
+                 return_datasamples: bool = False,
+                 batch_size: int = 1,
+                 objects: Optional[List[str]] = None,
+                 **kwargs) -> dict:
+        """Call the inferencer.
+
+        Args:
+            images (str | array | list): The image path or array, or a list of
+                images.
+            questions (str | list): The question to the correspondding image.
+            return_datasamples (bool): Whether to return results as
+                :obj:`DataSample`. Defaults to False.
+            batch_size (int): Batch size. Defaults to 1.
+            objects (List[List[str]], optional): Some algorithms like OFA
+                fine-tuned VQA models requires extra object description list
+                for every image. Defaults to None.
+            resize (int, optional): Resize the short edge of the image to the
+                specified length before visualization. Defaults to None.
+            show (bool): Whether to display the visualization result in a
+                window. Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            show_dir (str, optional): If not None, save the visualization
+                results in the specified directory. Defaults to None.
+
+        Returns:
+            list: The inference results.
+        """
+        if not isinstance(images, (list, tuple)):
+            assert isinstance(questions, str)
+            inputs = [{'img': images, 'question': questions}]
+            if objects is not None:
+                assert isinstance(objects[0], str)
+                inputs[0]['objects'] = objects
+        else:
+            inputs = []
+            for i in range(len(images)):
+                input_ = {'img': images[i], 'question': questions[i]}
+                if objects is not None:
+                    input_['objects'] = objects[i]
+                inputs.append(input_)
+
+        return super().__call__(inputs, return_datasamples, batch_size,
+                                **kwargs)
+
+    def _init_pipeline(self, cfg: Config) -> Callable:
+        test_pipeline_cfg = cfg.test_dataloader.dataset.pipeline
+        from mmpretrain.datasets import remove_transform
+
+        # Image loading is finished in `self.preprocess`.
+        test_pipeline_cfg = remove_transform(test_pipeline_cfg,
+                                             'LoadImageFromFile')
+        test_pipeline = Compose(
+            [TRANSFORMS.build(t) for t in test_pipeline_cfg])
+        return test_pipeline
+
+    def preprocess(self, inputs: List[dict], batch_size: int = 1):
+
+        def load_image(input_: dict):
+            img = imread(input_['img'])
+            if img is None:
+                raise ValueError(f'Failed to read image {input_}.')
+            return {**input_, 'img': img}
+
+        pipeline = Compose([load_image, self.pipeline])
+
+        chunked_data = self._get_chunk_data(map(pipeline, inputs), batch_size)
+        yield from map(default_collate, chunked_data)
+
+    def visualize(self,
+                  ori_inputs: List[dict],
+                  preds: List[DataSample],
+                  show: bool = False,
+                  wait_time: int = 0,
+                  resize: Optional[int] = None,
+                  show_dir=None):
+        if not show and show_dir is None:
+            return None
+
+        if self.visualizer is None:
+            from mmpretrain.visualization import UniversalVisualizer
+            self.visualizer = UniversalVisualizer()
+
+        visualization = []
+        for i, (input_, data_sample) in enumerate(zip(ori_inputs, preds)):
+            image = imread(input_['img'])
+            if isinstance(input_['img'], str):
+                # The image loaded from path is BGR format.
+                image = image[..., ::-1]
+                name = Path(input_['img']).stem
+            else:
+                name = str(i)
+
+            if show_dir is not None:
+                show_dir = Path(show_dir)
+                show_dir.mkdir(exist_ok=True)
+                out_file = str((show_dir / name).with_suffix('.png'))
+            else:
+                out_file = None
+
+            self.visualizer.visualize_vqa(
+                image,
+                data_sample,
+                resize=resize,
+                show=show,
+                wait_time=wait_time,
+                name=name,
+                out_file=out_file)
+            visualization.append(self.visualizer.get_image())
+        if show:
+            self.visualizer.close()
+        return visualization
+
+    def postprocess(self,
+                    preds: List[DataSample],
+                    visualization: List[np.ndarray],
+                    return_datasamples=False) -> dict:
+        if return_datasamples:
+            return preds
+
+        results = []
+        for data_sample in preds:
+            results.append({
+                'question': data_sample.get('question'),
+                'pred_answer': data_sample.get('pred_answer'),
+            })
+
+        return results
+
+    @staticmethod
+    def list_models(pattern: Optional[str] = None):
+        """List all available model names.
+
+        Args:
+            pattern (str | None): A wildcard pattern to match model names.
+
+        Returns:
+            List[str]: a list of model names.
+        """
+        return list_models(pattern=pattern, task='Visual Question Answering')
diff --git a/mmpretrain/configs/_base_/datasets/cifar10_bs16.py b/mmpretrain/configs/_base_/datasets/cifar10_bs16.py
new file mode 100644
index 0000000000000000000000000000000000000000..3737dbee9a669a231c4aa93a711fa4b231bdf073
--- /dev/null
+++ b/mmpretrain/configs/_base_/datasets/cifar10_bs16.py
@@ -0,0 +1,52 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.dataset import DefaultSampler
+
+from mmpretrain.datasets import CIFAR10, PackInputs, RandomCrop, RandomFlip
+from mmpretrain.evaluation import Accuracy
+
+# dataset settings
+dataset_type = CIFAR10
+data_preprocessor = dict(
+    num_classes=10,
+    # RGB format normalization parameters
+    mean=[125.307, 122.961, 113.8575],
+    std=[51.5865, 50.847, 51.255],
+    # loaded images are already RGB format
+    to_rgb=False)
+
+train_pipeline = [
+    dict(type=RandomCrop, crop_size=32, padding=4),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=PackInputs),
+]
+
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/cifar10',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/cifar10/',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=False),
+)
+val_evaluator = dict(type=Accuracy, topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/mmpretrain/configs/_base_/datasets/cub_bs8_384.py b/mmpretrain/configs/_base_/datasets/cub_bs8_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..b193bf83cedaac3d358ac54ec63618833f6544d7
--- /dev/null
+++ b/mmpretrain/configs/_base_/datasets/cub_bs8_384.py
@@ -0,0 +1,59 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.dataset import DefaultSampler
+
+from mmpretrain.datasets import (CUB, CenterCrop, LoadImageFromFile,
+                                 PackInputs, RandomCrop, RandomFlip, Resize)
+from mmpretrain.evaluation import Accuracy
+
+# dataset settings
+dataset_type = CUB
+data_preprocessor = dict(
+    num_classes=200,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=Resize, scale=510),
+    dict(type=RandomCrop, crop_size=384),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=Resize, scale=510),
+    dict(type=CenterCrop, crop_size=384),
+    dict(type=PackInputs),
+]
+
+train_dataloader = dict(
+    batch_size=8,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/CUB_200_2011',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=2,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/CUB_200_2011',
+        split='test',
+        pipeline=test_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=False),
+)
+val_evaluator = dict(type=Accuracy, topk=(1, ))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/mmpretrain/configs/_base_/datasets/imagenet21k_bs128.py b/mmpretrain/configs/_base_/datasets/imagenet21k_bs128.py
new file mode 100644
index 0000000000000000000000000000000000000000..11c4c0a4b74a1218c050d2425bfa0b2915011ef6
--- /dev/null
+++ b/mmpretrain/configs/_base_/datasets/imagenet21k_bs128.py
@@ -0,0 +1,35 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.dataset import DefaultSampler
+
+from mmpretrain.datasets import (ImageNet21k, LoadImageFromFile, PackInputs,
+                                 RandomFlip, RandomResizedCrop)
+
+# dataset settings
+dataset_type = ImageNet21k
+data_preprocessor = dict(
+    num_classes=21842,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=RandomResizedCrop, scale=224),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(type=PackInputs),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet21k',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=True),
+)
diff --git a/mmpretrain/configs/_base_/datasets/imagenet_bs128_mbv3.py b/mmpretrain/configs/_base_/datasets/imagenet_bs128_mbv3.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf0aa629d72fcacca755eeef3ed16e5d21824d40
--- /dev/null
+++ b/mmpretrain/configs/_base_/datasets/imagenet_bs128_mbv3.py
@@ -0,0 +1,75 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.dataset import DefaultSampler
+
+from mmpretrain.datasets import (AutoAugment, CenterCrop, ImageNet,
+                                 LoadImageFromFile, PackInputs, RandomErasing,
+                                 RandomFlip, RandomResizedCrop, ResizeEdge)
+from mmpretrain.evaluation import Accuracy
+
+# dataset settings
+dataset_type = ImageNet
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=RandomResizedCrop, scale=224, backend='pillow'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(
+        type=AutoAugment,
+        policies='imagenet',
+        hparams=dict(pad_val=[round(x) for x in bgr_mean])),
+    dict(
+        type=RandomErasing,
+        erase_prob=0.2,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=ResizeEdge, scale=256, edge='short', backend='pillow'),
+    dict(type=CenterCrop, crop_size=224),
+    dict(type=PackInputs),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=128,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=False),
+)
+val_evaluator = dict(type=Accuracy, topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/mmpretrain/configs/_base_/datasets/imagenet_bs256_beitv2.py b/mmpretrain/configs/_base_/datasets/imagenet_bs256_beitv2.py
new file mode 100644
index 0000000000000000000000000000000000000000..f89eb17b846c25cea4c709829ff516eebb15e4e7
--- /dev/null
+++ b/mmpretrain/configs/_base_/datasets/imagenet_bs256_beitv2.py
@@ -0,0 +1,53 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.dataset import DefaultSampler, default_collate
+
+from mmpretrain.datasets import (BEiTMaskGenerator, ColorJitter, ImageNet,
+                                 LoadImageFromFile, PackInputs, RandomFlip,
+                                 RandomResizedCropAndInterpolationWithTwoPic)
+from mmpretrain.models import TwoNormDataPreprocessor
+
+dataset_type = ImageNet
+data_root = 'data/imagenet/'
+
+data_preprocessor = dict(
+    type=TwoNormDataPreprocessor,
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    second_mean=[127.5, 127.5, 127.5],
+    second_std=[127.5, 127.5, 127.5],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=ColorJitter, brightness=0.4, contrast=0.4, saturation=0.4,
+        hue=0.),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(
+        type=RandomResizedCropAndInterpolationWithTwoPic,
+        size=224,
+        second_size=224,
+        interpolation='bicubic',
+        second_interpolation='bicubic',
+        scale=(0.2, 1.0)),
+    dict(
+        type=BEiTMaskGenerator,
+        input_size=(14, 14),
+        num_masking_patches=75,
+        max_num_patches=75,
+        min_num_patches=16),
+    dict(type=PackInputs)
+]
+
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type=DefaultSampler, shuffle=True),
+    collate_fn=dict(type=default_collate),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/mmpretrain/configs/_base_/datasets/imagenet_bs32.py b/mmpretrain/configs/_base_/datasets/imagenet_bs32.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d074008cc204f4ac486dc04fb3f1c638fb9e161
--- /dev/null
+++ b/mmpretrain/configs/_base_/datasets/imagenet_bs32.py
@@ -0,0 +1,62 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.dataset import DefaultSampler
+
+from mmpretrain.datasets import (CenterCrop, ImageNet, LoadImageFromFile,
+                                 PackInputs, RandomFlip, RandomResizedCrop,
+                                 ResizeEdge)
+from mmpretrain.evaluation import Accuracy
+
+# dataset settings
+dataset_type = ImageNet
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=RandomResizedCrop, scale=224),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=ResizeEdge, scale=256, edge='short'),
+    dict(type=CenterCrop, crop_size=224),
+    dict(type=PackInputs),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        ann_file='meta/train.txt',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        ann_file='meta/val.txt',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=False),
+)
+val_evaluator = dict(type=Accuracy, topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/mmpretrain/configs/_base_/datasets/imagenet_bs32_pil_resize.py b/mmpretrain/configs/_base_/datasets/imagenet_bs32_pil_resize.py
new file mode 100644
index 0000000000000000000000000000000000000000..f911bc20ff68fb2bb34b3ce495bc784ac0d0f62d
--- /dev/null
+++ b/mmpretrain/configs/_base_/datasets/imagenet_bs32_pil_resize.py
@@ -0,0 +1,60 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.dataset import DefaultSampler
+
+from mmpretrain.datasets import (CenterCrop, ImageNet, LoadImageFromFile,
+                                 PackInputs, RandomFlip, RandomResizedCrop,
+                                 ResizeEdge)
+from mmpretrain.evaluation import Accuracy
+
+# dataset settings
+dataset_type = ImageNet
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=RandomResizedCrop, scale=224, backend='pillow'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=ResizeEdge, scale=256, edge='short', backend='pillow'),
+    dict(type=CenterCrop, crop_size=224),
+    dict(type=PackInputs),
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=32,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=False),
+)
+val_evaluator = dict(type=Accuracy, topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/mmpretrain/configs/_base_/datasets/imagenet_bs32_simclr.py b/mmpretrain/configs/_base_/datasets/imagenet_bs32_simclr.py
new file mode 100644
index 0000000000000000000000000000000000000000..29b698f498eb4a4e4aaf8fb0cab04129704d484a
--- /dev/null
+++ b/mmpretrain/configs/_base_/datasets/imagenet_bs32_simclr.py
@@ -0,0 +1,63 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmcv.transforms import (LoadImageFromFile, RandomApply, RandomFlip,
+                             RandomGrayscale)
+from mmengine.dataset import DefaultSampler, default_collate
+
+from mmpretrain.datasets import (ColorJitter, GaussianBlur, ImageNet,
+                                 MultiView, PackInputs, RandomResizedCrop)
+from mmpretrain.models import SelfSupDataPreprocessor
+
+# dataset settings
+dataset_type = ImageNet
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type=SelfSupDataPreprocessor,
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+view_pipeline = [
+    dict(type=RandomResizedCrop, scale=224, backend='pillow'),
+    dict(type=RandomFlip, prob=0.5),
+    dict(
+        type=RandomApply,
+        transforms=[
+            dict(
+                type=ColorJitter,
+                brightness=0.8,
+                contrast=0.8,
+                saturation=0.8,
+                hue=0.2)
+        ],
+        prob=0.8),
+    dict(
+        type=RandomGrayscale,
+        prob=0.2,
+        keep_channels=True,
+        channel_weights=(0.114, 0.587, 0.2989)),
+    dict(
+        type=GaussianBlur,
+        magnitude_range=(0.1, 2.0),
+        magnitude_std='inf',
+        prob=0.5),
+]
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=MultiView, num_views=2, transforms=[view_pipeline]),
+    dict(type=PackInputs)
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=4,
+    persistent_workers=True,
+    sampler=dict(type=DefaultSampler, shuffle=True),
+    collate_fn=dict(type=default_collate),
+    dataset=dict(
+        type=ImageNet,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
diff --git a/mmpretrain/configs/_base_/datasets/imagenet_bs512_mae.py b/mmpretrain/configs/_base_/datasets/imagenet_bs512_mae.py
new file mode 100644
index 0000000000000000000000000000000000000000..017f5b7807eee0855bb427d0f445e4225127d08e
--- /dev/null
+++ b/mmpretrain/configs/_base_/datasets/imagenet_bs512_mae.py
@@ -0,0 +1,40 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmcv.transforms import LoadImageFromFile, RandomFlip
+from mmengine.dataset.sampler import DefaultSampler
+
+from mmpretrain.datasets import ImageNet, PackInputs, RandomResizedCrop
+from mmpretrain.models import SelfSupDataPreprocessor
+
+# dataset settings
+dataset_type = ImageNet
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type=SelfSupDataPreprocessor,
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=RandomResizedCrop,
+        scale=224,
+        crop_ratio_range=(0.2, 1.0),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=RandomFlip, prob=0.5),
+    dict(type=PackInputs)
+]
+
+train_dataloader = dict(
+    batch_size=512,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type=DefaultSampler, shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        split='train',
+        pipeline=train_pipeline))
diff --git a/mmpretrain/configs/_base_/datasets/imagenet_bs64_pil_resize.py b/mmpretrain/configs/_base_/datasets/imagenet_bs64_pil_resize.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2d8aea8bc2ec149031eab87f1a15540d5fec312
--- /dev/null
+++ b/mmpretrain/configs/_base_/datasets/imagenet_bs64_pil_resize.py
@@ -0,0 +1,60 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.dataset import DefaultSampler
+
+from mmpretrain.datasets import (CenterCrop, ImageNet, LoadImageFromFile,
+                                 PackInputs, RandomFlip, RandomResizedCrop,
+                                 ResizeEdge)
+from mmpretrain.evaluation import Accuracy
+
+# dataset settings
+dataset_type = ImageNet
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=RandomResizedCrop, scale=224, backend='pillow'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=ResizeEdge, scale=256, edge='short', backend='pillow'),
+    dict(type=CenterCrop, crop_size=224),
+    dict(type=PackInputs),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=False),
+)
+val_evaluator = dict(type=Accuracy, topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/mmpretrain/configs/_base_/datasets/imagenet_bs64_pil_resize_autoaug.py b/mmpretrain/configs/_base_/datasets/imagenet_bs64_pil_resize_autoaug.py
new file mode 100644
index 0000000000000000000000000000000000000000..a5f052662e4f834892ab6f813e15e3c1c7bb4e7d
--- /dev/null
+++ b/mmpretrain/configs/_base_/datasets/imagenet_bs64_pil_resize_autoaug.py
@@ -0,0 +1,78 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.dataset import DefaultSampler
+
+from mmpretrain.datasets import (CenterCrop, ImageNet, LoadImageFromFile,
+                                 PackInputs, RandomFlip, RandomResizedCrop,
+                                 ResizeEdge)
+from mmpretrain.datasets.transforms import AutoAugment
+from mmpretrain.evaluation import Accuracy
+
+# dataset settings
+dataset_type = ImageNet
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=RandomResizedCrop,
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(
+        type=AutoAugment,
+        policies='imagenet',
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=ResizeEdge,
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=CenterCrop, crop_size=224),
+    dict(type=PackInputs),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=False),
+)
+val_evaluator = dict(type=Accuracy, topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/mmpretrain/configs/_base_/datasets/imagenet_bs64_swin_224.py b/mmpretrain/configs/_base_/datasets/imagenet_bs64_swin_224.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a38943e270777426fb0ec3e991afbccce2a8873
--- /dev/null
+++ b/mmpretrain/configs/_base_/datasets/imagenet_bs64_swin_224.py
@@ -0,0 +1,89 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.dataset import DefaultSampler
+
+from mmpretrain.datasets import (CenterCrop, ImageNet, LoadImageFromFile,
+                                 PackInputs, RandAugment, RandomErasing,
+                                 RandomFlip, RandomResizedCrop, ResizeEdge)
+from mmpretrain.evaluation import Accuracy
+
+# dataset settings
+dataset_type = ImageNet
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=RandomResizedCrop,
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(
+        type=RandAugment,
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type=RandomErasing,
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=ResizeEdge,
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=CenterCrop, crop_size=224),
+    dict(type=PackInputs),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=False),
+)
+val_evaluator = dict(type=Accuracy, topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/mmpretrain/configs/_base_/datasets/imagenet_bs64_swin_256.py b/mmpretrain/configs/_base_/datasets/imagenet_bs64_swin_256.py
new file mode 100644
index 0000000000000000000000000000000000000000..9690ff8447895d656d345f380bf324420a9b72df
--- /dev/null
+++ b/mmpretrain/configs/_base_/datasets/imagenet_bs64_swin_256.py
@@ -0,0 +1,89 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.dataset import DefaultSampler
+
+from mmpretrain.datasets import (CenterCrop, ImageNet, LoadImageFromFile,
+                                 PackInputs, RandAugment, RandomErasing,
+                                 RandomFlip, RandomResizedCrop, ResizeEdge)
+from mmpretrain.evaluation import Accuracy
+
+# dataset settings
+dataset_type = ImageNet
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+bgr_mean = data_preprocessor['mean'][::-1]
+bgr_std = data_preprocessor['std'][::-1]
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=RandomResizedCrop,
+        scale=256,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(
+        type=RandAugment,
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(
+            pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
+    dict(
+        type=RandomErasing,
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=bgr_mean,
+        fill_std=bgr_std),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=ResizeEdge,
+        scale=292,  # ( 256 / 224 * 256 )
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=CenterCrop, crop_size=256),
+    dict(type=PackInputs),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='train',
+        pipeline=train_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        split='val',
+        pipeline=test_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=False),
+)
+val_evaluator = dict(type=Accuracy, topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/mmpretrain/configs/_base_/datasets/imagenet_bs64_swin_384.py b/mmpretrain/configs/_base_/datasets/imagenet_bs64_swin_384.py
new file mode 100644
index 0000000000000000000000000000000000000000..85aeb1e2c131109f3f6d75d21e2cc1c782c82b7f
--- /dev/null
+++ b/mmpretrain/configs/_base_/datasets/imagenet_bs64_swin_384.py
@@ -0,0 +1,64 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.dataset import DefaultSampler
+
+from mmpretrain.datasets import (ImageNet, LoadImageFromFile, PackInputs,
+                                 RandomFlip, RandomResizedCrop, Resize)
+from mmpretrain.evaluation import Accuracy
+
+# dataset settings
+dataset_type = ImageNet
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=RandomResizedCrop,
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=Resize, scale=384, backend='pillow', interpolation='bicubic'),
+    dict(type=PackInputs),
+]
+
+train_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        ann_file='meta/train.txt',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=64,
+    num_workers=5,
+    dataset=dict(
+        type=dataset_type,
+        data_root='data/imagenet',
+        ann_file='meta/val.txt',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type=DefaultSampler, shuffle=False),
+)
+val_evaluator = dict(type=Accuracy, topk=(1, 5))
+
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
diff --git a/mmpretrain/configs/_base_/default_runtime.py b/mmpretrain/configs/_base_/default_runtime.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5c748eb84b3e50d7c6b30efaa87cd3c1f2f1827
--- /dev/null
+++ b/mmpretrain/configs/_base_/default_runtime.py
@@ -0,0 +1,61 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
+                            LoggerHook, ParamSchedulerHook)
+from mmengine.visualization import LocalVisBackend
+
+from mmpretrain.engine.hooks import VisualizationHook
+from mmpretrain.visualization import UniversalVisualizer
+
+# configure default hooks
+default_hooks = dict(
+    # record the time of every iteration.
+    timer=dict(type=IterTimerHook),
+
+    # print log every 100 iterations.
+    logger=dict(type=LoggerHook, interval=100),
+
+    # enable the parameter scheduler.
+    param_scheduler=dict(type=ParamSchedulerHook),
+
+    # save checkpoint per epoch.
+    checkpoint=dict(type=CheckpointHook, interval=1),
+
+    # set sampler seed in distributed evrionment.
+    sampler_seed=dict(type=DistSamplerSeedHook),
+
+    # validation results visualization, set True to enable it.
+    visualization=dict(type=VisualizationHook, enable=False),
+)
+
+# configure environment
+env_cfg = dict(
+    # whether to enable cudnn benchmark
+    cudnn_benchmark=False,
+
+    # set multi process parameters
+    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+
+    # set distributed parameters
+    dist_cfg=dict(backend='nccl'),
+)
+
+# set visualizer
+vis_backends = [dict(type=LocalVisBackend)]
+visualizer = dict(type=UniversalVisualizer, vis_backends=vis_backends)
+
+# set log level
+log_level = 'INFO'
+
+# load from which checkpoint
+load_from = None
+
+# whether to resume training from the loaded checkpoint
+resume = False
+
+# Defaults to use random seed and disable `deterministic`
+randomness = dict(seed=None, deterministic=False)
+
+# Do not need to specify default_scope with new config. Therefore set it to
+# None to avoid BC-breaking.
+default_scope = None
diff --git a/mmpretrain/configs/_base_/models/convnext_base.py b/mmpretrain/configs/_base_/models/convnext_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..6315b2f1966d2484739087e1e131fe8dd9a2ad56
--- /dev/null
+++ b/mmpretrain/configs/_base_/models/convnext_base.py
@@ -0,0 +1,25 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.model import TruncNormalInit
+
+from mmpretrain.models import (ConvNeXt, CutMix, ImageClassifier,
+                               LabelSmoothLoss, LinearClsHead, Mixup)
+
+# Model settings
+model = dict(
+    type=ImageClassifier,
+    backbone=dict(type=ConvNeXt, arch='base', drop_path_rate=0.5),
+    head=dict(
+        type=LinearClsHead,
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type=LabelSmoothLoss, label_smooth_val=0.1, mode='original'),
+        init_cfg=None,
+    ),
+    init_cfg=dict(
+        type=TruncNormalInit, layer=['Conv2d', 'Linear'], std=.02, bias=0.),
+    train_cfg=dict(augments=[
+        dict(type=Mixup, alpha=0.8),
+        dict(type=CutMix, alpha=1.0),
+    ]),
+)
diff --git a/mmpretrain/configs/_base_/models/mae_hivit_base_p16.py b/mmpretrain/configs/_base_/models/mae_hivit_base_p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..975e16b44626198ced6494b26d707d3501094f3c
--- /dev/null
+++ b/mmpretrain/configs/_base_/models/mae_hivit_base_p16.py
@@ -0,0 +1,28 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmpretrain.models import (MAE, MAEHiViT, MAEPretrainDecoder,
+                               MAEPretrainHead, PixelReconstructionLoss)
+
+# model settings
+model = dict(
+    type=MAE,
+    backbone=dict(type=MAEHiViT, patch_size=16, arch='base', mask_ratio=0.75),
+    neck=dict(
+        type=MAEPretrainDecoder,
+        patch_size=16,
+        in_chans=3,
+        embed_dim=512,
+        decoder_embed_dim=512,
+        decoder_depth=6,
+        decoder_num_heads=16,
+        mlp_ratio=4.,
+    ),
+    head=dict(
+        type=MAEPretrainHead,
+        norm_pix=True,
+        patch_size=16,
+        loss=dict(type=PixelReconstructionLoss, criterion='L2')),
+    init_cfg=[
+        dict(type='Xavier', layer='Linear', distribution='uniform'),
+        dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+    ])
diff --git a/mmpretrain/configs/_base_/models/mae_vit_base_p16.py b/mmpretrain/configs/_base_/models/mae_vit_base_p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..9347d1e8810e553ef5563a96198794ec139ea3a4
--- /dev/null
+++ b/mmpretrain/configs/_base_/models/mae_vit_base_p16.py
@@ -0,0 +1,28 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmpretrain.models import (MAE, MAEPretrainDecoder, MAEPretrainHead,
+                               MAEViT, PixelReconstructionLoss)
+
+# model settings
+model = dict(
+    type=MAE,
+    backbone=dict(type=MAEViT, arch='b', patch_size=16, mask_ratio=0.75),
+    neck=dict(
+        type=MAEPretrainDecoder,
+        patch_size=16,
+        in_chans=3,
+        embed_dim=768,
+        decoder_embed_dim=512,
+        decoder_depth=8,
+        decoder_num_heads=16,
+        mlp_ratio=4.,
+    ),
+    head=dict(
+        type=MAEPretrainHead,
+        norm_pix=True,
+        patch_size=16,
+        loss=dict(type=PixelReconstructionLoss, criterion='L2')),
+    init_cfg=[
+        dict(type='Xavier', layer='Linear', distribution='uniform'),
+        dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+    ])
diff --git a/mmpretrain/configs/_base_/models/mobilenet_v2_1x.py b/mmpretrain/configs/_base_/models/mobilenet_v2_1x.py
new file mode 100644
index 0000000000000000000000000000000000000000..17dbb9fdd88c26767c1be7faeb0689be597626df
--- /dev/null
+++ b/mmpretrain/configs/_base_/models/mobilenet_v2_1x.py
@@ -0,0 +1,17 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmpretrain.models import (CrossEntropyLoss, GlobalAveragePooling,
+                               ImageClassifier, LinearClsHead, MobileNetV2)
+
+# model settings
+model = dict(
+    type=ImageClassifier,
+    backbone=dict(type=MobileNetV2, widen_factor=1.0),
+    neck=dict(type=GlobalAveragePooling),
+    head=dict(
+        type=LinearClsHead,
+        num_classes=1000,
+        in_channels=1280,
+        loss=dict(type=CrossEntropyLoss, loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/mmpretrain/configs/_base_/models/mobilenet_v3_small.py b/mmpretrain/configs/_base_/models/mobilenet_v3_small.py
new file mode 100644
index 0000000000000000000000000000000000000000..83edab592063113b86965dfb173fca7eb6f630cd
--- /dev/null
+++ b/mmpretrain/configs/_base_/models/mobilenet_v3_small.py
@@ -0,0 +1,25 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.model.weight_init import NormalInit
+from torch.nn.modules.activation import Hardswish
+
+from mmpretrain.models import (CrossEntropyLoss, GlobalAveragePooling,
+                               ImageClassifier, MobileNetV3,
+                               StackedLinearClsHead)
+
+# model settings
+model = dict(
+    type=ImageClassifier,
+    backbone=dict(type=MobileNetV3, arch='small'),
+    neck=dict(type=GlobalAveragePooling),
+    head=dict(
+        type=StackedLinearClsHead,
+        num_classes=1000,
+        in_channels=576,
+        mid_channels=[1024],
+        dropout_rate=0.2,
+        act_cfg=dict(type=Hardswish),
+        loss=dict(type=CrossEntropyLoss, loss_weight=1.0),
+        init_cfg=dict(
+            type=NormalInit, layer='Linear', mean=0., std=0.01, bias=0.),
+        topk=(1, 5)))
diff --git a/mmpretrain/configs/_base_/models/resnet18.py b/mmpretrain/configs/_base_/models/resnet18.py
new file mode 100644
index 0000000000000000000000000000000000000000..30b8f65148611c5602858b875b9be89b31f225cb
--- /dev/null
+++ b/mmpretrain/configs/_base_/models/resnet18.py
@@ -0,0 +1,22 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmpretrain.models import (CrossEntropyLoss, GlobalAveragePooling,
+                               ImageClassifier, LinearClsHead, ResNet)
+
+# model settings
+model = dict(
+    type=ImageClassifier,
+    backbone=dict(
+        type=ResNet,
+        depth=18,
+        num_stages=4,
+        out_indices=(3, ),
+        style='pytorch'),
+    neck=dict(type=GlobalAveragePooling),
+    head=dict(
+        type=LinearClsHead,
+        num_classes=1000,
+        in_channels=512,
+        loss=dict(type=CrossEntropyLoss, loss_weight=1.0),
+        topk=(1, 5),
+    ))
diff --git a/mmpretrain/configs/_base_/models/swin_transformer_base.py b/mmpretrain/configs/_base_/models/swin_transformer_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..c73c254d7a8af9524091ead1d61d1320541d3c5e
--- /dev/null
+++ b/mmpretrain/configs/_base_/models/swin_transformer_base.py
@@ -0,0 +1,20 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmpretrain.models import (CrossEntropyLoss, GlobalAveragePooling,
+                               ImageClassifier, LinearClsHead, SwinTransformer)
+
+# model settings
+model = dict(
+    type=ImageClassifier,
+    backbone=dict(
+        type=SwinTransformer,
+        arch='base',
+        img_size=384,
+        stage_cfgs=dict(block_cfgs=dict(window_size=12))),
+    neck=dict(type=GlobalAveragePooling),
+    head=dict(
+        type=LinearClsHead,
+        num_classes=1000,
+        in_channels=1024,
+        loss=dict(type=CrossEntropyLoss, loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/mmpretrain/configs/_base_/models/swin_transformer_v2_base.py b/mmpretrain/configs/_base_/models/swin_transformer_v2_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7566b5e1a08c7c8c9bf4afee57d973b6801d6c3
--- /dev/null
+++ b/mmpretrain/configs/_base_/models/swin_transformer_v2_base.py
@@ -0,0 +1,19 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmpretrain.models import (GlobalAveragePooling, ImageClassifier,
+                               LabelSmoothLoss, LinearClsHead,
+                               SwinTransformerV2)
+
+# model settings
+model = dict(
+    type=ImageClassifier,
+    backbone=dict(
+        type=SwinTransformerV2, arch='base', img_size=384, drop_path_rate=0.2),
+    neck=dict(type=GlobalAveragePooling),
+    head=dict(
+        type=LinearClsHead,
+        num_classes=1000,
+        in_channels=1024,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(type=LabelSmoothLoss, label_smooth_val=0.1, mode='original'),
+        cal_acc=False))
diff --git a/mmpretrain/configs/_base_/models/vit_base_p16.py b/mmpretrain/configs/_base_/models/vit_base_p16.py
new file mode 100644
index 0000000000000000000000000000000000000000..326c50aea815d023acf2efb12718cd847677fc60
--- /dev/null
+++ b/mmpretrain/configs/_base_/models/vit_base_p16.py
@@ -0,0 +1,31 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.model.weight_init import KaimingInit
+
+from mmpretrain.models import (ImageClassifier, LabelSmoothLoss,
+                               VisionTransformer, VisionTransformerClsHead)
+
+# model settings
+model = dict(
+    type=ImageClassifier,
+    backbone=dict(
+        type=VisionTransformer,
+        arch='b',
+        img_size=224,
+        patch_size=16,
+        drop_rate=0.1,
+        init_cfg=[
+            dict(
+                type=KaimingInit,
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]),
+    neck=None,
+    head=dict(
+        type=VisionTransformerClsHead,
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(
+            type=LabelSmoothLoss, label_smooth_val=0.1, mode='classy_vision'),
+    ))
diff --git a/mmpretrain/configs/_base_/schedules/cifar10_bs128.py b/mmpretrain/configs/_base_/schedules/cifar10_bs128.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ab749e8b648aca8a50fc7775281330fa1ce2a2b
--- /dev/null
+++ b/mmpretrain/configs/_base_/schedules/cifar10_bs128.py
@@ -0,0 +1,20 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.optim import MultiStepLR
+from torch.optim import SGD
+
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type=SGD, lr=0.1, momentum=0.9, weight_decay=0.0001))
+# learning policy
+param_scheduler = dict(
+    type=MultiStepLR, by_epoch=True, milestones=[100, 150], gamma=0.1)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=200, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=128)
diff --git a/mmpretrain/configs/_base_/schedules/cub_bs64.py b/mmpretrain/configs/_base_/schedules/cub_bs64.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ca40bfe36efdd315f63ca872bcebb5247747f26
--- /dev/null
+++ b/mmpretrain/configs/_base_/schedules/cub_bs64.py
@@ -0,0 +1,39 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.optim import CosineAnnealingLR, LinearLR
+from torch.optim import SGD
+
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(
+        type=SGD, lr=0.01, momentum=0.9, weight_decay=0.0005, nesterov=True))
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type=LinearLR,
+        start_factor=0.01,
+        by_epoch=True,
+        begin=0,
+        end=5,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type=CosineAnnealingLR,
+        T_max=95,
+        by_epoch=True,
+        begin=5,
+        end=100,
+    )
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=64)
diff --git a/mmpretrain/configs/_base_/schedules/imagenet_bs1024_adamw_swin.py b/mmpretrain/configs/_base_/schedules/imagenet_bs1024_adamw_swin.py
new file mode 100644
index 0000000000000000000000000000000000000000..60ccaa0e25ec69aa618430f51a60d949506fc406
--- /dev/null
+++ b/mmpretrain/configs/_base_/schedules/imagenet_bs1024_adamw_swin.py
@@ -0,0 +1,46 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.optim import CosineAnnealingLR, LinearLR
+from torch.optim import AdamW
+
+# for batch in each gpu is 128, 8 gpu
+# lr = 5e-4 * 128 * 8 / 512 = 0.001
+optim_wrapper = dict(
+    optimizer=dict(
+        type=AdamW,
+        lr=5e-4 * 1024 / 512,
+        weight_decay=0.05,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        flat_decay_mult=0.0,
+        custom_keys={
+            '.absolute_pos_embed': dict(decay_mult=0.0),
+            '.relative_position_bias_table': dict(decay_mult=0.0)
+        }),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type=LinearLR,
+        start_factor=1e-3,
+        by_epoch=True,
+        end=20,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(type=CosineAnnealingLR, eta_min=1e-5, by_epoch=True, begin=20)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/mmpretrain/configs/_base_/schedules/imagenet_bs256.py b/mmpretrain/configs/_base_/schedules/imagenet_bs256.py
new file mode 100644
index 0000000000000000000000000000000000000000..95afa2ad292c277a84aa274786ee34a9d6b8b0ef
--- /dev/null
+++ b/mmpretrain/configs/_base_/schedules/imagenet_bs256.py
@@ -0,0 +1,21 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.optim import MultiStepLR
+from torch.optim import SGD
+
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type=SGD, lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# learning policy
+param_scheduler = dict(
+    type=MultiStepLR, by_epoch=True, milestones=[30, 60, 90], gamma=0.1)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/mmpretrain/configs/_base_/schedules/imagenet_bs256_epochstep.py b/mmpretrain/configs/_base_/schedules/imagenet_bs256_epochstep.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d245ebb9c35345c457e502f350b756ead181ffe
--- /dev/null
+++ b/mmpretrain/configs/_base_/schedules/imagenet_bs256_epochstep.py
@@ -0,0 +1,20 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.optim import StepLR
+from torch.optim import SGD
+
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type=SGD, lr=0.045, momentum=0.9, weight_decay=0.00004))
+
+# learning policy
+param_scheduler = dict(type=StepLR, by_epoch=True, step_size=1, gamma=0.98)
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=256)
diff --git a/mmpretrain/configs/_base_/schedules/imagenet_bs4096_adamw.py b/mmpretrain/configs/_base_/schedules/imagenet_bs4096_adamw.py
new file mode 100644
index 0000000000000000000000000000000000000000..4561f23db1b7b546fd9667ef51aed81dd9e6d4a7
--- /dev/null
+++ b/mmpretrain/configs/_base_/schedules/imagenet_bs4096_adamw.py
@@ -0,0 +1,44 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.optim import CosineAnnealingLR, LinearLR
+from torch.optim import AdamW
+
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type=AdamW, lr=0.003, weight_decay=0.3),
+    # specific to vit pretrain
+    paramwise_cfg=dict(custom_keys={
+        '.cls_token': dict(decay_mult=0.0),
+        '.pos_embed': dict(decay_mult=0.0)
+    }),
+)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type=LinearLR,
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=30,
+        # update by iter
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type=CosineAnnealingLR,
+        T_max=270,
+        by_epoch=True,
+        begin=30,
+        end=300,
+    )
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/_base_/schedules/imagenet_lars_coslr_200e.py b/mmpretrain/configs/_base_/schedules/imagenet_lars_coslr_200e.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c7e6171e2aeb20a94277e7ca4d02b2598d73b8e
--- /dev/null
+++ b/mmpretrain/configs/_base_/schedules/imagenet_lars_coslr_200e.py
@@ -0,0 +1,27 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.optim.optimizer.optimizer_wrapper import OptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+
+from mmpretrain.engine.optimizers.lars import LARS
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=OptimWrapper,
+    optimizer=dict(type=LARS, lr=4.8, weight_decay=1e-6, momentum=0.9))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(type=CosineAnnealingLR, T_max=190, by_epoch=True, begin=10, end=200)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=200)
diff --git a/mmpretrain/configs/beit/beit_beit_base_p16_8xb256_amp_coslr_300e_in1k.py b/mmpretrain/configs/beit/beit_beit_base_p16_8xb256_amp_coslr_300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fe9c329abba18ecb5bad1875090f7de667a77391
--- /dev/null
+++ b/mmpretrain/configs/beit/beit_beit_base_p16_8xb256_amp_coslr_300e_in1k.py
@@ -0,0 +1,146 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.default_runtime import *
+
+from mmengine.dataset import DefaultSampler, default_collate
+from mmengine.hooks import CheckpointHook
+from mmengine.model import ConstantInit, PretrainedInit, TruncNormalInit
+from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
+from mmengine.runner import EpochBasedTrainLoop
+from torch.optim import AdamW
+
+from mmpretrain.datasets import (BEiTMaskGenerator, ColorJitter, ImageNet,
+                                 LoadImageFromFile, PackInputs, RandomFlip,
+                                 RandomResizedCropAndInterpolationWithTwoPic)
+from mmpretrain.models import (BEiT, BEiTPretrainViT, BEiTV1Head,
+                               CrossEntropyLoss, DALLEEncoder,
+                               TwoNormDataPreprocessor)
+
+# dataset settings
+dataset_type = ImageNet
+data_root = 'data/imagenet/'
+data_preprocessor = dict(
+    type=TwoNormDataPreprocessor,
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    second_mean=[-31.875, -31.875, -31.875],
+    second_std=[318.75, 318.75, 318.75],
+    to_rgb=True)
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=ColorJitter, brightness=0.4, contrast=0.4, saturation=0.4,
+        hue=0.),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(
+        type=RandomResizedCropAndInterpolationWithTwoPic,
+        size=224,
+        second_size=112,
+        interpolation='bicubic',
+        second_interpolation='lanczos',
+        scale=(0.08, 1.0)),
+    dict(
+        type=BEiTMaskGenerator,
+        input_size=(14, 14),
+        num_masking_patches=75,
+        max_num_patches=None,
+        min_num_patches=16),
+    dict(type=PackInputs)
+]
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type=DefaultSampler, shuffle=True),
+    collate_fn=dict(type=default_collate),
+    dataset=dict(
+        type=dataset_type,
+        data_root=data_root,
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+
+# model settings
+model = dict(
+    type=BEiT,
+    backbone=dict(
+        type=BEiTPretrainViT,
+        arch='base',
+        patch_size=16,
+        drop_path_rate=0.1,
+        final_norm=True,
+        out_type='raw',
+        layer_scale_init_value=0.1,
+        init_cfg=[
+            dict(type=TruncNormalInit, std=0.02, layer='Linear'),
+            dict(type=TruncNormalInit, std=0.02, layer='Conv2d'),
+            dict(type=ConstantInit, layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=None,
+    head=dict(
+        type=BEiTV1Head,
+        embed_dims=768,
+        num_embed=8192,
+        loss=dict(type=CrossEntropyLoss)),
+    target_generator=dict(
+        type=DALLEEncoder,
+        init_cfg=dict(
+            type=PretrainedInit,
+            checkpoint=  # noqa: E251
+            'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/dalle_encoder.pth',  # noqa: E501
+        )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW, lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=300)
+default_hooks.update(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type=CheckpointHook, interval=1, max_keep_ckpts=3))
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/mmpretrain/configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py b/mmpretrain/configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..00a76b75e1581294d4c5fe885c1cf3d8b1caff8e
--- /dev/null
+++ b/mmpretrain/configs/beit/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,139 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from ..._base_.datasets.imagenet_bs64_swin_224 import *
+    from ..._base_.schedules.imagenet_bs1024_adamw_swin import *
+    from ..._base_.default_runtime import *
+
+from mmengine.hooks import CheckpointHook
+from mmengine.model import PretrainedInit, TruncNormalInit
+from mmengine.optim import CosineAnnealingLR, LinearLR
+from torch.optim import AdamW
+
+from mmpretrain.datasets import LoadImageFromFile, PackInputs, RandomFlip
+from mmpretrain.engine.optimizers import \
+    LearningRateDecayOptimWrapperConstructor
+from mmpretrain.models import (BEiTViT, ImageClassifier, LabelSmoothLoss,
+                               LinearClsHead)
+from mmpretrain.models.utils.batch_augments import CutMix, Mixup
+
+data_preprocessor = dict(
+    num_classes=1000,
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    to_rgb=True,
+)
+
+# model settings
+model = dict(
+    type=ImageClassifier,
+    backbone=dict(
+        type=BEiTViT,
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+        init_cfg=dict(type=PretrainedInit, checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type=LinearClsHead,
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type=LabelSmoothLoss, label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type=TruncNormalInit, layer='Linear', std=0.02)]),
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=RandomResizedCrop,
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(
+        type=RandAugment,
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type=RandomErasing,
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type=PackInputs)
+]
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=ResizeEdge,
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=CenterCrop, crop_size=224),
+    dict(type=PackInputs)
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(type=AdamW, lr=4e-3, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor=LearningRateDecayOptimWrapperConstructor,
+    paramwise_cfg=dict(
+        _delete_=True,
+        layer_decay_rate=0.65,
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        by_epoch=True,
+        begin=20,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type=CheckpointHook, interval=1, max_keep_ckpts=2))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0)
diff --git a/mmpretrain/configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py b/mmpretrain/configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b4718afbc80496e5129bcf3424a815248793f47a
--- /dev/null
+++ b/mmpretrain/configs/beit/benchmarks/beit-base-p16_8xb64_in1k.py
@@ -0,0 +1,50 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from ..._base_.datasets.imagenet_bs64_swin_224 import *
+    from ..._base_.schedules.imagenet_bs1024_adamw_swin import *
+    from ..._base_.default_runtime import *
+
+from mmengine.model import ConstantInit, TruncNormalInit
+
+from mmpretrain.models import (BEiTViT, ImageClassifier, LabelSmoothLoss,
+                               LinearClsHead)
+from mmpretrain.models.utils.batch_augments import CutMix, Mixup
+
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+model = dict(
+    type=ImageClassifier,
+    backbone=dict(
+        type=BEiTViT,
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=dict(
+        type=LinearClsHead,
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type=LabelSmoothLoss, label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=.02),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
diff --git a/mmpretrain/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-1600e_in1k.py b/mmpretrain/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..6bec16b342d25f69676b26703798f4dbe4a55899
--- /dev/null
+++ b/mmpretrain/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-1600e_in1k.py
@@ -0,0 +1,130 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.datasets.imagenet_bs256_beitv2 import *
+    from .._base_.default_runtime import *
+
+from mmengine.model import ConstantInit, PretrainedInit, TruncNormalInit
+from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
+from mmengine.runner import EpochBasedTrainLoop
+from torch.optim import AdamW
+
+from mmpretrain.models import (VQKD, BEiT, BEiTPretrainViT, BEiTV2Head,
+                               BEiTV2Neck, CrossEntropyLoss)
+
+vqkd_encoder = dict(
+    arch='base',
+    img_size=224,
+    patch_size=16,
+    in_channels=3,
+    out_indices=-1,
+    drop_rate=0.,
+    drop_path_rate=0.,
+    norm_cfg=dict(type='LN', eps=1e-6),
+    final_norm=True,
+    out_type='featmap',
+    with_cls_token=True,
+    frozen_stages=-1,
+    use_abs_pos_emb=True,
+    use_rel_pos_bias=False,
+    use_shared_rel_pos_bias=False,
+    layer_scale_init_value=0.,
+    interpolate_mode='bicubic',
+    patch_cfg=dict(),
+    layer_cfgs=dict(),
+    init_cfg=None)
+
+layer_scale_init_value = 0.1
+drop_path_rate = 0.1  # 0. for 300 epochs and 0.1 for 1600 epochs.
+
+model = dict(
+    type=BEiT,
+    backbone=dict(
+        type=BEiTPretrainViT,
+        arch='base',
+        patch_size=16,
+        out_indices=[-4, -1],
+        drop_path_rate=drop_path_rate,
+        final_norm=False,
+        out_type='raw',
+        layer_scale_init_value=layer_scale_init_value,
+        init_cfg=[
+            dict(type=TruncNormalInit, std=0.02, layer='Linear'),
+            dict(type=TruncNormalInit, std=0.02, layer='Conv2d'),
+            dict(type=ConstantInit, layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=dict(
+        type=BEiTV2Neck,
+        num_layers=2,
+        early_layers=9,
+        backbone_arch='base',
+        drop_path_rate=drop_path_rate,
+        layer_scale_init_value=layer_scale_init_value,
+    ),
+    head=dict(
+        type=BEiTV2Head,
+        embed_dims=768,
+        num_embed=8192,
+        loss=dict(type=CrossEntropyLoss)),
+    target_generator=dict(
+        type=VQKD,
+        encoder_config=vqkd_encoder,
+        init_cfg=dict(
+            type=PretrainedInit,
+            checkpoint=  # noqa
+            'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/vqkd_encoder.pth'  # noqa
+        )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
+    optimizer=dict(
+        type=AdamW, lr=1.5e-3, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=1600)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type=CheckpointHook, interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/mmpretrain/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py b/mmpretrain/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3fe9b503c9be6a66496fc4c711cc5306e4dfdcd8
--- /dev/null
+++ b/mmpretrain/configs/beitv2/beitv2_beit-base-p16_8xb256-amp-coslr-300e_in1k.py
@@ -0,0 +1,130 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.datasets.imagenet_bs256_beitv2 import *
+    from .._base_.default_runtime import *
+
+from mmengine.model import ConstantInit, PretrainedInit, TruncNormalInit
+from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
+from mmengine.runner import EpochBasedTrainLoop
+from torch.optim import AdamW
+
+from mmpretrain.models import (VQKD, BEiT, BEiTPretrainViT, BEiTV2Head,
+                               BEiTV2Neck, CrossEntropyLoss)
+
+# model settings
+vqkd_encoder = dict(
+    arch='base',
+    img_size=224,
+    patch_size=16,
+    in_channels=3,
+    out_indices=-1,
+    drop_rate=0.,
+    drop_path_rate=0.,
+    norm_cfg=dict(type='LN', eps=1e-6),
+    final_norm=True,
+    out_type='featmap',
+    with_cls_token=True,
+    frozen_stages=-1,
+    use_abs_pos_emb=True,
+    use_rel_pos_bias=False,
+    use_shared_rel_pos_bias=False,
+    layer_scale_init_value=0.,
+    interpolate_mode='bicubic',
+    patch_cfg=dict(),
+    layer_cfgs=dict(),
+    init_cfg=None)
+
+layer_scale_init_value = 0.1
+drop_path_rate = 0.  # 0. for 300 epochs and 0.1 for 1600 epochs.
+model = dict(
+    type=BEiT,
+    backbone=dict(
+        type=BEiTPretrainViT,
+        arch='base',
+        patch_size=16,
+        out_indices=[-4, -1],
+        drop_path_rate=drop_path_rate,
+        final_norm=False,
+        out_type='raw',
+        layer_scale_init_value=layer_scale_init_value,
+        init_cfg=[
+            dict(type=TruncNormalInit, std=0.02, layer='Linear'),
+            dict(type=TruncNormalInit, std=0.02, layer='Conv2d'),
+            dict(type=ConstantInit, layer='LayerNorm', val=1.0, bias=0.0)
+        ]),
+    neck=dict(
+        type=BEiTV2Neck,
+        num_layers=2,
+        early_layers=9,
+        backbone_arch='base',
+        drop_path_rate=drop_path_rate,
+        layer_scale_init_value=layer_scale_init_value,
+    ),
+    head=dict(
+        type=BEiTV2Head,
+        embed_dims=768,
+        num_embed=8192,
+        loss=dict(type=CrossEntropyLoss)),
+    target_generator=dict(
+        type=VQKD,
+        encoder_config=vqkd_encoder,
+        init_cfg=dict(
+            type=PretrainedInit,
+            checkpoint=  # noqa
+            'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/vqkd_encoder.pth'  # noqa
+        )))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    # betas: (0.9, 0.98) for 300 epochs and (0.9, 0.999) for 1600 epochs.
+    optimizer=dict(
+        type=AdamW, lr=1.5e-3, betas=(0.9, 0.98), weight_decay=0.05),
+    clip_grad=dict(max_norm=3.0),
+    paramwise_cfg=dict(
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        eta_min=1e-5,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=300)
+default_hooks = dict(
+    # only keeps the latest 3 checkpoints
+    checkpoint=dict(type=CheckpointHook, interval=1, max_keep_ckpts=3))
+
+randomness = dict(seed=0, diff_rank_seed=True)
+
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=2048)
diff --git a/mmpretrain/configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py b/mmpretrain/configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ee32d3a98ed7da2ac60525a75d82f967df151fc7
--- /dev/null
+++ b/mmpretrain/configs/beitv2/benchmarks/beit-base-p16_8xb128-coslr-100e_in1k.py
@@ -0,0 +1,132 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from ..._base_.datasets.imagenet_bs64_swin_224 import *
+    from ..._base_.schedules.imagenet_bs1024_adamw_swin import *
+    from ..._base_.default_runtime import *
+
+from mmengine.model import PretrainedInit, TruncNormalInit
+from mmengine.optim import CosineAnnealingLR, LinearLR
+from torch.optim import AdamW
+
+from mmpretrain.engine.optimizers import \
+    LearningRateDecayOptimWrapperConstructor
+from mmpretrain.models import (BEiTViT, ImageClassifier, LabelSmoothLoss,
+                               LinearClsHead)
+from mmpretrain.models.utils.batch_augments import CutMix, Mixup
+
+# model settings
+model = dict(
+    type=ImageClassifier,
+    backbone=dict(
+        type=BEiTViT,
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        # 0.2 for 1600 epochs pretrained models and 0.1 for 300 epochs.
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+        init_cfg=dict(type=PretrainedInit, checkpoint='', prefix='backbone.')),
+    neck=None,
+    head=dict(
+        type=LinearClsHead,
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type=LabelSmoothLoss, label_smooth_val=0.1, mode='original'),
+        init_cfg=[dict(type=TruncNormalInit, layer='Linear', std=0.02)]),
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=RandomResizedCrop,
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(
+        type=RandAugment,
+        policies='timm_increasing',
+        num_policies=2,
+        total_level=10,
+        magnitude_level=9,
+        magnitude_std=0.5,
+        hparams=dict(pad_val=[104, 116, 124], interpolation='bicubic')),
+    dict(
+        type=RandomErasing,
+        erase_prob=0.25,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=0.3333333333333333,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type=PackInputs)
+]
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=ResizeEdge,
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=CenterCrop, crop_size=224),
+    dict(type=PackInputs)
+]
+
+train_dataloader = dict(batch_size=128, dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(batch_size=128, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+# optimizer wrapper
+optim_wrapper = dict(
+    optimizer=dict(type=AdamW, lr=5e-4, weight_decay=0.05, betas=(0.9, 0.999)),
+    constructor=LearningRateDecayOptimWrapperConstructor,
+    paramwise_cfg=dict(
+        _delete_=True,
+        # 0.6 for 1600 epochs pretrained models and 0.65 for 300 epochs
+        layer_decay_rate=0.65,
+        custom_keys={
+            # the following configurations are designed for BEiT
+            '.ln': dict(decay_mult=0.0),
+            '.bias': dict(decay_mult=0.0),
+            'q_bias': dict(decay_mult=0.0),
+            'v_bias': dict(decay_mult=0.0),
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0),
+            '.gamma': dict(decay_mult=0.0),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        by_epoch=True,
+        begin=20,
+        end=100,
+        eta_min=1e-6,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+default_hooks = dict(
+    # save checkpoint per epoch.
+    checkpoint=dict(type=CheckpointHook, interval=1, max_keep_ckpts=2))
+
+train_cfg = dict(by_epoch=True, max_epochs=100)
+
+randomness = dict(seed=0)
diff --git a/mmpretrain/configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py b/mmpretrain/configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec20ba950519c790caed99fec04fd28781d6a0d1
--- /dev/null
+++ b/mmpretrain/configs/beitv2/benchmarks/beit-base-p16_8xb64_in1k.py
@@ -0,0 +1,42 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from ..._base_.datasets.imagenet_bs64_swin_224 import *
+    from ..._base_.schedules.imagenet_bs1024_adamw_swin import *
+    from ..._base_.default_runtime import *
+
+from mmengine.model import ConstantInit, TruncNormalInit
+
+from mmpretrain.models import (BEiTViT, ImageClassifier, LabelSmoothLoss,
+                               LinearClsHead)
+from mmpretrain.models.utils.batch_augments.cutmix import CutMix
+from mmpretrain.models.utils.batch_augments.mixup import Mixup
+
+model = dict(
+    type=ImageClassifier,
+    backbone=dict(
+        type=BEiTViT,
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        out_type='avg_featmap',
+        use_abs_pos_emb=False,
+        use_rel_pos_bias=True,
+        use_shared_rel_pos_bias=False,
+    ),
+    neck=None,
+    head=dict(
+        type=LinearClsHead,
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type=LabelSmoothLoss, label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=.02),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
diff --git a/mmpretrain/configs/convnext/convnext-base_32xb128_in1k.py b/mmpretrain/configs/convnext/convnext-base_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e8a10f020cb72bf0756bf0a3661a759c28e30de
--- /dev/null
+++ b/mmpretrain/configs/convnext/convnext-base_32xb128_in1k.py
@@ -0,0 +1,28 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_224 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.convnext_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+from mmpretrain.engine import EMAHook
+
+# dataset setting
+train_dataloader.update(batch_size=128)
+
+# schedule setting
+optim_wrapper.update(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type=EMAHook, momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr.update(base_batch_size=4096)
diff --git a/mmpretrain/configs/convnext/convnext-base_32xb128_in21k.py b/mmpretrain/configs/convnext/convnext-base_32xb128_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..73fb0a0af1cf4d5eae41115cb1f55d3cee2bad5c
--- /dev/null
+++ b/mmpretrain/configs/convnext/convnext-base_32xb128_in21k.py
@@ -0,0 +1,27 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.datasets.imagenet21k_bs128 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.convnext_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model setting
+model.update(head=dict(num_classes=21841))
+
+# dataset setting
+data_preprocessor.update(num_classes=21841)
+train_dataloader.update(batch_size=128)
+
+# schedule setting
+optim_wrapper.update(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr.update(base_batch_size=4096)
diff --git a/mmpretrain/configs/convnext/convnext-large_64xb64_in1k-384px.py b/mmpretrain/configs/convnext/convnext-large_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..2da428a5ad9816d36f97483c389cc4d5d9acad78
--- /dev/null
+++ b/mmpretrain/configs/convnext/convnext-large_64xb64_in1k-384px.py
@@ -0,0 +1,27 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.config import read_base
+
+from mmpretrain.engine import EMAHook
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_384 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.convnext_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# dataset setting
+train_dataloader.update(batch_size=64)
+
+# schedule setting
+optim_wrapper.update(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type=EMAHook, momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr.update(base_batch_size=4096)
diff --git a/mmpretrain/configs/convnext/convnext-large_64xb64_in1k.py b/mmpretrain/configs/convnext/convnext-large_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..e11e6a9f9091e19f2aa9719e85ef7a4fed793b4b
--- /dev/null
+++ b/mmpretrain/configs/convnext/convnext-large_64xb64_in1k.py
@@ -0,0 +1,27 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.config import read_base
+
+from mmpretrain.engine import EMAHook
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_384 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.convnext_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# dataset setting
+train_dataloader.update(batch_size=64)
+
+# schedule setting
+optim_wrapper.update(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type=EMAHook, momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr.update(base_batch_size=4096)
diff --git a/mmpretrain/configs/convnext/convnext-large_64xb64_in21k.py b/mmpretrain/configs/convnext/convnext-large_64xb64_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d103dfa7678944d2d5e2248c75a8356b7d77a7dd
--- /dev/null
+++ b/mmpretrain/configs/convnext/convnext-large_64xb64_in21k.py
@@ -0,0 +1,26 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.datasets.imagenet21k_bs128 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.convnext_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model setting
+model.update(head=dict(num_classes=21841))
+
+# dataset setting
+data_preprocessor.update(num_classes=21841)
+train_dataloader.update(batch_size=64)
+
+# schedule setting
+optim_wrapper.update(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr.update(base_batch_size=4096)
diff --git a/mmpretrain/configs/convnext/convnext-small_32xb128_in1k-384px.py b/mmpretrain/configs/convnext/convnext-small_32xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b7bce73f8e78017010de9a5525e6f9da919cbab
--- /dev/null
+++ b/mmpretrain/configs/convnext/convnext-small_32xb128_in1k-384px.py
@@ -0,0 +1,27 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.config import read_base
+
+from mmpretrain.engine import EMAHook
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_384 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.convnext_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# dataset setting
+train_dataloader.update(batch_size=128)
+
+# schedule setting
+optim_wrapper.update(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type=EMAHook, momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr.update(base_batch_size=4096)
diff --git a/mmpretrain/configs/convnext/convnext-small_32xb128_in1k.py b/mmpretrain/configs/convnext/convnext-small_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bd43ec16488573a339c3536e9ef5c7a77dc0df6a
--- /dev/null
+++ b/mmpretrain/configs/convnext/convnext-small_32xb128_in1k.py
@@ -0,0 +1,27 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.config import read_base
+
+from mmpretrain.engine import EMAHook
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_224 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.convnext_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# dataset setting
+train_dataloader.update(batch_size=128)
+
+# schedule setting
+optim_wrapper.update(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type=EMAHook, momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr.update(base_batch_size=4096)
diff --git a/mmpretrain/configs/convnext/convnext-tiny_32xb128_in1k-384px.py b/mmpretrain/configs/convnext/convnext-tiny_32xb128_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b7bce73f8e78017010de9a5525e6f9da919cbab
--- /dev/null
+++ b/mmpretrain/configs/convnext/convnext-tiny_32xb128_in1k-384px.py
@@ -0,0 +1,27 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.config import read_base
+
+from mmpretrain.engine import EMAHook
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_384 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.convnext_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# dataset setting
+train_dataloader.update(batch_size=128)
+
+# schedule setting
+optim_wrapper.update(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type=EMAHook, momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr.update(base_batch_size=4096)
diff --git a/mmpretrain/configs/convnext/convnext-tiny_32xb128_in1k.py b/mmpretrain/configs/convnext/convnext-tiny_32xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bd43ec16488573a339c3536e9ef5c7a77dc0df6a
--- /dev/null
+++ b/mmpretrain/configs/convnext/convnext-tiny_32xb128_in1k.py
@@ -0,0 +1,27 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.config import read_base
+
+from mmpretrain.engine import EMAHook
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_224 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.convnext_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# dataset setting
+train_dataloader.update(batch_size=128)
+
+# schedule setting
+optim_wrapper.update(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type=EMAHook, momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr.update(base_batch_size=4096)
diff --git a/mmpretrain/configs/convnext/convnext-xlarge_64xb64_in1k-384px.py b/mmpretrain/configs/convnext/convnext-xlarge_64xb64_in1k-384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..2da428a5ad9816d36f97483c389cc4d5d9acad78
--- /dev/null
+++ b/mmpretrain/configs/convnext/convnext-xlarge_64xb64_in1k-384px.py
@@ -0,0 +1,27 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.config import read_base
+
+from mmpretrain.engine import EMAHook
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_384 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.convnext_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# dataset setting
+train_dataloader.update(batch_size=64)
+
+# schedule setting
+optim_wrapper.update(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type=EMAHook, momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr.update(base_batch_size=4096)
diff --git a/mmpretrain/configs/convnext/convnext-xlarge_64xb64_in1k.py b/mmpretrain/configs/convnext/convnext-xlarge_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bdb1157effac74299861a23b79df67c132dd276a
--- /dev/null
+++ b/mmpretrain/configs/convnext/convnext-xlarge_64xb64_in1k.py
@@ -0,0 +1,27 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.config import read_base
+
+from mmpretrain.engine import EMAHook
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_224 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.convnext_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# dataset setting
+train_dataloader.update(batch_size=64)
+
+# schedule setting
+optim_wrapper.update(
+    optimizer=dict(lr=4e-3),
+    clip_grad=None,
+)
+
+# runtime setting
+custom_hooks = [dict(type=EMAHook, momentum=1e-4, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (64 GPUs) x (64 samples per GPU)
+auto_scale_lr.update(base_batch_size=4096)
diff --git a/mmpretrain/configs/convnext/convnext-xlarge_64xb64_in21k.py b/mmpretrain/configs/convnext/convnext-xlarge_64xb64_in21k.py
new file mode 100644
index 0000000000000000000000000000000000000000..21f10dcd605b5792caf612540f0ff3233c7675e0
--- /dev/null
+++ b/mmpretrain/configs/convnext/convnext-xlarge_64xb64_in21k.py
@@ -0,0 +1,28 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.config import read_base
+
+from mmpretrain.engine import EMAHook
+
+with read_base():
+    from .._base_.datasets.imagenet21k_bs128 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.convnext_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model setting
+model.update(head=dict(num_classes=21841))
+
+# dataset setting
+data_preprocessor.update(num_classes=21841)
+train_dataloader.update(batch_size=64)
+
+# schedule setting
+optim_wrapper.update(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr.update(base_batch_size=4096)
diff --git a/mmpretrain/configs/convnext/convnext_base_32xb128_in1k_384px.py b/mmpretrain/configs/convnext/convnext_base_32xb128_in1k_384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d90e7107a3f189eab22d866d4db3f3a1b5e06bf
--- /dev/null
+++ b/mmpretrain/configs/convnext/convnext_base_32xb128_in1k_384px.py
@@ -0,0 +1,28 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_384 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.convnext_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+from mmpretrain.engine import EMAHook
+
+# dataset setting
+train_dataloader.update(batch_size=128)
+
+# schedule setting
+optim_wrapper.update(
+    optimizer=dict(lr=4e-3),
+    clip_grad=dict(max_norm=5.0),
+)
+
+# runtime setting
+custom_hooks = [dict(type=EMAHook, momentum=4e-5, priority='ABOVE_NORMAL')]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr.update(base_batch_size=4096)
diff --git a/mmpretrain/configs/eva/eva_mae_style_vit_base_p16_16xb256_coslr_400e_in1k.py b/mmpretrain/configs/eva/eva_mae_style_vit_base_p16_16xb256_coslr_400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a254ac8a84d94acdd1ec5f84059c7e75abc3cbc4
--- /dev/null
+++ b/mmpretrain/configs/eva/eva_mae_style_vit_base_p16_16xb256_coslr_400e_in1k.py
@@ -0,0 +1,92 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_vit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks import CheckpointHook
+from mmengine.optim import CosineAnnealingLR, LinearLR, OptimWrapper
+from mmengine.runner import EpochBasedTrainLoop
+from torch.optim import AdamW
+
+from mmpretrain.models import (EVA, CLIPGenerator, CosineSimilarityLoss,
+                               MAEPretrainDecoder, MIMHead)
+
+# dataset settings
+train_dataloader.batch_size = 256
+
+# model settings
+model.type = EVA
+model.init_cfg = None
+model.backbone.update(init_cfg=[
+    dict(type='Xavier', distribution='uniform', layer='Linear'),
+    dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+])
+model.neck.update(
+    type=MAEPretrainDecoder,
+    predict_feature_dim=512,
+    init_cfg=[
+        dict(type='Xavier', distribution='uniform', layer='Linear'),
+        dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+    ])
+model.head = dict(
+    type=MIMHead,
+    loss=dict(type=CosineSimilarityLoss, shift_factor=2.0, scale_factor=2.0))
+model.target_generator = dict(
+    type=CLIPGenerator,
+    tokenizer_path=  # noqa
+    'https://download.openmmlab.com/mmselfsup/1.x/target_generator_ckpt/clip_vit_base_16.pth.tar'  # noqa
+)
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=OptimWrapper,
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+find_unused_parameters = True
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=400)
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(dict(seed=0, diff_rank_seed=True))
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_hivit_base_p16_8xb512_amp_coslr_1600e_in1k.py b/mmpretrain/configs/mae/mae_hivit_base_p16_8xb512_amp_coslr_1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a32cb0c2e856784dd900ffbc4ef8ad674a4eb4d9
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_hivit_base_p16_8xb512_amp_coslr_1600e_in1k.py
@@ -0,0 +1,65 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_hivit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=1600)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_hivit_base_p16_8xb512_amp_coslr_400e_in1k.py b/mmpretrain/configs/mae/mae_hivit_base_p16_8xb512_amp_coslr_400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..6ffcf6d13c049fa8802766d74f7e5c9a803b706e
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_hivit_base_p16_8xb512_amp_coslr_400e_in1k.py
@@ -0,0 +1,65 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_hivit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=400)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_hivit_base_p16_8xb512_amp_coslr_800e_in1k.py b/mmpretrain/configs/mae/mae_hivit_base_p16_8xb512_amp_coslr_800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f8a49b5840d8905414058b873b6a6f3a0acbb2a1
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_hivit_base_p16_8xb512_amp_coslr_800e_in1k.py
@@ -0,0 +1,65 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_hivit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=800)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_hivit_large_p16_8xb512_amp_coslr_1600e_in1k.py b/mmpretrain/configs/mae/mae_hivit_large_p16_8xb512_amp_coslr_1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae1aba546e22e612dde9d1f41f0ae45306034ba9
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_hivit_large_p16_8xb512_amp_coslr_1600e_in1k.py
@@ -0,0 +1,70 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_hivit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# model settings
+model.update(
+    backbone=dict(type=MAEHiViT, arch='large'),
+    neck=dict(type=MAEPretrainDecoder, embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=1600)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_hivit_large_p16_8xb512_amp_coslr_400e_in1k.py b/mmpretrain/configs/mae/mae_hivit_large_p16_8xb512_amp_coslr_400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..cdc1259ffce7f84c83b1412c4ecc480bfbbcc202
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_hivit_large_p16_8xb512_amp_coslr_400e_in1k.py
@@ -0,0 +1,70 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_hivit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# model settings
+model.update(
+    backbone=dict(type=MAEHiViT, arch='large'),
+    neck=dict(type=MAEPretrainDecoder, embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=400)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_hivit_large_p16_8xb512_amp_coslr_800e_in1k.py b/mmpretrain/configs/mae/mae_hivit_large_p16_8xb512_amp_coslr_800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..657ee01181ecb3c60c65c7a6fb08bed897a29012
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_hivit_large_p16_8xb512_amp_coslr_800e_in1k.py
@@ -0,0 +1,70 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_hivit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# model settings
+model.update(
+    backbone=dict(type=MAEHiViT, arch='large'),
+    neck=dict(type=MAEPretrainDecoder, embed_dim=768))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'norm': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=800)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+find_unused_parameters = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_vit_base_p16_8xb512_amp_coslr_1600e_in1k.py b/mmpretrain/configs/mae/mae_vit_base_p16_8xb512_amp_coslr_1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4b325df877acea22cfc565903644bbd8eb1d8cd
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_vit_base_p16_8xb512_amp_coslr_1600e_in1k.py
@@ -0,0 +1,65 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_vit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=1600)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_vit_base_p16_8xb512_amp_coslr_300e_in1k.py b/mmpretrain/configs/mae/mae_vit_base_p16_8xb512_amp_coslr_300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..6cee3bc93fd8b1c65263ac415422b9b73628e88d
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_vit_base_p16_8xb512_amp_coslr_300e_in1k.py
@@ -0,0 +1,65 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_vit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=300)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_vit_base_p16_8xb512_amp_coslr_400e_in1k.py b/mmpretrain/configs/mae/mae_vit_base_p16_8xb512_amp_coslr_400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..fb78e2bdfbd9b100f925958bf312fa8081b212c4
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_vit_base_p16_8xb512_amp_coslr_400e_in1k.py
@@ -0,0 +1,65 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_vit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=400)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_vit_base_p16_8xb512_amp_coslr_800e_in1k.py b/mmpretrain/configs/mae/mae_vit_base_p16_8xb512_amp_coslr_800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f34e1dac4ee444aaca71ce6f48f194d91076d0cd
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_vit_base_p16_8xb512_amp_coslr_800e_in1k.py
@@ -0,0 +1,65 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_vit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=800)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_vit_huge_p14_8xb512_amp_coslr_1600e_in1k.py b/mmpretrain/configs/mae/mae_vit_huge_p14_8xb512_amp_coslr_1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..bc91ee00bca5d77ebe873fa846cc13eaa677b9c8
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_vit_huge_p14_8xb512_amp_coslr_1600e_in1k.py
@@ -0,0 +1,75 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_vit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# model settings
+model.update(
+    backbone=dict(type=MAEViT, arch='h', patch_size=14),
+    neck=dict(
+        type=MAEPretrainDecoder,
+        embed_dim=1280,
+        patch_size=14,
+        num_patches=256),
+    head=dict(patch_size=14))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=1600)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_vit_large_p16_8xb512_amp_coslr_1600e_in1k.py b/mmpretrain/configs/mae/mae_vit_large_p16_8xb512_amp_coslr_1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef0777af8c19e4e615cc4c5ae2976e964febd0ce
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_vit_large_p16_8xb512_amp_coslr_1600e_in1k.py
@@ -0,0 +1,70 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_vit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# model settings
+model = dict(
+    backbone=dict(type=MAEViT, arch='l'),
+    neck=dict(type=MAEPretrainDecoder, embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=1600)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_vit_large_p16_8xb512_amp_coslr_300e_in1k.py b/mmpretrain/configs/mae/mae_vit_large_p16_8xb512_amp_coslr_300e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea005e4b3a59ade8b778e64068e95fc4e73ed321
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_vit_large_p16_8xb512_amp_coslr_300e_in1k.py
@@ -0,0 +1,70 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_vit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# model settings
+model = dict(
+    backbone=dict(type=MAEViT, arch='l'),
+    neck=dict(type=MAEPretrainDecoder, embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=260,
+        by_epoch=True,
+        begin=40,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=300)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_vit_large_p16_8xb512_amp_coslr_400e_in1k.py b/mmpretrain/configs/mae/mae_vit_large_p16_8xb512_amp_coslr_400e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f735491a2ca8f5a43eb8ea7bb57b8c5162f77da
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_vit_large_p16_8xb512_amp_coslr_400e_in1k.py
@@ -0,0 +1,70 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_vit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# model settings
+model = dict(
+    backbone=dict(type=MAEViT, arch='l'),
+    neck=dict(type=MAEPretrainDecoder, embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=360,
+        by_epoch=True,
+        begin=40,
+        end=400,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=400)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mae/mae_vit_large_p16_8xb512_amp_coslr_800e_in1k.py b/mmpretrain/configs/mae/mae_vit_large_p16_8xb512_amp_coslr_800e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0a5abd58b3cde6a9422c85c08c464fbc72f5d59
--- /dev/null
+++ b/mmpretrain/configs/mae/mae_vit_large_p16_8xb512_amp_coslr_800e_in1k.py
@@ -0,0 +1,70 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mae_vit_base_p16 import *
+    from .._base_.datasets.imagenet_bs512_mae import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.amp_optimizer_wrapper import AmpOptimWrapper
+from mmengine.optim.scheduler.lr_scheduler import CosineAnnealingLR, LinearLR
+from mmengine.runner.loops import EpochBasedTrainLoop
+from torch.optim.adamw import AdamW
+
+# model settings
+model = dict(
+    backbone=dict(type=MAEViT, arch='l'),
+    neck=dict(type=MAEPretrainDecoder, embed_dim=1024))
+
+# optimizer wrapper
+optim_wrapper = dict(
+    type=AmpOptimWrapper,
+    loss_scale='dynamic',
+    optimizer=dict(
+        type=AdamW,
+        lr=1.5e-4 * 4096 / 256,
+        betas=(0.9, 0.95),
+        weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys={
+            'ln': dict(decay_mult=0.0),
+            'bias': dict(decay_mult=0.0),
+            'pos_embed': dict(decay_mult=0.),
+            'mask_token': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+# learning rate scheduler
+param_scheduler = [
+    dict(
+        type=LinearLR,
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type=CosineAnnealingLR,
+        T_max=760,
+        by_epoch=True,
+        begin=40,
+        end=800,
+        convert_to_iter_based=True)
+]
+
+# runtime settings
+train_cfg = dict(type=EpochBasedTrainLoop, max_epochs=800)
+# only keeps the latest 3 checkpoints
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=1, max_keep_ckpts=3)
+
+randomness.update(seed=0, diff_rank_seed=True)
+
+# auto resume
+resume = True
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/mmpretrain/configs/mobilenet_v2/mobilenet_v2_8xb32_in1k.py b/mmpretrain/configs/mobilenet_v2/mobilenet_v2_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..79eec6355017ef41c10812f6c67bbc362ad0c343
--- /dev/null
+++ b/mmpretrain/configs/mobilenet_v2/mobilenet_v2_8xb32_in1k.py
@@ -0,0 +1,9 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.datasets.imagenet_bs32_pil_resize import *
+    from .._base_.default_runtime import *
+    from .._base_.models.mobilenet_v2_1x import *
+    from .._base_.schedules.imagenet_bs256_epochstep import *
diff --git a/mmpretrain/configs/mobilenet_v3/mobilenet_v3_large_8xb128_in1k.py b/mmpretrain/configs/mobilenet_v3/mobilenet_v3_large_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f1bee1c132e9d5a718ba6d92be5822543b2222d
--- /dev/null
+++ b/mmpretrain/configs/mobilenet_v3/mobilenet_v3_large_8xb128_in1k.py
@@ -0,0 +1,40 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+
+# Refers to https://pytorch.org/blog/ml-models-torchvision-v0.9/#classification
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mobilenet_v3_small import *
+    from .._base_.datasets.imagenet_bs128_mbv3 import *
+    from .._base_.default_runtime import *
+
+from mmengine.optim import StepLR
+from torch.optim import RMSprop
+
+# model settings
+model.merge(
+    dict(
+        backbone=dict(arch='large'),
+        head=dict(in_channels=960, mid_channels=[1280]),
+    ))
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type=RMSprop,
+        lr=0.064,
+        alpha=0.9,
+        momentum=0.9,
+        eps=0.0316,
+        weight_decay=1e-5))
+
+param_scheduler = dict(type=StepLR, by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/mmpretrain/configs/mobilenet_v3/mobilenet_v3_small_050_8xb128_in1k.py b/mmpretrain/configs/mobilenet_v3/mobilenet_v3_small_050_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..50e1ffc6709e3fa490cb0bf5e1eb958d94264315
--- /dev/null
+++ b/mmpretrain/configs/mobilenet_v3/mobilenet_v3_small_050_8xb128_in1k.py
@@ -0,0 +1,85 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+# Refers to https://pytorch.org/blog/ml-models-torchvision-v0.9/#classification
+
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mobilenet_v3_small import *
+    from .._base_.datasets.imagenet_bs128_mbv3 import *
+    from .._base_.default_runtime import *
+
+from mmengine.optim import StepLR
+from torch.nn.modules.batchnorm import BatchNorm2d
+from torch.optim import RMSprop
+
+# model settings
+model.merge(
+    dict(
+        backbone=dict(
+            arch='small_050',
+            norm_cfg=dict(type=BatchNorm2d, eps=1e-5, momentum=0.1)),
+        head=dict(in_channels=288),
+    ))
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=RandomResizedCrop,
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(
+        type=AutoAugment,
+        policies='imagenet',
+        hparams=dict(pad_val=[round(x) for x in [103.53, 116.28, 123.675]])),
+    dict(
+        type=RandomErasing,
+        erase_prob=0.2,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=ResizeEdge,
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=CenterCrop, crop_size=224),
+    dict(type=PackInputs),
+]
+
+train_dataloader.merge(dict(dataset=dict(pipeline=train_pipeline)))
+
+val_dataloader.merge(dict(dataset=dict(pipeline=test_pipeline)))
+# If you want standard test, please manually configure the test dataset
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type=RMSprop,
+        lr=0.064,
+        alpha=0.9,
+        momentum=0.9,
+        eps=0.0316,
+        weight_decay=1e-5))
+
+param_scheduler = dict(type=StepLR, by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=10)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/mmpretrain/configs/mobilenet_v3/mobilenet_v3_small_075_8xb128_in1k.py b/mmpretrain/configs/mobilenet_v3/mobilenet_v3_small_075_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..c8c640cd8a0ed4d3a33b7c2ffd10e4c44229307b
--- /dev/null
+++ b/mmpretrain/configs/mobilenet_v3/mobilenet_v3_small_075_8xb128_in1k.py
@@ -0,0 +1,83 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+# Refers to https://pytorch.org/blog/ml-models-torchvision-v0.9/#classification
+
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mobilenet_v3_small import *
+    from .._base_.datasets.imagenet_bs128_mbv3 import *
+    from .._base_.default_runtime import *
+
+from mmengine.optim import StepLR
+from torch.nn.modules.batchnorm import BatchNorm2d
+from torch.optim import RMSprop
+
+# model settings
+model.merge(
+    dict(
+        backbone=dict(
+            arch='small_075',
+            norm_cfg=dict(type=BatchNorm2d, eps=1e-5, momentum=0.1)),
+        head=dict(in_channels=432),
+    ))
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=RandomResizedCrop,
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(
+        type=AutoAugment,
+        policies='imagenet',
+        hparams=dict(pad_val=[round(x) for x in [103.53, 116.28, 123.675]])),
+    dict(
+        type=RandomErasing,
+        erase_prob=0.2,
+        mode='rand',
+        min_area_ratio=0.02,
+        max_area_ratio=1 / 3,
+        fill_color=[103.53, 116.28, 123.675],
+        fill_std=[57.375, 57.12, 58.395]),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(
+        type=ResizeEdge,
+        scale=256,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type=CenterCrop, crop_size=224),
+    dict(type=PackInputs),
+]
+
+train_dataloader.merge(dict(dataset=dict(pipeline=train_pipeline)))
+val_dataloader.merge(dict(dataset=dict(pipeline=test_pipeline)))
+test_dataloader = val_dataloader
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type=RMSprop,
+        lr=0.064,
+        alpha=0.9,
+        momentum=0.9,
+        eps=0.0316,
+        weight_decay=1e-5))
+
+param_scheduler = dict(type=StepLR, by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=10)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/mmpretrain/configs/mobilenet_v3/mobilenet_v3_small_8xb128_in1k.py b/mmpretrain/configs/mobilenet_v3/mobilenet_v3_small_8xb128_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..0c220a01d098c1a0a8259f08c81bc07054ff9ebb
--- /dev/null
+++ b/mmpretrain/configs/mobilenet_v3/mobilenet_v3_small_8xb128_in1k.py
@@ -0,0 +1,34 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+# Refers to https://pytorch.org/blog/ml-models-torchvision-v0.9/#classification
+
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mobilenet_v3_small import *
+    from .._base_.datasets.imagenet_bs128_mbv3 import *
+    from .._base_.default_runtime import *
+
+from mmengine.optim import StepLR
+from torch.optim import RMSprop
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        type=RMSprop,
+        lr=0.064,
+        alpha=0.9,
+        momentum=0.9,
+        eps=0.0316,
+        weight_decay=1e-5))
+
+param_scheduler = dict(type=StepLR, by_epoch=True, step_size=2, gamma=0.973)
+
+train_cfg = dict(by_epoch=True, max_epochs=600, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (8 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=1024)
diff --git a/mmpretrain/configs/mobilenet_v3/mobilenet_v3_small_8xb16_cifar10.py b/mmpretrain/configs/mobilenet_v3/mobilenet_v3_small_8xb16_cifar10.py
new file mode 100644
index 0000000000000000000000000000000000000000..0f91ee38243543b37e73c09386a5433bfcb46458
--- /dev/null
+++ b/mmpretrain/configs/mobilenet_v3/mobilenet_v3_small_8xb16_cifar10.py
@@ -0,0 +1,34 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.models.mobilenet_v3_small import *
+    from .._base_.datasets.cifar10_bs16 import *
+    from .._base_.schedules.cifar10_bs128 import *
+    from .._base_.default_runtime import *
+
+from mmengine.optim import MultiStepLR
+
+# model settings
+model.merge(
+    dict(
+        head=dict(
+            _delete_=True,
+            type=StackedLinearClsHead,
+            num_classes=10,
+            in_channels=576,
+            mid_channels=[1280],
+            act_cfg=dict(type=Hardswish),
+            loss=dict(type=CrossEntropyLoss, loss_weight=1.0),
+            topk=(1, 5))))
+# schedule settings
+param_scheduler.merge(
+    dict(
+        type=MultiStepLR,
+        by_epoch=True,
+        milestones=[120, 170],
+        gamma=0.1,
+    ))
+
+train_cfg.merge(dict(by_epoch=True, max_epochs=200))
diff --git a/mmpretrain/configs/resnet/resnet18_8xb32_in1k.py b/mmpretrain/configs/resnet/resnet18_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..f16d248b6988c924e8540a7782dabee4997baba1
--- /dev/null
+++ b/mmpretrain/configs/resnet/resnet18_8xb32_in1k.py
@@ -0,0 +1,9 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.datasets.imagenet_bs32 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.resnet18 import *
+    from .._base_.schedules.imagenet_bs256 import *
diff --git a/mmpretrain/configs/simclr/simclr_resnet50_16xb256_coslr_200e_in1k.py b/mmpretrain/configs/simclr/simclr_resnet50_16xb256_coslr_200e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..09c738f219e561e5863dd2ce8246af005502bc83
--- /dev/null
+++ b/mmpretrain/configs/simclr/simclr_resnet50_16xb256_coslr_200e_in1k.py
@@ -0,0 +1,58 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.datasets.imagenet_bs32_simclr import *
+    from .._base_.schedules.imagenet_lars_coslr_200e import *
+    from .._base_.default_runtime import *
+
+from mmengine.hooks.checkpoint_hook import CheckpointHook
+from mmengine.optim.optimizer.optimizer_wrapper import OptimWrapper
+
+from mmpretrain.engine.optimizers.lars import LARS
+from mmpretrain.models.backbones.resnet import ResNet
+from mmpretrain.models.heads.contrastive_head import ContrastiveHead
+from mmpretrain.models.losses.cross_entropy_loss import CrossEntropyLoss
+from mmpretrain.models.necks.nonlinear_neck import NonLinearNeck
+from mmpretrain.models.selfsup.simclr import SimCLR
+
+# dataset settings
+train_dataloader.merge(dict(batch_size=256))
+
+# model settings
+model = dict(
+    type=SimCLR,
+    backbone=dict(
+        type=ResNet,
+        depth=50,
+        norm_cfg=dict(type='SyncBN'),
+        zero_init_residual=True),
+    neck=dict(
+        type=NonLinearNeck,  # SimCLR non-linear neck
+        in_channels=2048,
+        hid_channels=2048,
+        out_channels=128,
+        num_layers=2,
+        with_avg_pool=True),
+    head=dict(
+        type=ContrastiveHead,
+        loss=dict(type=CrossEntropyLoss),
+        temperature=0.1),
+)
+
+# optimizer
+optim_wrapper = dict(
+    type=OptimWrapper,
+    optimizer=dict(type=LARS, lr=4.8, momentum=0.9, weight_decay=1e-6),
+    paramwise_cfg=dict(
+        custom_keys={
+            'bn': dict(decay_mult=0, lars_exclude=True),
+            'bias': dict(decay_mult=0, lars_exclude=True),
+            # bn layer in ResNet block downsample module
+            'downsample.1': dict(decay_mult=0, lars_exclude=True)
+        }))
+
+# runtime settings
+default_hooks.checkpoint = dict(
+    type=CheckpointHook, interval=10, max_keep_ckpts=3)
diff --git a/mmpretrain/configs/swin_transformer/swin_base_16xb64_in1k.py b/mmpretrain/configs/swin_transformer/swin_base_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..09af3d0149a60d6b6b6aba7cc74e436e8d205dd1
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer/swin_base_16xb64_in1k.py
@@ -0,0 +1,35 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+from mmengine.model import ConstantInit, TruncNormalInit
+
+from mmpretrain.models import CutMix, LabelSmoothLoss, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_224 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(img_size=224, drop_path_rate=0.5, stage_cfgs=None),
+    head=dict(
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type=LabelSmoothLoss,
+            label_smooth_val=0.1,
+            mode='original',
+            loss_weight=0),
+        topk=None,
+        cal_acc=False),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=0.02, bias=0.),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/mmpretrain/configs/swin_transformer/swin_base_16xb64_in1k_384px.py b/mmpretrain/configs/swin_transformer/swin_base_16xb64_in1k_384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..aacdc3274367cb07127d9261ef76149ab385c08f
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer/swin_base_16xb64_in1k_384px.py
@@ -0,0 +1,12 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_384 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/mmpretrain/configs/swin_transformer/swin_large_16xb64_in1k.py b/mmpretrain/configs/swin_transformer/swin_large_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8fc27937d2cd521dbd32eeb22e094851405b788
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer/swin_large_16xb64_in1k.py
@@ -0,0 +1,18 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_224 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(arch='large', img_size=224, stage_cfgs=None),
+    head=dict(in_channels=1536),
+)
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/mmpretrain/configs/swin_transformer/swin_large_16xb64_in1k_384px.py b/mmpretrain/configs/swin_transformer/swin_large_16xb64_in1k_384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a449aa656349d29038721a43041ec5c71b097bd
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer/swin_large_16xb64_in1k_384px.py
@@ -0,0 +1,18 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_384 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(arch='large'),
+    head=dict(in_channels=1536),
+)
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/mmpretrain/configs/swin_transformer/swin_large_8xb8_cub_384px.py b/mmpretrain/configs/swin_transformer/swin_large_8xb8_cub_384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..2003cd3a0787b9649ee18e99e0bbb0a00ad13530
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer/swin_large_8xb8_cub_384px.py
@@ -0,0 +1,49 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+from mmengine.hooks import CheckpointHook, LoggerHook
+from mmengine.model import PretrainedInit
+from torch.optim.adamw import AdamW
+
+from mmpretrain.models import ImageClassifier
+
+with read_base():
+    from .._base_.datasets.cub_bs8_384 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_base import *
+    from .._base_.schedules.cub_bs64 import *
+
+# model settings
+checkpoint = 'https://download.openmmlab.com/mmclassification/v0/swin-transformer/convert/swin-large_3rdparty_in21k-384px.pth'  # noqa
+
+model.update(
+    backbone=dict(
+        arch='large',
+        init_cfg=dict(
+            type=PretrainedInit, checkpoint=checkpoint, prefix='backbone')),
+    head=dict(num_classes=200, in_channels=1536))
+
+# schedule settings
+optim_wrapper = dict(
+    optimizer=dict(
+        _delete_=True,
+        type=AdamW,
+        lr=5e-6,
+        weight_decay=0.0005,
+        eps=1e-8,
+        betas=(0.9, 0.999)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.absolute_pos_embed': dict(decay_mult=0.0),
+            '.relative_position_bias_table': dict(decay_mult=0.0)
+        }),
+    clip_grad=dict(max_norm=5.0),
+)
+
+default_hooks = dict(
+    # log every 20 intervals
+    logger=dict(type=LoggerHook, interval=20),
+    # save last three checkpoints
+    checkpoint=dict(type=CheckpointHook, interval=1, max_keep_ckpts=3))
diff --git a/mmpretrain/configs/swin_transformer/swin_small_16xb64_in1k.py b/mmpretrain/configs/swin_transformer/swin_small_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..59792528435c038ec86d090fe75ce0d9430a18d0
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer/swin_small_16xb64_in1k.py
@@ -0,0 +1,37 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+from mmengine.model import ConstantInit, TruncNormalInit
+
+from mmpretrain.models import CutMix, LabelSmoothLoss, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_224 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(
+        arch='small', img_size=224, drop_path_rate=0.3, stage_cfgs=None),
+    head=dict(
+        in_channels=768,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type=LabelSmoothLoss,
+            label_smooth_val=0.1,
+            mode='original',
+            loss_weight=0),
+        topk=None,
+        cal_acc=False),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=0.02, bias=0.),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/mmpretrain/configs/swin_transformer/swin_tiny_16xb64_in1k.py b/mmpretrain/configs/swin_transformer/swin_tiny_16xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..733e1ef0ec471aa98e3be75bcf1b21c07a5b60f5
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer/swin_tiny_16xb64_in1k.py
@@ -0,0 +1,37 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+from mmengine.model import ConstantInit, TruncNormalInit
+
+from mmpretrain.models import CutMix, LabelSmoothLoss, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_224 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(
+        arch='tiny', img_size=224, drop_path_rate=0.2, stage_cfgs=None),
+    head=dict(
+        in_channels=768,
+        init_cfg=None,  # suppress the default init_cfg of LinearClsHead.
+        loss=dict(
+            type=LabelSmoothLoss,
+            label_smooth_val=0.1,
+            mode='original',
+            loss_weight=0),
+        topk=None,
+        cal_acc=False),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=0.02, bias=0.),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
+
+# schedule settings
+optim_wrapper = dict(clip_grad=dict(max_norm=5.0))
diff --git a/mmpretrain/configs/swin_transformer_v2/swinv2_base_w12_8xb128_in21k_192px.py b/mmpretrain/configs/swin_transformer_v2/swinv2_base_w12_8xb128_in21k_192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..1ecc4363330d9747bee4c104be591924449694e1
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer_v2/swinv2_base_w12_8xb128_in21k_192px.py
@@ -0,0 +1,32 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+from mmengine.model import ConstantInit, TruncNormalInit
+
+from mmpretrain.models import CutMix, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet21k_bs128 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_v2_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(
+        img_size=192, drop_path_rate=0.5, window_size=[12, 12, 12, 6]),
+    head=dict(num_classes=21841),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=0.02, bias=0.),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
+
+# dataset settings
+data_preprocessor = dict(num_classes=21841)
+
+_base_['train_pipeline'][1]['scale'] = 192  # RandomResizedCrop
+_base_['test_pipeline'][1]['scale'] = 219  # ResizeEdge
+_base_['test_pipeline'][2]['crop_size'] = 192  # CenterCrop
diff --git a/mmpretrain/configs/swin_transformer_v2/swinv2_base_w16_16xb64_in1k_256px.py b/mmpretrain/configs/swin_transformer_v2/swinv2_base_w16_16xb64_in1k_256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..103afb42608b68bc28fa6ccc3576c09a3a9bb595
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer_v2/swinv2_base_w16_16xb64_in1k_256px.py
@@ -0,0 +1,24 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+from mmengine.model import ConstantInit, TruncNormalInit
+
+from mmpretrain.models import CutMix, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_256 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_v2_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(
+        img_size=256, drop_path_rate=0.5, window_size=[16, 16, 16, 8]),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=0.02, bias=0.),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
diff --git a/mmpretrain/configs/swin_transformer_v2/swinv2_base_w16_in21k_pre_16xb64_in1k_256px.py b/mmpretrain/configs/swin_transformer_v2/swinv2_base_w16_in21k_pre_16xb64_in1k_256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..6588f50fffd7b694dbc6b8f1851a5ae04eb83fd6
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer_v2/swinv2_base_w16_in21k_pre_16xb64_in1k_256px.py
@@ -0,0 +1,26 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+from mmengine.model import ConstantInit, TruncNormalInit
+
+from mmpretrain.models import CutMix, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_256 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_v2_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(
+        img_size=256,
+        window_size=[16, 16, 16, 8],
+        pretrained_window_sizes=[12, 12, 12, 6]),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=0.02, bias=0.),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
diff --git a/mmpretrain/configs/swin_transformer_v2/swinv2_base_w24_in21k_pre_16xb64_in1k_384px.py b/mmpretrain/configs/swin_transformer_v2/swinv2_base_w24_in21k_pre_16xb64_in1k_384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..118c085e7550ebfe01dc98c880c8bfaf6ef2d977
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer_v2/swinv2_base_w24_in21k_pre_16xb64_in1k_384px.py
@@ -0,0 +1,14 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_384 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_v2_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(
+        window_size=[24, 24, 24, 12], pretrained_window_sizes=[12, 12, 12, 6]))
diff --git a/mmpretrain/configs/swin_transformer_v2/swinv2_base_w8_16xb64_in1k_256px.py b/mmpretrain/configs/swin_transformer_v2/swinv2_base_w8_16xb64_in1k_256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..d40144cbba13654a0f9e07c721389419d8ac67d0
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer_v2/swinv2_base_w8_16xb64_in1k_256px.py
@@ -0,0 +1,23 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+from mmengine.model import ConstantInit, TruncNormalInit
+
+from mmpretrain.models import CutMix, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_256 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_v2_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(img_size=256, drop_path_rate=0.5),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=0.02, bias=0.),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
diff --git a/mmpretrain/configs/swin_transformer_v2/swinv2_large_w12_8xb128_in21k_192px.py b/mmpretrain/configs/swin_transformer_v2/swinv2_large_w12_8xb128_in21k_192px.py
new file mode 100644
index 0000000000000000000000000000000000000000..1ecc4363330d9747bee4c104be591924449694e1
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer_v2/swinv2_large_w12_8xb128_in21k_192px.py
@@ -0,0 +1,32 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+from mmengine.model import ConstantInit, TruncNormalInit
+
+from mmpretrain.models import CutMix, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet21k_bs128 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_v2_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(
+        img_size=192, drop_path_rate=0.5, window_size=[12, 12, 12, 6]),
+    head=dict(num_classes=21841),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=0.02, bias=0.),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
+
+# dataset settings
+data_preprocessor = dict(num_classes=21841)
+
+_base_['train_pipeline'][1]['scale'] = 192  # RandomResizedCrop
+_base_['test_pipeline'][1]['scale'] = 219  # ResizeEdge
+_base_['test_pipeline'][2]['crop_size'] = 192  # CenterCrop
diff --git a/mmpretrain/configs/swin_transformer_v2/swinv2_large_w16_in21k_pre_16xb64_in1k_256px.py b/mmpretrain/configs/swin_transformer_v2/swinv2_large_w16_in21k_pre_16xb64_in1k_256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a1b59df0640b456dbb90691d582bc0b0f5da85e
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer_v2/swinv2_large_w16_in21k_pre_16xb64_in1k_256px.py
@@ -0,0 +1,24 @@
+# Only for evaluation
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+from mmpretrain.models import CrossEntropyLoss
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_256 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_v2_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(
+        arch='large',
+        img_size=256,
+        window_size=[16, 16, 16, 8],
+        pretrained_window_sizes=[12, 12, 12, 6]),
+    head=dict(
+        in_channels=1536,
+        loss=dict(type=CrossEntropyLoss, loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/mmpretrain/configs/swin_transformer_v2/swinv2_large_w24_in21k_pre_16xb64_in1k_384px.py b/mmpretrain/configs/swin_transformer_v2/swinv2_large_w24_in21k_pre_16xb64_in1k_384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..b20bcead8410e77d63b688799d17a9913cb51f94
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer_v2/swinv2_large_w24_in21k_pre_16xb64_in1k_384px.py
@@ -0,0 +1,24 @@
+# Only for evaluation
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+from mmpretrain.models import CrossEntropyLoss
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_384 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_v2_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(
+        arch='large',
+        img_size=384,
+        window_size=[24, 24, 24, 12],
+        pretrained_window_sizes=[12, 12, 12, 6]),
+    head=dict(
+        in_channels=1536,
+        loss=dict(type=CrossEntropyLoss, loss_weight=1.0),
+        topk=(1, 5)))
diff --git a/mmpretrain/configs/swin_transformer_v2/swinv2_small_w16_16xb64_in1k_256px.py b/mmpretrain/configs/swin_transformer_v2/swinv2_small_w16_16xb64_in1k_256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..dfd15c313954a325a9f42e2ebc2bc77a20de6cb6
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer_v2/swinv2_small_w16_16xb64_in1k_256px.py
@@ -0,0 +1,28 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+from mmengine.model import ConstantInit, TruncNormalInit
+
+from mmpretrain.models import CutMix, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_256 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_v2_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(
+        arch='small',
+        img_size=256,
+        drop_path_rate=0.3,
+        window_size=[16, 16, 16, 8]),
+    head=dict(in_channels=768),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=0.02, bias=0.),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
diff --git a/mmpretrain/configs/swin_transformer_v2/swinv2_small_w8_16xb64_in1k_256px.py b/mmpretrain/configs/swin_transformer_v2/swinv2_small_w8_16xb64_in1k_256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..bfec346617f3dd70269ac1e5f0e730f72232669a
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer_v2/swinv2_small_w8_16xb64_in1k_256px.py
@@ -0,0 +1,24 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+from mmengine.model import ConstantInit, TruncNormalInit
+
+from mmpretrain.models import CutMix, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_256 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_v2_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(arch='small', img_size=256, drop_path_rate=0.3),
+    head=dict(in_channels=768),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=0.02, bias=0.),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
diff --git a/mmpretrain/configs/swin_transformer_v2/swinv2_tiny_w16_16xb64_in1k_256px.py b/mmpretrain/configs/swin_transformer_v2/swinv2_tiny_w16_16xb64_in1k_256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..f2fa160963da8739d17fd2570088ac578e189624
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer_v2/swinv2_tiny_w16_16xb64_in1k_256px.py
@@ -0,0 +1,28 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+from mmengine.model import ConstantInit, TruncNormalInit
+
+from mmpretrain.models import CutMix, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_256 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_v2_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(
+        arch='tiny',
+        img_size=256,
+        drop_path_rate=0.2,
+        window_size=[16, 16, 16, 8]),
+    head=dict(in_channels=768),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=0.02, bias=0.),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
diff --git a/mmpretrain/configs/swin_transformer_v2/swinv2_tiny_w8_16xb64_in1k_256px.py b/mmpretrain/configs/swin_transformer_v2/swinv2_tiny_w8_16xb64_in1k_256px.py
new file mode 100644
index 0000000000000000000000000000000000000000..8cca2b3830236b646d5a24652223acf00a683d8a
--- /dev/null
+++ b/mmpretrain/configs/swin_transformer_v2/swinv2_tiny_w8_16xb64_in1k_256px.py
@@ -0,0 +1,24 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+from mmengine.model import ConstantInit, TruncNormalInit
+
+from mmpretrain.models import CutMix, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_256 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.swin_transformer_v2_base import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+# model settings
+model.update(
+    backbone=dict(arch='tiny', img_size=256, drop_path_rate=0.2),
+    head=dict(in_channels=768),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=0.02, bias=0.),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.)
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
diff --git a/mmpretrain/configs/vision_transformer/vit_base_p16_32xb128_mae_in1k.py b/mmpretrain/configs/vision_transformer/vit_base_p16_32xb128_mae_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..18c2afdaf2b161ecda1ab28c3ab4b6445dd08f5e
--- /dev/null
+++ b/mmpretrain/configs/vision_transformer/vit_base_p16_32xb128_mae_in1k.py
@@ -0,0 +1,52 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+from mmengine.model import ConstantInit, TruncNormalInit
+from torch.optim import AdamW
+
+from mmpretrain.engine import EMAHook
+from mmpretrain.models import CutMix, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_swin_224 import *
+    from .._base_.default_runtime import *
+    from .._base_.models.vit_base_p16 import *
+    from .._base_.schedules.imagenet_bs1024_adamw_swin import *
+
+model.update(
+    backbone=dict(drop_rate=0, drop_path_rate=0.1, init_cfg=None),
+    head=dict(loss=dict(mode='original')),
+    init_cfg=[
+        dict(type=TruncNormalInit, layer='Linear', std=.02),
+        dict(type=ConstantInit, layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(
+        augments=[dict(type=Mixup, alpha=0.8),
+                  dict(type=CutMix, alpha=1.0)]))
+
+# dataset settings
+train_dataloader.update(batch_size=128)
+
+# schedule settings
+optim_wrapper.update(
+    optimizer=dict(
+        type=AdamW,
+        lr=1e-4 * 4096 / 256,
+        weight_decay=0.3,
+        eps=1e-8,
+        betas=(0.9, 0.95)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# runtime settings
+custom_hooks = [dict(type=EMAHook, momentum=1e-4)]
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr.update(base_batch_size=4096)
diff --git a/mmpretrain/configs/vision_transformer/vit_base_p16_64xb64_in1k.py b/mmpretrain/configs/vision_transformer/vit_base_p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8f128d1cfa00386ab4299349135c8f422c68faa7
--- /dev/null
+++ b/mmpretrain/configs/vision_transformer/vit_base_p16_64xb64_in1k.py
@@ -0,0 +1,20 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+from mmpretrain.models import Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_pil_resize_autoaug import *
+    from .._base_.default_runtime import *
+    from .._base_.models.vit_base_p16 import *
+    from .._base_.schedules.imagenet_bs4096_adamw import *
+
+# model setting
+model.update(
+    head=dict(hidden_dim=3072),
+    train_cfg=dict(augments=dict(type=Mixup, alpha=0.2)),
+)
+
+# schedule setting
+optim_wrapper.update(clip_grad=dict(max_norm=1.0))
diff --git a/mmpretrain/configs/vision_transformer/vit_base_p16_64xb64_in1k_384px.py b/mmpretrain/configs/vision_transformer/vit_base_p16_64xb64_in1k_384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..98e01f306c1e37717cd6826fa0dba2dc74ab3807
--- /dev/null
+++ b/mmpretrain/configs/vision_transformer/vit_base_p16_64xb64_in1k_384px.py
@@ -0,0 +1,44 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+from mmpretrain.datasets import (CenterCrop, LoadImageFromFile, PackInputs,
+                                 RandomFlip, RandomResizedCrop, ResizeEdge)
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_pil_resize import *
+    from .._base_.default_runtime import *
+    from .._base_.models.vit_base_p16 import *
+    from .._base_.schedules.imagenet_bs4096_adamw import *
+
+# model setting
+model.update(backbone=dict(img_size=384))
+
+# dataset setting
+data_preprocessor.update(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=RandomResizedCrop, scale=384, backend='pillow'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=ResizeEdge, scale=384, edge='short', backend='pillow'),
+    dict(type=CenterCrop, crop_size=384),
+    dict(type=PackInputs),
+]
+
+train_dataloader.update(dataset=dict(pipeline=train_pipeline))
+val_dataloader.update(dataset=dict(pipeline=test_pipeline))
+test_dataloader.update(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper.update(clip_grad=dict(max_norm=1.0))
diff --git a/mmpretrain/configs/vision_transformer/vit_base_p32_64xb64_in1k.py b/mmpretrain/configs/vision_transformer/vit_base_p32_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..3651c93b602c8a8391b6ec2c5debb12660cf27fe
--- /dev/null
+++ b/mmpretrain/configs/vision_transformer/vit_base_p32_64xb64_in1k.py
@@ -0,0 +1,26 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+from mmpretrain.models import CrossEntropyLoss, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_pil_resize_autoaug import *
+    from .._base_.default_runtime import *
+    from .._base_.models.vit_base_p16 import *
+    from .._base_.schedules.imagenet_bs4096_adamw import *
+
+# model setting
+model.update(
+    backbone=dict(patch_size=32),
+    head=dict(
+        hidden_dim=3072,
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=dict(type=Mixup, alpha=0.2)),
+)
+
+model.head.loss = dict(type=CrossEntropyLoss, loss_weight=1.0)
+
+# schedule setting
+optim_wrapper.update(clip_grad=dict(max_norm=1.0))
diff --git a/mmpretrain/configs/vision_transformer/vit_base_p32_64xb64_in1k_384px.py b/mmpretrain/configs/vision_transformer/vit_base_p32_64xb64_in1k_384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..253740cc7c313d4822b92e533f22576412caf93c
--- /dev/null
+++ b/mmpretrain/configs/vision_transformer/vit_base_p32_64xb64_in1k_384px.py
@@ -0,0 +1,48 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+from mmpretrain.datasets import (CenterCrop, LoadImageFromFile, PackInputs,
+                                 RandomFlip, RandomResizedCrop, ResizeEdge)
+from mmpretrain.models import CrossEntropyLoss
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_pil_resize import *
+    from .._base_.default_runtime import *
+    from .._base_.models.vit_base_p16 import *
+    from .._base_.schedules.imagenet_bs4096_adamw import *
+
+# model setting
+model.update(
+    backbone=dict(img_size=384, patch_size=32), head=dict(topk=(1, 5)))
+
+model.head.loss = dict(type=CrossEntropyLoss, loss_weight=1.0)
+
+# dataset setting
+data_preprocessor.update(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=RandomResizedCrop, scale=384, backend='pillow'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=ResizeEdge, scale=384, edge='short', backend='pillow'),
+    dict(type=CenterCrop, crop_size=384),
+    dict(type=PackInputs),
+]
+
+train_dataloader.update(dataset=dict(pipeline=train_pipeline))
+val_dataloader.update(dataset=dict(pipeline=test_pipeline))
+test_dataloader.update(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper.update(clip_grad=dict(max_norm=1.0))
diff --git a/mmpretrain/configs/vision_transformer/vit_large_p16_64xb64_in1k.py b/mmpretrain/configs/vision_transformer/vit_large_p16_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..03f4a74b40c8c3c0ac4bc28c081539741ca477c8
--- /dev/null
+++ b/mmpretrain/configs/vision_transformer/vit_large_p16_64xb64_in1k.py
@@ -0,0 +1,27 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+from mmpretrain.models import CrossEntropyLoss, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_pil_resize_autoaug import *
+    from .._base_.default_runtime import *
+    from .._base_.models.vit_base_p16 import *
+    from .._base_.schedules.imagenet_bs4096_adamw import *
+
+# model setting
+model.update(
+    backbone=dict(arch='l'),
+    head=dict(
+        hidden_dim=3072,
+        in_channels=1024,
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=dict(type=Mixup, alpha=0.2)),
+)
+
+model.head.loss = dict(type=CrossEntropyLoss, loss_weight=1.0)
+
+# schedule setting
+optim_wrapper.update(clip_grad=dict(max_norm=1.0))
diff --git a/mmpretrain/configs/vision_transformer/vit_large_p16_64xb64_in1k_384px.py b/mmpretrain/configs/vision_transformer/vit_large_p16_64xb64_in1k_384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..eba4bc45b33bb8351dce2ce2f1112d8c1a7e9701
--- /dev/null
+++ b/mmpretrain/configs/vision_transformer/vit_large_p16_64xb64_in1k_384px.py
@@ -0,0 +1,49 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+from mmpretrain.datasets import (CenterCrop, LoadImageFromFile, PackInputs,
+                                 RandomFlip, RandomResizedCrop, ResizeEdge)
+from mmpretrain.models import CrossEntropyLoss
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_pil_resize import *
+    from .._base_.default_runtime import *
+    from .._base_.models.vit_base_p16 import *
+    from .._base_.schedules.imagenet_bs4096_adamw import *
+
+# model setting
+model.update(
+    backbone=dict(arch='l', img_size=384),
+    head=dict(in_channels=1024, topk=(1, 5)))
+
+model.head.loss = dict(type=CrossEntropyLoss, loss_weight=1.0)
+
+# dataset setting
+data_preprocessor.update(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=RandomResizedCrop, scale=384, backend='pillow'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=ResizeEdge, scale=384, edge='short', backend='pillow'),
+    dict(type=CenterCrop, crop_size=384),
+    dict(type=PackInputs),
+]
+
+train_dataloader.update(dataset=dict(pipeline=train_pipeline))
+val_dataloader.update(dataset=dict(pipeline=test_pipeline))
+test_dataloader.update(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper.update(clip_grad=dict(max_norm=1.0))
diff --git a/mmpretrain/configs/vision_transformer/vit_large_p32_64xb64_in1k.py b/mmpretrain/configs/vision_transformer/vit_large_p32_64xb64_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..73dae6e79d1ea37d5a74d23a805935040432daa2
--- /dev/null
+++ b/mmpretrain/configs/vision_transformer/vit_large_p32_64xb64_in1k.py
@@ -0,0 +1,27 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+from mmpretrain.models import CrossEntropyLoss, Mixup
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_pil_resize_autoaug import *
+    from .._base_.default_runtime import *
+    from .._base_.models.vit_base_p16 import *
+    from .._base_.schedules.imagenet_bs4096_adamw import *
+
+# model setting
+model.update(
+    backbone=dict(arch='l', patch_size=32),
+    head=dict(
+        hidden_dim=3072,
+        in_channels=1024,
+        topk=(1, 5),
+    ),
+    train_cfg=dict(augments=dict(type=Mixup, alpha=0.2)),
+)
+
+loss = dict(type=CrossEntropyLoss, loss_weight=1.0)
+
+# schedule setting
+optim_wrapper.update(clip_grad=dict(max_norm=1.0))
diff --git a/mmpretrain/configs/vision_transformer/vit_large_p32_64xb64_in1k_384px.py b/mmpretrain/configs/vision_transformer/vit_large_p32_64xb64_in1k_384px.py
new file mode 100644
index 0000000000000000000000000000000000000000..82e16192f3f03dac06f5c6de0398d7f5de463461
--- /dev/null
+++ b/mmpretrain/configs/vision_transformer/vit_large_p32_64xb64_in1k_384px.py
@@ -0,0 +1,49 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# This is a BETA new format config file, and the usage may change recently.
+from mmengine.config import read_base
+
+from mmpretrain.datasets import (CenterCrop, LoadImageFromFile, PackInputs,
+                                 RandomFlip, RandomResizedCrop, ResizeEdge)
+from mmpretrain.models import CrossEntropyLoss
+
+with read_base():
+    from .._base_.datasets.imagenet_bs64_pil_resize import *
+    from .._base_.default_runtime import *
+    from .._base_.models.vit_base_p16 import *
+    from .._base_.schedules.imagenet_bs4096_adamw import *
+
+# model setting
+model.update(
+    backbone=dict(arch='l', img_size=384, patch_size=32),
+    head=dict(in_channels=1024, topk=(1, 5)))
+
+model.head.loss = dict(type=CrossEntropyLoss, loss_weight=1.0)
+
+# dataset setting
+data_preprocessor.update(
+    mean=[127.5, 127.5, 127.5],
+    std=[127.5, 127.5, 127.5],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=RandomResizedCrop, scale=384, backend='pillow'),
+    dict(type=RandomFlip, prob=0.5, direction='horizontal'),
+    dict(type=PackInputs),
+]
+
+test_pipeline = [
+    dict(type=LoadImageFromFile),
+    dict(type=ResizeEdge, scale=384, edge='short', backend='pillow'),
+    dict(type=CenterCrop, crop_size=384),
+    dict(type=PackInputs),
+]
+
+train_dataloader.update(dataset=dict(pipeline=train_pipeline))
+val_dataloader.update(dataset=dict(pipeline=test_pipeline))
+test_dataloader.update(dataset=dict(pipeline=test_pipeline))
+
+# schedule setting
+optim_wrapper.update(clip_grad=dict(max_norm=1.0))
diff --git a/mmpretrain/datasets/__init__.py b/mmpretrain/datasets/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e621e15771480e539ce52f3264b87c19202c1602
--- /dev/null
+++ b/mmpretrain/datasets/__init__.py
@@ -0,0 +1,62 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmpretrain.utils.dependency import WITH_MULTIMODAL
+from .base_dataset import BaseDataset
+from .builder import build_dataset
+from .caltech101 import Caltech101
+from .cifar import CIFAR10, CIFAR100
+from .cub import CUB
+from .custom import CustomDataset
+from .dataset_wrappers import KFoldDataset
+from .dtd import DTD
+from .fgvcaircraft import FGVCAircraft
+from .flowers102 import Flowers102
+from .food101 import Food101
+from .imagenet import ImageNet, ImageNet21k
+from .inshop import InShop
+from .mnist import MNIST, FashionMNIST
+from .multi_label import MultiLabelDataset
+from .multi_task import MultiTaskDataset
+from .nlvr2 import NLVR2
+from .oxfordiiitpet import OxfordIIITPet
+from .places205 import Places205
+from .samplers import *  # noqa: F401,F403
+from .stanfordcars import StanfordCars
+from .sun397 import SUN397
+from .transforms import *  # noqa: F401,F403
+from .voc import VOC
+
+__all__ = [
+    'BaseDataset', 'CIFAR10', 'CIFAR100', 'CUB', 'Caltech101', 'CustomDataset',
+    'DTD', 'FGVCAircraft', 'FashionMNIST', 'Flowers102', 'Food101', 'ImageNet',
+    'ImageNet21k', 'InShop', 'KFoldDataset', 'MNIST', 'MultiLabelDataset',
+    'MultiTaskDataset', 'NLVR2', 'OxfordIIITPet', 'Places205', 'SUN397',
+    'StanfordCars', 'VOC', 'build_dataset'
+]
+
+if WITH_MULTIMODAL:
+    from .coco_caption import COCOCaption
+    from .coco_retrieval import COCORetrieval
+    from .coco_vqa import COCOVQA
+    from .flamingo import FlamingoEvalCOCOCaption, FlamingoEvalCOCOVQA
+    from .flickr30k_caption import Flickr30kCaption
+    from .flickr30k_retrieval import Flickr30kRetrieval
+    from .gqa_dataset import GQA
+    from .iconqa import IconQA
+    from .infographic_vqa import InfographicVQA
+    from .minigpt4_dataset import MiniGPT4Dataset
+    from .nocaps import NoCaps
+    from .ocr_vqa import OCRVQA
+    from .refcoco import RefCOCO
+    from .scienceqa import ScienceQA
+    from .textvqa import TextVQA
+    from .visual_genome import VisualGenomeQA
+    from .vizwiz import VizWiz
+    from .vsr import VSR
+
+    __all__.extend([
+        'COCOCaption', 'COCORetrieval', 'COCOVQA', 'FlamingoEvalCOCOCaption',
+        'FlamingoEvalCOCOVQA', 'Flickr30kCaption', 'Flickr30kRetrieval',
+        'RefCOCO', 'VisualGenomeQA', 'ScienceQA', 'NoCaps', 'GQA', 'TextVQA',
+        'VSR', 'VizWiz', 'OCRVQA', 'InfographicVQA', 'IconQA',
+        'MiniGPT4Dataset'
+    ])
diff --git a/mmpretrain/datasets/__pycache__/__init__.cpython-310.pyc b/mmpretrain/datasets/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e19a17ed3a8b2cb3f0c2a3e3e90635dcb7f154d4
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/base_dataset.cpython-310.pyc b/mmpretrain/datasets/__pycache__/base_dataset.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..4c346ab5b5d2b1888bd955ca5258b186c0e12860
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/base_dataset.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/builder.cpython-310.pyc b/mmpretrain/datasets/__pycache__/builder.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..53d8fe2d075e5f34f71c9e184c5d8e8ffa484949
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/builder.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/caltech101.cpython-310.pyc b/mmpretrain/datasets/__pycache__/caltech101.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3820d20b18ef49c97044e2bd51efe20d185f103e
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/caltech101.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/categories.cpython-310.pyc b/mmpretrain/datasets/__pycache__/categories.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e3937fae49bc0815c581ef481fefa01c02d51f99
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/categories.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/cifar.cpython-310.pyc b/mmpretrain/datasets/__pycache__/cifar.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..c121d6c27ed429e989523880c336bf3b7f2f2aee
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/cifar.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/cub.cpython-310.pyc b/mmpretrain/datasets/__pycache__/cub.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..33283ad623a2fb07795d62fd2c019d8776a9ee08
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/cub.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/custom.cpython-310.pyc b/mmpretrain/datasets/__pycache__/custom.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..094ddcf6877ca45a3b5739a586d6b61f036a504e
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/custom.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/dataset_wrappers.cpython-310.pyc b/mmpretrain/datasets/__pycache__/dataset_wrappers.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..8620b8a51a52471b9da262c65ab9542f5adcb643
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/dataset_wrappers.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/dtd.cpython-310.pyc b/mmpretrain/datasets/__pycache__/dtd.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f9681880cade7289b760a2cc073e0451e1b6d3d0
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/dtd.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/fgvcaircraft.cpython-310.pyc b/mmpretrain/datasets/__pycache__/fgvcaircraft.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7241c7331d3a02b5733d1e3e65c6ada3ebc402f0
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/fgvcaircraft.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/flowers102.cpython-310.pyc b/mmpretrain/datasets/__pycache__/flowers102.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..d198de212d7c6031da2582b53c32d5d550122f42
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/flowers102.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/food101.cpython-310.pyc b/mmpretrain/datasets/__pycache__/food101.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..5af1c6f560ee53ddde7b29738351ce0c8ccda118
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/food101.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/imagenet.cpython-310.pyc b/mmpretrain/datasets/__pycache__/imagenet.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e5f96364e356db37c6110736501fad5e87af5a48
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/imagenet.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/inshop.cpython-310.pyc b/mmpretrain/datasets/__pycache__/inshop.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..772806ff2030c0405a26547b27ebd5aa56b91adb
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/inshop.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/mnist.cpython-310.pyc b/mmpretrain/datasets/__pycache__/mnist.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..58abc1a4ece6f99f441585259f25925d961da9f6
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/mnist.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/multi_label.cpython-310.pyc b/mmpretrain/datasets/__pycache__/multi_label.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..5f7ad83ce359ad7b61068cd6bddcae6a21d84d24
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/multi_label.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/multi_task.cpython-310.pyc b/mmpretrain/datasets/__pycache__/multi_task.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a598ff9344314768e4b790724426bfb49bc4c21d
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/multi_task.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/nlvr2.cpython-310.pyc b/mmpretrain/datasets/__pycache__/nlvr2.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..19b56469cc775862c1a1e7c957769a346eaa97e7
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/nlvr2.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/oxfordiiitpet.cpython-310.pyc b/mmpretrain/datasets/__pycache__/oxfordiiitpet.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..0ea6cdf09fd55d911030fe3d63e27e3c9e461759
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/oxfordiiitpet.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/places205.cpython-310.pyc b/mmpretrain/datasets/__pycache__/places205.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..2b43f35b1aef07513867f96bb37a8738abdd9d82
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/places205.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/stanfordcars.cpython-310.pyc b/mmpretrain/datasets/__pycache__/stanfordcars.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..cdd7826a5d687dcbc1dee7d17fa4a6a25d7a5a20
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/stanfordcars.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/sun397.cpython-310.pyc b/mmpretrain/datasets/__pycache__/sun397.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..fee5f09326b29caa2309baeee7478cf012e425f1
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/sun397.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/utils.cpython-310.pyc b/mmpretrain/datasets/__pycache__/utils.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..891463f54ef125ce9b03e71efd68404db0f96f59
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/utils.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/__pycache__/voc.cpython-310.pyc b/mmpretrain/datasets/__pycache__/voc.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..bbbe1de022873b771de8014fe993abf47cf2370b
Binary files /dev/null and b/mmpretrain/datasets/__pycache__/voc.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/base_dataset.py b/mmpretrain/datasets/base_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..dffdf04772163b5fa55afabc8e15ac8c118aadd2
--- /dev/null
+++ b/mmpretrain/datasets/base_dataset.py
@@ -0,0 +1,219 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from os import PathLike
+from typing import List, Optional, Sequence, Union
+
+import mmengine
+import numpy as np
+from mmengine.dataset import BaseDataset as _BaseDataset
+
+from mmpretrain.registry import DATASETS, TRANSFORMS
+
+
+def expanduser(path):
+    """Expand ~ and ~user constructions.
+
+    If user or $HOME is unknown, do nothing.
+    """
+    if isinstance(path, (str, PathLike)):
+        return osp.expanduser(path)
+    else:
+        return path
+
+
+@DATASETS.register_module()
+class BaseDataset(_BaseDataset):
+    """Base dataset for image classification task.
+
+    This dataset support annotation file in `OpenMMLab 2.0 style annotation
+    format`.
+
+    .. _OpenMMLab 2.0 style annotation format:
+        https://github.com/open-mmlab/mmengine/blob/main/docs/zh_cn/tutorials/basedataset.md
+
+    Comparing with the :class:`mmengine.BaseDataset`, this class implemented
+    several useful methods.
+
+    Args:
+        ann_file (str): Annotation file path.
+        metainfo (dict, optional): Meta information for dataset, such as class
+            information. Defaults to None.
+        data_root (str): The root directory for ``data_prefix`` and
+            ``ann_file``. Defaults to ''.
+        data_prefix (str | dict): Prefix for training data. Defaults to ''.
+        filter_cfg (dict, optional): Config for filter data. Defaults to None.
+        indices (int or Sequence[int], optional): Support using first few
+            data in annotation file to facilitate training/testing on a smaller
+            dataset. Defaults to None, which means using all ``data_infos``.
+        serialize_data (bool): Whether to hold memory using serialized objects,
+            when enabled, data loader workers can use shared RAM from master
+            process instead of making a copy. Defaults to True.
+        pipeline (Sequence): Processing pipeline. Defaults to an empty tuple.
+        test_mode (bool, optional): ``test_mode=True`` means in test phase,
+            an error will be raised when getting an item fails, ``test_mode=False``
+            means in training phase, another item will be returned randomly.
+            Defaults to False.
+        lazy_init (bool): Whether to load annotation during instantiation.
+            In some cases, such as visualization, only the meta information of
+            the dataset is needed, which is not necessary to load annotation
+            file. ``Basedataset`` can skip load annotations to save time by set
+            ``lazy_init=False``. Defaults to False.
+        max_refetch (int): If ``Basedataset.prepare_data`` get a None img.
+            The maximum extra number of cycles to get a valid image.
+            Defaults to 1000.
+        classes (str | Sequence[str], optional): Specify names of classes.
+
+            - If is string, it should be a file path, and the every line of
+              the file is a name of a class.
+            - If is a sequence of string, every item is a name of class.
+            - If is None, use categories information in ``metainfo`` argument,
+              annotation file or the class attribute ``METAINFO``.
+
+            Defaults to None.
+    """  # noqa: E501
+
+    def __init__(self,
+                 ann_file: str,
+                 metainfo: Optional[dict] = None,
+                 data_root: str = '',
+                 data_prefix: Union[str, dict] = '',
+                 filter_cfg: Optional[dict] = None,
+                 indices: Optional[Union[int, Sequence[int]]] = None,
+                 serialize_data: bool = True,
+                 pipeline: Sequence = (),
+                 test_mode: bool = False,
+                 lazy_init: bool = False,
+                 max_refetch: int = 1000,
+                 classes: Union[str, Sequence[str], None] = None):
+        if isinstance(data_prefix, str):
+            data_prefix = dict(img_path=expanduser(data_prefix))
+
+        ann_file = expanduser(ann_file)
+        metainfo = self._compat_classes(metainfo, classes)
+
+        transforms = []
+        for transform in pipeline:
+            if isinstance(transform, dict):
+                transforms.append(TRANSFORMS.build(transform))
+            else:
+                transforms.append(transform)
+
+        super().__init__(
+            ann_file=ann_file,
+            metainfo=metainfo,
+            data_root=data_root,
+            data_prefix=data_prefix,
+            filter_cfg=filter_cfg,
+            indices=indices,
+            serialize_data=serialize_data,
+            pipeline=transforms,
+            test_mode=test_mode,
+            lazy_init=lazy_init,
+            max_refetch=max_refetch)
+
+    @property
+    def img_prefix(self):
+        """The prefix of images."""
+        return self.data_prefix['img_path']
+
+    @property
+    def CLASSES(self):
+        """Return all categories names."""
+        return self._metainfo.get('classes', None)
+
+    @property
+    def class_to_idx(self):
+        """Map mapping class name to class index.
+
+        Returns:
+            dict: mapping from class name to class index.
+        """
+
+        return {cat: i for i, cat in enumerate(self.CLASSES)}
+
+    def get_gt_labels(self):
+        """Get all ground-truth labels (categories).
+
+        Returns:
+            np.ndarray: categories for all images.
+        """
+
+        gt_labels = np.array(
+            [self.get_data_info(i)['gt_label'] for i in range(len(self))])
+        return gt_labels
+
+    def get_cat_ids(self, idx: int) -> List[int]:
+        """Get category id by index.
+
+        Args:
+            idx (int): Index of data.
+
+        Returns:
+            cat_ids (List[int]): Image category of specified index.
+        """
+
+        return [int(self.get_data_info(idx)['gt_label'])]
+
+    def _compat_classes(self, metainfo, classes):
+        """Merge the old style ``classes`` arguments to ``metainfo``."""
+        if isinstance(classes, str):
+            # take it as a file path
+            class_names = mmengine.list_from_file(expanduser(classes))
+        elif isinstance(classes, (tuple, list)):
+            class_names = classes
+        elif classes is not None:
+            raise ValueError(f'Unsupported type {type(classes)} of classes.')
+
+        if metainfo is None:
+            metainfo = {}
+
+        if classes is not None:
+            metainfo = {'classes': tuple(class_names), **metainfo}
+
+        return metainfo
+
+    def full_init(self):
+        """Load annotation file and set ``BaseDataset._fully_initialized`` to
+        True."""
+        super().full_init()
+
+        #  To support the standard OpenMMLab 2.0 annotation format. Generate
+        #  metainfo in internal format from standard metainfo format.
+        if 'categories' in self._metainfo and 'classes' not in self._metainfo:
+            categories = sorted(
+                self._metainfo['categories'], key=lambda x: x['id'])
+            self._metainfo['classes'] = tuple(
+                [cat['category_name'] for cat in categories])
+
+    def __repr__(self):
+        """Print the basic information of the dataset.
+
+        Returns:
+            str: Formatted string.
+        """
+        head = 'Dataset ' + self.__class__.__name__
+        body = []
+        if self._fully_initialized:
+            body.append(f'Number of samples: \t{self.__len__()}')
+        else:
+            body.append("Haven't been initialized")
+
+        if self.CLASSES is not None:
+            body.append(f'Number of categories: \t{len(self.CLASSES)}')
+
+        body.extend(self.extra_repr())
+
+        if len(self.pipeline.transforms) > 0:
+            body.append('With transforms:')
+            for t in self.pipeline.transforms:
+                body.append(f'    {t}')
+
+        lines = [head] + [' ' * 4 + line for line in body]
+        return '\n'.join(lines)
+
+    def extra_repr(self) -> List[str]:
+        """The extra repr information of the dataset."""
+        body = []
+        body.append(f'Annotation file: \t{self.ann_file}')
+        body.append(f'Prefix of images: \t{self.img_prefix}')
+        return body
diff --git a/mmpretrain/datasets/builder.py b/mmpretrain/datasets/builder.py
new file mode 100644
index 0000000000000000000000000000000000000000..dfa3872fe9931a4946368f07dfc5f5913a3e1f9f
--- /dev/null
+++ b/mmpretrain/datasets/builder.py
@@ -0,0 +1,25 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmpretrain.registry import DATASETS
+
+
+def build_dataset(cfg):
+    """Build dataset.
+
+    Examples:
+        >>> from mmpretrain.datasets import build_dataset
+        >>> mnist_train = build_dataset(
+        ...     dict(type='MNIST', data_prefix='data/mnist/', test_mode=False))
+        >>> print(mnist_train)
+        Dataset MNIST
+            Number of samples:  60000
+            Number of categories:       10
+            Prefix of data:     data/mnist/
+        >>> mnist_test = build_dataset(
+        ...     dict(type='MNIST', data_prefix='data/mnist/', test_mode=True))
+        >>> print(mnist_test)
+        Dataset MNIST
+            Number of samples:  10000
+            Number of categories:       10
+            Prefix of data:     data/mnist/
+    """
+    return DATASETS.build(cfg)
diff --git a/mmpretrain/datasets/caltech101.py b/mmpretrain/datasets/caltech101.py
new file mode 100644
index 0000000000000000000000000000000000000000..71e5de85ff3bbf73c387a071f47113b46be36e2a
--- /dev/null
+++ b/mmpretrain/datasets/caltech101.py
@@ -0,0 +1,113 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+from mmengine import get_file_backend, list_from_file
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+from .categories import CALTECH101_CATEGORIES
+
+
+@DATASETS.register_module()
+class Caltech101(BaseDataset):
+    """The Caltech101 Dataset.
+
+    Support the `Caltech101 <https://data.caltech.edu/records/mzrjq-6wc02>`_ Dataset.
+    After downloading and decompression, the dataset directory structure is as follows.
+
+    Caltech101 dataset directory: ::
+
+        caltech-101
+        ├── 101_ObjectCategories
+        │   ├── class_x
+        │   │   ├── xx1.jpg
+        │   │   ├── xx2.jpg
+        │   │   └── ...
+        │   ├── class_y
+        │   │   ├── yy1.jpg
+        │   │   ├── yy2.jpg
+        │   │   └── ...
+        │   └── ...
+        ├── Annotations
+        │   ├── class_x
+        │   │   ├── xx1.mat
+        │   │   └── ...
+        │   └── ...
+        ├── meta
+        │   ├── train.txt
+        │   └── test.txt
+        └── ....
+
+    Please note that since there is no official splitting for training and
+    test set, you can use the train.txt and text.txt provided by us or
+    create your own annotation files. Here is the download
+    `link <https://download.openmmlab.com/mmpretrain/datasets/caltech_meta.zip>`_
+    for the annotations.
+
+    Args:
+        data_root (str): The root directory for the Caltech101 dataset.
+        split (str, optional): The dataset split, supports "train" and "test".
+            Default to "train".
+
+    Examples:
+        >>> from mmpretrain.datasets import Caltech101
+        >>> train_dataset = Caltech101(data_root='data/caltech-101', split='train')
+        >>> train_dataset
+        Dataset Caltech101
+            Number of samples:  3060
+            Number of categories:       102
+            Root of dataset:    data/caltech-101
+        >>> test_dataset = Caltech101(data_root='data/caltech-101', split='test')
+        >>> test_dataset
+        Dataset Caltech101
+            Number of samples:  6728
+            Number of categories:       102
+            Root of dataset:    data/caltech-101
+    """  # noqa: E501
+
+    METAINFO = {'classes': CALTECH101_CATEGORIES}
+
+    def __init__(self, data_root: str, split: str = 'train', **kwargs):
+
+        splits = ['train', 'test']
+        assert split in splits, \
+            f"The split must be one of {splits}, but get '{split}'"
+        self.split = split
+
+        self.backend = get_file_backend(data_root, enable_singleton=True)
+
+        if split == 'train':
+            ann_file = self.backend.join_path('meta', 'train.txt')
+        else:
+            ann_file = self.backend.join_path('meta', 'test.txt')
+
+        data_prefix = '101_ObjectCategories'
+        test_mode = split == 'test'
+
+        super(Caltech101, self).__init__(
+            ann_file=ann_file,
+            data_root=data_root,
+            data_prefix=data_prefix,
+            test_mode=test_mode,
+            **kwargs)
+
+    def load_data_list(self):
+        """Load images and ground truth labels."""
+
+        pairs = list_from_file(self.ann_file)
+        data_list = []
+
+        for pair in pairs:
+            path, gt_label = pair.split()
+            img_path = self.backend.join_path(self.img_prefix, path)
+            info = dict(img_path=img_path, gt_label=int(gt_label))
+            data_list.append(info)
+
+        return data_list
+
+    def extra_repr(self) -> List[str]:
+        """The extra repr information of the dataset."""
+        body = [
+            f'Root of dataset: \t{self.data_root}',
+        ]
+        return body
diff --git a/mmpretrain/datasets/categories.py b/mmpretrain/datasets/categories.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e75f7953b8f41750e2517d28c76047bfe37330a
--- /dev/null
+++ b/mmpretrain/datasets/categories.py
@@ -0,0 +1,1661 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Pre-defined categories names of various datasets.
+
+VOC2007_CATEGORIES = ('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
+                      'car', 'cat', 'chair', 'cow', 'diningtable', 'dog',
+                      'horse', 'motorbike', 'person', 'pottedplant', 'sheep',
+                      'sofa', 'train', 'tvmonitor')
+
+CUB_CATEGORIES = (
+    'Black_footed_Albatross', 'Laysan_Albatross', 'Sooty_Albatross',
+    'Groove_billed_Ani', 'Crested_Auklet', 'Least_Auklet', 'Parakeet_Auklet',
+    'Rhinoceros_Auklet', 'Brewer_Blackbird', 'Red_winged_Blackbird',
+    'Rusty_Blackbird', 'Yellow_headed_Blackbird', 'Bobolink', 'Indigo_Bunting',
+    'Lazuli_Bunting', 'Painted_Bunting', 'Cardinal', 'Spotted_Catbird',
+    'Gray_Catbird', 'Yellow_breasted_Chat', 'Eastern_Towhee',
+    'Chuck_will_Widow', 'Brandt_Cormorant', 'Red_faced_Cormorant',
+    'Pelagic_Cormorant', 'Bronzed_Cowbird', 'Shiny_Cowbird', 'Brown_Creeper',
+    'American_Crow', 'Fish_Crow', 'Black_billed_Cuckoo', 'Mangrove_Cuckoo',
+    'Yellow_billed_Cuckoo', 'Gray_crowned_Rosy_Finch', 'Purple_Finch',
+    'Northern_Flicker', 'Acadian_Flycatcher', 'Great_Crested_Flycatcher',
+    'Least_Flycatcher', 'Olive_sided_Flycatcher', 'Scissor_tailed_Flycatcher',
+    'Vermilion_Flycatcher', 'Yellow_bellied_Flycatcher', 'Frigatebird',
+    'Northern_Fulmar', 'Gadwall', 'American_Goldfinch', 'European_Goldfinch',
+    'Boat_tailed_Grackle', 'Eared_Grebe', 'Horned_Grebe', 'Pied_billed_Grebe',
+    'Western_Grebe', 'Blue_Grosbeak', 'Evening_Grosbeak', 'Pine_Grosbeak',
+    'Rose_breasted_Grosbeak', 'Pigeon_Guillemot', 'California_Gull',
+    'Glaucous_winged_Gull', 'Heermann_Gull', 'Herring_Gull', 'Ivory_Gull',
+    'Ring_billed_Gull', 'Slaty_backed_Gull', 'Western_Gull',
+    'Anna_Hummingbird', 'Ruby_throated_Hummingbird', 'Rufous_Hummingbird',
+    'Green_Violetear', 'Long_tailed_Jaeger', 'Pomarine_Jaeger', 'Blue_Jay',
+    'Florida_Jay', 'Green_Jay', 'Dark_eyed_Junco', 'Tropical_Kingbird',
+    'Gray_Kingbird', 'Belted_Kingfisher', 'Green_Kingfisher',
+    'Pied_Kingfisher', 'Ringed_Kingfisher', 'White_breasted_Kingfisher',
+    'Red_legged_Kittiwake', 'Horned_Lark', 'Pacific_Loon', 'Mallard',
+    'Western_Meadowlark', 'Hooded_Merganser', 'Red_breasted_Merganser',
+    'Mockingbird', 'Nighthawk', 'Clark_Nutcracker', 'White_breasted_Nuthatch',
+    'Baltimore_Oriole', 'Hooded_Oriole', 'Orchard_Oriole', 'Scott_Oriole',
+    'Ovenbird', 'Brown_Pelican', 'White_Pelican', 'Western_Wood_Pewee',
+    'Sayornis', 'American_Pipit', 'Whip_poor_Will', 'Horned_Puffin',
+    'Common_Raven', 'White_necked_Raven', 'American_Redstart', 'Geococcyx',
+    'Loggerhead_Shrike', 'Great_Grey_Shrike', 'Baird_Sparrow',
+    'Black_throated_Sparrow', 'Brewer_Sparrow', 'Chipping_Sparrow',
+    'Clay_colored_Sparrow', 'House_Sparrow', 'Field_Sparrow', 'Fox_Sparrow',
+    'Grasshopper_Sparrow', 'Harris_Sparrow', 'Henslow_Sparrow',
+    'Le_Conte_Sparrow', 'Lincoln_Sparrow', 'Nelson_Sharp_tailed_Sparrow',
+    'Savannah_Sparrow', 'Seaside_Sparrow', 'Song_Sparrow', 'Tree_Sparrow',
+    'Vesper_Sparrow', 'White_crowned_Sparrow', 'White_throated_Sparrow',
+    'Cape_Glossy_Starling', 'Bank_Swallow', 'Barn_Swallow', 'Cliff_Swallow',
+    'Tree_Swallow', 'Scarlet_Tanager', 'Summer_Tanager', 'Artic_Tern',
+    'Black_Tern', 'Caspian_Tern', 'Common_Tern', 'Elegant_Tern',
+    'Forsters_Tern', 'Least_Tern', 'Green_tailed_Towhee', 'Brown_Thrasher',
+    'Sage_Thrasher', 'Black_capped_Vireo', 'Blue_headed_Vireo',
+    'Philadelphia_Vireo', 'Red_eyed_Vireo', 'Warbling_Vireo',
+    'White_eyed_Vireo', 'Yellow_throated_Vireo', 'Bay_breasted_Warbler',
+    'Black_and_white_Warbler', 'Black_throated_Blue_Warbler',
+    'Blue_winged_Warbler', 'Canada_Warbler', 'Cape_May_Warbler',
+    'Cerulean_Warbler', 'Chestnut_sided_Warbler', 'Golden_winged_Warbler',
+    'Hooded_Warbler', 'Kentucky_Warbler', 'Magnolia_Warbler',
+    'Mourning_Warbler', 'Myrtle_Warbler', 'Nashville_Warbler',
+    'Orange_crowned_Warbler', 'Palm_Warbler', 'Pine_Warbler',
+    'Prairie_Warbler', 'Prothonotary_Warbler', 'Swainson_Warbler',
+    'Tennessee_Warbler', 'Wilson_Warbler', 'Worm_eating_Warbler',
+    'Yellow_Warbler', 'Northern_Waterthrush', 'Louisiana_Waterthrush',
+    'Bohemian_Waxwing', 'Cedar_Waxwing', 'American_Three_toed_Woodpecker',
+    'Pileated_Woodpecker', 'Red_bellied_Woodpecker', 'Red_cockaded_Woodpecker',
+    'Red_headed_Woodpecker', 'Downy_Woodpecker', 'Bewick_Wren', 'Cactus_Wren',
+    'Carolina_Wren', 'House_Wren', 'Marsh_Wren', 'Rock_Wren', 'Winter_Wren',
+    'Common_Yellowthroat')
+
+IMAGENET_CATEGORIES = (
+    'tench, Tinca tinca',
+    'goldfish, Carassius auratus',
+    'great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias',  # noqa: E501
+    'tiger shark, Galeocerdo cuvieri',
+    'hammerhead, hammerhead shark',
+    'electric ray, crampfish, numbfish, torpedo',
+    'stingray',
+    'cock',
+    'hen',
+    'ostrich, Struthio camelus',
+    'brambling, Fringilla montifringilla',
+    'goldfinch, Carduelis carduelis',
+    'house finch, linnet, Carpodacus mexicanus',
+    'junco, snowbird',
+    'indigo bunting, indigo finch, indigo bird, Passerina cyanea',
+    'robin, American robin, Turdus migratorius',
+    'bulbul',
+    'jay',
+    'magpie',
+    'chickadee',
+    'water ouzel, dipper',
+    'kite',
+    'bald eagle, American eagle, Haliaeetus leucocephalus',
+    'vulture',
+    'great grey owl, great gray owl, Strix nebulosa',
+    'European fire salamander, Salamandra salamandra',
+    'common newt, Triturus vulgaris',
+    'eft',
+    'spotted salamander, Ambystoma maculatum',
+    'axolotl, mud puppy, Ambystoma mexicanum',
+    'bullfrog, Rana catesbeiana',
+    'tree frog, tree-frog',
+    'tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui',
+    'loggerhead, loggerhead turtle, Caretta caretta',
+    'leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea',  # noqa: E501
+    'mud turtle',
+    'terrapin',
+    'box turtle, box tortoise',
+    'banded gecko',
+    'common iguana, iguana, Iguana iguana',
+    'American chameleon, anole, Anolis carolinensis',
+    'whiptail, whiptail lizard',
+    'agama',
+    'frilled lizard, Chlamydosaurus kingi',
+    'alligator lizard',
+    'Gila monster, Heloderma suspectum',
+    'green lizard, Lacerta viridis',
+    'African chameleon, Chamaeleo chamaeleon',
+    'Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis',  # noqa: E501
+    'African crocodile, Nile crocodile, Crocodylus niloticus',
+    'American alligator, Alligator mississipiensis',
+    'triceratops',
+    'thunder snake, worm snake, Carphophis amoenus',
+    'ringneck snake, ring-necked snake, ring snake',
+    'hognose snake, puff adder, sand viper',
+    'green snake, grass snake',
+    'king snake, kingsnake',
+    'garter snake, grass snake',
+    'water snake',
+    'vine snake',
+    'night snake, Hypsiglena torquata',
+    'boa constrictor, Constrictor constrictor',
+    'rock python, rock snake, Python sebae',
+    'Indian cobra, Naja naja',
+    'green mamba',
+    'sea snake',
+    'horned viper, cerastes, sand viper, horned asp, Cerastes cornutus',
+    'diamondback, diamondback rattlesnake, Crotalus adamanteus',
+    'sidewinder, horned rattlesnake, Crotalus cerastes',
+    'trilobite',
+    'harvestman, daddy longlegs, Phalangium opilio',
+    'scorpion',
+    'black and gold garden spider, Argiope aurantia',
+    'barn spider, Araneus cavaticus',
+    'garden spider, Aranea diademata',
+    'black widow, Latrodectus mactans',
+    'tarantula',
+    'wolf spider, hunting spider',
+    'tick',
+    'centipede',
+    'black grouse',
+    'ptarmigan',
+    'ruffed grouse, partridge, Bonasa umbellus',
+    'prairie chicken, prairie grouse, prairie fowl',
+    'peacock',
+    'quail',
+    'partridge',
+    'African grey, African gray, Psittacus erithacus',
+    'macaw',
+    'sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita',
+    'lorikeet',
+    'coucal',
+    'bee eater',
+    'hornbill',
+    'hummingbird',
+    'jacamar',
+    'toucan',
+    'drake',
+    'red-breasted merganser, Mergus serrator',
+    'goose',
+    'black swan, Cygnus atratus',
+    'tusker',
+    'echidna, spiny anteater, anteater',
+    'platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus',  # noqa: E501
+    'wallaby, brush kangaroo',
+    'koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus',  # noqa: E501
+    'wombat',
+    'jellyfish',
+    'sea anemone, anemone',
+    'brain coral',
+    'flatworm, platyhelminth',
+    'nematode, nematode worm, roundworm',
+    'conch',
+    'snail',
+    'slug',
+    'sea slug, nudibranch',
+    'chiton, coat-of-mail shell, sea cradle, polyplacophore',
+    'chambered nautilus, pearly nautilus, nautilus',
+    'Dungeness crab, Cancer magister',
+    'rock crab, Cancer irroratus',
+    'fiddler crab',
+    'king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica',  # noqa: E501
+    'American lobster, Northern lobster, Maine lobster, Homarus americanus',  # noqa: E501
+    'spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish',  # noqa: E501
+    'crayfish, crawfish, crawdad, crawdaddy',
+    'hermit crab',
+    'isopod',
+    'white stork, Ciconia ciconia',
+    'black stork, Ciconia nigra',
+    'spoonbill',
+    'flamingo',
+    'little blue heron, Egretta caerulea',
+    'American egret, great white heron, Egretta albus',
+    'bittern',
+    'crane',
+    'limpkin, Aramus pictus',
+    'European gallinule, Porphyrio porphyrio',
+    'American coot, marsh hen, mud hen, water hen, Fulica americana',
+    'bustard',
+    'ruddy turnstone, Arenaria interpres',
+    'red-backed sandpiper, dunlin, Erolia alpina',
+    'redshank, Tringa totanus',
+    'dowitcher',
+    'oystercatcher, oyster catcher',
+    'pelican',
+    'king penguin, Aptenodytes patagonica',
+    'albatross, mollymawk',
+    'grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus',  # noqa: E501
+    'killer whale, killer, orca, grampus, sea wolf, Orcinus orca',
+    'dugong, Dugong dugon',
+    'sea lion',
+    'Chihuahua',
+    'Japanese spaniel',
+    'Maltese dog, Maltese terrier, Maltese',
+    'Pekinese, Pekingese, Peke',
+    'Shih-Tzu',
+    'Blenheim spaniel',
+    'papillon',
+    'toy terrier',
+    'Rhodesian ridgeback',
+    'Afghan hound, Afghan',
+    'basset, basset hound',
+    'beagle',
+    'bloodhound, sleuthhound',
+    'bluetick',
+    'black-and-tan coonhound',
+    'Walker hound, Walker foxhound',
+    'English foxhound',
+    'redbone',
+    'borzoi, Russian wolfhound',
+    'Irish wolfhound',
+    'Italian greyhound',
+    'whippet',
+    'Ibizan hound, Ibizan Podenco',
+    'Norwegian elkhound, elkhound',
+    'otterhound, otter hound',
+    'Saluki, gazelle hound',
+    'Scottish deerhound, deerhound',
+    'Weimaraner',
+    'Staffordshire bullterrier, Staffordshire bull terrier',
+    'American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier',  # noqa: E501
+    'Bedlington terrier',
+    'Border terrier',
+    'Kerry blue terrier',
+    'Irish terrier',
+    'Norfolk terrier',
+    'Norwich terrier',
+    'Yorkshire terrier',
+    'wire-haired fox terrier',
+    'Lakeland terrier',
+    'Sealyham terrier, Sealyham',
+    'Airedale, Airedale terrier',
+    'cairn, cairn terrier',
+    'Australian terrier',
+    'Dandie Dinmont, Dandie Dinmont terrier',
+    'Boston bull, Boston terrier',
+    'miniature schnauzer',
+    'giant schnauzer',
+    'standard schnauzer',
+    'Scotch terrier, Scottish terrier, Scottie',
+    'Tibetan terrier, chrysanthemum dog',
+    'silky terrier, Sydney silky',
+    'soft-coated wheaten terrier',
+    'West Highland white terrier',
+    'Lhasa, Lhasa apso',
+    'flat-coated retriever',
+    'curly-coated retriever',
+    'golden retriever',
+    'Labrador retriever',
+    'Chesapeake Bay retriever',
+    'German short-haired pointer',
+    'vizsla, Hungarian pointer',
+    'English setter',
+    'Irish setter, red setter',
+    'Gordon setter',
+    'Brittany spaniel',
+    'clumber, clumber spaniel',
+    'English springer, English springer spaniel',
+    'Welsh springer spaniel',
+    'cocker spaniel, English cocker spaniel, cocker',
+    'Sussex spaniel',
+    'Irish water spaniel',
+    'kuvasz',
+    'schipperke',
+    'groenendael',
+    'malinois',
+    'briard',
+    'kelpie',
+    'komondor',
+    'Old English sheepdog, bobtail',
+    'Shetland sheepdog, Shetland sheep dog, Shetland',
+    'collie',
+    'Border collie',
+    'Bouvier des Flandres, Bouviers des Flandres',
+    'Rottweiler',
+    'German shepherd, German shepherd dog, German police dog, alsatian',
+    'Doberman, Doberman pinscher',
+    'miniature pinscher',
+    'Greater Swiss Mountain dog',
+    'Bernese mountain dog',
+    'Appenzeller',
+    'EntleBucher',
+    'boxer',
+    'bull mastiff',
+    'Tibetan mastiff',
+    'French bulldog',
+    'Great Dane',
+    'Saint Bernard, St Bernard',
+    'Eskimo dog, husky',
+    'malamute, malemute, Alaskan malamute',
+    'Siberian husky',
+    'dalmatian, coach dog, carriage dog',
+    'affenpinscher, monkey pinscher, monkey dog',
+    'basenji',
+    'pug, pug-dog',
+    'Leonberg',
+    'Newfoundland, Newfoundland dog',
+    'Great Pyrenees',
+    'Samoyed, Samoyede',
+    'Pomeranian',
+    'chow, chow chow',
+    'keeshond',
+    'Brabancon griffon',
+    'Pembroke, Pembroke Welsh corgi',
+    'Cardigan, Cardigan Welsh corgi',
+    'toy poodle',
+    'miniature poodle',
+    'standard poodle',
+    'Mexican hairless',
+    'timber wolf, grey wolf, gray wolf, Canis lupus',
+    'white wolf, Arctic wolf, Canis lupus tundrarum',
+    'red wolf, maned wolf, Canis rufus, Canis niger',
+    'coyote, prairie wolf, brush wolf, Canis latrans',
+    'dingo, warrigal, warragal, Canis dingo',
+    'dhole, Cuon alpinus',
+    'African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus',
+    'hyena, hyaena',
+    'red fox, Vulpes vulpes',
+    'kit fox, Vulpes macrotis',
+    'Arctic fox, white fox, Alopex lagopus',
+    'grey fox, gray fox, Urocyon cinereoargenteus',
+    'tabby, tabby cat',
+    'tiger cat',
+    'Persian cat',
+    'Siamese cat, Siamese',
+    'Egyptian cat',
+    'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor',  # noqa: E501
+    'lynx, catamount',
+    'leopard, Panthera pardus',
+    'snow leopard, ounce, Panthera uncia',
+    'jaguar, panther, Panthera onca, Felis onca',
+    'lion, king of beasts, Panthera leo',
+    'tiger, Panthera tigris',
+    'cheetah, chetah, Acinonyx jubatus',
+    'brown bear, bruin, Ursus arctos',
+    'American black bear, black bear, Ursus americanus, Euarctos americanus',  # noqa: E501
+    'ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus',
+    'sloth bear, Melursus ursinus, Ursus ursinus',
+    'mongoose',
+    'meerkat, mierkat',
+    'tiger beetle',
+    'ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle',
+    'ground beetle, carabid beetle',
+    'long-horned beetle, longicorn, longicorn beetle',
+    'leaf beetle, chrysomelid',
+    'dung beetle',
+    'rhinoceros beetle',
+    'weevil',
+    'fly',
+    'bee',
+    'ant, emmet, pismire',
+    'grasshopper, hopper',
+    'cricket',
+    'walking stick, walkingstick, stick insect',
+    'cockroach, roach',
+    'mantis, mantid',
+    'cicada, cicala',
+    'leafhopper',
+    'lacewing, lacewing fly',
+    "dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk",  # noqa: E501
+    'damselfly',
+    'admiral',
+    'ringlet, ringlet butterfly',
+    'monarch, monarch butterfly, milkweed butterfly, Danaus plexippus',
+    'cabbage butterfly',
+    'sulphur butterfly, sulfur butterfly',
+    'lycaenid, lycaenid butterfly',
+    'starfish, sea star',
+    'sea urchin',
+    'sea cucumber, holothurian',
+    'wood rabbit, cottontail, cottontail rabbit',
+    'hare',
+    'Angora, Angora rabbit',
+    'hamster',
+    'porcupine, hedgehog',
+    'fox squirrel, eastern fox squirrel, Sciurus niger',
+    'marmot',
+    'beaver',
+    'guinea pig, Cavia cobaya',
+    'sorrel',
+    'zebra',
+    'hog, pig, grunter, squealer, Sus scrofa',
+    'wild boar, boar, Sus scrofa',
+    'warthog',
+    'hippopotamus, hippo, river horse, Hippopotamus amphibius',
+    'ox',
+    'water buffalo, water ox, Asiatic buffalo, Bubalus bubalis',
+    'bison',
+    'ram, tup',
+    'bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis',  # noqa: E501
+    'ibex, Capra ibex',
+    'hartebeest',
+    'impala, Aepyceros melampus',
+    'gazelle',
+    'Arabian camel, dromedary, Camelus dromedarius',
+    'llama',
+    'weasel',
+    'mink',
+    'polecat, fitch, foulmart, foumart, Mustela putorius',
+    'black-footed ferret, ferret, Mustela nigripes',
+    'otter',
+    'skunk, polecat, wood pussy',
+    'badger',
+    'armadillo',
+    'three-toed sloth, ai, Bradypus tridactylus',
+    'orangutan, orang, orangutang, Pongo pygmaeus',
+    'gorilla, Gorilla gorilla',
+    'chimpanzee, chimp, Pan troglodytes',
+    'gibbon, Hylobates lar',
+    'siamang, Hylobates syndactylus, Symphalangus syndactylus',
+    'guenon, guenon monkey',
+    'patas, hussar monkey, Erythrocebus patas',
+    'baboon',
+    'macaque',
+    'langur',
+    'colobus, colobus monkey',
+    'proboscis monkey, Nasalis larvatus',
+    'marmoset',
+    'capuchin, ringtail, Cebus capucinus',
+    'howler monkey, howler',
+    'titi, titi monkey',
+    'spider monkey, Ateles geoffroyi',
+    'squirrel monkey, Saimiri sciureus',
+    'Madagascar cat, ring-tailed lemur, Lemur catta',
+    'indri, indris, Indri indri, Indri brevicaudatus',
+    'Indian elephant, Elephas maximus',
+    'African elephant, Loxodonta africana',
+    'lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens',
+    'giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca',
+    'barracouta, snoek',
+    'eel',
+    'coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch',  # noqa: E501
+    'rock beauty, Holocanthus tricolor',
+    'anemone fish',
+    'sturgeon',
+    'gar, garfish, garpike, billfish, Lepisosteus osseus',
+    'lionfish',
+    'puffer, pufferfish, blowfish, globefish',
+    'abacus',
+    'abaya',
+    "academic gown, academic robe, judge's robe",
+    'accordion, piano accordion, squeeze box',
+    'acoustic guitar',
+    'aircraft carrier, carrier, flattop, attack aircraft carrier',
+    'airliner',
+    'airship, dirigible',
+    'altar',
+    'ambulance',
+    'amphibian, amphibious vehicle',
+    'analog clock',
+    'apiary, bee house',
+    'apron',
+    'ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin',  # noqa: E501
+    'assault rifle, assault gun',
+    'backpack, back pack, knapsack, packsack, rucksack, haversack',
+    'bakery, bakeshop, bakehouse',
+    'balance beam, beam',
+    'balloon',
+    'ballpoint, ballpoint pen, ballpen, Biro',
+    'Band Aid',
+    'banjo',
+    'bannister, banister, balustrade, balusters, handrail',
+    'barbell',
+    'barber chair',
+    'barbershop',
+    'barn',
+    'barometer',
+    'barrel, cask',
+    'barrow, garden cart, lawn cart, wheelbarrow',
+    'baseball',
+    'basketball',
+    'bassinet',
+    'bassoon',
+    'bathing cap, swimming cap',
+    'bath towel',
+    'bathtub, bathing tub, bath, tub',
+    'beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon',  # noqa: E501
+    'beacon, lighthouse, beacon light, pharos',
+    'beaker',
+    'bearskin, busby, shako',
+    'beer bottle',
+    'beer glass',
+    'bell cote, bell cot',
+    'bib',
+    'bicycle-built-for-two, tandem bicycle, tandem',
+    'bikini, two-piece',
+    'binder, ring-binder',
+    'binoculars, field glasses, opera glasses',
+    'birdhouse',
+    'boathouse',
+    'bobsled, bobsleigh, bob',
+    'bolo tie, bolo, bola tie, bola',
+    'bonnet, poke bonnet',
+    'bookcase',
+    'bookshop, bookstore, bookstall',
+    'bottlecap',
+    'bow',
+    'bow tie, bow-tie, bowtie',
+    'brass, memorial tablet, plaque',
+    'brassiere, bra, bandeau',
+    'breakwater, groin, groyne, mole, bulwark, seawall, jetty',
+    'breastplate, aegis, egis',
+    'broom',
+    'bucket, pail',
+    'buckle',
+    'bulletproof vest',
+    'bullet train, bullet',
+    'butcher shop, meat market',
+    'cab, hack, taxi, taxicab',
+    'caldron, cauldron',
+    'candle, taper, wax light',
+    'cannon',
+    'canoe',
+    'can opener, tin opener',
+    'cardigan',
+    'car mirror',
+    'carousel, carrousel, merry-go-round, roundabout, whirligig',
+    "carpenter's kit, tool kit",
+    'carton',
+    'car wheel',
+    'cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM',  # noqa: E501
+    'cassette',
+    'cassette player',
+    'castle',
+    'catamaran',
+    'CD player',
+    'cello, violoncello',
+    'cellular telephone, cellular phone, cellphone, cell, mobile phone',
+    'chain',
+    'chainlink fence',
+    'chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour',  # noqa: E501
+    'chain saw, chainsaw',
+    'chest',
+    'chiffonier, commode',
+    'chime, bell, gong',
+    'china cabinet, china closet',
+    'Christmas stocking',
+    'church, church building',
+    'cinema, movie theater, movie theatre, movie house, picture palace',
+    'cleaver, meat cleaver, chopper',
+    'cliff dwelling',
+    'cloak',
+    'clog, geta, patten, sabot',
+    'cocktail shaker',
+    'coffee mug',
+    'coffeepot',
+    'coil, spiral, volute, whorl, helix',
+    'combination lock',
+    'computer keyboard, keypad',
+    'confectionery, confectionary, candy store',
+    'container ship, containership, container vessel',
+    'convertible',
+    'corkscrew, bottle screw',
+    'cornet, horn, trumpet, trump',
+    'cowboy boot',
+    'cowboy hat, ten-gallon hat',
+    'cradle',
+    'crane',
+    'crash helmet',
+    'crate',
+    'crib, cot',
+    'Crock Pot',
+    'croquet ball',
+    'crutch',
+    'cuirass',
+    'dam, dike, dyke',
+    'desk',
+    'desktop computer',
+    'dial telephone, dial phone',
+    'diaper, nappy, napkin',
+    'digital clock',
+    'digital watch',
+    'dining table, board',
+    'dishrag, dishcloth',
+    'dishwasher, dish washer, dishwashing machine',
+    'disk brake, disc brake',
+    'dock, dockage, docking facility',
+    'dogsled, dog sled, dog sleigh',
+    'dome',
+    'doormat, welcome mat',
+    'drilling platform, offshore rig',
+    'drum, membranophone, tympan',
+    'drumstick',
+    'dumbbell',
+    'Dutch oven',
+    'electric fan, blower',
+    'electric guitar',
+    'electric locomotive',
+    'entertainment center',
+    'envelope',
+    'espresso maker',
+    'face powder',
+    'feather boa, boa',
+    'file, file cabinet, filing cabinet',
+    'fireboat',
+    'fire engine, fire truck',
+    'fire screen, fireguard',
+    'flagpole, flagstaff',
+    'flute, transverse flute',
+    'folding chair',
+    'football helmet',
+    'forklift',
+    'fountain',
+    'fountain pen',
+    'four-poster',
+    'freight car',
+    'French horn, horn',
+    'frying pan, frypan, skillet',
+    'fur coat',
+    'garbage truck, dustcart',
+    'gasmask, respirator, gas helmet',
+    'gas pump, gasoline pump, petrol pump, island dispenser',
+    'goblet',
+    'go-kart',
+    'golf ball',
+    'golfcart, golf cart',
+    'gondola',
+    'gong, tam-tam',
+    'gown',
+    'grand piano, grand',
+    'greenhouse, nursery, glasshouse',
+    'grille, radiator grille',
+    'grocery store, grocery, food market, market',
+    'guillotine',
+    'hair slide',
+    'hair spray',
+    'half track',
+    'hammer',
+    'hamper',
+    'hand blower, blow dryer, blow drier, hair dryer, hair drier',
+    'hand-held computer, hand-held microcomputer',
+    'handkerchief, hankie, hanky, hankey',
+    'hard disc, hard disk, fixed disk',
+    'harmonica, mouth organ, harp, mouth harp',
+    'harp',
+    'harvester, reaper',
+    'hatchet',
+    'holster',
+    'home theater, home theatre',
+    'honeycomb',
+    'hook, claw',
+    'hoopskirt, crinoline',
+    'horizontal bar, high bar',
+    'horse cart, horse-cart',
+    'hourglass',
+    'iPod',
+    'iron, smoothing iron',
+    "jack-o'-lantern",
+    'jean, blue jean, denim',
+    'jeep, landrover',
+    'jersey, T-shirt, tee shirt',
+    'jigsaw puzzle',
+    'jinrikisha, ricksha, rickshaw',
+    'joystick',
+    'kimono',
+    'knee pad',
+    'knot',
+    'lab coat, laboratory coat',
+    'ladle',
+    'lampshade, lamp shade',
+    'laptop, laptop computer',
+    'lawn mower, mower',
+    'lens cap, lens cover',
+    'letter opener, paper knife, paperknife',
+    'library',
+    'lifeboat',
+    'lighter, light, igniter, ignitor',
+    'limousine, limo',
+    'liner, ocean liner',
+    'lipstick, lip rouge',
+    'Loafer',
+    'lotion',
+    'loudspeaker, speaker, speaker unit, loudspeaker system, speaker system',  # noqa: E501
+    "loupe, jeweler's loupe",
+    'lumbermill, sawmill',
+    'magnetic compass',
+    'mailbag, postbag',
+    'mailbox, letter box',
+    'maillot',
+    'maillot, tank suit',
+    'manhole cover',
+    'maraca',
+    'marimba, xylophone',
+    'mask',
+    'matchstick',
+    'maypole',
+    'maze, labyrinth',
+    'measuring cup',
+    'medicine chest, medicine cabinet',
+    'megalith, megalithic structure',
+    'microphone, mike',
+    'microwave, microwave oven',
+    'military uniform',
+    'milk can',
+    'minibus',
+    'miniskirt, mini',
+    'minivan',
+    'missile',
+    'mitten',
+    'mixing bowl',
+    'mobile home, manufactured home',
+    'Model T',
+    'modem',
+    'monastery',
+    'monitor',
+    'moped',
+    'mortar',
+    'mortarboard',
+    'mosque',
+    'mosquito net',
+    'motor scooter, scooter',
+    'mountain bike, all-terrain bike, off-roader',
+    'mountain tent',
+    'mouse, computer mouse',
+    'mousetrap',
+    'moving van',
+    'muzzle',
+    'nail',
+    'neck brace',
+    'necklace',
+    'nipple',
+    'notebook, notebook computer',
+    'obelisk',
+    'oboe, hautboy, hautbois',
+    'ocarina, sweet potato',
+    'odometer, hodometer, mileometer, milometer',
+    'oil filter',
+    'organ, pipe organ',
+    'oscilloscope, scope, cathode-ray oscilloscope, CRO',
+    'overskirt',
+    'oxcart',
+    'oxygen mask',
+    'packet',
+    'paddle, boat paddle',
+    'paddlewheel, paddle wheel',
+    'padlock',
+    'paintbrush',
+    "pajama, pyjama, pj's, jammies",
+    'palace',
+    'panpipe, pandean pipe, syrinx',
+    'paper towel',
+    'parachute, chute',
+    'parallel bars, bars',
+    'park bench',
+    'parking meter',
+    'passenger car, coach, carriage',
+    'patio, terrace',
+    'pay-phone, pay-station',
+    'pedestal, plinth, footstall',
+    'pencil box, pencil case',
+    'pencil sharpener',
+    'perfume, essence',
+    'Petri dish',
+    'photocopier',
+    'pick, plectrum, plectron',
+    'pickelhaube',
+    'picket fence, paling',
+    'pickup, pickup truck',
+    'pier',
+    'piggy bank, penny bank',
+    'pill bottle',
+    'pillow',
+    'ping-pong ball',
+    'pinwheel',
+    'pirate, pirate ship',
+    'pitcher, ewer',
+    "plane, carpenter's plane, woodworking plane",
+    'planetarium',
+    'plastic bag',
+    'plate rack',
+    'plow, plough',
+    "plunger, plumber's helper",
+    'Polaroid camera, Polaroid Land camera',
+    'pole',
+    'police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria',  # noqa: E501
+    'poncho',
+    'pool table, billiard table, snooker table',
+    'pop bottle, soda bottle',
+    'pot, flowerpot',
+    "potter's wheel",
+    'power drill',
+    'prayer rug, prayer mat',
+    'printer',
+    'prison, prison house',
+    'projectile, missile',
+    'projector',
+    'puck, hockey puck',
+    'punching bag, punch bag, punching ball, punchball',
+    'purse',
+    'quill, quill pen',
+    'quilt, comforter, comfort, puff',
+    'racer, race car, racing car',
+    'racket, racquet',
+    'radiator',
+    'radio, wireless',
+    'radio telescope, radio reflector',
+    'rain barrel',
+    'recreational vehicle, RV, R.V.',
+    'reel',
+    'reflex camera',
+    'refrigerator, icebox',
+    'remote control, remote',
+    'restaurant, eating house, eating place, eatery',
+    'revolver, six-gun, six-shooter',
+    'rifle',
+    'rocking chair, rocker',
+    'rotisserie',
+    'rubber eraser, rubber, pencil eraser',
+    'rugby ball',
+    'rule, ruler',
+    'running shoe',
+    'safe',
+    'safety pin',
+    'saltshaker, salt shaker',
+    'sandal',
+    'sarong',
+    'sax, saxophone',
+    'scabbard',
+    'scale, weighing machine',
+    'school bus',
+    'schooner',
+    'scoreboard',
+    'screen, CRT screen',
+    'screw',
+    'screwdriver',
+    'seat belt, seatbelt',
+    'sewing machine',
+    'shield, buckler',
+    'shoe shop, shoe-shop, shoe store',
+    'shoji',
+    'shopping basket',
+    'shopping cart',
+    'shovel',
+    'shower cap',
+    'shower curtain',
+    'ski',
+    'ski mask',
+    'sleeping bag',
+    'slide rule, slipstick',
+    'sliding door',
+    'slot, one-armed bandit',
+    'snorkel',
+    'snowmobile',
+    'snowplow, snowplough',
+    'soap dispenser',
+    'soccer ball',
+    'sock',
+    'solar dish, solar collector, solar furnace',
+    'sombrero',
+    'soup bowl',
+    'space bar',
+    'space heater',
+    'space shuttle',
+    'spatula',
+    'speedboat',
+    "spider web, spider's web",
+    'spindle',
+    'sports car, sport car',
+    'spotlight, spot',
+    'stage',
+    'steam locomotive',
+    'steel arch bridge',
+    'steel drum',
+    'stethoscope',
+    'stole',
+    'stone wall',
+    'stopwatch, stop watch',
+    'stove',
+    'strainer',
+    'streetcar, tram, tramcar, trolley, trolley car',
+    'stretcher',
+    'studio couch, day bed',
+    'stupa, tope',
+    'submarine, pigboat, sub, U-boat',
+    'suit, suit of clothes',
+    'sundial',
+    'sunglass',
+    'sunglasses, dark glasses, shades',
+    'sunscreen, sunblock, sun blocker',
+    'suspension bridge',
+    'swab, swob, mop',
+    'sweatshirt',
+    'swimming trunks, bathing trunks',
+    'swing',
+    'switch, electric switch, electrical switch',
+    'syringe',
+    'table lamp',
+    'tank, army tank, armored combat vehicle, armoured combat vehicle',
+    'tape player',
+    'teapot',
+    'teddy, teddy bear',
+    'television, television system',
+    'tennis ball',
+    'thatch, thatched roof',
+    'theater curtain, theatre curtain',
+    'thimble',
+    'thresher, thrasher, threshing machine',
+    'throne',
+    'tile roof',
+    'toaster',
+    'tobacco shop, tobacconist shop, tobacconist',
+    'toilet seat',
+    'torch',
+    'totem pole',
+    'tow truck, tow car, wrecker',
+    'toyshop',
+    'tractor',
+    'trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi',  # noqa: E501
+    'tray',
+    'trench coat',
+    'tricycle, trike, velocipede',
+    'trimaran',
+    'tripod',
+    'triumphal arch',
+    'trolleybus, trolley coach, trackless trolley',
+    'trombone',
+    'tub, vat',
+    'turnstile',
+    'typewriter keyboard',
+    'umbrella',
+    'unicycle, monocycle',
+    'upright, upright piano',
+    'vacuum, vacuum cleaner',
+    'vase',
+    'vault',
+    'velvet',
+    'vending machine',
+    'vestment',
+    'viaduct',
+    'violin, fiddle',
+    'volleyball',
+    'waffle iron',
+    'wall clock',
+    'wallet, billfold, notecase, pocketbook',
+    'wardrobe, closet, press',
+    'warplane, military plane',
+    'washbasin, handbasin, washbowl, lavabo, wash-hand basin',
+    'washer, automatic washer, washing machine',
+    'water bottle',
+    'water jug',
+    'water tower',
+    'whiskey jug',
+    'whistle',
+    'wig',
+    'window screen',
+    'window shade',
+    'Windsor tie',
+    'wine bottle',
+    'wing',
+    'wok',
+    'wooden spoon',
+    'wool, woolen, woollen',
+    'worm fence, snake fence, snake-rail fence, Virginia fence',
+    'wreck',
+    'yawl',
+    'yurt',
+    'web site, website, internet site, site',
+    'comic book',
+    'crossword puzzle, crossword',
+    'street sign',
+    'traffic light, traffic signal, stoplight',
+    'book jacket, dust cover, dust jacket, dust wrapper',
+    'menu',
+    'plate',
+    'guacamole',
+    'consomme',
+    'hot pot, hotpot',
+    'trifle',
+    'ice cream, icecream',
+    'ice lolly, lolly, lollipop, popsicle',
+    'French loaf',
+    'bagel, beigel',
+    'pretzel',
+    'cheeseburger',
+    'hotdog, hot dog, red hot',
+    'mashed potato',
+    'head cabbage',
+    'broccoli',
+    'cauliflower',
+    'zucchini, courgette',
+    'spaghetti squash',
+    'acorn squash',
+    'butternut squash',
+    'cucumber, cuke',
+    'artichoke, globe artichoke',
+    'bell pepper',
+    'cardoon',
+    'mushroom',
+    'Granny Smith',
+    'strawberry',
+    'orange',
+    'lemon',
+    'fig',
+    'pineapple, ananas',
+    'banana',
+    'jackfruit, jak, jack',
+    'custard apple',
+    'pomegranate',
+    'hay',
+    'carbonara',
+    'chocolate sauce, chocolate syrup',
+    'dough',
+    'meat loaf, meatloaf',
+    'pizza, pizza pie',
+    'potpie',
+    'burrito',
+    'red wine',
+    'espresso',
+    'cup',
+    'eggnog',
+    'alp',
+    'bubble',
+    'cliff, drop, drop-off',
+    'coral reef',
+    'geyser',
+    'lakeside, lakeshore',
+    'promontory, headland, head, foreland',
+    'sandbar, sand bar',
+    'seashore, coast, seacoast, sea-coast',
+    'valley, vale',
+    'volcano',
+    'ballplayer, baseball player',
+    'groom, bridegroom',
+    'scuba diver',
+    'rapeseed',
+    'daisy',
+    "yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum",  # noqa: E501
+    'corn',
+    'acorn',
+    'hip, rose hip, rosehip',
+    'buckeye, horse chestnut, conker',
+    'coral fungus',
+    'agaric',
+    'gyromitra',
+    'stinkhorn, carrion fungus',
+    'earthstar',
+    'hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa',  # noqa: E501
+    'bolete',
+    'ear, spike, capitulum',
+    'toilet tissue, toilet paper, bathroom tissue')
+
+CIFAR10_CATEGORIES = ('airplane', 'automobile', 'bird', 'cat', 'deer', 'dog',
+                      'frog', 'horse', 'ship', 'truck')
+
+CIFAR100_CATEGORIES = (
+    'apple', 'aquarium_fish', 'baby', 'bear', 'beaver', 'bed', 'bee', 'beetle',
+    'bicycle', 'bottle', 'bowl', 'boy', 'bridge', 'bus', 'butterfly', 'camel',
+    'can', 'castle', 'caterpillar', 'cattle', 'chair', 'chimpanzee', 'clock',
+    'cloud', 'cockroach', 'couch', 'crab', 'crocodile', 'cup', 'dinosaur',
+    'dolphin', 'elephant', 'flatfish', 'forest', 'fox', 'girl', 'hamster',
+    'house', 'kangaroo', 'keyboard', 'lamp', 'lawn_mower', 'leopard', 'lion',
+    'lizard', 'lobster', 'man', 'maple_tree', 'motorcycle', 'mountain',
+    'mouse', 'mushroom', 'oak_tree', 'orange', 'orchid', 'otter', 'palm_tree',
+    'pear', 'pickup_truck', 'pine_tree', 'plain', 'plate', 'poppy',
+    'porcupine', 'possum', 'rabbit', 'raccoon', 'ray', 'road', 'rocket',
+    'rose', 'sea', 'seal', 'shark', 'shrew', 'skunk', 'skyscraper', 'snail',
+    'snake', 'spider', 'squirrel', 'streetcar', 'sunflower', 'sweet_pepper',
+    'table', 'tank', 'telephone', 'television', 'tiger', 'tractor', 'train',
+    'trout', 'tulip', 'turtle', 'wardrobe', 'whale', 'willow_tree', 'wolf',
+    'woman', 'worm')
+
+MNIST_CATEGORITES = ('0 - zero', '1 - one', '2 - two', '3 - three', '4 - four',
+                     '5 - five', '6 - six', '7 - seven', '8 - eight',
+                     '9 - nine')
+
+FASHIONMNIST_CATEGORITES = ('T-shirt/top', 'Trouser', 'Pullover', 'Dress',
+                            'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag',
+                            'Ankle boot')
+
+PLACES205_CATEGORIES = (
+    'abbey', 'airport_terminal', 'alley', 'amphitheater', 'amusement_park',
+    'aquarium', 'aqueduct', 'arch', 'art_gallery', 'art_studio',
+    'assembly_line', 'attic', 'auditorium', 'apartment_building/outdoor',
+    'badlands', 'ballroom', 'bamboo_forest', 'banquet_hall', 'bar',
+    'baseball_field', 'basement', 'basilica', 'bayou', 'beauty_salon',
+    'bedroom', 'boardwalk', 'boat_deck', 'bookstore', 'botanical_garden',
+    'bowling_alley', 'boxing_ring', 'bridge', 'building_facade',
+    'bus_interior', 'butchers_shop', 'butte', 'bakery/shop', 'cafeteria',
+    'campsite', 'candy_store', 'canyon', 'castle', 'cemetery', 'chalet',
+    'classroom', 'closet', 'clothing_store', 'coast', 'cockpit', 'coffee_shop',
+    'conference_center', 'conference_room', 'construction_site', 'corn_field',
+    'corridor', 'cottage_garden', 'courthouse', 'courtyard', 'creek',
+    'crevasse', 'crosswalk', 'cathedral/outdoor', 'church/outdoor', 'dam',
+    'dining_room', 'dock', 'dorm_room', 'driveway', 'desert/sand',
+    'desert/vegetation', 'dinette/home', 'doorway/outdoor', 'engine_room',
+    'excavation', 'fairway', 'fire_escape', 'fire_station', 'food_court',
+    'forest_path', 'forest_road', 'formal_garden', 'fountain',
+    'field/cultivated', 'field/wild', 'galley', 'game_room', 'garbage_dump',
+    'gas_station', 'gift_shop', 'golf_course', 'harbor', 'herb_garden',
+    'highway', 'home_office', 'hospital', 'hospital_room', 'hot_spring',
+    'hotel_room', 'hotel/outdoor', 'ice_cream_parlor', 'iceberg', 'igloo',
+    'islet', 'ice_skating_rink/outdoor', 'inn/outdoor', 'jail_cell', 'kasbah',
+    'kindergarden_classroom', 'kitchen', 'kitchenette', 'laundromat',
+    'lighthouse', 'living_room', 'lobby', 'locker_room', 'mansion', 'marsh',
+    'martial_arts_gym', 'mausoleum', 'medina', 'motel', 'mountain',
+    'mountain_snowy', 'music_studio', 'market/outdoor', 'monastery/outdoor',
+    'museum/indoor', 'nursery', 'ocean', 'office', 'office_building',
+    'orchard', 'pagoda', 'palace', 'pantry', 'parking_lot', 'parlor',
+    'pasture', 'patio', 'pavilion', 'phone_booth', 'picnic_area', 'playground',
+    'plaza', 'pond', 'pulpit', 'racecourse', 'raft', 'railroad_track',
+    'rainforest', 'reception', 'residential_neighborhood', 'restaurant',
+    'restaurant_kitchen', 'restaurant_patio', 'rice_paddy', 'river',
+    'rock_arch', 'rope_bridge', 'ruin', 'runway', 'sandbar', 'schoolhouse',
+    'sea_cliff', 'shed', 'shoe_shop', 'shopfront', 'shower', 'ski_resort',
+    'ski_slope', 'sky', 'skyscraper', 'slum', 'snowfield', 'staircase',
+    'supermarket', 'swamp', 'stadium/baseball', 'stadium/football',
+    'stage/indoor', 'subway_station/platform', 'swimming_pool/outdoor',
+    'television_studio', 'topiary_garden', 'tower', 'train_railway',
+    'tree_farm', 'trench', 'temple/east_asia', 'temple/south_asia',
+    'track/outdoor', 'train_station/platform', 'underwater/coral_reef',
+    'valley', 'vegetable_garden', 'veranda', 'viaduct', 'volcano',
+    'waiting_room', 'water_tower', 'watering_hole', 'wheat_field', 'wind_farm',
+    'windmill', 'yard')
+
+OxfordIIITPet_CATEGORIES = (
+    'Abyssinian', 'american_bulldog', 'american_pit_bull_terrier',
+    'basset_hound', 'beagle', 'Bengal', 'Birman', 'Bombay', 'boxer',
+    'British_Shorthair', 'chihuahua', 'Egyptian_Mau', 'english_cocker_spaniel',
+    'english_setter', 'german_shorthaired', 'great_pyrenees', 'havanese',
+    'japanese_chin', 'keeshond', 'leonberger', 'Maine_Coon',
+    'miniature_pinscher', 'newfoundland', 'Persian', 'pomeranian', 'pug',
+    'Ragdoll', 'Russian_Blue', 'saint_bernard', 'samoyed', 'scottish_terrier',
+    'shiba_inu', 'Siamese', 'Sphynx', 'staffordshire_bull_terrier',
+    'wheaten_terrier', 'yorkshire_terrier')
+
+DTD_CATEGORIES = ('banded', 'blotchy', 'braided', 'bubbly', 'bumpy',
+                  'chequered', 'cobwebbed', 'cracked', 'crosshatched',
+                  'crystalline', 'dotted', 'fibrous', 'flecked', 'freckled',
+                  'frilly', 'gauzy', 'grid', 'grooved', 'honeycombed',
+                  'interlaced', 'knitted', 'lacelike', 'lined', 'marbled',
+                  'matted', 'meshed', 'paisley', 'perforated', 'pitted',
+                  'pleated', 'polka-dotted', 'porous', 'potholed', 'scaly',
+                  'smeared', 'spiralled', 'sprinkled', 'stained', 'stratified',
+                  'striped', 'studded', 'swirly', 'veined', 'waffled', 'woven',
+                  'wrinkled', 'zigzagged')
+
+FGVCAIRCRAFT_CATEGORIES = (
+    '707-320', '727-200', '737-200', '737-300', '737-400', '737-500',
+    '737-600', '737-700', '737-800', '737-900', '747-100', '747-200',
+    '747-300', '747-400', '757-200', '757-300', '767-200', '767-300',
+    '767-400', '777-200', '777-300', 'A300B4', 'A310', 'A318', 'A319', 'A320',
+    'A321', 'A330-200', 'A330-300', 'A340-200', 'A340-300', 'A340-500',
+    'A340-600', 'A380', 'ATR-42', 'ATR-72', 'An-12', 'BAE 146-200',
+    'BAE 146-300', 'BAE-125', 'Beechcraft 1900', 'Boeing 717', 'C-130', 'C-47',
+    'CRJ-200', 'CRJ-700', 'CRJ-900', 'Cessna 172', 'Cessna 208', 'Cessna 525',
+    'Cessna 560', 'Challenger 600', 'DC-10', 'DC-3', 'DC-6', 'DC-8', 'DC-9-30',
+    'DH-82', 'DHC-1', 'DHC-6', 'DHC-8-100', 'DHC-8-300', 'DR-400',
+    'Dornier 328', 'E-170', 'E-190', 'E-195', 'EMB-120', 'ERJ 135', 'ERJ 145',
+    'Embraer Legacy 600', 'Eurofighter Typhoon', 'F-16A/B', 'F/A-18',
+    'Falcon 2000', 'Falcon 900', 'Fokker 100', 'Fokker 50', 'Fokker 70',
+    'Global Express', 'Gulfstream IV', 'Gulfstream V', 'Hawk T1', 'Il-76',
+    'L-1011', 'MD-11', 'MD-80', 'MD-87', 'MD-90', 'Metroliner', 'Model B200',
+    'PA-28', 'SR-20', 'Saab 2000', 'Saab 340', 'Spitfire', 'Tornado', 'Tu-134',
+    'Tu-154', 'Yak-42')
+
+STANFORDCARS_CATEGORIES = (
+    'AM General Hummer SUV 2000', 'Acura RL Sedan 2012', 'Acura TL Sedan 2012',
+    'Acura TL Type-S 2008', 'Acura TSX Sedan 2012',
+    'Acura Integra Type R 2001', 'Acura ZDX Hatchback 2012',
+    'Aston Martin V8 Vantage Convertible 2012',
+    'Aston Martin V8 Vantage Coupe 2012',
+    'Aston Martin Virage Convertible 2012', 'Aston Martin Virage Coupe 2012',
+    'Audi RS 4 Convertible 2008', 'Audi A5 Coupe 2012', 'Audi TTS Coupe 2012',
+    'Audi R8 Coupe 2012', 'Audi V8 Sedan 1994', 'Audi 100 Sedan 1994',
+    'Audi 100 Wagon 1994', 'Audi TT Hatchback 2011', 'Audi S6 Sedan 2011',
+    'Audi S5 Convertible 2012', 'Audi S5 Coupe 2012', 'Audi S4 Sedan 2012',
+    'Audi S4 Sedan 2007', 'Audi TT RS Coupe 2012',
+    'BMW ActiveHybrid 5 Sedan 2012', 'BMW 1 Series Convertible 2012',
+    'BMW 1 Series Coupe 2012', 'BMW 3 Series Sedan 2012',
+    'BMW 3 Series Wagon 2012', 'BMW 6 Series Convertible 2007',
+    'BMW X5 SUV 2007', 'BMW X6 SUV 2012', 'BMW M3 Coupe 2012',
+    'BMW M5 Sedan 2010', 'BMW M6 Convertible 2010', 'BMW X3 SUV 2012',
+    'BMW Z4 Convertible 2012',
+    'Bentley Continental Supersports Conv. Convertible 2012',
+    'Bentley Arnage Sedan 2009', 'Bentley Mulsanne Sedan 2011',
+    'Bentley Continental GT Coupe 2012', 'Bentley Continental GT Coupe 2007',
+    'Bentley Continental Flying Spur Sedan 2007',
+    'Bugatti Veyron 16.4 Convertible 2009', 'Bugatti Veyron 16.4 Coupe 2009',
+    'Buick Regal GS 2012', 'Buick Rainier SUV 2007', 'Buick Verano Sedan 2012',
+    'Buick Enclave SUV 2012', 'Cadillac CTS-V Sedan 2012',
+    'Cadillac SRX SUV 2012', 'Cadillac Escalade EXT Crew Cab 2007',
+    'Chevrolet Silverado 1500 Hybrid Crew Cab 2012',
+    'Chevrolet Corvette Convertible 2012', 'Chevrolet Corvette ZR1 2012',
+    'Chevrolet Corvette Ron Fellows Edition Z06 2007',
+    'Chevrolet Traverse SUV 2012', 'Chevrolet Camaro Convertible 2012',
+    'Chevrolet HHR SS 2010', 'Chevrolet Impala Sedan 2007',
+    'Chevrolet Tahoe Hybrid SUV 2012', 'Chevrolet Sonic Sedan 2012',
+    'Chevrolet Express Cargo Van 2007', 'Chevrolet Avalanche Crew Cab 2012',
+    'Chevrolet Cobalt SS 2010', 'Chevrolet Malibu Hybrid Sedan 2010',
+    'Chevrolet TrailBlazer SS 2009',
+    'Chevrolet Silverado 2500HD Regular Cab 2012',
+    'Chevrolet Silverado 1500 Classic Extended Cab 2007',
+    'Chevrolet Express Van 2007', 'Chevrolet Monte Carlo Coupe 2007',
+    'Chevrolet Malibu Sedan 2007',
+    'Chevrolet Silverado 1500 Extended Cab 2012',
+    'Chevrolet Silverado 1500 Regular Cab 2012', 'Chrysler Aspen SUV 2009',
+    'Chrysler Sebring Convertible 2010',
+    'Chrysler Town and Country Minivan 2012', 'Chrysler 300 SRT-8 2010',
+    'Chrysler Crossfire Convertible 2008',
+    'Chrysler PT Cruiser Convertible 2008', 'Daewoo Nubira Wagon 2002',
+    'Dodge Caliber Wagon 2012', 'Dodge Caliber Wagon 2007',
+    'Dodge Caravan Minivan 1997', 'Dodge Ram Pickup 3500 Crew Cab 2010',
+    'Dodge Ram Pickup 3500 Quad Cab 2009', 'Dodge Sprinter Cargo Van 2009',
+    'Dodge Journey SUV 2012', 'Dodge Dakota Crew Cab 2010',
+    'Dodge Dakota Club Cab 2007', 'Dodge Magnum Wagon 2008',
+    'Dodge Challenger SRT8 2011', 'Dodge Durango SUV 2012',
+    'Dodge Durango SUV 2007', 'Dodge Charger Sedan 2012',
+    'Dodge Charger SRT-8 2009', 'Eagle Talon Hatchback 1998',
+    'FIAT 500 Abarth 2012', 'FIAT 500 Convertible 2012',
+    'Ferrari FF Coupe 2012', 'Ferrari California Convertible 2012',
+    'Ferrari 458 Italia Convertible 2012', 'Ferrari 458 Italia Coupe 2012',
+    'Fisker Karma Sedan 2012', 'Ford F-450 Super Duty Crew Cab 2012',
+    'Ford Mustang Convertible 2007', 'Ford Freestar Minivan 2007',
+    'Ford Expedition EL SUV 2009', 'Ford Edge SUV 2012',
+    'Ford Ranger SuperCab 2011', 'Ford GT Coupe 2006',
+    'Ford F-150 Regular Cab 2012', 'Ford F-150 Regular Cab 2007',
+    'Ford Focus Sedan 2007', 'Ford E-Series Wagon Van 2012',
+    'Ford Fiesta Sedan 2012', 'GMC Terrain SUV 2012', 'GMC Savana Van 2012',
+    'GMC Yukon Hybrid SUV 2012', 'GMC Acadia SUV 2012',
+    'GMC Canyon Extended Cab 2012', 'Geo Metro Convertible 1993',
+    'HUMMER H3T Crew Cab 2010', 'HUMMER H2 SUT Crew Cab 2009',
+    'Honda Odyssey Minivan 2012', 'Honda Odyssey Minivan 2007',
+    'Honda Accord Coupe 2012', 'Honda Accord Sedan 2012',
+    'Hyundai Veloster Hatchback 2012', 'Hyundai Santa Fe SUV 2012',
+    'Hyundai Tucson SUV 2012', 'Hyundai Veracruz SUV 2012',
+    'Hyundai Sonata Hybrid Sedan 2012', 'Hyundai Elantra Sedan 2007',
+    'Hyundai Accent Sedan 2012', 'Hyundai Genesis Sedan 2012',
+    'Hyundai Sonata Sedan 2012', 'Hyundai Elantra Touring Hatchback 2012',
+    'Hyundai Azera Sedan 2012', 'Infiniti G Coupe IPL 2012',
+    'Infiniti QX56 SUV 2011', 'Isuzu Ascender SUV 2008', 'Jaguar XK XKR 2012',
+    'Jeep Patriot SUV 2012', 'Jeep Wrangler SUV 2012', 'Jeep Liberty SUV 2012',
+    'Jeep Grand Cherokee SUV 2012', 'Jeep Compass SUV 2012',
+    'Lamborghini Reventon Coupe 2008', 'Lamborghini Aventador Coupe 2012',
+    'Lamborghini Gallardo LP 570-4 Superleggera 2012',
+    'Lamborghini Diablo Coupe 2001', 'Land Rover Range Rover SUV 2012',
+    'Land Rover LR2 SUV 2012', 'Lincoln Town Car Sedan 2011',
+    'MINI Cooper Roadster Convertible 2012',
+    'Maybach Landaulet Convertible 2012', 'Mazda Tribute SUV 2011',
+    'McLaren MP4-12C Coupe 2012', 'Mercedes-Benz 300-Class Convertible 1993',
+    'Mercedes-Benz C-Class Sedan 2012', 'Mercedes-Benz SL-Class Coupe 2009',
+    'Mercedes-Benz E-Class Sedan 2012', 'Mercedes-Benz S-Class Sedan 2012',
+    'Mercedes-Benz Sprinter Van 2012', 'Mitsubishi Lancer Sedan 2012',
+    'Nissan Leaf Hatchback 2012', 'Nissan NV Passenger Van 2012',
+    'Nissan Juke Hatchback 2012', 'Nissan 240SX Coupe 1998',
+    'Plymouth Neon Coupe 1999', 'Porsche Panamera Sedan 2012',
+    'Ram C/V Cargo Van Minivan 2012',
+    'Rolls-Royce Phantom Drophead Coupe Convertible 2012',
+    'Rolls-Royce Ghost Sedan 2012', 'Rolls-Royce Phantom Sedan 2012',
+    'Scion xD Hatchback 2012', 'Spyker C8 Convertible 2009',
+    'Spyker C8 Coupe 2009', 'Suzuki Aerio Sedan 2007',
+    'Suzuki Kizashi Sedan 2012', 'Suzuki SX4 Hatchback 2012',
+    'Suzuki SX4 Sedan 2012', 'Tesla Model S Sedan 2012',
+    'Toyota Sequoia SUV 2012', 'Toyota Camry Sedan 2012',
+    'Toyota Corolla Sedan 2012', 'Toyota 4Runner SUV 2012',
+    'Volkswagen Golf Hatchback 2012', 'Volkswagen Golf Hatchback 1991',
+    'Volkswagen Beetle Hatchback 2012', 'Volvo C30 Hatchback 2012',
+    'Volvo 240 Sedan 1993', 'Volvo XC90 SUV 2007',
+    'smart fortwo Convertible 2012')
+
+SUN397_CATEGORIES = (
+    'abbey', 'airplane_cabin', 'airport_terminal', 'alley', 'amphitheater',
+    'amusement_arcade', 'amusement_park', 'anechoic_chamber',
+    'apartment_building_outdoor', 'apse_indoor', 'aquarium', 'aqueduct',
+    'arch', 'archive', 'arrival_gate_outdoor', 'art_gallery', 'art_school',
+    'art_studio', 'assembly_line', 'athletic_field_outdoor', 'atrium_public',
+    'attic', 'auditorium', 'auto_factory', 'badlands',
+    'badminton_court_indoor', 'baggage_claim', 'bakery_shop',
+    'balcony_exterior', 'balcony_interior', 'ball_pit', 'ballroom',
+    'bamboo_forest', 'banquet_hall', 'bar', 'barn', 'barndoor',
+    'baseball_field', 'basement', 'basilica', 'basketball_court_outdoor',
+    'bathroom', 'batters_box', 'bayou', 'bazaar_indoor', 'bazaar_outdoor',
+    'beach', 'beauty_salon', 'bedroom', 'berth', 'biology_laboratory',
+    'bistro_indoor', 'boardwalk', 'boat_deck', 'boathouse', 'bookstore',
+    'booth_indoor', 'botanical_garden', 'bow_window_indoor',
+    'bow_window_outdoor', 'bowling_alley', 'boxing_ring', 'brewery_indoor',
+    'bridge', 'building_facade', 'bullring', 'burial_chamber', 'bus_interior',
+    'butchers_shop', 'butte', 'cabin_outdoor', 'cafeteria', 'campsite',
+    'campus', 'canal_natural', 'canal_urban', 'candy_store', 'canyon',
+    'car_interior_backseat', 'car_interior_frontseat', 'carrousel',
+    'casino_indoor', 'castle', 'catacomb', 'cathedral_indoor',
+    'cathedral_outdoor', 'cavern_indoor', 'cemetery', 'chalet',
+    'cheese_factory', 'chemistry_lab', 'chicken_coop_indoor',
+    'chicken_coop_outdoor', 'childs_room', 'church_indoor', 'church_outdoor',
+    'classroom', 'clean_room', 'cliff', 'cloister_indoor', 'closet',
+    'clothing_store', 'coast', 'cockpit', 'coffee_shop', 'computer_room',
+    'conference_center', 'conference_room', 'construction_site',
+    'control_room', 'control_tower_outdoor', 'corn_field', 'corral',
+    'corridor', 'cottage_garden', 'courthouse', 'courtroom', 'courtyard',
+    'covered_bridge_exterior', 'creek', 'crevasse', 'crosswalk',
+    'cubicle_office', 'dam', 'delicatessen', 'dentists_office', 'desert_sand',
+    'desert_vegetation', 'diner_indoor', 'diner_outdoor', 'dinette_home',
+    'dinette_vehicle', 'dining_car', 'dining_room', 'discotheque', 'dock',
+    'doorway_outdoor', 'dorm_room', 'driveway', 'driving_range_outdoor',
+    'drugstore', 'electrical_substation', 'elevator_door', 'elevator_interior',
+    'elevator_shaft', 'engine_room', 'escalator_indoor', 'excavation',
+    'factory_indoor', 'fairway', 'fastfood_restaurant', 'field_cultivated',
+    'field_wild', 'fire_escape', 'fire_station', 'firing_range_indoor',
+    'fishpond', 'florist_shop_indoor', 'food_court', 'forest_broadleaf',
+    'forest_needleleaf', 'forest_path', 'forest_road', 'formal_garden',
+    'fountain', 'galley', 'game_room', 'garage_indoor', 'garbage_dump',
+    'gas_station', 'gazebo_exterior', 'general_store_indoor',
+    'general_store_outdoor', 'gift_shop', 'golf_course', 'greenhouse_indoor',
+    'greenhouse_outdoor', 'gymnasium_indoor', 'hangar_indoor',
+    'hangar_outdoor', 'harbor', 'hayfield', 'heliport', 'herb_garden',
+    'highway', 'hill', 'home_office', 'hospital', 'hospital_room',
+    'hot_spring', 'hot_tub_outdoor', 'hotel_outdoor', 'hotel_room', 'house',
+    'hunting_lodge_outdoor', 'ice_cream_parlor', 'ice_floe', 'ice_shelf',
+    'ice_skating_rink_indoor', 'ice_skating_rink_outdoor', 'iceberg', 'igloo',
+    'industrial_area', 'inn_outdoor', 'islet', 'jacuzzi_indoor', 'jail_indoor',
+    'jail_cell', 'jewelry_shop', 'kasbah', 'kennel_indoor', 'kennel_outdoor',
+    'kindergarden_classroom', 'kitchen', 'kitchenette', 'labyrinth_outdoor',
+    'lake_natural', 'landfill', 'landing_deck', 'laundromat', 'lecture_room',
+    'library_indoor', 'library_outdoor', 'lido_deck_outdoor', 'lift_bridge',
+    'lighthouse', 'limousine_interior', 'living_room', 'lobby', 'lock_chamber',
+    'locker_room', 'mansion', 'manufactured_home', 'market_indoor',
+    'market_outdoor', 'marsh', 'martial_arts_gym', 'mausoleum', 'medina',
+    'moat_water', 'monastery_outdoor', 'mosque_indoor', 'mosque_outdoor',
+    'motel', 'mountain', 'mountain_snowy', 'movie_theater_indoor',
+    'museum_indoor', 'music_store', 'music_studio',
+    'nuclear_power_plant_outdoor', 'nursery', 'oast_house',
+    'observatory_outdoor', 'ocean', 'office', 'office_building',
+    'oil_refinery_outdoor', 'oilrig', 'operating_room', 'orchard',
+    'outhouse_outdoor', 'pagoda', 'palace', 'pantry', 'park',
+    'parking_garage_indoor', 'parking_garage_outdoor', 'parking_lot', 'parlor',
+    'pasture', 'patio', 'pavilion', 'pharmacy', 'phone_booth',
+    'physics_laboratory', 'picnic_area', 'pilothouse_indoor',
+    'planetarium_outdoor', 'playground', 'playroom', 'plaza', 'podium_indoor',
+    'podium_outdoor', 'pond', 'poolroom_establishment', 'poolroom_home',
+    'power_plant_outdoor', 'promenade_deck', 'pub_indoor', 'pulpit',
+    'putting_green', 'racecourse', 'raceway', 'raft', 'railroad_track',
+    'rainforest', 'reception', 'recreation_room', 'residential_neighborhood',
+    'restaurant', 'restaurant_kitchen', 'restaurant_patio', 'rice_paddy',
+    'riding_arena', 'river', 'rock_arch', 'rope_bridge', 'ruin', 'runway',
+    'sandbar', 'sandbox', 'sauna', 'schoolhouse', 'sea_cliff', 'server_room',
+    'shed', 'shoe_shop', 'shopfront', 'shopping_mall_indoor', 'shower',
+    'skatepark', 'ski_lodge', 'ski_resort', 'ski_slope', 'sky', 'skyscraper',
+    'slum', 'snowfield', 'squash_court', 'stable', 'stadium_baseball',
+    'stadium_football', 'stage_indoor', 'staircase', 'street',
+    'subway_interior', 'subway_station_platform', 'supermarket', 'sushi_bar',
+    'swamp', 'swimming_pool_indoor', 'swimming_pool_outdoor',
+    'synagogue_indoor', 'synagogue_outdoor', 'television_studio',
+    'temple_east_asia', 'temple_south_asia', 'tennis_court_indoor',
+    'tennis_court_outdoor', 'tent_outdoor', 'theater_indoor_procenium',
+    'theater_indoor_seats', 'thriftshop', 'throne_room', 'ticket_booth',
+    'toll_plaza', 'topiary_garden', 'tower', 'toyshop', 'track_outdoor',
+    'train_railway', 'train_station_platform', 'tree_farm', 'tree_house',
+    'trench', 'underwater_coral_reef', 'utility_room', 'valley',
+    'van_interior', 'vegetable_garden', 'veranda', 'veterinarians_office',
+    'viaduct', 'videostore', 'village', 'vineyard', 'volcano',
+    'volleyball_court_indoor', 'volleyball_court_outdoor', 'waiting_room',
+    'warehouse_indoor', 'water_tower', 'waterfall_block', 'waterfall_fan',
+    'waterfall_plunge', 'watering_hole', 'wave', 'wet_bar', 'wheat_field',
+    'wind_farm', 'windmill', 'wine_cellar_barrel_storage',
+    'wine_cellar_bottle_storage', 'wrestling_ring_indoor', 'yard',
+    'youth_hostel')
+
+CALTECH101_CATEGORIES = (
+    'BACKGROUND_Google', 'Faces', 'Faces_easy', 'Leopards', 'Motorbikes',
+    'accordion', 'airplanes', 'anchor', 'ant', 'barrel', 'bass', 'beaver',
+    'binocular', 'bonsai', 'brain', 'brontosaurus', 'buddha', 'butterfly',
+    'camera', 'cannon', 'car_side', 'ceiling_fan', 'cellphone', 'chair',
+    'chandelier', 'cougar_body', 'cougar_face', 'crab', 'crayfish',
+    'crocodile', 'crocodile_head', 'cup', 'dalmatian', 'dollar_bill',
+    'dolphin', 'dragonfly', 'electric_guitar', 'elephant', 'emu', 'euphonium',
+    'ewer', 'ferry', 'flamingo', 'flamingo_head', 'garfield', 'gerenuk',
+    'gramophone', 'grand_piano', 'hawksbill', 'headphone', 'hedgehog',
+    'helicopter', 'ibis', 'inline_skate', 'joshua_tree', 'kangaroo', 'ketch',
+    'lamp', 'laptop', 'llama', 'lobster', 'lotus', 'mandolin', 'mayfly',
+    'menorah', 'metronome', 'minaret', 'nautilus', 'octopus', 'okapi',
+    'pagoda', 'panda', 'pigeon', 'pizza', 'platypus', 'pyramid', 'revolver',
+    'rhino', 'rooster', 'saxophone', 'schooner', 'scissors', 'scorpion',
+    'sea_horse', 'snoopy', 'soccer_ball', 'stapler', 'starfish', 'stegosaurus',
+    'stop_sign', 'strawberry', 'sunflower', 'tick', 'trilobite', 'umbrella',
+    'watch', 'water_lilly', 'wheelchair', 'wild_cat', 'windsor_chair',
+    'wrench', 'yin_yang')
+
+FOOD101_CATEGORIES = (
+    'apple_pie', 'baby_back_ribs', 'baklava', 'beef_carpaccio', 'beef_tartare',
+    'beet_salad', 'beignets', 'bibimbap', 'bread_pudding', 'breakfast_burrito',
+    'bruschetta', 'caesar_salad', 'cannoli', 'caprese_salad', 'carrot_cake',
+    'ceviche', 'cheesecake', 'cheese_plate', 'chicken_curry',
+    'chicken_quesadilla', 'chicken_wings', 'chocolate_cake',
+    'chocolate_mousse', 'churros', 'clam_chowder', 'club_sandwich',
+    'crab_cakes', 'creme_brulee', 'croque_madame', 'cup_cakes', 'deviled_eggs',
+    'donuts', 'dumplings', 'edamame', 'eggs_benedict', 'escargots', 'falafel',
+    'filet_mignon', 'fish_and_chips', 'foie_gras', 'french_fries',
+    'french_onion_soup', 'french_toast', 'fried_calamari', 'fried_rice',
+    'frozen_yogurt', 'garlic_bread', 'gnocchi', 'greek_salad',
+    'grilled_cheese_sandwich', 'grilled_salmon', 'guacamole', 'gyoza',
+    'hamburger', 'hot_and_sour_soup', 'hot_dog', 'huevos_rancheros', 'hummus',
+    'ice_cream', 'lasagna', 'lobster_bisque', 'lobster_roll_sandwich',
+    'macaroni_and_cheese', 'macarons', 'miso_soup', 'mussels', 'nachos',
+    'omelette', 'onion_rings', 'oysters', 'pad_thai', 'paella', 'pancakes',
+    'panna_cotta', 'peking_duck', 'pho', 'pizza', 'pork_chop', 'poutine',
+    'prime_rib', 'pulled_pork_sandwich', 'ramen', 'ravioli', 'red_velvet_cake',
+    'risotto', 'samosa', 'sashimi', 'scallops', 'seaweed_salad',
+    'shrimp_and_grits', 'spaghetti_bolognese', 'spaghetti_carbonara',
+    'spring_rolls', 'steak', 'strawberry_shortcake', 'sushi', 'tacos',
+    'takoyaki', 'tiramisu', 'tuna_tartare', 'waffles')
+
+CIFAR100_CATEGORIES_CN = (
+    '苹果', '水族馆鱼', '婴儿', '熊', '河狸', '床', '蜜蜂', '甲虫', '自行车', '瓶子', '碗', '小男孩',
+    '桥', '公共汽车', '蝴蝶', '骆驼', '易拉罐', '城堡', '毛毛虫', '牛', '椅子', '猩猩', '钟', '白云',
+    '蟑螂', '沙发', '螃蟹', '鳄鱼', '杯子', '恐龙', '海豚', '大象', '比目鱼', '森林', '狐狸', '小女孩',
+    '仓鼠', '屋子', '袋鼠', '键盘', '台灯', '割草机', '猎豹', '狮子', '蜥蜴', '龙虾', '男人', '枫树',
+    '摩托车', '山', '老鼠', '蘑菇', '橡树', '橙子橘子', '兰花', '水獭', '棕榈树', '梨', '皮卡车', '松树',
+    '田野', '盘子', '罂粟', '豪猪', '负鼠', '兔子', '浣熊', '鳐鱼', '公路', '火箭', '玫瑰', '大海',
+    '海豹', '鲨鱼', '尖嘴小鼠', '臭鼬', '摩天大楼', '蜗牛', '蛇', '蜘蛛', '松鼠', '电车', '向日葵', '甜椒',
+    '桌子', '坦克', '电话', '电视', '老虎', '拖拉机', '火车', '鳟鱼', '郁金香', '乌龟', '衣柜', '鲸鱼',
+    '柳树', '狼', '女人', '蠕虫')
+
+IMAGENET_SIMPLE_CATEGORIES = (
+    'tench', 'goldfish', 'great white shark', 'tiger shark',
+    'hammerhead shark', 'electric ray', 'stingray', 'rooster', 'hen',
+    'ostrich', 'brambling', 'goldfinch', 'house finch', 'junco',
+    'indigo bunting', 'American robin', 'bulbul', 'jay', 'magpie', 'chickadee',
+    'American dipper', 'kite (bird of prey)', 'bald eagle', 'vulture',
+    'great grey owl', 'fire salamander', 'smooth newt', 'newt',
+    'spotted salamander', 'axolotl', 'American bullfrog', 'tree frog',
+    'tailed frog', 'loggerhead sea turtle', 'leatherback sea turtle',
+    'mud turtle', 'terrapin', 'box turtle', 'banded gecko', 'green iguana',
+    'Carolina anole', 'desert grassland whiptail lizard', 'agama',
+    'frilled-necked lizard', 'alligator lizard', 'Gila monster',
+    'European green lizard', 'chameleon', 'Komodo dragon', 'Nile crocodile',
+    'American alligator', 'triceratops', 'worm snake', 'ring-necked snake',
+    'eastern hog-nosed snake', 'smooth green snake', 'kingsnake',
+    'garter snake', 'water snake', 'vine snake', 'night snake',
+    'boa constrictor', 'African rock python', 'Indian cobra', 'green mamba',
+    'sea snake', 'Saharan horned viper', 'eastern diamondback rattlesnake',
+    'sidewinder rattlesnake', 'trilobite', 'harvestman', 'scorpion',
+    'yellow garden spider', 'barn spider', 'European garden spider',
+    'southern black widow', 'tarantula', 'wolf spider', 'tick', 'centipede',
+    'black grouse', 'ptarmigan', 'ruffed grouse', 'prairie grouse', 'peafowl',
+    'quail', 'partridge', 'african grey parrot', 'macaw',
+    'sulphur-crested cockatoo', 'lorikeet', 'coucal', 'bee eater', 'hornbill',
+    'hummingbird', 'jacamar', 'toucan', 'duck', 'red-breasted merganser',
+    'goose', 'black swan', 'tusker', 'echidna', 'platypus', 'wallaby', 'koala',
+    'wombat', 'jellyfish', 'sea anemone', 'brain coral', 'flatworm',
+    'nematode', 'conch', 'snail', 'slug', 'sea slug', 'chiton',
+    'chambered nautilus', 'Dungeness crab', 'rock crab', 'fiddler crab',
+    'red king crab', 'American lobster', 'spiny lobster', 'crayfish',
+    'hermit crab', 'isopod', 'white stork', 'black stork', 'spoonbill',
+    'flamingo', 'little blue heron', 'great egret', 'bittern bird',
+    'crane bird', 'limpkin', 'common gallinule', 'American coot', 'bustard',
+    'ruddy turnstone', 'dunlin', 'common redshank', 'dowitcher',
+    'oystercatcher', 'pelican', 'king penguin', 'albatross', 'grey whale',
+    'killer whale', 'dugong', 'sea lion', 'Chihuahua', 'Japanese Chin',
+    'Maltese', 'Pekingese', 'Shih Tzu', 'King Charles Spaniel', 'Papillon',
+    'toy terrier', 'Rhodesian Ridgeback', 'Afghan Hound', 'Basset Hound',
+    'Beagle', 'Bloodhound', 'Bluetick Coonhound', 'Black and Tan Coonhound',
+    'Treeing Walker Coonhound', 'English foxhound', 'Redbone Coonhound',
+    'borzoi', 'Irish Wolfhound', 'Italian Greyhound', 'Whippet',
+    'Ibizan Hound', 'Norwegian Elkhound', 'Otterhound', 'Saluki',
+    'Scottish Deerhound', 'Weimaraner', 'Staffordshire Bull Terrier',
+    'American Staffordshire Terrier', 'Bedlington Terrier', 'Border Terrier',
+    'Kerry Blue Terrier', 'Irish Terrier', 'Norfolk Terrier',
+    'Norwich Terrier', 'Yorkshire Terrier', 'Wire Fox Terrier',
+    'Lakeland Terrier', 'Sealyham Terrier', 'Airedale Terrier',
+    'Cairn Terrier', 'Australian Terrier', 'Dandie Dinmont Terrier',
+    'Boston Terrier', 'Miniature Schnauzer', 'Giant Schnauzer',
+    'Standard Schnauzer', 'Scottish Terrier', 'Tibetan Terrier',
+    'Australian Silky Terrier', 'Soft-coated Wheaten Terrier',
+    'West Highland White Terrier', 'Lhasa Apso', 'Flat-Coated Retriever',
+    'Curly-coated Retriever', 'Golden Retriever', 'Labrador Retriever',
+    'Chesapeake Bay Retriever', 'German Shorthaired Pointer', 'Vizsla',
+    'English Setter', 'Irish Setter', 'Gordon Setter', 'Brittany dog',
+    'Clumber Spaniel', 'English Springer Spaniel', 'Welsh Springer Spaniel',
+    'Cocker Spaniel', 'Sussex Spaniel', 'Irish Water Spaniel', 'Kuvasz',
+    'Schipperke', 'Groenendael dog', 'Malinois', 'Briard', 'Australian Kelpie',
+    'Komondor', 'Old English Sheepdog', 'Shetland Sheepdog', 'collie',
+    'Border Collie', 'Bouvier des Flandres dog', 'Rottweiler',
+    'German Shepherd Dog', 'Dobermann', 'Miniature Pinscher',
+    'Greater Swiss Mountain Dog', 'Bernese Mountain Dog',
+    'Appenzeller Sennenhund', 'Entlebucher Sennenhund', 'Boxer', 'Bullmastiff',
+    'Tibetan Mastiff', 'French Bulldog', 'Great Dane', 'St. Bernard', 'husky',
+    'Alaskan Malamute', 'Siberian Husky', 'Dalmatian', 'Affenpinscher',
+    'Basenji', 'pug', 'Leonberger', 'Newfoundland dog', 'Great Pyrenees dog',
+    'Samoyed', 'Pomeranian', 'Chow Chow', 'Keeshond', 'brussels griffon',
+    'Pembroke Welsh Corgi', 'Cardigan Welsh Corgi', 'Toy Poodle',
+    'Miniature Poodle', 'Standard Poodle',
+    'Mexican hairless dog (xoloitzcuintli)', 'grey wolf',
+    'Alaskan tundra wolf', 'red wolf or maned wolf', 'coyote', 'dingo',
+    'dhole', 'African wild dog', 'hyena', 'red fox', 'kit fox', 'Arctic fox',
+    'grey fox', 'tabby cat', 'tiger cat', 'Persian cat', 'Siamese cat',
+    'Egyptian Mau', 'cougar', 'lynx', 'leopard', 'snow leopard', 'jaguar',
+    'lion', 'tiger', 'cheetah', 'brown bear', 'American black bear',
+    'polar bear', 'sloth bear', 'mongoose', 'meerkat', 'tiger beetle',
+    'ladybug', 'ground beetle', 'longhorn beetle', 'leaf beetle',
+    'dung beetle', 'rhinoceros beetle', 'weevil', 'fly', 'bee', 'ant',
+    'grasshopper', 'cricket insect', 'stick insect', 'cockroach',
+    'praying mantis', 'cicada', 'leafhopper', 'lacewing', 'dragonfly',
+    'damselfly', 'red admiral butterfly', 'ringlet butterfly',
+    'monarch butterfly', 'small white butterfly', 'sulphur butterfly',
+    'gossamer-winged butterfly', 'starfish', 'sea urchin', 'sea cucumber',
+    'cottontail rabbit', 'hare', 'Angora rabbit', 'hamster', 'porcupine',
+    'fox squirrel', 'marmot', 'beaver', 'guinea pig', 'common sorrel horse',
+    'zebra', 'pig', 'wild boar', 'warthog', 'hippopotamus', 'ox',
+    'water buffalo', 'bison', 'ram (adult male sheep)', 'bighorn sheep',
+    'Alpine ibex', 'hartebeest', 'impala (antelope)', 'gazelle',
+    'arabian camel', 'llama', 'weasel', 'mink', 'European polecat',
+    'black-footed ferret', 'otter', 'skunk', 'badger', 'armadillo',
+    'three-toed sloth', 'orangutan', 'gorilla', 'chimpanzee', 'gibbon',
+    'siamang', 'guenon', 'patas monkey', 'baboon', 'macaque', 'langur',
+    'black-and-white colobus', 'proboscis monkey', 'marmoset',
+    'white-headed capuchin', 'howler monkey', 'titi monkey',
+    "Geoffroy's spider monkey", 'common squirrel monkey', 'ring-tailed lemur',
+    'indri', 'Asian elephant', 'African bush elephant', 'red panda',
+    'giant panda', 'snoek fish', 'eel', 'silver salmon', 'rock beauty fish',
+    'clownfish', 'sturgeon', 'gar fish', 'lionfish', 'pufferfish', 'abacus',
+    'abaya', 'academic gown', 'accordion', 'acoustic guitar',
+    'aircraft carrier', 'airliner', 'airship', 'altar', 'ambulance',
+    'amphibious vehicle', 'analog clock', 'apiary', 'apron', 'trash can',
+    'assault rifle', 'backpack', 'bakery', 'balance beam', 'balloon',
+    'ballpoint pen', 'Band-Aid', 'banjo', 'baluster / handrail', 'barbell',
+    'barber chair', 'barbershop', 'barn', 'barometer', 'barrel', 'wheelbarrow',
+    'baseball', 'basketball', 'bassinet', 'bassoon', 'swimming cap',
+    'bath towel', 'bathtub', 'station wagon', 'lighthouse', 'beaker',
+    'military hat (bearskin or shako)', 'beer bottle', 'beer glass',
+    'bell tower', 'baby bib', 'tandem bicycle', 'bikini', 'ring binder',
+    'binoculars', 'birdhouse', 'boathouse', 'bobsleigh', 'bolo tie',
+    'poke bonnet', 'bookcase', 'bookstore', 'bottle cap', 'hunting bow',
+    'bow tie', 'brass memorial plaque', 'bra', 'breakwater', 'breastplate',
+    'broom', 'bucket', 'buckle', 'bulletproof vest', 'high-speed train',
+    'butcher shop', 'taxicab', 'cauldron', 'candle', 'cannon', 'canoe',
+    'can opener', 'cardigan', 'car mirror', 'carousel', 'tool kit',
+    'cardboard box / carton', 'car wheel', 'automated teller machine',
+    'cassette', 'cassette player', 'castle', 'catamaran', 'CD player', 'cello',
+    'mobile phone', 'chain', 'chain-link fence', 'chain mail', 'chainsaw',
+    'storage chest', 'chiffonier', 'bell or wind chime', 'china cabinet',
+    'Christmas stocking', 'church', 'movie theater', 'cleaver',
+    'cliff dwelling', 'cloak', 'clogs', 'cocktail shaker', 'coffee mug',
+    'coffeemaker', 'spiral or coil', 'combination lock', 'computer keyboard',
+    'candy store', 'container ship', 'convertible', 'corkscrew', 'cornet',
+    'cowboy boot', 'cowboy hat', 'cradle', 'construction crane',
+    'crash helmet', 'crate', 'infant bed', 'Crock Pot', 'croquet ball',
+    'crutch', 'cuirass', 'dam', 'desk', 'desktop computer',
+    'rotary dial telephone', 'diaper', 'digital clock', 'digital watch',
+    'dining table', 'dishcloth', 'dishwasher', 'disc brake', 'dock',
+    'dog sled', 'dome', 'doormat', 'drilling rig', 'drum', 'drumstick',
+    'dumbbell', 'Dutch oven', 'electric fan', 'electric guitar',
+    'electric locomotive', 'entertainment center', 'envelope',
+    'espresso machine', 'face powder', 'feather boa', 'filing cabinet',
+    'fireboat', 'fire truck', 'fire screen', 'flagpole', 'flute',
+    'folding chair', 'football helmet', 'forklift', 'fountain', 'fountain pen',
+    'four-poster bed', 'freight car', 'French horn', 'frying pan', 'fur coat',
+    'garbage truck', 'gas mask or respirator', 'gas pump', 'goblet', 'go-kart',
+    'golf ball', 'golf cart', 'gondola', 'gong', 'gown', 'grand piano',
+    'greenhouse', 'radiator grille', 'grocery store', 'guillotine',
+    'hair clip', 'hair spray', 'half-track', 'hammer', 'hamper', 'hair dryer',
+    'hand-held computer', 'handkerchief', 'hard disk drive', 'harmonica',
+    'harp', 'combine harvester', 'hatchet', 'holster', 'home theater',
+    'honeycomb', 'hook', 'hoop skirt', 'gymnastic horizontal bar',
+    'horse-drawn vehicle', 'hourglass', 'iPod', 'clothes iron',
+    'carved pumpkin', 'jeans', 'jeep', 'T-shirt', 'jigsaw puzzle', 'rickshaw',
+    'joystick', 'kimono', 'knee pad', 'knot', 'lab coat', 'ladle', 'lampshade',
+    'laptop computer', 'lawn mower', 'lens cap', 'letter opener', 'library',
+    'lifeboat', 'lighter', 'limousine', 'ocean liner', 'lipstick',
+    'slip-on shoe', 'lotion', 'music speaker', 'loupe magnifying glass',
+    'sawmill', 'magnetic compass', 'messenger bag', 'mailbox', 'tights',
+    'one-piece bathing suit', 'manhole cover', 'maraca', 'marimba', 'mask',
+    'matchstick', 'maypole', 'maze', 'measuring cup', 'medicine cabinet',
+    'megalith', 'microphone', 'microwave oven', 'military uniform', 'milk can',
+    'minibus', 'miniskirt', 'minivan', 'missile', 'mitten', 'mixing bowl',
+    'mobile home', 'ford model t', 'modem', 'monastery', 'monitor', 'moped',
+    'mortar and pestle', 'graduation cap', 'mosque', 'mosquito net', 'vespa',
+    'mountain bike', 'tent', 'computer mouse', 'mousetrap', 'moving van',
+    'muzzle', 'metal nail', 'neck brace', 'necklace', 'baby pacifier',
+    'notebook computer', 'obelisk', 'oboe', 'ocarina', 'odometer',
+    'oil filter', 'pipe organ', 'oscilloscope', 'overskirt', 'bullock cart',
+    'oxygen mask', 'product packet / packaging', 'paddle', 'paddle wheel',
+    'padlock', 'paintbrush', 'pajamas', 'palace', 'pan flute', 'paper towel',
+    'parachute', 'parallel bars', 'park bench', 'parking meter',
+    'railroad car', 'patio', 'payphone', 'pedestal', 'pencil case',
+    'pencil sharpener', 'perfume', 'Petri dish', 'photocopier', 'plectrum',
+    'Pickelhaube', 'picket fence', 'pickup truck', 'pier', 'piggy bank',
+    'pill bottle', 'pillow', 'ping-pong ball', 'pinwheel', 'pirate ship',
+    'drink pitcher', 'block plane', 'planetarium', 'plastic bag', 'plate rack',
+    'farm plow', 'plunger', 'Polaroid camera', 'pole', 'police van', 'poncho',
+    'pool table', 'soda bottle', 'plant pot', "potter's wheel", 'power drill',
+    'prayer rug', 'printer', 'prison', 'missile', 'projector', 'hockey puck',
+    'punching bag', 'purse', 'quill', 'quilt', 'race car', 'racket',
+    'radiator', 'radio', 'radio telescope', 'rain barrel',
+    'recreational vehicle', 'fishing casting reel', 'reflex camera',
+    'refrigerator', 'remote control', 'restaurant', 'revolver', 'rifle',
+    'rocking chair', 'rotisserie', 'eraser', 'rugby ball',
+    'ruler measuring stick', 'sneaker', 'safe', 'safety pin', 'salt shaker',
+    'sandal', 'sarong', 'saxophone', 'scabbard', 'weighing scale',
+    'school bus', 'schooner', 'scoreboard', 'CRT monitor', 'screw',
+    'screwdriver', 'seat belt', 'sewing machine', 'shield', 'shoe store',
+    'shoji screen / room divider', 'shopping basket', 'shopping cart',
+    'shovel', 'shower cap', 'shower curtain', 'ski', 'balaclava ski mask',
+    'sleeping bag', 'slide rule', 'sliding door', 'slot machine', 'snorkel',
+    'snowmobile', 'snowplow', 'soap dispenser', 'soccer ball', 'sock',
+    'solar thermal collector', 'sombrero', 'soup bowl', 'keyboard space bar',
+    'space heater', 'space shuttle', 'spatula', 'motorboat', 'spider web',
+    'spindle', 'sports car', 'spotlight', 'stage', 'steam locomotive',
+    'through arch bridge', 'steel drum', 'stethoscope', 'scarf', 'stone wall',
+    'stopwatch', 'stove', 'strainer', 'tram', 'stretcher', 'couch', 'stupa',
+    'submarine', 'suit', 'sundial', 'sunglasses', 'sunglasses', 'sunscreen',
+    'suspension bridge', 'mop', 'sweatshirt', 'swim trunks / shorts', 'swing',
+    'electrical switch', 'syringe', 'table lamp', 'tank', 'tape player',
+    'teapot', 'teddy bear', 'television', 'tennis ball', 'thatched roof',
+    'front curtain', 'thimble', 'threshing machine', 'throne', 'tile roof',
+    'toaster', 'tobacco shop', 'toilet seat', 'torch', 'totem pole',
+    'tow truck', 'toy store', 'tractor', 'semi-trailer truck', 'tray',
+    'trench coat', 'tricycle', 'trimaran', 'tripod', 'triumphal arch',
+    'trolleybus', 'trombone', 'hot tub', 'turnstile', 'typewriter keyboard',
+    'umbrella', 'unicycle', 'upright piano', 'vacuum cleaner', 'vase',
+    'vaulted or arched ceiling', 'velvet fabric', 'vending machine',
+    'vestment', 'viaduct', 'violin', 'volleyball', 'waffle iron', 'wall clock',
+    'wallet', 'wardrobe', 'military aircraft', 'sink', 'washing machine',
+    'water bottle', 'water jug', 'water tower', 'whiskey jug', 'whistle',
+    'hair wig', 'window screen', 'window shade', 'Windsor tie', 'wine bottle',
+    'airplane wing', 'wok', 'wooden spoon', 'wool', 'split-rail fence',
+    'shipwreck', 'sailboat', 'yurt', 'website', 'comic book', 'crossword',
+    'traffic or street sign', 'traffic light', 'dust jacket', 'menu', 'plate',
+    'guacamole', 'consomme', 'hot pot', 'trifle', 'ice cream', 'popsicle',
+    'baguette', 'bagel', 'pretzel', 'cheeseburger', 'hot dog',
+    'mashed potatoes', 'cabbage', 'broccoli', 'cauliflower', 'zucchini',
+    'spaghetti squash', 'acorn squash', 'butternut squash', 'cucumber',
+    'artichoke', 'bell pepper', 'cardoon', 'mushroom', 'Granny Smith apple',
+    'strawberry', 'orange', 'lemon', 'fig', 'pineapple', 'banana', 'jackfruit',
+    'cherimoya (custard apple)', 'pomegranate', 'hay', 'carbonara',
+    'chocolate syrup', 'dough', 'meatloaf', 'pizza', 'pot pie', 'burrito',
+    'red wine', 'espresso', 'tea cup', 'eggnog', 'mountain', 'bubble', 'cliff',
+    'coral reef', 'geyser', 'lakeshore', 'promontory', 'sandbar', 'beach',
+    'valley', 'volcano', 'baseball player', 'bridegroom', 'scuba diver',
+    'rapeseed', 'daisy', "yellow lady's slipper", 'corn', 'acorn', 'rose hip',
+    'horse chestnut seed', 'coral fungus', 'agaric', 'gyromitra',
+    'stinkhorn mushroom', 'earth star fungus', 'hen of the woods mushroom',
+    'bolete', 'corn cob', 'toilet paper')
diff --git a/mmpretrain/datasets/cifar.py b/mmpretrain/datasets/cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a011daee0d74e6b06613106f7587b8ad8a7ed90
--- /dev/null
+++ b/mmpretrain/datasets/cifar.py
@@ -0,0 +1,210 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pickle
+from typing import List, Optional
+
+import mmengine.dist as dist
+import numpy as np
+from mmengine.fileio import (LocalBackend, exists, get, get_file_backend,
+                             join_path)
+from mmengine.logging import MMLogger
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+from .categories import CIFAR10_CATEGORIES, CIFAR100_CATEGORIES
+from .utils import check_md5, download_and_extract_archive
+
+
+@DATASETS.register_module()
+class CIFAR10(BaseDataset):
+    """`CIFAR10 <https://www.cs.toronto.edu/~kriz/cifar.html>`_ Dataset.
+
+    This implementation is modified from
+    https://github.com/pytorch/vision/blob/master/torchvision/datasets/cifar.py
+
+    Args:
+        data_root (str): The root directory of the CIFAR Dataset.
+        split (str, optional): The dataset split, supports "train" and "test".
+            Default to "train".
+        metainfo (dict, optional): Meta information for dataset, such as
+            categories information. Defaults to None.
+        download (bool): Whether to download the dataset if not exists.
+            Defaults to True.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """  # noqa: E501
+
+    base_folder = 'cifar-10-batches-py'
+    url = 'https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz'
+    filename = 'cifar-10-python.tar.gz'
+    tgz_md5 = 'c58f30108f718f92721af3b95e74349a'
+    train_list = [
+        ['data_batch_1', 'c99cafc152244af753f735de768cd75f'],
+        ['data_batch_2', 'd4bba439e000b95fd0a9bffe97cbabec'],
+        ['data_batch_3', '54ebc095f3ab1f0389bbae665268c751'],
+        ['data_batch_4', '634d18415352ddfa80567beed471001a'],
+        ['data_batch_5', '482c414d41f54cd18b22e5b47cb7c3cb'],
+    ]
+
+    test_list = [
+        ['test_batch', '40351d587109b95175f43aff81a1287e'],
+    ]
+    meta = {
+        'filename': 'batches.meta',
+        'key': 'label_names',
+        'md5': '5ff9c542aee3614f3951f8cda6e48888',
+    }
+    METAINFO = {'classes': CIFAR10_CATEGORIES}
+
+    def __init__(self,
+                 data_root: str = '',
+                 split: str = 'train',
+                 metainfo: Optional[dict] = None,
+                 download: bool = True,
+                 data_prefix: str = '',
+                 test_mode: bool = False,
+                 **kwargs):
+
+        splits = ['train', 'test']
+        assert split in splits, \
+            f"The split must be one of {splits}, but get '{split}'"
+        self.split = split
+
+        # To handle the BC-breaking
+        if split == 'train' and test_mode:
+            logger = MMLogger.get_current_instance()
+            logger.warning('split="train" but test_mode=True. '
+                           'The training set will be used.')
+
+        if not data_root and not data_prefix:
+            raise RuntimeError('Please set ``data_root`` to'
+                               'specify the dataset path')
+
+        self.download = download
+        super().__init__(
+            # The CIFAR dataset doesn't need specify annotation file
+            ann_file='',
+            metainfo=metainfo,
+            data_root=data_root,
+            data_prefix=dict(root=data_prefix),
+            test_mode=test_mode,
+            **kwargs)
+
+    def load_data_list(self):
+        """Load images and ground truth labels."""
+        root = self.data_prefix['root']
+        backend = get_file_backend(root, enable_singleton=True)
+
+        if dist.is_main_process() and not self._check_integrity():
+            if not isinstance(backend, LocalBackend):
+                raise RuntimeError(f'The dataset on {root} is not integrated, '
+                                   f'please manually handle it.')
+
+            if self.download:
+                download_and_extract_archive(
+                    self.url, root, filename=self.filename, md5=self.tgz_md5)
+            else:
+                raise RuntimeError(
+                    f'Cannot find {self.__class__.__name__} dataset in '
+                    f"{self.data_prefix['root']}, you can specify "
+                    '`download=True` to download automatically.')
+
+        dist.barrier()
+        assert self._check_integrity(), \
+            'Download failed or shared storage is unavailable. Please ' \
+            f'download the dataset manually through {self.url}.'
+
+        if self.split == 'train':
+            downloaded_list = self.train_list
+        else:
+            downloaded_list = self.test_list
+
+        imgs = []
+        gt_labels = []
+
+        # load the picked numpy arrays
+        for file_name, _ in downloaded_list:
+            file_path = join_path(root, self.base_folder, file_name)
+            entry = pickle.loads(get(file_path), encoding='latin1')
+            imgs.append(entry['data'])
+            if 'labels' in entry:
+                gt_labels.extend(entry['labels'])
+            else:
+                gt_labels.extend(entry['fine_labels'])
+
+        imgs = np.vstack(imgs).reshape(-1, 3, 32, 32)
+        imgs = imgs.transpose((0, 2, 3, 1))  # convert to HWC
+
+        if self.CLASSES is None:
+            # The metainfo in the file has the lowest priority, therefore
+            # we only need to load it if classes is not specified.
+            self._load_meta()
+
+        data_list = []
+        for img, gt_label in zip(imgs, gt_labels):
+            info = {'img': img, 'gt_label': int(gt_label)}
+            data_list.append(info)
+        return data_list
+
+    def _load_meta(self):
+        """Load categories information from metafile."""
+        root = self.data_prefix['root']
+
+        path = join_path(root, self.base_folder, self.meta['filename'])
+        md5 = self.meta.get('md5', None)
+        if not exists(path) or (md5 is not None and not check_md5(path, md5)):
+            raise RuntimeError(
+                'Dataset metadata file not found or corrupted.' +
+                ' You can use `download=True` to download it')
+        data = pickle.loads(get(path), encoding='latin1')
+        self._metainfo.setdefault('classes', data[self.meta['key']])
+
+    def _check_integrity(self):
+        """Check the integrity of data files."""
+        root = self.data_prefix['root']
+
+        for fentry in (self.train_list + self.test_list):
+            filename, md5 = fentry[0], fentry[1]
+            fpath = join_path(root, self.base_folder, filename)
+            if not exists(fpath):
+                return False
+            if md5 is not None and not check_md5(fpath, md5):
+                return False
+        return True
+
+    def extra_repr(self) -> List[str]:
+        """The extra repr information of the dataset."""
+        body = [f"Prefix of data: \t{self.data_prefix['root']}"]
+        return body
+
+
+@DATASETS.register_module()
+class CIFAR100(CIFAR10):
+    """`CIFAR100 <https://www.cs.toronto.edu/~kriz/cifar.html>`_ Dataset.
+
+    Args:
+        data_root (str): The root directory of the CIFAR Dataset.
+        split (str, optional): The dataset split, supports "train" and "test".
+            Default to "train".
+        metainfo (dict, optional): Meta information for dataset, such as
+            categories information. Defaults to None.
+        download (bool): Whether to download the dataset if not exists.
+            Defaults to True.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    base_folder = 'cifar-100-python'
+    url = 'https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz'
+    filename = 'cifar-100-python.tar.gz'
+    tgz_md5 = 'eb9058c3a382ffc7106e4002c42a8d85'
+    train_list = [
+        ['train', '16019d7e3df5f24257cddd939b257f8d'],
+    ]
+
+    test_list = [
+        ['test', 'f0ef6b0ae62326f3e7ffdfab6717acfc'],
+    ]
+    meta = {
+        'filename': 'meta',
+        'key': 'fine_label_names',
+        'md5': '7973b15100ade9c7d40fb424638fde48',
+    }
+    METAINFO = {'classes': CIFAR100_CATEGORIES}
diff --git a/mmpretrain/datasets/coco_caption.py b/mmpretrain/datasets/coco_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..541cda80398f7fcc7d3304d3d9f43155685ebe57
--- /dev/null
+++ b/mmpretrain/datasets/coco_caption.py
@@ -0,0 +1,42 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from pathlib import Path
+from typing import List
+
+import mmengine
+from mmengine.dataset import BaseDataset
+from mmengine.fileio import get_file_backend
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class COCOCaption(BaseDataset):
+    """COCO Caption dataset.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix`` and
+            ``ann_file``..
+        ann_file (str): Annotation file path.
+        data_prefix (dict): Prefix for data field. Defaults to
+            ``dict(img_path='')``.
+        pipeline (Sequence): Processing pipeline. Defaults to an empty tuple.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        img_prefix = self.data_prefix['img_path']
+        annotations = mmengine.load(self.ann_file)
+        file_backend = get_file_backend(img_prefix)
+
+        data_list = []
+        for ann in annotations:
+            data_info = {
+                'image_id': Path(ann['image']).stem.split('_')[-1],
+                'img_path': file_backend.join_path(img_prefix, ann['image']),
+                'gt_caption': ann['caption'],
+            }
+
+            data_list.append(data_info)
+
+        return data_list
diff --git a/mmpretrain/datasets/coco_retrieval.py b/mmpretrain/datasets/coco_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..be8a0bcb864dddad53e96e6342f9bd987ae222e2
--- /dev/null
+++ b/mmpretrain/datasets/coco_retrieval.py
@@ -0,0 +1,148 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import json
+import os.path as osp
+from collections import OrderedDict
+from os import PathLike
+from typing import List, Sequence, Union
+
+from mmengine import get_file_backend
+
+from mmpretrain.registry import DATASETS, TRANSFORMS
+from .base_dataset import BaseDataset
+
+
+def expanduser(data_prefix):
+    if isinstance(data_prefix, (str, PathLike)):
+        return osp.expanduser(data_prefix)
+    else:
+        return data_prefix
+
+
+@DATASETS.register_module()
+class COCORetrieval(BaseDataset):
+    """COCO Retrieval dataset.
+
+    COCO (Common Objects in Context): The COCO dataset contains more than
+    330K images,each of which has approximately 5 descriptive annotations.
+    This dataset was releasedin collaboration between Microsoft and Carnegie
+    Mellon University
+
+    COCO_2014 dataset directory: ::
+
+        COCO_2014
+        ├── val2014
+        ├── train2014
+        ├── annotations
+                 ├── instances_train2014.json
+                 ├── instances_val2014.json
+                 ├── person_keypoints_train2014.json
+                 ├── person_keypoints_val2014.json
+                 ├── captions_train2014.json
+                 ├── captions_val2014.json
+
+    Args:
+        ann_file (str): Annotation file path.
+        test_mode (bool): Whether dataset is used for evaluation. This will
+            decide the annotation format in data list annotations.
+            Defaults to False.
+        data_root (str): The root directory for ``data_prefix`` and
+            ``ann_file``. Defaults to ''.
+        data_prefix (str | dict): Prefix for training data. Defaults to ''.
+        pipeline (Sequence): Processing pipeline. Defaults to an empty tuple.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+
+    Examples:
+        >>> from mmpretrain.datasets import COCORetrieval
+        >>> train_dataset=COCORetrieval(data_root='coco2014/')
+        >>> train_dataset
+        Dataset COCORetrieval
+            Number of samples: 	414113
+            Annotation file:  /coco2014/annotations/captions_train2014.json
+            Prefix of images:  /coco2014/
+        >>> from mmpretrain.datasets import COCORetrieval
+        >>> val_dataset = COCORetrieval(data_root='coco2014/')
+        >>> val_dataset
+         Dataset COCORetrieval
+             Number of samples: 	202654
+             Annotation file: 	/coco2014/annotations/captions_val2014.json
+             Prefix of images: 	/coco2014/
+    """
+
+    def __init__(self,
+                 ann_file: str,
+                 test_mode: bool = False,
+                 data_prefix: Union[str, dict] = '',
+                 data_root: str = '',
+                 pipeline: Sequence = (),
+                 **kwargs):
+
+        if isinstance(data_prefix, str):
+            data_prefix = dict(img_path=expanduser(data_prefix))
+
+        ann_file = expanduser(ann_file)
+        transforms = []
+        for transform in pipeline:
+            if isinstance(transform, dict):
+                transforms.append(TRANSFORMS.build(transform))
+            else:
+                transforms.append(transform)
+
+        super().__init__(
+            data_root=data_root,
+            data_prefix=data_prefix,
+            test_mode=test_mode,
+            pipeline=transforms,
+            ann_file=ann_file,
+            **kwargs,
+        )
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        # get file backend
+        img_prefix = self.data_prefix['img_path']
+        file_backend = get_file_backend(img_prefix)
+
+        anno_info = json.load(open(self.ann_file, 'r'))
+        # mapping img_id to img filename
+        img_dict = OrderedDict()
+        for idx, img in enumerate(anno_info['images']):
+            if img['id'] not in img_dict:
+                img_rel_path = img['coco_url'].rsplit('/', 2)[-2:]
+                img_path = file_backend.join_path(img_prefix, *img_rel_path)
+
+                # create new idx for image
+                img_dict[img['id']] = dict(
+                    ori_id=img['id'],
+                    image_id=idx,  # will be used for evaluation
+                    img_path=img_path,
+                    text=[],
+                    gt_text_id=[],
+                    gt_image_id=[],
+                )
+
+        train_list = []
+        for idx, anno in enumerate(anno_info['annotations']):
+            anno['text'] = anno.pop('caption')
+            anno['ori_id'] = anno.pop('id')
+            anno['text_id'] = idx  # will be used for evaluation
+            # 1. prepare train data list item
+            train_data = anno.copy()
+            train_image = img_dict[train_data['image_id']]
+            train_data['img_path'] = train_image['img_path']
+            train_data['image_ori_id'] = train_image['ori_id']
+            train_data['image_id'] = train_image['image_id']
+            train_data['is_matched'] = True
+            train_list.append(train_data)
+            # 2. prepare eval data list item based on img dict
+            img_dict[anno['image_id']]['gt_text_id'].append(anno['text_id'])
+            img_dict[anno['image_id']]['text'].append(anno['text'])
+            img_dict[anno['image_id']]['gt_image_id'].append(
+                train_image['image_id'])
+
+        self.img_size = len(img_dict)
+        self.text_size = len(anno_info['annotations'])
+
+        # return needed format data list
+        if self.test_mode:
+            return list(img_dict.values())
+        return train_list
diff --git a/mmpretrain/datasets/coco_vqa.py b/mmpretrain/datasets/coco_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..85f4bdcf39ef82ec47a2072dc198e6b8792d8768
--- /dev/null
+++ b/mmpretrain/datasets/coco_vqa.py
@@ -0,0 +1,114 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+import re
+from collections import Counter
+from typing import List
+
+import mmengine
+from mmengine.dataset import BaseDataset
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class COCOVQA(BaseDataset):
+    """VQAv2 dataset.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix``, ``ann_file``
+            and ``question_file``.
+        data_prefix (str): The directory of images.
+        question_file (str): Question file path.
+        ann_file (str, optional): Annotation file path for training and
+            validation. Defaults to an empty string.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self,
+                 data_root: str,
+                 data_prefix: str,
+                 question_file: str,
+                 ann_file: str = '',
+                 **kwarg):
+        self.question_file = question_file
+        super().__init__(
+            data_root=data_root,
+            data_prefix=dict(img_path=data_prefix),
+            ann_file=ann_file,
+            **kwarg,
+        )
+
+    def _join_prefix(self):
+        if not mmengine.is_abs(self.question_file) and self.question_file:
+            self.question_file = osp.join(self.data_root, self.question_file)
+
+        return super()._join_prefix()
+
+    def _create_image_index(self):
+        img_prefix = self.data_prefix['img_path']
+
+        files = mmengine.list_dir_or_file(img_prefix, list_dir=False)
+        image_index = {}
+        for file in files:
+            image_id = re.findall(r'\d{12}', file)
+            if len(image_id) > 0:
+                image_id = int(image_id[-1])
+                image_index[image_id] = mmengine.join_path(img_prefix, file)
+
+        return image_index
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        questions = mmengine.load(self.question_file)['questions']
+        if self.ann_file:
+            annotations = mmengine.load(self.ann_file)['annotations']
+            assert len(questions) == len(annotations)
+        else:
+            annotations = [None] * len(questions)
+
+        # The original VQAv2 annotation file and question file includes
+        # only image id but no image file paths.
+        self.image_index = self._create_image_index()
+
+        data_list = []
+        for question, ann in zip(questions, annotations):
+            # question example
+            # {
+            #     'image_id': 262144,
+            #     'question': "Is the ball flying towards the batter?",
+            #     'question_id': 262144000
+            # }
+            #
+            # ann example
+            # {
+            #     'question_type': "what are the",
+            #     'answer_type': "other",
+            #     'answers': [
+            #         {'answer': 'watching',
+            #          'answer_id': 1,
+            #          'answer_confidence': 'yes'},
+            #         ...
+            #     ],
+            #     'image_id': 262148,
+            #     'question_id': 262148000,
+            #     'multiple_choice_answer': 'watching',
+            #     'answer_type': 'other',
+            # }
+
+            data_info = question
+            data_info['img_path'] = self.image_index[question['image_id']]
+
+            if ann is not None:
+                assert ann['question_id'] == question['question_id']
+
+                # add answer_weight & answer_count, delete duplicate answer
+                answers = [item['answer'] for item in ann.pop('answers')]
+                count = Counter(answers)
+                answer_weight = [i / len(answers) for i in count.values()]
+                data_info['gt_answer'] = list(count.keys())
+                data_info['gt_answer_weight'] = answer_weight
+                data_info.update(ann)
+
+            data_list.append(data_info)
+
+        return data_list
diff --git a/mmpretrain/datasets/cub.py b/mmpretrain/datasets/cub.py
new file mode 100644
index 0000000000000000000000000000000000000000..8db126216fb3408e2dd18255db04a851eb5fe08f
--- /dev/null
+++ b/mmpretrain/datasets/cub.py
@@ -0,0 +1,142 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+from mmengine import get_file_backend, list_from_file
+from mmengine.logging import MMLogger
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+from .categories import CUB_CATEGORIES
+
+
+@DATASETS.register_module()
+class CUB(BaseDataset):
+    """The CUB-200-2011 Dataset.
+
+    Support the `CUB-200-2011 <http://www.vision.caltech.edu/visipedia/CUB-200-2011.html>`_ Dataset.
+    Comparing with the `CUB-200 <http://www.vision.caltech.edu/visipedia/CUB-200.html>`_ Dataset,
+    there are much more pictures in `CUB-200-2011`. After downloading and decompression, the dataset
+    directory structure is as follows.
+
+    CUB dataset directory: ::
+
+        CUB_200_2011
+        ├── images
+        │   ├── class_x
+        │   │   ├── xx1.jpg
+        │   │   ├── xx2.jpg
+        │   │   └── ...
+        │   ├── class_y
+        │   │   ├── yy1.jpg
+        │   │   ├── yy2.jpg
+        │   │   └── ...
+        │   └── ...
+        ├── images.txt
+        ├── image_class_labels.txt
+        ├── train_test_split.txt
+        └── ....
+
+    Args:
+        data_root (str): The root directory for CUB-200-2011 dataset.
+        split (str, optional): The dataset split, supports "train" and "test".
+            Default to "train".
+
+    Examples:
+        >>> from mmpretrain.datasets import CUB
+        >>> train_dataset = CUB(data_root='data/CUB_200_2011', split='train')
+        >>> train_dataset
+        Dataset CUB
+            Number of samples:  5994
+            Number of categories:       200
+            Root of dataset:    data/CUB_200_2011
+        >>> test_dataset = CUB(data_root='data/CUB_200_2011', split='test')
+        >>> test_dataset
+        Dataset CUB
+            Number of samples:  5794
+            Number of categories:       200
+            Root of dataset:    data/CUB_200_2011
+    """  # noqa: E501
+
+    METAINFO = {'classes': CUB_CATEGORIES}
+
+    def __init__(self,
+                 data_root: str,
+                 split: str = 'train',
+                 test_mode: bool = False,
+                 **kwargs):
+
+        splits = ['train', 'test']
+        assert split in splits, \
+            f"The split must be one of {splits}, but get '{split}'"
+        self.split = split
+
+        # To handle the BC-breaking
+        if split == 'train' and test_mode:
+            logger = MMLogger.get_current_instance()
+            logger.warning('split="train" but test_mode=True. '
+                           'The training set will be used.')
+
+        ann_file = 'images.txt'
+        data_prefix = 'images'
+        image_class_labels_file = 'image_class_labels.txt'
+        train_test_split_file = 'train_test_split.txt'
+
+        self.backend = get_file_backend(data_root, enable_singleton=True)
+        self.image_class_labels_file = self.backend.join_path(
+            data_root, image_class_labels_file)
+        self.train_test_split_file = self.backend.join_path(
+            data_root, train_test_split_file)
+        super(CUB, self).__init__(
+            ann_file=ann_file,
+            data_root=data_root,
+            data_prefix=data_prefix,
+            test_mode=test_mode,
+            **kwargs)
+
+    def _load_data_from_txt(self, filepath):
+        """load data from CUB txt file, the every line of the file is idx and a
+        data item."""
+        pairs = list_from_file(filepath)
+        data_dict = dict()
+        for pair in pairs:
+            idx, data_item = pair.split()
+            # all the index starts from 1 in CUB files,
+            # here we need to '- 1' to let them start from 0.
+            data_dict[int(idx) - 1] = data_item
+        return data_dict
+
+    def load_data_list(self):
+        """Load images and ground truth labels."""
+        sample_dict = self._load_data_from_txt(self.ann_file)
+
+        label_dict = self._load_data_from_txt(self.image_class_labels_file)
+
+        split_dict = self._load_data_from_txt(self.train_test_split_file)
+
+        assert sample_dict.keys() == label_dict.keys() == split_dict.keys(),\
+            f'sample_ids should be same in files {self.ann_file}, ' \
+            f'{self.image_class_labels_file} and {self.train_test_split_file}'
+
+        data_list = []
+        for sample_id in sample_dict.keys():
+            if split_dict[sample_id] == '1' and self.split == 'test':
+                # skip train samples when split='test'
+                continue
+            elif split_dict[sample_id] == '0' and self.split == 'train':
+                # skip test samples when split='train'
+                continue
+
+            img_path = self.backend.join_path(self.img_prefix,
+                                              sample_dict[sample_id])
+            gt_label = int(label_dict[sample_id]) - 1
+            info = dict(img_path=img_path, gt_label=gt_label)
+            data_list.append(info)
+
+        return data_list
+
+    def extra_repr(self) -> List[str]:
+        """The extra repr information of the dataset."""
+        body = [
+            f'Root of dataset: \t{self.data_root}',
+        ]
+        return body
diff --git a/mmpretrain/datasets/custom.py b/mmpretrain/datasets/custom.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb491ff0cc7f816f629603d3b8be55e3f787c373
--- /dev/null
+++ b/mmpretrain/datasets/custom.py
@@ -0,0 +1,287 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Callable, Dict, List, Optional, Sequence, Tuple, Union
+
+from mmengine.fileio import (BaseStorageBackend, get_file_backend,
+                             list_from_file)
+from mmengine.logging import MMLogger
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+
+
+def find_folders(
+    root: str,
+    backend: Optional[BaseStorageBackend] = None
+) -> Tuple[List[str], Dict[str, int]]:
+    """Find classes by folders under a root.
+
+    Args:
+        root (string): root directory of folders
+        backend (BaseStorageBackend | None): The file backend of the root.
+            If None, auto infer backend from the root path. Defaults to None.
+
+    Returns:
+        Tuple[List[str], Dict[str, int]]:
+
+        - folders: The name of sub folders under the root.
+        - folder_to_idx: The map from folder name to class idx.
+    """
+    # Pre-build file backend to prevent verbose file backend inference.
+    backend = backend or get_file_backend(root, enable_singleton=True)
+    folders = list(
+        backend.list_dir_or_file(
+            root,
+            list_dir=True,
+            list_file=False,
+            recursive=False,
+        ))
+    folders.sort()
+    folder_to_idx = {folders[i]: i for i in range(len(folders))}
+    return folders, folder_to_idx
+
+
+def get_samples(
+    root: str,
+    folder_to_idx: Dict[str, int],
+    is_valid_file: Callable,
+    backend: Optional[BaseStorageBackend] = None,
+):
+    """Make dataset by walking all images under a root.
+
+    Args:
+        root (string): root directory of folders
+        folder_to_idx (dict): the map from class name to class idx
+        is_valid_file (Callable): A function that takes path of a file
+            and check if the file is a valid sample file.
+        backend (BaseStorageBackend | None): The file backend of the root.
+            If None, auto infer backend from the root path. Defaults to None.
+
+    Returns:
+        Tuple[list, set]:
+
+        - samples: a list of tuple where each element is (image, class_idx)
+        - empty_folders: The folders don't have any valid files.
+    """
+    samples = []
+    available_classes = set()
+    # Pre-build file backend to prevent verbose file backend inference.
+    backend = backend or get_file_backend(root, enable_singleton=True)
+
+    if folder_to_idx is not None:
+        for folder_name in sorted(list(folder_to_idx.keys())):
+            _dir = backend.join_path(root, folder_name)
+            files = backend.list_dir_or_file(
+                _dir,
+                list_dir=False,
+                list_file=True,
+                recursive=True,
+            )
+            for file in sorted(list(files)):
+                if is_valid_file(file):
+                    path = backend.join_path(folder_name, file)
+                    item = (path, folder_to_idx[folder_name])
+                    samples.append(item)
+                    available_classes.add(folder_name)
+        empty_folders = set(folder_to_idx.keys()) - available_classes
+    else:
+        files = backend.list_dir_or_file(
+            root,
+            list_dir=False,
+            list_file=True,
+            recursive=True,
+        )
+        samples = [file for file in sorted(list(files)) if is_valid_file(file)]
+        empty_folders = None
+
+    return samples, empty_folders
+
+
+@DATASETS.register_module()
+class CustomDataset(BaseDataset):
+    """A generic dataset for multiple tasks.
+
+    The dataset supports two kinds of style.
+
+    1. Use an annotation file to specify all samples, and each line indicates a
+       sample:
+
+       The annotation file (for ``with_label=True``, supervised tasks.): ::
+
+           folder_1/xxx.png 0
+           folder_1/xxy.png 1
+           123.png 4
+           nsdf3.png 3
+           ...
+
+       The annotation file (for ``with_label=False``, unsupervised tasks.): ::
+
+           folder_1/xxx.png
+           folder_1/xxy.png
+           123.png
+           nsdf3.png
+           ...
+
+       Sample files: ::
+
+           data_prefix/
+           ├── folder_1
+           │   ├── xxx.png
+           │   ├── xxy.png
+           │   └── ...
+           ├── 123.png
+           ├── nsdf3.png
+           └── ...
+
+       Please use the argument ``metainfo`` to specify extra information for
+       the task, like ``{'classes': ('bird', 'cat', 'deer', 'dog', 'frog')}``.
+
+    2. Place all samples in one folder as below:
+
+       Sample files (for ``with_label=True``, supervised tasks, we use the name
+       of sub-folders as the categories names): ::
+
+           data_prefix/
+           ├── class_x
+           │   ├── xxx.png
+           │   ├── xxy.png
+           │   └── ...
+           │       └── xxz.png
+           └── class_y
+               ├── 123.png
+               ├── nsdf3.png
+               ├── ...
+               └── asd932_.png
+
+       Sample files (for ``with_label=False``, unsupervised tasks, we use all
+       sample files under the specified folder): ::
+
+           data_prefix/
+           ├── folder_1
+           │   ├── xxx.png
+           │   ├── xxy.png
+           │   └── ...
+           ├── 123.png
+           ├── nsdf3.png
+           └── ...
+
+    If the ``ann_file`` is specified, the dataset will be generated by the
+    first way, otherwise, try the second way.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix`` and
+            ``ann_file``. Defaults to ''.
+        data_prefix (str | dict): Prefix for the data. Defaults to ''.
+        ann_file (str): Annotation file path. Defaults to ''.
+        with_label (bool): Whether the annotation file includes ground truth
+            labels, or use sub-folders to specify categories.
+            Defaults to True.
+        extensions (Sequence[str]): A sequence of allowed extensions. Defaults
+            to ('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif').
+        metainfo (dict, optional): Meta information for dataset, such as class
+            information. Defaults to None.
+        lazy_init (bool): Whether to load annotation during instantiation.
+            In some cases, such as visualization, only the meta information of
+            the dataset is needed, which is not necessary to load annotation
+            file. ``Basedataset`` can skip load annotations to save time by set
+            ``lazy_init=False``. Defaults to False.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self,
+                 data_root: str = '',
+                 data_prefix: Union[str, dict] = '',
+                 ann_file: str = '',
+                 with_label=True,
+                 extensions: Sequence[str] = ('.jpg', '.jpeg', '.png', '.ppm',
+                                              '.bmp', '.pgm', '.tif'),
+                 metainfo: Optional[dict] = None,
+                 lazy_init: bool = False,
+                 **kwargs):
+        assert (ann_file or data_prefix or data_root), \
+            'One of `ann_file`, `data_root` and `data_prefix` must '\
+            'be specified.'
+
+        self.extensions = tuple(set([i.lower() for i in extensions]))
+        self.with_label = with_label
+
+        super().__init__(
+            # The base class requires string ann_file but this class doesn't
+            ann_file=ann_file,
+            metainfo=metainfo,
+            data_root=data_root,
+            data_prefix=data_prefix,
+            # Force to lazy_init for some modification before loading data.
+            lazy_init=True,
+            **kwargs)
+
+        # Full initialize the dataset.
+        if not lazy_init:
+            self.full_init()
+
+    def _find_samples(self):
+        """find samples from ``data_prefix``."""
+        if self.with_label:
+            classes, folder_to_idx = find_folders(self.img_prefix)
+            samples, empty_classes = get_samples(
+                self.img_prefix,
+                folder_to_idx,
+                is_valid_file=self.is_valid_file,
+            )
+
+            self.folder_to_idx = folder_to_idx
+
+            if self.CLASSES is not None:
+                assert len(self.CLASSES) == len(classes), \
+                    f"The number of subfolders ({len(classes)}) doesn't " \
+                    f'match the number of specified classes ' \
+                    f'({len(self.CLASSES)}). Please check the data folder.'
+            else:
+                self._metainfo['classes'] = tuple(classes)
+        else:
+            samples, empty_classes = get_samples(
+                self.img_prefix,
+                None,
+                is_valid_file=self.is_valid_file,
+            )
+
+        if len(samples) == 0:
+            raise RuntimeError(
+                f'Found 0 files in subfolders of: {self.data_prefix}. '
+                f'Supported extensions are: {",".join(self.extensions)}')
+
+        if empty_classes:
+            logger = MMLogger.get_current_instance()
+            logger.warning(
+                'Found no valid file in the folder '
+                f'{", ".join(empty_classes)}. '
+                f"Supported extensions are: {', '.join(self.extensions)}")
+
+        return samples
+
+    def load_data_list(self):
+        """Load image paths and gt_labels."""
+        if not self.ann_file:
+            samples = self._find_samples()
+        elif self.with_label:
+            lines = list_from_file(self.ann_file)
+            samples = [x.strip().rsplit(' ', 1) for x in lines]
+        else:
+            samples = list_from_file(self.ann_file)
+
+        # Pre-build file backend to prevent verbose file backend inference.
+        backend = get_file_backend(self.img_prefix, enable_singleton=True)
+        data_list = []
+        for sample in samples:
+            if self.with_label:
+                filename, gt_label = sample
+                img_path = backend.join_path(self.img_prefix, filename)
+                info = {'img_path': img_path, 'gt_label': int(gt_label)}
+            else:
+                img_path = backend.join_path(self.img_prefix, sample)
+                info = {'img_path': img_path}
+            data_list.append(info)
+        return data_list
+
+    def is_valid_file(self, filename: str) -> bool:
+        """Check if a file is a valid sample."""
+        return filename.lower().endswith(self.extensions)
diff --git a/mmpretrain/datasets/dataset_wrappers.py b/mmpretrain/datasets/dataset_wrappers.py
new file mode 100644
index 0000000000000000000000000000000000000000..1adff10beb024940f9066a407cc76ddb06b27404
--- /dev/null
+++ b/mmpretrain/datasets/dataset_wrappers.py
@@ -0,0 +1,176 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+
+import numpy as np
+from mmengine.dataset import BaseDataset, force_full_init
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class KFoldDataset:
+    """A wrapper of dataset for K-Fold cross-validation.
+
+    K-Fold cross-validation divides all the samples in groups of samples,
+    called folds, of almost equal sizes. And we use k-1 of folds to do training
+    and use the fold left to do validation.
+
+    Args:
+        dataset (:obj:`mmengine.dataset.BaseDataset` | dict): The dataset to be
+            divided
+        fold (int): The fold used to do validation. Defaults to 0.
+        num_splits (int): The number of all folds. Defaults to 5.
+        test_mode (bool): Use the training dataset or validation dataset.
+            Defaults to False.
+        seed (int, optional): The seed to shuffle the dataset before splitting.
+            If None, not shuffle the dataset. Defaults to None.
+    """
+
+    def __init__(self,
+                 dataset,
+                 fold=0,
+                 num_splits=5,
+                 test_mode=False,
+                 seed=None):
+        if isinstance(dataset, dict):
+            self.dataset = DATASETS.build(dataset)
+            # Init the dataset wrapper lazily according to the dataset setting.
+            lazy_init = dataset.get('lazy_init', False)
+        elif isinstance(dataset, BaseDataset):
+            self.dataset = dataset
+        else:
+            raise TypeError(f'Unsupported dataset type {type(dataset)}.')
+
+        self._metainfo = getattr(self.dataset, 'metainfo', {})
+        self.fold = fold
+        self.num_splits = num_splits
+        self.test_mode = test_mode
+        self.seed = seed
+
+        self._fully_initialized = False
+        if not lazy_init:
+            self.full_init()
+
+    @property
+    def metainfo(self) -> dict:
+        """Get the meta information of ``self.dataset``.
+
+        Returns:
+            dict: Meta information of the dataset.
+        """
+        # Prevent `self._metainfo` from being modified by outside.
+        return copy.deepcopy(self._metainfo)
+
+    def full_init(self):
+        """fully initialize the dataset."""
+        if self._fully_initialized:
+            return
+
+        self.dataset.full_init()
+        ori_len = len(self.dataset)
+        indices = list(range(ori_len))
+        if self.seed is not None:
+            rng = np.random.default_rng(self.seed)
+            rng.shuffle(indices)
+
+        test_start = ori_len * self.fold // self.num_splits
+        test_end = ori_len * (self.fold + 1) // self.num_splits
+        if self.test_mode:
+            indices = indices[test_start:test_end]
+        else:
+            indices = indices[:test_start] + indices[test_end:]
+
+        self._ori_indices = indices
+        self.dataset = self.dataset.get_subset(indices)
+
+        self._fully_initialized = True
+
+    @force_full_init
+    def _get_ori_dataset_idx(self, idx: int) -> int:
+        """Convert global idx to local index.
+
+        Args:
+            idx (int): Global index of ``KFoldDataset``.
+
+        Returns:
+            int: The original index in the whole dataset.
+        """
+        return self._ori_indices[idx]
+
+    @force_full_init
+    def get_data_info(self, idx: int) -> dict:
+        """Get annotation by index.
+
+        Args:
+            idx (int): Global index of ``KFoldDataset``.
+
+        Returns:
+            dict: The idx-th annotation of the datasets.
+        """
+        return self.dataset.get_data_info(idx)
+
+    @force_full_init
+    def __len__(self):
+        return len(self.dataset)
+
+    @force_full_init
+    def __getitem__(self, idx):
+        return self.dataset[idx]
+
+    @force_full_init
+    def get_cat_ids(self, idx):
+        return self.dataset.get_cat_ids(idx)
+
+    @force_full_init
+    def get_gt_labels(self):
+        return self.dataset.get_gt_labels()
+
+    @property
+    def CLASSES(self):
+        """Return all categories names."""
+        return self._metainfo.get('classes', None)
+
+    @property
+    def class_to_idx(self):
+        """Map mapping class name to class index.
+
+        Returns:
+            dict: mapping from class name to class index.
+        """
+
+        return {cat: i for i, cat in enumerate(self.CLASSES)}
+
+    def __repr__(self):
+        """Print the basic information of the dataset.
+
+        Returns:
+            str: Formatted string.
+        """
+        head = 'Dataset ' + self.__class__.__name__
+        body = []
+        type_ = 'test' if self.test_mode else 'training'
+        body.append(f'Type: \t{type_}')
+        body.append(f'Seed: \t{self.seed}')
+
+        def ordinal(n):
+            # Copy from https://codegolf.stackexchange.com/a/74047
+            suffix = 'tsnrhtdd'[(n // 10 % 10 != 1) * (n % 10 < 4) * n % 10::4]
+            return f'{n}{suffix}'
+
+        body.append(
+            f'Fold: \t{ordinal(self.fold+1)} of {self.num_splits}-fold')
+        if self._fully_initialized:
+            body.append(f'Number of samples: \t{self.__len__()}')
+        else:
+            body.append("Haven't been initialized")
+
+        if self.CLASSES is not None:
+            body.append(f'Number of categories: \t{len(self.CLASSES)}')
+        else:
+            body.append('The `CLASSES` meta info is not set.')
+
+        body.append(
+            f'Original dataset type:\t{self.dataset.__class__.__name__}')
+
+        lines = [head] + [' ' * 4 + line for line in body]
+        return '\n'.join(lines)
diff --git a/mmpretrain/datasets/dtd.py b/mmpretrain/datasets/dtd.py
new file mode 100644
index 0000000000000000000000000000000000000000..034d0b1b444afebfc420eeff7e138072f7d7ee1f
--- /dev/null
+++ b/mmpretrain/datasets/dtd.py
@@ -0,0 +1,116 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+import mat4py
+from mmengine import get_file_backend
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+from .categories import DTD_CATEGORIES
+
+
+@DATASETS.register_module()
+class DTD(BaseDataset):
+    """The Describable Texture Dataset (DTD).
+
+    Support the `Describable Texture Dataset <https://www.robots.ox.ac.uk/~vgg/data/dtd/>`_ Dataset.
+    After downloading and decompression, the dataset directory structure is as follows.
+
+    DTD dataset directory: ::
+
+        dtd
+        ├── images
+        │   ├── banded
+        |   |   ├──banded_0002.jpg
+        |   |   ├──banded_0004.jpg
+        |   |   └── ...
+        │   └── ...
+        ├── imdb
+        │   └── imdb.mat
+        ├── labels
+        |   |   ├──labels_joint_anno.txt
+        |   |   ├──test1.txt
+        |   |   ├──test2.txt
+        |   |   └── ...
+        │   └── ...
+        └── ....
+
+    Args:
+        data_root (str): The root directory for Describable Texture dataset.
+        split (str, optional): The dataset split, supports "train",
+            "val", "trainval", and "test". Default to "trainval".
+
+    Examples:
+        >>> from mmpretrain.datasets import DTD
+        >>> train_dataset = DTD(data_root='data/dtd', split='trainval')
+        >>> train_dataset
+        Dataset DTD
+            Number of samples:  3760
+            Number of categories:       47
+            Root of dataset:    data/dtd
+        >>> test_dataset = DTD(data_root='data/dtd', split='test')
+        >>> test_dataset
+        Dataset DTD
+            Number of samples:  1880
+            Number of categories:       47
+            Root of dataset:    data/dtd
+    """  # noqa: E501
+
+    METAINFO = {'classes': DTD_CATEGORIES}
+
+    def __init__(self, data_root: str, split: str = 'trainval', **kwargs):
+
+        splits = ['train', 'val', 'trainval', 'test']
+        assert split in splits, \
+            f"The split must be one of {splits}, but get '{split}'"
+        self.split = split
+
+        data_prefix = 'images'
+        test_mode = split == 'test'
+
+        self.backend = get_file_backend(data_root, enable_singleton=True)
+        ann_file = self.backend.join_path('imdb', 'imdb.mat')
+
+        super(DTD, self).__init__(
+            ann_file=ann_file,
+            data_root=data_root,
+            data_prefix=data_prefix,
+            test_mode=test_mode,
+            **kwargs)
+
+    def load_data_list(self):
+        """Load images and ground truth labels."""
+
+        data = mat4py.loadmat(self.ann_file)['images']
+        names = data['name']
+        labels = data['class']
+        parts = data['set']
+        num = len(names)
+        assert num == len(labels) == len(parts), 'get error ann file'
+
+        if self.split == 'train':
+            target_set = {1}
+        elif self.split == 'val':
+            target_set = {2}
+        elif self.split == 'test':
+            target_set = {3}
+        else:
+            target_set = {1, 2}
+
+        data_list = []
+        for i in range(num):
+            if parts[i] in target_set:
+                img_name = names[i]
+                img_path = self.backend.join_path(self.img_prefix, img_name)
+                gt_label = labels[i] - 1
+                info = dict(img_path=img_path, gt_label=gt_label)
+                data_list.append(info)
+
+        return data_list
+
+    def extra_repr(self) -> List[str]:
+        """The extra repr information of the dataset."""
+        body = [
+            f'Root of dataset: \t{self.data_root}',
+        ]
+        return body
diff --git a/mmpretrain/datasets/fgvcaircraft.py b/mmpretrain/datasets/fgvcaircraft.py
new file mode 100644
index 0000000000000000000000000000000000000000..696992c06bbf02f097d017a519d42f758ba5f16f
--- /dev/null
+++ b/mmpretrain/datasets/fgvcaircraft.py
@@ -0,0 +1,98 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+from mmengine import get_file_backend, list_from_file
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+from .categories import FGVCAIRCRAFT_CATEGORIES
+
+
+@DATASETS.register_module()
+class FGVCAircraft(BaseDataset):
+    """The FGVC_Aircraft Dataset.
+
+    Support the `FGVC_Aircraft Dataset <https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/>`_ Dataset.
+    After downloading and decompression, the dataset directory structure is as follows.
+
+    FGVC_Aircraft dataset directory: ::
+
+        fgvc-aircraft-2013b
+        └── data
+            ├── images
+            │   ├── 1.jpg
+            │   ├── 2.jpg
+            │   └── ...
+            ├── images_variant_train.txt
+            ├── images_variant_test.txt
+            ├── images_variant_trainval.txt
+            ├── images_variant_val.txt
+            ├── variants.txt
+            └── ....
+
+    Args:
+        data_root (str): The root directory for FGVC_Aircraft dataset.
+        split (str, optional): The dataset split, supports "train",
+            "val", "trainval", and "test". Default to "trainval".
+
+    Examples:
+        >>> from mmpretrain.datasets import FGVCAircraft
+        >>> train_dataset = FGVCAircraft(data_root='data/fgvc-aircraft-2013b', split='trainval')
+        >>> train_dataset
+        Dataset FGVCAircraft
+            Number of samples:  6667
+            Number of categories:       100
+            Root of dataset:    data/fgvc-aircraft-2013b
+        >>> test_dataset = FGVCAircraft(data_root='data/fgvc-aircraft-2013b', split='test')
+        >>> test_dataset
+        Dataset FGVCAircraft
+            Number of samples:  3333
+            Number of categories:       100
+            Root of dataset:    data/fgvc-aircraft-2013b
+    """  # noqa: E501
+
+    METAINFO = {'classes': FGVCAIRCRAFT_CATEGORIES}
+
+    def __init__(self, data_root: str, split: str = 'trainval', **kwargs):
+
+        splits = ['train', 'val', 'trainval', 'test']
+        assert split in splits, \
+            f"The split must be one of {splits}, but get '{split}'"
+        self.split = split
+
+        self.backend = get_file_backend(data_root, enable_singleton=True)
+        ann_file = self.backend.join_path('data',
+                                          f'images_variant_{split}.txt')
+        data_prefix = self.backend.join_path('data', 'images')
+        test_mode = split == 'test'
+
+        super(FGVCAircraft, self).__init__(
+            ann_file=ann_file,
+            data_root=data_root,
+            test_mode=test_mode,
+            data_prefix=data_prefix,
+            **kwargs)
+
+    def load_data_list(self):
+        """Load images and ground truth labels."""
+
+        pairs = list_from_file(self.ann_file)
+        data_list = []
+        for pair in pairs:
+            pair = pair.split()
+            img_name = pair[0]
+            class_name = ' '.join(pair[1:])
+            img_name = f'{img_name}.jpg'
+            img_path = self.backend.join_path(self.img_prefix, img_name)
+            gt_label = self.METAINFO['classes'].index(class_name)
+            info = dict(img_path=img_path, gt_label=gt_label)
+            data_list.append(info)
+
+        return data_list
+
+    def extra_repr(self) -> List[str]:
+        """The extra repr information of the dataset."""
+        body = [
+            f'Root of dataset: \t{self.data_root}',
+        ]
+        return body
diff --git a/mmpretrain/datasets/flamingo.py b/mmpretrain/datasets/flamingo.py
new file mode 100644
index 0000000000000000000000000000000000000000..3b5745a1437537fccbc304d158a0f0c8d09f032a
--- /dev/null
+++ b/mmpretrain/datasets/flamingo.py
@@ -0,0 +1,295 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import random
+from abc import abstractmethod
+from collections import Counter
+from typing import List
+
+import mmengine
+import numpy as np
+from mmengine.dataset import BaseDataset
+from pycocotools.coco import COCO
+
+from mmpretrain.registry import DATASETS
+from .coco_vqa import COCOVQA
+
+
+class FlamingoFewShotMixin:
+    """Flamingo fewshot eval dataset minin.
+
+    Args:
+        num_shots (int): Number of shots to perform evaluation.
+            Defaults to 0.
+            Note: 0 does not mean a strict zero-shot in Flamingo setting.
+            It will use 2 only-text prompt without in context images.
+        num_support_examples (int): Number of support examples to get the
+            few shots from. Defaults to 2048.
+        num_query_examples (int): Number of query examples to perform the
+            final evaluation. Defaults to 5000.
+        incontext_prompt_temp (str): In context prompt template for few shot
+            examples. Defaults to ''.
+        final_prompt_temp (str): Final query prompt template. Defaults to ''.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self,
+                 num_shots: int = 0,
+                 num_support_examples: int = 2048,
+                 num_query_examples: int = 5000,
+                 incontext_prompt_temp: str = '',
+                 final_prompt_temp: str = '',
+                 **kwarg):
+        self.num_shots = num_shots
+        self.num_support_examples = num_support_examples
+        self.num_query_examples = num_query_examples
+        self.incontext_prompt_temp = incontext_prompt_temp
+        self.final_prompt_temp = final_prompt_temp
+        super().__init__(**kwarg)
+
+    def get_subset_idx(self, total_num):
+        random_idx = np.random.choice(
+            total_num,
+            self.num_support_examples + self.num_query_examples,
+            replace=False)
+
+        support_idx = random_idx[:self.num_support_examples]
+        query_idx = random_idx[self.num_support_examples:]
+        return support_idx, query_idx
+
+    @abstractmethod
+    def parse_basic_anno(self, anno: dict) -> dict:
+        """Parse basic annotation for support and query set."""
+        pass
+
+    @abstractmethod
+    def parse_fewshot_anno(self, anno: dict, support_list: List) -> dict:
+        """Parse fewshot related annotation for query set with support list."""
+        pass
+
+
+@DATASETS.register_module()
+class FlamingoEvalCOCOVQA(FlamingoFewShotMixin, COCOVQA):
+    """Flamingo few shot VQAv2 dataset.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix`` and
+            ``ann_file``.
+        ann_file (str): Annotation file path.
+        question_file (str): Question file path.
+        num_shots (int): Number of shots to perform evaluation.
+            Defaults to 0.
+            Note: 0 does not mean a strict zero-shot in Flamingo setting.
+            It will use 2 only-text prompt without in context images.
+        num_support_examples (int): Number of support examples to get the
+            few shots from. Defaults to 2048.
+        num_query_examples (int): Number of query examples to perform the
+            final evaluation. Defaults to 5000.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self,
+                 data_root: str,
+                 question_file: str,
+                 ann_file: str = '',
+                 num_shots: int = 0,
+                 num_support_examples: int = 2048,
+                 num_query_examples: int = 5000,
+                 **kwarg):
+        super().__init__(
+            data_root=data_root,
+            question_file=question_file,
+            ann_file=ann_file,
+            num_shots=num_shots,
+            num_support_examples=num_support_examples,
+            num_query_examples=num_query_examples,
+            **kwarg)
+
+    def parse_basic_anno(self, ann: dict) -> dict:
+        """Parse basic annotation for support and query set.
+
+        Args:
+            anno (dict): Annotation for single example.
+
+        Return:
+            dict: Parsed annotation for single example.
+        """
+        if ann is None:
+            return {}
+
+        answers = [a['answer'] for a in ann['answers']]
+        count = Counter(answers)
+        answer_weight = [i / len(answers) for i in count.values()]
+        answer_info = {
+            'gt_answer': list(count.keys()),
+            'gt_answer_weight': answer_weight
+        }
+        return answer_info
+
+    def parse_fewshot_anno(self, query: dict, support_list: List) -> dict:
+        """Parse fewshot related annotation for query set with support list.
+
+        Args:
+            anno (dict): Annotation for single example.
+            support_list (List): List of support subset to subsample few shots.
+
+        Return:
+            dict: Parsed annotation for single example.
+        """
+        # prepare n shots examples
+        shots = random.sample(support_list, self.num_shots)
+
+        # append image path for n shots
+        img_path = [shot['img_path'] for shot in shots]
+        img_path.append(query['img_path'])
+        query['img_path'] = img_path
+
+        query['shots'] = [
+            dict(
+                question=item['question'],
+                answer=item['gt_answer'][0],
+            ) for item in shots
+        ]
+        return query
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        questions = mmengine.load(self.question_file)['questions']
+        if self.ann_file:
+            annotations = mmengine.load(self.ann_file)['annotations']
+            assert len(questions) == len(annotations)
+        else:
+            annotations = [None] * len(questions)
+            if self.num_shots > 0:
+                raise ValueError('Unable to construct few-shot examples '
+                                 'since no annotation file.')
+
+        # The original VQAv2 annotation file and question file includes
+        # only image id but no image file paths.
+        self.image_index = self._create_image_index()
+
+        num_data = len(questions)
+        support_idx, query_idx = self.get_subset_idx(num_data)
+
+        # prepare support subset
+        if self.num_shots > 0:
+            support_list = []
+            for idx in support_idx:
+                question = questions[idx]
+                ann = annotations[idx]
+                support = {**question, **self.parse_basic_anno(ann)}
+                support['img_path'] = self.image_index[question['image_id']]
+                support_list.append(support)
+
+        # prepare query subset
+        data_list = []
+        for idx in query_idx:
+            question = questions[idx]
+            ann = annotations[idx]
+            data_info = {**question, **self.parse_basic_anno(ann)}
+            data_info['img_path'] = self.image_index[question['image_id']]
+            if self.num_shots > 0:
+                data_info = self.parse_fewshot_anno(data_info, support_list)
+            data_list.append(data_info)
+
+        return data_list
+
+
+@DATASETS.register_module()
+class FlamingoEvalCOCOCaption(FlamingoFewShotMixin, BaseDataset):
+    """Flamingo few shot COCO Caption dataset.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix`` and
+            ``ann_file``.
+        ann_file (str): Annotation file path.
+        data_prefix (dict): Prefix for data field. Defaults to
+            ``dict(img_path='')``.
+        num_shots (int): Number of shots to perform evaluation.
+            Defaults to 0.
+        num_support_examples (int): Number of support examples to get the
+            few shots from. Defaults to 2048.
+        num_query_examples (int): Number of query examples to perform the
+            final evaluation. Defaults to 5000.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self,
+                 data_root: str,
+                 ann_file: str,
+                 num_shots: int = 0,
+                 num_support_examples: int = 2048,
+                 num_query_examples: int = 5000,
+                 **kwarg):
+        super().__init__(
+            data_root=data_root,
+            ann_file=ann_file,
+            num_shots=num_shots,
+            num_support_examples=num_support_examples,
+            num_query_examples=num_query_examples,
+            **kwarg)
+
+    def parse_basic_anno(self, ann: dict, coco: COCO) -> dict:
+        """Parse basic annotation for support and query set.
+
+        Args:
+            anno (dict): Annotation for single example.
+            coco (COCO): The coco dataset.
+
+        Return:
+            dict: Parsed annotation for single example.
+        """
+        img_prefix = self.data_prefix['img_path']
+        img = coco.imgs[ann['image_id']]
+        data_info = dict(
+            img_path=mmengine.join_path(img_prefix, img['file_name']),
+            gt_caption=ann['caption'],
+            image_id=ann['image_id'],
+        )
+        return data_info
+
+    def parse_fewshot_anno(self, query: dict, support_list: List) -> dict:
+        """Parse fewshot related annotation for query set with support list.
+
+        Args:
+            query (dict): Annotation for single example.
+            support_list (List): List of support subset to subsample few shots.
+            coco (COCO): The coco dataset.
+
+        Return:
+            dict: Parsed annotation for single example.
+        """
+        # prepare n shots examples
+        shots = random.sample(support_list, self.num_shots)
+
+        # append image path for n shots
+        img_path = [shot['img_path'] for shot in shots]
+        img_path.append(query['img_path'])
+        query['img_path'] = img_path
+
+        query['shots'] = [dict(caption=item['gt_caption']) for item in shots]
+        return query
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        with mmengine.get_local_path(self.ann_file) as ann_file:
+            coco = COCO(ann_file)
+
+        num_data = len(coco.anns)
+        support_idx, query_idx = self.get_subset_idx(num_data)
+        ann_ids = list(coco.anns)
+
+        # prepare support subset
+        if self.num_shots > 0:
+            support_list = []
+            for idx in support_idx:
+                support = self.parse_basic_anno(coco.anns[ann_ids[idx]], coco)
+                support_list.append(support)
+
+        # prepare query subset
+        query_list = []
+        for idx in query_idx:
+            data_info = self.parse_basic_anno(coco.anns[ann_ids[idx]], coco)
+            if self.num_shots > 0:
+                data_info = self.parse_fewshot_anno(data_info, support_list)
+            query_list.append(data_info)
+
+        return query_list
diff --git a/mmpretrain/datasets/flickr30k_caption.py b/mmpretrain/datasets/flickr30k_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0f6841a2c87a0b3eaa3a7abd5b8fda1cb235bc0
--- /dev/null
+++ b/mmpretrain/datasets/flickr30k_caption.py
@@ -0,0 +1,77 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+import mmengine
+from mmengine.dataset import BaseDataset
+from mmengine.fileio import get_file_backend
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class Flickr30kCaption(BaseDataset):
+    """Flickr30k Caption dataset. To generate coco-style GT annotation for
+    evaluation, please refer to
+    tools/dataset_converters/convert_flickr30k_ann.py.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix``, ``ann_file``
+            and ``question_file``.
+        data_prefix (str): The directory of images.
+        ann_file (str): Annotation file path for training and validation.
+        split (str): 'train', 'val' or 'test'.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self, data_root: str, data_prefix: str, ann_file: str,
+                 split: str, **kwarg):
+
+        assert split in ['train', 'val', 'test'], \
+            '`split` must be train, val or test'
+        self.split = split
+        super().__init__(
+            data_root=data_root,
+            data_prefix=dict(img_path=data_prefix),
+            ann_file=ann_file,
+            **kwarg,
+        )
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        img_prefix = self.data_prefix['img_path']
+        annotations = mmengine.load(self.ann_file)
+        file_backend = get_file_backend(img_prefix)
+
+        data_list = []
+
+        for img in annotations['images']:
+
+            # img_example={
+            #     "sentids": [0, 1, 2],
+            #     "imgid": 0,
+            #     "sentences": [
+            #         {"raw": "Two men in green shirts standing in a yard.",
+            #          "imgid": 0, "sentid": 0},
+            #         {"raw": "A man in a blue shirt standing in a garden.",
+            #          "imgid": 0, "sentid": 1},
+            #         {"raw": "Two friends enjoy time spent together.",
+            #          "imgid": 0, "sentid": 2}
+            #     ],
+            #     "split": "train",
+            #     "filename": "1000092795.jpg"
+            # },
+
+            if img['split'] != self.split:
+                continue
+
+            for sentence in img['sentences']:
+                data_info = {
+                    'image_id': img['imgid'],
+                    'img_path': file_backend.join_path(img_prefix,
+                                                       img['filename']),
+                    'gt_caption': sentence['raw']
+                }
+
+                data_list.append(data_info)
+
+        return data_list
diff --git a/mmpretrain/datasets/flickr30k_retrieval.py b/mmpretrain/datasets/flickr30k_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..9f43c151b2079b3f72cf620577923efc57987316
--- /dev/null
+++ b/mmpretrain/datasets/flickr30k_retrieval.py
@@ -0,0 +1,110 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from collections import OrderedDict
+from typing import List
+
+import mmengine
+from mmengine import get_file_backend
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+
+
+@DATASETS.register_module()
+class Flickr30kRetrieval(BaseDataset):
+    """Flickr30k Retrieval dataset.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix``, ``ann_file``
+            and ``question_file``.
+        data_prefix (str): The directory of images.
+        ann_file (str): Annotation file path for training and validation.
+        split (str): 'train', 'val' or 'test'.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self, data_root: str, data_prefix: str, ann_file: str,
+                 split: str, **kwarg):
+
+        assert split in ['train', 'val', 'test'], \
+            '`split` must be train, val or test'
+        self.split = split
+        super().__init__(
+            data_root=data_root,
+            data_prefix=dict(img_path=data_prefix),
+            ann_file=ann_file,
+            **kwarg,
+        )
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        # get file backend
+        img_prefix = self.data_prefix['img_path']
+        file_backend = get_file_backend(img_prefix)
+
+        annotations = mmengine.load(self.ann_file)
+
+        # mapping img_id to img filename
+        img_dict = OrderedDict()
+        img_idx = 0
+        sentence_idx = 0
+        train_list = []
+        for img in annotations['images']:
+
+            # img_example={
+            #     "sentids": [0, 1, 2],
+            #     "imgid": 0,
+            #     "sentences": [
+            #         {"raw": "Two men in green shirts standing in a yard.",
+            #          "imgid": 0, "sentid": 0},
+            #         {"raw": "A man in a blue shirt standing in a garden.",
+            #          "imgid": 0, "sentid": 1},
+            #         {"raw": "Two friends enjoy time spent together.",
+            #          "imgid": 0, "sentid": 2}
+            #     ],
+            #     "split": "train",
+            #     "filename": "1000092795.jpg"
+            # },
+
+            if img['split'] != self.split:
+                continue
+
+            # create new idx for image
+            train_image = dict(
+                ori_id=img['imgid'],
+                image_id=img_idx,  # used for evaluation
+                img_path=file_backend.join_path(img_prefix, img['filename']),
+                text=[],
+                gt_text_id=[],
+                gt_image_id=[],
+            )
+
+            for sentence in img['sentences']:
+                ann = {}
+                ann['text'] = sentence['raw']
+                ann['ori_id'] = sentence['sentid']
+                ann['text_id'] = sentence_idx  # used for evaluation
+
+                ann['image_ori_id'] = train_image['ori_id']
+                ann['image_id'] = train_image['image_id']
+                ann['img_path'] = train_image['img_path']
+                ann['is_matched'] = True
+
+                # 1. prepare train data list item
+                train_list.append(ann)
+                # 2. prepare eval data list item based on img dict
+                train_image['text'].append(ann['text'])
+                train_image['gt_text_id'].append(ann['text_id'])
+                train_image['gt_image_id'].append(ann['image_id'])
+
+                sentence_idx += 1
+
+            img_dict[img['imgid']] = train_image
+            img_idx += 1
+
+        self.img_size = len(img_dict)
+        self.text_size = len(train_list)
+
+        # return needed format data list
+        if self.test_mode:
+            return list(img_dict.values())
+        return train_list
diff --git a/mmpretrain/datasets/flowers102.py b/mmpretrain/datasets/flowers102.py
new file mode 100644
index 0000000000000000000000000000000000000000..fe76dcc8422c8692261800b204a6262b60002e81
--- /dev/null
+++ b/mmpretrain/datasets/flowers102.py
@@ -0,0 +1,104 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+import mat4py
+from mmengine import get_file_backend
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+
+
+@DATASETS.register_module()
+class Flowers102(BaseDataset):
+    """The Oxford 102 Flower Dataset.
+
+    Support the `Oxford 102 Flowers Dataset <https://www.robots.ox.ac.uk/~vgg/data/flowers/102/>`_ Dataset.
+    After downloading and decompression, the dataset directory structure is as follows.
+
+    Flowers102 dataset directory: ::
+
+        Flowers102
+        ├── jpg
+        │   ├── image_00001.jpg
+        │   ├── image_00002.jpg
+        │   └── ...
+        ├── imagelabels.mat
+        ├── setid.mat
+        └── ...
+
+    Args:
+        data_root (str): The root directory for Oxford 102 Flowers dataset.
+        split (str, optional): The dataset split, supports "train",
+            "val", "trainval", and "test". Default to "trainval".
+
+    Examples:
+        >>> from mmpretrain.datasets import Flowers102
+        >>> train_dataset = Flowers102(data_root='data/Flowers102', split='trainval')
+        >>> train_dataset
+        Dataset Flowers102
+            Number of samples:  2040
+            Root of dataset:    data/Flowers102
+        >>> test_dataset = Flowers102(data_root='data/Flowers102', split='test')
+        >>> test_dataset
+        Dataset Flowers102
+            Number of samples:  6149
+            Root of dataset:    data/Flowers102
+    """  # noqa: E501
+
+    def __init__(self, data_root: str, split: str = 'trainval', **kwargs):
+        splits = ['train', 'val', 'trainval', 'test']
+        assert split in splits, \
+            f"The split must be one of {splits}, but get '{split}'"
+        self.split = split
+
+        ann_file = 'imagelabels.mat'
+        data_prefix = 'jpg'
+        train_test_split_file = 'setid.mat'
+        test_mode = split == 'test'
+
+        self.backend = get_file_backend(data_root, enable_singleton=True)
+
+        self.train_test_split_file = self.backend.join_path(
+            data_root, train_test_split_file)
+
+        super(Flowers102, self).__init__(
+            ann_file=ann_file,
+            data_root=data_root,
+            data_prefix=data_prefix,
+            test_mode=test_mode,
+            **kwargs)
+
+    def load_data_list(self):
+        """Load images and ground truth labels."""
+
+        label_dict = mat4py.loadmat(self.ann_file)['labels']
+        split_list = mat4py.loadmat(self.train_test_split_file)
+
+        if self.split == 'train':
+            split_list = split_list['trnid']
+        elif self.split == 'val':
+            split_list = split_list['valid']
+        elif self.split == 'test':
+            split_list = split_list['tstid']
+        else:
+            train_ids = split_list['trnid']
+            val_ids = split_list['valid']
+            train_ids.extend(val_ids)
+            split_list = train_ids
+
+        data_list = []
+        for sample_id in split_list:
+            img_name = 'image_%05d.jpg' % (sample_id)
+            img_path = self.backend.join_path(self.img_prefix, img_name)
+            gt_label = int(label_dict[sample_id - 1]) - 1
+            info = dict(img_path=img_path, gt_label=gt_label)
+            data_list.append(info)
+
+        return data_list
+
+    def extra_repr(self) -> List[str]:
+        """The extra repr information of the dataset."""
+        body = [
+            f'Root of dataset: \t{self.data_root}',
+        ]
+        return body
diff --git a/mmpretrain/datasets/food101.py b/mmpretrain/datasets/food101.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ce7ffeee91c6843c259149770e9de4ad9f4317a
--- /dev/null
+++ b/mmpretrain/datasets/food101.py
@@ -0,0 +1,102 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+from mmengine import get_file_backend, list_from_file
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+from .categories import FOOD101_CATEGORIES
+
+
+@DATASETS.register_module()
+class Food101(BaseDataset):
+    """The Food101 Dataset.
+
+    Support the `Food101 Dataset <https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/>`_ Dataset.
+    After downloading and decompression, the dataset directory structure is as follows.
+
+    Food101 dataset directory: ::
+
+        food-101
+        ├── images
+        │   ├── class_x
+        │   │   ├── xx1.jpg
+        │   │   ├── xx2.jpg
+        │   │   └── ...
+        │   ├── class_y
+        │   │   ├── yy1.jpg
+        │   │   ├── yy2.jpg
+        │   │   └── ...
+        │   └── ...
+        ├── meta
+        │   ├── train.txt
+        │   └── test.txt
+        └── ....
+
+    Args:
+        data_root (str): The root directory for Food101 dataset.
+        split (str, optional): The dataset split, supports "train" and "test".
+            Default to "train".
+
+    Examples:
+        >>> from mmpretrain.datasets import Food101
+        >>> train_dataset = Food101(data_root='data/food-101', split='train')
+        >>> train_dataset
+        Dataset Food101
+            Number of samples:  75750
+            Number of categories:       101
+            Root of dataset:    data/food-101
+        >>> test_dataset = Food101(data_root='data/food-101', split='test')
+        >>> test_dataset
+        Dataset Food101
+            Number of samples:  25250
+            Number of categories:       101
+            Root of dataset:    data/food-101
+    """  # noqa: E501
+
+    METAINFO = {'classes': FOOD101_CATEGORIES}
+
+    def __init__(self, data_root: str, split: str = 'train', **kwargs):
+
+        splits = ['train', 'test']
+        assert split in splits, \
+            f"The split must be one of {splits}, but get '{split}'"
+        self.split = split
+
+        self.backend = get_file_backend(data_root, enable_singleton=True)
+        if split == 'train':
+            ann_file = self.backend.join_path('meta', 'train.txt')
+        else:
+            ann_file = self.backend.join_path('meta', 'test.txt')
+
+        test_mode = split == 'test'
+        data_prefix = 'images'
+
+        super(Food101, self).__init__(
+            ann_file=ann_file,
+            data_root=data_root,
+            test_mode=test_mode,
+            data_prefix=data_prefix,
+            **kwargs)
+
+    def load_data_list(self):
+        """Load images and ground truth labels."""
+
+        pairs = list_from_file(self.ann_file)
+        data_list = []
+        for pair in pairs:
+            class_name, img_name = pair.split('/')
+            img_name = f'{img_name}.jpg'
+            img_path = self.backend.join_path(self.img_prefix, class_name,
+                                              img_name)
+            gt_label = self.METAINFO['classes'].index(class_name)
+            info = dict(img_path=img_path, gt_label=gt_label)
+            data_list.append(info)
+        return data_list
+
+    def extra_repr(self) -> List[str]:
+        """The extra repr information of the dataset."""
+        body = [
+            f'Root of dataset: \t{self.data_root}',
+        ]
+        return body
diff --git a/mmpretrain/datasets/gqa_dataset.py b/mmpretrain/datasets/gqa_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..741791bc2bb51f768e8907aac7f002f0e730aeea
--- /dev/null
+++ b/mmpretrain/datasets/gqa_dataset.py
@@ -0,0 +1,70 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from typing import List
+
+import mmengine
+from mmengine.dataset import BaseDataset
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class GQA(BaseDataset):
+    """GQA dataset.
+
+    We use the annotation file from LAVIS, and you can download all annotation files from following links: # noqa: E501
+
+    train:
+        https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/gqa/train_balanced_questions.json # noqa: E501
+    val:
+        https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/gqa/testdev_balanced_questions.json # noqa: E501
+    test:
+        https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/gqa/test_balanced_questions.json # noqa: E501
+
+    and images from the official website:
+        https://cs.stanford.edu/people/dorarad/gqa/index.html
+
+    Args:
+        data_root (str): The root directory for ``data_prefix``, ``ann_file``
+            and ``question_file``.
+        data_prefix (str): The directory of images.
+        ann_file (str, optional): Annotation file path for training and
+            validation. Defaults to an empty string.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self,
+                 data_root: str,
+                 data_prefix: str,
+                 ann_file: str = '',
+                 **kwarg):
+        super().__init__(
+            data_root=data_root,
+            data_prefix=dict(img_path=data_prefix),
+            ann_file=ann_file,
+            **kwarg,
+        )
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        annotations = mmengine.load(self.ann_file)
+
+        data_list = []
+        for ann in annotations:
+            # ann example
+            # {
+            #     'question': "Is it overcast?",
+            #     'answer': 'no,
+            #     'image_id': n161313.jpg,
+            #     'question_id': 262148000,
+            #     ....
+            # }
+            data_info = dict()
+            data_info['img_path'] = osp.join(self.data_prefix['img_path'],
+                                             ann['image'])
+            data_info['question'] = ann['question']
+            data_info['gt_answer'] = ann['answer']
+
+            data_list.append(data_info)
+
+        return data_list
diff --git a/mmpretrain/datasets/iconqa.py b/mmpretrain/datasets/iconqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..20c4d87ddea463f7c326cb0062b2634d4d06342e
--- /dev/null
+++ b/mmpretrain/datasets/iconqa.py
@@ -0,0 +1,63 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+import mmengine
+from mmengine.dataset import BaseDataset
+from mmengine.fileio import list_dir_or_file
+from mmengine.utils import check_file_exist
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class IconQA(BaseDataset):
+    """IconQA: A benchmark for abstract diagram understanding
+        and visual language reasoning.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix``, ``ann_file``
+            and ``question_file``.
+        data_prefix (str): The directory of the specific task and split.
+            eg. ``iconqa/val/choose_text/``.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self, data_root: str, data_prefix: str, **kwarg):
+        super().__init__(
+            data_root=data_root,
+            data_prefix=dict(img_path=data_prefix),
+            **kwarg,
+        )
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        sample_list = list(
+            list_dir_or_file(self.data_prefix['img_path'], list_file=False))
+
+        data_list = list()
+        for sample_id in sample_list:
+            # data json
+            # {
+            # "question": "How likely is it that you will pick a black one?",
+            # "choices": [
+            #     "certain",
+            #     "unlikely",
+            #     "impossible",
+            #     "probable"
+            # ],
+            # "answer": 2,
+            # "ques_type": "choose_txt",
+            # "grade": "grade1",
+            # "label": "S2"
+            # }
+            data_info = mmengine.load(
+                mmengine.join_path(self.data_prefix['img_path'], sample_id,
+                                   'data.json'))
+            data_info['gt_answer'] = data_info['choices'][int(
+                data_info['answer'])]
+            data_info['img_path'] = mmengine.join_path(
+                self.data_prefix['img_path'], sample_id, 'image.png')
+            check_file_exist(data_info['img_path'])
+            data_list.append(data_info)
+
+        return data_list
diff --git a/mmpretrain/datasets/imagenet.py b/mmpretrain/datasets/imagenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..771d6ee454e3dc094962ca09036888f97ffb2d21
--- /dev/null
+++ b/mmpretrain/datasets/imagenet.py
@@ -0,0 +1,235 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Union
+
+from mmengine import fileio
+from mmengine.logging import MMLogger
+
+from mmpretrain.registry import DATASETS
+from .categories import IMAGENET_CATEGORIES
+from .custom import CustomDataset
+
+
+@DATASETS.register_module()
+class ImageNet(CustomDataset):
+    """`ImageNet <http://www.image-net.org>`_ Dataset.
+
+    The dataset supports two kinds of directory format,
+
+    ::
+
+        imagenet
+        ├── train
+        │   ├──class_x
+        |   |   ├── x1.jpg
+        |   |   ├── x2.jpg
+        |   |   └── ...
+        │   ├── class_y
+        |   |   ├── y1.jpg
+        |   |   ├── y2.jpg
+        |   |   └── ...
+        |   └── ...
+        ├── val
+        │   ├──class_x
+        |   |   └── ...
+        │   ├── class_y
+        |   |   └── ...
+        |   └── ...
+        └── test
+            ├── test1.jpg
+            ├── test2.jpg
+            └── ...
+
+    or ::
+
+        imagenet
+        ├── train
+        │   ├── x1.jpg
+        │   ├── y1.jpg
+        │   └── ...
+        ├── val
+        │   ├── x3.jpg
+        │   ├── y3.jpg
+        │   └── ...
+        ├── test
+        │   ├── test1.jpg
+        │   ├── test2.jpg
+        │   └── ...
+        └── meta
+            ├── train.txt
+            └── val.txt
+
+
+    Args:
+        data_root (str): The root directory for ``data_prefix`` and
+            ``ann_file``. Defaults to ''.
+        split (str): The dataset split, supports "train", "val" and "test".
+            Default to ''.
+        data_prefix (str | dict): Prefix for training data. Defaults to ''.
+        ann_file (str): Annotation file path. Defaults to ''.
+        metainfo (dict, optional): Meta information for dataset, such as class
+            information. Defaults to None.
+        **kwargs: Other keyword arguments in :class:`CustomDataset` and
+            :class:`BaseDataset`.
+
+
+    Examples:
+        >>> from mmpretrain.datasets import ImageNet
+        >>> train_dataset = ImageNet(data_root='data/imagenet', split='train')
+        >>> train_dataset
+        Dataset ImageNet
+            Number of samples:  1281167
+            Number of categories:       1000
+            Root of dataset:    data/imagenet
+        >>> test_dataset = ImageNet(data_root='data/imagenet', split='val')
+        >>> test_dataset
+        Dataset ImageNet
+            Number of samples:  50000
+            Number of categories:       1000
+            Root of dataset:    data/imagenet
+    """  # noqa: E501
+
+    IMG_EXTENSIONS = ('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif')
+    METAINFO = {'classes': IMAGENET_CATEGORIES}
+
+    def __init__(self,
+                 data_root: str = '',
+                 split: str = '',
+                 data_prefix: Union[str, dict] = '',
+                 ann_file: str = '',
+                 metainfo: Optional[dict] = None,
+                 **kwargs):
+        kwargs = {'extensions': self.IMG_EXTENSIONS, **kwargs}
+
+        if split:
+            splits = ['train', 'val', 'test']
+            assert split in splits, \
+                f"The split must be one of {splits}, but get '{split}'"
+
+            if split == 'test':
+                logger = MMLogger.get_current_instance()
+                logger.info(
+                    'Since the ImageNet1k test set does not provide label'
+                    'annotations, `with_label` is set to False')
+                kwargs['with_label'] = False
+
+            data_prefix = split if data_prefix == '' else data_prefix
+
+            if ann_file == '':
+                _ann_path = fileio.join_path(data_root, 'meta', f'{split}.txt')
+                if fileio.exists(_ann_path):
+                    ann_file = fileio.join_path('meta', f'{split}.txt')
+
+        super().__init__(
+            data_root=data_root,
+            data_prefix=data_prefix,
+            ann_file=ann_file,
+            metainfo=metainfo,
+            **kwargs)
+
+    def extra_repr(self) -> List[str]:
+        """The extra repr information of the dataset."""
+        body = [
+            f'Root of dataset: \t{self.data_root}',
+        ]
+        return body
+
+
+@DATASETS.register_module()
+class ImageNet21k(CustomDataset):
+    """ImageNet21k Dataset.
+
+    Since the dataset ImageNet21k is extremely big, contains 21k+ classes
+    and 1.4B files. We won't provide the default categories list. Please
+    specify it from the ``classes`` argument.
+    The dataset directory structure is as follows,
+
+    ImageNet21k dataset directory ::
+
+        imagenet21k
+        ├── train
+        │   ├──class_x
+        |   |   ├── x1.jpg
+        |   |   ├── x2.jpg
+        |   |   └── ...
+        │   ├── class_y
+        |   |   ├── y1.jpg
+        |   |   ├── y2.jpg
+        |   |   └── ...
+        |   └── ...
+        └── meta
+            └── train.txt
+
+
+    Args:
+        data_root (str): The root directory for ``data_prefix`` and
+            ``ann_file``. Defaults to ''.
+        data_prefix (str | dict): Prefix for training data. Defaults to ''.
+        ann_file (str): Annotation file path. Defaults to ''.
+        metainfo (dict, optional): Meta information for dataset, such as class
+            information. Defaults to None.
+        multi_label (bool): Not implement by now. Use multi label or not.
+            Defaults to False.
+        **kwargs: Other keyword arguments in :class:`CustomDataset` and
+            :class:`BaseDataset`.
+
+    Examples:
+        >>> from mmpretrain.datasets import ImageNet21k
+        >>> train_dataset = ImageNet21k(data_root='data/imagenet21k', split='train')
+        >>> train_dataset
+        Dataset ImageNet21k
+            Number of samples:  14197088
+            Annotation file:    data/imagenet21k/meta/train.txt
+            Prefix of images:   data/imagenet21k/train
+    """  # noqa: E501
+
+    IMG_EXTENSIONS = ('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif')
+
+    def __init__(self,
+                 data_root: str = '',
+                 split: str = '',
+                 data_prefix: Union[str, dict] = '',
+                 ann_file: str = '',
+                 metainfo: Optional[dict] = None,
+                 multi_label: bool = False,
+                 **kwargs):
+        if multi_label:
+            raise NotImplementedError(
+                'The `multi_label` option is not supported by now.')
+        self.multi_label = multi_label
+
+        if split:
+            splits = ['train']
+            assert split in splits, \
+                f"The split must be one of {splits}, but get '{split}'.\
+                If you want to specify your own validation set or test set,\
+                please set split to None."
+
+            self.split = split
+            data_prefix = split if data_prefix == '' else data_prefix
+
+            if not ann_file:
+                _ann_path = fileio.join_path(data_root, 'meta', f'{split}.txt')
+                if fileio.exists(_ann_path):
+                    ann_file = fileio.join_path('meta', f'{split}.txt')
+
+        logger = MMLogger.get_current_instance()
+
+        if not ann_file:
+            logger.warning(
+                'The ImageNet21k dataset is large, and scanning directory may '
+                'consume long time. Considering to specify the `ann_file` to '
+                'accelerate the initialization.')
+
+        kwargs = {'extensions': self.IMG_EXTENSIONS, **kwargs}
+        super().__init__(
+            data_root=data_root,
+            data_prefix=data_prefix,
+            ann_file=ann_file,
+            metainfo=metainfo,
+            **kwargs)
+
+        if self.CLASSES is None:
+            logger.warning(
+                'The CLASSES is not stored in the `ImageNet21k` class. '
+                'Considering to specify the `classes` argument if you need '
+                'do inference on the ImageNet-21k dataset')
diff --git a/mmpretrain/datasets/infographic_vqa.py b/mmpretrain/datasets/infographic_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..46f5b0a37455677fe548c04f305cffb77402b775
--- /dev/null
+++ b/mmpretrain/datasets/infographic_vqa.py
@@ -0,0 +1,61 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+import mmengine
+from mmengine.dataset import BaseDataset
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class InfographicVQA(BaseDataset):
+    """Infographic VQA dataset.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix``, ``ann_file``.
+        data_prefix (str): The directory of images.
+        ann_file (str, optional): Annotation file path for training and
+            validation. Defaults to an empty string.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self,
+                 data_root: str,
+                 data_prefix: str,
+                 ann_file: str = '',
+                 **kwarg):
+        super().__init__(
+            data_root=data_root,
+            data_prefix=dict(img_path=data_prefix),
+            ann_file=ann_file,
+            **kwarg,
+        )
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        annotations = mmengine.load(self.ann_file)
+        annotations = annotations['data']
+
+        data_list = []
+        for ann in annotations:
+            # ann example
+            # {
+            # "questionId": 98313,
+            # "question": "Which social platform has heavy female audience?",
+            # "image_local_name": "37313.jpeg",
+            # "image_url": "https://xxx.png",
+            # "ocr_output_file": "37313.json",
+            # "answers": [
+            #     "pinterest"
+            # ],
+            # "data_split": "val"
+            # }
+            data_info = dict()
+            data_info['question'] = ann['question']
+            data_info['img_path'] = mmengine.join_path(
+                self.data_prefix['img_path'], ann['image_local_name'])
+            if 'answers' in ann.keys():  # test splits do not include gt
+                data_info['gt_answer'] = ann['answers']
+            data_list.append(data_info)
+
+        return data_list
diff --git a/mmpretrain/datasets/inshop.py b/mmpretrain/datasets/inshop.py
new file mode 100644
index 0000000000000000000000000000000000000000..f64f1779632d4a98d0e36d59750f4a1e8cbd4aed
--- /dev/null
+++ b/mmpretrain/datasets/inshop.py
@@ -0,0 +1,157 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine import get_file_backend, list_from_file
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+
+
+@DATASETS.register_module()
+class InShop(BaseDataset):
+    """InShop Dataset for Image Retrieval.
+
+    Please download the images from the homepage
+    'https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion/InShopRetrieval.html'
+    (In-shop Clothes Retrieval Benchmark -> Img -> img.zip,
+    Eval/list_eval_partition.txt), and organize them as follows way: ::
+
+        In-shop Clothes Retrieval Benchmark (data_root)/
+           ├── Eval /
+           │    └── list_eval_partition.txt (ann_file)
+           ├── Img (img_prefix)
+           │    └── img/
+           ├── README.txt
+           └── .....
+
+    Args:
+        data_root (str): The root directory for dataset.
+        split (str): Choose from 'train', 'query' and 'gallery'.
+            Defaults to 'train'.
+        data_prefix (str | dict): Prefix for training data.
+            Defaults to 'Img'.
+        ann_file (str): Annotation file path, path relative to
+            ``data_root``. Defaults to 'Eval/list_eval_partition.txt'.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+
+    Examples:
+        >>> from mmpretrain.datasets import InShop
+        >>>
+        >>> # build train InShop dataset
+        >>> inshop_train_cfg = dict(data_root='data/inshop', split='train')
+        >>> inshop_train = InShop(**inshop_train_cfg)
+        >>> inshop_train
+        Dataset InShop
+            Number of samples:  25882
+            The `CLASSES` meta info is not set.
+            Root of dataset:    data/inshop
+        >>>
+        >>> # build query InShop dataset
+        >>> inshop_query_cfg =  dict(data_root='data/inshop', split='query')
+        >>> inshop_query = InShop(**inshop_query_cfg)
+        >>> inshop_query
+        Dataset InShop
+            Number of samples:  14218
+            The `CLASSES` meta info is not set.
+            Root of dataset:    data/inshop
+        >>>
+        >>> # build gallery InShop dataset
+        >>> inshop_gallery_cfg = dict(data_root='data/inshop', split='gallery')
+        >>> inshop_gallery = InShop(**inshop_gallery_cfg)
+        >>> inshop_gallery
+        Dataset InShop
+            Number of samples:  12612
+            The `CLASSES` meta info is not set.
+            Root of dataset:    data/inshop
+    """
+
+    def __init__(self,
+                 data_root: str,
+                 split: str = 'train',
+                 data_prefix: str = 'Img',
+                 ann_file: str = 'Eval/list_eval_partition.txt',
+                 **kwargs):
+
+        assert split in ('train', 'query', 'gallery'), "'split' of `InShop`" \
+            f" must be one of ['train', 'query', 'gallery'], bu get '{split}'"
+        self.backend = get_file_backend(data_root, enable_singleton=True)
+        self.split = split
+        super().__init__(
+            data_root=data_root,
+            data_prefix=data_prefix,
+            ann_file=ann_file,
+            **kwargs)
+
+    def _process_annotations(self):
+        lines = list_from_file(self.ann_file)
+
+        anno_train = dict(metainfo=dict(), data_list=list())
+        anno_gallery = dict(metainfo=dict(), data_list=list())
+
+        # item_id to label, each item corresponds to one class label
+        class_num = 0
+        gt_label_train = {}
+
+        # item_id to label, each label corresponds to several items
+        gallery_num = 0
+        gt_label_gallery = {}
+
+        # (lines[0], lines[1]) is the image number and the field name;
+        # Each line format as 'image_name, item_id, evaluation_status'
+        for line in lines[2:]:
+            img_name, item_id, status = line.split()
+            img_path = self.backend.join_path(self.img_prefix, img_name)
+            if status == 'train':
+                if item_id not in gt_label_train:
+                    gt_label_train[item_id] = class_num
+                    class_num += 1
+                # item_id to class_id (for the training set)
+                anno_train['data_list'].append(
+                    dict(img_path=img_path, gt_label=gt_label_train[item_id]))
+            elif status == 'gallery':
+                if item_id not in gt_label_gallery:
+                    gt_label_gallery[item_id] = []
+                # Since there are multiple images for each item,
+                # record the corresponding item for each image.
+                gt_label_gallery[item_id].append(gallery_num)
+                anno_gallery['data_list'].append(
+                    dict(img_path=img_path, sample_idx=gallery_num))
+                gallery_num += 1
+
+        if self.split == 'train':
+            anno_train['metainfo']['class_number'] = class_num
+            anno_train['metainfo']['sample_number'] = \
+                len(anno_train['data_list'])
+            return anno_train
+        elif self.split == 'gallery':
+            anno_gallery['metainfo']['sample_number'] = gallery_num
+            return anno_gallery
+
+        # Generate the label for the query(val) set
+        anno_query = dict(metainfo=dict(), data_list=list())
+        query_num = 0
+        for line in lines[2:]:
+            img_name, item_id, status = line.split()
+            img_path = self.backend.join_path(self.img_prefix, img_name)
+            if status == 'query':
+                anno_query['data_list'].append(
+                    dict(
+                        img_path=img_path, gt_label=gt_label_gallery[item_id]))
+                query_num += 1
+
+        anno_query['metainfo']['sample_number'] = query_num
+        return anno_query
+
+    def load_data_list(self):
+        """load data list.
+
+        For the train set, return image and ground truth label. For the query
+        set, return image and ids of images in gallery. For the gallery set,
+        return image and its id.
+        """
+        data_info = self._process_annotations()
+        data_list = data_info['data_list']
+        return data_list
+
+    def extra_repr(self):
+        """The extra repr information of the dataset."""
+        body = [f'Root of dataset: \t{self.data_root}']
+        return body
diff --git a/mmpretrain/datasets/minigpt4_dataset.py b/mmpretrain/datasets/minigpt4_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..e14e5c354e26b7a3810173d3a344f96c9a3ee049
--- /dev/null
+++ b/mmpretrain/datasets/minigpt4_dataset.py
@@ -0,0 +1,79 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+import mmengine
+from mmengine.dataset import BaseDataset
+from mmengine.fileio import get_file_backend
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class MiniGPT4Dataset(BaseDataset):
+    """Dataset for training MiniGPT4.
+
+    MiniGPT4 dataset directory:
+
+        minigpt4_dataset
+            ├── image
+            │   ├── id0.jpg
+            │   │── id1.jpg
+            │   │── id2.jpg
+            │   └── ...
+            └── conversation_data.json
+
+    The structure of conversation_data.json:
+
+        [
+            // English data
+            {
+                "id": str(id0),
+                "conversation": "###Ask: <Img><ImageHere></Img> [Ask content]
+                                ###Answer: [Answer content]"
+            },
+
+            // Chinese data
+            {
+                "id": str(id1),
+                "conversation": "###问：<Img><ImageHere></Img> [Ask content]
+                                ###答：[Answer content]"
+            },
+
+            ...
+        ]
+
+    Args:
+        data_root (str): The root directory for ``ann_file`` and ``image``.
+        ann_file (str): Conversation file path.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def load_data_list(self) -> List[dict]:
+        file_backend = get_file_backend(self.data_root)
+        conversation_path = file_backend.join_path(self.data_root,
+                                                   self.ann_file)
+        conversation = mmengine.load(conversation_path)
+        img_ids = {}
+        n = 0
+        for conv in conversation:
+            img_id = conv['id']
+            if img_id not in img_ids.keys():
+                img_ids[img_id] = n
+                n += 1
+
+        img_root = file_backend.join_path(self.data_root, 'image')
+        data_list = []
+        for conv in conversation:
+            img_file = '{}.jpg'.format(conv['id'])
+            chat_content = conv['conversation']
+            lang = 'en' if chat_content.startswith('###Ask: ') else 'zh'
+            data_info = {
+                'image_id': img_ids[conv['id']],
+                'img_path': file_backend.join_path(img_root, img_file),
+                'chat_content': chat_content,
+                'lang': lang,
+            }
+
+            data_list.append(data_info)
+
+        return data_list
diff --git a/mmpretrain/datasets/mnist.py b/mmpretrain/datasets/mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..425267fe8034860d3b78c6af5b565ddb6efc7c10
--- /dev/null
+++ b/mmpretrain/datasets/mnist.py
@@ -0,0 +1,234 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import codecs
+from typing import List, Optional
+from urllib.parse import urljoin
+
+import mmengine.dist as dist
+import numpy as np
+import torch
+from mmengine.fileio import LocalBackend, exists, get_file_backend, join_path
+from mmengine.logging import MMLogger
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+from .categories import FASHIONMNIST_CATEGORITES, MNIST_CATEGORITES
+from .utils import (download_and_extract_archive, open_maybe_compressed_file,
+                    rm_suffix)
+
+
+@DATASETS.register_module()
+class MNIST(BaseDataset):
+    """`MNIST <http://yann.lecun.com/exdb/mnist/>`_ Dataset.
+
+    This implementation is modified from
+    https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py
+
+    Args:
+        data_root (str): The root directory of the MNIST Dataset.
+        split (str, optional): The dataset split, supports "train" and "test".
+            Default to "train".
+        metainfo (dict, optional): Meta information for dataset, such as
+            categories information. Defaults to None.
+        download (bool): Whether to download the dataset if not exists.
+            Defaults to True.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """  # noqa: E501
+
+    url_prefix = 'http://yann.lecun.com/exdb/mnist/'
+    # train images and labels
+    train_list = [
+        ['train-images-idx3-ubyte.gz', 'f68b3c2dcbeaaa9fbdd348bbdeb94873'],
+        ['train-labels-idx1-ubyte.gz', 'd53e105ee54ea40749a09fcbcd1e9432'],
+    ]
+    # test images and labels
+    test_list = [
+        ['t10k-images-idx3-ubyte.gz', '9fb629c4189551a2d022fa330f9573f3'],
+        ['t10k-labels-idx1-ubyte.gz', 'ec29112dd5afa0611ce80d1b7f02629c'],
+    ]
+    METAINFO = {'classes': MNIST_CATEGORITES}
+
+    def __init__(self,
+                 data_root: str = '',
+                 split: str = 'train',
+                 metainfo: Optional[dict] = None,
+                 download: bool = True,
+                 data_prefix: str = '',
+                 test_mode: bool = False,
+                 **kwargs):
+
+        splits = ['train', 'test']
+        assert split in splits, \
+            f"The split must be one of {splits}, but get '{split}'"
+        self.split = split
+
+        # To handle the BC-breaking
+        if split == 'train' and test_mode:
+            logger = MMLogger.get_current_instance()
+            logger.warning('split="train" but test_mode=True. '
+                           'The training set will be used.')
+
+        if not data_root and not data_prefix:
+            raise RuntimeError('Please set ``data_root`` to'
+                               'specify the dataset path')
+
+        self.download = download
+        super().__init__(
+            # The MNIST dataset doesn't need specify annotation file
+            ann_file='',
+            metainfo=metainfo,
+            data_root=data_root,
+            data_prefix=dict(root=data_prefix),
+            test_mode=test_mode,
+            **kwargs)
+
+    def load_data_list(self):
+        """Load images and ground truth labels."""
+        root = self.data_prefix['root']
+        backend = get_file_backend(root, enable_singleton=True)
+
+        if dist.is_main_process() and not self._check_exists():
+            if not isinstance(backend, LocalBackend):
+                raise RuntimeError(f'The dataset on {root} is not integrated, '
+                                   f'please manually handle it.')
+
+            if self.download:
+                self._download()
+            else:
+                raise RuntimeError(
+                    f'Cannot find {self.__class__.__name__} dataset in '
+                    f"{self.data_prefix['root']}, you can specify "
+                    '`download=True` to download automatically.')
+
+        dist.barrier()
+        assert self._check_exists(), \
+            'Download failed or shared storage is unavailable. Please ' \
+            f'download the dataset manually through {self.url_prefix}.'
+
+        if not self.test_mode:
+            file_list = self.train_list
+        else:
+            file_list = self.test_list
+
+        # load data from SN3 files
+        imgs = read_image_file(join_path(root, rm_suffix(file_list[0][0])))
+        gt_labels = read_label_file(
+            join_path(root, rm_suffix(file_list[1][0])))
+
+        data_infos = []
+        for img, gt_label in zip(imgs, gt_labels):
+            gt_label = np.array(gt_label, dtype=np.int64)
+            info = {'img': img.numpy(), 'gt_label': gt_label}
+            data_infos.append(info)
+        return data_infos
+
+    def _check_exists(self):
+        """Check the exists of data files."""
+        root = self.data_prefix['root']
+
+        for filename, _ in (self.train_list + self.test_list):
+            # get extracted filename of data
+            extract_filename = rm_suffix(filename)
+            fpath = join_path(root, extract_filename)
+            if not exists(fpath):
+                return False
+        return True
+
+    def _download(self):
+        """Download and extract data files."""
+        root = self.data_prefix['root']
+
+        for filename, md5 in (self.train_list + self.test_list):
+            url = urljoin(self.url_prefix, filename)
+            download_and_extract_archive(
+                url, download_root=root, filename=filename, md5=md5)
+
+    def extra_repr(self) -> List[str]:
+        """The extra repr information of the dataset."""
+        body = [f"Prefix of data: \t{self.data_prefix['root']}"]
+        return body
+
+
+@DATASETS.register_module()
+class FashionMNIST(MNIST):
+    """`Fashion-MNIST <https://github.com/zalandoresearch/fashion-mnist>`_
+    Dataset.
+
+    Args:
+        data_root (str): The root directory of the MNIST Dataset.
+        split (str, optional): The dataset split, supports "train" and "test".
+            Default to "train".
+        metainfo (dict, optional): Meta information for dataset, such as
+            categories information. Defaults to None.
+        download (bool): Whether to download the dataset if not exists.
+            Defaults to True.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    url_prefix = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/'
+    # train images and labels
+    train_list = [
+        ['train-images-idx3-ubyte.gz', '8d4fb7e6c68d591d4c3dfef9ec88bf0d'],
+        ['train-labels-idx1-ubyte.gz', '25c81989df183df01b3e8a0aad5dffbe'],
+    ]
+    # test images and labels
+    test_list = [
+        ['t10k-images-idx3-ubyte.gz', 'bef4ecab320f06d8554ea6380940ec79'],
+        ['t10k-labels-idx1-ubyte.gz', 'bb300cfdad3c16e7a12a480ee83cd310'],
+    ]
+    METAINFO = {'classes': FASHIONMNIST_CATEGORITES}
+
+
+def get_int(b: bytes) -> int:
+    """Convert bytes to int."""
+    return int(codecs.encode(b, 'hex'), 16)
+
+
+def read_sn3_pascalvincent_tensor(path: str,
+                                  strict: bool = True) -> torch.Tensor:
+    """Read a SN3 file in "Pascal Vincent" format (Lush file 'libidx/idx-
+    io.lsh').
+
+    Argument may be a filename, compressed filename, or file object.
+    """
+    # typemap
+    if not hasattr(read_sn3_pascalvincent_tensor, 'typemap'):
+        read_sn3_pascalvincent_tensor.typemap = {
+            8: (torch.uint8, np.uint8, np.uint8),
+            9: (torch.int8, np.int8, np.int8),
+            11: (torch.int16, np.dtype('>i2'), 'i2'),
+            12: (torch.int32, np.dtype('>i4'), 'i4'),
+            13: (torch.float32, np.dtype('>f4'), 'f4'),
+            14: (torch.float64, np.dtype('>f8'), 'f8')
+        }
+    # read
+    with open_maybe_compressed_file(path) as f:
+        data = f.read()
+    # parse
+    magic = get_int(data[0:4])
+    nd = magic % 256
+    ty = magic // 256
+    assert nd >= 1 and nd <= 3
+    assert ty >= 8 and ty <= 14
+    m = read_sn3_pascalvincent_tensor.typemap[ty]
+    s = [get_int(data[4 * (i + 1):4 * (i + 2)]) for i in range(nd)]
+    parsed = np.frombuffer(data, dtype=m[1], offset=(4 * (nd + 1)))
+    assert parsed.shape[0] == np.prod(s) or not strict
+    return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
+
+
+def read_label_file(path: str) -> torch.Tensor:
+    """Read labels from SN3 label file."""
+    with open(path, 'rb') as f:
+        x = read_sn3_pascalvincent_tensor(f, strict=False)
+    assert (x.dtype == torch.uint8)
+    assert (x.ndimension() == 1)
+    return x.long()
+
+
+def read_image_file(path: str) -> torch.Tensor:
+    """Read images from SN3 image file."""
+    with open(path, 'rb') as f:
+        x = read_sn3_pascalvincent_tensor(f, strict=False)
+    assert (x.dtype == torch.uint8)
+    assert (x.ndimension() == 3)
+    return x
diff --git a/mmpretrain/datasets/multi_label.py b/mmpretrain/datasets/multi_label.py
new file mode 100644
index 0000000000000000000000000000000000000000..58a9c7cd5f097689d29700004e2ed815934a1594
--- /dev/null
+++ b/mmpretrain/datasets/multi_label.py
@@ -0,0 +1,85 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+
+
+@DATASETS.register_module()
+class MultiLabelDataset(BaseDataset):
+    """Multi-label Dataset.
+
+    This dataset support annotation file in `OpenMMLab 2.0 style annotation
+    format`.
+
+    The annotation format is shown as follows.
+
+    .. code-block:: none
+
+        {
+            "metainfo":
+            {
+              "classes":['A', 'B', 'C'....]
+            },
+            "data_list":
+            [
+              {
+                "img_path": "test_img1.jpg",
+                'gt_label': [0, 1],
+              },
+              {
+                "img_path": "test_img2.jpg",
+                'gt_label': [2],
+              },
+            ]
+            ....
+        }
+
+
+    Args:
+        ann_file (str): Annotation file path.
+        metainfo (dict, optional): Meta information for dataset, such as class
+            information. Defaults to None.
+        data_root (str): The root directory for ``data_prefix`` and
+            ``ann_file``. Defaults to ''.
+        data_prefix (str | dict): Prefix for training data. Defaults to ''.
+        filter_cfg (dict, optional): Config for filter data. Defaults to None.
+        indices (int or Sequence[int], optional): Support using first few
+            data in annotation file to facilitate training/testing on a smaller
+            dataset. Defaults to None which means using all ``data_infos``.
+        serialize_data (bool, optional): Whether to hold memory using
+            serialized objects, when enabled, data loader workers can use
+            shared RAM from master process instead of making a copy. Defaults
+            to True.
+        pipeline (list, optional): Processing pipeline. Defaults to [].
+        test_mode (bool, optional): ``test_mode=True`` means in test phase.
+            Defaults to False.
+        lazy_init (bool, optional): Whether to load annotation during
+            instantiation. In some cases, such as visualization, only the meta
+            information of the dataset is needed, which is not necessary to
+            load annotation file. ``Basedataset`` can skip load annotations to
+            save time by set ``lazy_init=False``. Defaults to False.
+        max_refetch (int, optional): If ``Basedataset.prepare_data`` get a
+            None img. The maximum extra number of cycles to get a valid
+            image. Defaults to 1000.
+        classes (str | Sequence[str], optional): Specify names of classes.
+
+            - If is string, it should be a file path, and the every line of
+              the file is a name of a class.
+            - If is a sequence of string, every item is a name of class.
+            - If is None, use categories information in ``metainfo`` argument,
+              annotation file or the class attribute ``METAINFO``.
+
+            Defaults to None.
+    """
+
+    def get_cat_ids(self, idx: int) -> List[int]:
+        """Get category ids by index.
+
+        Args:
+            idx (int): Index of data.
+
+        Returns:
+            cat_ids (List[int]): Image categories of specified index.
+        """
+        return self.get_data_info(idx)['gt_label']
diff --git a/mmpretrain/datasets/multi_task.py b/mmpretrain/datasets/multi_task.py
new file mode 100644
index 0000000000000000000000000000000000000000..443df0e7d7de11962d472d33b25b4bbff562524f
--- /dev/null
+++ b/mmpretrain/datasets/multi_task.py
@@ -0,0 +1,337 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+import os.path as osp
+from os import PathLike
+from typing import Optional, Sequence
+
+import mmengine
+from mmcv.transforms import Compose
+from mmengine.fileio import get_file_backend
+
+from .builder import DATASETS
+
+
+def expanduser(path):
+    if isinstance(path, (str, PathLike)):
+        return osp.expanduser(path)
+    else:
+        return path
+
+
+def isabs(uri):
+    return osp.isabs(uri) or ('://' in uri)
+
+
+@DATASETS.register_module()
+class MultiTaskDataset:
+    """Custom dataset for multi-task dataset.
+
+    To use the dataset, please generate and provide an annotation file in the
+    below format:
+
+    .. code-block:: json
+
+        {
+          "metainfo": {
+            "tasks":
+              [
+              'gender'
+              'wear'
+              ]
+          },
+          "data_list": [
+            {
+              "img_path": "a.jpg",
+              gt_label:{
+                  "gender": 0,
+                  "wear": [1, 0, 1, 0]
+                }
+            },
+            {
+              "img_path": "b.jpg",
+              gt_label:{
+                  "gender": 1,
+                  "wear": [1, 0, 1, 0]
+                }
+            }
+          ]
+        }
+
+    Assume we put our dataset in the ``data/mydataset`` folder in the
+    repository and organize it as the below format: ::
+
+        mmpretrain/
+        └── data
+            └── mydataset
+                ├── annotation
+                │   ├── train.json
+                │   ├── test.json
+                │   └── val.json
+                ├── train
+                │   ├── a.jpg
+                │   └── ...
+                ├── test
+                │   ├── b.jpg
+                │   └── ...
+                └── val
+                    ├── c.jpg
+                    └── ...
+
+    We can use the below config to build datasets:
+
+    .. code:: python
+
+        >>> from mmpretrain.datasets import build_dataset
+        >>> train_cfg = dict(
+        ...     type="MultiTaskDataset",
+        ...     ann_file="annotation/train.json",
+        ...     data_root="data/mydataset",
+        ...     # The `img_path` field in the train annotation file is relative
+        ...     # to the `train` folder.
+        ...     data_prefix='train',
+        ... )
+        >>> train_dataset = build_dataset(train_cfg)
+
+    Or we can put all files in the same folder: ::
+
+        mmpretrain/
+        └── data
+            └── mydataset
+                 ├── train.json
+                 ├── test.json
+                 ├── val.json
+                 ├── a.jpg
+                 ├── b.jpg
+                 ├── c.jpg
+                 └── ...
+
+    And we can use the below config to build datasets:
+
+    .. code:: python
+
+        >>> from mmpretrain.datasets import build_dataset
+        >>> train_cfg = dict(
+        ...     type="MultiTaskDataset",
+        ...     ann_file="train.json",
+        ...     data_root="data/mydataset",
+        ...     # the `data_prefix` is not required since all paths are
+        ...     # relative to the `data_root`.
+        ... )
+        >>> train_dataset = build_dataset(train_cfg)
+
+
+    Args:
+        ann_file (str): The annotation file path. It can be either absolute
+            path or relative path to the ``data_root``.
+        metainfo (dict, optional): The extra meta information. It should be
+            a dict with the same format as the ``"metainfo"`` field in the
+            annotation file. Defaults to None.
+        data_root (str, optional): The root path of the data directory. It's
+            the prefix of the ``data_prefix`` and the ``ann_file``. And it can
+            be a remote path like "s3://openmmlab/xxx/". Defaults to None.
+        data_prefix (str, optional): The base folder relative to the
+            ``data_root`` for the ``"img_path"`` field in the annotation file.
+            Defaults to None.
+        pipeline (Sequence[dict]): A list of dict, where each element
+            represents a operation defined in
+            :mod:`mmpretrain.datasets.pipelines`. Defaults to an empty tuple.
+        test_mode (bool): in train mode or test mode. Defaults to False.
+    """
+    METAINFO = dict()
+
+    def __init__(self,
+                 ann_file: str,
+                 metainfo: Optional[dict] = None,
+                 data_root: Optional[str] = None,
+                 data_prefix: Optional[str] = None,
+                 pipeline: Sequence = (),
+                 test_mode: bool = False):
+
+        self.data_root = expanduser(data_root)
+
+        # Inference the file client
+        if self.data_root is not None:
+            self.file_backend = get_file_backend(uri=self.data_root)
+        else:
+            self.file_backend = None
+
+        self.ann_file = self._join_root(expanduser(ann_file))
+        self.data_prefix = self._join_root(data_prefix)
+
+        self.test_mode = test_mode
+        self.pipeline = Compose(pipeline)
+        self.data_list = self.load_data_list(self.ann_file, metainfo)
+
+    def _join_root(self, path):
+        """Join ``self.data_root`` with the specified path.
+
+        If the path is an absolute path, just return the path. And if the
+        path is None, return ``self.data_root``.
+
+        Examples:
+            >>> self.data_root = 'a/b/c'
+            >>> self._join_root('d/e/')
+            'a/b/c/d/e'
+            >>> self._join_root('https://openmmlab.com')
+            'https://openmmlab.com'
+            >>> self._join_root(None)
+            'a/b/c'
+        """
+        if path is None:
+            return self.data_root
+        if isabs(path):
+            return path
+
+        joined_path = self.file_backend.join_path(self.data_root, path)
+        return joined_path
+
+    @classmethod
+    def _get_meta_info(cls, in_metainfo: dict = None) -> dict:
+        """Collect meta information from the dictionary of meta.
+
+        Args:
+            in_metainfo (dict): Meta information dict.
+
+        Returns:
+            dict: Parsed meta information.
+        """
+        # `cls.METAINFO` will be overwritten by in_meta
+        metainfo = copy.deepcopy(cls.METAINFO)
+        if in_metainfo is None:
+            return metainfo
+
+        metainfo.update(in_metainfo)
+
+        return metainfo
+
+    def load_data_list(self, ann_file, metainfo_override=None):
+        """Load annotations from an annotation file.
+
+        Args:
+            ann_file (str): Absolute annotation file path if ``self.root=None``
+                or relative path if ``self.root=/path/to/data/``.
+
+        Returns:
+            list[dict]: A list of annotation.
+        """
+        annotations = mmengine.load(ann_file)
+        if not isinstance(annotations, dict):
+            raise TypeError(f'The annotations loaded from annotation file '
+                            f'should be a dict, but got {type(annotations)}!')
+        if 'data_list' not in annotations:
+            raise ValueError('The annotation file must have the `data_list` '
+                             'field.')
+        metainfo = annotations.get('metainfo', {})
+        raw_data_list = annotations['data_list']
+
+        # Set meta information.
+        assert isinstance(metainfo, dict), 'The `metainfo` field in the '\
+            f'annotation file should be a dict, but got {type(metainfo)}'
+        if metainfo_override is not None:
+            assert isinstance(metainfo_override, dict), 'The `metainfo` ' \
+                f'argument should be a dict, but got {type(metainfo_override)}'
+            metainfo.update(metainfo_override)
+        self._metainfo = self._get_meta_info(metainfo)
+
+        data_list = []
+        for i, raw_data in enumerate(raw_data_list):
+            try:
+                data_list.append(self.parse_data_info(raw_data))
+            except AssertionError as e:
+                raise RuntimeError(
+                    f'The format check fails during parse the item {i} of '
+                    f'the annotation file with error: {e}')
+        return data_list
+
+    def parse_data_info(self, raw_data):
+        """Parse raw annotation to target format.
+
+        This method will return a dict which contains the data information of a
+        sample.
+
+        Args:
+            raw_data (dict): Raw data information load from ``ann_file``
+
+        Returns:
+            dict: Parsed annotation.
+        """
+        assert isinstance(raw_data, dict), \
+            f'The item should be a dict, but got {type(raw_data)}'
+        assert 'img_path' in raw_data, \
+            "The item doesn't have `img_path` field."
+        data = dict(
+            img_path=self._join_root(raw_data['img_path']),
+            gt_label=raw_data['gt_label'],
+        )
+        return data
+
+    @property
+    def metainfo(self) -> dict:
+        """Get meta information of dataset.
+
+        Returns:
+            dict: meta information collected from ``cls.METAINFO``,
+            annotation file and metainfo argument during instantiation.
+        """
+        return copy.deepcopy(self._metainfo)
+
+    def prepare_data(self, idx):
+        """Get data processed by ``self.pipeline``.
+
+        Args:
+            idx (int): The index of ``data_info``.
+
+        Returns:
+            Any: Depends on ``self.pipeline``.
+        """
+        results = copy.deepcopy(self.data_list[idx])
+        return self.pipeline(results)
+
+    def __len__(self):
+        """Get the length of the whole dataset.
+
+        Returns:
+            int: The length of filtered dataset.
+        """
+        return len(self.data_list)
+
+    def __getitem__(self, idx):
+        """Get the idx-th image and data information of dataset after
+        ``self.pipeline``.
+
+        Args:
+            idx (int): The index of of the data.
+
+        Returns:
+            dict: The idx-th image and data information after
+            ``self.pipeline``.
+        """
+        return self.prepare_data(idx)
+
+    def __repr__(self):
+        """Print the basic information of the dataset.
+
+        Returns:
+            str: Formatted string.
+        """
+        head = 'Dataset ' + self.__class__.__name__
+        body = [f'Number of samples: \t{self.__len__()}']
+        if self.data_root is not None:
+            body.append(f'Root location: \t{self.data_root}')
+        body.append(f'Annotation file: \t{self.ann_file}')
+        if self.data_prefix is not None:
+            body.append(f'Prefix of images: \t{self.data_prefix}')
+        # -------------------- extra repr --------------------
+        tasks = self.metainfo['tasks']
+        body.append(f'For {len(tasks)} tasks')
+        for task in tasks:
+            body.append(f' {task} ')
+        # ----------------------------------------------------
+
+        if len(self.pipeline.transforms) > 0:
+            body.append('With transforms:')
+            for t in self.pipeline.transforms:
+                body.append(f'    {t}')
+
+        lines = [head] + [' ' * 4 + line for line in body]
+        return '\n'.join(lines)
diff --git a/mmpretrain/datasets/nlvr2.py b/mmpretrain/datasets/nlvr2.py
new file mode 100644
index 0000000000000000000000000000000000000000..0063090657714406049a6daa6fa3c0d868422590
--- /dev/null
+++ b/mmpretrain/datasets/nlvr2.py
@@ -0,0 +1,36 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import json
+from typing import List
+
+from mmengine.fileio import get_file_backend, list_from_file
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+
+
+@DATASETS.register_module()
+class NLVR2(BaseDataset):
+    """COCO Caption dataset."""
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+
+        data_list = []
+        img_prefix = self.data_prefix['img_path']
+        file_backend = get_file_backend(img_prefix)
+        examples = list_from_file(self.ann_file)
+
+        for example in examples:
+            example = json.loads(example)
+            prefix = example['identifier'].rsplit('-', 1)[0]
+            train_data = {}
+            train_data['text'] = example['sentence']
+            train_data['gt_label'] = {'True': 1, 'False': 0}[example['label']]
+            train_data['img_path'] = [
+                file_backend.join_path(img_prefix, prefix + f'-img{i}.png')
+                for i in range(2)
+            ]
+
+            data_list.append(train_data)
+
+        return data_list
diff --git a/mmpretrain/datasets/nocaps.py b/mmpretrain/datasets/nocaps.py
new file mode 100644
index 0000000000000000000000000000000000000000..65116e9cecc2d9983ef72ca3eee24ff7baedacc0
--- /dev/null
+++ b/mmpretrain/datasets/nocaps.py
@@ -0,0 +1,46 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+import mmengine
+from mmengine.dataset import BaseDataset
+from mmengine.fileio import get_file_backend
+from pycocotools.coco import COCO
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class NoCaps(BaseDataset):
+    """NoCaps dataset.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix`` and
+            ``ann_file``..
+        ann_file (str): Annotation file path.
+        data_prefix (dict): Prefix for data field. Defaults to
+            ``dict(img_path='')``.
+        pipeline (Sequence): Processing pipeline. Defaults to an empty tuple.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        img_prefix = self.data_prefix['img_path']
+        with mmengine.get_local_path(self.ann_file) as ann_file:
+            coco = COCO(ann_file)
+
+        file_backend = get_file_backend(img_prefix)
+        data_list = []
+        for ann in coco.anns.values():
+            image_id = ann['image_id']
+            image_path = file_backend.join_path(
+                img_prefix, coco.imgs[image_id]['file_name'])
+            data_info = {
+                'image_id': image_id,
+                'img_path': image_path,
+                'gt_caption': None
+            }
+
+            data_list.append(data_info)
+
+        return data_list
diff --git a/mmpretrain/datasets/ocr_vqa.py b/mmpretrain/datasets/ocr_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..55aa6913e3c4464444e8b971ccabf68aa2d99904
--- /dev/null
+++ b/mmpretrain/datasets/ocr_vqa.py
@@ -0,0 +1,91 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from typing import List
+
+import mmengine
+from mmengine.dataset import BaseDataset
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class OCRVQA(BaseDataset):
+    """OCR-VQA dataset.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix``, ``ann_file``
+            and ``question_file``.
+        data_prefix (str): The directory of images.
+        ann_file (str): Annotation file path for training and validation.
+        split (str): 'train', 'val' or 'test'.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self, data_root: str, data_prefix: str, ann_file: str,
+                 split: str, **kwarg):
+
+        assert split in ['train', 'val', 'test'], \
+            '`split` must be train, val or test'
+        self.split = split
+        super().__init__(
+            data_root=data_root,
+            data_prefix=dict(img_path=data_prefix),
+            ann_file=ann_file,
+            **kwarg,
+        )
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+
+        split_dict = {1: 'train', 2: 'val', 3: 'test'}
+
+        annotations = mmengine.load(self.ann_file)
+
+        # ann example
+        # "761183272": {
+        #     "imageURL": \
+        #         "http://ecx.images-amazon.com/images/I/61Y5cOdHJbL.jpg",
+        #     "questions": [
+        #         "Who wrote this book?",
+        #         "What is the title of this book?",
+        #         "What is the genre of this book?",
+        #         "Is this a games related book?",
+        #         "What is the year printed on this calendar?"],
+        #     "answers": [
+        #         "Sandra Boynton",
+        #         "Mom's Family Wall Calendar 2016",
+        #         "Calendars",
+        #         "No",
+        #         "2016"],
+        #     "title": "Mom's Family Wall Calendar 2016",
+        #     "authorName": "Sandra Boynton",
+        #     "genre": "Calendars",
+        #     "split": 1
+        # },
+
+        data_list = []
+
+        for key, ann in annotations.items():
+            if self.split != split_dict[ann['split']]:
+                continue
+
+            extension = osp.splitext(ann['imageURL'])[1]
+            if extension not in ['.jpg', '.png']:
+                continue
+            img_path = mmengine.join_path(self.data_prefix['img_path'],
+                                          key + extension)
+            for question, answer in zip(ann['questions'], ann['answers']):
+                data_info = {}
+                data_info['img_path'] = img_path
+                data_info['question'] = question
+                data_info['gt_answer'] = answer
+                data_info['gt_answer_weight'] = [1.0]
+
+                data_info['imageURL'] = ann['imageURL']
+                data_info['title'] = ann['title']
+                data_info['authorName'] = ann['authorName']
+                data_info['genre'] = ann['genre']
+
+                data_list.append(data_info)
+
+        return data_list
diff --git a/mmpretrain/datasets/oxfordiiitpet.py b/mmpretrain/datasets/oxfordiiitpet.py
new file mode 100644
index 0000000000000000000000000000000000000000..23c8b7db8679e99c6ed2698b9eb140cd6151d445
--- /dev/null
+++ b/mmpretrain/datasets/oxfordiiitpet.py
@@ -0,0 +1,97 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+from mmengine import get_file_backend, list_from_file
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+from .categories import OxfordIIITPet_CATEGORIES
+
+
+@DATASETS.register_module()
+class OxfordIIITPet(BaseDataset):
+    """The Oxford-IIIT Pets Dataset.
+
+    Support the `Oxford-IIIT Pets Dataset <https://www.robots.ox.ac.uk/~vgg/data/pets/>`_ Dataset.
+    After downloading and decompression, the dataset directory structure is as follows.
+
+    Oxford-IIIT_Pets dataset directory: ::
+
+        Oxford-IIIT_Pets
+        ├── images
+        │   ├── Abyssinian_1.jpg
+        │   ├── Abyssinian_2.jpg
+        │   └── ...
+        ├── annotations
+        │   ├── trainval.txt
+        │   ├── test.txt
+        │   ├── list.txt
+        │   └── ...
+        └── ....
+
+    Args:
+        data_root (str): The root directory for Oxford-IIIT Pets dataset.
+        split (str, optional): The dataset split, supports "trainval" and "test".
+            Default to "trainval".
+
+    Examples:
+        >>> from mmpretrain.datasets import OxfordIIITPet
+        >>> train_dataset = OxfordIIITPet(data_root='data/Oxford-IIIT_Pets', split='trainval')
+        >>> train_dataset
+        Dataset OxfordIIITPet
+            Number of samples:  3680
+            Number of categories:       37
+            Root of dataset:    data/Oxford-IIIT_Pets
+        >>> test_dataset = OxfordIIITPet(data_root='data/Oxford-IIIT_Pets', split='test')
+        >>> test_dataset
+        Dataset OxfordIIITPet
+            Number of samples:  3669
+            Number of categories:       37
+            Root of dataset:    data/Oxford-IIIT_Pets
+    """  # noqa: E501
+
+    METAINFO = {'classes': OxfordIIITPet_CATEGORIES}
+
+    def __init__(self, data_root: str, split: str = 'trainval', **kwargs):
+
+        splits = ['trainval', 'test']
+        assert split in splits, \
+            f"The split must be one of {splits}, but get '{split}'"
+        self.split = split
+
+        self.backend = get_file_backend(data_root, enable_singleton=True)
+        if split == 'trainval':
+            ann_file = self.backend.join_path('annotations', 'trainval.txt')
+        else:
+            ann_file = self.backend.join_path('annotations', 'test.txt')
+
+        data_prefix = 'images'
+        test_mode = split == 'test'
+
+        super(OxfordIIITPet, self).__init__(
+            ann_file=ann_file,
+            data_root=data_root,
+            data_prefix=data_prefix,
+            test_mode=test_mode,
+            **kwargs)
+
+    def load_data_list(self):
+        """Load images and ground truth labels."""
+
+        pairs = list_from_file(self.ann_file)
+        data_list = []
+        for pair in pairs:
+            img_name, class_id, _, _ = pair.split()
+            img_name = f'{img_name}.jpg'
+            img_path = self.backend.join_path(self.img_prefix, img_name)
+            gt_label = int(class_id) - 1
+            info = dict(img_path=img_path, gt_label=gt_label)
+            data_list.append(info)
+        return data_list
+
+    def extra_repr(self) -> List[str]:
+        """The extra repr information of the dataset."""
+        body = [
+            f'Root of dataset: \t{self.data_root}',
+        ]
+        return body
diff --git a/mmpretrain/datasets/places205.py b/mmpretrain/datasets/places205.py
new file mode 100644
index 0000000000000000000000000000000000000000..f3ba1ff631a7a4840b66cf63ec53585ec064560d
--- /dev/null
+++ b/mmpretrain/datasets/places205.py
@@ -0,0 +1,40 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional, Union
+
+from mmpretrain.registry import DATASETS
+from .categories import PLACES205_CATEGORIES
+from .custom import CustomDataset
+
+
+@DATASETS.register_module()
+class Places205(CustomDataset):
+    """`Places205 <http://places.csail.mit.edu/downloadData.html>`_ Dataset.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix`` and
+            ``ann_file``. Defaults to ''.
+        data_prefix (str | dict): Prefix for training data. Defaults
+            to ''.
+        ann_file (str): Annotation file path. Defaults to ''.
+        metainfo (dict, optional): Meta information for dataset, such as class
+            information. Defaults to None.
+        **kwargs: Other keyword arguments in :class:`CustomDataset` and
+            :class:`BaseDataset`.
+    """
+
+    IMG_EXTENSIONS = ('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif')
+    METAINFO = {'classes': PLACES205_CATEGORIES}
+
+    def __init__(self,
+                 data_root: str = '',
+                 data_prefix: Union[str, dict] = '',
+                 ann_file: str = '',
+                 metainfo: Optional[dict] = None,
+                 **kwargs):
+        kwargs = {'extensions': self.IMG_EXTENSIONS, **kwargs}
+        super().__init__(
+            data_root=data_root,
+            data_prefix=data_prefix,
+            ann_file=ann_file,
+            metainfo=metainfo,
+            **kwargs)
diff --git a/mmpretrain/datasets/refcoco.py b/mmpretrain/datasets/refcoco.py
new file mode 100644
index 0000000000000000000000000000000000000000..39c3d3e65e5ffdcb5a49fc183473138cfba8938a
--- /dev/null
+++ b/mmpretrain/datasets/refcoco.py
@@ -0,0 +1,112 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from typing import List
+
+import mmengine
+import numpy as np
+from mmengine.dataset import BaseDataset
+from pycocotools.coco import COCO
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class RefCOCO(BaseDataset):
+    """RefCOCO dataset.
+
+    RefCOCO is a popular dataset used for the task of visual grounding.
+    Here are the steps for accessing and utilizing the
+    RefCOCO dataset.
+
+    You can access the RefCOCO dataset from the official source:
+    https://github.com/lichengunc/refer
+
+    The RefCOCO dataset is organized in a structured format: ::
+
+        FeaturesDict({
+            'coco_annotations': Sequence({
+                'area': int64,
+                'bbox': BBoxFeature(shape=(4,), dtype=float32),
+                'id': int64,
+                'label': int64,
+            }),
+            'image': Image(shape=(None, None, 3), dtype=uint8),
+            'image/id': int64,
+            'objects': Sequence({
+                'area': int64,
+                'bbox': BBoxFeature(shape=(4,), dtype=float32),
+                'gt_box_index': int64,
+                'id': int64,
+                'label': int64,
+                'refexp': Sequence({
+                    'raw': Text(shape=(), dtype=string),
+                    'refexp_id': int64,
+                }),
+            }),
+        })
+
+    Args:
+        ann_file (str): Annotation file path.
+        data_root (str): The root directory for ``data_prefix`` and
+            ``ann_file``. Defaults to ''.
+        data_prefix (str): Prefix for training data.
+        pipeline (Sequence): Processing pipeline. Defaults to an empty tuple.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self,
+                 data_root,
+                 ann_file,
+                 data_prefix,
+                 split_file,
+                 split='train',
+                 **kwargs):
+        self.split_file = split_file
+        self.split = split
+
+        super().__init__(
+            data_root=data_root,
+            data_prefix=dict(img_path=data_prefix),
+            ann_file=ann_file,
+            **kwargs,
+        )
+
+    def _join_prefix(self):
+        if not mmengine.is_abs(self.split_file) and self.split_file:
+            self.split_file = osp.join(self.data_root, self.split_file)
+
+        return super()._join_prefix()
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        with mmengine.get_local_path(self.ann_file) as ann_file:
+            coco = COCO(ann_file)
+        splits = mmengine.load(self.split_file, file_format='pkl')
+        img_prefix = self.data_prefix['img_path']
+
+        data_list = []
+        join_path = mmengine.fileio.get_file_backend(img_prefix).join_path
+        for refer in splits:
+            if refer['split'] != self.split:
+                continue
+
+            ann = coco.anns[refer['ann_id']]
+            img = coco.imgs[ann['image_id']]
+            sentences = refer['sentences']
+            bbox = np.array(ann['bbox'], dtype=np.float32)
+            bbox[2:4] = bbox[0:2] + bbox[2:4]  # XYWH -> XYXY
+
+            for sent in sentences:
+                data_info = {
+                    'img_path': join_path(img_prefix, img['file_name']),
+                    'image_id': ann['image_id'],
+                    'ann_id': ann['id'],
+                    'text': sent['sent'],
+                    'gt_bboxes': bbox[None, :],
+                }
+                data_list.append(data_info)
+
+        if len(data_list) == 0:
+            raise ValueError(f'No sample in split "{self.split}".')
+
+        return data_list
diff --git a/mmpretrain/datasets/samplers/__init__.py b/mmpretrain/datasets/samplers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..2bccf9c34659e19764871a696260cf5884696ca1
--- /dev/null
+++ b/mmpretrain/datasets/samplers/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .repeat_aug import RepeatAugSampler
+from .sequential import SequentialSampler
+
+__all__ = ['RepeatAugSampler', 'SequentialSampler']
diff --git a/mmpretrain/datasets/samplers/__pycache__/__init__.cpython-310.pyc b/mmpretrain/datasets/samplers/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..4c88a243479aadb7e6f204e363bbb5f9988308a5
Binary files /dev/null and b/mmpretrain/datasets/samplers/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/samplers/__pycache__/repeat_aug.cpython-310.pyc b/mmpretrain/datasets/samplers/__pycache__/repeat_aug.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..461fe786a95f1552b6a0e0c5ba4f931a2ce6d4e1
Binary files /dev/null and b/mmpretrain/datasets/samplers/__pycache__/repeat_aug.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/samplers/__pycache__/sequential.cpython-310.pyc b/mmpretrain/datasets/samplers/__pycache__/sequential.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..52d0722c1c0ff50e35d6f3773a57a54a8dd24d47
Binary files /dev/null and b/mmpretrain/datasets/samplers/__pycache__/sequential.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/samplers/repeat_aug.py b/mmpretrain/datasets/samplers/repeat_aug.py
new file mode 100644
index 0000000000000000000000000000000000000000..d833a1954d7d9d181c368d5b3b956c25df241c1a
--- /dev/null
+++ b/mmpretrain/datasets/samplers/repeat_aug.py
@@ -0,0 +1,101 @@
+import math
+from typing import Iterator, Optional, Sized
+
+import torch
+from mmengine.dist import get_dist_info, is_main_process, sync_random_seed
+from torch.utils.data import Sampler
+
+from mmpretrain.registry import DATA_SAMPLERS
+
+
+@DATA_SAMPLERS.register_module()
+class RepeatAugSampler(Sampler):
+    """Sampler that restricts data loading to a subset of the dataset for
+    distributed, with repeated augmentation. It ensures that different each
+    augmented version of a sample will be visible to a different process (GPU).
+    Heavily based on torch.utils.data.DistributedSampler.
+
+    This sampler was taken from
+    https://github.com/facebookresearch/deit/blob/0c4b8f60/samplers.py
+    Used in
+    Copyright (c) 2015-present, Facebook, Inc.
+
+    Args:
+        dataset (Sized): The dataset.
+        shuffle (bool): Whether shuffle the dataset or not. Defaults to True.
+        num_repeats (int): The repeat times of every sample. Defaults to 3.
+        seed (int, optional): Random seed used to shuffle the sampler if
+            :attr:`shuffle=True`. This number should be identical across all
+            processes in the distributed group. Defaults to None.
+    """
+
+    def __init__(self,
+                 dataset: Sized,
+                 shuffle: bool = True,
+                 num_repeats: int = 3,
+                 seed: Optional[int] = None):
+        rank, world_size = get_dist_info()
+        self.rank = rank
+        self.world_size = world_size
+
+        self.dataset = dataset
+        self.shuffle = shuffle
+        if not self.shuffle and is_main_process():
+            from mmengine.logging import MMLogger
+            logger = MMLogger.get_current_instance()
+            logger.warning('The RepeatAugSampler always picks a '
+                           'fixed part of data if `shuffle=False`.')
+
+        if seed is None:
+            seed = sync_random_seed()
+        self.seed = seed
+        self.epoch = 0
+        self.num_repeats = num_repeats
+
+        # The number of repeated samples in the rank
+        self.num_samples = math.ceil(
+            len(self.dataset) * num_repeats / world_size)
+        # The total number of repeated samples in all ranks.
+        self.total_size = self.num_samples * world_size
+        # The number of selected samples in the rank
+        self.num_selected_samples = math.ceil(len(self.dataset) / world_size)
+
+    def __iter__(self) -> Iterator[int]:
+        """Iterate the indices."""
+        # deterministically shuffle based on epoch and seed
+        if self.shuffle:
+            g = torch.Generator()
+            g.manual_seed(self.seed + self.epoch)
+            indices = torch.randperm(len(self.dataset), generator=g).tolist()
+        else:
+            indices = list(range(len(self.dataset)))
+
+        # produce repeats e.g. [0, 0, 0, 1, 1, 1, 2, 2, 2....]
+        indices = [x for x in indices for _ in range(self.num_repeats)]
+        # add extra samples to make it evenly divisible
+        padding_size = self.total_size - len(indices)
+        indices += indices[:padding_size]
+        assert len(indices) == self.total_size
+
+        # subsample per rank
+        indices = indices[self.rank:self.total_size:self.world_size]
+        assert len(indices) == self.num_samples
+
+        # return up to num selected samples
+        return iter(indices[:self.num_selected_samples])
+
+    def __len__(self) -> int:
+        """The number of samples in this rank."""
+        return self.num_selected_samples
+
+    def set_epoch(self, epoch: int) -> None:
+        """Sets the epoch for this sampler.
+
+        When :attr:`shuffle=True`, this ensures all replicas use a different
+        random ordering for each epoch. Otherwise, the next iteration of this
+        sampler will yield the same ordering.
+
+        Args:
+            epoch (int): Epoch number.
+        """
+        self.epoch = epoch
diff --git a/mmpretrain/datasets/samplers/sequential.py b/mmpretrain/datasets/samplers/sequential.py
new file mode 100644
index 0000000000000000000000000000000000000000..e3b940c2eabc2ab9c2401cd1923776fc067e9f6c
--- /dev/null
+++ b/mmpretrain/datasets/samplers/sequential.py
@@ -0,0 +1,56 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Iterator
+
+import torch
+from mmengine.dataset import DefaultSampler
+
+from mmpretrain.registry import DATA_SAMPLERS
+
+
+@DATA_SAMPLERS.register_module()
+class SequentialSampler(DefaultSampler):
+    """Sequential sampler which supports different subsample policy.
+
+    Args:
+        dataset (Sized): The dataset.
+        round_up (bool): Whether to add extra samples to make the number of
+            samples evenly divisible by the world size. Defaults to True.
+        subsample_type (str): The method to subsample data on different rank.
+            Supported type:
+
+            - ``'default'``: Original torch behavior. Sample the examples one
+              by one for each GPU in terms. For instance, 8 examples on 2 GPUs,
+              GPU0: [0,2,4,8], GPU1: [1,3,5,7]
+            - ``'sequential'``: Subsample all examples to n chunk sequntially.
+              For instance, 8 examples on 2 GPUs,
+              GPU0: [0,1,2,3], GPU1: [4,5,6,7]
+    """
+
+    def __init__(self, subsample_type: str = 'default', **kwargs) -> None:
+        super().__init__(shuffle=False, **kwargs)
+
+        if subsample_type not in ['default', 'sequential']:
+            raise ValueError(f'Unsupported subsample typer "{subsample_type}",'
+                             ' please choose from ["default", "sequential"]')
+        self.subsample_type = subsample_type
+
+    def __iter__(self) -> Iterator[int]:
+        """Iterate the indices."""
+        indices = torch.arange(len(self.dataset)).tolist()
+
+        # add extra samples to make it evenly divisible
+        if self.round_up:
+            indices = (
+                indices *
+                int(self.total_size / len(indices) + 1))[:self.total_size]
+
+        # subsample
+        if self.subsample_type == 'default':
+            indices = indices[self.rank:self.total_size:self.world_size]
+        elif self.subsample_type == 'sequential':
+            num_samples_per_rank = self.total_size // self.world_size
+            indices = indices[self.rank *
+                              num_samples_per_rank:(self.rank + 1) *
+                              num_samples_per_rank]
+
+        return iter(indices)
diff --git a/mmpretrain/datasets/scienceqa.py b/mmpretrain/datasets/scienceqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e442491be85540980c0309b65d32a12c9c85542
--- /dev/null
+++ b/mmpretrain/datasets/scienceqa.py
@@ -0,0 +1,109 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+from typing import Callable, List, Sequence
+
+import mmengine
+from mmengine.dataset import BaseDataset
+from mmengine.fileio import get_file_backend
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class ScienceQA(BaseDataset):
+    """ScienceQA dataset.
+
+    This dataset is used to load the multimodal data of ScienceQA dataset.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix`` and
+            ``ann_file``.
+        split (str): The split of dataset. Options: ``train``, ``val``,
+            ``test``, ``trainval``, ``minival``, and ``minitest``.
+        split_file (str): The split file of dataset, which contains the
+            ids of data samples in the split.
+        ann_file (str): Annotation file path.
+        image_only (bool): Whether only to load data with image. Defaults to
+            False.
+        data_prefix (dict): Prefix for data field. Defaults to
+            ``dict(img_path='')``.
+        pipeline (Sequence): Processing pipeline. Defaults to an empty tuple.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self,
+                 data_root: str,
+                 split: str,
+                 split_file: str,
+                 ann_file: str,
+                 image_only: bool = False,
+                 data_prefix: dict = dict(img_path=''),
+                 pipeline: Sequence[Callable] = (),
+                 **kwargs):
+        assert split in [
+            'train', 'val', 'test', 'trainval', 'minival', 'minitest'
+        ], f'Invalid split {split}'
+        self.split = split
+        self.split_file = os.path.join(data_root, split_file)
+        self.image_only = image_only
+
+        super().__init__(
+            data_root=data_root,
+            ann_file=ann_file,
+            data_prefix=data_prefix,
+            pipeline=pipeline,
+            **kwargs)
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        img_prefix = self.data_prefix['img_path']
+        annotations = mmengine.load(self.ann_file)
+        current_data_split = mmengine.load(self.split_file)[self.split]  # noqa
+
+        file_backend = get_file_backend(img_prefix)
+
+        data_list = []
+        for data_id in current_data_split:
+            ann = annotations[data_id]
+            if self.image_only and ann['image'] is None:
+                continue
+            data_info = {
+                'image_id':
+                data_id,
+                'question':
+                ann['question'],
+                'choices':
+                ann['choices'],
+                'gt_answer':
+                ann['answer'],
+                'hint':
+                ann['hint'],
+                'image_name':
+                ann['image'],
+                'task':
+                ann['task'],
+                'grade':
+                ann['grade'],
+                'subject':
+                ann['subject'],
+                'topic':
+                ann['topic'],
+                'category':
+                ann['category'],
+                'skill':
+                ann['skill'],
+                'lecture':
+                ann['lecture'],
+                'solution':
+                ann['solution'],
+                'split':
+                ann['split'],
+                'img_path':
+                file_backend.join_path(img_prefix, data_id, ann['image'])
+                if ann['image'] is not None else None,
+                'has_image':
+                True if ann['image'] is not None else False,
+            }
+            data_list.append(data_info)
+
+        return data_list
diff --git a/mmpretrain/datasets/stanfordcars.py b/mmpretrain/datasets/stanfordcars.py
new file mode 100644
index 0000000000000000000000000000000000000000..355697943cf693869f35f2a0bd71abdfa0396722
--- /dev/null
+++ b/mmpretrain/datasets/stanfordcars.py
@@ -0,0 +1,148 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+import mat4py
+from mmengine import get_file_backend
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+from .categories import STANFORDCARS_CATEGORIES
+
+
+@DATASETS.register_module()
+class StanfordCars(BaseDataset):
+    """The Stanford Cars Dataset.
+
+    Support the `Stanford Cars Dataset <https://ai.stanford.edu/~jkrause/cars/car_dataset.html>`_ Dataset.
+    The official website provides two ways to organize the dataset.
+    Therefore, after downloading and decompression, the dataset directory structure is as follows.
+
+    Stanford Cars dataset directory: ::
+
+        Stanford_Cars
+        ├── car_ims
+        │   ├── 00001.jpg
+        │   ├── 00002.jpg
+        │   └── ...
+        └── cars_annos.mat
+
+    or ::
+
+        Stanford_Cars
+        ├── cars_train
+        │   ├── 00001.jpg
+        │   ├── 00002.jpg
+        │   └── ...
+        ├── cars_test
+        │   ├── 00001.jpg
+        │   ├── 00002.jpg
+        │   └── ...
+        └── devkit
+            ├── cars_meta.mat
+            ├── cars_train_annos.mat
+            ├── cars_test_annos.mat
+            ├── cars_test_annoswithlabels.mat
+            ├── eval_train.m
+            └── train_perfect_preds.txt
+
+    Args:
+        data_root (str): The root directory for Stanford Cars dataset.
+        split (str, optional): The dataset split, supports "train"
+            and "test". Default to "train".
+
+    Examples:
+        >>> from mmpretrain.datasets import StanfordCars
+        >>> train_dataset = StanfordCars(data_root='data/Stanford_Cars', split='train')
+        >>> train_dataset
+        Dataset StanfordCars
+            Number of samples:  8144
+            Number of categories:       196
+            Root of dataset:    data/Stanford_Cars
+        >>> test_dataset = StanfordCars(data_root='data/Stanford_Cars', split='test')
+        >>> test_dataset
+        Dataset StanfordCars
+            Number of samples:  8041
+            Number of categories:       196
+            Root of dataset:    data/Stanford_Cars
+    """  # noqa: E501
+
+    METAINFO = {'classes': STANFORDCARS_CATEGORIES}
+
+    def __init__(self, data_root: str, split: str = 'train', **kwargs):
+
+        splits = ['train', 'test']
+        assert split in splits, \
+            f"The split must be one of {splits}, but get '{split}'"
+        self.split = split
+
+        test_mode = split == 'test'
+        self.backend = get_file_backend(data_root, enable_singleton=True)
+
+        anno_file_path = self.backend.join_path(data_root, 'cars_annos.mat')
+        if self.backend.exists(anno_file_path):
+            ann_file = 'cars_annos.mat'
+            data_prefix = ''
+        else:
+            if test_mode:
+                ann_file = self.backend.join_path(
+                    'devkit', 'cars_test_annos_withlabels.mat')
+                data_prefix = 'cars_test'
+            else:
+                ann_file = self.backend.join_path('devkit',
+                                                  'cars_train_annos.mat')
+                data_prefix = 'cars_train'
+
+            if not self.backend.exists(
+                    self.backend.join_path(data_root, ann_file)):
+                doc_url = 'https://mmpretrain.readthedocs.io/en/latest/api/datasets.html#stanfordcars'  # noqa: E501
+                raise RuntimeError(
+                    f'The dataset is incorrectly organized, please \
+                    refer to {doc_url} and reorganize your folders.')
+
+        super(StanfordCars, self).__init__(
+            ann_file=ann_file,
+            data_root=data_root,
+            data_prefix=data_prefix,
+            test_mode=test_mode,
+            **kwargs)
+
+    def load_data_list(self):
+        data = mat4py.loadmat(self.ann_file)['annotations']
+
+        data_list = []
+        if 'test' in data.keys():
+            # first way
+            img_paths, labels, test = data['relative_im_path'], data[
+                'class'], data['test']
+            num = len(img_paths)
+            assert num == len(labels) == len(test), 'get error ann file'
+            for i in range(num):
+                if not self.test_mode and test[i] == 1:
+                    continue
+                if self.test_mode and test[i] == 0:
+                    continue
+                img_path = self.backend.join_path(self.img_prefix,
+                                                  img_paths[i])
+                gt_label = labels[i] - 1
+                info = dict(img_path=img_path, gt_label=gt_label)
+                data_list.append(info)
+        else:
+            # second way
+            img_names, labels = data['fname'], data['class']
+            num = len(img_names)
+            assert num == len(labels), 'get error ann file'
+            for i in range(num):
+                img_path = self.backend.join_path(self.img_prefix,
+                                                  img_names[i])
+                gt_label = labels[i] - 1
+                info = dict(img_path=img_path, gt_label=gt_label)
+                data_list.append(info)
+
+        return data_list
+
+    def extra_repr(self) -> List[str]:
+        """The extra repr information of the dataset."""
+        body = [
+            f'Root of dataset: \t{self.data_root}',
+        ]
+        return body
diff --git a/mmpretrain/datasets/sun397.py b/mmpretrain/datasets/sun397.py
new file mode 100644
index 0000000000000000000000000000000000000000..1039a0690f8096082d5c55f89d743478fdf5b22d
--- /dev/null
+++ b/mmpretrain/datasets/sun397.py
@@ -0,0 +1,125 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+from mmengine import get_file_backend, list_from_file
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+from .categories import SUN397_CATEGORIES
+
+
+@DATASETS.register_module()
+class SUN397(BaseDataset):
+    """The SUN397 Dataset.
+
+    Support the `SUN397 Dataset <https://vision.princeton.edu/projects/2010/SUN/>`_ Dataset.
+    After downloading and decompression, the dataset directory structure is as follows.
+
+    SUN397 dataset directory: ::
+
+        SUN397
+        ├── SUN397
+        │   ├── a
+        │   │   ├── abbey
+        │   |   |   ├── sun_aaalbzqrimafwbiv.jpg
+        │   |   |   └── ...
+        │   │   ├── airplane_cabin
+        │   |   |   ├── sun_aadqdkqaslqqoblu.jpg
+        │   |   |   └── ...
+        │   |   └── ...
+        │   ├── b
+        │   │   └── ...
+        │   ├── c
+        │   │   └── ...
+        │   └── ...
+        └── Partitions
+            ├── ClassName.txt
+            ├── Training_01.txt
+            ├── Testing_01.txt
+            └── ...
+
+    Args:
+        data_root (str): The root directory for Stanford Cars dataset.
+        split (str, optional): The dataset split, supports "train" and "test".
+            Default to "train".
+
+    Examples:
+        >>> from mmpretrain.datasets import SUN397
+        >>> train_dataset = SUN397(data_root='data/SUN397', split='train')
+        >>> train_dataset
+        Dataset SUN397
+            Number of samples:  19850
+            Number of categories:       397
+            Root of dataset:    data/SUN397
+        >>> test_dataset = SUN397(data_root='data/SUN397', split='test')
+        >>> test_dataset
+        Dataset SUN397
+            Number of samples:  19850
+            Number of categories:       397
+            Root of dataset:    data/SUN397
+
+    **Note that some images are not a jpg file although the name ends with ".jpg".
+    The backend of SUN397 should be "pillow" as below to read these images properly,**
+
+    .. code-block:: python
+
+        pipeline = [
+            dict(type='LoadImageFromFile', imdecode_backend='pillow'),
+            dict(type='RandomResizedCrop', scale=224),
+            dict(type='PackInputs')
+            ]
+    """  # noqa: E501
+
+    METAINFO = {'classes': SUN397_CATEGORIES}
+
+    def __init__(self, data_root: str, split: str = 'train', **kwargs):
+
+        splits = ['train', 'test']
+        assert split in splits, \
+            f"The split must be one of {splits}, but get '{split}'"
+        self.split = split
+
+        self.backend = get_file_backend(data_root, enable_singleton=True)
+        if split == 'train':
+            ann_file = self.backend.join_path('Partitions', 'Training_01.txt')
+        else:
+            ann_file = self.backend.join_path('Partitions', 'Testing_01.txt')
+
+        data_prefix = 'SUN397'
+        test_mode = split == 'test'
+
+        super(SUN397, self).__init__(
+            ann_file=ann_file,
+            data_root=data_root,
+            test_mode=test_mode,
+            data_prefix=data_prefix,
+            **kwargs)
+
+    def load_data_list(self):
+        pairs = list_from_file(self.ann_file)
+        data_list = []
+        for pair in pairs:
+            img_path = self.backend.join_path(self.img_prefix, pair[1:])
+            items = pair.split('/')
+            class_name = '_'.join(items[2:-1])
+            gt_label = self.METAINFO['classes'].index(class_name)
+            info = dict(img_path=img_path, gt_label=gt_label)
+            data_list.append(info)
+
+        return data_list
+
+    def __getitem__(self, idx: int) -> dict:
+        try:
+            return super().__getitem__(idx)
+        except AttributeError:
+            raise RuntimeError(
+                'Some images in the SUN397 dataset are not a jpg file '
+                'although the name ends with ".jpg". The backend of SUN397 '
+                'should be "pillow" to read these images properly.')
+
+    def extra_repr(self) -> List[str]:
+        """The extra repr information of the dataset."""
+        body = [
+            f'Root of dataset: \t{self.data_root}',
+        ]
+        return body
diff --git a/mmpretrain/datasets/textvqa.py b/mmpretrain/datasets/textvqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..48a82b45ef1a4cc0bad2ab45b32b8ba8d28b2a60
--- /dev/null
+++ b/mmpretrain/datasets/textvqa.py
@@ -0,0 +1,105 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from collections import Counter
+from typing import List
+
+import mmengine
+from mmengine.dataset import BaseDataset
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class TextVQA(BaseDataset):
+    """TextVQA dataset.
+
+    val image:
+        https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
+    test image:
+        https://dl.fbaipublicfiles.com/textvqa/images/test_images.zip
+    val json:
+        https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
+    test json:
+        https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_test.json
+
+    folder structure:
+    data/textvqa
+        ├── annotations
+        │   ├── TextVQA_0.5.1_test.json
+        │   └── TextVQA_0.5.1_val.json
+        └── images
+            ├── test_images
+            └── train_images
+
+    Args:
+        data_root (str): The root directory for ``data_prefix``, ``ann_file``
+            and ``question_file``.
+        data_prefix (str): The directory of images.
+        question_file (str): Question file path.
+        ann_file (str, optional): Annotation file path for training and
+            validation. Defaults to an empty string.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self,
+                 data_root: str,
+                 data_prefix: str,
+                 ann_file: str = '',
+                 **kwarg):
+        super().__init__(
+            data_root=data_root,
+            data_prefix=dict(img_path=data_prefix),
+            ann_file=ann_file,
+            **kwarg,
+        )
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        annotations = mmengine.load(self.ann_file)['data']
+
+        data_list = []
+
+        for ann in annotations:
+
+            # ann example
+            # {
+            #     'question': 'what is the brand of...is camera?',
+            #     'image_id': '003a8ae2ef43b901',
+            #     'image_classes': [
+            #         'Cassette deck', 'Printer', ...
+            #         ],
+            #     'flickr_original_url': 'https://farm2.static...04a6_o.jpg',
+            #     'flickr_300k_url': 'https://farm2.static...04a6_o.jpg',
+            #     'image_width': 1024,
+            #     'image_height': 664,
+            #     'answers': [
+            #         'nous les gosses',
+            #         'dakota',
+            #         'clos culombu',
+            #         'dakota digital' ...
+            #        ],
+            #     'question_tokens':
+            #         ['what', 'is', 'the', 'brand', 'of', 'this', 'camera'],
+            #     'question_id': 34602,
+            #     'set_name': 'val'
+            # }
+
+            data_info = dict(question=ann['question'])
+            data_info['question_id'] = ann['question_id']
+            data_info['image_id'] = ann['image_id']
+
+            img_path = mmengine.join_path(self.data_prefix['img_path'],
+                                          ann['image_id'] + '.jpg')
+            data_info['img_path'] = img_path
+
+            data_info['question_id'] = ann['question_id']
+
+            if 'answers' in ann:
+                answers = [item for item in ann.pop('answers')]
+                count = Counter(answers)
+                answer_weight = [i / len(answers) for i in count.values()]
+                data_info['gt_answer'] = list(count.keys())
+                data_info['gt_answer_weight'] = answer_weight
+
+            data_list.append(data_info)
+
+        return data_list
diff --git a/mmpretrain/datasets/transforms/__init__.py b/mmpretrain/datasets/transforms/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..617503f26e8c0ec9e1b48b952df2e22a5a5b522d
--- /dev/null
+++ b/mmpretrain/datasets/transforms/__init__.py
@@ -0,0 +1,41 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmcv.transforms import (CenterCrop, LoadImageFromFile, Normalize,
+                             RandomFlip, RandomGrayscale, RandomResize, Resize)
+
+from mmpretrain.registry import TRANSFORMS
+from .auto_augment import (AutoAugment, AutoContrast, BaseAugTransform,
+                           Brightness, ColorTransform, Contrast, Cutout,
+                           Equalize, GaussianBlur, Invert, Posterize,
+                           RandAugment, Rotate, Sharpness, Shear, Solarize,
+                           SolarizeAdd, Translate)
+from .formatting import (Collect, NumpyToPIL, PackInputs, PackMultiTaskInputs,
+                         PILToNumpy, Transpose)
+from .processing import (Albumentations, BEiTMaskGenerator, CleanCaption,
+                         ColorJitter, EfficientNetCenterCrop,
+                         EfficientNetRandomCrop, Lighting,
+                         MAERandomResizedCrop, RandomCrop, RandomErasing,
+                         RandomResizedCrop,
+                         RandomResizedCropAndInterpolationWithTwoPic,
+                         RandomTranslatePad, ResizeEdge, SimMIMMaskGenerator)
+from .utils import get_transform_idx, remove_transform
+from .wrappers import ApplyToList, MultiView
+
+for t in (CenterCrop, LoadImageFromFile, Normalize, RandomFlip,
+          RandomGrayscale, RandomResize, Resize):
+    TRANSFORMS.register_module(module=t)
+
+__all__ = [
+    'NumpyToPIL', 'PILToNumpy', 'Transpose', 'Collect', 'RandomCrop',
+    'RandomResizedCrop', 'Shear', 'Translate', 'Rotate', 'Invert',
+    'ColorTransform', 'Solarize', 'Posterize', 'AutoContrast', 'Equalize',
+    'Contrast', 'Brightness', 'Sharpness', 'AutoAugment', 'SolarizeAdd',
+    'Cutout', 'RandAugment', 'Lighting', 'ColorJitter', 'RandomErasing',
+    'PackInputs', 'Albumentations', 'EfficientNetRandomCrop',
+    'EfficientNetCenterCrop', 'ResizeEdge', 'BaseAugTransform',
+    'PackMultiTaskInputs', 'GaussianBlur', 'BEiTMaskGenerator',
+    'SimMIMMaskGenerator', 'CenterCrop', 'LoadImageFromFile', 'Normalize',
+    'RandomFlip', 'RandomGrayscale', 'RandomResize', 'Resize', 'MultiView',
+    'ApplyToList', 'CleanCaption', 'RandomTranslatePad',
+    'RandomResizedCropAndInterpolationWithTwoPic', 'get_transform_idx',
+    'remove_transform', 'MAERandomResizedCrop'
+]
diff --git a/mmpretrain/datasets/transforms/__pycache__/__init__.cpython-310.pyc b/mmpretrain/datasets/transforms/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..b16f6525360f618772288046a5718152d17d2b41
Binary files /dev/null and b/mmpretrain/datasets/transforms/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/transforms/__pycache__/auto_augment.cpython-310.pyc b/mmpretrain/datasets/transforms/__pycache__/auto_augment.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..106001603c7d57ccaccdca236343d5d7eb25b3e7
Binary files /dev/null and b/mmpretrain/datasets/transforms/__pycache__/auto_augment.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/transforms/__pycache__/formatting.cpython-310.pyc b/mmpretrain/datasets/transforms/__pycache__/formatting.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..03a1575633ff0bc7ac6fe34781d2e8d95b0de826
Binary files /dev/null and b/mmpretrain/datasets/transforms/__pycache__/formatting.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/transforms/__pycache__/processing.cpython-310.pyc b/mmpretrain/datasets/transforms/__pycache__/processing.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..420b4d7d759b889b32abd76fe25805c3f4476d73
Binary files /dev/null and b/mmpretrain/datasets/transforms/__pycache__/processing.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/transforms/__pycache__/utils.cpython-310.pyc b/mmpretrain/datasets/transforms/__pycache__/utils.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..54544b1d10fe6222bdc37ec64a54a512f15f92df
Binary files /dev/null and b/mmpretrain/datasets/transforms/__pycache__/utils.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/transforms/__pycache__/wrappers.cpython-310.pyc b/mmpretrain/datasets/transforms/__pycache__/wrappers.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e62ceb563edce11f2f465f7a4732e61445b096bb
Binary files /dev/null and b/mmpretrain/datasets/transforms/__pycache__/wrappers.cpython-310.pyc differ
diff --git a/mmpretrain/datasets/transforms/auto_augment.py b/mmpretrain/datasets/transforms/auto_augment.py
new file mode 100644
index 0000000000000000000000000000000000000000..4705d5ec04e38aa8286904a1e02b0bc56d79f09e
--- /dev/null
+++ b/mmpretrain/datasets/transforms/auto_augment.py
@@ -0,0 +1,1244 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import inspect
+from copy import deepcopy
+from math import ceil
+from numbers import Number
+from typing import List, Optional, Sequence, Tuple, Union
+
+import mmcv
+import numpy as np
+from mmcv.transforms import BaseTransform, Compose, RandomChoice
+from mmcv.transforms.utils import cache_randomness
+from mmengine.utils import is_list_of, is_seq_of
+from PIL import Image, ImageFilter
+
+from mmpretrain.registry import TRANSFORMS
+
+
+def merge_hparams(policy: dict, hparams: dict) -> dict:
+    """Merge hyperparameters into policy config.
+
+    Only merge partial hyperparameters required of the policy.
+
+    Args:
+        policy (dict): Original policy config dict.
+        hparams (dict): Hyperparameters need to be merged.
+
+    Returns:
+        dict: Policy config dict after adding ``hparams``.
+    """
+    policy = deepcopy(policy)
+    op = TRANSFORMS.get(policy['type'])
+    assert op is not None, f'Invalid policy type "{policy["type"]}".'
+
+    op_args = inspect.getfullargspec(op.__init__).args
+    for key, value in hparams.items():
+        if key in op_args and key not in policy:
+            policy[key] = value
+    return policy
+
+
+@TRANSFORMS.register_module()
+class AutoAugment(RandomChoice):
+    """Auto augmentation.
+
+    This data augmentation is proposed in `AutoAugment: Learning Augmentation
+    Policies from Data <https://arxiv.org/abs/1805.09501>`_.
+
+    Args:
+        policies (str | list[list[dict]]): The policies of auto augmentation.
+            If string, use preset policies collection like "imagenet". If list,
+            Each item is a sub policies, composed by several augmentation
+            policy dicts. When AutoAugment is called, a random sub policies in
+            ``policies`` will be selected to augment images.
+        hparams (dict): Configs of hyperparameters. Hyperparameters will be
+            used in policies that require these arguments if these arguments
+            are not set in policy dicts. Defaults to ``dict(pad_val=128)``.
+
+    .. admonition:: Available preset policies
+
+        - ``"imagenet"``: Policy for ImageNet, come from
+          `DeepVoltaire/AutoAugment`_
+
+    .. _DeepVoltaire/AutoAugment: https://github.com/DeepVoltaire/AutoAugment
+    """
+
+    def __init__(self,
+                 policies: Union[str, List[List[dict]]],
+                 hparams: dict = dict(pad_val=128)):
+        if isinstance(policies, str):
+            assert policies in AUTOAUG_POLICIES, 'Invalid policies, ' \
+                f'please choose from {list(AUTOAUG_POLICIES.keys())}.'
+            policies = AUTOAUG_POLICIES[policies]
+        self.hparams = hparams
+        self.policies = [[merge_hparams(t, hparams) for t in sub]
+                         for sub in policies]
+        transforms = [[TRANSFORMS.build(t) for t in sub] for sub in policies]
+
+        super().__init__(transforms=transforms)
+
+    def __repr__(self) -> str:
+        policies_str = ''
+        for sub in self.policies:
+            policies_str += '\n    ' + ', \t'.join([t['type'] for t in sub])
+
+        repr_str = self.__class__.__name__
+        repr_str += f'(policies:{policies_str}\n)'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class RandAugment(BaseTransform):
+    r"""Random augmentation.
+
+    This data augmentation is proposed in `RandAugment: Practical automated
+    data augmentation with a reduced search space
+    <https://arxiv.org/abs/1909.13719>`_.
+
+    Args:
+        policies (str | list[dict]): The policies of random augmentation.
+            If string, use preset policies collection like "timm_increasing".
+            If list, each item is one specific augmentation policy dict.
+            The policy dict shall should have these keys:
+
+            - ``type`` (str), The type of augmentation.
+            - ``magnitude_range`` (Sequence[number], optional): For those
+              augmentation have magnitude, you need to specify the magnitude
+              level mapping range. For example, assume ``total_level`` is 10,
+              ``magnitude_level=3`` specify magnitude is 3 if
+              ``magnitude_range=(0, 10)`` while specify magnitude is 7 if
+              ``magnitude_range=(10, 0)``.
+            - other keyword arguments of the augmentation.
+
+        num_policies (int): Number of policies to select from policies each
+            time.
+        magnitude_level (int | float): Magnitude level for all the augmentation
+            selected.
+        magnitude_std (Number | str): Deviation of magnitude noise applied.
+
+            - If positive number, the magnitude obeys normal distribution
+              :math:`\mathcal{N}(magnitude_level, magnitude_std)`.
+            - If 0 or negative number, magnitude remains unchanged.
+            - If str "inf", the magnitude obeys uniform distribution
+              :math:`Uniform(min, magnitude)`.
+        total_level (int | float): Total level for the magnitude. Defaults to
+            10.
+        hparams (dict): Configs of hyperparameters. Hyperparameters will be
+            used in policies that require these arguments if these arguments
+            are not set in policy dicts. Defaults to ``dict(pad_val=128)``.
+
+    .. admonition:: Available preset policies
+
+        - ``"timm_increasing"``: The ``_RAND_INCREASING_TRANSFORMS`` policy
+          from `timm`_
+
+    .. _timm: https://github.com/rwightman/pytorch-image-models
+
+    Examples:
+
+        To use "timm-increasing" policies collection, select two policies every
+        time, and magnitude_level of every policy is 6 (total is 10 by default)
+
+        >>> import numpy as np
+        >>> from mmpretrain.datasets import RandAugment
+        >>> transform = RandAugment(
+        ...     policies='timm_increasing',
+        ...     num_policies=2,
+        ...     magnitude_level=6,
+        ... )
+        >>> data = {'img': np.random.randint(0, 256, (224, 224, 3))}
+        >>> results = transform(data)
+        >>> print(results['img'].shape)
+        (224, 224, 3)
+
+        If you want the ``magnitude_level`` randomly changes every time, you
+        can use ``magnitude_std`` to specify the random distribution. For
+        example, a normal distribution :math:`\mathcal{N}(6, 0.5)`.
+
+        >>> transform = RandAugment(
+        ...     policies='timm_increasing',
+        ...     num_policies=2,
+        ...     magnitude_level=6,
+        ...     magnitude_std=0.5,
+        ... )
+
+        You can also use your own policies:
+
+        >>> policies = [
+        ...     dict(type='AutoContrast'),
+        ...     dict(type='Rotate', magnitude_range=(0, 30)),
+        ...     dict(type='ColorTransform', magnitude_range=(0, 0.9)),
+        ... ]
+        >>> transform = RandAugment(
+        ...     policies=policies,
+        ...     num_policies=2,
+        ...     magnitude_level=6
+        ... )
+
+    Note:
+        ``magnitude_std`` will introduce some randomness to policy, modified by
+        https://github.com/rwightman/pytorch-image-models.
+
+        When magnitude_std=0, we calculate the magnitude as follows:
+
+        .. math::
+            \text{magnitude} = \frac{\text{magnitude_level}}
+            {\text{totallevel}} \times (\text{val2} - \text{val1})
+            + \text{val1}
+    """
+
+    def __init__(self,
+                 policies: Union[str, List[dict]],
+                 num_policies: int,
+                 magnitude_level: int,
+                 magnitude_std: Union[Number, str] = 0.,
+                 total_level: int = 10,
+                 hparams: dict = dict(pad_val=128)):
+        if isinstance(policies, str):
+            assert policies in RANDAUG_POLICIES, 'Invalid policies, ' \
+                f'please choose from {list(RANDAUG_POLICIES.keys())}.'
+            policies = RANDAUG_POLICIES[policies]
+
+        assert is_list_of(policies, dict), 'policies must be a list of dict.'
+
+        assert isinstance(magnitude_std, (Number, str)), \
+            '`magnitude_std` must be of number or str type, ' \
+            f'got {type(magnitude_std)} instead.'
+        if isinstance(magnitude_std, str):
+            assert magnitude_std == 'inf', \
+                '`magnitude_std` must be of number or "inf", ' \
+                f'got "{magnitude_std}" instead.'
+
+        assert num_policies > 0, 'num_policies must be greater than 0.'
+        assert magnitude_level >= 0, 'magnitude_level must be no less than 0.'
+        assert total_level > 0, 'total_level must be greater than 0.'
+
+        self.num_policies = num_policies
+        self.magnitude_level = magnitude_level
+        self.magnitude_std = magnitude_std
+        self.total_level = total_level
+        self.hparams = hparams
+        self.policies = []
+        self.transforms = []
+
+        randaug_cfg = dict(
+            magnitude_level=magnitude_level,
+            total_level=total_level,
+            magnitude_std=magnitude_std)
+
+        for policy in policies:
+            self._check_policy(policy)
+            policy = merge_hparams(policy, hparams)
+            policy.pop('magnitude_key', None)  # For backward compatibility
+            if 'magnitude_range' in policy:
+                policy.update(randaug_cfg)
+            self.policies.append(policy)
+            self.transforms.append(TRANSFORMS.build(policy))
+
+    def __iter__(self):
+        """Iterate all transforms."""
+        return iter(self.transforms)
+
+    def _check_policy(self, policy):
+        """Check whether the sub-policy dict is available."""
+        assert isinstance(policy, dict) and 'type' in policy, \
+            'Each policy must be a dict with key "type".'
+        type_name = policy['type']
+
+        if 'magnitude_range' in policy:
+            magnitude_range = policy['magnitude_range']
+            assert is_seq_of(magnitude_range, Number), \
+                f'`magnitude_range` of RandAugment policy {type_name} ' \
+                'should be a sequence with two numbers.'
+
+    @cache_randomness
+    def random_policy_indices(self) -> np.ndarray:
+        """Return the random chosen transform indices."""
+        indices = np.arange(len(self.policies))
+        return np.random.choice(indices, size=self.num_policies).tolist()
+
+    def transform(self, results: dict) -> Optional[dict]:
+        """Randomly choose a sub-policy to apply."""
+
+        chosen_policies = [
+            self.transforms[i] for i in self.random_policy_indices()
+        ]
+
+        sub_pipeline = Compose(chosen_policies)
+        return sub_pipeline(results)
+
+    def __repr__(self) -> str:
+        policies_str = ''
+        for policy in self.policies:
+            policies_str += '\n    ' + f'{policy["type"]}'
+            if 'magnitude_range' in policy:
+                val1, val2 = policy['magnitude_range']
+                policies_str += f' ({val1}, {val2})'
+
+        repr_str = self.__class__.__name__
+        repr_str += f'(num_policies={self.num_policies}, '
+        repr_str += f'magnitude_level={self.magnitude_level}, '
+        repr_str += f'total_level={self.total_level}, '
+        repr_str += f'policies:{policies_str}\n)'
+        return repr_str
+
+
+class BaseAugTransform(BaseTransform):
+    r"""The base class of augmentation transform for RandAugment.
+
+    This class provides several common attributions and methods to support the
+    magnitude level mapping and magnitude level randomness in
+    :class:`RandAugment`.
+
+    Args:
+        magnitude_level (int | float): Magnitude level.
+        magnitude_range (Sequence[number], optional): For augmentation have
+            magnitude argument, maybe "magnitude", "angle" or other, you can
+            specify the magnitude level mapping range to generate the magnitude
+            argument. For example, assume ``total_level`` is 10,
+            ``magnitude_level=3`` specify magnitude is 3 if
+            ``magnitude_range=(0, 10)`` while specify magnitude is 7 if
+            ``magnitude_range=(10, 0)``. Defaults to None.
+        magnitude_std (Number | str): Deviation of magnitude noise applied.
+
+            - If positive number, the magnitude obeys normal distribution
+              :math:`\mathcal{N}(magnitude, magnitude_std)`.
+            - If 0 or negative number, magnitude remains unchanged.
+            - If str "inf", the magnitude obeys uniform distribution
+              :math:`Uniform(min, magnitude)`.
+
+            Defaults to 0.
+        total_level (int | float): Total level for the magnitude. Defaults to
+            10.
+        prob (float): The probability for performing transformation therefore
+            should be in range [0, 1]. Defaults to 0.5.
+        random_negative_prob (float): The probability that turns the magnitude
+            negative, which should be in range [0,1]. Defaults to 0.
+    """
+
+    def __init__(self,
+                 magnitude_level: int = 10,
+                 magnitude_range: Tuple[float, float] = None,
+                 magnitude_std: Union[str, float] = 0.,
+                 total_level: int = 10,
+                 prob: float = 0.5,
+                 random_negative_prob: float = 0.5):
+        self.magnitude_level = magnitude_level
+        self.magnitude_range = magnitude_range
+        self.magnitude_std = magnitude_std
+        self.total_level = total_level
+        self.prob = prob
+        self.random_negative_prob = random_negative_prob
+
+    @cache_randomness
+    def random_disable(self):
+        """Randomly disable the transform."""
+        return np.random.rand() > self.prob
+
+    @cache_randomness
+    def random_magnitude(self):
+        """Randomly generate magnitude."""
+        magnitude = self.magnitude_level
+        # if magnitude_std is positive number or 'inf', move
+        # magnitude_value randomly.
+        if self.magnitude_std == 'inf':
+            magnitude = np.random.uniform(0, magnitude)
+        elif self.magnitude_std > 0:
+            magnitude = np.random.normal(magnitude, self.magnitude_std)
+            magnitude = np.clip(magnitude, 0, self.total_level)
+
+        val1, val2 = self.magnitude_range
+        magnitude = (magnitude / self.total_level) * (val2 - val1) + val1
+        return magnitude
+
+    @cache_randomness
+    def random_negative(self, value):
+        """Randomly negative the value."""
+        if np.random.rand() < self.random_negative_prob:
+            return -value
+        else:
+            return value
+
+    def extra_repr(self):
+        """Extra repr string when auto-generating magnitude is enabled."""
+        if self.magnitude_range is not None:
+            repr_str = f', magnitude_level={self.magnitude_level}, '
+            repr_str += f'magnitude_range={self.magnitude_range}, '
+            repr_str += f'magnitude_std={self.magnitude_std}, '
+            repr_str += f'total_level={self.total_level}, '
+            return repr_str
+        else:
+            return ''
+
+
+@TRANSFORMS.register_module()
+class Shear(BaseAugTransform):
+    """Shear images.
+
+    Args:
+        magnitude (int | float | None): The magnitude used for shear. If None,
+            generate from ``magnitude_range``, see :class:`BaseAugTransform`.
+            Defaults to None.
+        pad_val (int, Sequence[int]): Pixel pad_val value for constant fill.
+            If a sequence of length 3, it is used to pad_val R, G, B channels
+            respectively. Defaults to 128.
+        prob (float): The probability for performing shear therefore should be
+            in range [0, 1]. Defaults to 0.5.
+        direction (str): The shearing direction. Options are 'horizontal' and
+            'vertical'. Defaults to 'horizontal'.
+        random_negative_prob (float): The probability that turns the magnitude
+            negative, which should be in range [0,1]. Defaults to 0.5.
+        interpolation (str): Interpolation method. Options are 'nearest',
+            'bilinear', 'bicubic', 'area', 'lanczos'. Defaults to 'bicubic'.
+        **kwargs: Other keyword arguments of :class:`BaseAugTransform`.
+    """
+
+    def __init__(self,
+                 magnitude: Union[int, float, None] = None,
+                 pad_val: Union[int, Sequence[int]] = 128,
+                 prob: float = 0.5,
+                 direction: str = 'horizontal',
+                 random_negative_prob: float = 0.5,
+                 interpolation: str = 'bicubic',
+                 **kwargs):
+        super().__init__(
+            prob=prob, random_negative_prob=random_negative_prob, **kwargs)
+        assert (magnitude is None) ^ (self.magnitude_range is None), \
+            'Please specify only one of `magnitude` and `magnitude_range`.'
+
+        self.magnitude = magnitude
+        if isinstance(pad_val, Sequence):
+            self.pad_val = tuple(pad_val)
+        else:
+            self.pad_val = pad_val
+
+        assert direction in ('horizontal', 'vertical'), 'direction must be ' \
+            f'either "horizontal" or "vertical", got "{direction}" instead.'
+        self.direction = direction
+
+        self.interpolation = interpolation
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        if self.magnitude is not None:
+            magnitude = self.random_negative(self.magnitude)
+        else:
+            magnitude = self.random_negative(self.random_magnitude())
+
+        img = results['img']
+        img_sheared = mmcv.imshear(
+            img,
+            magnitude,
+            direction=self.direction,
+            border_value=self.pad_val,
+            interpolation=self.interpolation)
+        results['img'] = img_sheared.astype(img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(magnitude={self.magnitude}, '
+        repr_str += f'pad_val={self.pad_val}, '
+        repr_str += f'prob={self.prob}, '
+        repr_str += f'direction={self.direction}, '
+        repr_str += f'random_negative_prob={self.random_negative_prob}, '
+        repr_str += f'interpolation={self.interpolation}{self.extra_repr()})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class Translate(BaseAugTransform):
+    """Translate images.
+
+    Args:
+        magnitude (int | float | None): The magnitude used for translate. Note
+            that the offset is calculated by magnitude * size in the
+            corresponding direction. With a magnitude of 1, the whole image
+            will be moved out of the range. If None, generate from
+            ``magnitude_range``, see :class:`BaseAugTransform`.
+        pad_val (int, Sequence[int]): Pixel pad_val value for constant fill.
+            If a sequence of length 3, it is used to pad_val R, G, B channels
+            respectively. Defaults to 128.
+        prob (float): The probability for performing translate therefore should
+             be in range [0, 1]. Defaults to 0.5.
+        direction (str): The translating direction. Options are 'horizontal'
+            and 'vertical'. Defaults to 'horizontal'.
+        random_negative_prob (float): The probability that turns the magnitude
+            negative, which should be in range [0,1]. Defaults to 0.5.
+        interpolation (str): Interpolation method. Options are 'nearest',
+            'bilinear', 'bicubic', 'area', 'lanczos'. Defaults to 'nearest'.
+        **kwargs: Other keyword arguments of :class:`BaseAugTransform`.
+    """
+
+    def __init__(self,
+                 magnitude: Union[int, float, None] = None,
+                 pad_val: Union[int, Sequence[int]] = 128,
+                 prob: float = 0.5,
+                 direction: str = 'horizontal',
+                 random_negative_prob: float = 0.5,
+                 interpolation: str = 'nearest',
+                 **kwargs):
+        super().__init__(
+            prob=prob, random_negative_prob=random_negative_prob, **kwargs)
+        assert (magnitude is None) ^ (self.magnitude_range is None), \
+            'Please specify only one of `magnitude` and `magnitude_range`.'
+
+        self.magnitude = magnitude
+        if isinstance(pad_val, Sequence):
+            self.pad_val = tuple(pad_val)
+        else:
+            self.pad_val = pad_val
+
+        assert direction in ('horizontal', 'vertical'), 'direction must be ' \
+            f'either "horizontal" or "vertical", got "{direction}" instead.'
+        self.direction = direction
+
+        self.interpolation = interpolation
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        if self.magnitude is not None:
+            magnitude = self.random_negative(self.magnitude)
+        else:
+            magnitude = self.random_negative(self.random_magnitude())
+
+        img = results['img']
+        height, width = img.shape[:2]
+        if self.direction == 'horizontal':
+            offset = magnitude * width
+        else:
+            offset = magnitude * height
+        img_translated = mmcv.imtranslate(
+            img,
+            offset,
+            direction=self.direction,
+            border_value=self.pad_val,
+            interpolation=self.interpolation)
+        results['img'] = img_translated.astype(img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(magnitude={self.magnitude}, '
+        repr_str += f'pad_val={self.pad_val}, '
+        repr_str += f'prob={self.prob}, '
+        repr_str += f'direction={self.direction}, '
+        repr_str += f'random_negative_prob={self.random_negative_prob}, '
+        repr_str += f'interpolation={self.interpolation}{self.extra_repr()})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class Rotate(BaseAugTransform):
+    """Rotate images.
+
+    Args:
+        angle (float, optional): The angle used for rotate. Positive values
+            stand for clockwise rotation. If None, generate from
+            ``magnitude_range``, see :class:`BaseAugTransform`.
+            Defaults to None.
+        center (tuple[float], optional): Center point (w, h) of the rotation in
+            the source image. If None, the center of the image will be used.
+            Defaults to None.
+        scale (float): Isotropic scale factor. Defaults to 1.0.
+        pad_val (int, Sequence[int]): Pixel pad_val value for constant fill.
+            If a sequence of length 3, it is used to pad_val R, G, B channels
+            respectively. Defaults to 128.
+        prob (float): The probability for performing rotate therefore should be
+            in range [0, 1]. Defaults to 0.5.
+        random_negative_prob (float): The probability that turns the angle
+            negative, which should be in range [0,1]. Defaults to 0.5.
+        interpolation (str): Interpolation method. Options are 'nearest',
+            'bilinear', 'bicubic', 'area', 'lanczos'. Defaults to 'nearest'.
+        **kwargs: Other keyword arguments of :class:`BaseAugTransform`.
+    """
+
+    def __init__(self,
+                 angle: Optional[float] = None,
+                 center: Optional[Tuple[float]] = None,
+                 scale: float = 1.0,
+                 pad_val: Union[int, Sequence[int]] = 128,
+                 prob: float = 0.5,
+                 random_negative_prob: float = 0.5,
+                 interpolation: str = 'nearest',
+                 **kwargs):
+        super().__init__(
+            prob=prob, random_negative_prob=random_negative_prob, **kwargs)
+        assert (angle is None) ^ (self.magnitude_range is None), \
+            'Please specify only one of `angle` and `magnitude_range`.'
+
+        self.angle = angle
+        self.center = center
+        self.scale = scale
+        if isinstance(pad_val, Sequence):
+            self.pad_val = tuple(pad_val)
+        else:
+            self.pad_val = pad_val
+
+        self.interpolation = interpolation
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        if self.angle is not None:
+            angle = self.random_negative(self.angle)
+        else:
+            angle = self.random_negative(self.random_magnitude())
+
+        img = results['img']
+        img_rotated = mmcv.imrotate(
+            img,
+            angle,
+            center=self.center,
+            scale=self.scale,
+            border_value=self.pad_val,
+            interpolation=self.interpolation)
+        results['img'] = img_rotated.astype(img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(angle={self.angle}, '
+        repr_str += f'center={self.center}, '
+        repr_str += f'scale={self.scale}, '
+        repr_str += f'pad_val={self.pad_val}, '
+        repr_str += f'prob={self.prob}, '
+        repr_str += f'random_negative_prob={self.random_negative_prob}, '
+        repr_str += f'interpolation={self.interpolation}{self.extra_repr()})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class AutoContrast(BaseAugTransform):
+    """Auto adjust image contrast.
+
+    Args:
+        prob (float): The probability for performing auto contrast
+            therefore should be in range [0, 1]. Defaults to 0.5.
+        **kwargs: Other keyword arguments of :class:`BaseAugTransform`.
+    """
+
+    def __init__(self, prob: float = 0.5, **kwargs):
+        super().__init__(prob=prob, **kwargs)
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        img = results['img']
+        img_contrasted = mmcv.auto_contrast(img)
+        results['img'] = img_contrasted.astype(img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(prob={self.prob})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class Invert(BaseAugTransform):
+    """Invert images.
+
+    Args:
+        prob (float): The probability for performing invert therefore should
+             be in range [0, 1]. Defaults to 0.5.
+        **kwargs: Other keyword arguments of :class:`BaseAugTransform`.
+    """
+
+    def __init__(self, prob: float = 0.5, **kwargs):
+        super().__init__(prob=prob, **kwargs)
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        img = results['img']
+        img_inverted = mmcv.iminvert(img)
+        results['img'] = img_inverted.astype(img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(prob={self.prob})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class Equalize(BaseAugTransform):
+    """Equalize the image histogram.
+
+    Args:
+        prob (float): The probability for performing equalize therefore should
+             be in range [0, 1]. Defaults to 0.5.
+        **kwargs: Other keyword arguments of :class:`BaseAugTransform`.
+    """
+
+    def __init__(self, prob: float = 0.5, **kwargs):
+        super().__init__(prob=prob, **kwargs)
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        img = results['img']
+        img_equalized = mmcv.imequalize(img)
+        results['img'] = img_equalized.astype(img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(prob={self.prob})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class Solarize(BaseAugTransform):
+    """Solarize images (invert all pixel values above a threshold).
+
+    Args:
+        thr (int | float | None): The threshold above which the pixels value
+            will be inverted. If None, generate from ``magnitude_range``,
+            see :class:`BaseAugTransform`. Defaults to None.
+        prob (float): The probability for solarizing therefore should be in
+            range [0, 1]. Defaults to 0.5.
+        **kwargs: Other keyword arguments of :class:`BaseAugTransform`.
+    """
+
+    def __init__(self,
+                 thr: Union[int, float, None] = None,
+                 prob: float = 0.5,
+                 **kwargs):
+        super().__init__(prob=prob, random_negative_prob=0., **kwargs)
+        assert (thr is None) ^ (self.magnitude_range is None), \
+            'Please specify only one of `thr` and `magnitude_range`.'
+
+        self.thr = thr
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        if self.thr is not None:
+            thr = self.thr
+        else:
+            thr = self.random_magnitude()
+
+        img = results['img']
+        img_solarized = mmcv.solarize(img, thr=thr)
+        results['img'] = img_solarized.astype(img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(thr={self.thr}, '
+        repr_str += f'prob={self.prob}{self.extra_repr()}))'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class SolarizeAdd(BaseAugTransform):
+    """SolarizeAdd images (add a certain value to pixels below a threshold).
+
+    Args:
+        magnitude (int | float | None): The value to be added to pixels below
+            the thr. If None, generate from ``magnitude_range``, see
+            :class:`BaseAugTransform`. Defaults to None.
+        thr (int | float): The threshold below which the pixels value will be
+            adjusted.
+        prob (float): The probability for solarizing therefore should be in
+            range [0, 1]. Defaults to 0.5.
+        **kwargs: Other keyword arguments of :class:`BaseAugTransform`.
+    """
+
+    def __init__(self,
+                 magnitude: Union[int, float, None] = None,
+                 thr: Union[int, float] = 128,
+                 prob: float = 0.5,
+                 **kwargs):
+        super().__init__(prob=prob, random_negative_prob=0., **kwargs)
+        assert (magnitude is None) ^ (self.magnitude_range is None), \
+            'Please specify only one of `magnitude` and `magnitude_range`.'
+
+        self.magnitude = magnitude
+
+        assert isinstance(thr, (int, float)), 'The thr type must '\
+            f'be int or float, but got {type(thr)} instead.'
+        self.thr = thr
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        if self.magnitude is not None:
+            magnitude = self.magnitude
+        else:
+            magnitude = self.random_magnitude()
+
+        img = results['img']
+        img_solarized = np.where(img < self.thr,
+                                 np.minimum(img + magnitude, 255), img)
+        results['img'] = img_solarized.astype(img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(magnitude={self.magnitude}, '
+        repr_str += f'thr={self.thr}, '
+        repr_str += f'prob={self.prob}{self.extra_repr()})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class Posterize(BaseAugTransform):
+    """Posterize images (reduce the number of bits for each color channel).
+
+    Args:
+        bits (int, optional): Number of bits for each pixel in the output img,
+            which should be less or equal to 8. If None, generate from
+            ``magnitude_range``, see :class:`BaseAugTransform`.
+            Defaults to None.
+        prob (float): The probability for posterizing therefore should be in
+            range [0, 1]. Defaults to 0.5.
+        **kwargs: Other keyword arguments of :class:`BaseAugTransform`.
+    """
+
+    def __init__(self,
+                 bits: Optional[int] = None,
+                 prob: float = 0.5,
+                 **kwargs):
+        super().__init__(prob=prob, random_negative_prob=0., **kwargs)
+        assert (bits is None) ^ (self.magnitude_range is None), \
+            'Please specify only one of `bits` and `magnitude_range`.'
+
+        if bits is not None:
+            assert bits <= 8, \
+                f'The bits must be less than 8, got {bits} instead.'
+        self.bits = bits
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        if self.bits is not None:
+            bits = self.bits
+        else:
+            bits = self.random_magnitude()
+
+        # To align timm version, we need to round up to integer here.
+        bits = ceil(bits)
+
+        img = results['img']
+        img_posterized = mmcv.posterize(img, bits=bits)
+        results['img'] = img_posterized.astype(img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(bits={self.bits}, '
+        repr_str += f'prob={self.prob}{self.extra_repr()})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class Contrast(BaseAugTransform):
+    """Adjust images contrast.
+
+    Args:
+        magnitude (int | float | None): The magnitude used for adjusting
+            contrast. A positive magnitude would enhance the contrast and
+            a negative magnitude would make the image grayer. A magnitude=0
+            gives the origin img. If None, generate from ``magnitude_range``,
+            see :class:`BaseAugTransform`. Defaults to None.
+        prob (float): The probability for performing contrast adjusting
+            therefore should be in range [0, 1]. Defaults to 0.5.
+        random_negative_prob (float): The probability that turns the magnitude
+            negative, which should be in range [0,1]. Defaults to 0.5.
+    """
+
+    def __init__(self,
+                 magnitude: Union[int, float, None] = None,
+                 prob: float = 0.5,
+                 random_negative_prob: float = 0.5,
+                 **kwargs):
+        super().__init__(
+            prob=prob, random_negative_prob=random_negative_prob, **kwargs)
+        assert (magnitude is None) ^ (self.magnitude_range is None), \
+            'Please specify only one of `magnitude` and `magnitude_range`.'
+
+        self.magnitude = magnitude
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        if self.magnitude is not None:
+            magnitude = self.random_negative(self.magnitude)
+        else:
+            magnitude = self.random_negative(self.random_magnitude())
+
+        img = results['img']
+        img_contrasted = mmcv.adjust_contrast(img, factor=1 + magnitude)
+        results['img'] = img_contrasted.astype(img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(magnitude={self.magnitude}, '
+        repr_str += f'prob={self.prob}, '
+        repr_str += f'random_negative_prob={self.random_negative_prob}'
+        repr_str += f'{self.extra_repr()})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class ColorTransform(BaseAugTransform):
+    """Adjust images color balance.
+
+    Args:
+        magnitude (int | float | None): The magnitude used for color transform.
+            A positive magnitude would enhance the color and a negative
+            magnitude would make the image grayer. A magnitude=0 gives the
+            origin img. If None, generate from ``magnitude_range``, see
+            :class:`BaseAugTransform`. Defaults to None.
+        prob (float): The probability for performing ColorTransform therefore
+            should be in range [0, 1]. Defaults to 0.5.
+        random_negative_prob (float): The probability that turns the magnitude
+            negative, which should be in range [0,1]. Defaults to 0.5.
+        **kwargs: Other keyword arguments of :class:`BaseAugTransform`.
+    """
+
+    def __init__(self,
+                 magnitude: Union[int, float, None] = None,
+                 prob: float = 0.5,
+                 random_negative_prob: float = 0.5,
+                 **kwargs):
+        super().__init__(
+            prob=prob, random_negative_prob=random_negative_prob, **kwargs)
+        assert (magnitude is None) ^ (self.magnitude_range is None), \
+            'Please specify only one of `magnitude` and `magnitude_range`.'
+
+        self.magnitude = magnitude
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        if self.magnitude is not None:
+            magnitude = self.random_negative(self.magnitude)
+        else:
+            magnitude = self.random_negative(self.random_magnitude())
+
+        img = results['img']
+        img_color_adjusted = mmcv.adjust_color(img, alpha=1 + magnitude)
+        results['img'] = img_color_adjusted.astype(img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(magnitude={self.magnitude}, '
+        repr_str += f'prob={self.prob}, '
+        repr_str += f'random_negative_prob={self.random_negative_prob}'
+        repr_str += f'{self.extra_repr()})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class Brightness(BaseAugTransform):
+    """Adjust images brightness.
+
+    Args:
+        magnitude (int | float | None): The magnitude used for adjusting
+            brightness. A positive magnitude would enhance the brightness and a
+            negative magnitude would make the image darker. A magnitude=0 gives
+            the origin img. If None, generate from ``magnitude_range``, see
+            :class:`BaseAugTransform`. Defaults to None.
+        prob (float): The probability for performing brightness adjusting
+            therefore should be in range [0, 1]. Defaults to 0.5.
+        random_negative_prob (float): The probability that turns the magnitude
+            negative, which should be in range [0,1]. Defaults to 0.5.
+        **kwargs: Other keyword arguments of :class:`BaseAugTransform`.
+    """
+
+    def __init__(self,
+                 magnitude: Union[int, float, None] = None,
+                 prob: float = 0.5,
+                 random_negative_prob: float = 0.5,
+                 **kwargs):
+        super().__init__(
+            prob=prob, random_negative_prob=random_negative_prob, **kwargs)
+        assert (magnitude is None) ^ (self.magnitude_range is None), \
+            'Please specify only one of `magnitude` and `magnitude_range`.'
+
+        self.magnitude = magnitude
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        if self.magnitude is not None:
+            magnitude = self.random_negative(self.magnitude)
+        else:
+            magnitude = self.random_negative(self.random_magnitude())
+
+        img = results['img']
+        img_brightened = mmcv.adjust_brightness(img, factor=1 + magnitude)
+        results['img'] = img_brightened.astype(img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(magnitude={self.magnitude}, '
+        repr_str += f'prob={self.prob}, '
+        repr_str += f'random_negative_prob={self.random_negative_prob}'
+        repr_str += f'{self.extra_repr()})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class Sharpness(BaseAugTransform):
+    """Adjust images sharpness.
+
+    Args:
+        magnitude (int | float | None): The magnitude used for adjusting
+            sharpness. A positive magnitude would enhance the sharpness and a
+            negative magnitude would make the image bulr. A magnitude=0 gives
+            the origin img. If None, generate from ``magnitude_range``, see
+            :class:`BaseAugTransform`. Defaults to None.
+        prob (float): The probability for performing sharpness adjusting
+            therefore should be in range [0, 1]. Defaults to 0.5.
+        random_negative_prob (float): The probability that turns the magnitude
+            negative, which should be in range [0,1]. Defaults to 0.5.
+        **kwargs: Other keyword arguments of :class:`BaseAugTransform`.
+    """
+
+    def __init__(self,
+                 magnitude: Union[int, float, None] = None,
+                 prob: float = 0.5,
+                 random_negative_prob: float = 0.5,
+                 **kwargs):
+        super().__init__(
+            prob=prob, random_negative_prob=random_negative_prob, **kwargs)
+        assert (magnitude is None) ^ (self.magnitude_range is None), \
+            'Please specify only one of `magnitude` and `magnitude_range`.'
+
+        self.magnitude = magnitude
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        if self.magnitude is not None:
+            magnitude = self.random_negative(self.magnitude)
+        else:
+            magnitude = self.random_negative(self.random_magnitude())
+
+        img = results['img']
+        img_sharpened = mmcv.adjust_sharpness(img, factor=1 + magnitude)
+        results['img'] = img_sharpened.astype(img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(magnitude={self.magnitude}, '
+        repr_str += f'prob={self.prob}, '
+        repr_str += f'random_negative_prob={self.random_negative_prob}'
+        repr_str += f'{self.extra_repr()})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class Cutout(BaseAugTransform):
+    """Cutout images.
+
+    Args:
+        shape (int | tuple(int) | None): Expected cutout shape (h, w).
+            If given as a single value, the value will be used for both h and
+            w. If None, generate from ``magnitude_range``, see
+            :class:`BaseAugTransform`. Defaults to None.
+        pad_val (int, Sequence[int]): Pixel pad_val value for constant fill.
+            If it is a sequence, it must have the same length with the image
+            channels. Defaults to 128.
+        prob (float): The probability for performing cutout therefore should
+            be in range [0, 1]. Defaults to 0.5.
+        **kwargs: Other keyword arguments of :class:`BaseAugTransform`.
+    """
+
+    def __init__(self,
+                 shape: Union[int, Tuple[int], None] = None,
+                 pad_val: Union[int, Sequence[int]] = 128,
+                 prob: float = 0.5,
+                 **kwargs):
+        super().__init__(prob=prob, random_negative_prob=0., **kwargs)
+        assert (shape is None) ^ (self.magnitude_range is None), \
+            'Please specify only one of `shape` and `magnitude_range`.'
+
+        self.shape = shape
+        if isinstance(pad_val, Sequence):
+            self.pad_val = tuple(pad_val)
+        else:
+            self.pad_val = pad_val
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        if self.shape is not None:
+            shape = self.shape
+        else:
+            shape = int(self.random_magnitude())
+
+        img = results['img']
+        img_cutout = mmcv.cutout(img, shape, pad_val=self.pad_val)
+        results['img'] = img_cutout.astype(img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(shape={self.shape}, '
+        repr_str += f'pad_val={self.pad_val}, '
+        repr_str += f'prob={self.prob}{self.extra_repr()})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class GaussianBlur(BaseAugTransform):
+    """Gaussian blur images.
+
+    Args:
+        radius (int, float, optional): The blur radius. If None, generate from
+            ``magnitude_range``, see :class:`BaseAugTransform`.
+            Defaults to None.
+        prob (float): The probability for posterizing therefore should be in
+            range [0, 1]. Defaults to 0.5.
+        **kwargs: Other keyword arguments of :class:`BaseAugTransform`.
+    """
+
+    def __init__(self,
+                 radius: Union[int, float, None] = None,
+                 prob: float = 0.5,
+                 **kwargs):
+        super().__init__(prob=prob, random_negative_prob=0., **kwargs)
+        assert (radius is None) ^ (self.magnitude_range is None), \
+            'Please specify only one of `radius` and `magnitude_range`.'
+
+        self.radius = radius
+
+    def transform(self, results):
+        """Apply transform to results."""
+        if self.random_disable():
+            return results
+
+        if self.radius is not None:
+            radius = self.radius
+        else:
+            radius = self.random_magnitude()
+
+        img = results['img']
+        pil_img = Image.fromarray(img)
+        pil_img = pil_img.filter(ImageFilter.GaussianBlur(radius=radius))
+        results['img'] = np.array(pil_img, dtype=img.dtype)
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(radius={self.radius}, '
+        repr_str += f'prob={self.prob}{self.extra_repr()})'
+        return repr_str
+
+
+# yapf: disable
+# flake8: noqa
+AUTOAUG_POLICIES = {
+    # Policy for ImageNet, refers to
+    # https://github.com/DeepVoltaire/AutoAugment/blame/master/autoaugment.py
+    'imagenet': [
+        [dict(type='Posterize', bits=4, prob=0.4),             dict(type='Rotate', angle=30., prob=0.6)],
+        [dict(type='Solarize', thr=256 / 9 * 4, prob=0.6),     dict(type='AutoContrast', prob=0.6)],
+        [dict(type='Equalize', prob=0.8),                      dict(type='Equalize', prob=0.6)],
+        [dict(type='Posterize', bits=5, prob=0.6),             dict(type='Posterize', bits=5, prob=0.6)],
+        [dict(type='Equalize', prob=0.4),                      dict(type='Solarize', thr=256 / 9 * 5, prob=0.2)],
+        [dict(type='Equalize', prob=0.4),                      dict(type='Rotate', angle=30 / 9 * 8, prob=0.8)],
+        [dict(type='Solarize', thr=256 / 9 * 6, prob=0.6),     dict(type='Equalize', prob=0.6)],
+        [dict(type='Posterize', bits=6, prob=0.8),             dict(type='Equalize', prob=1.)],
+        [dict(type='Rotate', angle=10., prob=0.2),             dict(type='Solarize', thr=256 / 9, prob=0.6)],
+        [dict(type='Equalize', prob=0.6),                      dict(type='Posterize', bits=5, prob=0.4)],
+        [dict(type='Rotate', angle=30 / 9 * 8, prob=0.8),      dict(type='ColorTransform', magnitude=0., prob=0.4)],
+        [dict(type='Rotate', angle=30., prob=0.4),             dict(type='Equalize', prob=0.6)],
+        [dict(type='Equalize', prob=0.0),                      dict(type='Equalize', prob=0.8)],
+        [dict(type='Invert', prob=0.6),                        dict(type='Equalize', prob=1.)],
+        [dict(type='ColorTransform', magnitude=0.4, prob=0.6), dict(type='Contrast', magnitude=0.8, prob=1.)],
+        [dict(type='Rotate', angle=30 / 9 * 8, prob=0.8),      dict(type='ColorTransform', magnitude=0.2, prob=1.)],
+        [dict(type='ColorTransform', magnitude=0.8, prob=0.8), dict(type='Solarize', thr=256 / 9 * 2, prob=0.8)],
+        [dict(type='Sharpness', magnitude=0.7, prob=0.4),      dict(type='Invert', prob=0.6)],
+        [dict(type='Shear', magnitude=0.3 / 9 * 5, prob=0.6, direction='horizontal'), dict(type='Equalize', prob=1.)],
+        [dict(type='ColorTransform', magnitude=0., prob=0.4),  dict(type='Equalize', prob=0.6)],
+        [dict(type='Equalize', prob=0.4),                      dict(type='Solarize', thr=256 / 9 * 5, prob=0.2)],
+        [dict(type='Solarize', thr=256 / 9 * 4, prob=0.6),     dict(type='AutoContrast', prob=0.6)],
+        [dict(type='Invert', prob=0.6),                        dict(type='Equalize', prob=1.)],
+        [dict(type='ColorTransform', magnitude=0.4, prob=0.6), dict(type='Contrast', magnitude=0.8, prob=1.)],
+        [dict(type='Equalize', prob=0.8),                      dict(type='Equalize', prob=0.6)],
+    ],
+}
+
+RANDAUG_POLICIES = {
+    # Refers to `_RAND_INCREASING_TRANSFORMS` in pytorch-image-models
+    'timm_increasing': [
+        dict(type='AutoContrast'),
+        dict(type='Equalize'),
+        dict(type='Invert'),
+        dict(type='Rotate', magnitude_range=(0, 30)),
+        dict(type='Posterize', magnitude_range=(4, 0)),
+        dict(type='Solarize', magnitude_range=(256, 0)),
+        dict(type='SolarizeAdd', magnitude_range=(0, 110)),
+        dict(type='ColorTransform', magnitude_range=(0, 0.9)),
+        dict(type='Contrast', magnitude_range=(0, 0.9)),
+        dict(type='Brightness', magnitude_range=(0, 0.9)),
+        dict(type='Sharpness', magnitude_range=(0, 0.9)),
+        dict(type='Shear', magnitude_range=(0, 0.3), direction='horizontal'),
+        dict(type='Shear', magnitude_range=(0, 0.3), direction='vertical'),
+        dict(type='Translate', magnitude_range=(0, 0.45), direction='horizontal'),
+        dict(type='Translate', magnitude_range=(0, 0.45), direction='vertical'),
+    ],
+    'simple_increasing': [
+        dict(type='AutoContrast'),
+        dict(type='Equalize'),
+        dict(type='Rotate', magnitude_range=(0, 30)),
+        dict(type='Shear', magnitude_range=(0, 0.3), direction='horizontal'),
+        dict(type='Shear', magnitude_range=(0, 0.3), direction='vertical'),
+    ],
+}
diff --git a/mmpretrain/datasets/transforms/formatting.py b/mmpretrain/datasets/transforms/formatting.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4d331636a883ce602e419e0867aea7b513b4d87
--- /dev/null
+++ b/mmpretrain/datasets/transforms/formatting.py
@@ -0,0 +1,353 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from collections import defaultdict
+from collections.abc import Sequence
+
+import cv2
+import numpy as np
+import torch
+import torchvision.transforms.functional as F
+from mmcv.transforms import BaseTransform
+from mmengine.utils import is_str
+from PIL import Image
+
+from mmpretrain.registry import TRANSFORMS
+from mmpretrain.structures import DataSample, MultiTaskDataSample
+
+
+def to_tensor(data):
+    """Convert objects of various python types to :obj:`torch.Tensor`.
+
+    Supported types are: :class:`numpy.ndarray`, :class:`torch.Tensor`,
+    :class:`Sequence`, :class:`int` and :class:`float`.
+    """
+    if isinstance(data, torch.Tensor):
+        return data
+    elif isinstance(data, np.ndarray):
+        return torch.from_numpy(data)
+    elif isinstance(data, Sequence) and not is_str(data):
+        return torch.tensor(data)
+    elif isinstance(data, int):
+        return torch.LongTensor([data])
+    elif isinstance(data, float):
+        return torch.FloatTensor([data])
+    else:
+        raise TypeError(
+            f'Type {type(data)} cannot be converted to tensor.'
+            'Supported types are: `numpy.ndarray`, `torch.Tensor`, '
+            '`Sequence`, `int` and `float`')
+
+
+@TRANSFORMS.register_module()
+class PackInputs(BaseTransform):
+    """Pack the inputs data.
+
+    **Required Keys:**
+
+    - ``input_key``
+    - ``*algorithm_keys``
+    - ``*meta_keys``
+
+    **Deleted Keys:**
+
+    All other keys in the dict.
+
+    **Added Keys:**
+
+    - inputs (:obj:`torch.Tensor`): The forward data of models.
+    - data_samples (:obj:`~mmpretrain.structures.DataSample`): The
+      annotation info of the sample.
+
+    Args:
+        input_key (str): The key of element to feed into the model forwarding.
+            Defaults to 'img'.
+        algorithm_keys (Sequence[str]): The keys of custom elements to be used
+            in the algorithm. Defaults to an empty tuple.
+        meta_keys (Sequence[str]): The keys of meta information to be saved in
+            the data sample. Defaults to :attr:`PackInputs.DEFAULT_META_KEYS`.
+
+    .. admonition:: Default algorithm keys
+
+        Besides the specified ``algorithm_keys``, we will set some default keys
+        into the output data sample and do some formatting. Therefore, you
+        don't need to set these keys in the ``algorithm_keys``.
+
+        - ``gt_label``: The ground-truth label. The value will be converted
+          into a 1-D tensor.
+        - ``gt_score``: The ground-truth score. The value will be converted
+          into a 1-D tensor.
+        - ``mask``: The mask for some self-supervise tasks. The value will
+          be converted into a tensor.
+
+    .. admonition:: Default meta keys
+
+        - ``sample_idx``: The id of the image sample.
+        - ``img_path``: The path to the image file.
+        - ``ori_shape``: The original shape of the image as a tuple (H, W).
+        - ``img_shape``: The shape of the image after the pipeline as a
+          tuple (H, W).
+        - ``scale_factor``: The scale factor between the resized image and
+          the original image.
+        - ``flip``: A boolean indicating if image flip transform was used.
+        - ``flip_direction``: The flipping direction.
+    """
+
+    DEFAULT_META_KEYS = ('sample_idx', 'img_path', 'ori_shape', 'img_shape',
+                         'scale_factor', 'flip', 'flip_direction')
+
+    def __init__(self,
+                 input_key='img',
+                 algorithm_keys=(),
+                 meta_keys=DEFAULT_META_KEYS):
+        self.input_key = input_key
+        self.algorithm_keys = algorithm_keys
+        self.meta_keys = meta_keys
+
+    @staticmethod
+    def format_input(input_):
+        if isinstance(input_, list):
+            return [PackInputs.format_input(item) for item in input_]
+        elif isinstance(input_, np.ndarray):
+            if input_.ndim == 2:  # For grayscale image.
+                input_ = np.expand_dims(input_, -1)
+            if input_.ndim == 3 and not input_.flags.c_contiguous:
+                input_ = np.ascontiguousarray(input_.transpose(2, 0, 1))
+                input_ = to_tensor(input_)
+            elif input_.ndim == 3:
+                # convert to tensor first to accelerate, see
+                # https://github.com/open-mmlab/mmdetection/pull/9533
+                input_ = to_tensor(input_).permute(2, 0, 1).contiguous()
+            else:
+                # convert input with other shape to tensor without permute,
+                # like video input (num_crops, C, T, H, W).
+                input_ = to_tensor(input_)
+        elif isinstance(input_, Image.Image):
+            input_ = F.pil_to_tensor(input_)
+        elif not isinstance(input_, torch.Tensor):
+            raise TypeError(f'Unsupported input type {type(input_)}.')
+
+        return input_
+
+    def transform(self, results: dict) -> dict:
+        """Method to pack the input data."""
+
+        packed_results = dict()
+        if self.input_key in results:
+            input_ = results[self.input_key]
+            packed_results['inputs'] = self.format_input(input_)
+
+        data_sample = DataSample()
+
+        # Set default keys
+        if 'gt_label' in results:
+            data_sample.set_gt_label(results['gt_label'])
+        if 'gt_score' in results:
+            data_sample.set_gt_score(results['gt_score'])
+        if 'mask' in results:
+            data_sample.set_mask(results['mask'])
+
+        # Set custom algorithm keys
+        for key in self.algorithm_keys:
+            if key in results:
+                data_sample.set_field(results[key], key)
+
+        # Set meta keys
+        for key in self.meta_keys:
+            if key in results:
+                data_sample.set_field(results[key], key, field_type='metainfo')
+
+        packed_results['data_samples'] = data_sample
+        return packed_results
+
+    def __repr__(self) -> str:
+        repr_str = self.__class__.__name__
+        repr_str += f"(input_key='{self.input_key}', "
+        repr_str += f'algorithm_keys={self.algorithm_keys}, '
+        repr_str += f'meta_keys={self.meta_keys})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class PackMultiTaskInputs(BaseTransform):
+    """Convert all image labels of multi-task dataset to a dict of tensor.
+
+    Args:
+        multi_task_fields (Sequence[str]):
+        input_key (str):
+        task_handlers (dict):
+    """
+
+    def __init__(self,
+                 multi_task_fields,
+                 input_key='img',
+                 task_handlers=dict()):
+        self.multi_task_fields = multi_task_fields
+        self.input_key = input_key
+        self.task_handlers = defaultdict(PackInputs)
+        for task_name, task_handler in task_handlers.items():
+            self.task_handlers[task_name] = TRANSFORMS.build(task_handler)
+
+    def transform(self, results: dict) -> dict:
+        """Method to pack the input data.
+
+        result = {'img_path': 'a.png', 'gt_label': {'task1': 1, 'task3': 3},
+            'img': array([[[  0,   0,   0])
+        """
+        packed_results = dict()
+        results = results.copy()
+
+        if self.input_key in results:
+            input_ = results[self.input_key]
+            packed_results['inputs'] = PackInputs.format_input(input_)
+
+        task_results = defaultdict(dict)
+        for field in self.multi_task_fields:
+            if field in results:
+                value = results.pop(field)
+                for k, v in value.items():
+                    task_results[k].update({field: v})
+
+        data_sample = MultiTaskDataSample()
+        for task_name, task_result in task_results.items():
+            task_handler = self.task_handlers[task_name]
+            task_pack_result = task_handler({**results, **task_result})
+            data_sample.set_field(task_pack_result['data_samples'], task_name)
+
+        packed_results['data_samples'] = data_sample
+        return packed_results
+
+    def __repr__(self):
+        repr = self.__class__.__name__
+        task_handlers = ', '.join(
+            f"'{name}': {handler.__class__.__name__}"
+            for name, handler in self.task_handlers.items())
+        repr += f'(multi_task_fields={self.multi_task_fields}, '
+        repr += f"input_key='{self.input_key}', "
+        repr += f'task_handlers={{{task_handlers}}})'
+        return repr
+
+
+@TRANSFORMS.register_module()
+class Transpose(BaseTransform):
+    """Transpose numpy array.
+
+    **Required Keys:**
+
+    - ``*keys``
+
+    **Modified Keys:**
+
+    - ``*keys``
+
+    Args:
+        keys (List[str]): The fields to convert to tensor.
+        order (List[int]): The output dimensions order.
+    """
+
+    def __init__(self, keys, order):
+        self.keys = keys
+        self.order = order
+
+    def transform(self, results):
+        """Method to transpose array."""
+        for key in self.keys:
+            results[key] = results[key].transpose(self.order)
+        return results
+
+    def __repr__(self):
+        return self.__class__.__name__ + \
+            f'(keys={self.keys}, order={self.order})'
+
+
+@TRANSFORMS.register_module(('NumpyToPIL', 'ToPIL'))
+class NumpyToPIL(BaseTransform):
+    """Convert the image from OpenCV format to :obj:`PIL.Image.Image`.
+
+    **Required Keys:**
+
+    - ``img``
+
+    **Modified Keys:**
+
+    - ``img``
+
+    Args:
+        to_rgb (bool): Whether to convert img to rgb. Defaults to True.
+    """
+
+    def __init__(self, to_rgb: bool = False) -> None:
+        self.to_rgb = to_rgb
+
+    def transform(self, results: dict) -> dict:
+        """Method to convert images to :obj:`PIL.Image.Image`."""
+        img = results['img']
+        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) if self.to_rgb else img
+
+        results['img'] = Image.fromarray(img)
+        return results
+
+    def __repr__(self) -> str:
+        return self.__class__.__name__ + f'(to_rgb={self.to_rgb})'
+
+
+@TRANSFORMS.register_module(('PILToNumpy', 'ToNumpy'))
+class PILToNumpy(BaseTransform):
+    """Convert img to :obj:`numpy.ndarray`.
+
+    **Required Keys:**
+
+    - ``img``
+
+    **Modified Keys:**
+
+    - ``img``
+
+    Args:
+        to_bgr (bool): Whether to convert img to rgb. Defaults to True.
+        dtype (str, optional): The dtype of the converted numpy array.
+            Defaults to None.
+    """
+
+    def __init__(self, to_bgr: bool = False, dtype=None) -> None:
+        self.to_bgr = to_bgr
+        self.dtype = dtype
+
+    def transform(self, results: dict) -> dict:
+        """Method to convert img to :obj:`numpy.ndarray`."""
+        img = np.array(results['img'], dtype=self.dtype)
+        img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR) if self.to_bgr else img
+
+        results['img'] = img
+        return results
+
+    def __repr__(self) -> str:
+        return self.__class__.__name__ + \
+            f'(to_bgr={self.to_bgr}, dtype={self.dtype})'
+
+
+@TRANSFORMS.register_module()
+class Collect(BaseTransform):
+    """Collect and only reserve the specified fields.
+
+    **Required Keys:**
+
+    - ``*keys``
+
+    **Deleted Keys:**
+
+    All keys except those in the argument ``*keys``.
+
+    Args:
+        keys (Sequence[str]): The keys of the fields to be collected.
+    """
+
+    def __init__(self, keys):
+        self.keys = keys
+
+    def transform(self, results):
+        data = {}
+        for key in self.keys:
+            data[key] = results[key]
+        return data
+
+    def __repr__(self):
+        return self.__class__.__name__ + f'(keys={self.keys})'
diff --git a/mmpretrain/datasets/transforms/processing.py b/mmpretrain/datasets/transforms/processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..4c640f6b1fa6d4e250ce2f8db59c038382e915f6
--- /dev/null
+++ b/mmpretrain/datasets/transforms/processing.py
@@ -0,0 +1,1795 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import inspect
+import math
+import numbers
+import re
+import string
+from enum import EnumMeta
+from numbers import Number
+from typing import Dict, List, Optional, Sequence, Tuple, Union
+
+import mmcv
+import mmengine
+import numpy as np
+import torch
+import torchvision
+import torchvision.transforms.functional as F
+from mmcv.transforms import BaseTransform
+from mmcv.transforms.utils import cache_randomness
+from PIL import Image
+from torchvision import transforms
+from torchvision.transforms.transforms import InterpolationMode
+
+from mmpretrain.registry import TRANSFORMS
+
+try:
+    import albumentations
+except ImportError:
+    albumentations = None
+
+
+def _str_to_torch_dtype(t: str):
+    """mapping str format dtype to torch.dtype."""
+    import torch  # noqa: F401,F403
+    return eval(f'torch.{t}')
+
+
+def _interpolation_modes_from_str(t: str):
+    """mapping str format to Interpolation."""
+    t = t.lower()
+    inverse_modes_mapping = {
+        'nearest': InterpolationMode.NEAREST,
+        'bilinear': InterpolationMode.BILINEAR,
+        'bicubic': InterpolationMode.BICUBIC,
+        'box': InterpolationMode.BOX,
+        'hammimg': InterpolationMode.HAMMING,
+        'lanczos': InterpolationMode.LANCZOS,
+    }
+    return inverse_modes_mapping[t]
+
+
+class TorchVisonTransformWrapper:
+
+    def __init__(self, transform, *args, **kwargs):
+        if 'interpolation' in kwargs and isinstance(kwargs['interpolation'],
+                                                    str):
+            kwargs['interpolation'] = _interpolation_modes_from_str(
+                kwargs['interpolation'])
+        if 'dtype' in kwargs and isinstance(kwargs['dtype'], str):
+            kwargs['dtype'] = _str_to_torch_dtype(kwargs['dtype'])
+        self.t = transform(*args, **kwargs)
+
+    def __call__(self, results):
+        results['img'] = self.t(results['img'])
+        return results
+
+    def __repr__(self) -> str:
+        return f'TorchVision{repr(self.t)}'
+
+
+def register_vision_transforms() -> List[str]:
+    """Register transforms in ``torchvision.transforms`` to the ``TRANSFORMS``
+    registry.
+
+    Returns:
+        List[str]: A list of registered transforms' name.
+    """
+    vision_transforms = []
+    for module_name in dir(torchvision.transforms):
+        if not re.match('[A-Z]', module_name):
+            # must startswith a capital letter
+            continue
+        _transform = getattr(torchvision.transforms, module_name)
+        if inspect.isclass(_transform) and callable(
+                _transform) and not isinstance(_transform, (EnumMeta)):
+            from functools import partial
+            TRANSFORMS.register_module(
+                module=partial(
+                    TorchVisonTransformWrapper, transform=_transform),
+                name=f'torchvision/{module_name}')
+            vision_transforms.append(f'torchvision/{module_name}')
+    return vision_transforms
+
+
+# register all the transforms in torchvision by using a transform wrapper
+VISION_TRANSFORMS = register_vision_transforms()
+
+
+@TRANSFORMS.register_module()
+class RandomCrop(BaseTransform):
+    """Crop the given Image at a random location.
+
+    **Required Keys:**
+
+    - img
+
+    **Modified Keys:**
+
+    - img
+    - img_shape
+
+    Args:
+        crop_size (int | Sequence): Desired output size of the crop. If
+            crop_size is an int instead of sequence like (h, w), a square crop
+            (crop_size, crop_size) is made.
+        padding (int | Sequence, optional): Optional padding on each border
+            of the image. If a sequence of length 4 is provided, it is used to
+            pad left, top, right, bottom borders respectively.  If a sequence
+            of length 2 is provided, it is used to pad left/right, top/bottom
+            borders, respectively. Default: None, which means no padding.
+        pad_if_needed (bool): It will pad the image if smaller than the
+            desired size to avoid raising an exception. Since cropping is done
+            after padding, the padding seems to be done at a random offset.
+            Default: False.
+        pad_val (Number | Sequence[Number]): Pixel pad_val value for constant
+            fill. If a tuple of length 3, it is used to pad_val R, G, B
+            channels respectively. Default: 0.
+        padding_mode (str): Type of padding. Defaults to "constant". Should
+            be one of the following:
+
+            - ``constant``: Pads with a constant value, this value is specified
+              with pad_val.
+            - ``edge``: pads with the last value at the edge of the image.
+            - ``reflect``: Pads with reflection of image without repeating the
+              last value on the edge. For example, padding [1, 2, 3, 4]
+              with 2 elements on both sides in reflect mode will result
+              in [3, 2, 1, 2, 3, 4, 3, 2].
+            - ``symmetric``: Pads with reflection of image repeating the last
+              value on the edge. For example, padding [1, 2, 3, 4] with
+              2 elements on both sides in symmetric mode will result in
+              [2, 1, 1, 2, 3, 4, 4, 3].
+    """
+
+    def __init__(self,
+                 crop_size: Union[Sequence, int],
+                 padding: Optional[Union[Sequence, int]] = None,
+                 pad_if_needed: bool = False,
+                 pad_val: Union[Number, Sequence[Number]] = 0,
+                 padding_mode: str = 'constant'):
+        if isinstance(crop_size, Sequence):
+            assert len(crop_size) == 2
+            assert crop_size[0] > 0 and crop_size[1] > 0
+            self.crop_size = crop_size
+        else:
+            assert crop_size > 0
+            self.crop_size = (crop_size, crop_size)
+        # check padding mode
+        assert padding_mode in ['constant', 'edge', 'reflect', 'symmetric']
+        self.padding = padding
+        self.pad_if_needed = pad_if_needed
+        self.pad_val = pad_val
+        self.padding_mode = padding_mode
+
+    @cache_randomness
+    def rand_crop_params(self, img: np.ndarray):
+        """Get parameters for ``crop`` for a random crop.
+
+        Args:
+            img (ndarray): Image to be cropped.
+
+        Returns:
+            tuple: Params (offset_h, offset_w, target_h, target_w) to be
+                passed to ``crop`` for random crop.
+        """
+        h, w = img.shape[:2]
+        target_h, target_w = self.crop_size
+        if w == target_w and h == target_h:
+            return 0, 0, h, w
+        elif w < target_w or h < target_h:
+            target_w = min(w, target_w)
+            target_h = min(h, target_h)
+
+        offset_h = np.random.randint(0, h - target_h + 1)
+        offset_w = np.random.randint(0, w - target_w + 1)
+
+        return offset_h, offset_w, target_h, target_w
+
+    def transform(self, results: dict) -> dict:
+        """Transform function to randomly crop images.
+
+        Args:
+            results (dict): Result dict from loading pipeline.
+
+        Returns:
+            dict: Randomly cropped results, 'img_shape'
+                key in result dict is updated according to crop size.
+        """
+        img = results['img']
+        if self.padding is not None:
+            img = mmcv.impad(img, padding=self.padding, pad_val=self.pad_val)
+
+        # pad img if needed
+        if self.pad_if_needed:
+            h_pad = math.ceil(max(0, self.crop_size[0] - img.shape[0]) / 2)
+            w_pad = math.ceil(max(0, self.crop_size[1] - img.shape[1]) / 2)
+
+            img = mmcv.impad(
+                img,
+                padding=(w_pad, h_pad, w_pad, h_pad),
+                pad_val=self.pad_val,
+                padding_mode=self.padding_mode)
+
+        offset_h, offset_w, target_h, target_w = self.rand_crop_params(img)
+        img = mmcv.imcrop(
+            img,
+            np.array([
+                offset_w,
+                offset_h,
+                offset_w + target_w - 1,
+                offset_h + target_h - 1,
+            ]))
+        results['img'] = img
+        results['img_shape'] = img.shape
+
+        return results
+
+    def __repr__(self):
+        """Print the basic information of the transform.
+
+        Returns:
+            str: Formatted string.
+        """
+        repr_str = self.__class__.__name__ + f'(crop_size={self.crop_size}'
+        repr_str += f', padding={self.padding}'
+        repr_str += f', pad_if_needed={self.pad_if_needed}'
+        repr_str += f', pad_val={self.pad_val}'
+        repr_str += f', padding_mode={self.padding_mode})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class RandomResizedCrop(BaseTransform):
+    """Crop the given image to random scale and aspect ratio.
+
+    A crop of random size (default: of 0.08 to 1.0) of the original size and a
+    random aspect ratio (default: of 3/4 to 4/3) of the original aspect ratio
+    is made. This crop is finally resized to given size.
+
+    **Required Keys:**
+
+    - img
+
+    **Modified Keys:**
+
+    - img
+    - img_shape
+
+    Args:
+        scale (sequence | int): Desired output scale of the crop. If size is an
+            int instead of sequence like (h, w), a square crop (size, size) is
+            made.
+        crop_ratio_range (tuple): Range of the random size of the cropped
+            image compared to the original image. Defaults to (0.08, 1.0).
+        aspect_ratio_range (tuple): Range of the random aspect ratio of the
+            cropped image compared to the original image.
+            Defaults to (3. / 4., 4. / 3.).
+        max_attempts (int): Maximum number of attempts before falling back to
+            Central Crop. Defaults to 10.
+        interpolation (str): Interpolation method, accepted values are
+            'nearest', 'bilinear', 'bicubic', 'area', 'lanczos'. Defaults to
+            'bilinear'.
+        backend (str): The image resize backend type, accepted values are
+            'cv2' and 'pillow'. Defaults to 'cv2'.
+    """
+
+    def __init__(self,
+                 scale: Union[Sequence, int],
+                 crop_ratio_range: Tuple[float, float] = (0.08, 1.0),
+                 aspect_ratio_range: Tuple[float, float] = (3. / 4., 4. / 3.),
+                 max_attempts: int = 10,
+                 interpolation: str = 'bilinear',
+                 backend: str = 'cv2') -> None:
+        if isinstance(scale, Sequence):
+            assert len(scale) == 2
+            assert scale[0] > 0 and scale[1] > 0
+            self.scale = scale
+        else:
+            assert scale > 0
+            self.scale = (scale, scale)
+        if (crop_ratio_range[0] > crop_ratio_range[1]) or (
+                aspect_ratio_range[0] > aspect_ratio_range[1]):
+            raise ValueError(
+                'range should be of kind (min, max). '
+                f'But received crop_ratio_range {crop_ratio_range} '
+                f'and aspect_ratio_range {aspect_ratio_range}.')
+        assert isinstance(max_attempts, int) and max_attempts >= 0, \
+            'max_attempts mush be int and no less than 0.'
+        assert interpolation in ('nearest', 'bilinear', 'bicubic', 'area',
+                                 'lanczos')
+
+        self.crop_ratio_range = crop_ratio_range
+        self.aspect_ratio_range = aspect_ratio_range
+        self.max_attempts = max_attempts
+        self.interpolation = interpolation
+        self.backend = backend
+
+    @cache_randomness
+    def rand_crop_params(self, img: np.ndarray) -> Tuple[int, int, int, int]:
+        """Get parameters for ``crop`` for a random sized crop.
+
+        Args:
+            img (ndarray): Image to be cropped.
+
+        Returns:
+            tuple: Params (offset_h, offset_w, target_h, target_w) to be
+                passed to `crop` for a random sized crop.
+        """
+        h, w = img.shape[:2]
+        area = h * w
+
+        for _ in range(self.max_attempts):
+            target_area = np.random.uniform(*self.crop_ratio_range) * area
+            log_ratio = (math.log(self.aspect_ratio_range[0]),
+                         math.log(self.aspect_ratio_range[1]))
+            aspect_ratio = math.exp(np.random.uniform(*log_ratio))
+            target_w = int(round(math.sqrt(target_area * aspect_ratio)))
+            target_h = int(round(math.sqrt(target_area / aspect_ratio)))
+
+            if 0 < target_w <= w and 0 < target_h <= h:
+                offset_h = np.random.randint(0, h - target_h + 1)
+                offset_w = np.random.randint(0, w - target_w + 1)
+
+                return offset_h, offset_w, target_h, target_w
+
+        # Fallback to central crop
+        in_ratio = float(w) / float(h)
+        if in_ratio < min(self.aspect_ratio_range):
+            target_w = w
+            target_h = int(round(target_w / min(self.aspect_ratio_range)))
+        elif in_ratio > max(self.aspect_ratio_range):
+            target_h = h
+            target_w = int(round(target_h * max(self.aspect_ratio_range)))
+        else:  # whole image
+            target_w = w
+            target_h = h
+        offset_h = (h - target_h) // 2
+        offset_w = (w - target_w) // 2
+        return offset_h, offset_w, target_h, target_w
+
+    def transform(self, results: dict) -> dict:
+        """Transform function to randomly resized crop images.
+
+        Args:
+            results (dict): Result dict from loading pipeline.
+
+        Returns:
+            dict: Randomly resized cropped results, 'img_shape'
+                key in result dict is updated according to crop size.
+        """
+        img = results['img']
+        offset_h, offset_w, target_h, target_w = self.rand_crop_params(img)
+        img = mmcv.imcrop(
+            img,
+            bboxes=np.array([
+                offset_w, offset_h, offset_w + target_w - 1,
+                offset_h + target_h - 1
+            ]))
+        img = mmcv.imresize(
+            img,
+            tuple(self.scale[::-1]),
+            interpolation=self.interpolation,
+            backend=self.backend)
+        results['img'] = img
+        results['img_shape'] = img.shape
+
+        return results
+
+    def __repr__(self):
+        """Print the basic information of the transform.
+
+        Returns:
+            str: Formatted string.
+        """
+        repr_str = self.__class__.__name__ + f'(scale={self.scale}'
+        repr_str += ', crop_ratio_range='
+        repr_str += f'{tuple(round(s, 4) for s in self.crop_ratio_range)}'
+        repr_str += ', aspect_ratio_range='
+        repr_str += f'{tuple(round(r, 4) for r in self.aspect_ratio_range)}'
+        repr_str += f', max_attempts={self.max_attempts}'
+        repr_str += f', interpolation={self.interpolation}'
+        repr_str += f', backend={self.backend})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class EfficientNetRandomCrop(RandomResizedCrop):
+    """EfficientNet style RandomResizedCrop.
+
+    **Required Keys:**
+
+    - img
+
+    **Modified Keys:**
+
+    - img
+    - img_shape
+
+    Args:
+        scale (int): Desired output scale of the crop. Only int size is
+            accepted, a square crop (size, size) is made.
+        min_covered (Number): Minimum ratio of the cropped area to the original
+             area. Defaults to 0.1.
+        crop_padding (int): The crop padding parameter in efficientnet style
+            center crop. Defaults to 32.
+        crop_ratio_range (tuple): Range of the random size of the cropped
+            image compared to the original image. Defaults to (0.08, 1.0).
+        aspect_ratio_range (tuple): Range of the random aspect ratio of the
+            cropped image compared to the original image.
+            Defaults to (3. / 4., 4. / 3.).
+        max_attempts (int): Maximum number of attempts before falling back to
+            Central Crop. Defaults to 10.
+        interpolation (str): Interpolation method, accepted values are
+            'nearest', 'bilinear', 'bicubic', 'area', 'lanczos'. Defaults to
+            'bicubic'.
+        backend (str): The image resize backend type, accepted values are
+            'cv2' and 'pillow'. Defaults to 'cv2'.
+    """
+
+    def __init__(self,
+                 scale: int,
+                 min_covered: float = 0.1,
+                 crop_padding: int = 32,
+                 interpolation: str = 'bicubic',
+                 **kwarg):
+        assert isinstance(scale, int)
+        super().__init__(scale, interpolation=interpolation, **kwarg)
+        assert min_covered >= 0, 'min_covered should be no less than 0.'
+        assert crop_padding >= 0, 'crop_padding should be no less than 0.'
+
+        self.min_covered = min_covered
+        self.crop_padding = crop_padding
+
+    # https://github.com/kakaobrain/fast-autoaugment/blob/master/FastAutoAugment/data.py # noqa
+    @cache_randomness
+    def rand_crop_params(self, img: np.ndarray) -> Tuple[int, int, int, int]:
+        """Get parameters for ``crop`` for a random sized crop.
+
+        Args:
+            img (ndarray): Image to be cropped.
+
+        Returns:
+            tuple: Params (offset_h, offset_w, target_h, target_w) to be
+                passed to `crop` for a random sized crop.
+        """
+        h, w = img.shape[:2]
+        area = h * w
+        min_target_area = self.crop_ratio_range[0] * area
+        max_target_area = self.crop_ratio_range[1] * area
+
+        for _ in range(self.max_attempts):
+            aspect_ratio = np.random.uniform(*self.aspect_ratio_range)
+            min_target_h = int(
+                round(math.sqrt(min_target_area / aspect_ratio)))
+            max_target_h = int(
+                round(math.sqrt(max_target_area / aspect_ratio)))
+
+            if max_target_h * aspect_ratio > w:
+                max_target_h = int((w + 0.5 - 1e-7) / aspect_ratio)
+                if max_target_h * aspect_ratio > w:
+                    max_target_h -= 1
+
+            max_target_h = min(max_target_h, h)
+            min_target_h = min(max_target_h, min_target_h)
+
+            # slightly differs from tf implementation
+            target_h = int(
+                round(np.random.uniform(min_target_h, max_target_h)))
+            target_w = int(round(target_h * aspect_ratio))
+            target_area = target_h * target_w
+
+            # slight differs from tf. In tf, if target_area > max_target_area,
+            # area will be recalculated
+            if (target_area < min_target_area or target_area > max_target_area
+                    or target_w > w or target_h > h
+                    or target_area < self.min_covered * area):
+                continue
+
+            offset_h = np.random.randint(0, h - target_h + 1)
+            offset_w = np.random.randint(0, w - target_w + 1)
+
+            return offset_h, offset_w, target_h, target_w
+
+        # Fallback to central crop
+        img_short = min(h, w)
+        crop_size = self.scale[0] / (self.scale[0] +
+                                     self.crop_padding) * img_short
+
+        offset_h = max(0, int(round((h - crop_size) / 2.)))
+        offset_w = max(0, int(round((w - crop_size) / 2.)))
+        return offset_h, offset_w, crop_size, crop_size
+
+    def __repr__(self):
+        """Print the basic information of the transform.
+
+        Returns:
+            str: Formatted string.
+        """
+        repr_str = super().__repr__()[:-1]
+        repr_str += f', min_covered={self.min_covered}'
+        repr_str += f', crop_padding={self.crop_padding})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class RandomErasing(BaseTransform):
+    """Randomly selects a rectangle region in an image and erase pixels.
+
+    **Required Keys:**
+
+    - img
+
+    **Modified Keys:**
+
+    - img
+
+    Args:
+        erase_prob (float): Probability that image will be randomly erased.
+            Default: 0.5
+        min_area_ratio (float): Minimum erased area / input image area
+            Default: 0.02
+        max_area_ratio (float): Maximum erased area / input image area
+            Default: 0.4
+        aspect_range (sequence | float): Aspect ratio range of erased area.
+            if float, it will be converted to (aspect_ratio, 1/aspect_ratio)
+            Default: (3/10, 10/3)
+        mode (str): Fill method in erased area, can be:
+
+            - const (default): All pixels are assign with the same value.
+            - rand: each pixel is assigned with a random value in [0, 255]
+
+        fill_color (sequence | Number): Base color filled in erased area.
+            Defaults to (128, 128, 128).
+        fill_std (sequence | Number, optional): If set and ``mode`` is 'rand',
+            fill erased area with random color from normal distribution
+            (mean=fill_color, std=fill_std); If not set, fill erased area with
+            random color from uniform distribution (0~255). Defaults to None.
+
+    Note:
+        See `Random Erasing Data Augmentation
+        <https://arxiv.org/pdf/1708.04896.pdf>`_
+
+        This paper provided 4 modes: RE-R, RE-M, RE-0, RE-255, and use RE-M as
+        default. The config of these 4 modes are:
+
+        - RE-R: RandomErasing(mode='rand')
+        - RE-M: RandomErasing(mode='const', fill_color=(123.67, 116.3, 103.5))
+        - RE-0: RandomErasing(mode='const', fill_color=0)
+        - RE-255: RandomErasing(mode='const', fill_color=255)
+    """
+
+    def __init__(self,
+                 erase_prob=0.5,
+                 min_area_ratio=0.02,
+                 max_area_ratio=0.4,
+                 aspect_range=(3 / 10, 10 / 3),
+                 mode='const',
+                 fill_color=(128, 128, 128),
+                 fill_std=None):
+        assert isinstance(erase_prob, float) and 0. <= erase_prob <= 1.
+        assert isinstance(min_area_ratio, float) and 0. <= min_area_ratio <= 1.
+        assert isinstance(max_area_ratio, float) and 0. <= max_area_ratio <= 1.
+        assert min_area_ratio <= max_area_ratio, \
+            'min_area_ratio should be smaller than max_area_ratio'
+        if isinstance(aspect_range, float):
+            aspect_range = min(aspect_range, 1 / aspect_range)
+            aspect_range = (aspect_range, 1 / aspect_range)
+        assert isinstance(aspect_range, Sequence) and len(aspect_range) == 2 \
+            and all(isinstance(x, float) for x in aspect_range), \
+            'aspect_range should be a float or Sequence with two float.'
+        assert all(x > 0 for x in aspect_range), \
+            'aspect_range should be positive.'
+        assert aspect_range[0] <= aspect_range[1], \
+            'In aspect_range (min, max), min should be smaller than max.'
+        assert mode in ['const', 'rand'], \
+            'Please select `mode` from ["const", "rand"].'
+        if isinstance(fill_color, Number):
+            fill_color = [fill_color] * 3
+        assert isinstance(fill_color, Sequence) and len(fill_color) == 3 \
+            and all(isinstance(x, Number) for x in fill_color), \
+            'fill_color should be a float or Sequence with three int.'
+        if fill_std is not None:
+            if isinstance(fill_std, Number):
+                fill_std = [fill_std] * 3
+            assert isinstance(fill_std, Sequence) and len(fill_std) == 3 \
+                and all(isinstance(x, Number) for x in fill_std), \
+                'fill_std should be a float or Sequence with three int.'
+
+        self.erase_prob = erase_prob
+        self.min_area_ratio = min_area_ratio
+        self.max_area_ratio = max_area_ratio
+        self.aspect_range = aspect_range
+        self.mode = mode
+        self.fill_color = fill_color
+        self.fill_std = fill_std
+
+    def _fill_pixels(self, img, top, left, h, w):
+        """Fill pixels to the patch of image."""
+        if self.mode == 'const':
+            patch = np.empty((h, w, 3), dtype=np.uint8)
+            patch[:, :] = np.array(self.fill_color, dtype=np.uint8)
+        elif self.fill_std is None:
+            # Uniform distribution
+            patch = np.random.uniform(0, 256, (h, w, 3)).astype(np.uint8)
+        else:
+            # Normal distribution
+            patch = np.random.normal(self.fill_color, self.fill_std, (h, w, 3))
+            patch = np.clip(patch.astype(np.int32), 0, 255).astype(np.uint8)
+
+        img[top:top + h, left:left + w] = patch
+        return img
+
+    @cache_randomness
+    def random_disable(self):
+        """Randomly disable the transform."""
+        return np.random.rand() > self.erase_prob
+
+    @cache_randomness
+    def random_patch(self, img_h, img_w):
+        """Randomly generate patch the erase."""
+        # convert the aspect ratio to log space to equally handle width and
+        # height.
+        log_aspect_range = np.log(
+            np.array(self.aspect_range, dtype=np.float32))
+        aspect_ratio = np.exp(np.random.uniform(*log_aspect_range))
+        area = img_h * img_w
+        area *= np.random.uniform(self.min_area_ratio, self.max_area_ratio)
+
+        h = min(int(round(np.sqrt(area * aspect_ratio))), img_h)
+        w = min(int(round(np.sqrt(area / aspect_ratio))), img_w)
+        top = np.random.randint(0, img_h - h) if img_h > h else 0
+        left = np.random.randint(0, img_w - w) if img_w > w else 0
+        return top, left, h, w
+
+    def transform(self, results):
+        """
+        Args:
+            results (dict): Results dict from pipeline
+
+        Returns:
+            dict: Results after the transformation.
+        """
+        if self.random_disable():
+            return results
+
+        img = results['img']
+        img_h, img_w = img.shape[:2]
+
+        img = self._fill_pixels(img, *self.random_patch(img_h, img_w))
+
+        results['img'] = img
+
+        return results
+
+    def __repr__(self):
+        repr_str = self.__class__.__name__
+        repr_str += f'(erase_prob={self.erase_prob}, '
+        repr_str += f'min_area_ratio={self.min_area_ratio}, '
+        repr_str += f'max_area_ratio={self.max_area_ratio}, '
+        repr_str += f'aspect_range={self.aspect_range}, '
+        repr_str += f'mode={self.mode}, '
+        repr_str += f'fill_color={self.fill_color}, '
+        repr_str += f'fill_std={self.fill_std})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class EfficientNetCenterCrop(BaseTransform):
+    r"""EfficientNet style center crop.
+
+    **Required Keys:**
+
+    - img
+
+    **Modified Keys:**
+
+    - img
+    - img_shape
+
+    Args:
+        crop_size (int): Expected size after cropping with the format
+            of (h, w).
+        crop_padding (int): The crop padding parameter in efficientnet style
+            center crop. Defaults to 32.
+        interpolation (str): Interpolation method, accepted values are
+            'nearest', 'bilinear', 'bicubic', 'area', 'lanczos'. Only valid if
+            ``efficientnet_style`` is True. Defaults to 'bicubic'.
+        backend (str): The image resize backend type, accepted values are
+            `cv2` and `pillow`. Only valid if efficientnet style is True.
+            Defaults to `cv2`.
+    Notes:
+        - If the image is smaller than the crop size, return the original
+          image.
+        - The pipeline will be to first
+          to perform the center crop with the ``crop_size_`` as:
+
+        .. math::
+
+            \text{crop_size_} = \frac{\text{crop_size}}{\text{crop_size} +
+            \text{crop_padding}} \times \text{short_edge}
+
+        And then the pipeline resizes the img to the input crop size.
+    """
+
+    def __init__(self,
+                 crop_size: int,
+                 crop_padding: int = 32,
+                 interpolation: str = 'bicubic',
+                 backend: str = 'cv2'):
+        assert isinstance(crop_size, int)
+        assert crop_size > 0
+        assert crop_padding >= 0
+        assert interpolation in ('nearest', 'bilinear', 'bicubic', 'area',
+                                 'lanczos')
+
+        self.crop_size = crop_size
+        self.crop_padding = crop_padding
+        self.interpolation = interpolation
+        self.backend = backend
+
+    def transform(self, results: dict) -> dict:
+        """Transform function to randomly resized crop images.
+
+        Args:
+            results (dict): Result dict from loading pipeline.
+
+        Returns:
+            dict: EfficientNet style center cropped results, 'img_shape'
+                key in result dict is updated according to crop size.
+        """
+        img = results['img']
+        h, w = img.shape[:2]
+
+        # https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/preprocessing.py#L118 # noqa
+        img_short = min(h, w)
+        crop_size = self.crop_size / (self.crop_size +
+                                      self.crop_padding) * img_short
+
+        offset_h = max(0, int(round((h - crop_size) / 2.)))
+        offset_w = max(0, int(round((w - crop_size) / 2.)))
+
+        # crop the image
+        img = mmcv.imcrop(
+            img,
+            bboxes=np.array([
+                offset_w, offset_h, offset_w + crop_size - 1,
+                offset_h + crop_size - 1
+            ]))
+        # resize image
+        img = mmcv.imresize(
+            img, (self.crop_size, self.crop_size),
+            interpolation=self.interpolation,
+            backend=self.backend)
+        results['img'] = img
+        results['img_shape'] = img.shape
+
+        return results
+
+    def __repr__(self):
+        """Print the basic information of the transform.
+
+        Returns:
+            str: Formatted string.
+        """
+        repr_str = self.__class__.__name__ + f'(crop_size={self.crop_size}'
+        repr_str += f', crop_padding={self.crop_padding}'
+        repr_str += f', interpolation={self.interpolation}'
+        repr_str += f', backend={self.backend})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class ResizeEdge(BaseTransform):
+    """Resize images along the specified edge.
+
+    **Required Keys:**
+
+    - img
+
+    **Modified Keys:**
+
+    - img
+    - img_shape
+
+    **Added Keys:**
+
+    - scale
+    - scale_factor
+
+    Args:
+        scale (int): The edge scale to resizing.
+        edge (str): The edge to resize. Defaults to 'short'.
+        backend (str): Image resize backend, choices are 'cv2' and 'pillow'.
+            These two backends generates slightly different results.
+            Defaults to 'cv2'.
+        interpolation (str): Interpolation method, accepted values are
+            "nearest", "bilinear", "bicubic", "area", "lanczos" for 'cv2'
+            backend, "nearest", "bilinear" for 'pillow' backend.
+            Defaults to 'bilinear'.
+    """
+
+    def __init__(self,
+                 scale: int,
+                 edge: str = 'short',
+                 backend: str = 'cv2',
+                 interpolation: str = 'bilinear') -> None:
+        allow_edges = ['short', 'long', 'width', 'height']
+        assert edge in allow_edges, \
+            f'Invalid edge "{edge}", please specify from {allow_edges}.'
+        self.edge = edge
+        self.scale = scale
+        self.backend = backend
+        self.interpolation = interpolation
+
+    def _resize_img(self, results: dict) -> None:
+        """Resize images with ``results['scale']``."""
+
+        img, w_scale, h_scale = mmcv.imresize(
+            results['img'],
+            results['scale'],
+            interpolation=self.interpolation,
+            return_scale=True,
+            backend=self.backend)
+        results['img'] = img
+        results['img_shape'] = img.shape[:2]
+        results['scale'] = img.shape[:2][::-1]
+        results['scale_factor'] = (w_scale, h_scale)
+
+    def transform(self, results: Dict) -> Dict:
+        """Transform function to resize images.
+
+        Args:
+            results (dict): Result dict from loading pipeline.
+
+        Returns:
+            dict: Resized results, 'img', 'scale', 'scale_factor',
+            'img_shape' keys are updated in result dict.
+        """
+        assert 'img' in results, 'No `img` field in the input.'
+
+        h, w = results['img'].shape[:2]
+        if any([
+                # conditions to resize the width
+                self.edge == 'short' and w < h,
+                self.edge == 'long' and w > h,
+                self.edge == 'width',
+        ]):
+            width = self.scale
+            height = int(self.scale * h / w)
+        else:
+            height = self.scale
+            width = int(self.scale * w / h)
+        results['scale'] = (width, height)
+
+        self._resize_img(results)
+        return results
+
+    def __repr__(self):
+        """Print the basic information of the transform.
+
+        Returns:
+            str: Formatted string.
+        """
+        repr_str = self.__class__.__name__
+        repr_str += f'(scale={self.scale}, '
+        repr_str += f'edge={self.edge}, '
+        repr_str += f'backend={self.backend}, '
+        repr_str += f'interpolation={self.interpolation})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class ColorJitter(BaseTransform):
+    """Randomly change the brightness, contrast and saturation of an image.
+
+    Modified from
+    https://github.com/pytorch/vision/blob/main/torchvision/transforms/transforms.py
+    Licensed under the BSD 3-Clause License.
+
+    **Required Keys:**
+
+    - img
+
+    **Modified Keys:**
+
+    - img
+
+    Args:
+        brightness (float | Sequence[float] (min, max)): How much to jitter
+            brightness. brightness_factor is chosen uniformly from
+            ``[max(0, 1 - brightness), 1 + brightness]`` or the given
+            ``[min, max]``. Should be non negative numbers. Defaults to 0.
+        contrast (float | Sequence[float] (min, max)): How much to jitter
+            contrast. contrast_factor is chosen uniformly from
+            ``[max(0, 1 - contrast), 1 + contrast]`` or the given
+            ``[min, max]``. Should be non negative numbers. Defaults to 0.
+        saturation (float | Sequence[float] (min, max)): How much to jitter
+            saturation. saturation_factor is chosen uniformly from
+            ``[max(0, 1 - saturation), 1 + saturation]`` or the given
+            ``[min, max]``. Should be non negative numbers. Defaults to 0.
+        hue (float | Sequence[float] (min, max)): How much to jitter hue.
+            hue_factor is chosen uniformly from ``[-hue, hue]`` (0 <= hue
+            <= 0.5) or the given ``[min, max]`` (-0.5 <= min <= max <= 0.5).
+            Defaults to 0.
+        backend (str): The backend to operate the image. Defaults to 'pillow'
+    """
+
+    def __init__(self,
+                 brightness: Union[float, Sequence[float]] = 0.,
+                 contrast: Union[float, Sequence[float]] = 0.,
+                 saturation: Union[float, Sequence[float]] = 0.,
+                 hue: Union[float, Sequence[float]] = 0.,
+                 backend='pillow'):
+        self.brightness = self._set_range(brightness, 'brightness')
+        self.contrast = self._set_range(contrast, 'contrast')
+        self.saturation = self._set_range(saturation, 'saturation')
+        self.hue = self._set_range(hue, 'hue', center=0, bound=(-0.5, 0.5))
+        self.backend = backend
+
+    def _set_range(self, value, name, center=1, bound=(0, float('inf'))):
+        """Set the range of magnitudes."""
+        if isinstance(value, numbers.Number):
+            if value < 0:
+                raise ValueError(
+                    f'If {name} is a single number, it must be non negative.')
+            value = (center - float(value), center + float(value))
+
+        if isinstance(value, (tuple, list)) and len(value) == 2:
+            if not bound[0] <= value[0] <= value[1] <= bound[1]:
+                value = np.clip(value, bound[0], bound[1])
+                from mmengine.logging import MMLogger
+                logger = MMLogger.get_current_instance()
+                logger.warning(f'ColorJitter {name} values exceed the bound '
+                               f'{bound}, clipped to the bound.')
+        else:
+            raise TypeError(f'{name} should be a single number '
+                            'or a list/tuple with length 2.')
+
+        # if value is 0 or (1., 1.) for brightness/contrast/saturation
+        # or (0., 0.) for hue, do nothing
+        if value[0] == value[1] == center:
+            value = None
+        else:
+            value = tuple(value)
+
+        return value
+
+    @cache_randomness
+    def _rand_params(self):
+        """Get random parameters including magnitudes and indices of
+        transforms."""
+        trans_inds = np.random.permutation(4)
+        b, c, s, h = (None, ) * 4
+
+        if self.brightness is not None:
+            b = np.random.uniform(self.brightness[0], self.brightness[1])
+        if self.contrast is not None:
+            c = np.random.uniform(self.contrast[0], self.contrast[1])
+        if self.saturation is not None:
+            s = np.random.uniform(self.saturation[0], self.saturation[1])
+        if self.hue is not None:
+            h = np.random.uniform(self.hue[0], self.hue[1])
+
+        return trans_inds, b, c, s, h
+
+    def transform(self, results: Dict) -> Dict:
+        """Transform function to resize images.
+
+        Args:
+            results (dict): Result dict from loading pipeline.
+
+        Returns:
+            dict: ColorJitter results, 'img' key is updated in result dict.
+        """
+        img = results['img']
+        trans_inds, brightness, contrast, saturation, hue = self._rand_params()
+
+        for index in trans_inds:
+            if index == 0 and brightness is not None:
+                img = mmcv.adjust_brightness(
+                    img, brightness, backend=self.backend)
+            elif index == 1 and contrast is not None:
+                img = mmcv.adjust_contrast(img, contrast, backend=self.backend)
+            elif index == 2 and saturation is not None:
+                img = mmcv.adjust_color(
+                    img, alpha=saturation, backend=self.backend)
+            elif index == 3 and hue is not None:
+                img = mmcv.adjust_hue(img, hue, backend=self.backend)
+
+        results['img'] = img
+        return results
+
+    def __repr__(self):
+        """Print the basic information of the transform.
+
+        Returns:
+            str: Formatted string.
+        """
+        repr_str = self.__class__.__name__
+        repr_str += f'(brightness={self.brightness}, '
+        repr_str += f'contrast={self.contrast}, '
+        repr_str += f'saturation={self.saturation}, '
+        repr_str += f'hue={self.hue})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class Lighting(BaseTransform):
+    """Adjust images lighting using AlexNet-style PCA jitter.
+
+    **Required Keys:**
+
+    - img
+
+    **Modified Keys:**
+
+    - img
+
+    Args:
+        eigval (Sequence[float]): the eigenvalue of the convariance matrix
+            of pixel values, respectively.
+        eigvec (list[list]): the eigenvector of the convariance matrix of
+            pixel values, respectively.
+        alphastd (float): The standard deviation for distribution of alpha.
+            Defaults to 0.1.
+        to_rgb (bool): Whether to convert img to rgb. Defaults to False.
+    """
+
+    def __init__(self,
+                 eigval: Sequence[float],
+                 eigvec: Sequence[float],
+                 alphastd: float = 0.1,
+                 to_rgb: bool = False):
+        assert isinstance(eigval, Sequence), \
+            f'eigval must be Sequence, got {type(eigval)} instead.'
+        assert isinstance(eigvec, Sequence), \
+            f'eigvec must be Sequence, got {type(eigvec)} instead.'
+        for vec in eigvec:
+            assert isinstance(vec, Sequence) and len(vec) == len(eigvec[0]), \
+                'eigvec must contains lists with equal length.'
+        assert isinstance(alphastd, float), 'alphastd should be of type ' \
+            f'float or int, got {type(alphastd)} instead.'
+
+        self.eigval = np.array(eigval)
+        self.eigvec = np.array(eigvec)
+        self.alphastd = alphastd
+        self.to_rgb = to_rgb
+
+    def transform(self, results: Dict) -> Dict:
+        """Transform function to resize images.
+
+        Args:
+            results (dict): Result dict from loading pipeline.
+
+        Returns:
+            dict: Lightinged results, 'img' key is updated in result dict.
+        """
+        assert 'img' in results, 'No `img` field in the input.'
+
+        img = results['img']
+        img_lighting = mmcv.adjust_lighting(
+            img,
+            self.eigval,
+            self.eigvec,
+            alphastd=self.alphastd,
+            to_rgb=self.to_rgb)
+        results['img'] = img_lighting.astype(img.dtype)
+        return results
+
+    def __repr__(self):
+        """Print the basic information of the transform.
+
+        Returns:
+            str: Formatted string.
+        """
+        repr_str = self.__class__.__name__
+        repr_str += f'(eigval={self.eigval.tolist()}, '
+        repr_str += f'eigvec={self.eigvec.tolist()}, '
+        repr_str += f'alphastd={self.alphastd}, '
+        repr_str += f'to_rgb={self.to_rgb})'
+        return repr_str
+
+
+# 'Albu' is used in previous versions of mmpretrain, here is for compatibility
+# users can use both 'Albumentations' and 'Albu'.
+@TRANSFORMS.register_module(['Albumentations', 'Albu'])
+class Albumentations(BaseTransform):
+    """Wrapper to use augmentation from albumentations library.
+
+    **Required Keys:**
+
+    - img
+
+    **Modified Keys:**
+
+    - img
+    - img_shape
+
+    Adds custom transformations from albumentations library.
+    More details can be found in
+    `Albumentations <https://albumentations.readthedocs.io>`_.
+    An example of ``transforms`` is as followed:
+
+    .. code-block::
+
+        [
+            dict(
+                type='ShiftScaleRotate',
+                shift_limit=0.0625,
+                scale_limit=0.0,
+                rotate_limit=0,
+                interpolation=1,
+                p=0.5),
+            dict(
+                type='RandomBrightnessContrast',
+                brightness_limit=[0.1, 0.3],
+                contrast_limit=[0.1, 0.3],
+                p=0.2),
+            dict(type='ChannelShuffle', p=0.1),
+            dict(
+                type='OneOf',
+                transforms=[
+                    dict(type='Blur', blur_limit=3, p=1.0),
+                    dict(type='MedianBlur', blur_limit=3, p=1.0)
+                ],
+                p=0.1),
+        ]
+
+    Args:
+        transforms (List[Dict]): List of albumentations transform configs.
+        keymap (Optional[Dict]): Mapping of mmpretrain to albumentations
+            fields, in format {'input key':'albumentation-style key'}.
+            Defaults to None.
+
+    Example:
+        >>> import mmcv
+        >>> from mmpretrain.datasets import Albumentations
+        >>> transforms = [
+        ...     dict(
+        ...         type='ShiftScaleRotate',
+        ...         shift_limit=0.0625,
+        ...         scale_limit=0.0,
+        ...         rotate_limit=0,
+        ...         interpolation=1,
+        ...         p=0.5),
+        ...     dict(
+        ...         type='RandomBrightnessContrast',
+        ...         brightness_limit=[0.1, 0.3],
+        ...         contrast_limit=[0.1, 0.3],
+        ...         p=0.2),
+        ...     dict(type='ChannelShuffle', p=0.1),
+        ...     dict(
+        ...         type='OneOf',
+        ...         transforms=[
+        ...             dict(type='Blur', blur_limit=3, p=1.0),
+        ...             dict(type='MedianBlur', blur_limit=3, p=1.0)
+        ...         ],
+        ...         p=0.1),
+        ... ]
+        >>> albu = Albumentations(transforms)
+        >>> data = {'img': mmcv.imread('./demo/demo.JPEG')}
+        >>> data = albu(data)
+        >>> print(data['img'].shape)
+        (375, 500, 3)
+    """
+
+    def __init__(self, transforms: List[Dict], keymap: Optional[Dict] = None):
+        if albumentations is None:
+            raise RuntimeError('albumentations is not installed')
+        else:
+            from albumentations import Compose as albu_Compose
+
+        assert isinstance(transforms, list), 'transforms must be a list.'
+        if keymap is not None:
+            assert isinstance(keymap, dict), 'keymap must be None or a dict. '
+
+        self.transforms = transforms
+
+        self.aug = albu_Compose(
+            [self.albu_builder(t) for t in self.transforms])
+
+        if not keymap:
+            self.keymap_to_albu = dict(img='image')
+        else:
+            self.keymap_to_albu = keymap
+        self.keymap_back = {v: k for k, v in self.keymap_to_albu.items()}
+
+    def albu_builder(self, cfg: Dict):
+        """Import a module from albumentations.
+
+        It inherits some of :func:`build_from_cfg` logic.
+        Args:
+            cfg (dict): Config dict. It should at least contain the key "type".
+        Returns:
+            obj: The constructed object.
+        """
+
+        assert isinstance(cfg, dict) and 'type' in cfg, 'each item in ' \
+            "transforms must be a dict with keyword 'type'."
+        args = cfg.copy()
+
+        obj_type = args.pop('type')
+        if mmengine.is_str(obj_type):
+            obj_cls = getattr(albumentations, obj_type)
+        elif inspect.isclass(obj_type):
+            obj_cls = obj_type
+        else:
+            raise TypeError(
+                f'type must be a str or valid type, but got {type(obj_type)}')
+
+        if 'transforms' in args:
+            args['transforms'] = [
+                self.albu_builder(transform)
+                for transform in args['transforms']
+            ]
+
+        return obj_cls(**args)
+
+    @staticmethod
+    def mapper(d, keymap):
+        """Dictionary mapper.
+
+        Renames keys according to keymap provided.
+        Args:
+            d (dict): old dict
+            keymap (dict): {'old_key':'new_key'}
+        Returns:
+            dict: new dict.
+        """
+
+        updated_dict = {}
+        for k, v in zip(d.keys(), d.values()):
+            new_k = keymap.get(k, k)
+            updated_dict[new_k] = d[k]
+        return updated_dict
+
+    def transform(self, results: Dict) -> Dict:
+        """Transform function to perform albumentations transforms.
+
+        Args:
+            results (dict): Result dict from loading pipeline.
+
+        Returns:
+            dict: Transformed results, 'img' and 'img_shape' keys are
+                updated in result dict.
+        """
+        assert 'img' in results, 'No `img` field in the input.'
+
+        # dict to albumentations format
+        results = self.mapper(results, self.keymap_to_albu)
+        results = self.aug(**results)
+
+        # back to the original format
+        results = self.mapper(results, self.keymap_back)
+        results['img_shape'] = results['img'].shape[:2]
+
+        return results
+
+    def __repr__(self):
+        """Print the basic information of the transform.
+
+        Returns:
+            str: Formatted string.
+        """
+        repr_str = self.__class__.__name__
+        repr_str += f'(transforms={repr(self.transforms)})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class SimMIMMaskGenerator(BaseTransform):
+    """Generate random block mask for each Image.
+
+    **Added Keys**:
+
+    - mask
+
+    This module is used in SimMIM to generate masks.
+
+    Args:
+        input_size (int): Size of input image. Defaults to 192.
+        mask_patch_size (int): Size of each block mask. Defaults to 32.
+        model_patch_size (int): Patch size of each token. Defaults to 4.
+        mask_ratio (float): The mask ratio of image. Defaults to 0.6.
+    """
+
+    def __init__(self,
+                 input_size: int = 192,
+                 mask_patch_size: int = 32,
+                 model_patch_size: int = 4,
+                 mask_ratio: float = 0.6):
+        self.input_size = input_size
+        self.mask_patch_size = mask_patch_size
+        self.model_patch_size = model_patch_size
+        self.mask_ratio = mask_ratio
+
+        assert self.input_size % self.mask_patch_size == 0
+        assert self.mask_patch_size % self.model_patch_size == 0
+
+        self.rand_size = self.input_size // self.mask_patch_size
+        self.scale = self.mask_patch_size // self.model_patch_size
+
+        self.token_count = self.rand_size**2
+        self.mask_count = int(np.ceil(self.token_count * self.mask_ratio))
+
+    def transform(self, results: dict) -> dict:
+        """Method to generate random block mask for each Image in SimMIM.
+
+        Args:
+            results (dict): Result dict from previous pipeline.
+
+        Returns:
+            dict: Result dict with added key ``mask``.
+        """
+        mask_idx = np.random.permutation(self.token_count)[:self.mask_count]
+        mask = np.zeros(self.token_count, dtype=int)
+        mask[mask_idx] = 1
+
+        mask = mask.reshape((self.rand_size, self.rand_size))
+        mask = mask.repeat(self.scale, axis=0).repeat(self.scale, axis=1)
+
+        results.update({'mask': mask})
+
+        return results
+
+    def __repr__(self) -> str:
+        repr_str = self.__class__.__name__
+        repr_str += f'(input_size={self.input_size}, '
+        repr_str += f'mask_patch_size={self.mask_patch_size}, '
+        repr_str += f'model_patch_size={self.model_patch_size}, '
+        repr_str += f'mask_ratio={self.mask_ratio})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class BEiTMaskGenerator(BaseTransform):
+    """Generate mask for image.
+
+    **Added Keys**:
+
+    - mask
+
+    This module is borrowed from
+    https://github.com/microsoft/unilm/tree/master/beit
+
+    Args:
+        input_size (int): The size of input image.
+        num_masking_patches (int): The number of patches to be masked.
+        min_num_patches (int): The minimum number of patches to be masked
+            in the process of generating mask. Defaults to 4.
+        max_num_patches (int, optional): The maximum number of patches to be
+            masked in the process of generating mask. Defaults to None.
+        min_aspect (float): The minimum aspect ratio of mask blocks. Defaults
+            to 0.3.
+        min_aspect (float, optional): The minimum aspect ratio of mask blocks.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 input_size: int,
+                 num_masking_patches: int,
+                 min_num_patches: int = 4,
+                 max_num_patches: Optional[int] = None,
+                 min_aspect: float = 0.3,
+                 max_aspect: Optional[float] = None) -> None:
+        if not isinstance(input_size, tuple):
+            input_size = (input_size, ) * 2
+        self.height, self.width = input_size
+
+        self.num_patches = self.height * self.width
+
+        self.num_masking_patches = num_masking_patches
+        self.min_num_patches = min_num_patches
+        self.max_num_patches = num_masking_patches if max_num_patches is None \
+            else max_num_patches
+
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+
+    def _mask(self, mask: np.ndarray, max_mask_patches: int) -> int:
+        """Generate mask recursively.
+
+        Args:
+            mask (np.ndarray): The mask to be generated.
+            max_mask_patches (int): The maximum number of patches to be masked.
+
+        Returns:
+            int: The number of patches masked.
+        """
+        delta = 0
+        for _ in range(10):
+            target_area = np.random.uniform(self.min_num_patches,
+                                            max_mask_patches)
+            aspect_ratio = math.exp(np.random.uniform(*self.log_aspect_ratio))
+            h = int(round(math.sqrt(target_area * aspect_ratio)))
+            w = int(round(math.sqrt(target_area / aspect_ratio)))
+            if w < self.width and h < self.height:
+                top = np.random.randint(0, self.height - h)
+                left = np.random.randint(0, self.width - w)
+
+                num_masked = mask[top:top + h, left:left + w].sum()
+                # Overlap
+                if 0 < h * w - num_masked <= max_mask_patches:
+                    for i in range(top, top + h):
+                        for j in range(left, left + w):
+                            if mask[i, j] == 0:
+                                mask[i, j] = 1
+                                delta += 1
+                if delta > 0:
+                    break
+        return delta
+
+    def transform(self, results: dict) -> dict:
+        """Method to generate random block mask for each Image in BEiT.
+
+        Args:
+            results (dict): Result dict from previous pipeline.
+
+        Returns:
+            dict: Result dict with added key ``mask``.
+        """
+        mask = np.zeros(shape=(self.height, self.width), dtype=int)
+
+        mask_count = 0
+        while mask_count != self.num_masking_patches:
+            max_mask_patches = self.num_masking_patches - mask_count
+            max_mask_patches = min(max_mask_patches, self.max_num_patches)
+
+            delta = self._mask(mask, max_mask_patches)
+            mask_count += delta
+        results.update({'mask': mask})
+
+        return results
+
+    def __repr__(self) -> str:
+        repr_str = self.__class__.__name__
+        repr_str += f'(height={self.height}, '
+        repr_str += f'width={self.width}, '
+        repr_str += f'num_patches={self.num_patches}, '
+        repr_str += f'num_masking_patches={self.num_masking_patches}, '
+        repr_str += f'min_num_patches={self.min_num_patches}, '
+        repr_str += f'max_num_patches={self.max_num_patches}, '
+        repr_str += f'log_aspect_ratio={self.log_aspect_ratio})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class RandomResizedCropAndInterpolationWithTwoPic(BaseTransform):
+    """Crop the given PIL Image to random size and aspect ratio with random
+    interpolation.
+
+    **Required Keys**:
+
+    - img
+
+    **Modified Keys**:
+
+    - img
+
+    **Added Keys**:
+
+    - target_img
+
+    This module is borrowed from
+    https://github.com/microsoft/unilm/tree/master/beit.
+
+    A crop of random size (default: of 0.08 to 1.0) of the original size and a
+    random aspect ratio (default: of 3/4 to 4/3) of the original aspect ratio
+    is made. This crop is finally resized to given size. This is popularly used
+    to train the Inception networks. This module first crops the image and
+    resizes the crop to two different sizes.
+
+    Args:
+        size (Union[tuple, int]): Expected output size of each edge of the
+            first image.
+        second_size (Union[tuple, int], optional): Expected output size of each
+            edge of the second image.
+        scale (tuple[float, float]): Range of size of the origin size cropped.
+            Defaults to (0.08, 1.0).
+        ratio (tuple[float, float]): Range of aspect ratio of the origin aspect
+            ratio cropped. Defaults to (3./4., 4./3.).
+        interpolation (str): The interpolation for the first image. Defaults
+            to ``bilinear``.
+        second_interpolation (str): The interpolation for the second image.
+            Defaults to ``lanczos``.
+    """
+
+    def __init__(self,
+                 size: Union[tuple, int],
+                 second_size=None,
+                 scale=(0.08, 1.0),
+                 ratio=(3. / 4., 4. / 3.),
+                 interpolation='bilinear',
+                 second_interpolation='lanczos') -> None:
+        if isinstance(size, tuple):
+            self.size = size
+        else:
+            self.size = (size, size)
+        if second_size is not None:
+            if isinstance(second_size, tuple):
+                self.second_size = second_size
+            else:
+                self.second_size = (second_size, second_size)
+        else:
+            self.second_size = None
+        if (scale[0] > scale[1]) or (ratio[0] > ratio[1]):
+            ('range should be of kind (min, max)')
+
+        if interpolation == 'random':
+            self.interpolation = ('bilinear', 'bicubic')
+        else:
+            self.interpolation = interpolation
+        self.second_interpolation = second_interpolation
+        self.scale = scale
+        self.ratio = ratio
+
+    @staticmethod
+    def get_params(img: np.ndarray, scale: tuple,
+                   ratio: tuple) -> Sequence[int]:
+        """Get parameters for ``crop`` for a random sized crop.
+
+        Args:
+            img (np.ndarray): Image to be cropped.
+            scale (tuple): range of size of the origin size cropped
+            ratio (tuple): range of aspect ratio of the origin aspect
+                ratio cropped
+
+        Returns:
+            tuple: params (i, j, h, w) to be passed to ``crop`` for a random
+                sized crop.
+        """
+        img_h, img_w = img.shape[:2]
+        area = img_h * img_w
+
+        for _ in range(10):
+            target_area = np.random.uniform(*scale) * area
+            log_ratio = (math.log(ratio[0]), math.log(ratio[1]))
+            aspect_ratio = math.exp(np.random.uniform(*log_ratio))
+
+            w = int(round(math.sqrt(target_area * aspect_ratio)))
+            h = int(round(math.sqrt(target_area / aspect_ratio)))
+
+            if w < img_w and h < img_h:
+                i = np.random.randint(0, img_h - h)
+                j = np.random.randint(0, img_w - w)
+                return i, j, h, w
+
+        # Fallback to central crop
+        in_ratio = img_w / img_h
+        if in_ratio < min(ratio):
+            w = img_w
+            h = int(round(w / min(ratio)))
+        elif in_ratio > max(ratio):
+            h = img_h
+            w = int(round(h * max(ratio)))
+        else:  # whole image
+            w = img_w
+            h = img_h
+        i = (img_h - h) // 2
+        j = (img_w - w) // 2
+        return i, j, h, w
+
+    def transform(self, results: dict) -> dict:
+        """Crop the given image and resize it to two different sizes.
+
+        This module crops the given image randomly and resize the crop to two
+        different sizes. This is popularly used in BEiT-style masked image
+        modeling, where an off-the-shelf model is used to provide the target.
+
+        Args:
+            results (dict): Results from previous pipeline.
+
+        Returns:
+            dict: Results after applying this transformation.
+        """
+        img = results['img']
+        i, j, h, w = self.get_params(img, self.scale, self.ratio)
+        if isinstance(self.interpolation, (tuple, list)):
+            interpolation = np.random.choice(self.interpolation)
+        else:
+            interpolation = self.interpolation
+        if self.second_size is None:
+            img = img[i:i + h, j:j + w]
+            img = mmcv.imresize(img, self.size, interpolation=interpolation)
+            results.update({'img': img})
+        else:
+            img = img[i:i + h, j:j + w]
+            img_sample = mmcv.imresize(
+                img, self.size, interpolation=interpolation)
+            img_target = mmcv.imresize(
+                img, self.second_size, interpolation=self.second_interpolation)
+            results.update({'img': [img_sample, img_target]})
+        return results
+
+    def __repr__(self) -> str:
+        repr_str = self.__class__.__name__
+        repr_str += f'(size={self.size}, '
+        repr_str += f'second_size={self.second_size}, '
+        repr_str += f'interpolation={self.interpolation}, '
+        repr_str += f'second_interpolation={self.second_interpolation}, '
+        repr_str += f'scale={self.scale}, '
+        repr_str += f'ratio={self.ratio})'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class CleanCaption(BaseTransform):
+    """Clean caption text.
+
+    Remove some useless punctuation for the caption task.
+
+    **Required Keys:**
+
+    - ``*keys``
+
+    **Modified Keys:**
+
+    - ``*keys``
+
+    Args:
+        keys (Sequence[str], optional): The keys of text to be cleaned.
+            Defaults to 'gt_caption'.
+        remove_chars (str): The characters to be removed. Defaults to
+            :py:attr:`string.punctuation`.
+        lowercase (bool): Whether to convert the text to lowercase.
+            Defaults to True.
+        remove_dup_space (bool): Whether to remove duplicated whitespaces.
+            Defaults to True.
+        strip (bool): Whether to remove leading and trailing whitespaces.
+            Defaults to True.
+    """
+
+    def __init__(
+        self,
+        keys='gt_caption',
+        remove_chars=string.punctuation,
+        lowercase=True,
+        remove_dup_space=True,
+        strip=True,
+    ):
+        if isinstance(keys, str):
+            keys = [keys]
+        self.keys = keys
+        self.transtab = str.maketrans({ch: None for ch in remove_chars})
+        self.lowercase = lowercase
+        self.remove_dup_space = remove_dup_space
+        self.strip = strip
+
+    def _clean(self, text):
+        """Perform text cleaning before tokenizer."""
+
+        if self.strip:
+            text = text.strip()
+
+        text = text.translate(self.transtab)
+
+        if self.remove_dup_space:
+            text = re.sub(r'\s{2,}', ' ', text)
+
+        if self.lowercase:
+            text = text.lower()
+
+        return text
+
+    def clean(self, text):
+        """Perform text cleaning before tokenizer."""
+        if isinstance(text, (list, tuple)):
+            return [self._clean(item) for item in text]
+        elif isinstance(text, str):
+            return self._clean(text)
+        else:
+            raise TypeError('text must be a string or a list of strings')
+
+    def transform(self, results: dict) -> dict:
+        """Method to clean the input text data."""
+        for key in self.keys:
+            results[key] = self.clean(results[key])
+        return results
+
+
+@TRANSFORMS.register_module()
+class OFAAddObjects(BaseTransform):
+
+    def transform(self, results: dict) -> dict:
+        if 'objects' not in results:
+            raise ValueError(
+                'Some OFA fine-tuned models requires `objects` field in the '
+                'dataset, which is generated by VinVL. Or please use '
+                'zero-shot configs. See '
+                'https://github.com/OFA-Sys/OFA/issues/189')
+
+        if 'question' in results:
+            prompt = '{} object: {}'.format(
+                results['question'],
+                ' '.join(results['objects']),
+            )
+            results['decoder_prompt'] = prompt
+            results['question'] = prompt
+
+
+@TRANSFORMS.register_module()
+class RandomTranslatePad(BaseTransform):
+
+    def __init__(self, size=640, aug_translate=False):
+        self.size = size
+        self.aug_translate = aug_translate
+
+    @cache_randomness
+    def rand_translate_params(self, dh, dw):
+        top = np.random.randint(0, dh)
+        left = np.random.randint(0, dw)
+        return top, left
+
+    def transform(self, results: dict) -> dict:
+        img = results['img']
+        h, w = img.shape[:-1]
+        dw = self.size - w
+        dh = self.size - h
+        if self.aug_translate:
+            top, left = self.rand_translate_params(dh, dw)
+        else:
+            top = round(dh / 2.0 - 0.1)
+            left = round(dw / 2.0 - 0.1)
+
+        out_img = np.zeros((self.size, self.size, 3), dtype=np.float32)
+        out_img[top:top + h, left:left + w, :] = img
+        results['img'] = out_img
+        results['img_shape'] = (self.size, self.size)
+
+        # translate box
+        if 'gt_bboxes' in results.keys():
+            for i in range(len(results['gt_bboxes'])):
+                box = results['gt_bboxes'][i]
+                box[0], box[2] = box[0] + left, box[2] + left
+                box[1], box[3] = box[1] + top, box[3] + top
+                results['gt_bboxes'][i] = box
+
+        return results
+
+
+@TRANSFORMS.register_module()
+class MAERandomResizedCrop(transforms.RandomResizedCrop):
+    """RandomResizedCrop for matching TF/TPU implementation: no for-loop is
+    used.
+
+    This may lead to results different with torchvision's version.
+    Following BYOL's TF code:
+    https://github.com/deepmind/deepmind-research/blob/master/byol/utils/dataset.py#L206 # noqa: E501
+    """
+
+    @staticmethod
+    def get_params(img: Image.Image, scale: tuple, ratio: tuple) -> Tuple:
+        width, height = img.size
+        area = height * width
+
+        target_area = area * torch.empty(1).uniform_(scale[0], scale[1]).item()
+        log_ratio = torch.log(torch.tensor(ratio))
+        aspect_ratio = torch.exp(
+            torch.empty(1).uniform_(log_ratio[0], log_ratio[1])).item()
+
+        w = int(round(math.sqrt(target_area * aspect_ratio)))
+        h = int(round(math.sqrt(target_area / aspect_ratio)))
+
+        w = min(w, width)
+        h = min(h, height)
+
+        i = torch.randint(0, height - h + 1, size=(1, )).item()
+        j = torch.randint(0, width - w + 1, size=(1, )).item()
+
+        return i, j, h, w
+
+    def forward(self, results: dict) -> dict:
+        """The forward function of MAERandomResizedCrop.
+
+        Args:
+            results (dict): The results dict contains the image and all these
+                information related to the image.
+
+        Returns:
+            dict: The results dict contains the cropped image and all these
+            information related to the image.
+        """
+        img = results['img']
+        i, j, h, w = self.get_params(img, self.scale, self.ratio)
+        img = F.resized_crop(img, i, j, h, w, self.size, self.interpolation)
+        results['img'] = img
+        return results
diff --git a/mmpretrain/datasets/transforms/utils.py b/mmpretrain/datasets/transforms/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..d7940486fc9904c14f5a5a4a959022c11456c968
--- /dev/null
+++ b/mmpretrain/datasets/transforms/utils.py
@@ -0,0 +1,53 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+from typing import List, Union
+
+from mmcv.transforms import BaseTransform
+
+PIPELINE_TYPE = List[Union[dict, BaseTransform]]
+
+
+def get_transform_idx(pipeline: PIPELINE_TYPE, target: str) -> int:
+    """Returns the index of the transform in a pipeline.
+
+    Args:
+        pipeline (List[dict] | List[BaseTransform]): The transforms list.
+        target (str): The target transform class name.
+
+    Returns:
+        int: The transform index. Returns -1 if not found.
+    """
+    for i, transform in enumerate(pipeline):
+        if isinstance(transform, dict):
+            if isinstance(transform['type'], type):
+                if transform['type'].__name__ == target:
+                    return i
+            else:
+                if transform['type'] == target:
+                    return i
+        else:
+            if transform.__class__.__name__ == target:
+                return i
+
+    return -1
+
+
+def remove_transform(pipeline: PIPELINE_TYPE, target: str, inplace=False):
+    """Remove the target transform type from the pipeline.
+
+    Args:
+        pipeline (List[dict] | List[BaseTransform]): The transforms list.
+        target (str): The target transform class name.
+        inplace (bool): Whether to modify the pipeline inplace.
+
+    Returns:
+        The modified transform.
+    """
+    idx = get_transform_idx(pipeline, target)
+    if not inplace:
+        pipeline = copy.deepcopy(pipeline)
+    while idx >= 0:
+        pipeline.pop(idx)
+        idx = get_transform_idx(pipeline, target)
+
+    return pipeline
diff --git a/mmpretrain/datasets/transforms/wrappers.py b/mmpretrain/datasets/transforms/wrappers.py
new file mode 100644
index 0000000000000000000000000000000000000000..c0dfd730b4db0dc80ed315b79658cfbf683e4035
--- /dev/null
+++ b/mmpretrain/datasets/transforms/wrappers.py
@@ -0,0 +1,144 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+from typing import Callable, List, Union
+
+from mmcv.transforms import BaseTransform, Compose
+
+from mmpretrain.registry import TRANSFORMS
+
+# Define type of transform or transform config
+Transform = Union[dict, Callable[[dict], dict]]
+
+
+@TRANSFORMS.register_module()
+class MultiView(BaseTransform):
+    """A transform wrapper for multiple views of an image.
+
+    Args:
+        transforms (list[dict | callable], optional): Sequence of transform
+            object or config dict to be wrapped.
+        mapping (dict): A dict that defines the input key mapping.
+            The keys corresponds to the inner key (i.e., kwargs of the
+            ``transform`` method), and should be string type. The values
+            corresponds to the outer keys (i.e., the keys of the
+            data/results), and should have a type of string, list or dict.
+            None means not applying input mapping. Default: None.
+        allow_nonexist_keys (bool): If False, the outer keys in the mapping
+            must exist in the input data, or an exception will be raised.
+            Default: False.
+
+    Examples:
+        >>> # Example 1: MultiViews 1 pipeline with 2 views
+        >>> pipeline = [
+        >>>     dict(type='MultiView',
+        >>>         num_views=2,
+        >>>         transforms=[
+        >>>             [
+        >>>                dict(type='Resize', scale=224))],
+        >>>         ])
+        >>> ]
+        >>> # Example 2: MultiViews 2 pipelines, the first with 2 views,
+        >>> # the second with 6 views
+        >>> pipeline = [
+        >>>     dict(type='MultiView',
+        >>>         num_views=[2, 6],
+        >>>         transforms=[
+        >>>             [
+        >>>                dict(type='Resize', scale=224)],
+        >>>             [
+        >>>                dict(type='Resize', scale=224),
+        >>>                dict(type='RandomSolarize')],
+        >>>         ])
+        >>> ]
+    """
+
+    def __init__(self, transforms: List[List[Transform]],
+                 num_views: Union[int, List[int]]) -> None:
+
+        if isinstance(num_views, int):
+            num_views = [num_views]
+        assert isinstance(num_views, List)
+        assert len(num_views) == len(transforms)
+        self.num_views = num_views
+
+        self.pipelines = []
+        for trans in transforms:
+            pipeline = Compose(trans)
+            self.pipelines.append(pipeline)
+
+        self.transforms = []
+        for i in range(len(num_views)):
+            self.transforms.extend([self.pipelines[i]] * num_views[i])
+
+    def transform(self, results: dict) -> dict:
+        """Apply transformation to inputs.
+
+        Args:
+            results (dict): Result dict from previous pipelines.
+
+        Returns:
+            dict: Transformed results.
+        """
+        multi_views_outputs = dict(img=[])
+        for trans in self.transforms:
+            inputs = copy.deepcopy(results)
+            outputs = trans(inputs)
+
+            multi_views_outputs['img'].append(outputs['img'])
+        results.update(multi_views_outputs)
+        return results
+
+    def __repr__(self) -> str:
+        repr_str = self.__class__.__name__ + '('
+        for i, p in enumerate(self.pipelines):
+            repr_str += f'\nPipeline {i + 1} with {self.num_views[i]} views:\n'
+            repr_str += str(p)
+        repr_str += ')'
+        return repr_str
+
+
+@TRANSFORMS.register_module()
+class ApplyToList(BaseTransform):
+    """A transform wrapper to apply the wrapped transforms to a list of items.
+    For example, to load and resize a list of images.
+
+    Args:
+        transforms (list[dict | callable]): Sequence of transform config dict
+            to be wrapped.
+        scatter_key (str): The key to scatter data dict. If the field is a
+            list, scatter the list to multiple data dicts to do transformation.
+        collate_keys (List[str]): The keys to collate from multiple data dicts.
+            The fields in ``collate_keys`` will be composed into a list after
+            transformation, and the other fields will be adopted from the
+            first data dict.
+    """
+
+    def __init__(self, transforms, scatter_key, collate_keys):
+        super().__init__()
+
+        self.transforms = Compose([TRANSFORMS.build(t) for t in transforms])
+        self.scatter_key = scatter_key
+        self.collate_keys = set(collate_keys)
+        self.collate_keys.add(self.scatter_key)
+
+    def transform(self, results: dict):
+        scatter_field = results.get(self.scatter_key)
+
+        if isinstance(scatter_field, list):
+            scattered_results = []
+            for item in scatter_field:
+                single_results = copy.deepcopy(results)
+                single_results[self.scatter_key] = item
+                scattered_results.append(self.transforms(single_results))
+
+            final_output = scattered_results[0]
+
+            # merge output list to single output
+            for key in scattered_results[0].keys():
+                if key in self.collate_keys:
+                    final_output[key] = [
+                        single[key] for single in scattered_results
+                    ]
+            return final_output
+        else:
+            return self.transforms(results)
diff --git a/mmpretrain/datasets/utils.py b/mmpretrain/datasets/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..fcb60e432c374c1a904700a7348f706fa0e523eb
--- /dev/null
+++ b/mmpretrain/datasets/utils.py
@@ -0,0 +1,243 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import gzip
+import hashlib
+import os
+import os.path
+import shutil
+import tarfile
+import tempfile
+import urllib.error
+import urllib.request
+import zipfile
+
+from mmengine.fileio import LocalBackend, get_file_backend
+
+__all__ = [
+    'rm_suffix', 'check_integrity', 'download_and_extract_archive',
+    'open_maybe_compressed_file'
+]
+
+
+def rm_suffix(s, suffix=None):
+    if suffix is None:
+        return s[:s.rfind('.')]
+    else:
+        return s[:s.rfind(suffix)]
+
+
+def calculate_md5(fpath: str, chunk_size: int = 1024 * 1024):
+    md5 = hashlib.md5()
+    backend = get_file_backend(fpath, enable_singleton=True)
+    if isinstance(backend, LocalBackend):
+        # Enable chunk update for local file.
+        with open(fpath, 'rb') as f:
+            for chunk in iter(lambda: f.read(chunk_size), b''):
+                md5.update(chunk)
+    else:
+        md5.update(backend.get(fpath))
+    return md5.hexdigest()
+
+
+def check_md5(fpath, md5, **kwargs):
+    return md5 == calculate_md5(fpath, **kwargs)
+
+
+def check_integrity(fpath, md5=None):
+    if not os.path.isfile(fpath):
+        return False
+    if md5 is None:
+        return True
+    return check_md5(fpath, md5)
+
+
+def download_url_to_file(url, dst, hash_prefix=None, progress=True):
+    """Download object at the given URL to a local path.
+
+    Modified from
+    https://pytorch.org/docs/stable/hub.html#torch.hub.download_url_to_file
+
+    Args:
+        url (str): URL of the object to download
+        dst (str): Full path where object will be saved,
+            e.g. ``/tmp/temporary_file``
+        hash_prefix (string, optional): If not None, the SHA256 downloaded
+            file should start with ``hash_prefix``. Defaults to None.
+        progress (bool): whether or not to display a progress bar to stderr.
+            Defaults to True
+    """
+    file_size = None
+    req = urllib.request.Request(url)
+    u = urllib.request.urlopen(req)
+    meta = u.info()
+    if hasattr(meta, 'getheaders'):
+        content_length = meta.getheaders('Content-Length')
+    else:
+        content_length = meta.get_all('Content-Length')
+    if content_length is not None and len(content_length) > 0:
+        file_size = int(content_length[0])
+
+    # We deliberately save it in a temp file and move it after download is
+    # complete. This prevents a local file being overridden by a broken
+    # download.
+    dst = os.path.expanduser(dst)
+    dst_dir = os.path.dirname(dst)
+    f = tempfile.NamedTemporaryFile(delete=False, dir=dst_dir)
+
+    import rich.progress
+    columns = [
+        rich.progress.DownloadColumn(),
+        rich.progress.BarColumn(bar_width=None),
+        rich.progress.TimeRemainingColumn(),
+    ]
+    try:
+        if hash_prefix is not None:
+            sha256 = hashlib.sha256()
+        with rich.progress.Progress(*columns) as pbar:
+            task = pbar.add_task('download', total=file_size, visible=progress)
+            while True:
+                buffer = u.read(8192)
+                if len(buffer) == 0:
+                    break
+                f.write(buffer)
+                if hash_prefix is not None:
+                    sha256.update(buffer)
+                pbar.update(task, advance=len(buffer))
+
+        f.close()
+        if hash_prefix is not None:
+            digest = sha256.hexdigest()
+            if digest[:len(hash_prefix)] != hash_prefix:
+                raise RuntimeError(
+                    'invalid hash value (expected "{}", got "{}")'.format(
+                        hash_prefix, digest))
+        shutil.move(f.name, dst)
+    finally:
+        f.close()
+        if os.path.exists(f.name):
+            os.remove(f.name)
+
+
+def download_url(url, root, filename=None, md5=None):
+    """Download a file from a url and place it in root.
+
+    Args:
+        url (str): URL to download file from.
+        root (str): Directory to place downloaded file in.
+        filename (str | None): Name to save the file under.
+            If filename is None, use the basename of the URL.
+        md5 (str | None): MD5 checksum of the download.
+            If md5 is None, download without md5 check.
+    """
+    root = os.path.expanduser(root)
+    if not filename:
+        filename = os.path.basename(url)
+    fpath = os.path.join(root, filename)
+
+    os.makedirs(root, exist_ok=True)
+
+    if check_integrity(fpath, md5):
+        print(f'Using downloaded and verified file: {fpath}')
+    else:
+        try:
+            print(f'Downloading {url} to {fpath}')
+            download_url_to_file(url, fpath)
+        except (urllib.error.URLError, IOError) as e:
+            if url[:5] == 'https':
+                url = url.replace('https:', 'http:')
+                print('Failed download. Trying https -> http instead.'
+                      f' Downloading {url} to {fpath}')
+                download_url_to_file(url, fpath)
+            else:
+                raise e
+        # check integrity of downloaded file
+        if not check_integrity(fpath, md5):
+            raise RuntimeError('File not found or corrupted.')
+
+
+def _is_tarxz(filename):
+    return filename.endswith('.tar.xz')
+
+
+def _is_tar(filename):
+    return filename.endswith('.tar')
+
+
+def _is_targz(filename):
+    return filename.endswith('.tar.gz')
+
+
+def _is_tgz(filename):
+    return filename.endswith('.tgz')
+
+
+def _is_gzip(filename):
+    return filename.endswith('.gz') and not filename.endswith('.tar.gz')
+
+
+def _is_zip(filename):
+    return filename.endswith('.zip')
+
+
+def extract_archive(from_path, to_path=None, remove_finished=False):
+    if to_path is None:
+        to_path = os.path.dirname(from_path)
+
+    if _is_tar(from_path):
+        with tarfile.open(from_path, 'r') as tar:
+            tar.extractall(path=to_path)
+    elif _is_targz(from_path) or _is_tgz(from_path):
+        with tarfile.open(from_path, 'r:gz') as tar:
+            tar.extractall(path=to_path)
+    elif _is_tarxz(from_path):
+        with tarfile.open(from_path, 'r:xz') as tar:
+            tar.extractall(path=to_path)
+    elif _is_gzip(from_path):
+        to_path = os.path.join(
+            to_path,
+            os.path.splitext(os.path.basename(from_path))[0])
+        with open(to_path, 'wb') as out_f, gzip.GzipFile(from_path) as zip_f:
+            out_f.write(zip_f.read())
+    elif _is_zip(from_path):
+        with zipfile.ZipFile(from_path, 'r') as z:
+            z.extractall(to_path)
+    else:
+        raise ValueError(f'Extraction of {from_path} not supported')
+
+    if remove_finished:
+        os.remove(from_path)
+
+
+def download_and_extract_archive(url,
+                                 download_root,
+                                 extract_root=None,
+                                 filename=None,
+                                 md5=None,
+                                 remove_finished=False):
+    download_root = os.path.expanduser(download_root)
+    if extract_root is None:
+        extract_root = download_root
+    if not filename:
+        filename = os.path.basename(url)
+
+    download_url(url, download_root, filename, md5)
+
+    archive = os.path.join(download_root, filename)
+    print(f'Extracting {archive} to {extract_root}')
+    extract_archive(archive, extract_root, remove_finished)
+
+
+def open_maybe_compressed_file(path: str):
+    """Return a file object that possibly decompresses 'path' on the fly.
+
+    Decompression occurs when argument `path` is a string and ends with '.gz'
+    or '.xz'.
+    """
+    if not isinstance(path, str):
+        return path
+    if path.endswith('.gz'):
+        import gzip
+        return gzip.open(path, 'rb')
+    if path.endswith('.xz'):
+        import lzma
+        return lzma.open(path, 'rb')
+    return open(path, 'rb')
diff --git a/mmpretrain/datasets/vg_vqa.py b/mmpretrain/datasets/vg_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..2d83884c804086c060bcfe27e833bff28dc28e9e
--- /dev/null
+++ b/mmpretrain/datasets/vg_vqa.py
@@ -0,0 +1,77 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+from mmengine.fileio import load
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import BaseDataset
+
+
+@DATASETS.register_module()
+class VGVQA(BaseDataset):
+    """Visual Genome VQA dataset."""
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list.
+
+        Compare to BaseDataset, the only difference is that coco_vqa annotation
+        file is already a list of data. There is no 'metainfo'.
+        """
+
+        raw_data_list = load(self.ann_file)
+        if not isinstance(raw_data_list, list):
+            raise TypeError(
+                f'The VQA annotations loaded from annotation file '
+                f'should be a dict, but got {type(raw_data_list)}!')
+
+        # load and parse data_infos.
+        data_list = []
+        for raw_data_info in raw_data_list:
+            # parse raw data information to target format
+            data_info = self.parse_data_info(raw_data_info)
+            if isinstance(data_info, dict):
+                # For VQA tasks, each `data_info` looks like:
+                # {
+                #   "question_id": 986769,
+                #   "question": "How many people are there?",
+                #   "answer": "two",
+                #   "image": "image/1.jpg",
+                #   "dataset": "vg"
+                # }
+
+                # change 'image' key to 'img_path'
+                # TODO: This process will be removed, after the annotation file
+                # is preprocess.
+                data_info['img_path'] = data_info['image']
+                del data_info['image']
+
+                if 'answer' in data_info:
+                    # add answer_weight & answer_count, delete duplicate answer
+                    if data_info['dataset'] == 'vqa':
+                        answer_weight = {}
+                        for answer in data_info['answer']:
+                            if answer in answer_weight.keys():
+                                answer_weight[answer] += 1 / len(
+                                    data_info['answer'])
+                            else:
+                                answer_weight[answer] = 1 / len(
+                                    data_info['answer'])
+
+                        data_info['answer'] = list(answer_weight.keys())
+                        data_info['answer_weight'] = list(
+                            answer_weight.values())
+                        data_info['answer_count'] = len(answer_weight)
+
+                    elif data_info['dataset'] == 'vg':
+                        data_info['answers'] = [data_info['answer']]
+                        data_info['answer_weight'] = [0.2]
+                        data_info['answer_count'] = 1
+
+                data_list.append(data_info)
+
+            else:
+                raise TypeError(
+                    f'Each VQA data element loaded from annotation file '
+                    f'should be a dict, but got {type(data_info)}!')
+
+        return data_list
diff --git a/mmpretrain/datasets/visual_genome.py b/mmpretrain/datasets/visual_genome.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c33b86c4f81d0be0f2830618ad100196b461dcf
--- /dev/null
+++ b/mmpretrain/datasets/visual_genome.py
@@ -0,0 +1,95 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import re
+from itertools import chain
+from typing import List
+
+import mmengine
+from mmengine.dataset import BaseDataset
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class VisualGenomeQA(BaseDataset):
+    """Visual Genome Question Answering dataset.
+
+    dataset structure: ::
+
+        data_root
+        ├── image
+        │   ├── 1.jpg
+        │   ├── 2.jpg
+        │   └── ...
+        └── question_answers.json
+
+    Args:
+        data_root (str): The root directory for ``data_prefix``, ``ann_file``
+            and ``question_file``.
+        data_prefix (str): The directory of images. Defaults to ``"image"``.
+        ann_file (str, optional): Annotation file path for training and
+            validation. Defaults to ``"question_answers.json"``.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self,
+                 data_root: str,
+                 data_prefix: str = 'image',
+                 ann_file: str = 'question_answers.json',
+                 **kwarg):
+        super().__init__(
+            data_root=data_root,
+            data_prefix=dict(img_path=data_prefix),
+            ann_file=ann_file,
+            **kwarg,
+        )
+
+    def _create_image_index(self):
+        img_prefix = self.data_prefix['img_path']
+
+        files = mmengine.list_dir_or_file(img_prefix, list_dir=False)
+        image_index = {}
+        for file in files:
+            image_id = re.findall(r'\d+', file)
+            if len(image_id) > 0:
+                image_id = int(image_id[-1])
+                image_index[image_id] = mmengine.join_path(img_prefix, file)
+
+        return image_index
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        annotations = mmengine.load(self.ann_file)
+
+        # The original Visual Genome annotation file and question file includes
+        # only image id but no image file paths.
+        self.image_index = self._create_image_index()
+
+        data_list = []
+        for qas in chain.from_iterable(ann['qas'] for ann in annotations):
+            # ann example
+            # {
+            #     'id': 1,
+            #     'qas': [
+            #         {
+            #             'a_objects': [],
+            #             'question': 'What color is the clock?',
+            #             'image_id': 1,
+            #             'qa_id': 986768,
+            #             'answer': 'Two.',
+            #             'q_objects': [],
+            #         }
+            #         ...
+            #     ]
+            # }
+
+            data_info = {
+                'img_path': self.image_index[qas['image_id']],
+                'quesiton': qas['quesiton'],
+                'question_id': qas['question_id'],
+                'image_id': qas['image_id'],
+                'gt_answer': [qas['answer']],
+            }
+
+            data_list.append(data_info)
+
+        return data_list
diff --git a/mmpretrain/datasets/vizwiz.py b/mmpretrain/datasets/vizwiz.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b5dd394524cac5ad514351ac2a93286c75e1b17
--- /dev/null
+++ b/mmpretrain/datasets/vizwiz.py
@@ -0,0 +1,112 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from collections import Counter
+from typing import List
+
+import mmengine
+from mmengine.dataset import BaseDataset
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class VizWiz(BaseDataset):
+    """VizWiz dataset.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix``, ``ann_file``
+            and ``question_file``.
+        data_prefix (str): The directory of images.
+        ann_file (str, optional): Annotation file path for training and
+            validation. Defaults to an empty string.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self,
+                 data_root: str,
+                 data_prefix: str,
+                 ann_file: str = '',
+                 **kwarg):
+        super().__init__(
+            data_root=data_root,
+            data_prefix=dict(img_path=data_prefix),
+            ann_file=ann_file,
+            **kwarg,
+        )
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        annotations = mmengine.load(self.ann_file)
+
+        data_list = []
+        for ann in annotations:
+            # {
+            #     "image": "VizWiz_val_00000001.jpg",
+            #     "question": "Can you tell me what this medicine is please?",
+            #     "answers": [
+            #     {
+            #         "answer": "no",
+            #         "answer_confidence": "yes"
+            #     },
+            #     {
+            #         "answer": "unanswerable",
+            #         "answer_confidence": "yes"
+            #     },
+            #     {
+            #         "answer": "night time",
+            #         "answer_confidence": "maybe"
+            #     },
+            #     {
+            #         "answer": "unanswerable",
+            #         "answer_confidence": "yes"
+            #     },
+            #     {
+            #         "answer": "night time",
+            #         "answer_confidence": "maybe"
+            #     },
+            #     {
+            #         "answer": "night time cold medicine",
+            #         "answer_confidence": "maybe"
+            #     },
+            #     {
+            #         "answer": "night time",
+            #         "answer_confidence": "maybe"
+            #     },
+            #     {
+            #         "answer": "night time",
+            #         "answer_confidence": "maybe"
+            #     },
+            #     {
+            #         "answer": "night time",
+            #         "answer_confidence": "maybe"
+            #     },
+            #     {
+            #         "answer": "night time medicine",
+            #         "answer_confidence": "yes"
+            #     }
+            #     ],
+            #     "answer_type": "other",
+            #     "answerable": 1
+            # },
+            data_info = dict()
+            data_info['question'] = ann['question']
+            data_info['img_path'] = mmengine.join_path(
+                self.data_prefix['img_path'], ann['image'])
+
+            if 'answerable' not in ann:
+                data_list.append(data_info)
+            else:
+                if ann['answerable'] == 1:
+                    # add answer_weight & answer_count, delete duplicate answer
+                    answers = []
+                    for item in ann.pop('answers'):
+                        if item['answer_confidence'] == 'yes' and item[
+                                'answer'] != 'unanswerable':
+                            answers.append(item['answer'])
+                    count = Counter(answers)
+                    answer_weight = [i / len(answers) for i in count.values()]
+                    data_info['gt_answer'] = list(count.keys())
+                    data_info['gt_answer_weight'] = answer_weight
+                    # data_info.update(ann)
+                    data_list.append(data_info)
+
+        return data_list
diff --git a/mmpretrain/datasets/voc.py b/mmpretrain/datasets/voc.py
new file mode 100644
index 0000000000000000000000000000000000000000..39544de7a1794a2d965189c692f652cc56b218f9
--- /dev/null
+++ b/mmpretrain/datasets/voc.py
@@ -0,0 +1,195 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import xml.etree.ElementTree as ET
+from typing import List, Optional, Union
+
+from mmengine import get_file_backend, list_from_file
+from mmengine.logging import MMLogger
+
+from mmpretrain.registry import DATASETS
+from .base_dataset import expanduser
+from .categories import VOC2007_CATEGORIES
+from .multi_label import MultiLabelDataset
+
+
+@DATASETS.register_module()
+class VOC(MultiLabelDataset):
+    """`Pascal VOC <http://host.robots.ox.ac.uk/pascal/VOC/>`_ Dataset.
+
+    After decompression, the dataset directory structure is as follows:
+
+    VOC dataset directory: ::
+
+        VOC2007
+        ├── JPEGImages
+        │   ├── xxx.jpg
+        │   ├── xxy.jpg
+        │   └── ...
+        ├── Annotations
+        │   ├── xxx.xml
+        │   ├── xxy.xml
+        │   └── ...
+        └── ImageSets
+            └── Main
+                ├── train.txt
+                ├── val.txt
+                ├── trainval.txt
+                ├── test.txt
+                └── ...
+
+    Extra difficult label is in VOC annotations, we will use
+    `gt_label_difficult` to record the difficult labels in each sample
+    and corresponding evaluation should take care of this field
+    to calculate metrics. Usually, difficult labels are reckoned as
+    negative in defaults.
+
+    Args:
+        data_root (str): The root directory for VOC dataset.
+        split (str, optional): The dataset split, supports "train",
+            "val", "trainval", and "test". Default to "trainval".
+        image_set_path (str, optional): The path of image set, The file which
+            lists image ids of the sub dataset, and this path is relative
+            to ``data_root``. Default to ''.
+        data_prefix (dict): Prefix for data and annotation, keyword
+            'img_path' and 'ann_path' can be set. Defaults to be
+            ``dict(img_path='JPEGImages', ann_path='Annotations')``.
+        metainfo (dict, optional): Meta information for dataset, such as
+            categories information. Defaults to None.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+
+    Examples:
+        >>> from mmpretrain.datasets import VOC
+        >>> train_dataset = VOC(data_root='data/VOC2007', split='trainval')
+        >>> train_dataset
+        Dataset VOC
+            Number of samples:  5011
+            Number of categories:       20
+            Prefix of dataset:  data/VOC2007
+            Path of image set:  data/VOC2007/ImageSets/Main/trainval.txt
+            Prefix of images:   data/VOC2007/JPEGImages
+            Prefix of annotations:      data/VOC2007/Annotations
+        >>> test_dataset = VOC(data_root='data/VOC2007', split='test')
+        >>> test_dataset
+        Dataset VOC
+            Number of samples:  4952
+            Number of categories:       20
+            Prefix of dataset:  data/VOC2007
+            Path of image set:  data/VOC2007/ImageSets/Main/test.txt
+            Prefix of images:   data/VOC2007/JPEGImages
+            Prefix of annotations:      data/VOC2007/Annotations
+    """  # noqa: E501
+
+    METAINFO = {'classes': VOC2007_CATEGORIES}
+
+    def __init__(self,
+                 data_root: str,
+                 split: str = 'trainval',
+                 image_set_path: str = '',
+                 data_prefix: Union[str, dict] = dict(
+                     img_path='JPEGImages', ann_path='Annotations'),
+                 test_mode: bool = False,
+                 metainfo: Optional[dict] = None,
+                 **kwargs):
+
+        self.backend = get_file_backend(data_root, enable_singleton=True)
+
+        if split:
+            splits = ['train', 'val', 'trainval', 'test']
+            assert split in splits, \
+                f"The split must be one of {splits}, but get '{split}'"
+            self.split = split
+
+            if not data_prefix:
+                data_prefix = dict(
+                    img_path='JPEGImages', ann_path='Annotations')
+            if not image_set_path:
+                image_set_path = self.backend.join_path(
+                    'ImageSets', 'Main', f'{split}.txt')
+
+        # To handle the BC-breaking
+        if (split == 'train' or split == 'trainval') and test_mode:
+            logger = MMLogger.get_current_instance()
+            logger.warning(f'split="{split}" but test_mode=True. '
+                           f'The {split} set will be used.')
+
+        if isinstance(data_prefix, str):
+            data_prefix = dict(img_path=expanduser(data_prefix))
+        assert isinstance(data_prefix, dict) and 'img_path' in data_prefix, \
+            '`data_prefix` must be a dict with key img_path'
+
+        if (split and split not in ['val', 'test']) or not test_mode:
+            assert 'ann_path' in data_prefix and data_prefix[
+                'ann_path'] is not None, \
+                '"ann_path" must be set in `data_prefix`' \
+                'when validation or test set is used.'
+
+        self.data_root = data_root
+        self.image_set_path = self.backend.join_path(data_root, image_set_path)
+
+        super().__init__(
+            ann_file='',
+            metainfo=metainfo,
+            data_root=data_root,
+            data_prefix=data_prefix,
+            test_mode=test_mode,
+            **kwargs)
+
+    @property
+    def ann_prefix(self):
+        """The prefix of images."""
+        if 'ann_path' in self.data_prefix:
+            return self.data_prefix['ann_path']
+        else:
+            return None
+
+    def _get_labels_from_xml(self, img_id):
+        """Get gt_labels and labels_difficult from xml file."""
+        xml_path = self.backend.join_path(self.ann_prefix, f'{img_id}.xml')
+        content = self.backend.get(xml_path)
+        root = ET.fromstring(content)
+
+        labels, labels_difficult = set(), set()
+        for obj in root.findall('object'):
+            label_name = obj.find('name').text
+            # in case customized dataset has wrong labels
+            # or CLASSES has been override.
+            if label_name not in self.CLASSES:
+                continue
+            label = self.class_to_idx[label_name]
+            difficult = int(obj.find('difficult').text)
+            if difficult:
+                labels_difficult.add(label)
+            else:
+                labels.add(label)
+
+        return list(labels), list(labels_difficult)
+
+    def load_data_list(self):
+        """Load images and ground truth labels."""
+        data_list = []
+        img_ids = list_from_file(self.image_set_path)
+
+        for img_id in img_ids:
+            img_path = self.backend.join_path(self.img_prefix, f'{img_id}.jpg')
+
+            labels, labels_difficult = None, None
+            if self.ann_prefix is not None:
+                labels, labels_difficult = self._get_labels_from_xml(img_id)
+
+            info = dict(
+                img_path=img_path,
+                gt_label=labels,
+                gt_label_difficult=labels_difficult)
+            data_list.append(info)
+
+        return data_list
+
+    def extra_repr(self) -> List[str]:
+        """The extra repr information of the dataset."""
+        body = [
+            f'Prefix of dataset: \t{self.data_root}',
+            f'Path of image set: \t{self.image_set_path}',
+            f'Prefix of images: \t{self.img_prefix}',
+            f'Prefix of annotations: \t{self.ann_prefix}'
+        ]
+
+        return body
diff --git a/mmpretrain/datasets/vsr.py b/mmpretrain/datasets/vsr.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b109592bd020d57e3db8f2ff610901e2a1d9f31
--- /dev/null
+++ b/mmpretrain/datasets/vsr.py
@@ -0,0 +1,55 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+import mmengine
+from mmengine.dataset import BaseDataset
+
+from mmpretrain.registry import DATASETS
+
+
+@DATASETS.register_module()
+class VSR(BaseDataset):
+    """VSR: Visual Spatial Reasoning dataset.
+
+    Args:
+        data_root (str): The root directory for ``data_prefix``, ``ann_file``
+            and ``question_file``.
+        data_prefix (str): The directory of images.
+        ann_file (str, optional): Annotation file path for training and
+            validation. Defaults to an empty string.
+        **kwargs: Other keyword arguments in :class:`BaseDataset`.
+    """
+
+    def __init__(self,
+                 data_root: str,
+                 data_prefix: str,
+                 ann_file: str = '',
+                 **kwarg):
+        super().__init__(
+            data_root=data_root,
+            data_prefix=dict(img_path=data_prefix),
+            ann_file=ann_file,
+            **kwarg,
+        )
+
+    def load_data_list(self) -> List[dict]:
+        """Load data list."""
+        annotations = mmengine.load(self.ann_file)
+
+        data_list = []
+        for ann in annotations:
+            # ann example
+            # {
+            #     "image": "train2017/000000372029.jpg",
+            #     "question": "The dog is on the surfboard.",
+            #     "answer": true
+            # }
+            data_info = dict()
+            data_info['img_path'] = mmengine.join_path(
+                self.data_prefix['img_path'], ann['image'])
+            data_info['question'] = ann['question']
+            data_info['gt_answer'] = 'yes' if ann['answer'] else 'no'
+
+            data_list.append(data_info)
+
+        return data_list
diff --git a/mmpretrain/engine/__init__.py b/mmpretrain/engine/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..332fea0909b4abdc6a83cf7662ea916a777d99dd
--- /dev/null
+++ b/mmpretrain/engine/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .hooks import *  # noqa: F401, F403
+from .optimizers import *  # noqa: F401, F403
+from .runners import *  # noqa: F401, F403
+from .schedulers import *  # noqa: F401, F403
diff --git a/mmpretrain/engine/__pycache__/__init__.cpython-310.pyc b/mmpretrain/engine/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..c8f4cbc36b2477297771f2833148897bb0fa3ccc
Binary files /dev/null and b/mmpretrain/engine/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/engine/hooks/__init__.py b/mmpretrain/engine/hooks/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..bc9e22be7e96d636f202066f2e00e7699b730619
--- /dev/null
+++ b/mmpretrain/engine/hooks/__init__.py
@@ -0,0 +1,19 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .class_num_check_hook import ClassNumCheckHook
+from .densecl_hook import DenseCLHook
+from .ema_hook import EMAHook
+from .margin_head_hooks import SetAdaptiveMarginsHook
+from .precise_bn_hook import PreciseBNHook
+from .retriever_hooks import PrepareProtoBeforeValLoopHook
+from .simsiam_hook import SimSiamHook
+from .swav_hook import SwAVHook
+from .switch_recipe_hook import SwitchRecipeHook
+from .visualization_hook import VisualizationHook
+from .warmup_param_hook import WarmupParamHook
+
+__all__ = [
+    'ClassNumCheckHook', 'PreciseBNHook', 'VisualizationHook',
+    'SwitchRecipeHook', 'PrepareProtoBeforeValLoopHook',
+    'SetAdaptiveMarginsHook', 'EMAHook', 'SimSiamHook', 'DenseCLHook',
+    'SwAVHook', 'WarmupParamHook'
+]
diff --git a/mmpretrain/engine/hooks/__pycache__/__init__.cpython-310.pyc b/mmpretrain/engine/hooks/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..914b49a1cf16450c5f41615f7aa673c17042537a
Binary files /dev/null and b/mmpretrain/engine/hooks/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/engine/hooks/__pycache__/class_num_check_hook.cpython-310.pyc b/mmpretrain/engine/hooks/__pycache__/class_num_check_hook.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..b26fb0534d7735de2cc35d0e90305095316f5f2f
Binary files /dev/null and b/mmpretrain/engine/hooks/__pycache__/class_num_check_hook.cpython-310.pyc differ
diff --git a/mmpretrain/engine/hooks/__pycache__/densecl_hook.cpython-310.pyc b/mmpretrain/engine/hooks/__pycache__/densecl_hook.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..76e90f1ab625405b422d5177c7bbaf944861f813
Binary files /dev/null and b/mmpretrain/engine/hooks/__pycache__/densecl_hook.cpython-310.pyc differ
diff --git a/mmpretrain/engine/hooks/__pycache__/ema_hook.cpython-310.pyc b/mmpretrain/engine/hooks/__pycache__/ema_hook.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..74dfb5f0fbc01df8d468865f4b8db99cc3347580
Binary files /dev/null and b/mmpretrain/engine/hooks/__pycache__/ema_hook.cpython-310.pyc differ
diff --git a/mmpretrain/engine/hooks/__pycache__/margin_head_hooks.cpython-310.pyc b/mmpretrain/engine/hooks/__pycache__/margin_head_hooks.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..9e3f8b71b917f4b3ddd019e49c36b49e4728aa41
Binary files /dev/null and b/mmpretrain/engine/hooks/__pycache__/margin_head_hooks.cpython-310.pyc differ
diff --git a/mmpretrain/engine/hooks/__pycache__/precise_bn_hook.cpython-310.pyc b/mmpretrain/engine/hooks/__pycache__/precise_bn_hook.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..895af449edeb8db338ca7b932f5fd155b5d74f32
Binary files /dev/null and b/mmpretrain/engine/hooks/__pycache__/precise_bn_hook.cpython-310.pyc differ
diff --git a/mmpretrain/engine/hooks/__pycache__/retriever_hooks.cpython-310.pyc b/mmpretrain/engine/hooks/__pycache__/retriever_hooks.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..5dbb2252809e904845a2fabac7774895d515dd74
Binary files /dev/null and b/mmpretrain/engine/hooks/__pycache__/retriever_hooks.cpython-310.pyc differ
diff --git a/mmpretrain/engine/hooks/__pycache__/simsiam_hook.cpython-310.pyc b/mmpretrain/engine/hooks/__pycache__/simsiam_hook.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..9edb83148a67c420070ceddd061f2ab75de528b6
Binary files /dev/null and b/mmpretrain/engine/hooks/__pycache__/simsiam_hook.cpython-310.pyc differ
diff --git a/mmpretrain/engine/hooks/__pycache__/swav_hook.cpython-310.pyc b/mmpretrain/engine/hooks/__pycache__/swav_hook.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..2c5d864e292f6e439557ddb2d06cddd761be9b9a
Binary files /dev/null and b/mmpretrain/engine/hooks/__pycache__/swav_hook.cpython-310.pyc differ
diff --git a/mmpretrain/engine/hooks/__pycache__/switch_recipe_hook.cpython-310.pyc b/mmpretrain/engine/hooks/__pycache__/switch_recipe_hook.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..321a7a8ccfdde1d959967b6a637aeb42bc9b9a4c
Binary files /dev/null and b/mmpretrain/engine/hooks/__pycache__/switch_recipe_hook.cpython-310.pyc differ
diff --git a/mmpretrain/engine/hooks/__pycache__/visualization_hook.cpython-310.pyc b/mmpretrain/engine/hooks/__pycache__/visualization_hook.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f7259b68ccca1256f9949784790fc160244a49cd
Binary files /dev/null and b/mmpretrain/engine/hooks/__pycache__/visualization_hook.cpython-310.pyc differ
diff --git a/mmpretrain/engine/hooks/__pycache__/warmup_param_hook.cpython-310.pyc b/mmpretrain/engine/hooks/__pycache__/warmup_param_hook.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..905513ef848010e6a89fcdd2e137b6015ad95a07
Binary files /dev/null and b/mmpretrain/engine/hooks/__pycache__/warmup_param_hook.cpython-310.pyc differ
diff --git a/mmpretrain/engine/hooks/class_num_check_hook.py b/mmpretrain/engine/hooks/class_num_check_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..38170d6604810c575aa5c2c9435c0b75cfa761b2
--- /dev/null
+++ b/mmpretrain/engine/hooks/class_num_check_hook.py
@@ -0,0 +1,63 @@
+# Copyright (c) OpenMMLab. All rights reserved
+from mmengine.hooks import Hook
+from mmengine.utils import is_seq_of
+
+from mmpretrain.registry import HOOKS
+
+
+@HOOKS.register_module()
+class ClassNumCheckHook(Hook):
+    """Class Number Check HOOK."""
+
+    def _check_head(self, runner, dataset):
+        """Check whether the `num_classes` in head matches the length of
+        `CLASSES` in `dataset`.
+
+        Args:
+            runner (obj:`Runner`): runner object.
+            dataset (obj: `BaseDataset`): the dataset to check.
+        """
+        model = runner.model
+        if dataset.CLASSES is None:
+            runner.logger.warning(
+                f'Please set class information in `metainfo` '
+                f'in the {dataset.__class__.__name__} and'
+                f'check if it is consistent with the `num_classes` '
+                f'of head')
+        else:
+            assert is_seq_of(dataset.CLASSES, str), \
+                (f'Class information in `metainfo` in '
+                 f'{dataset.__class__.__name__} should be a tuple of str.')
+            for _, module in model.named_modules():
+                if hasattr(module, 'num_classes'):
+                    assert module.num_classes == len(dataset.CLASSES), \
+                        (f'The `num_classes` ({module.num_classes}) in '
+                         f'{module.__class__.__name__} of '
+                         f'{model.__class__.__name__} does not matches '
+                         f'the length of class information in `metainfo` '
+                         f'{len(dataset.CLASSES)}) in '
+                         f'{dataset.__class__.__name__}')
+
+    def before_train(self, runner):
+        """Check whether the training dataset is compatible with head.
+
+        Args:
+            runner (obj: `IterBasedRunner`): Iter based Runner.
+        """
+        self._check_head(runner, runner.train_dataloader.dataset)
+
+    def before_val(self, runner):
+        """Check whether the validation dataset is compatible with head.
+
+        Args:
+            runner (obj:`IterBasedRunner`): Iter based Runner.
+        """
+        self._check_head(runner, runner.val_dataloader.dataset)
+
+    def before_test(self, runner):
+        """Check whether the test dataset is compatible with head.
+
+        Args:
+            runner (obj:`IterBasedRunner`): Iter based Runner.
+        """
+        self._check_head(runner, runner.test_dataloader.dataset)
diff --git a/mmpretrain/engine/hooks/densecl_hook.py b/mmpretrain/engine/hooks/densecl_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c7e17d3419cbc2a540d3aecd81e223eed670df2
--- /dev/null
+++ b/mmpretrain/engine/hooks/densecl_hook.py
@@ -0,0 +1,42 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional, Sequence
+
+from mmengine.hooks import Hook
+
+from mmpretrain.registry import HOOKS
+from mmpretrain.utils import get_ori_model
+
+
+@HOOKS.register_module()
+class DenseCLHook(Hook):
+    """Hook for DenseCL.
+
+    This hook includes ``loss_lambda`` warmup in DenseCL.
+    Borrowed from the authors' code: `<https://github.com/WXinlong/DenseCL>`_.
+
+    Args:
+        start_iters (int): The number of warmup iterations to set
+            ``loss_lambda=0``. Defaults to 1000.
+    """
+
+    def __init__(self, start_iters: int = 1000) -> None:
+        self.start_iters = start_iters
+
+    def before_train(self, runner) -> None:
+        """Obtain ``loss_lambda`` from algorithm."""
+        assert hasattr(get_ori_model(runner.model), 'loss_lambda'), \
+            "The runner must have attribute \"loss_lambda\" in DenseCL."
+        self.loss_lambda = get_ori_model(runner.model).loss_lambda
+
+    def before_train_iter(self,
+                          runner,
+                          batch_idx: int,
+                          data_batch: Optional[Sequence[dict]] = None) -> None:
+        """Adjust ``loss_lambda`` every train iter."""
+        assert hasattr(get_ori_model(runner.model), 'loss_lambda'), \
+            "The runner must have attribute \"loss_lambda\" in DenseCL."
+        cur_iter = runner.iter
+        if cur_iter >= self.start_iters:
+            get_ori_model(runner.model).loss_lambda = self.loss_lambda
+        else:
+            get_ori_model(runner.model).loss_lambda = 0.
diff --git a/mmpretrain/engine/hooks/ema_hook.py b/mmpretrain/engine/hooks/ema_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..284d211b628c411f0eb712d1c558dc6aa2eb8996
--- /dev/null
+++ b/mmpretrain/engine/hooks/ema_hook.py
@@ -0,0 +1,216 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+import itertools
+import warnings
+from typing import Dict, Optional
+
+from mmengine.hooks import EMAHook as BaseEMAHook
+from mmengine.logging import MMLogger
+from mmengine.runner import Runner
+
+from mmpretrain.registry import HOOKS
+
+
+@HOOKS.register_module()
+class EMAHook(BaseEMAHook):
+    """A Hook to apply Exponential Moving Average (EMA) on the model during
+    training.
+
+    Comparing with :class:`mmengine.hooks.EMAHook`, this hook accepts
+    ``evaluate_on_ema`` and ``evaluate_on_origin`` arguments. By default, the
+    ``evaluate_on_ema`` is enabled, and if you want to do validation and
+    testing on both original and EMA models, please set both arguments
+    ``True``.
+
+    Note:
+        - EMAHook takes priority over CheckpointHook.
+        - The original model parameters are actually saved in ema field after
+          train.
+        - ``begin_iter`` and ``begin_epoch`` cannot be set at the same time.
+
+    Args:
+        ema_type (str): The type of EMA strategy to use. You can find the
+            supported strategies in :mod:`mmengine.model.averaged_model`.
+            Defaults to 'ExponentialMovingAverage'.
+        strict_load (bool): Whether to strictly enforce that the keys of
+            ``state_dict`` in checkpoint match the keys returned by
+            ``self.module.state_dict``. Defaults to False.
+            Changed in v0.3.0.
+        begin_iter (int): The number of iteration to enable ``EMAHook``.
+            Defaults to 0.
+        begin_epoch (int): The number of epoch to enable ``EMAHook``.
+            Defaults to 0.
+        evaluate_on_ema (bool): Whether to evaluate (validate and test)
+            on EMA model during val-loop and test-loop. Defaults to True.
+        evaluate_on_origin (bool): Whether to evaluate (validate and test)
+            on the original model during val-loop and test-loop.
+            Defaults to False.
+        **kwargs: Keyword arguments passed to subclasses of
+            :obj:`BaseAveragedModel`
+    """
+
+    priority = 'NORMAL'
+
+    def __init__(self,
+                 ema_type: str = 'ExponentialMovingAverage',
+                 strict_load: bool = False,
+                 begin_iter: int = 0,
+                 begin_epoch: int = 0,
+                 evaluate_on_ema: bool = True,
+                 evaluate_on_origin: bool = False,
+                 **kwargs):
+        super().__init__(
+            ema_type=ema_type,
+            strict_load=strict_load,
+            begin_iter=begin_iter,
+            begin_epoch=begin_epoch,
+            **kwargs)
+
+        if not evaluate_on_ema and not evaluate_on_origin:
+            warnings.warn(
+                'Automatically set `evaluate_on_origin=True` since the '
+                '`evaluate_on_ema` is disabled. If you want to disable '
+                'all validation, please modify the `val_interval` of '
+                'the `train_cfg`.', UserWarning)
+            evaluate_on_origin = True
+
+        self.evaluate_on_ema = evaluate_on_ema
+        self.evaluate_on_origin = evaluate_on_origin
+        self.load_ema_from_ckpt = False
+
+    def before_train(self, runner) -> None:
+        super().before_train(runner)
+        if not runner._resume and self.load_ema_from_ckpt:
+            # If loaded EMA state dict but not want to resume training
+            # overwrite the EMA state dict with the source model.
+            MMLogger.get_current_instance().info(
+                'Load from a checkpoint with EMA parameters but not '
+                'resume training. Initialize the model parameters with '
+                'EMA parameters')
+            for p_ema, p_src in zip(self._ema_params, self._src_params):
+                p_src.data.copy_(p_ema.data)
+
+    def before_val_epoch(self, runner) -> None:
+        """We load parameter values from ema model to source model before
+        validation.
+
+        Args:
+            runner (Runner): The runner of the training process.
+        """
+        if self.evaluate_on_ema:
+            # Swap when evaluate on ema
+            self._swap_ema_parameters()
+
+    def after_val_epoch(self,
+                        runner,
+                        metrics: Optional[Dict[str, float]] = None) -> None:
+        """We recover source model's parameter from ema model after validation.
+
+        Args:
+            runner (Runner): The runner of the validation process.
+            metrics (Dict[str, float], optional): Evaluation results of all
+                metrics on validation dataset. The keys are the names of the
+                metrics, and the values are corresponding results.
+        """
+        if self.evaluate_on_ema:
+            # Swap when evaluate on ema
+            self._swap_ema_parameters()
+
+        if self.evaluate_on_ema and self.evaluate_on_origin:
+            # Re-evaluate if evaluate on both ema and origin.
+            val_loop = runner.val_loop
+
+            runner.model.eval()
+            for idx, data_batch in enumerate(val_loop.dataloader):
+                val_loop.run_iter(idx, data_batch)
+
+            # compute metrics
+            origin_metrics = val_loop.evaluator.evaluate(
+                len(val_loop.dataloader.dataset))
+
+            for k, v in origin_metrics.items():
+                runner.message_hub.update_scalar(f'val/{k}_origin', v)
+
+    def before_test_epoch(self, runner) -> None:
+        """We load parameter values from ema model to source model before test.
+
+        Args:
+            runner (Runner): The runner of the training process.
+        """
+        if self.evaluate_on_ema:
+            # Swap when evaluate on ema
+            self._swap_ema_parameters()
+            MMLogger.get_current_instance().info('Start testing on EMA model.')
+        else:
+            MMLogger.get_current_instance().info(
+                'Start testing on the original model.')
+
+    def after_test_epoch(self,
+                         runner: Runner,
+                         metrics: Optional[Dict[str, float]] = None) -> None:
+        """We recover source model's parameter from ema model after test.
+
+        Args:
+            runner (Runner): The runner of the testing process.
+            metrics (Dict[str, float], optional): Evaluation results of all
+                metrics on test dataset. The keys are the names of the
+                metrics, and the values are corresponding results.
+        """
+        if self.evaluate_on_ema:
+            # Swap when evaluate on ema
+            self._swap_ema_parameters()
+
+        if self.evaluate_on_ema and self.evaluate_on_origin:
+            # Re-evaluate if evaluate on both ema and origin.
+            MMLogger.get_current_instance().info(
+                'Start testing on the original model.')
+            test_loop = runner.test_loop
+
+            runner.model.eval()
+            for idx, data_batch in enumerate(test_loop.dataloader):
+                test_loop.run_iter(idx, data_batch)
+
+            # compute metrics
+            origin_metrics = test_loop.evaluator.evaluate(
+                len(test_loop.dataloader.dataset))
+
+            for k, v in origin_metrics.items():
+                runner.message_hub.update_scalar(f'test/{k}_origin', v)
+
+    def after_load_checkpoint(self, runner, checkpoint: dict) -> None:
+        """Resume ema parameters from checkpoint.
+
+        Args:
+            runner (Runner): The runner of the testing process.
+        """
+        from mmengine.runner.checkpoint import load_state_dict
+        if 'ema_state_dict' in checkpoint:
+            # The original model parameters are actually saved in ema
+            # field swap the weights back to resume ema state.
+            self._swap_ema_state_dict(checkpoint)
+            self.ema_model.load_state_dict(
+                checkpoint['ema_state_dict'], strict=self.strict_load)
+            self.load_ema_from_ckpt = True
+
+        # Support load checkpoint without ema state dict.
+        else:
+            load_state_dict(
+                self.ema_model.module,
+                copy.deepcopy(checkpoint['state_dict']),
+                strict=self.strict_load)
+
+    @property
+    def _src_params(self):
+        if self.ema_model.update_buffers:
+            return itertools.chain(self.src_model.parameters(),
+                                   self.src_model.buffers())
+        else:
+            return self.src_model.parameters()
+
+    @property
+    def _ema_params(self):
+        if self.ema_model.update_buffers:
+            return itertools.chain(self.ema_model.module.parameters(),
+                                   self.ema_model.module.buffers())
+        else:
+            return self.ema_model.module.parameters()
diff --git a/mmpretrain/engine/hooks/margin_head_hooks.py b/mmpretrain/engine/hooks/margin_head_hooks.py
new file mode 100644
index 0000000000000000000000000000000000000000..fbeae7a347453153ff4ab3bef958acb549623f6f
--- /dev/null
+++ b/mmpretrain/engine/hooks/margin_head_hooks.py
@@ -0,0 +1,61 @@
+# Copyright (c) OpenMMLab. All rights reserved
+import numpy as np
+from mmengine.hooks import Hook
+from mmengine.model import is_model_wrapper
+
+from mmpretrain.models.heads import ArcFaceClsHead
+from mmpretrain.registry import HOOKS
+
+
+@HOOKS.register_module()
+class SetAdaptiveMarginsHook(Hook):
+    r"""Set adaptive-margins in ArcFaceClsHead based on the power of
+    category-wise count.
+
+    A PyTorch implementation of paper `Google Landmark Recognition 2020
+    Competition Third Place Solution <https://arxiv.org/abs/2010.05350>`_.
+    The margins will be
+    :math:`\text{f}(n) = (marginMax - marginMin) · norm(n^p) + marginMin`.
+    The `n` indicates the number of occurrences of a category.
+
+    Args:
+        margin_min (float): Lower bound of margins. Defaults to 0.05.
+        margin_max (float): Upper bound of margins. Defaults to 0.5.
+        power (float): The power of category freqercy. Defaults to -0.25.
+    """
+
+    def __init__(self, margin_min=0.05, margin_max=0.5, power=-0.25) -> None:
+        self.margin_min = margin_min
+        self.margin_max = margin_max
+        self.margin_range = margin_max - margin_min
+        self.p = power
+
+    def before_train(self, runner):
+        """change the margins in ArcFaceClsHead.
+
+        Args:
+            runner (obj: `Runner`): Runner.
+        """
+        model = runner.model
+        if is_model_wrapper(model):
+            model = model.module
+
+        if (hasattr(model, 'head')
+                and not isinstance(model.head, ArcFaceClsHead)):
+            raise ValueError(
+                'Hook ``SetFreqPowAdvMarginsHook`` could only be used '
+                f'for ``ArcFaceClsHead``, but get {type(model.head)}')
+
+        # generate margins base on the dataset.
+        gt_labels = runner.train_dataloader.dataset.get_gt_labels()
+        label_count = np.bincount(gt_labels)
+        label_count[label_count == 0] = 1  # At least one occurrence
+        pow_freq = np.power(label_count, self.p)
+
+        min_f, max_f = pow_freq.min(), pow_freq.max()
+        normized_pow_freq = (pow_freq - min_f) / (max_f - min_f)
+        margins = normized_pow_freq * self.margin_range + self.margin_min
+
+        assert len(margins) == runner.model.head.num_classes
+
+        model.head.set_margins(margins)
diff --git a/mmpretrain/engine/hooks/precise_bn_hook.py b/mmpretrain/engine/hooks/precise_bn_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fb0e4c419e4ed2af23574769815aaecbcd629c0
--- /dev/null
+++ b/mmpretrain/engine/hooks/precise_bn_hook.py
@@ -0,0 +1,223 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Adapted from https://github.com/facebookresearch/pycls/blob/f8cd962737e33ce9e19b3083a33551da95c2d9c0/pycls/core/net.py  # noqa: E501
+# Original licence: Copyright (c) 2019 Facebook, Inc under the Apache License 2.0  # noqa: E501
+
+import itertools
+import logging
+from typing import List, Optional, Sequence, Union
+
+import mmengine
+import torch
+import torch.nn as nn
+from mmengine.hooks import Hook
+from mmengine.logging import print_log
+from mmengine.model import is_model_wrapper
+from mmengine.runner import EpochBasedTrainLoop, IterBasedTrainLoop, Runner
+from mmengine.utils import ProgressBar
+from torch.functional import Tensor
+from torch.nn import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+from torch.nn.modules.instancenorm import _InstanceNorm
+from torch.utils.data import DataLoader
+
+from mmpretrain.registry import HOOKS
+
+DATA_BATCH = Optional[Sequence[dict]]
+
+
+def scaled_all_reduce(tensors: List[Tensor], num_gpus: int) -> List[Tensor]:
+    """Performs the scaled all_reduce operation on the provided tensors.
+
+    The input tensors are modified in-place. Currently supports only the sum
+    reduction operator. The reduced values are scaled by the inverse size of
+    the process group.
+
+    Args:
+        tensors (List[torch.Tensor]): The tensors to process.
+        num_gpus (int): The number of gpus to use
+    Returns:
+        List[torch.Tensor]: The processed tensors.
+    """
+    # There is no need for reduction in the single-proc case
+    if num_gpus == 1:
+        return tensors
+    # Queue the reductions
+    reductions = []
+    for tensor in tensors:
+        reduction = torch.distributed.all_reduce(tensor, async_op=True)
+        reductions.append(reduction)
+    # Wait for reductions to finish
+    for reduction in reductions:
+        reduction.wait()
+    # Scale the results
+    for tensor in tensors:
+        tensor.mul_(1.0 / num_gpus)
+    return tensors
+
+
+@torch.no_grad()
+def update_bn_stats(
+        model: nn.Module,
+        loader: DataLoader,
+        num_samples: int = 8192,
+        logger: Optional[Union[logging.Logger, str]] = None) -> None:
+    """Computes precise BN stats on training data.
+
+    Args:
+        model (nn.module): The model whose bn stats will be recomputed.
+        loader (DataLoader): PyTorch dataloader._dataloader
+        num_samples (int): The number of samples to update the bn stats.
+            Defaults to 8192.
+        logger (logging.Logger or str, optional): If the type of logger is
+        ``logging.Logger``, we directly use logger to log messages.
+            Some special loggers are:
+            - "silent": No message will be printed.
+            - "current": Use latest created logger to log message.
+            - other str: Instance name of logger. The corresponding logger
+            will log message if it has been created, otherwise will raise a
+            `ValueError`.
+            - None: The `print()` method will be used to print log messages.
+    """
+    if is_model_wrapper(model):
+        model = model.module
+
+    # get dist info
+    rank, world_size = mmengine.dist.get_dist_info()
+    # Compute the number of mini-batches to use, if the size of dataloader is
+    # less than num_iters, use all the samples in dataloader.
+    num_iter = num_samples // (loader.batch_size * world_size)
+    num_iter = min(num_iter, len(loader))
+    # Retrieve the BN layers
+    bn_layers = [
+        m for m in model.modules()
+        if m.training and isinstance(m, (_BatchNorm))
+    ]
+    if len(bn_layers) == 0:
+        print_log('No BN found in model', logger=logger, level=logging.WARNING)
+        return
+    print_log(
+        f'{len(bn_layers)} BN found, run {num_iter} iters...', logger=logger)
+
+    # Finds all the other norm layers with training=True.
+    other_norm_layers = [
+        m for m in model.modules()
+        if m.training and isinstance(m, (_InstanceNorm, GroupNorm))
+    ]
+    if len(other_norm_layers) > 0:
+        print_log(
+            'IN/GN stats will not be updated in PreciseHook.',
+            logger=logger,
+            level=logging.INFO)
+
+    # Initialize BN stats storage for computing
+    # mean(mean(batch)) and mean(var(batch))
+    running_means = [torch.zeros_like(bn.running_mean) for bn in bn_layers]
+    running_vars = [torch.zeros_like(bn.running_var) for bn in bn_layers]
+    # Remember momentum values
+    momentums = [bn.momentum for bn in bn_layers]
+    # Set momentum to 1.0 to compute BN stats that reflect the current batch
+    for bn in bn_layers:
+        bn.momentum = 1.0
+    # Average the BN stats for each BN layer over the batches
+    if rank == 0:
+        prog_bar = ProgressBar(num_iter)
+
+    for data in itertools.islice(loader, num_iter):
+        data = model.data_preprocessor(data, False)
+        model(**data)
+
+        for i, bn in enumerate(bn_layers):
+            running_means[i] += bn.running_mean / num_iter
+            running_vars[i] += bn.running_var / num_iter
+        if rank == 0:
+            prog_bar.update()
+
+    # Sync BN stats across GPUs (no reduction if 1 GPU used)
+    running_means = scaled_all_reduce(running_means, world_size)
+    running_vars = scaled_all_reduce(running_vars, world_size)
+    # Set BN stats and restore original momentum values
+    for i, bn in enumerate(bn_layers):
+        bn.running_mean = running_means[i]
+        bn.running_var = running_vars[i]
+        bn.momentum = momentums[i]
+
+
+@HOOKS.register_module()
+class PreciseBNHook(Hook):
+    """Precise BN hook.
+
+    Recompute and update the batch norm stats to make them more precise. During
+    training both BN stats and the weight are changing after every iteration,
+    so the running average can not precisely reflect the actual stats of the
+    current model.
+
+    With this hook, the BN stats are recomputed with fixed weights, to make the
+    running average more precise. Specifically, it computes the true average of
+    per-batch mean/variance instead of the running average. See Sec. 3 of the
+    paper `Rethinking Batch in BatchNorm <https://arxiv.org/abs/2105.07576>`
+    for details.
+
+    This hook will update BN stats, so it should be executed before
+    ``CheckpointHook`` and ``EMAHook``, generally set its priority to
+    "ABOVE_NORMAL".
+
+    Args:
+        num_samples (int): The number of samples to update the bn stats.
+            Defaults to 8192.
+        interval (int): Perform precise bn interval. If the train loop is
+        `EpochBasedTrainLoop` or `by_epoch=True`, its unit is 'epoch'; if the
+         train loop is `IterBasedTrainLoop` or `by_epoch=False`, its unit is
+         'iter'. Defaults to 1.
+    """
+
+    def __init__(self, num_samples: int = 8192, interval: int = 1) -> None:
+        assert interval > 0 and num_samples > 0, "'interval' and " \
+            "'num_samples' must be bigger than 0."
+
+        self.interval = interval
+        self.num_samples = num_samples
+
+    def _perform_precise_bn(self, runner: Runner) -> None:
+        """perform precise bn."""
+        print_log(
+            f'Running Precise BN for {self.num_samples} samples...',
+            logger=runner.logger)
+        update_bn_stats(
+            runner.model,
+            runner.train_loop.dataloader,
+            self.num_samples,
+            logger=runner.logger)
+        print_log('Finish Precise BN, BN stats updated.', logger=runner.logger)
+
+    def after_train_epoch(self, runner: Runner) -> None:
+        """Calculate prcise BN and broadcast BN stats across GPUs.
+
+        Args:
+            runner (obj:`Runner`): The runner of the training process.
+        """
+        # if use `EpochBasedTrainLoop``, do perform precise every
+        # `self.interval` epochs.
+        if isinstance(runner.train_loop,
+                      EpochBasedTrainLoop) and self.every_n_epochs(
+                          runner, self.interval):
+            self._perform_precise_bn(runner)
+
+    def after_train_iter(self,
+                         runner,
+                         batch_idx: int,
+                         data_batch: DATA_BATCH = None,
+                         outputs: Optional[dict] = None) -> None:
+        """Calculate prcise BN and broadcast BN stats across GPUs.
+
+        Args:
+            runner (obj:`Runner`): The runner of the training process.
+            batch_idx (int): The index of the current batch in the train loop.
+            data_batch (Sequence[dict], optional): Data from dataloader.
+                Defaults to None.
+        """
+        # if use `IterBasedTrainLoop``, do perform precise every
+        # `self.interval` iters.
+        if isinstance(runner.train_loop,
+                      IterBasedTrainLoop) and self.every_n_train_iters(
+                          runner, self.interval):
+            self._perform_precise_bn(runner)
diff --git a/mmpretrain/engine/hooks/retriever_hooks.py b/mmpretrain/engine/hooks/retriever_hooks.py
new file mode 100644
index 0000000000000000000000000000000000000000..6bd7c7aaff3175491b1ea1508e33b07b7c2ea8d4
--- /dev/null
+++ b/mmpretrain/engine/hooks/retriever_hooks.py
@@ -0,0 +1,32 @@
+# Copyright (c) OpenMMLab. All rights reserved
+import warnings
+
+from mmengine.hooks import Hook
+from mmengine.model import is_model_wrapper
+
+from mmpretrain.models import BaseRetriever
+from mmpretrain.registry import HOOKS
+
+
+@HOOKS.register_module()
+class PrepareProtoBeforeValLoopHook(Hook):
+    """The hook to prepare the prototype in retrievers.
+
+    Since the encoders of the retriever changes during training, the prototype
+    changes accordingly. So the `prototype_vecs` needs to be regenerated before
+    validation loop.
+    """
+
+    def before_val(self, runner) -> None:
+        model = runner.model
+        if is_model_wrapper(model):
+            model = model.module
+
+        if isinstance(model, BaseRetriever):
+            if hasattr(model, 'prepare_prototype'):
+                model.prepare_prototype()
+        else:
+            warnings.warn(
+                'Only the `mmpretrain.models.retrievers.BaseRetriever` '
+                'can execute `PrepareRetrieverPrototypeHook`, but got '
+                f'`{type(model)}`')
diff --git a/mmpretrain/engine/hooks/simsiam_hook.py b/mmpretrain/engine/hooks/simsiam_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..fabc4faca02bb78b92c39de68fa8a18e56d544f5
--- /dev/null
+++ b/mmpretrain/engine/hooks/simsiam_hook.py
@@ -0,0 +1,48 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional, Sequence
+
+from mmengine.hooks import Hook
+
+from mmpretrain.registry import HOOKS
+
+
+@HOOKS.register_module()
+class SimSiamHook(Hook):
+    """Hook for SimSiam.
+
+    This hook is for SimSiam to fix learning rate of predictor.
+
+    Args:
+        fix_pred_lr (bool): whether to fix the lr of predictor or not.
+        lr (float): the value of fixed lr.
+        adjust_by_epoch (bool, optional): whether to set lr by epoch or iter.
+            Defaults to True.
+    """
+
+    def __init__(self,
+                 fix_pred_lr: bool,
+                 lr: float,
+                 adjust_by_epoch: Optional[bool] = True) -> None:
+        self.fix_pred_lr = fix_pred_lr
+        self.lr = lr
+        self.adjust_by_epoch = adjust_by_epoch
+
+    def before_train_iter(self,
+                          runner,
+                          batch_idx: int,
+                          data_batch: Optional[Sequence[dict]] = None) -> None:
+        """fix lr of predictor by iter."""
+        if self.adjust_by_epoch:
+            return
+        else:
+            if self.fix_pred_lr:
+                for param_group in runner.optim_wrapper.optimizer.param_groups:
+                    if 'fix_lr' in param_group and param_group['fix_lr']:
+                        param_group['lr'] = self.lr
+
+    def before_train_epoch(self, runner) -> None:
+        """fix lr of predictor by epoch."""
+        if self.fix_pred_lr:
+            for param_group in runner.optim_wrapper.optimizer.param_groups:
+                if 'fix_lr' in param_group and param_group['fix_lr']:
+                    param_group['lr'] = self.lr
diff --git a/mmpretrain/engine/hooks/swav_hook.py b/mmpretrain/engine/hooks/swav_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..be5f3a36bdd7fc44e77700988f1759181e5ce54d
--- /dev/null
+++ b/mmpretrain/engine/hooks/swav_hook.py
@@ -0,0 +1,119 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from typing import Dict, List, Optional, Sequence
+
+import torch
+from mmengine.device import get_device
+from mmengine.dist import get_rank, get_world_size, is_distributed
+from mmengine.hooks import Hook
+from mmengine.logging import MMLogger
+
+from mmpretrain.registry import HOOKS
+from mmpretrain.utils import get_ori_model
+
+
+@HOOKS.register_module()
+class SwAVHook(Hook):
+    """Hook for SwAV.
+
+    This hook builds the queue in SwAV according to ``epoch_queue_starts``.
+    The queue will be saved in ``runner.work_dir`` or loaded at start epoch
+    if the path folder has queues saved before.
+
+    Args:
+        batch_size (int): the batch size per GPU for computing.
+        epoch_queue_starts (int, optional): from this epoch, starts to use the
+            queue. Defaults to 15.
+        crops_for_assign (list[int], optional): list of crops id used for
+            computing assignments. Defaults to [0, 1].
+        feat_dim (int, optional): feature dimension of output vector.
+            Defaults to 128.
+        queue_length (int, optional): length of the queue (0 for no queue).
+            Defaults to 0.
+        interval (int, optional): the interval to save the queue.
+            Defaults to 1.
+        frozen_layers_cfg (dict, optional): Dict to config frozen layers.
+            The key-value pair is layer name and its frozen iters. If frozen,
+            the layers don't need gradient. Defaults to dict().
+    """
+
+    def __init__(
+        self,
+        batch_size: int,
+        epoch_queue_starts: Optional[int] = 15,
+        crops_for_assign: Optional[List[int]] = [0, 1],
+        feat_dim: Optional[int] = 128,
+        queue_length: Optional[int] = 0,
+        interval: Optional[int] = 1,
+        frozen_layers_cfg: Optional[Dict] = dict()
+    ) -> None:
+        self.batch_size = batch_size * get_world_size()
+        self.epoch_queue_starts = epoch_queue_starts
+        self.crops_for_assign = crops_for_assign
+        self.feat_dim = feat_dim
+        self.queue_length = queue_length
+        self.interval = interval
+        self.frozen_layers_cfg = frozen_layers_cfg
+        self.requires_grad = True
+        self.queue = None
+
+    def before_run(self, runner) -> None:
+        """Check whether the queues exist locally or not."""
+        if is_distributed():
+            self.queue_path = osp.join(runner.work_dir,
+                                       'queue' + str(get_rank()) + '.pth')
+        else:
+            self.queue_path = osp.join(runner.work_dir, 'queue.pth')
+
+        # load the queues if queues exist locally
+        if osp.isfile(self.queue_path):
+            self.queue = torch.load(self.queue_path)['queue']
+            get_ori_model(runner.model).head.loss_module.queue = self.queue
+            MMLogger.get_current_instance().info(
+                f'Load queue from file: {self.queue_path}')
+
+        # the queue needs to be divisible by the batch size
+        self.queue_length -= self.queue_length % self.batch_size
+
+    def before_train_iter(self,
+                          runner,
+                          batch_idx: int,
+                          data_batch: Optional[Sequence[dict]] = None) -> None:
+        """Freeze layers before specific iters according to the config."""
+        for layer, frozen_iters in self.frozen_layers_cfg.items():
+            if runner.iter < frozen_iters and self.requires_grad:
+                self.requires_grad = False
+                for name, p in get_ori_model(runner.model).named_parameters():
+                    if layer in name:
+                        p.requires_grad = False
+            elif runner.iter >= frozen_iters and not self.requires_grad:
+                self.requires_grad = True
+                for name, p in get_ori_model(runner.model).named_parameters():
+                    if layer in name:
+                        p.requires_grad = True
+
+    def before_train_epoch(self, runner) -> None:
+        """Check the queues' state."""
+        # optionally starts a queue
+        if self.queue_length > 0 \
+            and runner.epoch >= self.epoch_queue_starts \
+                and self.queue is None:
+
+            self.queue = torch.zeros(
+                len(self.crops_for_assign),
+                self.queue_length // runner.world_size,
+                self.feat_dim,
+                device=get_device(),
+            )
+
+        # set the boolean type of use_the_queue
+        get_ori_model(runner.model).head.loss_module.queue = self.queue
+        get_ori_model(runner.model).head.loss_module.use_queue = False
+
+    def after_train_epoch(self, runner) -> None:
+        """Save the queues locally."""
+        self.queue = get_ori_model(runner.model).head.loss_module.queue
+
+        if self.queue is not None and self.every_n_epochs(
+                runner, self.interval):
+            torch.save({'queue': self.queue}, self.queue_path)
diff --git a/mmpretrain/engine/hooks/switch_recipe_hook.py b/mmpretrain/engine/hooks/switch_recipe_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..914b9572eb22d2cd2f54c519273c86baf2e0894d
--- /dev/null
+++ b/mmpretrain/engine/hooks/switch_recipe_hook.py
@@ -0,0 +1,169 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from collections import OrderedDict
+from copy import deepcopy
+
+from mmcv.transforms import Compose
+from mmengine.hooks import Hook
+from mmengine.model import is_model_wrapper
+
+from mmpretrain.models.utils import RandomBatchAugment
+from mmpretrain.registry import HOOKS, MODEL_WRAPPERS, MODELS
+
+
+@HOOKS.register_module()
+class SwitchRecipeHook(Hook):
+    """switch recipe during the training loop, including train pipeline, batch
+    augments and loss currently.
+
+    Args:
+        schedule (list): Every item of the schedule list should be a dict, and
+            the dict should have ``action_epoch`` and some of
+            ``train_pipeline``, ``train_augments`` and ``loss`` keys:
+
+            - ``action_epoch`` (int): switch training recipe at which epoch.
+            - ``train_pipeline`` (list, optional): The new data pipeline of the
+              train dataset. If not specified, keep the original settings.
+            - ``batch_augments`` (dict | None, optional): The new batch
+              augmentations of during training. See :mod:`Batch Augmentations
+              <mmpretrain.models.utils.batch_augments>` for more details.
+              If None, disable batch augmentations. If not specified, keep the
+              original settings.
+            - ``loss`` (dict, optional): The new loss module config. If not
+              specified, keep the original settings.
+
+    Example:
+        To use this hook in config files.
+
+        .. code:: python
+
+            custom_hooks = [
+                dict(
+                    type='SwitchRecipeHook',
+                    schedule=[
+                        dict(
+                            action_epoch=30,
+                            train_pipeline=pipeline_after_30e,
+                            batch_augments=batch_augments_after_30e,
+                            loss=loss_after_30e,
+                        ),
+                        dict(
+                            action_epoch=60,
+                            # Disable batch augmentations after 60e
+                            # and keep other settings.
+                            batch_augments=None,
+                        ),
+                    ]
+                )
+            ]
+    """
+    priority = 'NORMAL'
+
+    def __init__(self, schedule):
+        recipes = {}
+        for recipe in schedule:
+            assert 'action_epoch' in recipe, \
+                'Please set `action_epoch` in every item ' \
+                'of the `schedule` in the SwitchRecipeHook.'
+            recipe = deepcopy(recipe)
+            if 'train_pipeline' in recipe:
+                recipe['train_pipeline'] = Compose(recipe['train_pipeline'])
+            if 'batch_augments' in recipe:
+                batch_augments = recipe['batch_augments']
+                if isinstance(batch_augments, dict):
+                    batch_augments = RandomBatchAugment(**batch_augments)
+                recipe['batch_augments'] = batch_augments
+            if 'loss' in recipe:
+                loss = recipe['loss']
+                if isinstance(loss, dict):
+                    loss = MODELS.build(loss)
+                recipe['loss'] = loss
+
+            action_epoch = recipe.pop('action_epoch')
+            assert action_epoch not in recipes, \
+                f'The `action_epoch` {action_epoch} is repeated ' \
+                'in the SwitchRecipeHook.'
+            recipes[action_epoch] = recipe
+        self.schedule = OrderedDict(sorted(recipes.items()))
+
+    def before_train(self, runner) -> None:
+        """before run setting. If resume form a checkpoint, do all switch
+        before the current epoch.
+
+        Args:
+            runner (Runner): The runner of the training, validation or testing
+                process.
+        """
+        if runner._resume:
+            for action_epoch, recipe in self.schedule.items():
+                if action_epoch >= runner.epoch + 1:
+                    break
+                self._do_switch(runner, recipe,
+                                f' (resume recipe of epoch {action_epoch})')
+
+    def before_train_epoch(self, runner):
+        """do before train epoch."""
+        recipe = self.schedule.get(runner.epoch + 1, None)
+        if recipe is not None:
+            self._do_switch(runner, recipe, f' at epoch {runner.epoch + 1}')
+
+    def _do_switch(self, runner, recipe, extra_info=''):
+        """do the switch aug process."""
+        if 'batch_augments' in recipe:
+            self._switch_batch_augments(runner, recipe['batch_augments'])
+            runner.logger.info(f'Switch batch augments{extra_info}.')
+
+        if 'train_pipeline' in recipe:
+            self._switch_train_pipeline(runner, recipe['train_pipeline'])
+            runner.logger.info(f'Switch train pipeline{extra_info}.')
+
+        if 'loss' in recipe:
+            self._switch_loss(runner, recipe['loss'])
+            runner.logger.info(f'Switch loss{extra_info}.')
+
+    @staticmethod
+    def _switch_batch_augments(runner, batch_augments):
+        """switch the train augments."""
+        model = runner.model
+        if is_model_wrapper(model):
+            model = model.module
+
+        model.data_preprocessor.batch_augments = batch_augments
+
+    @staticmethod
+    def _switch_train_pipeline(runner, train_pipeline):
+        """switch the train loader dataset pipeline."""
+
+        def switch_pipeline(dataset, pipeline):
+            if hasattr(dataset, 'pipeline'):
+                # for usual dataset
+                dataset.pipeline = pipeline
+            elif hasattr(dataset, 'datasets'):
+                # for concat dataset wrapper
+                for ds in dataset.datasets:
+                    switch_pipeline(ds, pipeline)
+            elif hasattr(dataset, 'dataset'):
+                # for other dataset wrappers
+                switch_pipeline(dataset.dataset, pipeline)
+            else:
+                raise RuntimeError(
+                    'Cannot access the `pipeline` of the dataset.')
+
+        train_loader = runner.train_loop.dataloader
+        switch_pipeline(train_loader.dataset, train_pipeline)
+
+        # To restart the iterator of dataloader when `persistent_workers=True`
+        train_loader._iterator = None
+
+    @staticmethod
+    def _switch_loss(runner, loss_module):
+        """switch the loss module."""
+        model = runner.model
+        if is_model_wrapper(model, MODEL_WRAPPERS):
+            model = model.module
+
+        if hasattr(model, 'loss_module'):
+            model.loss_module = loss_module
+        elif hasattr(model, 'head') and hasattr(model.head, 'loss_module'):
+            model.head.loss_module = loss_module
+        else:
+            raise RuntimeError('Cannot access the `loss_module` of the model.')
diff --git a/mmpretrain/engine/hooks/visualization_hook.py b/mmpretrain/engine/hooks/visualization_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..64d2230a79db971bef78d77bcf80c40365bddb15
--- /dev/null
+++ b/mmpretrain/engine/hooks/visualization_hook.py
@@ -0,0 +1,126 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+import os.path as osp
+from typing import Optional, Sequence
+
+from mmengine.fileio import join_path
+from mmengine.hooks import Hook
+from mmengine.runner import EpochBasedTrainLoop, Runner
+from mmengine.visualization import Visualizer
+
+from mmpretrain.registry import HOOKS
+from mmpretrain.structures import DataSample
+
+
+@HOOKS.register_module()
+class VisualizationHook(Hook):
+    """Classification Visualization Hook. Used to visualize validation and
+    testing prediction results.
+
+    - If ``out_dir`` is specified, all storage backends are ignored
+      and save the image to the ``out_dir``.
+    - If ``show`` is True, plot the result image in a window, please
+      confirm you are able to access the graphical interface.
+
+    Args:
+        enable (bool): Whether to enable this hook. Defaults to False.
+        interval (int): The interval of samples to visualize. Defaults to 5000.
+        show (bool): Whether to display the drawn image. Defaults to False.
+        out_dir (str, optional): directory where painted images will be saved
+            in the testing process. If None, handle with the backends of the
+            visualizer. Defaults to None.
+        **kwargs: other keyword arguments of
+            :meth:`mmpretrain.visualization.UniversalVisualizer.visualize_cls`.
+    """
+
+    def __init__(self,
+                 enable=False,
+                 interval: int = 5000,
+                 show: bool = False,
+                 out_dir: Optional[str] = None,
+                 **kwargs):
+        self._visualizer: Visualizer = Visualizer.get_current_instance()
+
+        self.enable = enable
+        self.interval = interval
+        self.show = show
+        self.out_dir = out_dir
+
+        self.draw_args = {**kwargs, 'show': show}
+
+    def _draw_samples(self,
+                      batch_idx: int,
+                      data_batch: dict,
+                      data_samples: Sequence[DataSample],
+                      step: int = 0) -> None:
+        """Visualize every ``self.interval`` samples from a data batch.
+
+        Args:
+            batch_idx (int): The index of the current batch in the val loop.
+            data_batch (dict): Data from dataloader.
+            outputs (Sequence[:obj:`DataSample`]): Outputs from model.
+            step (int): Global step value to record. Defaults to 0.
+        """
+        if self.enable is False:
+            return
+
+        batch_size = len(data_samples)
+        images = data_batch['inputs']
+        start_idx = batch_size * batch_idx
+        end_idx = start_idx + batch_size
+
+        # The first index divisible by the interval, after the start index
+        first_sample_id = math.ceil(start_idx / self.interval) * self.interval
+
+        for sample_id in range(first_sample_id, end_idx, self.interval):
+            image = images[sample_id - start_idx]
+            image = image.permute(1, 2, 0).cpu().numpy().astype('uint8')
+
+            data_sample = data_samples[sample_id - start_idx]
+            if 'img_path' in data_sample:
+                # osp.basename works on different platforms even file clients.
+                sample_name = osp.basename(data_sample.get('img_path'))
+            else:
+                sample_name = str(sample_id)
+
+            draw_args = self.draw_args
+            if self.out_dir is not None:
+                draw_args['out_file'] = join_path(self.out_dir,
+                                                  f'{sample_name}_{step}.png')
+
+            self._visualizer.visualize_cls(
+                image=image,
+                data_sample=data_sample,
+                step=step,
+                name=sample_name,
+                **self.draw_args,
+            )
+
+    def after_val_iter(self, runner: Runner, batch_idx: int, data_batch: dict,
+                       outputs: Sequence[DataSample]) -> None:
+        """Visualize every ``self.interval`` samples during validation.
+
+        Args:
+            runner (:obj:`Runner`): The runner of the validation process.
+            batch_idx (int): The index of the current batch in the val loop.
+            data_batch (dict): Data from dataloader.
+            outputs (Sequence[:obj:`DataSample`]): Outputs from model.
+        """
+        if isinstance(runner.train_loop, EpochBasedTrainLoop):
+            step = runner.epoch
+        else:
+            step = runner.iter
+
+        self._draw_samples(batch_idx, data_batch, outputs, step=step)
+
+    def after_test_iter(self, runner: Runner, batch_idx: int, data_batch: dict,
+                        outputs: Sequence[DataSample]) -> None:
+        """Visualize every ``self.interval`` samples during test.
+
+        Args:
+            runner (:obj:`Runner`): The runner of the testing process.
+            batch_idx (int): The index of the current batch in the test loop.
+            data_batch (dict): Data from dataloader.
+            outputs (Sequence[:obj:`DetDataSample`]): Outputs from model.
+        """
+        self._draw_samples(batch_idx, data_batch, outputs, step=0)
diff --git a/mmpretrain/engine/hooks/warmup_param_hook.py b/mmpretrain/engine/hooks/warmup_param_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..b45d8918dbbcb9cf5d12c252621908f0b6c1f251
--- /dev/null
+++ b/mmpretrain/engine/hooks/warmup_param_hook.py
@@ -0,0 +1,66 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import operator as op
+from typing import Any, Optional, Union
+
+from mmengine.hooks import Hook
+
+from mmpretrain.registry import HOOKS
+from mmpretrain.utils import get_ori_model
+
+
+@HOOKS.register_module()
+class WarmupParamHook(Hook):
+    """This is a hook used for changing the parameters other than optimizations
+    that need to warmup inside the module.
+
+    This hook can extend with more detailed warmup rule if necessary.
+
+    Args:
+        param_name (str): The parameter name that needs to be altered.
+        module_name (str): Module name that belongs to the model. Such as
+            `head`, `head.loss`, etc.
+        warmup_epochs (int): The warmup epochs for this parameter.
+    """
+
+    def __init__(
+        self,
+        param_name: str,
+        module_name: str,
+        warmup_epochs: int,
+    ) -> None:
+        self.param_name = param_name
+        self.warmup_epochs = warmup_epochs
+        # getter for module which saves the changed parameter
+        self.module_getter = op.attrgetter(module_name)
+
+    def get_param(self, runner) -> Any:
+        """Get the parameter."""
+        try:
+            module = self.module_getter(get_ori_model(runner.model))
+            return getattr(module, self.param_name)
+        except AttributeError as e:
+            raise AttributeError(f'{e}. Please check hook settings.')
+
+    def set_param(self, runner, value) -> None:
+        """Set the parameter."""
+        try:
+            module = self.module_getter(get_ori_model(runner.model))
+            setattr(module, self.param_name, value)
+        except AttributeError as e:
+            raise AttributeError(f'{e}. Please check hook settings.')
+
+    def before_train(self, runner) -> None:
+        """Get the original value before train."""
+        self.ori_val = self.get_param(runner)
+
+    def before_train_iter(
+            self,
+            runner,
+            batch_idx: int,
+            data_batch: Optional[Union[dict, tuple, list]] = None) -> None:
+        """Set the warmup value before each train iter."""
+        cur_iter = runner.iter
+        iters_per_epoch = runner.max_iters / runner.max_epochs
+        new_val = self.ori_val * min(
+            1, cur_iter / (self.warmup_epochs * iters_per_epoch))
+        self.set_param(runner, new_val)
diff --git a/mmpretrain/engine/optimizers/__init__.py b/mmpretrain/engine/optimizers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..bd53a37630b2a0dfbb69b1020518b9ec4ff03715
--- /dev/null
+++ b/mmpretrain/engine/optimizers/__init__.py
@@ -0,0 +1,8 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .adan_t import Adan
+from .lamb import Lamb
+from .lars import LARS
+from .layer_decay_optim_wrapper_constructor import \
+    LearningRateDecayOptimWrapperConstructor
+
+__all__ = ['Lamb', 'Adan', 'LARS', 'LearningRateDecayOptimWrapperConstructor']
diff --git a/mmpretrain/engine/optimizers/__pycache__/__init__.cpython-310.pyc b/mmpretrain/engine/optimizers/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..97151eebeaa632a7621ae00d7f502d691d12eb25
Binary files /dev/null and b/mmpretrain/engine/optimizers/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/engine/optimizers/__pycache__/adan_t.cpython-310.pyc b/mmpretrain/engine/optimizers/__pycache__/adan_t.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3d971b1aac34bb811d4af6f751aecac889a0d279
Binary files /dev/null and b/mmpretrain/engine/optimizers/__pycache__/adan_t.cpython-310.pyc differ
diff --git a/mmpretrain/engine/optimizers/__pycache__/lamb.cpython-310.pyc b/mmpretrain/engine/optimizers/__pycache__/lamb.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..b6d4ee6483acd38e44b2bd97df3bf7104fad2170
Binary files /dev/null and b/mmpretrain/engine/optimizers/__pycache__/lamb.cpython-310.pyc differ
diff --git a/mmpretrain/engine/optimizers/__pycache__/lars.cpython-310.pyc b/mmpretrain/engine/optimizers/__pycache__/lars.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..8533c32a1b5750e2f0c421a5b7bec4e9710f76a1
Binary files /dev/null and b/mmpretrain/engine/optimizers/__pycache__/lars.cpython-310.pyc differ
diff --git a/mmpretrain/engine/optimizers/__pycache__/layer_decay_optim_wrapper_constructor.cpython-310.pyc b/mmpretrain/engine/optimizers/__pycache__/layer_decay_optim_wrapper_constructor.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3739b772a02a49af1facd33481210fa50191b99e
Binary files /dev/null and b/mmpretrain/engine/optimizers/__pycache__/layer_decay_optim_wrapper_constructor.cpython-310.pyc differ
diff --git a/mmpretrain/engine/optimizers/adan_t.py b/mmpretrain/engine/optimizers/adan_t.py
new file mode 100644
index 0000000000000000000000000000000000000000..571a71b6fe561fb33053af2fd6d2161a775918e4
--- /dev/null
+++ b/mmpretrain/engine/optimizers/adan_t.py
@@ -0,0 +1,312 @@
+# Copyright 2022 Garena Online Private Limited
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+from typing import List
+
+import torch
+from torch import Tensor
+from torch.optim.optimizer import Optimizer
+
+from mmpretrain.registry import OPTIMIZERS
+
+
+@OPTIMIZERS.register_module()
+class Adan(Optimizer):
+    """Implements a pytorch variant of Adan.
+
+    Adan was proposed in
+    Adan : Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models. # noqa
+    https://arxiv.org/abs/2208.06677
+    Arguments:
+        params (iterable): iterable of parameters to optimize
+            or dicts defining parameter groups.
+        lr (float, optional): learning rate. (default: 1e-3)
+        betas (Tuple[float, float, flot], optional): coefficients used
+            for computing running averages of gradient.
+            (default: (0.98, 0.92, 0.99))
+        eps (float, optional): term added to the denominator to improve
+            numerical stability. (default: 1e-8)
+        weight_decay (float, optional): decoupled weight decay
+            (L2 penalty) (default: 0)
+        max_grad_norm (float, optional): value used to clip
+            global grad norm (default: 0.0 no clip)
+        no_prox (bool): how to perform the decoupled weight decay
+            (default: False)
+        foreach (bool): if True would use torch._foreach implementation.
+            It's faster but uses slightly more memory.
+    """
+
+    def __init__(self,
+                 params,
+                 lr=1e-3,
+                 betas=(0.98, 0.92, 0.99),
+                 eps=1e-8,
+                 weight_decay=0.0,
+                 max_grad_norm=0.0,
+                 no_prox=False,
+                 foreach: bool = True):
+        if not 0.0 <= max_grad_norm:
+            raise ValueError('Invalid Max grad norm: {}'.format(max_grad_norm))
+        if not 0.0 <= lr:
+            raise ValueError('Invalid learning rate: {}'.format(lr))
+        if not 0.0 <= eps:
+            raise ValueError('Invalid epsilon value: {}'.format(eps))
+        if not 0.0 <= betas[0] < 1.0:
+            raise ValueError('Invalid beta parameter at index 0: {}'.format(
+                betas[0]))
+        if not 0.0 <= betas[1] < 1.0:
+            raise ValueError('Invalid beta parameter at index 1: {}'.format(
+                betas[1]))
+        if not 0.0 <= betas[2] < 1.0:
+            raise ValueError('Invalid beta parameter at index 2: {}'.format(
+                betas[2]))
+        defaults = dict(
+            lr=lr,
+            betas=betas,
+            eps=eps,
+            weight_decay=weight_decay,
+            max_grad_norm=max_grad_norm,
+            no_prox=no_prox,
+            foreach=foreach)
+        super().__init__(params, defaults)
+
+    def __setstate__(self, state):
+        super(Adan, self).__setstate__(state)
+        for group in self.param_groups:
+            group.setdefault('no_prox', False)
+
+    @torch.no_grad()
+    def restart_opt(self):
+        for group in self.param_groups:
+            group['step'] = 0
+            for p in group['params']:
+                if p.requires_grad:
+                    state = self.state[p]
+                    # State initialization
+
+                    # Exponential moving average of gradient values
+                    state['exp_avg'] = torch.zeros_like(p)
+                    # Exponential moving average of squared gradient values
+                    state['exp_avg_sq'] = torch.zeros_like(p)
+                    # Exponential moving average of gradient difference
+                    state['exp_avg_diff'] = torch.zeros_like(p)
+
+    @torch.no_grad()
+    def step(self):
+        """Performs a single optimization step."""
+        if self.defaults['max_grad_norm'] > 0:
+            device = self.param_groups[0]['params'][0].device
+            global_grad_norm = torch.zeros(1, device=device)
+
+            max_grad_norm = torch.tensor(
+                self.defaults['max_grad_norm'], device=device)
+            for group in self.param_groups:
+
+                for p in group['params']:
+                    if p.grad is not None:
+                        grad = p.grad
+                        global_grad_norm.add_(grad.pow(2).sum())
+
+            global_grad_norm = torch.sqrt(global_grad_norm) + group['eps']
+
+            clip_global_grad_norm = \
+                torch.clamp(max_grad_norm / global_grad_norm, max=1.0)
+        else:
+            clip_global_grad_norm = 1.0
+
+        for group in self.param_groups:
+            params_with_grad = []
+            grads = []
+            exp_avgs = []
+            exp_avg_sqs = []
+            exp_avg_diffs = []
+            pre_grads = []
+
+            beta1, beta2, beta3 = group['betas']
+            # assume same step across group now to simplify things
+            # per parameter step can be easily support
+            # by making it tensor, or pass list into kernel
+            if 'step' in group:
+                group['step'] += 1
+            else:
+                group['step'] = 1
+
+            bias_correction1 = 1.0 - beta1**group['step']
+            bias_correction2 = 1.0 - beta2**group['step']
+            bias_correction3 = 1.0 - beta3**group['step']
+
+            for p in group['params']:
+                if p.grad is None:
+                    continue
+                params_with_grad.append(p)
+                grads.append(p.grad)
+
+                state = self.state[p]
+                if len(state) == 0:
+                    state['exp_avg'] = torch.zeros_like(p)
+                    state['exp_avg_sq'] = torch.zeros_like(p)
+                    state['exp_avg_diff'] = torch.zeros_like(p)
+
+                if 'pre_grad' not in state or group['step'] == 1:
+                    # at first step grad wouldn't be clipped
+                    # by `clip_global_grad_norm`
+                    # this is only to simplify implementation
+                    state['pre_grad'] = p.grad
+
+                exp_avgs.append(state['exp_avg'])
+                exp_avg_sqs.append(state['exp_avg_sq'])
+                exp_avg_diffs.append(state['exp_avg_diff'])
+                pre_grads.append(state['pre_grad'])
+
+            kwargs = dict(
+                params=params_with_grad,
+                grads=grads,
+                exp_avgs=exp_avgs,
+                exp_avg_sqs=exp_avg_sqs,
+                exp_avg_diffs=exp_avg_diffs,
+                pre_grads=pre_grads,
+                beta1=beta1,
+                beta2=beta2,
+                beta3=beta3,
+                bias_correction1=bias_correction1,
+                bias_correction2=bias_correction2,
+                bias_correction3_sqrt=math.sqrt(bias_correction3),
+                lr=group['lr'],
+                weight_decay=group['weight_decay'],
+                eps=group['eps'],
+                no_prox=group['no_prox'],
+                clip_global_grad_norm=clip_global_grad_norm,
+            )
+            if group['foreach']:
+                copy_grads = _multi_tensor_adan(**kwargs)
+            else:
+                copy_grads = _single_tensor_adan(**kwargs)
+
+            for p, copy_grad in zip(params_with_grad, copy_grads):
+                self.state[p]['pre_grad'] = copy_grad
+
+
+def _single_tensor_adan(
+    params: List[Tensor],
+    grads: List[Tensor],
+    exp_avgs: List[Tensor],
+    exp_avg_sqs: List[Tensor],
+    exp_avg_diffs: List[Tensor],
+    pre_grads: List[Tensor],
+    *,
+    beta1: float,
+    beta2: float,
+    beta3: float,
+    bias_correction1: float,
+    bias_correction2: float,
+    bias_correction3_sqrt: float,
+    lr: float,
+    weight_decay: float,
+    eps: float,
+    no_prox: bool,
+    clip_global_grad_norm: Tensor,
+):
+    copy_grads = []
+    for i, param in enumerate(params):
+        grad = grads[i]
+        exp_avg = exp_avgs[i]
+        exp_avg_sq = exp_avg_sqs[i]
+        exp_avg_diff = exp_avg_diffs[i]
+        pre_grad = pre_grads[i]
+
+        grad = grad.mul_(clip_global_grad_norm)
+        copy_grads.append(grad.clone())
+
+        diff = grad - pre_grad
+        update = grad + beta2 * diff
+
+        exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)  # m_t
+        exp_avg_diff.mul_(beta2).add_(diff, alpha=1 - beta2)  # diff_t
+        exp_avg_sq.mul_(beta3).addcmul_(update, update, value=1 - beta3)  # n_t
+
+        denom = (exp_avg_sq.sqrt() / bias_correction3_sqrt).add_(eps)
+        update = exp_avg / bias_correction1
+        update.add_(beta2 * exp_avg_diff / bias_correction2).div_(denom)
+
+        if no_prox:
+            param.mul_(1 - lr * weight_decay)
+            param.add_(update, alpha=-lr)
+        else:
+            param.add_(update, alpha=-lr)
+            param.div_(1 + lr * weight_decay)
+    return copy_grads
+
+
+def _multi_tensor_adan(
+    params: List[Tensor],
+    grads: List[Tensor],
+    exp_avgs: List[Tensor],
+    exp_avg_sqs: List[Tensor],
+    exp_avg_diffs: List[Tensor],
+    pre_grads: List[Tensor],
+    *,
+    beta1: float,
+    beta2: float,
+    beta3: float,
+    bias_correction1: float,
+    bias_correction2: float,
+    bias_correction3_sqrt: float,
+    lr: float,
+    weight_decay: float,
+    eps: float,
+    no_prox: bool,
+    clip_global_grad_norm: Tensor,
+):
+    if clip_global_grad_norm < 1.0:
+        torch._foreach_mul_(grads, clip_global_grad_norm.item())
+    copy_grads = [g.clone() for g in grads]
+
+    diff = torch._foreach_sub(grads, pre_grads)
+    # NOTE: line below while looking identical gives different result,
+    # due to float precision errors.
+    # using mul+add produces identical results to single-tensor,
+    # using add+alpha doesn't
+    # update = torch._foreach_add(grads, torch._foreach_mul(diff, beta2))
+    update = torch._foreach_add(grads, diff, alpha=beta2)
+
+    torch._foreach_mul_(exp_avgs, beta1)
+    torch._foreach_add_(exp_avgs, grads, alpha=1 - beta1)  # m_t
+
+    torch._foreach_mul_(exp_avg_diffs, beta2)
+    torch._foreach_add_(exp_avg_diffs, diff, alpha=1 - beta2)  # diff_t
+
+    torch._foreach_mul_(exp_avg_sqs, beta3)
+    torch._foreach_addcmul_(
+        exp_avg_sqs, update, update, value=1 - beta3)  # n_t
+
+    denom = torch._foreach_sqrt(exp_avg_sqs)
+    torch._foreach_div_(denom, bias_correction3_sqrt)
+    torch._foreach_add_(denom, eps)
+
+    update = torch._foreach_div(exp_avgs, bias_correction1)
+    # NOTE: same issue as above.
+    # beta2 * diff / bias_correction2 != diff * (beta2 / bias_correction2)  # noqa
+    # using faster version by default. uncomment for tests to pass
+    # torch._foreach_add_(update, torch._foreach_div(torch._foreach_mul(exp_avg_diffs, beta2), bias_correction2))  # noqa
+    torch._foreach_add_(
+        update, torch._foreach_mul(exp_avg_diffs, beta2 / bias_correction2))
+    torch._foreach_div_(update, denom)
+
+    if no_prox:
+        torch._foreach_mul_(params, 1 - lr * weight_decay)
+    else:
+        torch._foreach_add_(params, update, alpha=-lr)
+        torch._foreach_div_(params, 1 + lr * weight_decay)
+    return copy_grads
diff --git a/mmpretrain/engine/optimizers/lamb.py b/mmpretrain/engine/optimizers/lamb.py
new file mode 100644
index 0000000000000000000000000000000000000000..0b44a1c168e03fa7f569388beec206fe68c64749
--- /dev/null
+++ b/mmpretrain/engine/optimizers/lamb.py
@@ -0,0 +1,228 @@
+"""PyTorch Lamb optimizer w/ behaviour similar to NVIDIA FusedLamb.
+
+This optimizer code was adapted from the following (starting with latest)
+* https://github.com/HabanaAI/Model-References/blob/
+2b435114fe8e31f159b1d3063b8280ae37af7423/PyTorch/nlp/bert/pretraining/lamb.py
+* https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/
+LanguageModeling/Transformer-XL/pytorch/lamb.py
+* https://github.com/cybertronai/pytorch-lamb
+
+Use FusedLamb if you can (GPU). The reason for including this variant of Lamb
+is to have a version that is
+similar in behaviour to APEX FusedLamb if you aren't using NVIDIA GPUs or
+cannot install/use APEX.
+
+In addition to some cleanup, this Lamb impl has been modified to support
+PyTorch XLA and has been tested on TPU.
+
+Original copyrights for above sources are below.
+
+Modifications Copyright 2021 Ross Wightman
+"""
+# Copyright (c) 2021, Habana Labs Ltd.  All rights reserved.
+
+# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# MIT License
+#
+# Copyright (c) 2019 cybertronai
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+import math
+
+import torch
+from torch.optim import Optimizer
+
+from mmpretrain.registry import OPTIMIZERS
+
+
+@OPTIMIZERS.register_module()
+class Lamb(Optimizer):
+    """A pure pytorch variant of FuseLAMB (NvLamb variant) optimizer.
+
+    This class is copied from `timm`_. The LAMB was proposed in `Large Batch
+    Optimization for Deep Learning - Training BERT in 76 minutes`_.
+
+    .. _timm:
+        https://github.com/rwightman/pytorch-image-models/blob/master/timm/optim/lamb.py
+    .. _Large Batch Optimization for Deep Learning - Training BERT in 76 minutes:
+        https://arxiv.org/abs/1904.00962
+
+    Arguments:
+        params (iterable): iterable of parameters to optimize or dicts defining
+        parameter groups.
+        lr (float, optional): learning rate. (default: 1e-3)
+        betas (Tuple[float, float], optional): coefficients used for computing
+            running averages of gradient and its norm. (default: (0.9, 0.999))
+        eps (float, optional): term added to the denominator to improve
+            numerical stability. (default: 1e-8)
+        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
+        grad_averaging (bool, optional): whether apply (1-beta2) to grad when
+            calculating running averages of gradient. (default: True)
+        max_grad_norm (float, optional): value used to clip global grad norm
+            (default: 1.0)
+        trust_clip (bool): enable LAMBC trust ratio clipping (default: False)
+        always_adapt (boolean, optional): Apply adaptive learning rate to 0.0
+            weight decay parameter (default: False)
+    """  # noqa: E501
+
+    def __init__(self,
+                 params,
+                 lr=1e-3,
+                 bias_correction=True,
+                 betas=(0.9, 0.999),
+                 eps=1e-6,
+                 weight_decay=0.01,
+                 grad_averaging=True,
+                 max_grad_norm=1.0,
+                 trust_clip=False,
+                 always_adapt=False):
+        defaults = dict(
+            lr=lr,
+            bias_correction=bias_correction,
+            betas=betas,
+            eps=eps,
+            weight_decay=weight_decay,
+            grad_averaging=grad_averaging,
+            max_grad_norm=max_grad_norm,
+            trust_clip=trust_clip,
+            always_adapt=always_adapt)
+        super().__init__(params, defaults)
+
+    @torch.no_grad()
+    def step(self, closure=None):
+        """Performs a single optimization step.
+
+        Arguments:
+            closure (callable, optional): A closure that reevaluates the model
+                and returns the loss.
+        """
+        loss = None
+        if closure is not None:
+            with torch.enable_grad():
+                loss = closure()
+
+        device = self.param_groups[0]['params'][0].device
+        one_tensor = torch.tensor(
+            1.0, device=device
+        )  # because torch.where doesn't handle scalars correctly
+        global_grad_norm = torch.zeros(1, device=device)
+        for group in self.param_groups:
+            for p in group['params']:
+                if p.grad is None:
+                    continue
+                grad = p.grad
+                if grad.is_sparse:
+                    raise RuntimeError(
+                        'Lamb does not support sparse gradients, consider '
+                        'SparseAdam instead.')
+                global_grad_norm.add_(grad.pow(2).sum())
+
+        global_grad_norm = torch.sqrt(global_grad_norm)
+        # FIXME it'd be nice to remove explicit tensor conversion of scalars
+        #  when torch.where promotes
+        # scalar types properly https://github.com/pytorch/pytorch/issues/9190
+        max_grad_norm = torch.tensor(
+            self.defaults['max_grad_norm'], device=device)
+        clip_global_grad_norm = torch.where(global_grad_norm > max_grad_norm,
+                                            global_grad_norm / max_grad_norm,
+                                            one_tensor)
+
+        for group in self.param_groups:
+            bias_correction = 1 if group['bias_correction'] else 0
+            beta1, beta2 = group['betas']
+            grad_averaging = 1 if group['grad_averaging'] else 0
+            beta3 = 1 - beta1 if grad_averaging else 1.0
+
+            # assume same step across group now to simplify things
+            # per parameter step can be easily support by making it tensor, or
+            # pass list into kernel
+            if 'step' in group:
+                group['step'] += 1
+            else:
+                group['step'] = 1
+
+            if bias_correction:
+                bias_correction1 = 1 - beta1**group['step']
+                bias_correction2 = 1 - beta2**group['step']
+            else:
+                bias_correction1, bias_correction2 = 1.0, 1.0
+
+            for p in group['params']:
+                if p.grad is None:
+                    continue
+                grad = p.grad.div_(clip_global_grad_norm)
+                state = self.state[p]
+
+                # State initialization
+                if len(state) == 0:
+                    # Exponential moving average of gradient valuesa
+                    state['exp_avg'] = torch.zeros_like(p)
+                    # Exponential moving average of squared gradient values
+                    state['exp_avg_sq'] = torch.zeros_like(p)
+
+                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
+
+                # Decay the first and second moment running average coefficient
+                exp_avg.mul_(beta1).add_(grad, alpha=beta3)  # m_t
+                exp_avg_sq.mul_(beta2).addcmul_(
+                    grad, grad, value=1 - beta2)  # v_t
+
+                denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(
+                    group['eps'])
+                update = (exp_avg / bias_correction1).div_(denom)
+
+                weight_decay = group['weight_decay']
+                if weight_decay != 0:
+                    update.add_(p, alpha=weight_decay)
+
+                if weight_decay != 0 or group['always_adapt']:
+                    # Layer-wise LR adaptation. By default, skip adaptation on
+                    # parameters that are
+                    # excluded from weight decay, unless always_adapt == True,
+                    # then always enabled.
+                    w_norm = p.norm(2.0)
+                    g_norm = update.norm(2.0)
+                    # FIXME nested where required since logical and/or not
+                    #  working in PT XLA
+                    trust_ratio = torch.where(
+                        w_norm > 0,
+                        torch.where(g_norm > 0, w_norm / g_norm, one_tensor),
+                        one_tensor,
+                    )
+                    if group['trust_clip']:
+                        # LAMBC trust clipping, upper bound fixed at one
+                        trust_ratio = torch.minimum(trust_ratio, one_tensor)
+                    update.mul_(trust_ratio)
+
+                p.add_(update, alpha=-group['lr'])
+
+        return loss
diff --git a/mmpretrain/engine/optimizers/lars.py b/mmpretrain/engine/optimizers/lars.py
new file mode 100644
index 0000000000000000000000000000000000000000..5e388878374e3d1e7408861a5f1830b00df5664b
--- /dev/null
+++ b/mmpretrain/engine/optimizers/lars.py
@@ -0,0 +1,130 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Iterable
+
+import torch
+from torch.optim.optimizer import Optimizer
+
+from mmpretrain.registry import OPTIMIZERS
+
+
+@OPTIMIZERS.register_module()
+class LARS(Optimizer):
+    """Implements layer-wise adaptive rate scaling for SGD.
+
+    Based on Algorithm 1 of the following paper by You, Gitman, and Ginsburg.
+    `Large Batch Training of Convolutional Networks:
+    <https://arxiv.org/abs/1708.03888>`_.
+
+    Args:
+        params (Iterable): Iterable of parameters to optimize or dicts defining
+            parameter groups.
+        lr (float): Base learning rate.
+        momentum (float): Momentum factor. Defaults to 0.
+        weight_decay (float): Weight decay (L2 penalty). Defaults to 0.
+        dampening (float): Dampening for momentum. Defaults to 0.
+        eta (float): LARS coefficient. Defaults to 0.001.
+        nesterov (bool): Enables Nesterov momentum. Defaults to False.
+        eps (float): A small number to avoid dviding zero. Defaults to 1e-8.
+
+    Example:
+        >>> optimizer = LARS(model.parameters(), lr=0.1, momentum=0.9,
+        >>>                  weight_decay=1e-4, eta=1e-3)
+        >>> optimizer.zero_grad()
+        >>> loss_fn(model(input), target).backward()
+        >>> optimizer.step()
+    """
+
+    def __init__(self,
+                 params: Iterable,
+                 lr: float,
+                 momentum: float = 0,
+                 weight_decay: float = 0,
+                 dampening: float = 0,
+                 eta: float = 0.001,
+                 nesterov: bool = False,
+                 eps: float = 1e-8) -> None:
+        if not isinstance(lr, float) and lr < 0.0:
+            raise ValueError(f'Invalid learning rate: {lr}')
+        if momentum < 0.0:
+            raise ValueError(f'Invalid momentum value: {momentum}')
+        if weight_decay < 0.0:
+            raise ValueError(f'Invalid weight_decay value: {weight_decay}')
+        if eta < 0.0:
+            raise ValueError(f'Invalid LARS coefficient value: {eta}')
+
+        defaults = dict(
+            lr=lr,
+            momentum=momentum,
+            dampening=dampening,
+            weight_decay=weight_decay,
+            nesterov=nesterov,
+            eta=eta)
+        if nesterov and (momentum <= 0 or dampening != 0):
+            raise ValueError(
+                'Nesterov momentum requires a momentum and zero dampening')
+
+        self.eps = eps
+        super().__init__(params, defaults)
+
+    def __setstate__(self, state) -> None:
+        super().__setstate__(state)
+        for group in self.param_groups:
+            group.setdefault('nesterov', False)
+
+    @torch.no_grad()
+    def step(self, closure=None) -> torch.Tensor:
+        """Performs a single optimization step.
+
+        Args:
+            closure (callable, optional): A closure that reevaluates the model
+                and returns the loss.
+        """
+        loss = None
+        if closure is not None:
+            with torch.enable_grad():
+                loss = closure()
+
+        for group in self.param_groups:
+            weight_decay = group['weight_decay']
+            momentum = group['momentum']
+            dampening = group['dampening']
+            eta = group['eta']
+            nesterov = group['nesterov']
+            lr = group['lr']
+            lars_exclude = group.get('lars_exclude', False)
+
+            for p in group['params']:
+                if p.grad is None:
+                    continue
+
+                d_p = p.grad
+
+                if lars_exclude:
+                    local_lr = 1.
+                else:
+                    weight_norm = torch.norm(p).item()
+                    grad_norm = torch.norm(d_p).item()
+                    if weight_norm != 0 and grad_norm != 0:
+                        # Compute local learning rate for this layer
+                        local_lr = eta * weight_norm / \
+                            (grad_norm + weight_decay * weight_norm + self.eps)
+                    else:
+                        local_lr = 1.
+
+                actual_lr = local_lr * lr
+                d_p = d_p.add(p, alpha=weight_decay).mul(actual_lr)
+                if momentum != 0:
+                    param_state = self.state[p]
+                    if 'momentum_buffer' not in param_state:
+                        buf = param_state['momentum_buffer'] = \
+                                torch.clone(d_p).detach()
+                    else:
+                        buf = param_state['momentum_buffer']
+                        buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
+                    if nesterov:
+                        d_p = d_p.add(buf, alpha=momentum)
+                    else:
+                        d_p = buf
+                p.add_(-d_p)
+
+        return loss
diff --git a/mmpretrain/engine/optimizers/layer_decay_optim_wrapper_constructor.py b/mmpretrain/engine/optimizers/layer_decay_optim_wrapper_constructor.py
new file mode 100644
index 0000000000000000000000000000000000000000..09c6abc54a9f49cc789bf91d2bf74b0ec68902c4
--- /dev/null
+++ b/mmpretrain/engine/optimizers/layer_decay_optim_wrapper_constructor.py
@@ -0,0 +1,166 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from collections import defaultdict
+from typing import Callable, List, Optional
+
+from mmengine.logging import MMLogger
+from mmengine.optim import DefaultOptimWrapperConstructor
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm, _InstanceNorm
+from torch import nn
+from torch.nn import GroupNorm, LayerNorm
+
+from mmpretrain.registry import OPTIM_WRAPPER_CONSTRUCTORS
+
+
+@OPTIM_WRAPPER_CONSTRUCTORS.register_module()
+class LearningRateDecayOptimWrapperConstructor(DefaultOptimWrapperConstructor):
+    """Different learning rates are set for different layers of backbone.
+
+    By default, each parameter share the same optimizer settings, and we
+    provide an argument ``paramwise_cfg`` to specify parameter-wise settings.
+    It is a dict and may contain the following fields:
+
+    - ``layer_decay_rate`` (float): The learning rate of a parameter will
+      multiply it by multiple times according to the layer depth of the
+      parameter. Usually, it's less than 1, so that the earlier layers will
+      have a lower learning rate. Defaults to 1.
+    - ``bias_decay_mult`` (float): It will be multiplied to the weight
+      decay for all bias parameters (except for those in normalization layers).
+    - ``norm_decay_mult`` (float): It will be multiplied to the weight
+      decay for all weight and bias parameters of normalization layers.
+    - ``flat_decay_mult`` (float): It will be multiplied to the weight
+      decay for all one-dimensional parameters
+    - ``custom_keys`` (dict): Specified parameters-wise settings by keys. If
+      one of the keys in ``custom_keys`` is a substring of the name of one
+      parameter, then the setting of the parameter will be specified by
+      ``custom_keys[key]`` and other setting like ``bias_decay_mult`` will be
+      ignored. It should be a dict and may contain fields ``decay_mult``.
+      (The ``lr_mult`` is disabled in this constructor).
+
+    Example:
+
+    In the config file, you can use this constructor as below:
+
+    .. code:: python
+
+        optim_wrapper = dict(
+            optimizer=dict(
+                type='AdamW',
+                lr=4e-3,
+                weight_decay=0.05,
+                eps=1e-8,
+                betas=(0.9, 0.999)),
+            constructor='LearningRateDecayOptimWrapperConstructor',
+            paramwise_cfg=dict(
+                layer_decay_rate=0.75,  # layer-wise lr decay factor
+                norm_decay_mult=0.,
+                flat_decay_mult=0.,
+                custom_keys={
+                    '.cls_token': dict(decay_mult=0.0),
+                    '.pos_embed': dict(decay_mult=0.0)
+                }))
+    """
+
+    def add_params(self,
+                   params: List[dict],
+                   module: nn.Module,
+                   prefix: str = '',
+                   get_layer_depth: Optional[Callable] = None,
+                   **kwargs) -> None:
+        """Add all parameters of module to the params list.
+
+        The parameters of the given module will be added to the list of param
+        groups, with specific rules defined by paramwise_cfg.
+
+        Args:
+            params (List[dict]): A list of param groups, it will be modified
+                in place.
+            module (nn.Module): The module to be added.
+            optimizer_cfg (dict): The configuration of optimizer.
+            prefix (str): The prefix of the module.
+        """
+        # get param-wise options
+        custom_keys = self.paramwise_cfg.get('custom_keys', {})
+        # first sort with alphabet order and then sort with reversed len of str
+        sorted_keys = sorted(sorted(custom_keys.keys()), key=len, reverse=True)
+        logger = MMLogger.get_current_instance()
+
+        # The model should have `get_layer_depth` method
+        if get_layer_depth is None and not hasattr(module, 'get_layer_depth'):
+            raise NotImplementedError('The layer-wise learning rate decay need'
+                                      f' the model {type(module)} has'
+                                      ' `get_layer_depth` method.')
+        else:
+            get_layer_depth = get_layer_depth or module.get_layer_depth
+
+        bias_decay_mult = self.paramwise_cfg.get('bias_decay_mult', None)
+        norm_decay_mult = self.paramwise_cfg.get('norm_decay_mult', None)
+        flat_decay_mult = self.paramwise_cfg.get('flat_decay_mult', None)
+        decay_rate = self.paramwise_cfg.get('layer_decay_rate', 1.0)
+
+        # special rules for norm layers and depth-wise conv layers
+        is_norm = isinstance(module,
+                             (_BatchNorm, _InstanceNorm, GroupNorm, LayerNorm))
+
+        for name, param in module.named_parameters(recurse=False):
+            param_group = {'params': [param]}
+            param_name = prefix + name
+            if not param.requires_grad:
+                continue
+
+            if self.base_wd is not None:
+                base_wd = self.base_wd
+                custom_key = next(
+                    filter(lambda k: k in param_name, sorted_keys), None)
+                # custom parameters decay
+                if custom_key is not None:
+                    custom_cfg = custom_keys[custom_key].copy()
+                    decay_mult = custom_cfg.pop('decay_mult', 1.)
+
+                    param_group['weight_decay'] = base_wd * decay_mult
+                    # add custom settings to param_group
+                    param_group.update(custom_cfg)
+                # norm decay
+                elif is_norm and norm_decay_mult is not None:
+                    param_group['weight_decay'] = base_wd * norm_decay_mult
+                # bias decay
+                elif name == 'bias' and bias_decay_mult is not None:
+                    param_group['weight_decay'] = base_wd * bias_decay_mult
+                # flatten parameters decay
+                elif param.ndim == 1 and flat_decay_mult is not None:
+                    param_group['weight_decay'] = base_wd * flat_decay_mult
+                else:
+                    param_group['weight_decay'] = base_wd
+
+            layer_id, max_id = get_layer_depth(param_name)
+            scale = decay_rate**(max_id - layer_id - 1)
+            param_group['lr'] = self.base_lr * scale
+            param_group['lr_scale'] = scale
+            param_group['layer_id'] = layer_id
+            param_group['param_name'] = param_name
+
+            params.append(param_group)
+
+        for child_name, child_mod in module.named_children():
+            child_prefix = f'{prefix}{child_name}.'
+            self.add_params(
+                params,
+                child_mod,
+                prefix=child_prefix,
+                get_layer_depth=get_layer_depth,
+            )
+
+        if prefix == '':
+            layer_params = defaultdict(list)
+            for param in params:
+                layer_params[param['layer_id']].append(param)
+            for layer_id, layer_params in layer_params.items():
+                lr_scale = layer_params[0]['lr_scale']
+                lr = layer_params[0]['lr']
+                msg = [
+                    f'layer {layer_id} params '
+                    f'(lr={lr:.3g}, lr_scale={lr_scale:.3g}):'
+                ]
+                for param in layer_params:
+                    msg.append(f'\t{param["param_name"]}: '
+                               f'weight_decay={param["weight_decay"]:.3g}')
+                logger.debug('\n'.join(msg))
diff --git a/mmpretrain/engine/runners/__init__.py b/mmpretrain/engine/runners/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..23206e1ea7c83fa1d547c677b3fe5203f8c5485f
--- /dev/null
+++ b/mmpretrain/engine/runners/__init__.py
@@ -0,0 +1,4 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .retrieval_loop import RetrievalTestLoop, RetrievalValLoop
+
+__all__ = ['RetrievalTestLoop', 'RetrievalValLoop']
diff --git a/mmpretrain/engine/runners/__pycache__/__init__.cpython-310.pyc b/mmpretrain/engine/runners/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7e4fa99f1059786ec9367cd285c34678e57c54da
Binary files /dev/null and b/mmpretrain/engine/runners/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/engine/runners/__pycache__/retrieval_loop.cpython-310.pyc b/mmpretrain/engine/runners/__pycache__/retrieval_loop.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..24d5ff8b502fd42a41f4072e384626b6688f40cb
Binary files /dev/null and b/mmpretrain/engine/runners/__pycache__/retrieval_loop.cpython-310.pyc differ
diff --git a/mmpretrain/engine/runners/retrieval_loop.py b/mmpretrain/engine/runners/retrieval_loop.py
new file mode 100644
index 0000000000000000000000000000000000000000..d15387eddeb9075c23949f95a77ed59006bb9a38
--- /dev/null
+++ b/mmpretrain/engine/runners/retrieval_loop.py
@@ -0,0 +1,168 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+
+import torch
+from mmengine.model import is_model_wrapper
+from mmengine.runner import TestLoop, ValLoop, autocast
+
+from mmpretrain.registry import LOOPS
+
+
+@LOOPS.register_module()
+class RetrievalValLoop(ValLoop):
+    """Loop for multimodal retrieval val.
+
+    Args:
+        runner (Runner): A reference of runner.
+        dataloader (Dataloader or dict): A dataloader object or a dict to
+            build a dataloader.
+        evaluator (Evaluator or dict or list): Used for computing metrics.
+        fp16 (bool): Whether to enable fp16 valing. Defaults to
+            False.
+    """
+
+    def run(self) -> dict:
+        """Launch val."""
+        self.runner.call_hook('before_val')
+        self.runner.call_hook('before_val_epoch')
+        self.runner.model.eval()
+
+        feats_local = []
+        data_samples_local = []
+
+        for idx, data_batch in enumerate(self.dataloader):
+            with torch.no_grad():
+                self.runner.call_hook(
+                    'before_val_iter', batch_idx=idx, data_batch=data_batch)
+                # predictions should be sequence of BaseDataElement
+                with autocast(enabled=self.fp16):
+                    if is_model_wrapper(self.runner.model):
+                        data_preprocessor = self.runner.model.module.data_preprocessor  # noqa: E501
+                    else:
+                        data_preprocessor = self.runner.model.data_preprocessor
+
+                    # get features for retrieval instead of data samples
+                    data_batch = data_preprocessor(data_batch, False)
+                    feats = self.runner.model._run_forward(
+                        data_batch, mode='tensor')
+                    feats_local.append(feats)
+                    data_samples_local.extend(data_batch['data_samples'])
+                self.runner.call_hook(
+                    'after_val_iter',
+                    batch_idx=idx,
+                    data_batch=data_batch,
+                    outputs=feats)
+
+        # concatenate different features
+        feats_local = {
+            k: torch.cat([dic[k] for dic in feats_local])
+            for k in feats_local[0]
+        }
+
+        # get predictions
+        if is_model_wrapper(self.runner.model):
+            predict_all_fn = self.runner.model.module.predict_all
+        else:
+            predict_all_fn = self.runner.model.predict_all
+
+        img_size = self.dataloader.dataset.img_size
+        text_size = self.dataloader.dataset.text_size
+        with torch.no_grad():
+            i2t_data_samples, t2i_data_samples = predict_all_fn(
+                feats_local,
+                data_samples_local,
+                num_images=img_size,
+                num_texts=text_size,
+            )
+
+        # process in evaluator and compute metrics
+        self.evaluator.process(i2t_data_samples, None)
+        i2t_metrics = self.evaluator.evaluate(img_size)
+        i2t_metrics = {f'i2t/{k}': v for k, v in i2t_metrics.items()}
+        self.evaluator.process(t2i_data_samples, None)
+        t2i_metrics = self.evaluator.evaluate(text_size)
+        t2i_metrics = {f't2i/{k}': v for k, v in t2i_metrics.items()}
+        metrics = {**i2t_metrics, **t2i_metrics}
+
+        self.runner.call_hook('after_val_epoch', metrics=metrics)
+        self.runner.call_hook('after_val')
+        return metrics
+
+
+@LOOPS.register_module()
+class RetrievalTestLoop(TestLoop):
+    """Loop for multimodal retrieval test.
+
+    Args:
+        runner (Runner): A reference of runner.
+        dataloader (Dataloader or dict): A dataloader object or a dict to
+            build a dataloader.
+        evaluator (Evaluator or dict or list): Used for computing metrics.
+        fp16 (bool): Whether to enable fp16 testing. Defaults to
+            False.
+    """
+
+    def run(self) -> dict:
+        """Launch test."""
+        self.runner.call_hook('before_test')
+        self.runner.call_hook('before_test_epoch')
+        self.runner.model.eval()
+
+        feats_local = []
+        data_samples_local = []
+
+        for idx, data_batch in enumerate(self.dataloader):
+            with torch.no_grad():
+                self.runner.call_hook(
+                    'before_test_iter', batch_idx=idx, data_batch=data_batch)
+                # predictions should be sequence of BaseDataElement
+                with autocast(enabled=self.fp16):
+                    if is_model_wrapper(self.runner.model):
+                        data_preprocessor = self.runner.model.module.data_preprocessor  # noqa: E501
+                    else:
+                        data_preprocessor = self.runner.model.data_preprocessor
+                    # get features for retrieval instead of data samples
+                    data_batch = data_preprocessor(data_batch, False)
+                    feats = self.runner.model._run_forward(
+                        data_batch, mode='tensor')
+                    feats_local.append(feats)
+                    data_samples_local.extend(data_batch['data_samples'])
+                self.runner.call_hook(
+                    'after_test_iter',
+                    batch_idx=idx,
+                    data_batch=data_batch,
+                    outputs=feats)
+
+        # concatenate different features
+        feats_local = {
+            k: torch.cat([dic[k] for dic in feats_local])
+            for k in feats_local[0]
+        }
+
+        # get predictions
+        if is_model_wrapper(self.runner.model):
+            predict_all_fn = self.runner.model.module.predict_all
+        else:
+            predict_all_fn = self.runner.model.predict_all
+
+        img_size = self.dataloader.dataset.img_size
+        text_size = self.dataloader.dataset.text_size
+        with torch.no_grad():
+            i2t_data_samples, t2i_data_samples = predict_all_fn(
+                feats_local,
+                data_samples_local,
+                num_images=img_size,
+                num_texts=text_size,
+            )
+
+        # process in evaluator and compute metrics
+        self.evaluator.process(i2t_data_samples, None)
+        i2t_metrics = self.evaluator.evaluate(img_size)
+        i2t_metrics = {f'i2t/{k}': v for k, v in i2t_metrics.items()}
+        self.evaluator.process(t2i_data_samples, None)
+        t2i_metrics = self.evaluator.evaluate(text_size)
+        t2i_metrics = {f't2i/{k}': v for k, v in t2i_metrics.items()}
+        metrics = {**i2t_metrics, **t2i_metrics}
+
+        self.runner.call_hook('after_test_epoch', metrics=metrics)
+        self.runner.call_hook('after_test')
+        return metrics
diff --git a/mmpretrain/engine/schedulers/__init__.py b/mmpretrain/engine/schedulers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..68b6a5477b84a53e060e0e6d43fdac830adebffb
--- /dev/null
+++ b/mmpretrain/engine/schedulers/__init__.py
@@ -0,0 +1,4 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .weight_decay_scheduler import CosineAnnealingWeightDecay
+
+__all__ = ['CosineAnnealingWeightDecay']
diff --git a/mmpretrain/engine/schedulers/__pycache__/__init__.cpython-310.pyc b/mmpretrain/engine/schedulers/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..2663f62ac8445c0ee6bf7194a3dc99639aa7fe49
Binary files /dev/null and b/mmpretrain/engine/schedulers/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/engine/schedulers/__pycache__/weight_decay_scheduler.cpython-310.pyc b/mmpretrain/engine/schedulers/__pycache__/weight_decay_scheduler.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..53dfd41c64ff95a791330fe656ba3b42e22727fa
Binary files /dev/null and b/mmpretrain/engine/schedulers/__pycache__/weight_decay_scheduler.cpython-310.pyc differ
diff --git a/mmpretrain/engine/schedulers/weight_decay_scheduler.py b/mmpretrain/engine/schedulers/weight_decay_scheduler.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e725a4c3f53856cf848ed7e6a225a178b36ab98
--- /dev/null
+++ b/mmpretrain/engine/schedulers/weight_decay_scheduler.py
@@ -0,0 +1,64 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+
+from mmengine.optim.scheduler import CosineAnnealingParamScheduler
+
+from mmpretrain.registry import PARAM_SCHEDULERS
+
+
+class WeightDecaySchedulerMixin:
+    """A mixin class for learning rate schedulers."""
+
+    def __init__(self, optimizer, *args, **kwargs):
+        super().__init__(optimizer, 'weight_decay', *args, **kwargs)
+
+
+@PARAM_SCHEDULERS.register_module()
+class CosineAnnealingWeightDecay(WeightDecaySchedulerMixin,
+                                 CosineAnnealingParamScheduler):
+    """Set the weight decay value of each parameter group using a cosine
+    annealing schedule.
+
+    If the weight decay was set to be 0 initially, the weight decay value will
+    be 0 constantly during the training.
+    """
+
+    def _get_value(self) -> list:
+        """Compute value using chainable form of the scheduler."""
+
+        def _get_eta_min(base_value):
+            if self.eta_min_ratio is None:
+                return self.eta_min
+            return base_value * self.eta_min_ratio
+
+        if self.last_step == 0:
+            return [
+                group[self.param_name] for group in self.optimizer.param_groups
+            ]
+        elif (self.last_step - 1 - self.T_max) % (2 * self.T_max) == 0:
+            weight_decay_value_list = []
+            for base_value, group in zip(self.base_values,
+                                         self.optimizer.param_groups):
+                if base_value == 0:
+                    group_value = 0
+                else:
+                    group_value = group[self.param_name] + (
+                        base_value - _get_eta_min(base_value)) * (
+                            1 - math.cos(math.pi / self.T_max)) / 2
+                weight_decay_value_list.append(group_value)
+            return weight_decay_value_list
+
+        weight_decay_value_list = []
+        for base_value, group in zip(self.base_values,
+                                     self.optimizer.param_groups):
+            if base_value == 0:
+                group_value = 0
+            else:
+                group_value = (
+                    1 + math.cos(math.pi * self.last_step / self.T_max)) / (
+                        1 + math.cos(math.pi *
+                                     (self.last_step - 1) / self.T_max)
+                    ) * (group[self.param_name] -
+                         _get_eta_min(base_value)) + _get_eta_min(base_value)
+            weight_decay_value_list.append(group_value)
+        return weight_decay_value_list
diff --git a/mmpretrain/evaluation/__init__.py b/mmpretrain/evaluation/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..f70dc226d30f7b8e4ee5a44ca163ad1ae04eabf5
--- /dev/null
+++ b/mmpretrain/evaluation/__init__.py
@@ -0,0 +1,3 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .functional import *  # noqa: F401,F403
+from .metrics import *  # noqa: F401,F403
diff --git a/mmpretrain/evaluation/__pycache__/__init__.cpython-310.pyc b/mmpretrain/evaluation/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..0de060074feb6ad054f711fe1f8734a8103f5618
Binary files /dev/null and b/mmpretrain/evaluation/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/functional/__init__.py b/mmpretrain/evaluation/functional/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef101fec61e72abc0eb90266d453b5b22331378d
--- /dev/null
+++ b/mmpretrain/evaluation/functional/__init__.py
@@ -0,0 +1 @@
+# Copyright (c) OpenMMLab. All rights reserved.
diff --git a/mmpretrain/evaluation/functional/__pycache__/__init__.cpython-310.pyc b/mmpretrain/evaluation/functional/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f340349eaef95b339936539d7a672831c376bedc
Binary files /dev/null and b/mmpretrain/evaluation/functional/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/ANLS.py b/mmpretrain/evaluation/metrics/ANLS.py
new file mode 100644
index 0000000000000000000000000000000000000000..14917f16e343b1f9c73a44af34f800c3ae72fd22
--- /dev/null
+++ b/mmpretrain/evaluation/metrics/ANLS.py
@@ -0,0 +1,103 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional
+
+from mmengine.evaluator import BaseMetric
+
+from mmpretrain.registry import METRICS
+
+
+@METRICS.register_module()
+class ANLS(BaseMetric):
+    """ANLS metric.
+
+    Compute the Average Normalized Levenshtein Similarity(ANLS).
+
+    Args:
+        threshold (float): ANLS threshold used for determining if the answer
+            has been correctly selected but not properly recognized,
+            or on the contrary, the output is a wrong text selected from the
+            options and given as an answer.
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Should be modified according to the
+            `retrieval_type` for unambiguous results. Defaults to TR.
+    """
+    default_prefix = 'ANLS'
+
+    def __init__(self,
+                 threshold: float = 0.5,
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None) -> None:
+        super().__init__(collect_device=collect_device, prefix=prefix)
+        self.threshold = threshold
+
+    def process(self, data_batch, data_samples) -> None:
+        """Process one batch of data samples.
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+
+        Args:
+            data_batch: A batch of data from the dataloader.
+            data_samples (Sequence[dict]): A batch of outputs from the model.
+        """
+        for sample in data_samples:
+            gt_answer = sample.get('gt_answer')
+            result = {
+                'pred_answer': sample.get('pred_answer'),
+                'gt_answer': gt_answer
+            }
+
+            self.results.append(result)
+
+    def compute_metrics(self, results: List) -> dict:
+        """Compute the metrics from processed results.
+
+        Args:
+            results (dict): The processed results of each batch.
+
+        Returns:
+            Dict: The computed metrics. The keys are the names of the metrics,
+            and the values are corresponding results.
+        """
+        total_score = 0.
+        for result in results:
+            sample_score_list = []
+            pred = ' '.join(result['pred_answer'].strip().lower().split())
+            for gt in result['gt_answer']:
+                gt = ' '.join(gt.strip().lower().split())
+                dist = levenshtein_distance(gt, pred)
+                length = max(
+                    len(gt.upper()), len(result['pred_answer'].upper()))
+                sample_score_list.append(0.0 if length == 0 else float(dist) /
+                                         float(length))
+
+            per_sample_score = 1. - min(sample_score_list)
+            if per_sample_score < self.threshold:
+                per_sample_score = 0.
+
+            total_score += per_sample_score
+
+        total_score = total_score / len(results)
+        return {'ANLS': total_score}
+
+
+def levenshtein_distance(s1, s2):
+    if len(s1) > len(s2):
+        s1, s2 = s2, s1
+
+    distances = range(len(s1) + 1)
+    for i2, c2 in enumerate(s2):
+        distances_ = [i2 + 1]
+        for i1, c1 in enumerate(s1):
+            if c1 == c2:
+                distances_.append(distances[i1])
+            else:
+                distances_.append(1 + min((distances[i1], distances[i1 + 1],
+                                           distances_[-1])))
+        distances = distances_
+    return distances[-1]
diff --git a/mmpretrain/evaluation/metrics/__init__.py b/mmpretrain/evaluation/metrics/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e572efeb91e8ba64c46ab6241fe611bff136a210
--- /dev/null
+++ b/mmpretrain/evaluation/metrics/__init__.py
@@ -0,0 +1,22 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .ANLS import ANLS
+from .caption import COCOCaption
+from .gqa import GQAAcc
+from .multi_label import AveragePrecision, MultiLabelMetric
+from .multi_task import MultiTasksMetric
+from .nocaps import NocapsSave
+from .retrieval import RetrievalAveragePrecision, RetrievalRecall
+from .scienceqa import ScienceQAMetric
+from .shape_bias_label import ShapeBiasMetric
+from .single_label import Accuracy, ConfusionMatrix, SingleLabelMetric
+from .visual_grounding_eval import VisualGroundingMetric
+from .voc_multi_label import VOCAveragePrecision, VOCMultiLabelMetric
+from .vqa import ReportVQA, VQAAcc
+
+__all__ = [
+    'Accuracy', 'SingleLabelMetric', 'MultiLabelMetric', 'AveragePrecision',
+    'MultiTasksMetric', 'VOCAveragePrecision', 'VOCMultiLabelMetric',
+    'ConfusionMatrix', 'RetrievalRecall', 'VQAAcc', 'ReportVQA', 'COCOCaption',
+    'VisualGroundingMetric', 'ScienceQAMetric', 'GQAAcc', 'NocapsSave',
+    'RetrievalAveragePrecision', 'ShapeBiasMetric', 'ANLS'
+]
diff --git a/mmpretrain/evaluation/metrics/__pycache__/ANLS.cpython-310.pyc b/mmpretrain/evaluation/metrics/__pycache__/ANLS.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7d12152cf6d2f5ae545c64d0d887eb46affd280c
Binary files /dev/null and b/mmpretrain/evaluation/metrics/__pycache__/ANLS.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/__pycache__/__init__.cpython-310.pyc b/mmpretrain/evaluation/metrics/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..28318cf934a3faf7d3d1831de7b18fe293ce9ab6
Binary files /dev/null and b/mmpretrain/evaluation/metrics/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/__pycache__/caption.cpython-310.pyc b/mmpretrain/evaluation/metrics/__pycache__/caption.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3f7d506109975a3a359cbad317087c8cf01824ff
Binary files /dev/null and b/mmpretrain/evaluation/metrics/__pycache__/caption.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/__pycache__/gqa.cpython-310.pyc b/mmpretrain/evaluation/metrics/__pycache__/gqa.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7ded347d7c506f26f6c82292aa0de51ff57684a6
Binary files /dev/null and b/mmpretrain/evaluation/metrics/__pycache__/gqa.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/__pycache__/multi_label.cpython-310.pyc b/mmpretrain/evaluation/metrics/__pycache__/multi_label.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..37ae69d1d67990f9475b570899cc78941df1feb3
Binary files /dev/null and b/mmpretrain/evaluation/metrics/__pycache__/multi_label.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/__pycache__/multi_task.cpython-310.pyc b/mmpretrain/evaluation/metrics/__pycache__/multi_task.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..70925abd659c00b91142c01a3942089b1a0f6072
Binary files /dev/null and b/mmpretrain/evaluation/metrics/__pycache__/multi_task.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/__pycache__/nocaps.cpython-310.pyc b/mmpretrain/evaluation/metrics/__pycache__/nocaps.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..284989d9827aad98c5106e4ed507915887d2397a
Binary files /dev/null and b/mmpretrain/evaluation/metrics/__pycache__/nocaps.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/__pycache__/retrieval.cpython-310.pyc b/mmpretrain/evaluation/metrics/__pycache__/retrieval.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f2b90d25f381a993d3f36de6259d88257f9cef68
Binary files /dev/null and b/mmpretrain/evaluation/metrics/__pycache__/retrieval.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/__pycache__/scienceqa.cpython-310.pyc b/mmpretrain/evaluation/metrics/__pycache__/scienceqa.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..85b3eddbed22e1f767844cefe1a7f4e3ef65c62a
Binary files /dev/null and b/mmpretrain/evaluation/metrics/__pycache__/scienceqa.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/__pycache__/shape_bias_label.cpython-310.pyc b/mmpretrain/evaluation/metrics/__pycache__/shape_bias_label.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..fc6f9817c13a7846baf293c3f50720d3c7366f74
Binary files /dev/null and b/mmpretrain/evaluation/metrics/__pycache__/shape_bias_label.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/__pycache__/single_label.cpython-310.pyc b/mmpretrain/evaluation/metrics/__pycache__/single_label.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7af3efa6c382fdf73034b227fe152c07cb3af674
Binary files /dev/null and b/mmpretrain/evaluation/metrics/__pycache__/single_label.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/__pycache__/visual_grounding_eval.cpython-310.pyc b/mmpretrain/evaluation/metrics/__pycache__/visual_grounding_eval.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..b2afa4ea5d6a98a69ff51bb94070da6c52034a40
Binary files /dev/null and b/mmpretrain/evaluation/metrics/__pycache__/visual_grounding_eval.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/__pycache__/voc_multi_label.cpython-310.pyc b/mmpretrain/evaluation/metrics/__pycache__/voc_multi_label.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..64d819810da352403e78fb033b0faa83b33ae227
Binary files /dev/null and b/mmpretrain/evaluation/metrics/__pycache__/voc_multi_label.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/__pycache__/vqa.cpython-310.pyc b/mmpretrain/evaluation/metrics/__pycache__/vqa.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..022e2abbc3b9abe3c563633088ef15b3a0df9563
Binary files /dev/null and b/mmpretrain/evaluation/metrics/__pycache__/vqa.cpython-310.pyc differ
diff --git a/mmpretrain/evaluation/metrics/caption.py b/mmpretrain/evaluation/metrics/caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..c4bffabfa97a9c6faec7ecc0ffb6d9ba2f435b97
--- /dev/null
+++ b/mmpretrain/evaluation/metrics/caption.py
@@ -0,0 +1,136 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import json
+import os
+import tempfile
+from typing import List, Optional
+
+from mmengine.evaluator import BaseMetric
+from mmengine.utils import track_iter_progress
+
+from mmpretrain.registry import METRICS
+from mmpretrain.utils import require
+
+try:
+    from pycocoevalcap.eval import COCOEvalCap
+    from pycocotools.coco import COCO
+except ImportError:
+    COCOEvalCap = None
+    COCO = None
+
+
+@METRICS.register_module()
+class COCOCaption(BaseMetric):
+    """Coco Caption evaluation wrapper.
+
+    Save the generated captions and transform into coco format.
+    Calling COCO API for caption metrics.
+
+    Args:
+        ann_file (str): the path for the COCO format caption ground truth
+            json file, load for evaluations.
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Should be modified according to the
+            `retrieval_type` for unambiguous results. Defaults to TR.
+    """
+
+    @require('pycocoevalcap')
+    def __init__(self,
+                 ann_file: str,
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None):
+        super().__init__(collect_device=collect_device, prefix=prefix)
+        self.ann_file = ann_file
+
+    def process(self, data_batch, data_samples):
+        """Process one batch of data samples.
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+
+        Args:
+            data_batch: A batch of data from the dataloader.
+            data_samples (Sequence[dict]): A batch of outputs from the model.
+        """
+
+        for data_sample in data_samples:
+            result = dict()
+
+            result['caption'] = data_sample.get('pred_caption')
+            result['image_id'] = int(data_sample.get('image_id'))
+
+            # Save the result to `self.results`.
+            self.results.append(result)
+
+    def compute_metrics(self, results: List):
+        """Compute the metrics from processed results.
+
+        Args:
+            results (dict): The processed results of each batch.
+
+        Returns:
+            Dict: The computed metrics. The keys are the names of the metrics,
+            and the values are corresponding results.
+        """
+        # NOTICE: don't access `self.results` from the method.
+
+        with tempfile.TemporaryDirectory() as temp_dir:
+
+            eval_result_file = save_result(
+                result=results,
+                result_dir=temp_dir,
+                filename='m4-caption_pred',
+                remove_duplicate='image_id',
+            )
+
+            coco_val = coco_caption_eval(eval_result_file, self.ann_file)
+
+        return coco_val
+
+
+def save_result(result, result_dir, filename, remove_duplicate=''):
+    """Saving predictions as json file for evaluation."""
+
+    # combine results from all processes
+    result_new = []
+
+    if remove_duplicate:
+        result_new = []
+        id_list = []
+        for res in track_iter_progress(result):
+            if res[remove_duplicate] not in id_list:
+                id_list.append(res[remove_duplicate])
+                result_new.append(res)
+        result = result_new
+
+    final_result_file_url = os.path.join(result_dir, '%s.json' % filename)
+    print(f'result file saved to {final_result_file_url}')
+    json.dump(result, open(final_result_file_url, 'w'))
+
+    return final_result_file_url
+
+
+def coco_caption_eval(results_file, ann_file):
+    """Evaluation between gt json and prediction json files."""
+    # create coco object and coco_result object
+    coco = COCO(ann_file)
+    coco_result = coco.loadRes(results_file)
+
+    # create coco_eval object by taking coco and coco_result
+    coco_eval = COCOEvalCap(coco, coco_result)
+
+    # make sure the image ids are the same
+    coco_eval.params['image_id'] = coco_result.getImgIds()
+
+    # This will take some times at the first run
+    coco_eval.evaluate()
+
+    # print output evaluation scores
+    for metric, score in coco_eval.eval.items():
+        print(f'{metric}: {score:.3f}')
+
+    return coco_eval.eval
diff --git a/mmpretrain/evaluation/metrics/gqa.py b/mmpretrain/evaluation/metrics/gqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..d5e8b0725524839c5b0a15a8ba6fb4eed689e589
--- /dev/null
+++ b/mmpretrain/evaluation/metrics/gqa.py
@@ -0,0 +1,78 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional
+
+from mmengine.evaluator import BaseMetric
+
+from mmpretrain.evaluation.metrics.vqa import (_process_digit_article,
+                                               _process_punctuation)
+from mmpretrain.registry import METRICS
+
+
+@METRICS.register_module()
+class GQAAcc(BaseMetric):
+    """GQA Acc metric.
+
+    Compute GQA accuracy.
+
+    Args:
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Should be modified according to the
+            `retrieval_type` for unambiguous results. Defaults to TR.
+    """
+    default_prefix = 'GQA'
+
+    def __init__(self,
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None) -> None:
+        super().__init__(collect_device=collect_device, prefix=prefix)
+
+    def process(self, data_batch, data_samples) -> None:
+        """Process one batch of data samples.
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+
+        Args:
+            data_batch: A batch of data from the dataloader.
+            data_samples (Sequence[dict]): A batch of outputs from the model.
+        """
+        for sample in data_samples:
+            gt_answer = sample.get('gt_answer')
+            result = {
+                'pred_answer': sample.get('pred_answer'),
+                'gt_answer': gt_answer
+            }
+
+            self.results.append(result)
+
+    def compute_metrics(self, results: List) -> dict:
+        """Compute the metrics from processed results.
+
+        Args:
+            results (dict): The processed results of each batch.
+
+        Returns:
+            Dict: The computed metrics. The keys are the names of the metrics,
+            and the values are corresponding results.
+        """
+        acc = []
+        for result in results:
+            pred_answer = self._process_answer(result['pred_answer'])
+            gt_answer = self._process_answer(result['gt_answer'])
+            gqa_acc = 1 if pred_answer == gt_answer else 0
+            acc.append(gqa_acc)
+
+        accuracy = sum(acc) / len(acc)
+
+        metrics = {'acc': accuracy}
+        return metrics
+
+    def _process_answer(self, answer) -> str:
+        answer = _process_punctuation(answer)
+        answer = _process_digit_article(answer)
+        return answer
diff --git a/mmpretrain/evaluation/metrics/multi_label.py b/mmpretrain/evaluation/metrics/multi_label.py
new file mode 100644
index 0000000000000000000000000000000000000000..bd91aac4449c845fbed514ed5f800bd971236ade
--- /dev/null
+++ b/mmpretrain/evaluation/metrics/multi_label.py
@@ -0,0 +1,599 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Sequence, Union
+
+import numpy as np
+import torch
+from mmengine.evaluator import BaseMetric
+from mmengine.logging import MMLogger
+
+from mmpretrain.registry import METRICS
+from mmpretrain.structures import label_to_onehot
+from .single_label import _precision_recall_f1_support, to_tensor
+
+
+@METRICS.register_module()
+class MultiLabelMetric(BaseMetric):
+    r"""A collection of precision, recall, f1-score and support for
+    multi-label tasks.
+
+    The collection of metrics is for single-label multi-class classification.
+    And all these metrics are based on the confusion matrix of every category:
+
+    .. image:: ../../_static/image/confusion-matrix.png
+       :width: 60%
+       :align: center
+
+    All metrics can be formulated use variables above:
+
+    **Precision** is the fraction of correct predictions in all predictions:
+
+    .. math::
+        \text{Precision} = \frac{TP}{TP+FP}
+
+    **Recall** is the fraction of correct predictions in all targets:
+
+    .. math::
+        \text{Recall} = \frac{TP}{TP+FN}
+
+    **F1-score** is the harmonic mean of the precision and recall:
+
+    .. math::
+        \text{F1-score} = \frac{2\times\text{Recall}\times\text{Precision}}{\text{Recall}+\text{Precision}}
+
+    **Support** is the number of samples:
+
+    .. math::
+        \text{Support} = TP + TN + FN + FP
+
+    Args:
+        thr (float, optional): Predictions with scores under the threshold
+            are considered as negative. If None, the ``topk`` predictions will
+            be considered as positive. If the ``topk`` is also None, use
+            ``thr=0.5`` as default. Defaults to None.
+        topk (int, optional): Predictions with the k-th highest scores are
+            considered as positive. If None, use ``thr`` to determine positive
+            predictions. If both ``thr`` and ``topk`` are not None, use
+            ``thr``. Defaults to None.
+        items (Sequence[str]): The detailed metric items to evaluate, select
+            from "precision", "recall", "f1-score" and "support".
+            Defaults to ``('precision', 'recall', 'f1-score')``.
+        average (str | None): How to calculate the final metrics from the
+            confusion matrix of every category. It supports three modes:
+
+            - `"macro"`: Calculate metrics for each category, and calculate
+              the mean value over all categories.
+            - `"micro"`: Average the confusion matrix over all categories and
+              calculate metrics on the mean confusion matrix.
+            - `None`: Calculate metrics of every category and output directly.
+
+            Defaults to "macro".
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Defaults to None.
+
+    Examples:
+        >>> import torch
+        >>> from mmpretrain.evaluation import MultiLabelMetric
+        >>> # ------ The Basic Usage for category indices labels -------
+        >>> y_pred = [[0], [1], [0, 1], [3]]
+        >>> y_true = [[0, 3], [0, 2], [1], [3]]
+        >>> # Output precision, recall, f1-score and support
+        >>> MultiLabelMetric.calculate(
+        ...     y_pred, y_true, pred_indices=True, target_indices=True, num_classes=4)
+        (tensor(50.), tensor(50.), tensor(45.8333), tensor(6))
+        >>> # ----------- The Basic Usage for one-hot labels -----------
+        >>> y_pred = torch.tensor([[1, 1, 0, 0],
+        ...                        [1, 1, 0, 0],
+        ...                        [0, 0, 1, 0],
+        ...                        [0, 1, 0, 0],
+        ...                        [0, 1, 0, 0]])
+        >>> y_true = torch.Tensor([[1, 1, 0, 0],
+        ...                        [0, 0, 1, 0],
+        ...                        [1, 1, 1, 0],
+        ...                        [1, 0, 0, 0],
+        ...                        [1, 0, 0, 0]])
+        >>> MultiLabelMetric.calculate(y_pred, y_true)
+        (tensor(43.7500), tensor(31.2500), tensor(33.3333), tensor(8))
+        >>> # --------- The Basic Usage for one-hot pred scores ---------
+        >>> y_pred = torch.rand(y_true.size())
+        >>> y_pred
+        tensor([[0.4575, 0.7335, 0.3934, 0.2572],
+        [0.1318, 0.1004, 0.8248, 0.6448],
+        [0.8349, 0.6294, 0.7896, 0.2061],
+        [0.4037, 0.7308, 0.6713, 0.8374],
+        [0.3779, 0.4836, 0.0313, 0.0067]])
+        >>> # Calculate with different threshold.
+        >>> MultiLabelMetric.calculate(y_pred, y_true, thr=0.1)
+        (tensor(42.5000), tensor(75.), tensor(53.1746), tensor(8))
+        >>> # Calculate with topk.
+        >>> MultiLabelMetric.calculate(y_pred, y_true, topk=1)
+        (tensor(62.5000), tensor(31.2500), tensor(39.1667), tensor(8))
+        >>>
+        >>> # ------------------- Use with Evalutor -------------------
+        >>> from mmpretrain.structures import DataSample
+        >>> from mmengine.evaluator import Evaluator
+        >>> data_sampels = [
+        ...     DataSample().set_pred_score(pred).set_gt_score(gt)
+        ...     for pred, gt in zip(torch.rand(1000, 5), torch.randint(0, 2, (1000, 5)))]
+        >>> evaluator = Evaluator(metrics=MultiLabelMetric(thr=0.5))
+        >>> evaluator.process(data_sampels)
+        >>> evaluator.evaluate(1000)
+        {
+            'multi-label/precision': 50.72898037055408,
+            'multi-label/recall': 50.06836461357571,
+            'multi-label/f1-score': 50.384466955258475
+        }
+        >>> # Evaluate on each class by using topk strategy
+        >>> evaluator = Evaluator(metrics=MultiLabelMetric(topk=1, average=None))
+        >>> evaluator.process(data_sampels)
+        >>> evaluator.evaluate(1000)
+        {
+            'multi-label/precision_top1_classwise': [48.22, 50.54, 50.99, 44.18, 52.5],
+            'multi-label/recall_top1_classwise': [18.92, 19.22, 19.92, 20.0, 20.27],
+            'multi-label/f1-score_top1_classwise': [27.18, 27.85, 28.65, 27.54, 29.25]
+        }
+    """  # noqa: E501
+    default_prefix: Optional[str] = 'multi-label'
+
+    def __init__(self,
+                 thr: Optional[float] = None,
+                 topk: Optional[int] = None,
+                 items: Sequence[str] = ('precision', 'recall', 'f1-score'),
+                 average: Optional[str] = 'macro',
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None) -> None:
+
+        logger = MMLogger.get_current_instance()
+        if thr is None and topk is None:
+            thr = 0.5
+            logger.warning('Neither thr nor k is given, set thr as 0.5 by '
+                           'default.')
+        elif thr is not None and topk is not None:
+            logger.warning('Both thr and topk are given, '
+                           'use threshold in favor of top-k.')
+
+        self.thr = thr
+        self.topk = topk
+        self.average = average
+
+        for item in items:
+            assert item in ['precision', 'recall', 'f1-score', 'support'], \
+                f'The metric {item} is not supported by `SingleLabelMetric`,' \
+                ' please choose from "precision", "recall", "f1-score" and ' \
+                '"support".'
+        self.items = tuple(items)
+
+        super().__init__(collect_device=collect_device, prefix=prefix)
+
+    def process(self, data_batch, data_samples: Sequence[dict]):
+        """Process one batch of data samples.
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+
+        Args:
+            data_batch: A batch of data from the dataloader.
+            data_samples (Sequence[dict]): A batch of outputs from the model.
+        """
+        for data_sample in data_samples:
+            result = dict()
+
+            result['pred_score'] = data_sample['pred_score'].clone()
+            num_classes = result['pred_score'].size()[-1]
+
+            if 'gt_score' in data_sample:
+                result['gt_score'] = data_sample['gt_score'].clone()
+            else:
+                result['gt_score'] = label_to_onehot(data_sample['gt_label'],
+                                                     num_classes)
+
+            # Save the result to `self.results`.
+            self.results.append(result)
+
+    def compute_metrics(self, results: List):
+        """Compute the metrics from processed results.
+
+        Args:
+            results (list): The processed results of each batch.
+
+        Returns:
+            Dict: The computed metrics. The keys are the names of the metrics,
+            and the values are corresponding results.
+        """
+        # NOTICE: don't access `self.results` from the method. `self.results`
+        # are a list of results from multiple batch, while the input `results`
+        # are the collected results.
+        metrics = {}
+
+        target = torch.stack([res['gt_score'] for res in results])
+        pred = torch.stack([res['pred_score'] for res in results])
+
+        metric_res = self.calculate(
+            pred,
+            target,
+            pred_indices=False,
+            target_indices=False,
+            average=self.average,
+            thr=self.thr,
+            topk=self.topk)
+
+        def pack_results(precision, recall, f1_score, support):
+            single_metrics = {}
+            if 'precision' in self.items:
+                single_metrics['precision'] = precision
+            if 'recall' in self.items:
+                single_metrics['recall'] = recall
+            if 'f1-score' in self.items:
+                single_metrics['f1-score'] = f1_score
+            if 'support' in self.items:
+                single_metrics['support'] = support
+            return single_metrics
+
+        if self.thr:
+            suffix = '' if self.thr == 0.5 else f'_thr-{self.thr:.2f}'
+            for k, v in pack_results(*metric_res).items():
+                metrics[k + suffix] = v
+        else:
+            for k, v in pack_results(*metric_res).items():
+                metrics[k + f'_top{self.topk}'] = v
+
+        result_metrics = dict()
+        for k, v in metrics.items():
+            if self.average is None:
+                result_metrics[k + '_classwise'] = v.detach().cpu().tolist()
+            elif self.average == 'macro':
+                result_metrics[k] = v.item()
+            else:
+                result_metrics[k + f'_{self.average}'] = v.item()
+        return result_metrics
+
+    @staticmethod
+    def calculate(
+        pred: Union[torch.Tensor, np.ndarray, Sequence],
+        target: Union[torch.Tensor, np.ndarray, Sequence],
+        pred_indices: bool = False,
+        target_indices: bool = False,
+        average: Optional[str] = 'macro',
+        thr: Optional[float] = None,
+        topk: Optional[int] = None,
+        num_classes: Optional[int] = None
+    ) -> Union[torch.Tensor, List[torch.Tensor]]:
+        """Calculate the precision, recall, f1-score.
+
+        Args:
+            pred (torch.Tensor | np.ndarray | Sequence): The prediction
+                results. A :obj:`torch.Tensor` or :obj:`np.ndarray` with
+                shape ``(N, num_classes)`` or a sequence of index/onehot
+                format labels.
+            target (torch.Tensor | np.ndarray | Sequence): The prediction
+                results. A :obj:`torch.Tensor` or :obj:`np.ndarray` with
+                shape ``(N, num_classes)`` or a sequence of index/onehot
+                format labels.
+            pred_indices (bool): Whether the ``pred`` is a sequence of
+                category index labels. If True, ``num_classes`` must be set.
+                Defaults to False.
+            target_indices (bool): Whether the ``target`` is a sequence of
+                category index labels. If True, ``num_classes`` must be set.
+                Defaults to False.
+            average (str | None): How to calculate the final metrics from
+                the confusion matrix of every category. It supports three
+                modes:
+
+                - `"macro"`: Calculate metrics for each category, and calculate
+                  the mean value over all categories.
+                - `"micro"`: Average the confusion matrix over all categories
+                  and calculate metrics on the mean confusion matrix.
+                - `None`: Calculate metrics of every category and output
+                  directly.
+
+                Defaults to "macro".
+            thr (float, optional): Predictions with scores under the thresholds
+                are considered as negative. Defaults to None.
+            topk (int, optional): Predictions with the k-th highest scores are
+                considered as positive. Defaults to None.
+            num_classes (Optional, int): The number of classes. If the ``pred``
+                is indices instead of onehot, this argument is required.
+                Defaults to None.
+
+        Returns:
+            Tuple: The tuple contains precision, recall and f1-score.
+            And the type of each item is:
+
+            - torch.Tensor: A tensor for each metric. The shape is (1, ) if
+              ``average`` is not None, and (C, ) if ``average`` is None.
+
+        Notes:
+            If both ``thr`` and ``topk`` are set, use ``thr` to determine
+            positive predictions. If neither is set, use ``thr=0.5`` as
+            default.
+        """
+        average_options = ['micro', 'macro', None]
+        assert average in average_options, 'Invalid `average` argument, ' \
+            f'please specicy from {average_options}.'
+
+        def _format_label(label, is_indices):
+            """format various label to torch.Tensor."""
+            if isinstance(label, np.ndarray):
+                assert label.ndim == 2, 'The shape `pred` and `target` ' \
+                    'array must be (N, num_classes).'
+                label = torch.from_numpy(label)
+            elif isinstance(label, torch.Tensor):
+                assert label.ndim == 2, 'The shape `pred` and `target` ' \
+                    'tensor must be (N, num_classes).'
+            elif isinstance(label, Sequence):
+                if is_indices:
+                    assert num_classes is not None, 'For index-type labels, ' \
+                        'please specify `num_classes`.'
+                    label = torch.stack([
+                        label_to_onehot(indices, num_classes)
+                        for indices in label
+                    ])
+                else:
+                    label = torch.stack(
+                        [to_tensor(onehot) for onehot in label])
+            else:
+                raise TypeError(
+                    'The `pred` and `target` must be type of torch.tensor or '
+                    f'np.ndarray or sequence but get {type(label)}.')
+            return label
+
+        pred = _format_label(pred, pred_indices)
+        target = _format_label(target, target_indices).long()
+
+        assert pred.shape == target.shape, \
+            f"The size of pred ({pred.shape}) doesn't match "\
+            f'the target ({target.shape}).'
+
+        if num_classes is not None:
+            assert pred.size(1) == num_classes, \
+                f'The shape of `pred` ({pred.shape}) '\
+                f"doesn't match the num_classes ({num_classes})."
+        num_classes = pred.size(1)
+
+        thr = 0.5 if (thr is None and topk is None) else thr
+
+        if thr is not None:
+            # a label is predicted positive if larger than thr
+            pos_inds = (pred >= thr).long()
+        else:
+            # top-k labels will be predicted positive for any example
+            _, topk_indices = pred.topk(topk)
+            pos_inds = torch.zeros_like(pred).scatter_(1, topk_indices, 1)
+            pos_inds = pos_inds.long()
+
+        return _precision_recall_f1_support(pos_inds, target, average)
+
+
+def _average_precision(pred: torch.Tensor,
+                       target: torch.Tensor) -> torch.Tensor:
+    r"""Calculate the average precision for a single class.
+
+    AP summarizes a precision-recall curve as the weighted mean of maximum
+    precisions obtained for any r'>r, where r is the recall:
+
+    .. math::
+        \text{AP} = \sum_n (R_n - R_{n-1}) P_n
+
+    Note that no approximation is involved since the curve is piecewise
+    constant.
+
+    Args:
+        pred (torch.Tensor): The model prediction with shape
+            ``(N, num_classes)``.
+        target (torch.Tensor): The target of predictions with shape
+            ``(N, num_classes)``.
+
+    Returns:
+        torch.Tensor: average precision result.
+    """
+    assert pred.shape == target.shape, \
+        f"The size of pred ({pred.shape}) doesn't match "\
+        f'the target ({target.shape}).'
+
+    # a small value for division by zero errors
+    eps = torch.finfo(torch.float32).eps
+
+    # get rid of -1 target such as difficult sample
+    # that is not wanted in evaluation results.
+    valid_index = target > -1
+    pred = pred[valid_index]
+    target = target[valid_index]
+
+    # sort examples
+    sorted_pred_inds = torch.argsort(pred, dim=0, descending=True)
+    sorted_target = target[sorted_pred_inds]
+
+    # get indexes when gt_true is positive
+    pos_inds = sorted_target == 1
+
+    # Calculate cumulative tp case numbers
+    tps = torch.cumsum(pos_inds, 0)
+    total_pos = tps[-1].item()  # the last of tensor may change later
+
+    # Calculate cumulative tp&fp(pred_poss) case numbers
+    pred_pos_nums = torch.arange(1, len(sorted_target) + 1).to(pred.device)
+    pred_pos_nums[pred_pos_nums < eps] = eps
+
+    tps[torch.logical_not(pos_inds)] = 0
+    precision = tps / pred_pos_nums.float()
+    ap = torch.sum(precision, 0) / max(total_pos, eps)
+    return ap
+
+
+@METRICS.register_module()
+class AveragePrecision(BaseMetric):
+    r"""Calculate the average precision with respect of classes.
+
+    AveragePrecision (AP) summarizes a precision-recall curve as the weighted
+    mean of maximum precisions obtained for any r'>r, where r is the recall:
+
+    .. math::
+        \text{AP} = \sum_n (R_n - R_{n-1}) P_n
+
+    Note that no approximation is involved since the curve is piecewise
+    constant.
+
+    Args:
+        average (str | None): How to calculate the final metrics from
+            every category. It supports two modes:
+
+            - `"macro"`: Calculate metrics for each category, and calculate
+              the mean value over all categories. The result of this mode
+              is also called **mAP**.
+            - `None`: Calculate metrics of every category and output directly.
+
+            Defaults to "macro".
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Defaults to None.
+
+    References
+    ----------
+    1. `Wikipedia entry for the Average precision
+       <https://en.wikipedia.org/w/index.php?title=Information_retrieval&
+       oldid=793358396#Average_precision>`_
+
+    Examples:
+        >>> import torch
+        >>> from mmpretrain.evaluation import AveragePrecision
+        >>> # --------- The Basic Usage for one-hot pred scores ---------
+        >>> y_pred = torch.Tensor([[0.9, 0.8, 0.3, 0.2],
+        ...                        [0.1, 0.2, 0.2, 0.1],
+        ...                        [0.7, 0.5, 0.9, 0.3],
+        ...                        [0.8, 0.1, 0.1, 0.2]])
+        >>> y_true = torch.Tensor([[1, 1, 0, 0],
+        ...                        [0, 1, 0, 0],
+        ...                        [0, 0, 1, 0],
+        ...                        [1, 0, 0, 0]])
+        >>> AveragePrecision.calculate(y_pred, y_true)
+        tensor(70.833)
+        >>> # ------------------- Use with Evalutor -------------------
+        >>> from mmpretrain.structures import DataSample
+        >>> from mmengine.evaluator import Evaluator
+        >>> data_samples = [
+        ...     DataSample().set_pred_score(i).set_gt_score(j)
+        ...     for i, j in zip(y_pred, y_true)
+        ... ]
+        >>> evaluator = Evaluator(metrics=AveragePrecision())
+        >>> evaluator.process(data_samples)
+        >>> evaluator.evaluate(5)
+        {'multi-label/mAP': 70.83333587646484}
+        >>> # Evaluate on each class
+        >>> evaluator = Evaluator(metrics=AveragePrecision(average=None))
+        >>> evaluator.process(data_samples)
+        >>> evaluator.evaluate(5)
+        {'multi-label/AP_classwise': [100., 83.33, 100., 0.]}
+    """
+    default_prefix: Optional[str] = 'multi-label'
+
+    def __init__(self,
+                 average: Optional[str] = 'macro',
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None) -> None:
+        super().__init__(collect_device=collect_device, prefix=prefix)
+        self.average = average
+
+    def process(self, data_batch, data_samples: Sequence[dict]):
+        """Process one batch of data samples.
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+
+        Args:
+            data_batch: A batch of data from the dataloader.
+            data_samples (Sequence[dict]): A batch of outputs from the model.
+        """
+
+        for data_sample in data_samples:
+            result = dict()
+
+            result['pred_score'] = data_sample['pred_score'].clone()
+            num_classes = result['pred_score'].size()[-1]
+
+            if 'gt_score' in data_sample:
+                result['gt_score'] = data_sample['gt_score'].clone()
+            else:
+                result['gt_score'] = label_to_onehot(data_sample['gt_label'],
+                                                     num_classes)
+
+            # Save the result to `self.results`.
+            self.results.append(result)
+
+    def compute_metrics(self, results: List):
+        """Compute the metrics from processed results.
+
+        Args:
+            results (list): The processed results of each batch.
+
+        Returns:
+            Dict: The computed metrics. The keys are the names of the metrics,
+            and the values are corresponding results.
+        """
+        # NOTICE: don't access `self.results` from the method. `self.results`
+        # are a list of results from multiple batch, while the input `results`
+        # are the collected results.
+
+        # concat
+        target = torch.stack([res['gt_score'] for res in results])
+        pred = torch.stack([res['pred_score'] for res in results])
+
+        ap = self.calculate(pred, target, self.average)
+
+        result_metrics = dict()
+
+        if self.average is None:
+            result_metrics['AP_classwise'] = ap.detach().cpu().tolist()
+        else:
+            result_metrics['mAP'] = ap.item()
+
+        return result_metrics
+
+    @staticmethod
+    def calculate(pred: Union[torch.Tensor, np.ndarray],
+                  target: Union[torch.Tensor, np.ndarray],
+                  average: Optional[str] = 'macro') -> torch.Tensor:
+        r"""Calculate the average precision for a single class.
+
+        Args:
+            pred (torch.Tensor | np.ndarray): The model predictions with
+                shape ``(N, num_classes)``.
+            target (torch.Tensor | np.ndarray): The target of predictions
+                with shape ``(N, num_classes)``.
+            average (str | None): The average method. It supports two modes:
+
+                - `"macro"`: Calculate metrics for each category, and calculate
+                  the mean value over all categories. The result of this mode
+                  is also called mAP.
+                - `None`: Calculate metrics of every category and output
+                  directly.
+
+                Defaults to "macro".
+
+        Returns:
+            torch.Tensor: the average precision of all classes.
+        """
+        average_options = ['macro', None]
+        assert average in average_options, 'Invalid `average` argument, ' \
+            f'please specicy from {average_options}.'
+
+        pred = to_tensor(pred)
+        target = to_tensor(target)
+        assert pred.ndim == 2 and pred.shape == target.shape, \
+            'Both `pred` and `target` should have shape `(N, num_classes)`.'
+
+        num_classes = pred.shape[1]
+        ap = pred.new_zeros(num_classes)
+        for k in range(num_classes):
+            ap[k] = _average_precision(pred[:, k], target[:, k])
+        if average == 'macro':
+            return ap.mean() * 100.0
+        else:
+            return ap * 100
diff --git a/mmpretrain/evaluation/metrics/multi_task.py b/mmpretrain/evaluation/metrics/multi_task.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e6af7680192883308df5f24b65ec38c9bb65ce6
--- /dev/null
+++ b/mmpretrain/evaluation/metrics/multi_task.py
@@ -0,0 +1,120 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, Sequence
+
+from mmengine.evaluator import BaseMetric
+
+from mmpretrain.registry import METRICS
+
+
+@METRICS.register_module()
+class MultiTasksMetric(BaseMetric):
+    """Metrics for MultiTask
+    Args:
+        task_metrics(dict): a dictionary in the keys are the names of the tasks
+            and the values is a list of the metric corresponds to this task
+    Examples:
+        >>> import torch
+        >>> from mmpretrain.evaluation import MultiTasksMetric
+        # -------------------- The Basic Usage --------------------
+        >>>task_metrics = {
+            'task0': [dict(type='Accuracy', topk=(1, ))],
+            'task1': [dict(type='Accuracy', topk=(1, 3))]
+        }
+        >>>pred = [{
+            'pred_task': {
+                'task0': torch.tensor([0.7, 0.0, 0.3]),
+                'task1': torch.tensor([0.5, 0.2, 0.3])
+            },
+            'gt_task': {
+                'task0':  torch.tensor(0),
+                'task1':  torch.tensor(2)
+            }
+        }, {
+            'pred_task': {
+                'task0': torch.tensor([0.0, 0.0, 1.0]),
+                'task1': torch.tensor([0.0, 0.0, 1.0])
+            },
+            'gt_task': {
+                'task0':  torch.tensor(2),
+                'task1':  torch.tensor(2)
+            }
+        }]
+        >>>metric = MultiTasksMetric(task_metrics)
+        >>>metric.process(None, pred)
+        >>>results = metric.evaluate(2)
+        results = {
+            'task0_accuracy/top1': 100.0,
+            'task1_accuracy/top1': 50.0,
+            'task1_accuracy/top3': 100.0
+        }
+    """
+
+    def __init__(self,
+                 task_metrics: Dict,
+                 collect_device: str = 'cpu') -> None:
+        self.task_metrics = task_metrics
+        super().__init__(collect_device=collect_device)
+
+        self._metrics = {}
+        for task_name in self.task_metrics.keys():
+            self._metrics[task_name] = []
+            for metric in self.task_metrics[task_name]:
+                self._metrics[task_name].append(METRICS.build(metric))
+
+    def process(self, data_batch, data_samples: Sequence[dict]):
+        """Process one batch of data samples.
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+        Args:
+            data_batch: A batch of data from the dataloader.
+            data_samples (Sequence[dict]): A batch of outputs from the model.
+        """
+        for task_name in self.task_metrics.keys():
+            filtered_data_samples = []
+            for data_sample in data_samples:
+                eval_mask = data_sample[task_name]['eval_mask']
+                if eval_mask:
+                    filtered_data_samples.append(data_sample[task_name])
+            for metric in self._metrics[task_name]:
+                metric.process(data_batch, filtered_data_samples)
+
+    def compute_metrics(self, results: list) -> dict:
+        raise NotImplementedError(
+            'compute metrics should not be used here directly')
+
+    def evaluate(self, size):
+        """Evaluate the model performance of the whole dataset after processing
+        all batches.
+
+        Args:
+            size (int): Length of the entire validation dataset. When batch
+                size > 1, the dataloader may pad some data samples to make
+                sure all ranks have the same length of dataset slice. The
+                ``collect_results`` function will drop the padded data based on
+                this size.
+        Returns:
+            dict: Evaluation metrics dict on the val dataset. The keys are
+            "{task_name}_{metric_name}" , and the values
+            are corresponding results.
+        """
+        metrics = {}
+        for task_name in self._metrics:
+            for metric in self._metrics[task_name]:
+                name = metric.__class__.__name__
+                if name == 'MultiTasksMetric' or metric.results:
+                    results = metric.evaluate(size)
+                else:
+                    results = {metric.__class__.__name__: 0}
+                for key in results:
+                    name = f'{task_name}_{key}'
+                    if name in results:
+                        """Inspired from https://github.com/open-
+                        mmlab/mmengine/ bl ob/ed20a9cba52ceb371f7c825131636b9e2
+                        747172e/mmengine/evalua tor/evaluator.py#L84-L87."""
+                        raise ValueError(
+                            'There are multiple metric results with the same'
+                            f'metric name {name}. Please make sure all metrics'
+                            'have different prefixes.')
+                    metrics[name] = results[key]
+        return metrics
diff --git a/mmpretrain/evaluation/metrics/nocaps.py b/mmpretrain/evaluation/metrics/nocaps.py
new file mode 100644
index 0000000000000000000000000000000000000000..e8e1d0625b66dfa1abe59bd6f83ea2a6c0b3d446
--- /dev/null
+++ b/mmpretrain/evaluation/metrics/nocaps.py
@@ -0,0 +1,59 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional
+
+import mmengine
+
+from mmpretrain.registry import METRICS
+from mmpretrain.utils import require
+from .caption import COCOCaption, save_result
+
+try:
+    from pycocoevalcap.eval import COCOEvalCap
+    from pycocotools.coco import COCO
+except ImportError:
+    COCOEvalCap = None
+    COCO = None
+
+
+@METRICS.register_module()
+class NocapsSave(COCOCaption):
+    """Nocaps evaluation wrapper.
+
+    Save the generated captions and transform into coco format.
+    The dumped file can be submitted to the official evluation system.
+
+    Args:
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Should be modified according to the
+            `retrieval_type` for unambiguous results. Defaults to TR.
+    """
+
+    @require('pycocoevalcap')
+    def __init__(self,
+                 save_dir: str = './',
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None):
+        super(COCOCaption, self).__init__(
+            collect_device=collect_device, prefix=prefix)
+        self.save_dir = save_dir
+
+    def compute_metrics(self, results: List):
+        """Compute the metrics from processed results.
+
+        Args:
+            results (dict): The processed results of each batch.
+        """
+        mmengine.mkdir_or_exist(self.save_dir)
+        save_result(
+            result=results,
+            result_dir=self.save_dir,
+            filename='nocap_pred',
+            remove_duplicate='image_id',
+        )
+
+        return dict()
diff --git a/mmpretrain/evaluation/metrics/retrieval.py b/mmpretrain/evaluation/metrics/retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..9813486b521c5b73d7be96901ea4f604bbe2a938
--- /dev/null
+++ b/mmpretrain/evaluation/metrics/retrieval.py
@@ -0,0 +1,445 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Sequence, Union
+
+import mmengine
+import numpy as np
+import torch
+from mmengine.evaluator import BaseMetric
+from mmengine.utils import is_seq_of
+
+from mmpretrain.registry import METRICS
+from mmpretrain.structures import label_to_onehot
+from .single_label import to_tensor
+
+
+@METRICS.register_module()
+class RetrievalRecall(BaseMetric):
+    r"""Recall evaluation metric for image retrieval.
+
+    Args:
+        topk (int | Sequence[int]): If the ground truth label matches one of
+            the best **k** predictions, the sample will be regard as a positive
+            prediction. If the parameter is a tuple, all of top-k recall will
+            be calculated and outputted together. Defaults to 1.
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Defaults to None.
+
+    Examples:
+        Use in the code:
+
+        >>> import torch
+        >>> from mmpretrain.evaluation import RetrievalRecall
+        >>> # -------------------- The Basic Usage --------------------
+        >>> y_pred = [[0], [1], [2], [3]]
+        >>> y_true = [[0, 1], [2], [1], [0, 3]]
+        >>> RetrievalRecall.calculate(
+        >>>     y_pred, y_true, topk=1, pred_indices=True, target_indices=True)
+        [tensor([50.])]
+        >>> # Calculate the recall@1 and recall@5 for non-indices input.
+        >>> y_score = torch.rand((1000, 10))
+        >>> import torch.nn.functional as F
+        >>> y_true = F.one_hot(torch.arange(0, 1000) % 10, num_classes=10)
+        >>> RetrievalRecall.calculate(y_score, y_true, topk=(1, 5))
+        [tensor(9.3000), tensor(48.4000)]
+        >>>
+        >>> # ------------------- Use with Evalutor -------------------
+        >>> from mmpretrain.structures import DataSample
+        >>> from mmengine.evaluator import Evaluator
+        >>> data_samples = [
+        ...     DataSample().set_gt_label([0, 1]).set_pred_score(
+        ...     torch.rand(10))
+        ...     for i in range(1000)
+        ... ]
+        >>> evaluator = Evaluator(metrics=RetrievalRecall(topk=(1, 5)))
+        >>> evaluator.process(data_samples)
+        >>> evaluator.evaluate(1000)
+        {'retrieval/Recall@1': 20.700000762939453,
+         'retrieval/Recall@5': 78.5999984741211}
+
+        Use in OpenMMLab configs:
+
+        .. code:: python
+
+            val_evaluator = dict(type='RetrievalRecall', topk=(1, 5))
+            test_evaluator = val_evaluator
+    """
+    default_prefix: Optional[str] = 'retrieval'
+
+    def __init__(self,
+                 topk: Union[int, Sequence[int]],
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None) -> None:
+        topk = (topk, ) if isinstance(topk, int) else topk
+
+        for k in topk:
+            if k <= 0:
+                raise ValueError('`topk` must be a ingter larger than 0 '
+                                 'or seq of ingter larger than 0.')
+
+        self.topk = topk
+        super().__init__(collect_device=collect_device, prefix=prefix)
+
+    def process(self, data_batch: Sequence[dict],
+                data_samples: Sequence[dict]):
+        """Process one batch of data and predictions.
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+
+        Args:
+            data_batch (Sequence[dict]): A batch of data from the dataloader.
+            predictions (Sequence[dict]): A batch of outputs from the model.
+        """
+        for data_sample in data_samples:
+            pred_score = data_sample['pred_score'].clone()
+            gt_label = data_sample['gt_label']
+
+            if 'gt_score' in data_sample:
+                target = data_sample.get('gt_score').clone()
+            else:
+                num_classes = pred_score.size()[-1]
+                target = label_to_onehot(gt_label, num_classes)
+
+            # Because the retrieval output logit vector will be much larger
+            # compared to the normal classification, to save resources, the
+            # evaluation results are computed each batch here and then reduce
+            #  all results at the end.
+            result = RetrievalRecall.calculate(
+                pred_score.unsqueeze(0), target.unsqueeze(0), topk=self.topk)
+            self.results.append(result)
+
+    def compute_metrics(self, results: List):
+        """Compute the metrics from processed results.
+
+        Args:
+            results (list): The processed results of each batch.
+
+        Returns:
+            Dict: The computed metrics. The keys are the names of the metrics,
+            and the values are corresponding results.
+        """
+        result_metrics = dict()
+        for i, k in enumerate(self.topk):
+            recall_at_k = sum([r[i].item() for r in results]) / len(results)
+            result_metrics[f'Recall@{k}'] = recall_at_k
+
+        return result_metrics
+
+    @staticmethod
+    def calculate(pred: Union[np.ndarray, torch.Tensor],
+                  target: Union[np.ndarray, torch.Tensor],
+                  topk: Union[int, Sequence[int]],
+                  pred_indices: (bool) = False,
+                  target_indices: (bool) = False) -> float:
+        """Calculate the average recall.
+
+        Args:
+            pred (torch.Tensor | np.ndarray | Sequence): The prediction
+                results. A :obj:`torch.Tensor` or :obj:`np.ndarray` with
+                shape ``(N, M)`` or a sequence of index/onehot
+                format labels.
+            target (torch.Tensor | np.ndarray | Sequence): The prediction
+                results. A :obj:`torch.Tensor` or :obj:`np.ndarray` with
+                shape ``(N, M)`` or a sequence of index/onehot
+                format labels.
+            topk (int, Sequence[int]): Predictions with the k-th highest
+                scores are considered as positive.
+            pred_indices (bool): Whether the ``pred`` is a sequence of
+                category index labels. Defaults to False.
+            target_indices (bool): Whether the ``target`` is a sequence of
+                category index labels. Defaults to False.
+
+        Returns:
+            List[float]: the average recalls.
+        """
+        topk = (topk, ) if isinstance(topk, int) else topk
+        for k in topk:
+            if k <= 0:
+                raise ValueError('`topk` must be a ingter larger than 0 '
+                                 'or seq of ingter larger than 0.')
+
+        max_keep = max(topk)
+        pred = _format_pred(pred, max_keep, pred_indices)
+        target = _format_target(target, target_indices)
+
+        assert len(pred) == len(target), (
+            f'Length of `pred`({len(pred)}) and `target` ({len(target)}) '
+            f'must be the same.')
+
+        num_samples = len(pred)
+        results = []
+        for k in topk:
+            recalls = torch.zeros(num_samples)
+            for i, (sample_pred,
+                    sample_target) in enumerate(zip(pred, target)):
+                sample_pred = np.array(to_tensor(sample_pred).cpu())
+                sample_target = np.array(to_tensor(sample_target).cpu())
+                recalls[i] = int(np.in1d(sample_pred[:k], sample_target).max())
+            results.append(recalls.mean() * 100)
+        return results
+
+
+@METRICS.register_module()
+class RetrievalAveragePrecision(BaseMetric):
+    r"""Calculate the average precision for image retrieval.
+
+    Args:
+        topk (int, optional): Predictions with the k-th highest scores are
+            considered as positive.
+        mode (str, optional): The mode to calculate AP, choose from
+                'IR'(information retrieval) and 'integrate'. Defaults to 'IR'.
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Defaults to None.
+
+    Note:
+        If the ``mode`` set to 'IR', use the stanford AP calculation of
+        information retrieval as in wikipedia page[1]; if set to 'integrate',
+        the method implemented integrates over the precision-recall curve
+        by averaging two adjacent precision points, then multiplying by the
+        recall step like mAP in Detection task. This is the convention for
+        the Revisited Oxford/Paris datasets[2].
+
+    References:
+        [1] `Wikipedia entry for the Average precision <https://en.wikipedia.
+        org/wiki/Evaluation_measures_(information_retrieval)#Average_precision>`_
+
+        [2] `The Oxford Buildings Dataset
+        <https://www.robots.ox.ac.uk/~vgg/data/oxbuildings/>`_
+
+    Examples:
+        Use in code:
+
+        >>> import torch
+        >>> import numpy as np
+        >>> from mmcls.evaluation import RetrievalAveragePrecision
+        >>> # using index format inputs
+        >>> pred = [ torch.Tensor([idx for idx in range(100)]) ] * 3
+        >>> target = [[0, 3, 6, 8, 35], [1, 2, 54, 105], [2, 42, 205]]
+        >>> RetrievalAveragePrecision.calculate(pred, target, 10, True, True)
+        29.246031746031747
+        >>> # using tensor format inputs
+        >>> pred = np.array([np.linspace(0.95, 0.05, 10)] * 2)
+        >>> target = torch.Tensor([[1, 0, 1, 0, 0, 1, 0, 0, 1, 1]] * 2)
+        >>> RetrievalAveragePrecision.calculate(pred, target, 10)
+        62.222222222222214
+
+        Use in OpenMMLab config files:
+
+        .. code:: python
+
+            val_evaluator = dict(type='RetrievalAveragePrecision', topk=100)
+            test_evaluator = val_evaluator
+    """
+
+    default_prefix: Optional[str] = 'retrieval'
+
+    def __init__(self,
+                 topk: Optional[int] = None,
+                 mode: Optional[str] = 'IR',
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None) -> None:
+        if topk is None or (isinstance(topk, int) and topk <= 0):
+            raise ValueError('`topk` must be a ingter larger than 0.')
+
+        mode_options = ['IR', 'integrate']
+        assert mode in mode_options, \
+            f'Invalid `mode` argument, please specify from {mode_options}.'
+
+        self.topk = topk
+        self.mode = mode
+        super().__init__(collect_device=collect_device, prefix=prefix)
+
+    def process(self, data_batch: Sequence[dict],
+                data_samples: Sequence[dict]):
+        """Process one batch of data and predictions.
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+        Args:
+            data_batch (Sequence[dict]): A batch of data from the dataloader.
+            predictions (Sequence[dict]): A batch of outputs from the model.
+        """
+        for data_sample in data_samples:
+            pred_score = data_sample.get('pred_score').clone()
+
+            if 'gt_score' in data_sample:
+                target = data_sample.get('gt_score').clone()
+            else:
+                gt_label = data_sample.get('gt_label')
+                num_classes = pred_score.size()[-1]
+                target = label_to_onehot(gt_label, num_classes)
+
+            # Because the retrieval output logit vector will be much larger
+            # compared to the normal classification, to save resources, the
+            # evaluation results are computed each batch here and then reduce
+            #  all results at the end.
+            result = RetrievalAveragePrecision.calculate(
+                pred_score.unsqueeze(0),
+                target.unsqueeze(0),
+                self.topk,
+                mode=self.mode)
+            self.results.append(result)
+
+    def compute_metrics(self, results: List):
+        """Compute the metrics from processed results.
+
+        Args:
+            results (list): The processed results of each batch.
+        Returns:
+            Dict: The computed metrics. The keys are the names of the metrics,
+            and the values are corresponding results.
+        """
+        result_metrics = dict()
+        result_metrics[f'mAP@{self.topk}'] = np.mean(self.results).item()
+
+        return result_metrics
+
+    @staticmethod
+    def calculate(pred: Union[np.ndarray, torch.Tensor],
+                  target: Union[np.ndarray, torch.Tensor],
+                  topk: Optional[int] = None,
+                  pred_indices: (bool) = False,
+                  target_indices: (bool) = False,
+                  mode: str = 'IR') -> float:
+        """Calculate the average precision.
+        Args:
+            pred (torch.Tensor | np.ndarray | Sequence): The prediction
+                results. A :obj:`torch.Tensor` or :obj:`np.ndarray` with
+                shape ``(N, M)`` or a sequence of index/onehot
+                format labels.
+            target (torch.Tensor | np.ndarray | Sequence): The prediction
+                results. A :obj:`torch.Tensor` or :obj:`np.ndarray` with
+                shape ``(N, M)`` or a sequence of index/onehot
+                format labels.
+            topk (int, optional): Predictions with the k-th highest scores
+                 are considered as positive.
+            pred_indices (bool): Whether the ``pred`` is a sequence of
+                category index labels. Defaults to False.
+            target_indices (bool): Whether the ``target`` is a sequence of
+                category index labels. Defaults to False.
+            mode (Optional[str]): The mode to calculate AP, choose from
+                'IR'(information retrieval) and 'integrate'. Defaults to 'IR'.
+
+        Note:
+            If the ``mode`` set to 'IR', use the stanford AP calculation of
+            information retrieval as in wikipedia page; if set to 'integrate',
+            the method implemented integrates over the precision-recall curve
+            by averaging two adjacent precision points, then multiplying by the
+            recall step like mAP in Detection task. This is the convention for
+            the Revisited Oxford/Paris datasets.
+
+        Returns:
+            float: the average precision of the query image.
+
+        References:
+            [1] `Wikipedia entry for Average precision(information_retrieval)
+            <https://en.wikipedia.org/wiki/Evaluation_measures_
+
+            (information_retrieval)#Average_precision>`_
+            [2] `The Oxford Buildings Dataset <https://www.robots.ox.ac.uk/
+            ~vgg/data/oxbuildings/`_
+        """
+        if topk is None or (isinstance(topk, int) and topk <= 0):
+            raise ValueError('`topk` must be a ingter larger than 0.')
+
+        mode_options = ['IR', 'integrate']
+        assert mode in mode_options, \
+            f'Invalid `mode` argument, please specify from {mode_options}.'
+
+        pred = _format_pred(pred, topk, pred_indices)
+        target = _format_target(target, target_indices)
+
+        assert len(pred) == len(target), (
+            f'Length of `pred`({len(pred)}) and `target` ({len(target)}) '
+            f'must be the same.')
+
+        num_samples = len(pred)
+        aps = np.zeros(num_samples)
+        for i, (sample_pred, sample_target) in enumerate(zip(pred, target)):
+            aps[i] = _calculateAp_for_sample(sample_pred, sample_target, mode)
+
+        return aps.mean()
+
+
+def _calculateAp_for_sample(pred, target, mode):
+    pred = np.array(to_tensor(pred).cpu())
+    target = np.array(to_tensor(target).cpu())
+
+    num_preds = len(pred)
+
+    # TODO: use ``torch.isin`` in torch1.10.
+    positive_ranks = np.arange(num_preds)[np.in1d(pred, target)]
+
+    ap = 0
+    for i, rank in enumerate(positive_ranks):
+        if mode == 'IR':
+            precision = (i + 1) / (rank + 1)
+            ap += precision
+        elif mode == 'integrate':
+            # code are modified from https://www.robots.ox.ac.uk/~vgg/data/oxbuildings/compute_ap.cpp # noqa:
+            old_precision = i / rank if rank > 0 else 1
+            cur_precision = (i + 1) / (rank + 1)
+            prediction = (old_precision + cur_precision) / 2
+            ap += prediction
+    ap = ap / len(target)
+
+    return ap * 100
+
+
+def _format_pred(label, topk=None, is_indices=False):
+    """format various label to List[indices]."""
+    if is_indices:
+        assert isinstance(label, Sequence),  \
+                '`pred` must be Sequence of indices when' \
+                f' `pred_indices` set to True, but get {type(label)}'
+        for i, sample_pred in enumerate(label):
+            assert is_seq_of(sample_pred, int) or isinstance(
+                sample_pred, (np.ndarray, torch.Tensor)), \
+                '`pred` should be Sequence of indices when `pred_indices`' \
+                f'set to True. but pred[{i}] is {sample_pred}'
+            if topk:
+                label[i] = sample_pred[:min(topk, len(sample_pred))]
+        return label
+    if isinstance(label, np.ndarray):
+        label = torch.from_numpy(label)
+    elif not isinstance(label, torch.Tensor):
+        raise TypeError(f'The pred must be type of torch.tensor, '
+                        f'np.ndarray or Sequence but get {type(label)}.')
+    topk = topk if topk else label.size()[-1]
+    _, indices = label.topk(topk)
+    return indices
+
+
+def _format_target(label, is_indices=False):
+    """format various label to List[indices]."""
+    if is_indices:
+        assert isinstance(label, Sequence),  \
+                '`target` must be Sequence of indices when' \
+                f' `target_indices` set to True, but get {type(label)}'
+        for i, sample_gt in enumerate(label):
+            assert is_seq_of(sample_gt, int) or isinstance(
+                sample_gt, (np.ndarray, torch.Tensor)), \
+                '`target` should be Sequence of indices when ' \
+                f'`target_indices` set to True. but target[{i}] is {sample_gt}'
+        return label
+
+    if isinstance(label, np.ndarray):
+        label = torch.from_numpy(label)
+    elif isinstance(label, Sequence) and not mmengine.is_str(label):
+        label = torch.tensor(label)
+    elif not isinstance(label, torch.Tensor):
+        raise TypeError(f'The pred must be type of torch.tensor, '
+                        f'np.ndarray or Sequence but get {type(label)}.')
+
+    indices = [sample_gt.nonzero().squeeze(-1) for sample_gt in label]
+    return indices
diff --git a/mmpretrain/evaluation/metrics/scienceqa.py b/mmpretrain/evaluation/metrics/scienceqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..ebf01c78cc88e5ce5e232fe837a0d77293386112
--- /dev/null
+++ b/mmpretrain/evaluation/metrics/scienceqa.py
@@ -0,0 +1,170 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import random
+from typing import List, Optional
+
+from mmengine.evaluator import BaseMetric
+
+from mmpretrain.registry import METRICS
+
+
+def get_pred_idx(prediction: str, choices: List[str],
+                 options: List[str]) -> int:  # noqa
+    """Get the index (e.g. 2) from the prediction (e.g. 'C')
+
+    Args:
+        prediction (str): The prediction from the model,
+            from ['A', 'B', 'C', 'D', 'E']
+        choices (List(str)): The choices for the question,
+            from ['A', 'B', 'C', 'D', 'E']
+        options (List(str)): The options for the question,
+            from ['A', 'B', 'C', 'D', 'E']
+
+    Returns:
+        int: The index of the prediction, from [0, 1, 2, 3, 4]
+    """
+    if prediction in options[:len(choices)]:
+        return options.index(prediction)
+    else:
+        return random.choice(range(len(choices)))
+
+
+@METRICS.register_module()
+class ScienceQAMetric(BaseMetric):
+    """Evaluation Metric for ScienceQA.
+
+    Args:
+        options (List(str)): Options for each question. Defaults to
+            ["A", "B", "C", "D", "E"].
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Should be modified according to the
+            `retrieval_type` for unambiguous results. Defaults to TR.
+    """
+
+    def __init__(self,
+                 options: List[str] = ['A', 'B', 'C', 'D', 'E'],
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None) -> None:
+        super().__init__(collect_device=collect_device, prefix=prefix)
+        self.options = options
+
+    def process(self, data_batch, data_samples) -> None:
+        """Process one batch of data samples.
+
+        data_samples should contain the following keys:
+        1. pred_answer (str): The prediction from the model,
+            from ['A', 'B', 'C', 'D', 'E']
+        2. choices (List(str)): The choices for the question,
+            from ['A', 'B', 'C', 'D', 'E']
+        3. grade (int): The grade for the question, from grade1 to grade12
+        4. subject (str): The subject for the question, from
+            ['natural science', 'social science', 'language science']
+        5. answer (str): The answer for the question, from
+            ['A', 'B', 'C', 'D', 'E']
+        6. hint (str): The hint for the question
+        7. has_image (bool): Whether or not the question has image
+
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+
+        Args:
+            data_batch: A batch of data from the dataloader.
+            data_samples (Sequence[dict]): A batch of outputs from the model.
+        """
+        for data_sample in data_samples:
+            result = dict()
+            choices = data_sample.get('choices')
+            result['prediction'] = get_pred_idx(
+                data_sample.get('pred_answer'), choices, self.options)
+            result['grade'] = data_sample.get('grade')
+            result['subject'] = data_sample.get('subject')
+            result['answer'] = data_sample.get('gt_answer')
+            hint = data_sample.get('hint')
+            has_image = data_sample.get('has_image', False)
+            result['no_context'] = True if not has_image and len(
+                hint) == 0 else False  # noqa
+            result['has_text'] = True if len(hint) > 0 else False
+            result['has_image'] = has_image
+
+            # Save the result to `self.results`.
+            self.results.append(result)
+
+    def compute_metrics(self, results: List) -> dict:
+        """Compute the metrics from processed results.
+
+        Args:
+            results (dict): The processed results of each batch.
+
+        Returns:
+            Dict: The computed metrics. The keys are the names of the metrics,
+            and the values are corresponding results.
+        """
+        # NOTICE: don't access `self.results` from the method.
+        metrics = dict()
+
+        all_acc = []
+        acc_natural = []
+        acc_social = []
+        acc_language = []
+        acc_has_text = []
+        acc_has_image = []
+        acc_no_context = []
+        acc_grade_1_6 = []
+        acc_grade_7_12 = []
+
+        for result in results:
+            correct = result['prediction'] == result['answer']
+            all_acc.append(correct)
+            # different subjects
+            if result['subject'] == 'natural science':
+                acc_natural.append(correct)
+            elif result['subject'] == 'social science':
+                acc_social.append(correct)
+            elif result['subject'] == 'language science':
+                acc_language.append(correct)
+
+            # different context
+            if result['has_text']:
+                acc_has_text.append(correct)
+            elif result['has_image']:
+                acc_has_image.append(correct)
+            elif result['no_context']:
+                acc_no_context.append(correct)
+
+            # different grade
+            if result['grade'] in [
+                    'grade1', 'grade2', 'grade3', 'grade4', 'grade5', 'grade6'
+            ]:
+                acc_grade_1_6.append(correct)
+            elif result['grade'] in [
+                    'grade7', 'grade8', 'grade9', 'grade10', 'grade11',
+                    'grade12'
+            ]:
+                acc_grade_7_12.append(correct)
+
+        metrics['all_acc'] = sum(all_acc) / len(all_acc)
+        if len(acc_natural) > 0:
+            metrics['acc_natural'] = sum(acc_natural) / len(acc_natural)
+        if len(acc_social) > 0:
+            metrics['acc_social'] = sum(acc_social) / len(acc_social)
+        if len(acc_language) > 0:
+            metrics['acc_language'] = sum(acc_language) / len(acc_language)
+        if len(acc_has_text) > 0:
+            metrics['acc_has_text'] = sum(acc_has_text) / len(acc_has_text)
+        if len(acc_has_image) > 0:
+            metrics['acc_has_image'] = sum(acc_has_image) / len(acc_has_image)
+        if len(acc_no_context) > 0:
+            metrics['acc_no_context'] = sum(acc_no_context) / len(
+                acc_no_context)
+        if len(acc_grade_1_6) > 0:
+            metrics['acc_grade_1_6'] = sum(acc_grade_1_6) / len(acc_grade_1_6)
+        if len(acc_grade_7_12) > 0:
+            metrics['acc_grade_7_12'] = sum(acc_grade_7_12) / len(
+                acc_grade_7_12)
+
+        return metrics
diff --git a/mmpretrain/evaluation/metrics/shape_bias_label.py b/mmpretrain/evaluation/metrics/shape_bias_label.py
new file mode 100644
index 0000000000000000000000000000000000000000..27c80a36073a9e6edd5e6583e213ed93374b165e
--- /dev/null
+++ b/mmpretrain/evaluation/metrics/shape_bias_label.py
@@ -0,0 +1,172 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import csv
+import os
+import os.path as osp
+from typing import List, Sequence
+
+import numpy as np
+import torch
+from mmengine.dist.utils import get_rank
+from mmengine.evaluator import BaseMetric
+
+from mmpretrain.registry import METRICS
+
+
+@METRICS.register_module()
+class ShapeBiasMetric(BaseMetric):
+    """Evaluate the model on ``cue_conflict`` dataset.
+
+    This module will evaluate the model on an OOD dataset, cue_conflict, in
+    order to measure the shape bias of the model. In addition to compuate the
+    Top-1 accuracy, this module also generate a csv file to record the
+    detailed prediction results, such that this csv file can be used to
+    generate the shape bias curve.
+
+    Args:
+        csv_dir (str): The directory to save the csv file.
+        model_name (str): The name of the csv file. Please note that the
+            model name should be an unique identifier.
+        dataset_name (str): The name of the dataset. Default: 'cue_conflict'.
+    """
+
+    # mapping several classes from ImageNet-1K to the same category
+    airplane_indices = [404]
+    bear_indices = [294, 295, 296, 297]
+    bicycle_indices = [444, 671]
+    bird_indices = [
+        8, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 22, 23, 24, 80, 81, 82, 83,
+        87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 98, 99, 100, 127, 128, 129,
+        130, 131, 132, 133, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
+        145
+    ]
+    boat_indices = [472, 554, 625, 814, 914]
+    bottle_indices = [440, 720, 737, 898, 899, 901, 907]
+    car_indices = [436, 511, 817]
+    cat_indices = [281, 282, 283, 284, 285, 286]
+    chair_indices = [423, 559, 765, 857]
+    clock_indices = [409, 530, 892]
+    dog_indices = [
+        152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165,
+        166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179,
+        180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 193, 194,
+        195, 196, 197, 198, 199, 200, 201, 202, 203, 205, 206, 207, 208, 209,
+        210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223,
+        224, 225, 226, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238,
+        239, 240, 241, 243, 244, 245, 246, 247, 248, 249, 250, 252, 253, 254,
+        255, 256, 257, 259, 261, 262, 263, 265, 266, 267, 268
+    ]
+    elephant_indices = [385, 386]
+    keyboard_indices = [508, 878]
+    knife_indices = [499]
+    oven_indices = [766]
+    truck_indices = [555, 569, 656, 675, 717, 734, 864, 867]
+
+    def __init__(self,
+                 csv_dir: str,
+                 model_name: str,
+                 dataset_name: str = 'cue_conflict',
+                 **kwargs) -> None:
+        super().__init__(**kwargs)
+
+        self.categories = sorted([
+            'knife', 'keyboard', 'elephant', 'bicycle', 'airplane', 'clock',
+            'oven', 'chair', 'bear', 'boat', 'cat', 'bottle', 'truck', 'car',
+            'bird', 'dog'
+        ])
+        self.csv_dir = csv_dir
+        self.model_name = model_name
+        self.dataset_name = dataset_name
+        if get_rank() == 0:
+            self.csv_path = self.create_csv()
+
+    def process(self, data_batch, data_samples: Sequence[dict]) -> None:
+        """Process one batch of data samples.
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+
+        Args:
+            data_batch: A batch of data from the dataloader.
+            data_samples (Sequence[dict]): A batch of outputs from the model.
+        """
+        for data_sample in data_samples:
+            result = dict()
+            if 'pred_score' in data_sample:
+                result['pred_score'] = data_sample['pred_score'].cpu()
+            else:
+                result['pred_label'] = data_sample['pred_label'].cpu()
+            result['gt_label'] = data_sample['gt_label'].cpu()
+            result['gt_category'] = data_sample['img_path'].split('/')[-2]
+            result['img_name'] = data_sample['img_path'].split('/')[-1]
+
+            aggregated_category_probabilities = []
+            # get the prediction for each category of current instance
+            for category in self.categories:
+                category_indices = getattr(self, f'{category}_indices')
+                category_probabilities = torch.gather(
+                    result['pred_score'], 0,
+                    torch.tensor(category_indices)).mean()
+                aggregated_category_probabilities.append(
+                    category_probabilities)
+            # sort the probabilities in descending order
+            pred_indices = torch.stack(aggregated_category_probabilities
+                                       ).argsort(descending=True).numpy()
+            result['pred_category'] = np.take(self.categories, pred_indices)
+
+            # Save the result to `self.results`.
+            self.results.append(result)
+
+    def create_csv(self) -> str:
+        """Create a csv file to store the results."""
+        session_name = 'session-1'
+        csv_path = osp.join(
+            self.csv_dir, self.dataset_name + '_' + self.model_name + '_' +
+            session_name + '.csv')
+        if osp.exists(csv_path):
+            os.remove(csv_path)
+        directory = osp.dirname(csv_path)
+        if not osp.exists(directory):
+            os.makedirs(directory, exist_ok=True)
+        with open(csv_path, 'w') as f:
+            writer = csv.writer(f)
+            writer.writerow([
+                'subj', 'session', 'trial', 'rt', 'object_response',
+                'category', 'condition', 'imagename'
+            ])
+        return csv_path
+
+    def dump_results_to_csv(self, results: List[dict]) -> None:
+        """Dump the results to a csv file.
+
+        Args:
+            results (List[dict]): A list of results.
+        """
+        for i, result in enumerate(results):
+            img_name = result['img_name']
+            category = result['gt_category']
+            condition = 'NaN'
+            with open(self.csv_path, 'a') as f:
+                writer = csv.writer(f)
+                writer.writerow([
+                    self.model_name, 1, i + 1, 'NaN',
+                    result['pred_category'][0], category, condition, img_name
+                ])
+
+    def compute_metrics(self, results: List[dict]) -> dict:
+        """Compute the metrics from the results.
+
+        Args:
+            results (List[dict]): A list of results.
+
+        Returns:
+            dict: A dict of metrics.
+        """
+        if get_rank() == 0:
+            self.dump_results_to_csv(results)
+        metrics = dict()
+        metrics['accuracy/top1'] = np.mean([
+            result['pred_category'][0] == result['gt_category']
+            for result in results
+        ])
+
+        return metrics
diff --git a/mmpretrain/evaluation/metrics/single_label.py b/mmpretrain/evaluation/metrics/single_label.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9329b9567e698a4e3ebdb7d77f0f8404b81ad4c
--- /dev/null
+++ b/mmpretrain/evaluation/metrics/single_label.py
@@ -0,0 +1,776 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from itertools import product
+from typing import List, Optional, Sequence, Union
+
+import mmengine
+import numpy as np
+import torch
+import torch.nn.functional as F
+from mmengine.evaluator import BaseMetric
+
+from mmpretrain.registry import METRICS
+
+
+def to_tensor(value):
+    """Convert value to torch.Tensor."""
+    if isinstance(value, np.ndarray):
+        value = torch.from_numpy(value)
+    elif isinstance(value, Sequence) and not mmengine.is_str(value):
+        value = torch.tensor(value)
+    elif not isinstance(value, torch.Tensor):
+        raise TypeError(f'{type(value)} is not an available argument.')
+    return value
+
+
+def _precision_recall_f1_support(pred_positive, gt_positive, average):
+    """calculate base classification task metrics, such as  precision, recall,
+    f1_score, support."""
+    average_options = ['micro', 'macro', None]
+    assert average in average_options, 'Invalid `average` argument, ' \
+        f'please specify from {average_options}.'
+
+    # ignore -1 target such as difficult sample that is not wanted
+    # in evaluation results.
+    # only for calculate multi-label without affecting single-label behavior
+    ignored_index = gt_positive == -1
+    pred_positive[ignored_index] = 0
+    gt_positive[ignored_index] = 0
+
+    class_correct = (pred_positive & gt_positive)
+    if average == 'micro':
+        tp_sum = class_correct.sum()
+        pred_sum = pred_positive.sum()
+        gt_sum = gt_positive.sum()
+    else:
+        tp_sum = class_correct.sum(0)
+        pred_sum = pred_positive.sum(0)
+        gt_sum = gt_positive.sum(0)
+
+    precision = tp_sum / torch.clamp(pred_sum, min=1).float() * 100
+    recall = tp_sum / torch.clamp(gt_sum, min=1).float() * 100
+    f1_score = 2 * precision * recall / torch.clamp(
+        precision + recall, min=torch.finfo(torch.float32).eps)
+    if average in ['macro', 'micro']:
+        precision = precision.mean(0)
+        recall = recall.mean(0)
+        f1_score = f1_score.mean(0)
+        support = gt_sum.sum(0)
+    else:
+        support = gt_sum
+    return precision, recall, f1_score, support
+
+
+@METRICS.register_module()
+class Accuracy(BaseMetric):
+    r"""Accuracy evaluation metric.
+
+    For either binary classification or multi-class classification, the
+    accuracy is the fraction of correct predictions in all predictions:
+
+    .. math::
+
+        \text{Accuracy} = \frac{N_{\text{correct}}}{N_{\text{all}}}
+
+    Args:
+        topk (int | Sequence[int]): If the ground truth label matches one of
+            the best **k** predictions, the sample will be regard as a positive
+            prediction. If the parameter is a tuple, all of top-k accuracy will
+            be calculated and outputted together. Defaults to 1.
+        thrs (Sequence[float | None] | float | None): If a float, predictions
+            with score lower than the threshold will be regard as the negative
+            prediction. If None, not apply threshold. If the parameter is a
+            tuple, accuracy based on all thresholds will be calculated and
+            outputted together. Defaults to 0.
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Defaults to None.
+
+    Examples:
+        >>> import torch
+        >>> from mmpretrain.evaluation import Accuracy
+        >>> # -------------------- The Basic Usage --------------------
+        >>> y_pred = [0, 2, 1, 3]
+        >>> y_true = [0, 1, 2, 3]
+        >>> Accuracy.calculate(y_pred, y_true)
+        tensor([50.])
+        >>> # Calculate the top1 and top5 accuracy.
+        >>> y_score = torch.rand((1000, 10))
+        >>> y_true = torch.zeros((1000, ))
+        >>> Accuracy.calculate(y_score, y_true, topk=(1, 5))
+        [[tensor([9.9000])], [tensor([51.5000])]]
+        >>>
+        >>> # ------------------- Use with Evalutor -------------------
+        >>> from mmpretrain.structures import DataSample
+        >>> from mmengine.evaluator import Evaluator
+        >>> data_samples = [
+        ...     DataSample().set_gt_label(0).set_pred_score(torch.rand(10))
+        ...     for i in range(1000)
+        ... ]
+        >>> evaluator = Evaluator(metrics=Accuracy(topk=(1, 5)))
+        >>> evaluator.process(data_samples)
+        >>> evaluator.evaluate(1000)
+        {
+            'accuracy/top1': 9.300000190734863,
+            'accuracy/top5': 51.20000076293945
+        }
+    """
+    default_prefix: Optional[str] = 'accuracy'
+
+    def __init__(self,
+                 topk: Union[int, Sequence[int]] = (1, ),
+                 thrs: Union[float, Sequence[Union[float, None]], None] = 0.,
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None) -> None:
+        super().__init__(collect_device=collect_device, prefix=prefix)
+
+        if isinstance(topk, int):
+            self.topk = (topk, )
+        else:
+            self.topk = tuple(topk)
+
+        if isinstance(thrs, float) or thrs is None:
+            self.thrs = (thrs, )
+        else:
+            self.thrs = tuple(thrs)
+
+    def process(self, data_batch, data_samples: Sequence[dict]):
+        """Process one batch of data samples.
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+
+        Args:
+            data_batch: A batch of data from the dataloader.
+            data_samples (Sequence[dict]): A batch of outputs from the model.
+        """
+
+        for data_sample in data_samples:
+            result = dict()
+            if 'pred_score' in data_sample:
+                result['pred_score'] = data_sample['pred_score'].cpu()
+            else:
+                result['pred_label'] = data_sample['pred_label'].cpu()
+            result['gt_label'] = data_sample['gt_label'].cpu()
+            # Save the result to `self.results`.
+            self.results.append(result)
+
+    def compute_metrics(self, results: List):
+        """Compute the metrics from processed results.
+
+        Args:
+            results (dict): The processed results of each batch.
+
+        Returns:
+            Dict: The computed metrics. The keys are the names of the metrics,
+            and the values are corresponding results.
+        """
+        # NOTICE: don't access `self.results` from the method.
+        metrics = {}
+
+        # concat
+        target = torch.cat([res['gt_label'] for res in results])
+        if 'pred_score' in results[0]:
+            pred = torch.stack([res['pred_score'] for res in results])
+
+            try:
+                acc = self.calculate(pred, target, self.topk, self.thrs)
+            except ValueError as e:
+                # If the topk is invalid.
+                raise ValueError(
+                    str(e) + ' Please check the `val_evaluator` and '
+                    '`test_evaluator` fields in your config file.')
+
+            multi_thrs = len(self.thrs) > 1
+            for i, k in enumerate(self.topk):
+                for j, thr in enumerate(self.thrs):
+                    name = f'top{k}'
+                    if multi_thrs:
+                        name += '_no-thr' if thr is None else f'_thr-{thr:.2f}'
+                    metrics[name] = acc[i][j].item()
+        else:
+            # If only label in the `pred_label`.
+            pred = torch.cat([res['pred_label'] for res in results])
+            acc = self.calculate(pred, target, self.topk, self.thrs)
+            metrics['top1'] = acc.item()
+
+        return metrics
+
+    @staticmethod
+    def calculate(
+        pred: Union[torch.Tensor, np.ndarray, Sequence],
+        target: Union[torch.Tensor, np.ndarray, Sequence],
+        topk: Sequence[int] = (1, ),
+        thrs: Sequence[Union[float, None]] = (0., ),
+    ) -> Union[torch.Tensor, List[List[torch.Tensor]]]:
+        """Calculate the accuracy.
+
+        Args:
+            pred (torch.Tensor | np.ndarray | Sequence): The prediction
+                results. It can be labels (N, ), or scores of every
+                class (N, C).
+            target (torch.Tensor | np.ndarray | Sequence): The target of
+                each prediction with shape (N, ).
+            thrs (Sequence[float | None]): Predictions with scores under
+                the thresholds are considered negative. It's only used
+                when ``pred`` is scores. None means no thresholds.
+                Defaults to (0., ).
+            thrs (Sequence[float]): Predictions with scores under
+                the thresholds are considered negative. It's only used
+                when ``pred`` is scores. Defaults to (0., ).
+
+        Returns:
+            torch.Tensor | List[List[torch.Tensor]]: Accuracy.
+
+            - torch.Tensor: If the ``pred`` is a sequence of label instead of
+              score (number of dimensions is 1). Only return a top-1 accuracy
+              tensor, and ignore the argument ``topk` and ``thrs``.
+            - List[List[torch.Tensor]]: If the ``pred`` is a sequence of score
+              (number of dimensions is 2). Return the accuracy on each ``topk``
+              and ``thrs``. And the first dim is ``topk``, the second dim is
+              ``thrs``.
+        """
+
+        pred = to_tensor(pred)
+        target = to_tensor(target).to(torch.int64)
+        num = pred.size(0)
+        assert pred.size(0) == target.size(0), \
+            f"The size of pred ({pred.size(0)}) doesn't match "\
+            f'the target ({target.size(0)}).'
+
+        if pred.ndim == 1:
+            # For pred label, ignore topk and acc
+            pred_label = pred.int()
+            correct = pred.eq(target).float().sum(0, keepdim=True)
+            acc = correct.mul_(100. / num)
+            return acc
+        else:
+            # For pred score, calculate on all topk and thresholds.
+            pred = pred.float()
+            maxk = max(topk)
+
+            if maxk > pred.size(1):
+                raise ValueError(
+                    f'Top-{maxk} accuracy is unavailable since the number of '
+                    f'categories is {pred.size(1)}.')
+
+            pred_score, pred_label = pred.topk(maxk, dim=1)
+            pred_label = pred_label.t()
+            correct = pred_label.eq(target.view(1, -1).expand_as(pred_label))
+            results = []
+            for k in topk:
+                results.append([])
+                for thr in thrs:
+                    # Only prediction values larger than thr are counted
+                    # as correct
+                    _correct = correct
+                    if thr is not None:
+                        _correct = _correct & (pred_score.t() > thr)
+                    correct_k = _correct[:k].reshape(-1).float().sum(
+                        0, keepdim=True)
+                    acc = correct_k.mul_(100. / num)
+                    results[-1].append(acc)
+            return results
+
+
+@METRICS.register_module()
+class SingleLabelMetric(BaseMetric):
+    r"""A collection of precision, recall, f1-score and support for
+    single-label tasks.
+
+    The collection of metrics is for single-label multi-class classification.
+    And all these metrics are based on the confusion matrix of every category:
+
+    .. image:: ../../_static/image/confusion-matrix.png
+       :width: 60%
+       :align: center
+
+    All metrics can be formulated use variables above:
+
+    **Precision** is the fraction of correct predictions in all predictions:
+
+    .. math::
+        \text{Precision} = \frac{TP}{TP+FP}
+
+    **Recall** is the fraction of correct predictions in all targets:
+
+    .. math::
+        \text{Recall} = \frac{TP}{TP+FN}
+
+    **F1-score** is the harmonic mean of the precision and recall:
+
+    .. math::
+        \text{F1-score} = \frac{2\times\text{Recall}\times\text{Precision}}{\text{Recall}+\text{Precision}}
+
+    **Support** is the number of samples:
+
+    .. math::
+        \text{Support} = TP + TN + FN + FP
+
+    Args:
+        thrs (Sequence[float | None] | float | None): If a float, predictions
+            with score lower than the threshold will be regard as the negative
+            prediction. If None, only the top-1 prediction will be regard as
+            the positive prediction. If the parameter is a tuple, accuracy
+            based on all thresholds will be calculated and outputted together.
+            Defaults to 0.
+        items (Sequence[str]): The detailed metric items to evaluate, select
+            from "precision", "recall", "f1-score" and "support".
+            Defaults to ``('precision', 'recall', 'f1-score')``.
+        average (str | None): How to calculate the final metrics from the
+            confusion matrix of every category. It supports three modes:
+
+            - `"macro"`: Calculate metrics for each category, and calculate
+              the mean value over all categories.
+            - `"micro"`: Average the confusion matrix over all categories and
+              calculate metrics on the mean confusion matrix.
+            - `None`: Calculate metrics of every category and output directly.
+
+            Defaults to "macro".
+        num_classes (int, optional): The number of classes. Defaults to None.
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Defaults to None.
+
+    Examples:
+        >>> import torch
+        >>> from mmpretrain.evaluation import SingleLabelMetric
+        >>> # -------------------- The Basic Usage --------------------
+        >>> y_pred = [0, 1, 1, 3]
+        >>> y_true = [0, 2, 1, 3]
+        >>> # Output precision, recall, f1-score and support.
+        >>> SingleLabelMetric.calculate(y_pred, y_true, num_classes=4)
+        (tensor(62.5000), tensor(75.), tensor(66.6667), tensor(4))
+        >>> # Calculate with different thresholds.
+        >>> y_score = torch.rand((1000, 10))
+        >>> y_true = torch.zeros((1000, ))
+        >>> SingleLabelMetric.calculate(y_score, y_true, thrs=(0., 0.9))
+        [(tensor(10.), tensor(0.9500), tensor(1.7352), tensor(1000)),
+         (tensor(10.), tensor(0.5500), tensor(1.0427), tensor(1000))]
+        >>>
+        >>> # ------------------- Use with Evalutor -------------------
+        >>> from mmpretrain.structures import DataSample
+        >>> from mmengine.evaluator import Evaluator
+        >>> data_samples = [
+        ...     DataSample().set_gt_label(i%5).set_pred_score(torch.rand(5))
+        ...     for i in range(1000)
+        ... ]
+        >>> evaluator = Evaluator(metrics=SingleLabelMetric())
+        >>> evaluator.process(data_samples)
+        >>> evaluator.evaluate(1000)
+        {'single-label/precision': 19.650691986083984,
+         'single-label/recall': 19.600000381469727,
+         'single-label/f1-score': 19.619548797607422}
+        >>> # Evaluate on each class
+        >>> evaluator = Evaluator(metrics=SingleLabelMetric(average=None))
+        >>> evaluator.process(data_samples)
+        >>> evaluator.evaluate(1000)
+        {
+            'single-label/precision_classwise': [21.1, 18.7, 17.8, 19.4, 16.1],
+            'single-label/recall_classwise': [18.5, 18.5, 17.0, 20.0, 18.0],
+            'single-label/f1-score_classwise': [19.7, 18.6, 17.1, 19.7, 17.0]
+        }
+    """  # noqa: E501
+    default_prefix: Optional[str] = 'single-label'
+
+    def __init__(self,
+                 thrs: Union[float, Sequence[Union[float, None]], None] = 0.,
+                 items: Sequence[str] = ('precision', 'recall', 'f1-score'),
+                 average: Optional[str] = 'macro',
+                 num_classes: Optional[int] = None,
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None) -> None:
+        super().__init__(collect_device=collect_device, prefix=prefix)
+
+        if isinstance(thrs, float) or thrs is None:
+            self.thrs = (thrs, )
+        else:
+            self.thrs = tuple(thrs)
+
+        for item in items:
+            assert item in ['precision', 'recall', 'f1-score', 'support'], \
+                f'The metric {item} is not supported by `SingleLabelMetric`,' \
+                ' please specify from "precision", "recall", "f1-score" and ' \
+                '"support".'
+        self.items = tuple(items)
+        self.average = average
+        self.num_classes = num_classes
+
+    def process(self, data_batch, data_samples: Sequence[dict]):
+        """Process one batch of data samples.
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+
+        Args:
+            data_batch: A batch of data from the dataloader.
+            data_samples (Sequence[dict]): A batch of outputs from the model.
+        """
+
+        for data_sample in data_samples:
+            result = dict()
+            if 'pred_score' in data_sample:
+                result['pred_score'] = data_sample['pred_score'].cpu()
+            else:
+                num_classes = self.num_classes or data_sample.get(
+                    'num_classes')
+                assert num_classes is not None, \
+                    'The `num_classes` must be specified if no `pred_score`.'
+                result['pred_label'] = data_sample['pred_label'].cpu()
+                result['num_classes'] = num_classes
+            result['gt_label'] = data_sample['gt_label'].cpu()
+            # Save the result to `self.results`.
+            self.results.append(result)
+
+    def compute_metrics(self, results: List):
+        """Compute the metrics from processed results.
+
+        Args:
+            results (list): The processed results of each batch.
+
+        Returns:
+            Dict: The computed metrics. The keys are the names of the metrics,
+            and the values are corresponding results.
+        """
+        # NOTICE: don't access `self.results` from the method. `self.results`
+        # are a list of results from multiple batch, while the input `results`
+        # are the collected results.
+        metrics = {}
+
+        def pack_results(precision, recall, f1_score, support):
+            single_metrics = {}
+            if 'precision' in self.items:
+                single_metrics['precision'] = precision
+            if 'recall' in self.items:
+                single_metrics['recall'] = recall
+            if 'f1-score' in self.items:
+                single_metrics['f1-score'] = f1_score
+            if 'support' in self.items:
+                single_metrics['support'] = support
+            return single_metrics
+
+        # concat
+        target = torch.cat([res['gt_label'] for res in results])
+        if 'pred_score' in results[0]:
+            pred = torch.stack([res['pred_score'] for res in results])
+            metrics_list = self.calculate(
+                pred, target, thrs=self.thrs, average=self.average)
+
+            multi_thrs = len(self.thrs) > 1
+            for i, thr in enumerate(self.thrs):
+                if multi_thrs:
+                    suffix = '_no-thr' if thr is None else f'_thr-{thr:.2f}'
+                else:
+                    suffix = ''
+
+                for k, v in pack_results(*metrics_list[i]).items():
+                    metrics[k + suffix] = v
+        else:
+            # If only label in the `pred_label`.
+            pred = torch.cat([res['pred_label'] for res in results])
+            res = self.calculate(
+                pred,
+                target,
+                average=self.average,
+                num_classes=results[0]['num_classes'])
+            metrics = pack_results(*res)
+
+        result_metrics = dict()
+        for k, v in metrics.items():
+
+            if self.average is None:
+                result_metrics[k + '_classwise'] = v.cpu().detach().tolist()
+            elif self.average == 'micro':
+                result_metrics[k + f'_{self.average}'] = v.item()
+            else:
+                result_metrics[k] = v.item()
+
+        return result_metrics
+
+    @staticmethod
+    def calculate(
+        pred: Union[torch.Tensor, np.ndarray, Sequence],
+        target: Union[torch.Tensor, np.ndarray, Sequence],
+        thrs: Sequence[Union[float, None]] = (0., ),
+        average: Optional[str] = 'macro',
+        num_classes: Optional[int] = None,
+    ) -> Union[torch.Tensor, List[torch.Tensor]]:
+        """Calculate the precision, recall, f1-score and support.
+
+        Args:
+            pred (torch.Tensor | np.ndarray | Sequence): The prediction
+                results. It can be labels (N, ), or scores of every
+                class (N, C).
+            target (torch.Tensor | np.ndarray | Sequence): The target of
+                each prediction with shape (N, ).
+            thrs (Sequence[float | None]): Predictions with scores under
+                the thresholds are considered negative. It's only used
+                when ``pred`` is scores. None means no thresholds.
+                Defaults to (0., ).
+            average (str | None): How to calculate the final metrics from
+                the confusion matrix of every category. It supports three
+                modes:
+
+                - `"macro"`: Calculate metrics for each category, and calculate
+                  the mean value over all categories.
+                - `"micro"`: Average the confusion matrix over all categories
+                  and calculate metrics on the mean confusion matrix.
+                - `None`: Calculate metrics of every category and output
+                  directly.
+
+                Defaults to "macro".
+            num_classes (Optional, int): The number of classes. If the ``pred``
+                is label instead of scores, this argument is required.
+                Defaults to None.
+
+        Returns:
+            Tuple: The tuple contains precision, recall and f1-score.
+            And the type of each item is:
+
+            - torch.Tensor: If the ``pred`` is a sequence of label instead of
+              score (number of dimensions is 1). Only returns a tensor for
+              each metric. The shape is (1, ) if ``classwise`` is False, and
+              (C, ) if ``classwise`` is True.
+            - List[torch.Tensor]: If the ``pred`` is a sequence of score
+              (number of dimensions is 2). Return the metrics on each ``thrs``.
+              The shape of tensor is (1, ) if ``classwise`` is False, and (C, )
+              if ``classwise`` is True.
+        """
+        average_options = ['micro', 'macro', None]
+        assert average in average_options, 'Invalid `average` argument, ' \
+            f'please specify from {average_options}.'
+
+        pred = to_tensor(pred)
+        target = to_tensor(target).to(torch.int64)
+        assert pred.size(0) == target.size(0), \
+            f"The size of pred ({pred.size(0)}) doesn't match "\
+            f'the target ({target.size(0)}).'
+
+        if pred.ndim == 1:
+            assert num_classes is not None, \
+                'Please specify the `num_classes` if the `pred` is labels ' \
+                'intead of scores.'
+            gt_positive = F.one_hot(target.flatten(), num_classes)
+            pred_positive = F.one_hot(pred.to(torch.int64), num_classes)
+            return _precision_recall_f1_support(pred_positive, gt_positive,
+                                                average)
+        else:
+            # For pred score, calculate on all thresholds.
+            num_classes = pred.size(1)
+            pred_score, pred_label = torch.topk(pred, k=1)
+            pred_score = pred_score.flatten()
+            pred_label = pred_label.flatten()
+
+            gt_positive = F.one_hot(target.flatten(), num_classes)
+
+            results = []
+            for thr in thrs:
+                pred_positive = F.one_hot(pred_label, num_classes)
+                if thr is not None:
+                    pred_positive[pred_score <= thr] = 0
+                results.append(
+                    _precision_recall_f1_support(pred_positive, gt_positive,
+                                                 average))
+
+            return results
+
+
+@METRICS.register_module()
+class ConfusionMatrix(BaseMetric):
+    r"""A metric to calculate confusion matrix for single-label tasks.
+
+    Args:
+        num_classes (int, optional): The number of classes. Defaults to None.
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Defaults to None.
+
+    Examples:
+
+        1. The basic usage.
+
+        >>> import torch
+        >>> from mmpretrain.evaluation import ConfusionMatrix
+        >>> y_pred = [0, 1, 1, 3]
+        >>> y_true = [0, 2, 1, 3]
+        >>> ConfusionMatrix.calculate(y_pred, y_true, num_classes=4)
+        tensor([[1, 0, 0, 0],
+                [0, 1, 0, 0],
+                [0, 1, 0, 0],
+                [0, 0, 0, 1]])
+        >>> # plot the confusion matrix
+        >>> import matplotlib.pyplot as plt
+        >>> y_score = torch.rand((1000, 10))
+        >>> y_true = torch.randint(10, (1000, ))
+        >>> matrix = ConfusionMatrix.calculate(y_score, y_true)
+        >>> ConfusionMatrix().plot(matrix)
+        >>> plt.show()
+
+        2. In the config file
+
+        .. code:: python
+
+            val_evaluator = dict(type='ConfusionMatrix')
+            test_evaluator = dict(type='ConfusionMatrix')
+    """  # noqa: E501
+    default_prefix = 'confusion_matrix'
+
+    def __init__(self,
+                 num_classes: Optional[int] = None,
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None) -> None:
+        super().__init__(collect_device, prefix)
+
+        self.num_classes = num_classes
+
+    def process(self, data_batch, data_samples: Sequence[dict]) -> None:
+        for data_sample in data_samples:
+            if 'pred_score' in data_sample:
+                pred_score = data_sample['pred_score']
+                pred_label = pred_score.argmax(dim=0, keepdim=True)
+                self.num_classes = pred_score.size(0)
+            else:
+                pred_label = data_sample['pred_label']
+
+            self.results.append({
+                'pred_label': pred_label,
+                'gt_label': data_sample['gt_label'],
+            })
+
+    def compute_metrics(self, results: list) -> dict:
+        pred_labels = []
+        gt_labels = []
+        for result in results:
+            pred_labels.append(result['pred_label'])
+            gt_labels.append(result['gt_label'])
+        confusion_matrix = ConfusionMatrix.calculate(
+            torch.cat(pred_labels),
+            torch.cat(gt_labels),
+            num_classes=self.num_classes)
+        return {'result': confusion_matrix}
+
+    @staticmethod
+    def calculate(pred, target, num_classes=None) -> dict:
+        """Calculate the confusion matrix for single-label task.
+
+        Args:
+            pred (torch.Tensor | np.ndarray | Sequence): The prediction
+                results. It can be labels (N, ), or scores of every
+                class (N, C).
+            target (torch.Tensor | np.ndarray | Sequence): The target of
+                each prediction with shape (N, ).
+            num_classes (Optional, int): The number of classes. If the ``pred``
+                is label instead of scores, this argument is required.
+                Defaults to None.
+
+        Returns:
+            torch.Tensor: The confusion matrix.
+        """
+        pred = to_tensor(pred)
+        target_label = to_tensor(target).int()
+
+        assert pred.size(0) == target_label.size(0), \
+            f"The size of pred ({pred.size(0)}) doesn't match "\
+            f'the target ({target_label.size(0)}).'
+        assert target_label.ndim == 1
+
+        if pred.ndim == 1:
+            assert num_classes is not None, \
+                'Please specify the `num_classes` if the `pred` is labels ' \
+                'intead of scores.'
+            pred_label = pred
+        else:
+            num_classes = num_classes or pred.size(1)
+            pred_label = torch.argmax(pred, dim=1).flatten()
+
+        with torch.no_grad():
+            indices = num_classes * target_label + pred_label
+            matrix = torch.bincount(indices, minlength=num_classes**2)
+            matrix = matrix.reshape(num_classes, num_classes)
+
+        return matrix
+
+    @staticmethod
+    def plot(confusion_matrix: torch.Tensor,
+             include_values: bool = False,
+             cmap: str = 'viridis',
+             classes: Optional[List[str]] = None,
+             colorbar: bool = True,
+             show: bool = True):
+        """Draw a confusion matrix by matplotlib.
+
+        Modified from `Scikit-Learn
+        <https://github.com/scikit-learn/scikit-learn/blob/dc580a8ef/sklearn/metrics/_plot/confusion_matrix.py#L81>`_
+
+        Args:
+            confusion_matrix (torch.Tensor): The confusion matrix to draw.
+            include_values (bool): Whether to draw the values in the figure.
+                Defaults to False.
+            cmap (str): The color map to use. Defaults to use "viridis".
+            classes (list[str], optional): The names of categories.
+                Defaults to None, which means to use index number.
+            colorbar (bool): Whether to show the colorbar. Defaults to True.
+            show (bool): Whether to show the figure immediately.
+                Defaults to True.
+        """  # noqa: E501
+        import matplotlib.pyplot as plt
+
+        fig, ax = plt.subplots(figsize=(10, 10))
+
+        num_classes = confusion_matrix.size(0)
+
+        im_ = ax.imshow(confusion_matrix, interpolation='nearest', cmap=cmap)
+        text_ = None
+        cmap_min, cmap_max = im_.cmap(0), im_.cmap(1.0)
+
+        if include_values:
+            text_ = np.empty_like(confusion_matrix, dtype=object)
+
+            # print text with appropriate color depending on background
+            thresh = (confusion_matrix.max() + confusion_matrix.min()) / 2.0
+
+            for i, j in product(range(num_classes), range(num_classes)):
+                color = cmap_max if confusion_matrix[i,
+                                                     j] < thresh else cmap_min
+
+                text_cm = format(confusion_matrix[i, j], '.2g')
+                text_d = format(confusion_matrix[i, j], 'd')
+                if len(text_d) < len(text_cm):
+                    text_cm = text_d
+
+                text_[i, j] = ax.text(
+                    j, i, text_cm, ha='center', va='center', color=color)
+
+        display_labels = classes or np.arange(num_classes)
+
+        if colorbar:
+            fig.colorbar(im_, ax=ax)
+        ax.set(
+            xticks=np.arange(num_classes),
+            yticks=np.arange(num_classes),
+            xticklabels=display_labels,
+            yticklabels=display_labels,
+            ylabel='True label',
+            xlabel='Predicted label',
+        )
+        ax.invert_yaxis()
+        ax.xaxis.tick_top()
+
+        ax.set_ylim((num_classes - 0.5, -0.5))
+        # Automatically rotate the x labels.
+        fig.autofmt_xdate(ha='center')
+
+        if show:
+            plt.show()
+        return fig
diff --git a/mmpretrain/evaluation/metrics/visual_grounding_eval.py b/mmpretrain/evaluation/metrics/visual_grounding_eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad16e5adf4660496b3a984087294ed9c0fee6537
--- /dev/null
+++ b/mmpretrain/evaluation/metrics/visual_grounding_eval.py
@@ -0,0 +1,85 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+import torch
+import torchvision.ops.boxes as boxes
+from mmengine.evaluator import BaseMetric
+
+from mmpretrain.registry import METRICS
+
+
+def aligned_box_iou(boxes1: torch.Tensor, boxes2: torch.Tensor):
+    area1 = boxes.box_area(boxes1)
+    area2 = boxes.box_area(boxes2)
+
+    lt = torch.max(boxes1[:, :2], boxes2[:, :2])  # (B, 2)
+    rb = torch.min(boxes1[:, 2:], boxes2[:, 2:])  # (B, 2)
+
+    wh = boxes._upcast(rb - lt).clamp(min=0)  # (B, 2)
+    inter = wh[:, 0] * wh[:, 1]  # (B, )
+
+    union = area1 + area2 - inter
+    iou = inter / union
+    return iou
+
+
+@METRICS.register_module()
+class VisualGroundingMetric(BaseMetric):
+    """Visual Grounding evaluator.
+
+    Calculate the box mIOU and box grounding accuracy for visual grounding
+    model.
+
+    Args:
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Should be modified according to the
+            `retrieval_type` for unambiguous results. Defaults to TR.
+    """
+    default_prefix = 'visual-grounding'
+
+    def process(self, data_batch, data_samples):
+        """Process one batch of data samples.
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+
+        Args:
+            data_batch: A batch of data from the dataloader.
+            data_samples (Sequence[dict]): A batch of outputs from the model.
+        """
+        for preds in data_samples:
+
+            pred_box = preds['pred_bboxes'].squeeze()
+            box_gt = torch.Tensor(preds['gt_bboxes']).squeeze()
+
+            result = {
+                'box': pred_box.to('cpu').squeeze(),
+                'box_target': box_gt.squeeze(),
+            }
+
+            self.results.append(result)
+
+    def compute_metrics(self, results: List):
+        """Compute the metrics from processed results.
+
+        Args:
+            results (dict): The processed results of each batch.
+
+        Returns:
+            Dict: The computed metrics. The keys are the names of the metrics,
+            and the values are corresponding results.
+        """
+        pred_boxes = torch.stack([each['box'] for each in results])
+        gt_boxes = torch.stack([each['box_target'] for each in results])
+        iou = aligned_box_iou(pred_boxes, gt_boxes)
+        accu_num = torch.sum(iou >= 0.5)
+
+        miou = torch.mean(iou)
+        acc = accu_num / len(gt_boxes)
+        coco_val = {'miou': miou, 'acc': acc}
+        return coco_val
diff --git a/mmpretrain/evaluation/metrics/voc_multi_label.py b/mmpretrain/evaluation/metrics/voc_multi_label.py
new file mode 100644
index 0000000000000000000000000000000000000000..1034852722796271c7ade9d75c3442cce8f1d0d1
--- /dev/null
+++ b/mmpretrain/evaluation/metrics/voc_multi_label.py
@@ -0,0 +1,98 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional, Sequence
+
+from mmpretrain.registry import METRICS
+from mmpretrain.structures import label_to_onehot
+from .multi_label import AveragePrecision, MultiLabelMetric
+
+
+class VOCMetricMixin:
+    """A mixin class for VOC dataset metrics, VOC annotations have extra
+    `difficult` attribute for each object, therefore, extra option is needed
+    for calculating VOC metrics.
+
+    Args:
+        difficult_as_postive (Optional[bool]): Whether to map the difficult
+            labels as positive in one-hot ground truth for evaluation. If it
+            set to True, map difficult gt labels to positive ones(1), If it
+            set to False, map difficult gt labels to negative ones(0).
+            Defaults to None, the difficult labels will be set to '-1'.
+    """
+
+    def __init__(self,
+                 *arg,
+                 difficult_as_positive: Optional[bool] = None,
+                 **kwarg):
+        self.difficult_as_positive = difficult_as_positive
+        super().__init__(*arg, **kwarg)
+
+    def process(self, data_batch, data_samples: Sequence[dict]):
+        """Process one batch of data samples.
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+
+        Args:
+            data_batch: A batch of data from the dataloader.
+            data_samples (Sequence[dict]): A batch of outputs from the model.
+        """
+        for data_sample in data_samples:
+            result = dict()
+            gt_label = data_sample['gt_label']
+            gt_label_difficult = data_sample['gt_label_difficult']
+
+            result['pred_score'] = data_sample['pred_score'].clone()
+            num_classes = result['pred_score'].size()[-1]
+
+            if 'gt_score' in data_sample:
+                result['gt_score'] = data_sample['gt_score'].clone()
+            else:
+                result['gt_score'] = label_to_onehot(gt_label, num_classes)
+
+            # VOC annotation labels all the objects in a single image
+            # therefore, some categories are appeared both in
+            # difficult objects and non-difficult objects.
+            # Here we reckon those labels which are only exists in difficult
+            # objects as difficult labels.
+            difficult_label = set(gt_label_difficult) - (
+                set(gt_label_difficult) & set(gt_label.tolist()))
+
+            # set difficult label for better eval
+            if self.difficult_as_positive is None:
+                result['gt_score'][[*difficult_label]] = -1
+            elif self.difficult_as_positive:
+                result['gt_score'][[*difficult_label]] = 1
+
+            # Save the result to `self.results`.
+            self.results.append(result)
+
+
+@METRICS.register_module()
+class VOCMultiLabelMetric(VOCMetricMixin, MultiLabelMetric):
+    """A collection of metrics for multi-label multi-class classification task
+    based on confusion matrix for VOC dataset.
+
+    It includes precision, recall, f1-score and support.
+
+    Args:
+        difficult_as_postive (Optional[bool]): Whether to map the difficult
+            labels as positive in one-hot ground truth for evaluation. If it
+            set to True, map difficult gt labels to positive ones(1), If it
+            set to False, map difficult gt labels to negative ones(0).
+            Defaults to None, the difficult labels will be set to '-1'.
+        **kwarg: Refers to `MultiLabelMetric` for detailed docstrings.
+    """
+
+
+@METRICS.register_module()
+class VOCAveragePrecision(VOCMetricMixin, AveragePrecision):
+    """Calculate the average precision with respect of classes for VOC dataset.
+
+    Args:
+        difficult_as_postive (Optional[bool]): Whether to map the difficult
+            labels as positive in one-hot ground truth for evaluation. If it
+            set to True, map difficult gt labels to positive ones(1), If it
+            set to False, map difficult gt labels to negative ones(0).
+            Defaults to None, the difficult labels will be set to '-1'.
+        **kwarg: Refers to `AveragePrecision` for detailed docstrings.
+    """
diff --git a/mmpretrain/evaluation/metrics/vqa.py b/mmpretrain/evaluation/metrics/vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd77ba9bc23e013c41ac095810740bdb71d33fb3
--- /dev/null
+++ b/mmpretrain/evaluation/metrics/vqa.py
@@ -0,0 +1,315 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Partly adopted from https://github.com/GT-Vision-Lab/VQA
+# Copyright (c) 2014, Aishwarya Agrawal
+from typing import List, Optional
+
+import mmengine
+from mmengine.evaluator import BaseMetric
+from mmengine.logging import MMLogger
+
+from mmpretrain.registry import METRICS
+
+
+def _process_punctuation(inText):
+    import re
+    outText = inText
+    punct = [
+        ';', r'/', '[', ']', '"', '{', '}', '(', ')', '=', '+', '\\', '_', '-',
+        '>', '<', '@', '`', ',', '?', '!'
+    ]
+    commaStrip = re.compile('(\d)(,)(\d)')  # noqa: W605
+    periodStrip = re.compile('(?!<=\d)(\.)(?!\d)')  # noqa: W605
+    for p in punct:
+        if (p + ' ' in inText or ' ' + p in inText) or (re.search(
+                commaStrip, inText) is not None):
+            outText = outText.replace(p, '')
+        else:
+            outText = outText.replace(p, ' ')
+    outText = periodStrip.sub('', outText, re.UNICODE)
+    return outText
+
+
+def _process_digit_article(inText):
+    outText = []
+    tempText = inText.lower().split()
+    articles = ['a', 'an', 'the']
+    manualMap = {
+        'none': '0',
+        'zero': '0',
+        'one': '1',
+        'two': '2',
+        'three': '3',
+        'four': '4',
+        'five': '5',
+        'six': '6',
+        'seven': '7',
+        'eight': '8',
+        'nine': '9',
+        'ten': '10',
+    }
+    contractions = {
+        'aint': "ain't",
+        'arent': "aren't",
+        'cant': "can't",
+        'couldve': "could've",
+        'couldnt': "couldn't",
+        "couldn'tve": "couldn't've",
+        "couldnt've": "couldn't've",
+        'didnt': "didn't",
+        'doesnt': "doesn't",
+        'dont': "don't",
+        'hadnt': "hadn't",
+        "hadnt've": "hadn't've",
+        "hadn'tve": "hadn't've",
+        'hasnt': "hasn't",
+        'havent': "haven't",
+        'hed': "he'd",
+        "hed've": "he'd've",
+        "he'dve": "he'd've",
+        'hes': "he's",
+        'howd': "how'd",
+        'howll': "how'll",
+        'hows': "how's",
+        "Id've": "I'd've",
+        "I'dve": "I'd've",
+        'Im': "I'm",
+        'Ive': "I've",
+        'isnt': "isn't",
+        'itd': "it'd",
+        "itd've": "it'd've",
+        "it'dve": "it'd've",
+        'itll': "it'll",
+        "let's": "let's",
+        'maam': "ma'am",
+        'mightnt': "mightn't",
+        "mightnt've": "mightn't've",
+        "mightn'tve": "mightn't've",
+        'mightve': "might've",
+        'mustnt': "mustn't",
+        'mustve': "must've",
+        'neednt': "needn't",
+        'notve': "not've",
+        'oclock': "o'clock",
+        'oughtnt': "oughtn't",
+        "ow's'at": "'ow's'at",
+        "'ows'at": "'ow's'at",
+        "'ow'sat": "'ow's'at",
+        'shant': "shan't",
+        "shed've": "she'd've",
+        "she'dve": "she'd've",
+        "she's": "she's",
+        'shouldve': "should've",
+        'shouldnt': "shouldn't",
+        "shouldnt've": "shouldn't've",
+        "shouldn'tve": "shouldn't've",
+        "somebody'd": 'somebodyd',
+        "somebodyd've": "somebody'd've",
+        "somebody'dve": "somebody'd've",
+        'somebodyll': "somebody'll",
+        'somebodys': "somebody's",
+        'someoned': "someone'd",
+        "someoned've": "someone'd've",
+        "someone'dve": "someone'd've",
+        'someonell': "someone'll",
+        'someones': "someone's",
+        'somethingd': "something'd",
+        "somethingd've": "something'd've",
+        "something'dve": "something'd've",
+        'somethingll': "something'll",
+        'thats': "that's",
+        'thered': "there'd",
+        "thered've": "there'd've",
+        "there'dve": "there'd've",
+        'therere': "there're",
+        'theres': "there's",
+        'theyd': "they'd",
+        "theyd've": "they'd've",
+        "they'dve": "they'd've",
+        'theyll': "they'll",
+        'theyre': "they're",
+        'theyve': "they've",
+        'twas': "'twas",
+        'wasnt': "wasn't",
+        "wed've": "we'd've",
+        "we'dve": "we'd've",
+        'weve': "we've",
+        'werent': "weren't",
+        'whatll': "what'll",
+        'whatre': "what're",
+        'whats': "what's",
+        'whatve': "what've",
+        'whens': "when's",
+        'whered': "where'd",
+        'wheres': "where's",
+        'whereve': "where've",
+        'whod': "who'd",
+        "whod've": "who'd've",
+        "who'dve": "who'd've",
+        'wholl': "who'll",
+        'whos': "who's",
+        'whove': "who've",
+        'whyll': "why'll",
+        'whyre': "why're",
+        'whys': "why's",
+        'wont': "won't",
+        'wouldve': "would've",
+        'wouldnt': "wouldn't",
+        "wouldnt've": "wouldn't've",
+        "wouldn'tve": "wouldn't've",
+        'yall': "y'all",
+        "yall'll": "y'all'll",
+        "y'allll": "y'all'll",
+        "yall'd've": "y'all'd've",
+        "y'alld've": "y'all'd've",
+        "y'all'dve": "y'all'd've",
+        'youd': "you'd",
+        "youd've": "you'd've",
+        "you'dve": "you'd've",
+        'youll': "you'll",
+        'youre': "you're",
+        'youve': "you've",
+    }
+    for word in tempText:
+        word = manualMap.setdefault(word, word)
+        if word not in articles:
+            outText.append(word)
+    for wordId, word in enumerate(outText):
+        if word in contractions:
+            outText[wordId] = contractions[word]
+    outText = ' '.join(outText)
+    return outText
+
+
+@METRICS.register_module()
+class VQAAcc(BaseMetric):
+    '''VQA Acc metric.
+    Args:
+
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Should be modified according to the
+            `retrieval_type` for unambiguous results. Defaults to TR.
+    '''
+    default_prefix = 'VQA'
+
+    def __init__(self,
+                 full_score_weight: float = 0.3,
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None):
+        super().__init__(collect_device=collect_device, prefix=prefix)
+        self.full_score_weight = full_score_weight
+
+    def process(self, data_batch, data_samples):
+        """Process one batch of data samples.
+
+        The processed results should be stored in ``self.results``, which will
+        be used to computed the metrics when all batches have been processed.
+
+        Args:
+            data_batch: A batch of data from the dataloader.
+            data_samples (Sequence[dict]): A batch of outputs from the model.
+        """
+        for sample in data_samples:
+            gt_answer = sample.get('gt_answer')
+            gt_answer_weight = sample.get('gt_answer_weight')
+            if isinstance(gt_answer, str):
+                gt_answer = [gt_answer]
+            if gt_answer_weight is None:
+                gt_answer_weight = [1. / (len(gt_answer))] * len(gt_answer)
+
+            result = {
+                'pred_answer': sample.get('pred_answer'),
+                'gt_answer': gt_answer,
+                'gt_answer_weight': gt_answer_weight,
+            }
+
+            self.results.append(result)
+
+    def compute_metrics(self, results: List):
+        """Compute the metrics from processed results.
+
+        Args:
+            results (dict): The processed results of each batch.
+
+        Returns:
+            Dict: The computed metrics. The keys are the names of the metrics,
+            and the values are corresponding results.
+        """
+        acc = []
+        for result in results:
+            pred_answer = self._process_answer(result['pred_answer'])
+            gt_answer = [
+                self._process_answer(answer) for answer in result['gt_answer']
+            ]
+            answer_weight = result['gt_answer_weight']
+
+            weight_sum = 0
+            for i, gt in enumerate(gt_answer):
+                if gt == pred_answer:
+                    weight_sum += answer_weight[i]
+            vqa_acc = min(1.0, weight_sum / self.full_score_weight)
+            acc.append(vqa_acc)
+
+        accuracy = sum(acc) / len(acc) * 100
+
+        metrics = {'acc': accuracy}
+        return metrics
+
+    def _process_answer(self, answer):
+        answer = answer.replace('\n', ' ')
+        answer = answer.replace('\t', ' ')
+        answer = answer.strip()
+        answer = _process_punctuation(answer)
+        answer = _process_digit_article(answer)
+        return answer
+
+
+@METRICS.register_module()
+class ReportVQA(BaseMetric):
+    """Dump VQA result to the standard json format for VQA evaluation.
+
+    Args:
+        file_path (str): The file path to save the result file.
+        collect_device (str): Device name used for collecting results from
+            different ranks during distributed training. Must be 'cpu' or
+            'gpu'. Defaults to 'cpu'.
+        prefix (str, optional): The prefix that will be added in the metric
+            names to disambiguate homonymous metrics of different evaluators.
+            If prefix is not provided in the argument, self.default_prefix
+            will be used instead. Should be modified according to the
+            `retrieval_type` for unambiguous results. Defaults to TR.
+    """
+    default_prefix = 'VQA'
+
+    def __init__(self,
+                 file_path: str,
+                 collect_device: str = 'cpu',
+                 prefix: Optional[str] = None):
+        super().__init__(collect_device=collect_device, prefix=prefix)
+        if not file_path.endswith('.json'):
+            raise ValueError('The output file must be a json file.')
+        self.file_path = file_path
+
+    def process(self, data_batch, data_samples) -> None:
+        """transfer tensors in predictions to CPU."""
+        for sample in data_samples:
+            question_id = sample['question_id']
+            pred_answer = sample['pred_answer']
+
+            result = {
+                'question_id': int(question_id),
+                'answer': pred_answer,
+            }
+
+            self.results.append(result)
+
+    def compute_metrics(self, results: List):
+        """Dump the result to json file."""
+        mmengine.dump(results, self.file_path)
+        logger = MMLogger.get_current_instance()
+        logger.info(f'Results has been saved to {self.file_path}.')
+        return {}
diff --git a/mmpretrain/models/__init__.py b/mmpretrain/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f583114ee54abd7885759c63b45231252ae0db1
--- /dev/null
+++ b/mmpretrain/models/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .backbones import *  # noqa: F401,F403
+from .builder import (BACKBONES, CLASSIFIERS, HEADS, LOSSES, NECKS,
+                      build_backbone, build_classifier, build_head, build_loss,
+                      build_neck)
+from .classifiers import *  # noqa: F401,F403
+from .heads import *  # noqa: F401,F403
+from .losses import *  # noqa: F401,F403
+from .multimodal import *  # noqa: F401,F403
+from .necks import *  # noqa: F401,F403
+from .peft import *  # noqa: F401,F403
+from .retrievers import *  # noqa: F401,F403
+from .selfsup import *  # noqa: F401,F403
+from .tta import *  # noqa: F401,F403
+from .utils import *  # noqa: F401,F403
+
+__all__ = [
+    'BACKBONES', 'HEADS', 'NECKS', 'LOSSES', 'CLASSIFIERS', 'build_backbone',
+    'build_head', 'build_neck', 'build_loss', 'build_classifier'
+]
diff --git a/mmpretrain/models/__pycache__/__init__.cpython-310.pyc b/mmpretrain/models/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..30cd299756c3540fe10986586041a5c8fd24afce
Binary files /dev/null and b/mmpretrain/models/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/models/__pycache__/builder.cpython-310.pyc b/mmpretrain/models/__pycache__/builder.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..c99faf7438d0063f4cb81d56243f1bd69fcaf0ab
Binary files /dev/null and b/mmpretrain/models/__pycache__/builder.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__init__.py b/mmpretrain/models/backbones/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..60e37fb7b6e15cadd0eef4a3c9c79c856fbf4247
--- /dev/null
+++ b/mmpretrain/models/backbones/__init__.py
@@ -0,0 +1,129 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .alexnet import AlexNet
+from .beit import BEiTViT
+from .conformer import Conformer
+from .convmixer import ConvMixer
+from .convnext import ConvNeXt
+from .cspnet import CSPDarkNet, CSPNet, CSPResNet, CSPResNeXt
+from .davit import DaViT
+from .deit import DistilledVisionTransformer
+from .deit3 import DeiT3
+from .densenet import DenseNet
+from .edgenext import EdgeNeXt
+from .efficientformer import EfficientFormer
+from .efficientnet import EfficientNet
+from .efficientnet_v2 import EfficientNetV2
+from .hivit import HiViT
+from .hornet import HorNet
+from .hrnet import HRNet
+from .inception_v3 import InceptionV3
+from .lenet import LeNet5
+from .levit import LeViT
+from .mixmim import MixMIMTransformer
+from .mlp_mixer import MlpMixer
+from .mobilenet_v2 import MobileNetV2
+from .mobilenet_v3 import MobileNetV3
+from .mobileone import MobileOne
+from .mobilevit import MobileViT
+from .mvit import MViT
+from .poolformer import PoolFormer
+from .regnet import RegNet
+from .replknet import RepLKNet
+from .repmlp import RepMLPNet
+from .repvgg import RepVGG
+from .res2net import Res2Net
+from .resnest import ResNeSt
+from .resnet import ResNet, ResNetV1c, ResNetV1d
+from .resnet_cifar import ResNet_CIFAR
+from .resnext import ResNeXt
+from .revvit import RevVisionTransformer
+from .riformer import RIFormer
+from .seresnet import SEResNet
+from .seresnext import SEResNeXt
+from .shufflenet_v1 import ShuffleNetV1
+from .shufflenet_v2 import ShuffleNetV2
+from .sparse_convnext import SparseConvNeXt
+from .sparse_resnet import SparseResNet
+from .swin_transformer import SwinTransformer
+from .swin_transformer_v2 import SwinTransformerV2
+from .t2t_vit import T2T_ViT
+from .timm_backbone import TIMMBackbone
+from .tinyvit import TinyViT
+from .tnt import TNT
+from .twins import PCPVT, SVT
+from .van import VAN
+from .vgg import VGG
+from .vig import PyramidVig, Vig
+from .vision_transformer import VisionTransformer
+from .vit_eva02 import ViTEVA02
+from .vit_sam import ViTSAM
+from .xcit import XCiT
+
+__all__ = [
+    'LeNet5',
+    'AlexNet',
+    'VGG',
+    'RegNet',
+    'ResNet',
+    'ResNeXt',
+    'ResNetV1d',
+    'ResNeSt',
+    'ResNet_CIFAR',
+    'SEResNet',
+    'SEResNeXt',
+    'ShuffleNetV1',
+    'ShuffleNetV2',
+    'MobileNetV2',
+    'MobileNetV3',
+    'VisionTransformer',
+    'SwinTransformer',
+    'TNT',
+    'TIMMBackbone',
+    'T2T_ViT',
+    'Res2Net',
+    'RepVGG',
+    'Conformer',
+    'MlpMixer',
+    'DistilledVisionTransformer',
+    'PCPVT',
+    'SVT',
+    'EfficientNet',
+    'EfficientNetV2',
+    'ConvNeXt',
+    'HRNet',
+    'ResNetV1c',
+    'ConvMixer',
+    'EdgeNeXt',
+    'CSPDarkNet',
+    'CSPResNet',
+    'CSPResNeXt',
+    'CSPNet',
+    'RepLKNet',
+    'RepMLPNet',
+    'PoolFormer',
+    'RIFormer',
+    'DenseNet',
+    'VAN',
+    'InceptionV3',
+    'MobileOne',
+    'EfficientFormer',
+    'SwinTransformerV2',
+    'MViT',
+    'DeiT3',
+    'HorNet',
+    'MobileViT',
+    'DaViT',
+    'BEiTViT',
+    'RevVisionTransformer',
+    'MixMIMTransformer',
+    'TinyViT',
+    'LeViT',
+    'Vig',
+    'PyramidVig',
+    'XCiT',
+    'ViTSAM',
+    'ViTEVA02',
+    'HiViT',
+    'SparseResNet',
+    'SparseConvNeXt',
+]
diff --git a/mmpretrain/models/backbones/__pycache__/__init__.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7a5436ccae8aa19cf9f0cb465e95da983bc938c2
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/alexnet.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/alexnet.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..9967a64493459fff130fd5ba14f66bc0e85484b3
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/alexnet.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/base_backbone.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/base_backbone.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..07505e9d35b440c5fd62830d6c7fa85d1e9b248e
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/base_backbone.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/beit.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/beit.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..0cfb38be1c35d4c60a5da7bb0b3b2affbfc0882e
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/beit.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/conformer.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/conformer.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..b1ff22aab6eabb863da43837d42e6ca003c57f9d
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/conformer.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/convmixer.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/convmixer.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..5ecfb61815272cac0c96a5d9352ea4e8eb0760bb
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/convmixer.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/convnext.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/convnext.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3ae29defcb2988fed0ae0951e0d10ae55afe46e2
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/convnext.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/cspnet.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/cspnet.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..2235d3c5faa17e45863054b5016efd52afa3a300
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/cspnet.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/davit.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/davit.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..788136244ffa0e0c4cd77d3fa6852069c256831c
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/davit.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/deit.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/deit.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..1cbb9dadc88dc60cae76aafb6181c6be1863ed6c
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/deit.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/deit3.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/deit3.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e479f1ae966dd502d85f12aaaa8ebcedba208d1f
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/deit3.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/densenet.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/densenet.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..23ff7aaecfa4d60f5501a326d45c4e746ec47841
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/densenet.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/edgenext.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/edgenext.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..6128a5afc745247e0f4c87c89b0d278255afcb2c
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/edgenext.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/efficientformer.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/efficientformer.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..163d19e401cc23cecc7e03825e6481ac2240ec88
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/efficientformer.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/efficientnet.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/efficientnet.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..50b5fb93d2c9d259096fefd0948d23c4ff71cf4e
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/efficientnet.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/efficientnet_v2.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/efficientnet_v2.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..37075a5102088d2339f9ff25c69878fdb1d33d4f
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/efficientnet_v2.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/hivit.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/hivit.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..10b9339262920bea4fa62a140c2c8fc396e3b2a9
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/hivit.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/hornet.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/hornet.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..bba3ad677337ba02dc0c748fb97006b17310101d
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/hornet.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/hrnet.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/hrnet.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..36551ed3ae6b27636c76e58614dfa8269def6776
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/hrnet.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/inception_v3.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/inception_v3.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..64da4474d880fb6aa12aa05233f3d00c6d177118
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/inception_v3.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/lenet.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/lenet.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..480c50be05841b28609335fe2214685500393316
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/lenet.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/levit.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/levit.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f9020dc1d246a2c7a8d2cc7a81e88cf1d45f11ed
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/levit.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/mixmim.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/mixmim.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..d1ac41ef08b62a9a70c6f79cd833fbe77f782260
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/mixmim.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/mlp_mixer.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/mlp_mixer.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..47eff90a423d75b061a5998c9f9937027dcc83da
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/mlp_mixer.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/mobilenet_v2.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/mobilenet_v2.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ae5b1af6798d6258e839f6bda0ab09944fe8b93f
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/mobilenet_v2.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/mobilenet_v3.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/mobilenet_v3.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..feb04edffbb93712622245a2d2da9edc0fcea03f
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/mobilenet_v3.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/mobileone.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/mobileone.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..bc9c50cf31b1ca1731a6915a3edd4b535980263a
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/mobileone.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/mobilevit.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/mobilevit.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..127b0fedf8f526c101aedd3a614ee328340646b8
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/mobilevit.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/mvit.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/mvit.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..acba262a3bd94ffc11fc9890ccfff4a59122c453
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/mvit.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/poolformer.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/poolformer.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..759b323d33bd6a01de89ca7ebd4483e91483e0ce
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/poolformer.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/regnet.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/regnet.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..4b274648d1c0cc67739e62ccf5e45d3a6c6a57f9
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/regnet.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/replknet.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/replknet.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ea73f08d818d7d729dc987b024c2737dd67a7fd7
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/replknet.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/repmlp.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/repmlp.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..2991c8b30e1b11464bf8b5dbeb724d759fef5ac1
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/repmlp.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/repvgg.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/repvgg.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..8dfaad32a4185fe26ce05ac33fbabcc5ba807a70
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/repvgg.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/res2net.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/res2net.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..1e4ab76f1d69026553d09496151a24a33a9c136a
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/res2net.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/resnest.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/resnest.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..851e68b34897eae1327686882ec9b57dcd1f8d94
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/resnest.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/resnet.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/resnet.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e252c28b70f312a663e8aabba86e41b9bd8bdd8b
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/resnet.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/resnet_cifar.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/resnet_cifar.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..b814fcb764b327fc8e34ded2bda042c5a37cc3da
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/resnet_cifar.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/resnext.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/resnext.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e7eb69be2692fcf3b6aa75bb93d73afceada9747
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/resnext.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/revvit.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/revvit.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a7b223df1599a50621b280b1e5323549149b696a
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/revvit.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/riformer.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/riformer.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..9413c77584c92e02a357446b5ec4d6ed589badca
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/riformer.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/seresnet.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/seresnet.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..beab741f0322abc6cf71c97595ffac19c7adcfdc
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/seresnet.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/seresnext.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/seresnext.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..0f96e65474b963444f2f71784614542adefe6718
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/seresnext.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/shufflenet_v1.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/shufflenet_v1.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..abbea01609981c5bc9a08904adc0936f5efb6453
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/shufflenet_v1.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/shufflenet_v2.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/shufflenet_v2.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f2e29553215c13821f18841c2ad46c1d02edcaa9
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/shufflenet_v2.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/sparse_convnext.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/sparse_convnext.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3e1a88c926767cf793f860da2f7caab8910bbdbf
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/sparse_convnext.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/sparse_resnet.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/sparse_resnet.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..2b0984680bd8145729d83c1591d4aaef98b99ab7
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/sparse_resnet.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/swin_transformer.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/swin_transformer.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..b028de33a49f83ad7c22d65cc569927aa333cb12
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/swin_transformer.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/swin_transformer_v2.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/swin_transformer_v2.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..be9a786cf45e2d915b1f5b07613c9e1e8aebdaf9
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/swin_transformer_v2.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/t2t_vit.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/t2t_vit.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..91381156be57ac85043774366893c1ba064809b0
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/t2t_vit.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/timm_backbone.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/timm_backbone.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..16168aafc3fd0a1cb473db894ecbbab06c605710
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/timm_backbone.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/tinyvit.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/tinyvit.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..65431a48ee3e844cae8e850a57fb24faaa33b373
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/tinyvit.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/tnt.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/tnt.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e035ca2297849e0eb560314f7992fdaef33dd3c7
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/tnt.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/twins.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/twins.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e71546e2e4e2d2ba2edf244e81710d8e8272e261
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/twins.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/van.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/van.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..28cba010a2c578ffa620bd9cb96892b5a92af222
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/van.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/vgg.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/vgg.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3d817376ea0de624eed72377f8c7682e303996ab
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/vgg.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/vig.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/vig.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..d50fcd0b250f9b6d78b8711fe8ee4c155ed8bf22
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/vig.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/vision_transformer.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/vision_transformer.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..cecc6bdd94abbcdc7e2375d3b1b9807906a88b14
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/vision_transformer.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/vit_eva02.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/vit_eva02.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..4f44852abaa50dfba7d1ecbfe9bbb4a241bdd4b8
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/vit_eva02.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/vit_sam.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/vit_sam.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..03dee0259e5341a0cc68fc614f836c55330b62b2
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/vit_sam.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/__pycache__/xcit.cpython-310.pyc b/mmpretrain/models/backbones/__pycache__/xcit.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..5d6d30e14d24826c4a6c0246916415ea985ef9a7
Binary files /dev/null and b/mmpretrain/models/backbones/__pycache__/xcit.cpython-310.pyc differ
diff --git a/mmpretrain/models/backbones/alexnet.py b/mmpretrain/models/backbones/alexnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7c2891fdd2c878e243331f572f6e3e562232d46
--- /dev/null
+++ b/mmpretrain/models/backbones/alexnet.py
@@ -0,0 +1,56 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+
+
+@MODELS.register_module()
+class AlexNet(BaseBackbone):
+    """`AlexNet <https://en.wikipedia.org/wiki/AlexNet>`_ backbone.
+
+    The input for AlexNet is a 224x224 RGB image.
+
+    Args:
+        num_classes (int): number of classes for classification.
+            The default value is -1, which uses the backbone as
+            a feature extractor without the top classifier.
+    """
+
+    def __init__(self, num_classes=-1):
+        super(AlexNet, self).__init__()
+        self.num_classes = num_classes
+        self.features = nn.Sequential(
+            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
+            nn.ReLU(inplace=True),
+            nn.MaxPool2d(kernel_size=3, stride=2),
+            nn.Conv2d(64, 192, kernel_size=5, padding=2),
+            nn.ReLU(inplace=True),
+            nn.MaxPool2d(kernel_size=3, stride=2),
+            nn.Conv2d(192, 384, kernel_size=3, padding=1),
+            nn.ReLU(inplace=True),
+            nn.Conv2d(384, 256, kernel_size=3, padding=1),
+            nn.ReLU(inplace=True),
+            nn.Conv2d(256, 256, kernel_size=3, padding=1),
+            nn.ReLU(inplace=True),
+            nn.MaxPool2d(kernel_size=3, stride=2),
+        )
+        if self.num_classes > 0:
+            self.classifier = nn.Sequential(
+                nn.Dropout(),
+                nn.Linear(256 * 6 * 6, 4096),
+                nn.ReLU(inplace=True),
+                nn.Dropout(),
+                nn.Linear(4096, 4096),
+                nn.ReLU(inplace=True),
+                nn.Linear(4096, num_classes),
+            )
+
+    def forward(self, x):
+
+        x = self.features(x)
+        if self.num_classes > 0:
+            x = x.view(x.size(0), 256 * 6 * 6)
+            x = self.classifier(x)
+
+        return (x, )
diff --git a/mmpretrain/models/backbones/base_backbone.py b/mmpretrain/models/backbones/base_backbone.py
new file mode 100644
index 0000000000000000000000000000000000000000..751aa956ba2ad178ea9e40875b6e610ee7bbbcd3
--- /dev/null
+++ b/mmpretrain/models/backbones/base_backbone.py
@@ -0,0 +1,33 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from abc import ABCMeta, abstractmethod
+
+from mmengine.model import BaseModule
+
+
+class BaseBackbone(BaseModule, metaclass=ABCMeta):
+    """Base backbone.
+
+    This class defines the basic functions of a backbone. Any backbone that
+    inherits this class should at least define its own `forward` function.
+    """
+
+    def __init__(self, init_cfg=None):
+        super(BaseBackbone, self).__init__(init_cfg)
+
+    @abstractmethod
+    def forward(self, x):
+        """Forward computation.
+
+        Args:
+            x (tensor | tuple[tensor]): x could be a Torch.tensor or a tuple of
+                Torch.tensor, containing input data for forward computation.
+        """
+        pass
+
+    def train(self, mode=True):
+        """Set module status before forward computation.
+
+        Args:
+            mode (bool): Whether it is train_mode or test_mode
+        """
+        super(BaseBackbone, self).train(mode)
diff --git a/mmpretrain/models/backbones/beit.py b/mmpretrain/models/backbones/beit.py
new file mode 100644
index 0000000000000000000000000000000000000000..3c7d9085182a989a8b2a6b26e90c35702759f36f
--- /dev/null
+++ b/mmpretrain/models/backbones/beit.py
@@ -0,0 +1,697 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Sequence, Tuple, Union
+
+import numpy as np
+import torch
+import torch.nn as nn
+from mmcv.cnn.bricks.drop import build_dropout
+from mmcv.cnn.bricks.transformer import FFN, PatchEmbed
+from mmengine.model import BaseModule, ModuleList
+from mmengine.model.weight_init import trunc_normal_
+
+from mmpretrain.registry import MODELS
+from ..utils import (BEiTAttention, build_norm_layer, resize_pos_embed,
+                     resize_relative_position_bias_table, to_2tuple)
+from .base_backbone import BaseBackbone
+from .vision_transformer import TransformerEncoderLayer
+
+
+class RelativePositionBias(BaseModule):
+    """Relative Position Bias.
+
+    This module is copied from
+    https://github.com/microsoft/unilm/blob/master/beit/modeling_finetune.py#L209.
+
+    Args:
+        window_size (Sequence[int]): The window size of the relative
+            position bias.
+        num_heads (int): The number of head in multi-head attention.
+        with_cls_token (bool): To indicate the backbone has cls_token or not.
+            Defaults to True.
+    """
+
+    def __init__(
+        self,
+        window_size: Sequence[int],
+        num_heads: int,
+        with_cls_token: bool = True,
+    ) -> None:
+        super().__init__()
+        self.window_size = window_size
+        if with_cls_token:
+            num_extra_tokens = 3
+        else:
+            num_extra_tokens = 0
+        # cls to token & token to cls & cls to cls
+        self.num_relative_distance = (2 * window_size[0] - 1) * (
+            2 * window_size[1] - 1) + num_extra_tokens
+        self.relative_position_bias_table = nn.Parameter(
+            torch.zeros(self.num_relative_distance,
+                        num_heads))  # 2*Wh-1 * 2*Ww-1, nH
+
+        # get pair-wise relative position index for each
+        # token inside the window
+        coords_h = torch.arange(window_size[0])
+        coords_w = torch.arange(window_size[1])
+        coords = torch.stack(torch.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
+        coords_flatten = torch.flatten(coords, 1)  # 2, Wh*Ww
+        relative_coords = coords_flatten[:, :, None] -\
+            coords_flatten[:, None, :]  # 2, Wh*Ww, Wh*Ww
+        relative_coords = relative_coords.permute(
+            1, 2, 0).contiguous()  # Wh*Ww, Wh*Ww, 2
+        relative_coords[:, :, 0] += window_size[0] - 1  # shift to start from 0
+        relative_coords[:, :, 1] += window_size[1] - 1
+        relative_coords[:, :, 0] *= 2 * window_size[1] - 1
+        if with_cls_token:
+            relative_position_index = torch.zeros(
+                size=(window_size[0] * window_size[1] + 1, ) * 2,
+                dtype=relative_coords.dtype)
+            relative_position_index[1:, 1:] = relative_coords.sum(
+                -1)  # Wh*Ww, Wh*Ww
+            relative_position_index[0, 0:] = self.num_relative_distance - 3
+            relative_position_index[0:, 0] = self.num_relative_distance - 2
+            relative_position_index[0, 0] = self.num_relative_distance - 1
+        else:
+            relative_position_index = torch.zeros(
+                size=(window_size[0] * window_size[1], ) * 2,
+                dtype=relative_coords.dtype)
+            relative_position_index = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww
+
+        self.register_buffer('relative_position_index',
+                             relative_position_index)
+
+    def forward(self) -> torch.Tensor:
+        # Wh*Ww,Wh*Ww,nH
+        relative_position_bias = self.relative_position_bias_table[
+            self.relative_position_index.view(-1)].view(
+                self.window_size[0] * self.window_size[1] + 1,
+                self.window_size[0] * self.window_size[1] + 1, -1)
+        return relative_position_bias.permute(
+            2, 0, 1).contiguous()  # nH, Wh*Ww, Wh*Ww
+
+
+class BEiTTransformerEncoderLayer(TransformerEncoderLayer):
+    """Implements one encoder layer in BEiT.
+
+    Comparing with conventional ``TransformerEncoderLayer``, this module
+    adds weights to the shortcut connection. In addition, ``BEiTAttention``
+    is used to replace the original ``MultiheadAttention`` in
+    ``TransformerEncoderLayer``.
+
+    Args:
+        embed_dims (int): The feature dimension.
+        num_heads (int): Parallel attention heads.
+        feedforward_channels (int): The hidden dimension for FFNs.
+        layer_scale_init_value (float): The initialization value for
+            the learnable scaling of attention and FFN. 1 means no scaling.
+        drop_rate (float): Probability of an element to be zeroed
+            after the feed forward layer. Defaults to 0.
+        window_size (tuple[int]): The height and width of the window.
+            Defaults to None.
+        use_rel_pos_bias (bool): Whether to use unique relative position bias,
+            if False, use shared relative position bias defined in backbone.
+        attn_drop_rate (float): The drop out rate for attention layer.
+            Defaults to 0.0.
+        drop_path_rate (float): Stochastic depth rate. Default 0.0.
+        num_fcs (int): The number of fully-connected layers for FFNs.
+            Defaults to 2.
+        bias (bool | str): The option to add leanable bias for q, k, v. If bias
+            is True, it will add leanable bias. If bias is 'qv_bias', it will
+            only add leanable bias for q, v. If bias is False, it will not add
+            bias for q, k, v. Default to 'qv_bias'.
+        act_cfg (dict): The activation config for FFNs.
+            Defaults to ``dict(type='GELU')``.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to dict(type='LN').
+        attn_cfg (dict): The configuration for the attention layer.
+            Defaults to an empty dict.
+        ffn_cfg (dict): The configuration for the ffn layer.
+            Defaults to ``dict(add_identity=False)``.
+        init_cfg (dict or List[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims: int,
+                 num_heads: int,
+                 feedforward_channels: int,
+                 layer_scale_init_value: float,
+                 window_size: Tuple[int, int],
+                 use_rel_pos_bias: bool,
+                 drop_rate: float = 0.,
+                 attn_drop_rate: float = 0.,
+                 drop_path_rate: float = 0.,
+                 num_fcs: int = 2,
+                 bias: Union[str, bool] = 'qv_bias',
+                 act_cfg: dict = dict(type='GELU'),
+                 norm_cfg: dict = dict(type='LN'),
+                 attn_cfg: dict = dict(),
+                 ffn_cfg: dict = dict(add_identity=False),
+                 init_cfg: Optional[Union[dict, List[dict]]] = None) -> None:
+        super().__init__(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            feedforward_channels=feedforward_channels,
+            attn_drop_rate=attn_drop_rate,
+            drop_path_rate=0.,
+            drop_rate=0.,
+            num_fcs=num_fcs,
+            act_cfg=act_cfg,
+            norm_cfg=norm_cfg,
+            init_cfg=init_cfg)
+
+        attn_cfg = {
+            'window_size': window_size,
+            'use_rel_pos_bias': use_rel_pos_bias,
+            'qk_scale': None,
+            'embed_dims': embed_dims,
+            'num_heads': num_heads,
+            'attn_drop': attn_drop_rate,
+            'proj_drop': drop_rate,
+            'bias': bias,
+            **attn_cfg,
+        }
+        self.attn = BEiTAttention(**attn_cfg)
+
+        ffn_cfg = {
+            'embed_dims': embed_dims,
+            'feedforward_channels': feedforward_channels,
+            'num_fcs': num_fcs,
+            'ffn_drop': drop_rate,
+            'dropout_layer': dict(type='DropPath', drop_prob=drop_path_rate),
+            'act_cfg': act_cfg,
+            **ffn_cfg,
+        }
+        self.ffn = FFN(**ffn_cfg)
+
+        # NOTE: drop path for stochastic depth, we shall see if
+        # this is better than dropout here
+        dropout_layer = dict(type='DropPath', drop_prob=drop_path_rate)
+        self.drop_path = build_dropout(
+            dropout_layer) if dropout_layer else nn.Identity()
+
+        if layer_scale_init_value > 0:
+            self.gamma_1 = nn.Parameter(
+                layer_scale_init_value * torch.ones((embed_dims)),
+                requires_grad=True)
+            self.gamma_2 = nn.Parameter(
+                layer_scale_init_value * torch.ones((embed_dims)),
+                requires_grad=True)
+        else:
+            self.gamma_1, self.gamma_2 = None, None
+
+    def forward(self, x: torch.Tensor,
+                rel_pos_bias: torch.Tensor) -> torch.Tensor:
+        if self.gamma_1 is None:
+            x = x + self.drop_path(
+                self.attn(self.ln1(x), rel_pos_bias=rel_pos_bias))
+            x = x + self.drop_path(self.ffn(self.ln2(x)))
+        else:
+            x = x + self.drop_path(self.gamma_1 * self.attn(
+                self.ln1(x), rel_pos_bias=rel_pos_bias))
+            x = x + self.drop_path(self.gamma_2 * self.ffn(self.ln2(x)))
+        return x
+
+
+@MODELS.register_module()
+class BEiTViT(BaseBackbone):
+    """Backbone for BEiT.
+
+    A PyTorch implement of : `BEiT: BERT Pre-Training of Image Transformers
+    <https://arxiv.org/abs/2106.08254>`_
+    A PyTorch implement of : `BEiT v2: Masked Image Modeling with
+    Vector-Quantized Visual Tokenizers <https://arxiv.org/abs/2208.06366>`_
+
+    Args:
+        arch (str | dict): BEiT architecture. If use string, choose from
+            'base', 'large'. If use dict, it should have below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **num_layers** (int): The number of transformer encoder layers.
+            - **num_heads** (int): The number of heads in attention modules.
+            - **feedforward_channels** (int): The hidden dimensions in
+              feedforward modules.
+
+            Defaults to 'base'.
+        img_size (int | tuple): The expected input image shape. Because we
+            support dynamic input shape, just set the argument to the most
+            common input image shape. Defaults to 224.
+        patch_size (int | tuple): The patch size in patch embedding.
+            Defaults to 16.
+        in_channels (int): The num of input channels. Defaults to 3.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        bias (bool | str): The option to add leanable bias for q, k, v. If bias
+            is True, it will add leanable bias. If bias is 'qv_bias', it will
+            only add leanable bias for q, v. If bias is False, it will not add
+            bias for q, k, v. Default to 'qv_bias'.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        final_norm (bool): Whether to add a additional layer to normalize
+            final feature map. Defaults to True.
+        out_type (str): The type of output features. Please choose from
+
+            - ``"cls_token"``: The class token tensor with shape (B, C).
+            - ``"featmap"``: The feature map tensor from the patch tokens
+              with shape (B, C, H, W).
+            - ``"avg_featmap"``: The global averaged feature map tensor
+              with shape (B, C).
+            - ``"raw"``: The raw feature tensor includes patch tokens and
+              class tokens with shape (B, L, C).
+
+            Defaults to ``"avg_featmap"``.
+        with_cls_token (bool): Whether concatenating class token into image
+            tokens as transformer input. Defaults to True.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        use_abs_pos_emb (bool): Use position embedding like vanilla ViT.
+            Defaults to False.
+        use_rel_pos_bias (bool): Use relative position embedding in each
+            transformer encoder layer. Defaults to True.
+        use_shared_rel_pos_bias (bool): Use shared relative position embedding,
+            all transformer encoder layers share the same relative position
+            embedding. Defaults to False.
+        layer_scale_init_value (float): The initialization value for
+            the learnable scaling of attention and FFN. Defaults to 0.1.
+        interpolate_mode (str): Select the interpolate mode for position
+            embeding vector resize. Defaults to "bicubic".
+        patch_cfg (dict): Configs of patch embeding. Defaults to an empty dict.
+        layer_cfgs (Sequence | dict): Configs of each transformer layer in
+            encoder. Defaults to an empty dict.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+    arch_zoo = {
+        **dict.fromkeys(
+            ['s', 'small'], {
+                'embed_dims': 768,
+                'num_layers': 8,
+                'num_heads': 8,
+                'feedforward_channels': 768 * 3,
+            }),
+        **dict.fromkeys(
+            ['b', 'base'], {
+                'embed_dims': 768,
+                'num_layers': 12,
+                'num_heads': 12,
+                'feedforward_channels': 3072
+            }),
+        **dict.fromkeys(
+            ['l', 'large'], {
+                'embed_dims': 1024,
+                'num_layers': 24,
+                'num_heads': 16,
+                'feedforward_channels': 4096
+            }),
+        **dict.fromkeys(
+            ['eva-g', 'eva-giant'],
+            {
+                # The implementation in EVA
+                # <https://arxiv.org/abs/2211.07636>
+                'embed_dims': 1408,
+                'num_layers': 40,
+                'num_heads': 16,
+                'feedforward_channels': 6144
+            }),
+        **dict.fromkeys(
+            ['deit-t', 'deit-tiny'], {
+                'embed_dims': 192,
+                'num_layers': 12,
+                'num_heads': 3,
+                'feedforward_channels': 192 * 4
+            }),
+        **dict.fromkeys(
+            ['deit-s', 'deit-small'], {
+                'embed_dims': 384,
+                'num_layers': 12,
+                'num_heads': 6,
+                'feedforward_channels': 384 * 4
+            }),
+        **dict.fromkeys(
+            ['deit-b', 'deit-base'], {
+                'embed_dims': 768,
+                'num_layers': 12,
+                'num_heads': 12,
+                'feedforward_channels': 768 * 4
+            }),
+    }
+    num_extra_tokens = 1  # class token
+    OUT_TYPES = {'raw', 'cls_token', 'featmap', 'avg_featmap'}
+
+    def __init__(self,
+                 arch='base',
+                 img_size=224,
+                 patch_size=16,
+                 in_channels=3,
+                 out_indices=-1,
+                 drop_rate=0,
+                 drop_path_rate=0,
+                 bias='qv_bias',
+                 norm_cfg=dict(type='LN', eps=1e-6),
+                 final_norm=False,
+                 out_type='avg_featmap',
+                 with_cls_token=True,
+                 frozen_stages=-1,
+                 use_abs_pos_emb=False,
+                 use_rel_pos_bias=True,
+                 use_shared_rel_pos_bias=False,
+                 interpolate_mode='bicubic',
+                 layer_scale_init_value=0.1,
+                 patch_cfg=dict(),
+                 layer_cfgs=dict(),
+                 init_cfg=None):
+        super(BEiTViT, self).__init__(init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {
+                'embed_dims', 'num_layers', 'num_heads', 'feedforward_channels'
+            }
+            assert isinstance(arch, dict) and essential_keys <= set(arch), \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.embed_dims = self.arch_settings['embed_dims']
+        self.num_layers = self.arch_settings['num_layers']
+        self.img_size = to_2tuple(img_size)
+
+        # Set patch embedding
+        _patch_cfg = dict(
+            in_channels=in_channels,
+            input_size=img_size,
+            embed_dims=self.embed_dims,
+            conv_type='Conv2d',
+            kernel_size=patch_size,
+            stride=patch_size,
+        )
+        _patch_cfg.update(patch_cfg)
+        self.patch_embed = PatchEmbed(**_patch_cfg)
+        self.patch_resolution = self.patch_embed.init_out_size
+        num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+
+        # Set out type
+        if out_type not in self.OUT_TYPES:
+            raise ValueError(f'Unsupported `out_type` {out_type}, please '
+                             f'choose from {self.OUT_TYPES}')
+        self.out_type = out_type
+
+        # Set cls token
+        self.with_cls_token = with_cls_token
+        if with_cls_token:
+            self.cls_token = nn.Parameter(torch.zeros(1, 1, self.embed_dims))
+            self.num_extra_tokens = 1
+        elif out_type != 'cls_token':
+            self.cls_token = None
+            self.num_extra_tokens = 0
+        else:
+            raise ValueError(
+                'with_cls_token must be True when `out_type="cls_token"`.')
+
+        # Set position embedding
+        self.interpolate_mode = interpolate_mode
+        if use_abs_pos_emb:
+            self.pos_embed = nn.Parameter(
+                torch.zeros(1, num_patches + self.num_extra_tokens,
+                            self.embed_dims))
+            self._register_load_state_dict_pre_hook(self._prepare_pos_embed)
+        else:
+            self.pos_embed = None
+        self.drop_after_pos = nn.Dropout(p=drop_rate)
+
+        assert not (use_rel_pos_bias and use_shared_rel_pos_bias), (
+            '`use_rel_pos_bias` and `use_shared_rel_pos_bias` cannot be set '
+            'to True at the same time')
+        self.use_rel_pos_bias = use_rel_pos_bias
+
+        if use_shared_rel_pos_bias:
+            self.rel_pos_bias = RelativePositionBias(
+                window_size=self.patch_resolution,
+                num_heads=self.arch_settings['num_heads'])
+        else:
+            self.rel_pos_bias = None
+        self._register_load_state_dict_pre_hook(
+            self._prepare_relative_position_bias_table)
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = self.num_layers + index
+            assert 0 <= out_indices[i] <= self.num_layers, \
+                f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+
+        # stochastic depth decay rule
+        dpr = np.linspace(0, drop_path_rate, self.num_layers)
+
+        self.layers = ModuleList()
+        if isinstance(layer_cfgs, dict):
+            layer_cfgs = [layer_cfgs] * self.num_layers
+        for i in range(self.num_layers):
+            _layer_cfg = dict(
+                embed_dims=self.embed_dims,
+                num_heads=self.arch_settings['num_heads'],
+                feedforward_channels=self.
+                arch_settings['feedforward_channels'],
+                layer_scale_init_value=layer_scale_init_value,
+                window_size=self.patch_resolution,
+                use_rel_pos_bias=use_rel_pos_bias,
+                drop_rate=drop_rate,
+                drop_path_rate=dpr[i],
+                bias=bias,
+                norm_cfg=norm_cfg)
+            _layer_cfg.update(layer_cfgs[i])
+            self.layers.append(BEiTTransformerEncoderLayer(**_layer_cfg))
+
+        self.frozen_stages = frozen_stages
+        self.final_norm = final_norm
+        if final_norm:
+            self.ln1 = build_norm_layer(norm_cfg, self.embed_dims)
+
+        if out_type == 'avg_featmap':
+            self.ln2 = build_norm_layer(norm_cfg, self.embed_dims)
+
+        # freeze stages only when self.frozen_stages > 0
+        if self.frozen_stages > 0:
+            self._freeze_stages()
+
+    @property
+    def norm1(self):
+        return self.ln1
+
+    @property
+    def norm2(self):
+        return self.ln2
+
+    def init_weights(self):
+        super(BEiTViT, self).init_weights()
+
+        if not (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            if self.pos_embed is not None:
+                trunc_normal_(self.pos_embed, std=0.02)
+
+    def _prepare_pos_embed(self, state_dict, prefix, *args, **kwargs):
+        name = prefix + 'pos_embed'
+        if name not in state_dict.keys():
+            return
+
+        ckpt_pos_embed_shape = state_dict[name].shape
+        if (not self.with_cls_token
+                and ckpt_pos_embed_shape[1] == self.pos_embed.shape[1] + 1):
+            # Remove cls token from state dict if it's not used.
+            state_dict[name] = state_dict[name][:, 1:]
+            ckpt_pos_embed_shape = state_dict[name].shape
+
+        if self.pos_embed.shape != ckpt_pos_embed_shape:
+            from mmengine.logging import MMLogger
+            logger = MMLogger.get_current_instance()
+            logger.info(
+                f'Resize the pos_embed shape from {ckpt_pos_embed_shape} '
+                f'to {self.pos_embed.shape}.')
+
+            ckpt_pos_embed_shape = to_2tuple(
+                int(np.sqrt(ckpt_pos_embed_shape[1] - self.num_extra_tokens)))
+            pos_embed_shape = self.patch_embed.init_out_size
+
+            state_dict[name] = resize_pos_embed(state_dict[name],
+                                                ckpt_pos_embed_shape,
+                                                pos_embed_shape,
+                                                self.interpolate_mode,
+                                                self.num_extra_tokens)
+
+    @staticmethod
+    def resize_pos_embed(*args, **kwargs):
+        """Interface for backward-compatibility."""
+        return resize_pos_embed(*args, **kwargs)
+
+    def _freeze_stages(self):
+        # freeze position embedding
+        if self.pos_embed is not None:
+            self.pos_embed.requires_grad = False
+        # set dropout to eval model
+        self.drop_after_pos.eval()
+        # freeze patch embedding
+        self.patch_embed.eval()
+        for param in self.patch_embed.parameters():
+            param.requires_grad = False
+        # freeze cls_token
+        if self.with_cls_token:
+            self.cls_token.requires_grad = False
+        # freeze layers
+        for i in range(1, self.frozen_stages + 1):
+            m = self.layers[i - 1]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+        # freeze the last layer norm
+        if self.frozen_stages == len(self.layers):
+            if self.final_norm:
+                self.ln1.eval()
+                for param in self.ln1.parameters():
+                    param.requires_grad = False
+
+            if self.out_type == 'avg_featmap':
+                self.ln2.eval()
+                for param in self.ln2.parameters():
+                    param.requires_grad = False
+
+    def forward(self, x):
+        B = x.shape[0]
+        x, patch_resolution = self.patch_embed(x)
+
+        if self.cls_token is not None:
+            # stole cls_tokens impl from Phil Wang, thanks
+            cls_token = self.cls_token.expand(B, -1, -1)
+            x = torch.cat((cls_token, x), dim=1)
+
+        if self.pos_embed is not None:
+            x = x + resize_pos_embed(
+                self.pos_embed,
+                self.patch_resolution,
+                patch_resolution,
+                mode=self.interpolate_mode,
+                num_extra_tokens=self.num_extra_tokens)
+        x = self.drop_after_pos(x)
+
+        rel_pos_bias = self.rel_pos_bias() \
+            if self.rel_pos_bias is not None else None
+
+        outs = []
+        for i, layer in enumerate(self.layers):
+            x = layer(x, rel_pos_bias)
+
+            if i == len(self.layers) - 1 and self.final_norm:
+                x = self.ln1(x)
+
+            if i in self.out_indices:
+                outs.append(self._format_output(x, patch_resolution))
+
+        return tuple(outs)
+
+    def _format_output(self, x, hw):
+        if self.out_type == 'raw':
+            return x
+        if self.out_type == 'cls_token':
+            return x[:, 0]
+
+        patch_token = x[:, self.num_extra_tokens:]
+        if self.out_type == 'featmap':
+            B = x.size(0)
+            # (B, N, C) -> (B, H, W, C) -> (B, C, H, W)
+            return patch_token.reshape(B, *hw, -1).permute(0, 3, 1, 2)
+        if self.out_type == 'avg_featmap':
+            return self.ln2(patch_token.mean(dim=1))
+
+    def _prepare_relative_position_bias_table(self, state_dict, prefix, *args,
+                                              **kwargs):
+        from mmengine.logging import MMLogger
+        logger = MMLogger.get_current_instance()
+
+        if self.use_rel_pos_bias and 'rel_pos_bias.relative_position_bias_table' in state_dict:  # noqa:E501
+            logger.info('Expand the shared relative position embedding to '
+                        'each transformer block.')
+            rel_pos_bias = state_dict[
+                'rel_pos_bias.relative_position_bias_table']
+            for i in range(self.num_layers):
+                state_dict[
+                    f'layers.{i}.attn.relative_position_bias_table'] = \
+                        rel_pos_bias.clone()
+            state_dict.pop('rel_pos_bias.relative_position_bias_table')
+            state_dict.pop('rel_pos_bias.relative_position_index')
+
+        state_dict_model = self.state_dict()
+        all_keys = list(state_dict_model.keys())
+        for key in all_keys:
+            if 'relative_position_bias_table' in key:
+                ckpt_key = prefix + key
+                if ckpt_key not in state_dict:
+                    continue
+                rel_pos_bias_pretrained = state_dict[ckpt_key]
+                rel_pos_bias_current = state_dict_model[key]
+                L1, nH1 = rel_pos_bias_pretrained.size()
+                L2, nH2 = rel_pos_bias_current.size()
+                src_size = int((L1 - 3)**0.5)
+                dst_size = int((L2 - 3)**0.5)
+                if L1 != L2:
+                    extra_tokens = rel_pos_bias_pretrained[-3:, :]
+                    rel_pos_bias = rel_pos_bias_pretrained[:-3, :]
+
+                    new_rel_pos_bias = resize_relative_position_bias_table(
+                        src_size, dst_size, rel_pos_bias, nH1)
+                    new_rel_pos_bias = torch.cat(
+                        (new_rel_pos_bias, extra_tokens), dim=0)
+                    logger.info('Resize the relative_position_bias_table from '
+                                f'{state_dict[ckpt_key].shape} to '
+                                f'{new_rel_pos_bias.shape}')
+                    state_dict[ckpt_key] = new_rel_pos_bias
+
+                    # The index buffer need to be re-generated.
+                    index_buffer = ckpt_key.replace('bias_table', 'index')
+                    if index_buffer in state_dict:
+                        del state_dict[index_buffer]
+
+    def get_layer_depth(self, param_name: str, prefix: str = ''):
+        """Get the layer-wise depth of a parameter.
+
+        Args:
+            param_name (str): The name of the parameter.
+            prefix (str): The prefix for the parameter.
+                Defaults to an empty string.
+
+        Returns:
+            Tuple[int, int]: The layer-wise depth and the num of layers.
+
+        Note:
+            The first depth is the stem module (``layer_depth=0``), and the
+            last depth is the subsequent module (``layer_depth=num_layers-1``)
+        """
+        num_layers = self.num_layers + 2
+
+        if not param_name.startswith(prefix):
+            # For subsequent module like head
+            return num_layers - 1, num_layers
+
+        param_name = param_name[len(prefix):]
+
+        if param_name in ('cls_token', 'pos_embed'):
+            layer_depth = 0
+        elif param_name.startswith('patch_embed'):
+            layer_depth = 0
+        elif param_name.startswith('layers'):
+            layer_id = int(param_name.split('.')[1])
+            layer_depth = layer_id + 1
+        else:
+            layer_depth = num_layers - 1
+
+        return layer_depth, num_layers
diff --git a/mmpretrain/models/backbones/conformer.py b/mmpretrain/models/backbones/conformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..eda72b0595b6923a7f1f563ae7186ca533f85023
--- /dev/null
+++ b/mmpretrain/models/backbones/conformer.py
@@ -0,0 +1,621 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Sequence
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn import build_activation_layer, build_norm_layer
+from mmcv.cnn.bricks.drop import DropPath
+from mmcv.cnn.bricks.transformer import AdaptivePadding
+from mmengine.model import BaseModule
+from mmengine.model.weight_init import trunc_normal_
+
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+from .vision_transformer import TransformerEncoderLayer
+
+
+class ConvBlock(BaseModule):
+    """Basic convluation block used in Conformer.
+
+    This block includes three convluation modules, and supports three new
+    functions:
+    1. Returns the output of both the final layers and the second convluation
+    module.
+    2. Fuses the input of the second convluation module with an extra input
+    feature map.
+    3. Supports to add an extra convluation module to the identity connection.
+
+    Args:
+        in_channels (int): The number of input channels.
+        out_channels (int): The number of output channels.
+        stride (int): The stride of the second convluation module.
+            Defaults to 1.
+        groups (int): The groups of the second convluation module.
+            Defaults to 1.
+        drop_path_rate (float): The rate of the DropPath layer. Defaults to 0.
+        with_residual_conv (bool): Whether to add an extra convluation module
+            to the identity connection. Defaults to False.
+        norm_cfg (dict): The config of normalization layers.
+            Defaults to ``dict(type='BN', eps=1e-6)``.
+        act_cfg (dict): The config of activative functions.
+            Defaults to ``dict(type='ReLU', inplace=True))``.
+        init_cfg (dict, optional): The extra config to initialize the module.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 stride=1,
+                 groups=1,
+                 drop_path_rate=0.,
+                 with_residual_conv=False,
+                 norm_cfg=dict(type='BN', eps=1e-6),
+                 act_cfg=dict(type='ReLU', inplace=True),
+                 init_cfg=None):
+        super(ConvBlock, self).__init__(init_cfg=init_cfg)
+
+        expansion = 4
+        mid_channels = out_channels // expansion
+
+        self.conv1 = nn.Conv2d(
+            in_channels,
+            mid_channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=False)
+        self.bn1 = build_norm_layer(norm_cfg, mid_channels)[1]
+        self.act1 = build_activation_layer(act_cfg)
+
+        self.conv2 = nn.Conv2d(
+            mid_channels,
+            mid_channels,
+            kernel_size=3,
+            stride=stride,
+            groups=groups,
+            padding=1,
+            bias=False)
+        self.bn2 = build_norm_layer(norm_cfg, mid_channels)[1]
+        self.act2 = build_activation_layer(act_cfg)
+
+        self.conv3 = nn.Conv2d(
+            mid_channels,
+            out_channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=False)
+        self.bn3 = build_norm_layer(norm_cfg, out_channels)[1]
+        self.act3 = build_activation_layer(act_cfg)
+
+        if with_residual_conv:
+            self.residual_conv = nn.Conv2d(
+                in_channels,
+                out_channels,
+                kernel_size=1,
+                stride=stride,
+                padding=0,
+                bias=False)
+            self.residual_bn = build_norm_layer(norm_cfg, out_channels)[1]
+
+        self.with_residual_conv = with_residual_conv
+        self.drop_path = DropPath(
+            drop_path_rate) if drop_path_rate > 0. else nn.Identity()
+
+    def zero_init_last_bn(self):
+        nn.init.zeros_(self.bn3.weight)
+
+    def forward(self, x, fusion_features=None, out_conv2=True):
+        identity = x
+
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = self.act1(x)
+
+        x = self.conv2(x) if fusion_features is None else self.conv2(
+            x + fusion_features)
+        x = self.bn2(x)
+        x2 = self.act2(x)
+
+        x = self.conv3(x2)
+        x = self.bn3(x)
+
+        if self.drop_path is not None:
+            x = self.drop_path(x)
+
+        if self.with_residual_conv:
+            identity = self.residual_conv(identity)
+            identity = self.residual_bn(identity)
+
+        x += identity
+        x = self.act3(x)
+
+        if out_conv2:
+            return x, x2
+        else:
+            return x
+
+
+class FCUDown(BaseModule):
+    """CNN feature maps -> Transformer patch embeddings."""
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 down_stride,
+                 with_cls_token=True,
+                 norm_cfg=dict(type='LN', eps=1e-6),
+                 act_cfg=dict(type='GELU'),
+                 init_cfg=None):
+        super(FCUDown, self).__init__(init_cfg=init_cfg)
+        self.down_stride = down_stride
+        self.with_cls_token = with_cls_token
+
+        self.conv_project = nn.Conv2d(
+            in_channels, out_channels, kernel_size=1, stride=1, padding=0)
+        self.sample_pooling = nn.AvgPool2d(
+            kernel_size=down_stride, stride=down_stride)
+
+        self.ln = build_norm_layer(norm_cfg, out_channels)[1]
+        self.act = build_activation_layer(act_cfg)
+
+    def forward(self, x, x_t):
+        x = self.conv_project(x)  # [N, C, H, W]
+
+        x = self.sample_pooling(x).flatten(2).transpose(1, 2)
+        x = self.ln(x)
+        x = self.act(x)
+
+        if self.with_cls_token:
+            x = torch.cat([x_t[:, 0][:, None, :], x], dim=1)
+
+        return x
+
+
+class FCUUp(BaseModule):
+    """Transformer patch embeddings -> CNN feature maps."""
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 up_stride,
+                 with_cls_token=True,
+                 norm_cfg=dict(type='BN', eps=1e-6),
+                 act_cfg=dict(type='ReLU', inplace=True),
+                 init_cfg=None):
+        super(FCUUp, self).__init__(init_cfg=init_cfg)
+
+        self.up_stride = up_stride
+        self.with_cls_token = with_cls_token
+
+        self.conv_project = nn.Conv2d(
+            in_channels, out_channels, kernel_size=1, stride=1, padding=0)
+        self.bn = build_norm_layer(norm_cfg, out_channels)[1]
+        self.act = build_activation_layer(act_cfg)
+
+    def forward(self, x, H, W):
+        B, _, C = x.shape
+        # [N, 197, 384] -> [N, 196, 384] -> [N, 384, 196] -> [N, 384, 14, 14]
+        if self.with_cls_token:
+            x_r = x[:, 1:].transpose(1, 2).reshape(B, C, H, W)
+        else:
+            x_r = x.transpose(1, 2).reshape(B, C, H, W)
+
+        x_r = self.act(self.bn(self.conv_project(x_r)))
+
+        return F.interpolate(
+            x_r, size=(H * self.up_stride, W * self.up_stride))
+
+
+class ConvTransBlock(BaseModule):
+    """Basic module for Conformer.
+
+    This module is a fusion of CNN block transformer encoder block.
+
+    Args:
+        in_channels (int): The number of input channels in conv blocks.
+        out_channels (int): The number of output channels in conv blocks.
+        embed_dims (int): The embedding dimension in transformer blocks.
+        conv_stride (int): The stride of conv2d layers. Defaults to 1.
+        groups (int): The groups of conv blocks. Defaults to 1.
+        with_residual_conv (bool): Whether to add a conv-bn layer to the
+            identity connect in the conv block. Defaults to False.
+        down_stride (int): The stride of the downsample pooling layer.
+            Defaults to 4.
+        num_heads (int): The number of heads in transformer attention layers.
+            Defaults to 12.
+        mlp_ratio (float): The expansion ratio in transformer FFN module.
+            Defaults to 4.
+        qkv_bias (bool): Enable bias for qkv if True. Defaults to False.
+        with_cls_token (bool): Whether use class token or not.
+            Defaults to True.
+        drop_rate (float): The dropout rate of the output projection and
+            FFN in the transformer block. Defaults to 0.
+        attn_drop_rate (float): The dropout rate after the attention
+            calculation in the transformer block. Defaults to 0.
+        drop_path_rate (bloat): The drop path rate in both the conv block
+            and the transformer block. Defaults to 0.
+        last_fusion (bool): Whether this block is the last stage. If so,
+            downsample the fusion feature map.
+        init_cfg (dict, optional): The extra config to initialize the module.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 embed_dims,
+                 conv_stride=1,
+                 groups=1,
+                 with_residual_conv=False,
+                 down_stride=4,
+                 num_heads=12,
+                 mlp_ratio=4.,
+                 qkv_bias=False,
+                 with_cls_token=True,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 last_fusion=False,
+                 init_cfg=None):
+        super(ConvTransBlock, self).__init__(init_cfg=init_cfg)
+        expansion = 4
+        self.cnn_block = ConvBlock(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            with_residual_conv=with_residual_conv,
+            stride=conv_stride,
+            groups=groups)
+
+        if last_fusion:
+            self.fusion_block = ConvBlock(
+                in_channels=out_channels,
+                out_channels=out_channels,
+                stride=2,
+                with_residual_conv=True,
+                groups=groups,
+                drop_path_rate=drop_path_rate)
+        else:
+            self.fusion_block = ConvBlock(
+                in_channels=out_channels,
+                out_channels=out_channels,
+                groups=groups,
+                drop_path_rate=drop_path_rate)
+
+        self.squeeze_block = FCUDown(
+            in_channels=out_channels // expansion,
+            out_channels=embed_dims,
+            down_stride=down_stride,
+            with_cls_token=with_cls_token)
+
+        self.expand_block = FCUUp(
+            in_channels=embed_dims,
+            out_channels=out_channels // expansion,
+            up_stride=down_stride,
+            with_cls_token=with_cls_token)
+
+        self.trans_block = TransformerEncoderLayer(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            feedforward_channels=int(embed_dims * mlp_ratio),
+            drop_rate=drop_rate,
+            drop_path_rate=drop_path_rate,
+            attn_drop_rate=attn_drop_rate,
+            qkv_bias=qkv_bias,
+            norm_cfg=dict(type='LN', eps=1e-6))
+
+        self.down_stride = down_stride
+        self.embed_dim = embed_dims
+        self.last_fusion = last_fusion
+
+    def forward(self, cnn_input, trans_input):
+        x, x_conv2 = self.cnn_block(cnn_input, out_conv2=True)
+
+        _, _, H, W = x_conv2.shape
+
+        # Convert the feature map of conv2 to transformer embedding
+        # and concat with class token.
+        conv2_embedding = self.squeeze_block(x_conv2, trans_input)
+
+        trans_output = self.trans_block(conv2_embedding + trans_input)
+
+        # Convert the transformer output embedding to feature map
+        trans_features = self.expand_block(trans_output, H // self.down_stride,
+                                           W // self.down_stride)
+        x = self.fusion_block(
+            x, fusion_features=trans_features, out_conv2=False)
+
+        return x, trans_output
+
+
+@MODELS.register_module()
+class Conformer(BaseBackbone):
+    """Conformer backbone.
+
+    A PyTorch implementation of : `Conformer: Local Features Coupling Global
+    Representations for Visual Recognition <https://arxiv.org/abs/2105.03889>`_
+
+    Args:
+        arch (str | dict): Conformer architecture. Defaults to 'tiny'.
+        patch_size (int): The patch size. Defaults to 16.
+        base_channels (int): The base number of channels in CNN network.
+            Defaults to 64.
+        mlp_ratio (float): The expansion ratio of FFN network in transformer
+            block. Defaults to 4.
+        with_cls_token (bool): Whether use class token or not.
+            Defaults to True.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+    arch_zoo = {
+        **dict.fromkeys(['t', 'tiny'],
+                        {'embed_dims': 384,
+                         'channel_ratio': 1,
+                         'num_heads': 6,
+                         'depths': 12
+                         }),
+        **dict.fromkeys(['s', 'small'],
+                        {'embed_dims': 384,
+                         'channel_ratio': 4,
+                         'num_heads': 6,
+                         'depths': 12
+                         }),
+        **dict.fromkeys(['b', 'base'],
+                        {'embed_dims': 576,
+                         'channel_ratio': 6,
+                         'num_heads': 9,
+                         'depths': 12
+                         }),
+    }  # yapf: disable
+
+    _version = 1
+
+    def __init__(self,
+                 arch='tiny',
+                 patch_size=16,
+                 base_channels=64,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 with_cls_token=True,
+                 drop_path_rate=0.,
+                 norm_eval=True,
+                 frozen_stages=0,
+                 out_indices=-1,
+                 init_cfg=None):
+
+        super().__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {
+                'embed_dims', 'depths', 'num_heads', 'channel_ratio'
+            }
+            assert isinstance(arch, dict) and set(arch) == essential_keys, \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.num_features = self.embed_dims = self.arch_settings['embed_dims']
+        self.depths = self.arch_settings['depths']
+        self.num_heads = self.arch_settings['num_heads']
+        self.channel_ratio = self.arch_settings['channel_ratio']
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = self.depths + index + 1
+                assert out_indices[i] >= 0, f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+
+        self.norm_eval = norm_eval
+        self.frozen_stages = frozen_stages
+
+        self.with_cls_token = with_cls_token
+        if self.with_cls_token:
+            self.cls_token = nn.Parameter(torch.zeros(1, 1, self.embed_dims))
+
+        # stochastic depth decay rule
+        self.trans_dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, self.depths)
+        ]
+
+        # Stem stage: get the feature maps by conv block
+        self.conv1 = nn.Conv2d(
+            3, 64, kernel_size=7, stride=2, padding=3,
+            bias=False)  # 1 / 2 [112, 112]
+        self.bn1 = nn.BatchNorm2d(64)
+        self.act1 = nn.ReLU(inplace=True)
+        self.maxpool = nn.MaxPool2d(
+            kernel_size=3, stride=2, padding=1)  # 1 / 4 [56, 56]
+
+        assert patch_size % 16 == 0, 'The patch size of Conformer must ' \
+            'be divisible by 16.'
+        trans_down_stride = patch_size // 4
+
+        # To solve the issue #680
+        # Auto pad the feature map to be divisible by trans_down_stride
+        self.auto_pad = AdaptivePadding(trans_down_stride, trans_down_stride)
+
+        # 1 stage
+        stage1_channels = int(base_channels * self.channel_ratio)
+        self.conv_1 = ConvBlock(
+            in_channels=64,
+            out_channels=stage1_channels,
+            with_residual_conv=True,
+            stride=1)
+        self.trans_patch_conv = nn.Conv2d(
+            64,
+            self.embed_dims,
+            kernel_size=trans_down_stride,
+            stride=trans_down_stride,
+            padding=0)
+
+        self.trans_1 = TransformerEncoderLayer(
+            embed_dims=self.embed_dims,
+            num_heads=self.num_heads,
+            feedforward_channels=int(self.embed_dims * mlp_ratio),
+            drop_path_rate=self.trans_dpr[0],
+            qkv_bias=qkv_bias,
+            norm_cfg=dict(type='LN', eps=1e-6))
+
+        # 2~4 stage
+        init_stage = 2
+        fin_stage = self.depths // 3 + 1
+        for i in range(init_stage, fin_stage):
+            self.add_module(
+                f'conv_trans_{i}',
+                ConvTransBlock(
+                    in_channels=stage1_channels,
+                    out_channels=stage1_channels,
+                    embed_dims=self.embed_dims,
+                    conv_stride=1,
+                    with_residual_conv=False,
+                    down_stride=trans_down_stride,
+                    num_heads=self.num_heads,
+                    mlp_ratio=mlp_ratio,
+                    qkv_bias=qkv_bias,
+                    drop_path_rate=self.trans_dpr[i - 1],
+                    with_cls_token=self.with_cls_token))
+
+        stage2_channels = int(base_channels * self.channel_ratio * 2)
+        # 5~8 stage
+        init_stage = fin_stage  # 5
+        fin_stage = fin_stage + self.depths // 3  # 9
+        for i in range(init_stage, fin_stage):
+            if i == init_stage:
+                conv_stride = 2
+                in_channels = stage1_channels
+            else:
+                conv_stride = 1
+                in_channels = stage2_channels
+
+            with_residual_conv = True if i == init_stage else False
+            self.add_module(
+                f'conv_trans_{i}',
+                ConvTransBlock(
+                    in_channels=in_channels,
+                    out_channels=stage2_channels,
+                    embed_dims=self.embed_dims,
+                    conv_stride=conv_stride,
+                    with_residual_conv=with_residual_conv,
+                    down_stride=trans_down_stride // 2,
+                    num_heads=self.num_heads,
+                    mlp_ratio=mlp_ratio,
+                    qkv_bias=qkv_bias,
+                    drop_path_rate=self.trans_dpr[i - 1],
+                    with_cls_token=self.with_cls_token))
+
+        stage3_channels = int(base_channels * self.channel_ratio * 2 * 2)
+        # 9~12 stage
+        init_stage = fin_stage  # 9
+        fin_stage = fin_stage + self.depths // 3  # 13
+        for i in range(init_stage, fin_stage):
+            if i == init_stage:
+                conv_stride = 2
+                in_channels = stage2_channels
+                with_residual_conv = True
+            else:
+                conv_stride = 1
+                in_channels = stage3_channels
+                with_residual_conv = False
+
+            last_fusion = (i == self.depths)
+
+            self.add_module(
+                f'conv_trans_{i}',
+                ConvTransBlock(
+                    in_channels=in_channels,
+                    out_channels=stage3_channels,
+                    embed_dims=self.embed_dims,
+                    conv_stride=conv_stride,
+                    with_residual_conv=with_residual_conv,
+                    down_stride=trans_down_stride // 4,
+                    num_heads=self.num_heads,
+                    mlp_ratio=mlp_ratio,
+                    qkv_bias=qkv_bias,
+                    drop_path_rate=self.trans_dpr[i - 1],
+                    with_cls_token=self.with_cls_token,
+                    last_fusion=last_fusion))
+        self.fin_stage = fin_stage
+
+        self.pooling = nn.AdaptiveAvgPool2d(1)
+        self.trans_norm = nn.LayerNorm(self.embed_dims)
+
+        if self.with_cls_token:
+            trunc_normal_(self.cls_token, std=.02)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+        elif isinstance(m, nn.Conv2d):
+            nn.init.kaiming_normal_(
+                m.weight, mode='fan_out', nonlinearity='relu')
+        elif isinstance(m, nn.BatchNorm2d):
+            nn.init.constant_(m.weight, 1.)
+            nn.init.constant_(m.bias, 0.)
+
+        if hasattr(m, 'zero_init_last_bn'):
+            m.zero_init_last_bn()
+
+    def init_weights(self):
+        super(Conformer, self).init_weights()
+
+        if (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            # Suppress default init if use pretrained model.
+            return
+        self.apply(self._init_weights)
+
+    def forward(self, x):
+        output = []
+        B = x.shape[0]
+        if self.with_cls_token:
+            cls_tokens = self.cls_token.expand(B, -1, -1)
+
+        # stem
+        x_base = self.maxpool(self.act1(self.bn1(self.conv1(x))))
+        x_base = self.auto_pad(x_base)
+
+        # 1 stage [N, 64, 56, 56] -> [N, 128, 56, 56]
+        x = self.conv_1(x_base, out_conv2=False)
+        x_t = self.trans_patch_conv(x_base).flatten(2).transpose(1, 2)
+        if self.with_cls_token:
+            x_t = torch.cat([cls_tokens, x_t], dim=1)
+        x_t = self.trans_1(x_t)
+
+        # 2 ~ final
+        for i in range(2, self.fin_stage):
+            stage = getattr(self, f'conv_trans_{i}')
+            x, x_t = stage(x, x_t)
+            if i in self.out_indices:
+                if self.with_cls_token:
+                    output.append([
+                        self.pooling(x).flatten(1),
+                        self.trans_norm(x_t)[:, 0]
+                    ])
+                else:
+                    # if no class token, use the mean patch token
+                    # as the transformer feature.
+                    output.append([
+                        self.pooling(x).flatten(1),
+                        self.trans_norm(x_t).mean(dim=1)
+                    ])
+
+        return tuple(output)
diff --git a/mmpretrain/models/backbones/convmixer.py b/mmpretrain/models/backbones/convmixer.py
new file mode 100644
index 0000000000000000000000000000000000000000..480050d5ce1aa29f190dbc24ec1413573d541cb1
--- /dev/null
+++ b/mmpretrain/models/backbones/convmixer.py
@@ -0,0 +1,176 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Sequence
+
+import torch
+import torch.nn as nn
+from mmcv.cnn.bricks import (Conv2dAdaptivePadding, build_activation_layer,
+                             build_norm_layer)
+from mmengine.utils import digit_version
+
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+
+
+class Residual(nn.Module):
+
+    def __init__(self, fn):
+        super().__init__()
+        self.fn = fn
+
+    def forward(self, x):
+        return self.fn(x) + x
+
+
+@MODELS.register_module()
+class ConvMixer(BaseBackbone):
+    """ConvMixer.                              .
+
+    A PyTorch implementation of : `Patches Are All You Need?
+    <https://arxiv.org/pdf/2201.09792.pdf>`_
+
+    Modified from the `official repo
+    <https://github.com/locuslab/convmixer/blob/main/convmixer.py>`_
+    and `timm
+    <https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/convmixer.py>`_.
+
+    Args:
+        arch (str | dict): The model's architecture. If string, it should be
+            one of architecture in ``ConvMixer.arch_settings``. And if dict, it
+            should include the following two keys:
+
+            - embed_dims (int): The dimensions of patch embedding.
+            - depth (int): Number of repetitions of ConvMixer Layer.
+            - patch_size (int): The patch size.
+            - kernel_size (int): The kernel size of depthwise conv layers.
+
+            Defaults to '768/32'.
+        in_channels (int): Number of input image channels. Defaults to 3.
+        patch_size (int): The size of one patch in the patch embed layer.
+            Defaults to 7.
+        norm_cfg (dict): The config dict for norm layers.
+            Defaults to ``dict(type='BN')``.
+        act_cfg (dict): The config dict for activation after each convolution.
+            Defaults to ``dict(type='GELU')``.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Defaults to 0, which means not freezing any parameters.
+        init_cfg (dict, optional): Initialization config dict.
+    """
+    arch_settings = {
+        '768/32': {
+            'embed_dims': 768,
+            'depth': 32,
+            'patch_size': 7,
+            'kernel_size': 7
+        },
+        '1024/20': {
+            'embed_dims': 1024,
+            'depth': 20,
+            'patch_size': 14,
+            'kernel_size': 9
+        },
+        '1536/20': {
+            'embed_dims': 1536,
+            'depth': 20,
+            'patch_size': 7,
+            'kernel_size': 9
+        },
+    }
+
+    def __init__(self,
+                 arch='768/32',
+                 in_channels=3,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='GELU'),
+                 out_indices=-1,
+                 frozen_stages=0,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            assert arch in self.arch_settings, \
+                f'Unavailable arch, please choose from ' \
+                f'({set(self.arch_settings)}) or pass a dict.'
+            arch = self.arch_settings[arch]
+        elif isinstance(arch, dict):
+            essential_keys = {
+                'embed_dims', 'depth', 'patch_size', 'kernel_size'
+            }
+            assert isinstance(arch, dict) and essential_keys <= set(arch), \
+                f'Custom arch needs a dict with keys {essential_keys}'
+
+        self.embed_dims = arch['embed_dims']
+        self.depth = arch['depth']
+        self.patch_size = arch['patch_size']
+        self.kernel_size = arch['kernel_size']
+        self.act = build_activation_layer(act_cfg)
+
+        # check out indices and frozen stages
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = self.depth + index
+                assert out_indices[i] >= 0, f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+        self.frozen_stages = frozen_stages
+
+        # Set stem layers
+        self.stem = nn.Sequential(
+            nn.Conv2d(
+                in_channels,
+                self.embed_dims,
+                kernel_size=self.patch_size,
+                stride=self.patch_size), self.act,
+            build_norm_layer(norm_cfg, self.embed_dims)[1])
+
+        # Set conv2d according to torch version
+        convfunc = nn.Conv2d
+        if digit_version(torch.__version__) < digit_version('1.9.0'):
+            convfunc = Conv2dAdaptivePadding
+
+        # Repetitions of ConvMixer Layer
+        self.stages = nn.Sequential(*[
+            nn.Sequential(
+                Residual(
+                    nn.Sequential(
+                        convfunc(
+                            self.embed_dims,
+                            self.embed_dims,
+                            self.kernel_size,
+                            groups=self.embed_dims,
+                            padding='same'), self.act,
+                        build_norm_layer(norm_cfg, self.embed_dims)[1])),
+                nn.Conv2d(self.embed_dims, self.embed_dims, kernel_size=1),
+                self.act,
+                build_norm_layer(norm_cfg, self.embed_dims)[1])
+            for _ in range(self.depth)
+        ])
+
+        self._freeze_stages()
+
+    def forward(self, x):
+        x = self.stem(x)
+        outs = []
+        for i, stage in enumerate(self.stages):
+            x = stage(x)
+            if i in self.out_indices:
+                outs.append(x)
+
+        # x = self.pooling(x).flatten(1)
+        return tuple(outs)
+
+    def train(self, mode=True):
+        super(ConvMixer, self).train(mode)
+        self._freeze_stages()
+
+    def _freeze_stages(self):
+        for i in range(self.frozen_stages):
+            stage = self.stages[i]
+            stage.eval()
+            for param in stage.parameters():
+                param.requires_grad = False
diff --git a/mmpretrain/models/backbones/convnext.py b/mmpretrain/models/backbones/convnext.py
new file mode 100644
index 0000000000000000000000000000000000000000..6a954f5b980186a86565a228669c6917bda14f68
--- /dev/null
+++ b/mmpretrain/models/backbones/convnext.py
@@ -0,0 +1,412 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from functools import partial
+from itertools import chain
+from typing import Sequence
+
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint as cp
+from mmcv.cnn.bricks import DropPath
+from mmengine.model import BaseModule, ModuleList, Sequential
+
+from mmpretrain.registry import MODELS
+from ..utils import GRN, build_norm_layer
+from .base_backbone import BaseBackbone
+
+
+class ConvNeXtBlock(BaseModule):
+    """ConvNeXt Block.
+
+    Args:
+        in_channels (int): The number of input channels.
+        dw_conv_cfg (dict): Config of depthwise convolution.
+            Defaults to ``dict(kernel_size=7, padding=3)``.
+        norm_cfg (dict): The config dict for norm layers.
+            Defaults to ``dict(type='LN2d', eps=1e-6)``.
+        act_cfg (dict): The config dict for activation between pointwise
+            convolution. Defaults to ``dict(type='GELU')``.
+        mlp_ratio (float): The expansion ratio in both pointwise convolution.
+            Defaults to 4.
+        linear_pw_conv (bool): Whether to use linear layer to do pointwise
+            convolution. More details can be found in the note.
+            Defaults to True.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        layer_scale_init_value (float): Init value for Layer Scale.
+            Defaults to 1e-6.
+
+    Note:
+        There are two equivalent implementations:
+
+        1. DwConv -> LayerNorm -> 1x1 Conv -> GELU -> 1x1 Conv;
+           all outputs are in (N, C, H, W).
+        2. DwConv -> LayerNorm -> Permute to (N, H, W, C) -> Linear -> GELU
+           -> Linear; Permute back
+
+        As default, we use the second to align with the official repository.
+        And it may be slightly faster.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 dw_conv_cfg=dict(kernel_size=7, padding=3),
+                 norm_cfg=dict(type='LN2d', eps=1e-6),
+                 act_cfg=dict(type='GELU'),
+                 mlp_ratio=4.,
+                 linear_pw_conv=True,
+                 drop_path_rate=0.,
+                 layer_scale_init_value=1e-6,
+                 use_grn=False,
+                 with_cp=False):
+        super().__init__()
+        self.with_cp = with_cp
+
+        self.depthwise_conv = nn.Conv2d(
+            in_channels, in_channels, groups=in_channels, **dw_conv_cfg)
+
+        self.linear_pw_conv = linear_pw_conv
+        self.norm = build_norm_layer(norm_cfg, in_channels)
+
+        mid_channels = int(mlp_ratio * in_channels)
+        if self.linear_pw_conv:
+            # Use linear layer to do pointwise conv.
+            pw_conv = nn.Linear
+        else:
+            pw_conv = partial(nn.Conv2d, kernel_size=1)
+
+        self.pointwise_conv1 = pw_conv(in_channels, mid_channels)
+        self.act = MODELS.build(act_cfg)
+        self.pointwise_conv2 = pw_conv(mid_channels, in_channels)
+
+        if use_grn:
+            self.grn = GRN(mid_channels)
+        else:
+            self.grn = None
+
+        self.gamma = nn.Parameter(
+            layer_scale_init_value * torch.ones((in_channels)),
+            requires_grad=True) if layer_scale_init_value > 0 else None
+
+        self.drop_path = DropPath(
+            drop_path_rate) if drop_path_rate > 0. else nn.Identity()
+
+    def forward(self, x):
+
+        def _inner_forward(x):
+            shortcut = x
+            x = self.depthwise_conv(x)
+
+            if self.linear_pw_conv:
+                x = x.permute(0, 2, 3, 1)  # (N, C, H, W) -> (N, H, W, C)
+                x = self.norm(x, data_format='channel_last')
+                x = self.pointwise_conv1(x)
+                x = self.act(x)
+                if self.grn is not None:
+                    x = self.grn(x, data_format='channel_last')
+                x = self.pointwise_conv2(x)
+                x = x.permute(0, 3, 1, 2)  # (N, H, W, C) -> (N, C, H, W)
+            else:
+                x = self.norm(x, data_format='channel_first')
+                x = self.pointwise_conv1(x)
+                x = self.act(x)
+
+                if self.grn is not None:
+                    x = self.grn(x, data_format='channel_first')
+                x = self.pointwise_conv2(x)
+
+            if self.gamma is not None:
+                x = x.mul(self.gamma.view(1, -1, 1, 1))
+
+            x = shortcut + self.drop_path(x)
+            return x
+
+        if self.with_cp and x.requires_grad:
+            x = cp.checkpoint(_inner_forward, x)
+        else:
+            x = _inner_forward(x)
+        return x
+
+
+@MODELS.register_module()
+class ConvNeXt(BaseBackbone):
+    """ConvNeXt v1&v2 backbone.
+
+    A PyTorch implementation of `A ConvNet for the 2020s
+    <https://arxiv.org/abs/2201.03545>`_ and
+    `ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
+    <http://arxiv.org/abs/2301.00808>`_
+
+    Modified from the `official repo
+    <https://github.com/facebookresearch/ConvNeXt/blob/main/models/convnext.py>`_
+    and `timm
+    <https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/convnext.py>`_.
+
+    To use ConvNeXt v2, please set ``use_grn=True`` and ``layer_scale_init_value=0.``.
+
+    Args:
+        arch (str | dict): The model's architecture. If string, it should be
+            one of architecture in ``ConvNeXt.arch_settings``. And if dict, it
+            should include the following two keys:
+
+            - depths (list[int]): Number of blocks at each stage.
+            - channels (list[int]): The number of channels at each stage.
+
+            Defaults to 'tiny'.
+        in_channels (int): Number of input image channels. Defaults to 3.
+        stem_patch_size (int): The size of one patch in the stem layer.
+            Defaults to 4.
+        norm_cfg (dict): The config dict for norm layers.
+            Defaults to ``dict(type='LN2d', eps=1e-6)``.
+        act_cfg (dict): The config dict for activation between pointwise
+            convolution. Defaults to ``dict(type='GELU')``.
+        linear_pw_conv (bool): Whether to use linear layer to do pointwise
+            convolution. Defaults to True.
+        use_grn (bool): Whether to add Global Response Normalization in the
+            blocks. Defaults to False.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        layer_scale_init_value (float): Init value for Layer Scale.
+            Defaults to 1e-6.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Defaults to 0, which means not freezing any parameters.
+        gap_before_final_norm (bool): Whether to globally average the feature
+            map before the final norm layer. In the official repo, it's only
+            used in classification task. Defaults to True.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        init_cfg (dict, optional): Initialization config dict
+    """  # noqa: E501
+    arch_settings = {
+        'atto': {
+            'depths': [2, 2, 6, 2],
+            'channels': [40, 80, 160, 320]
+        },
+        'femto': {
+            'depths': [2, 2, 6, 2],
+            'channels': [48, 96, 192, 384]
+        },
+        'pico': {
+            'depths': [2, 2, 6, 2],
+            'channels': [64, 128, 256, 512]
+        },
+        'nano': {
+            'depths': [2, 2, 8, 2],
+            'channels': [80, 160, 320, 640]
+        },
+        'tiny': {
+            'depths': [3, 3, 9, 3],
+            'channels': [96, 192, 384, 768]
+        },
+        'small': {
+            'depths': [3, 3, 27, 3],
+            'channels': [96, 192, 384, 768]
+        },
+        'base': {
+            'depths': [3, 3, 27, 3],
+            'channels': [128, 256, 512, 1024]
+        },
+        'large': {
+            'depths': [3, 3, 27, 3],
+            'channels': [192, 384, 768, 1536]
+        },
+        'xlarge': {
+            'depths': [3, 3, 27, 3],
+            'channels': [256, 512, 1024, 2048]
+        },
+        'huge': {
+            'depths': [3, 3, 27, 3],
+            'channels': [352, 704, 1408, 2816]
+        }
+    }
+
+    def __init__(self,
+                 arch='tiny',
+                 in_channels=3,
+                 stem_patch_size=4,
+                 norm_cfg=dict(type='LN2d', eps=1e-6),
+                 act_cfg=dict(type='GELU'),
+                 linear_pw_conv=True,
+                 use_grn=False,
+                 drop_path_rate=0.,
+                 layer_scale_init_value=1e-6,
+                 out_indices=-1,
+                 frozen_stages=0,
+                 gap_before_final_norm=True,
+                 with_cp=False,
+                 init_cfg=[
+                     dict(
+                         type='TruncNormal',
+                         layer=['Conv2d', 'Linear'],
+                         std=.02,
+                         bias=0.),
+                     dict(
+                         type='Constant', layer=['LayerNorm'], val=1.,
+                         bias=0.),
+                 ]):
+        super().__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            assert arch in self.arch_settings, \
+                f'Unavailable arch, please choose from ' \
+                f'({set(self.arch_settings)}) or pass a dict.'
+            arch = self.arch_settings[arch]
+        elif isinstance(arch, dict):
+            assert 'depths' in arch and 'channels' in arch, \
+                f'The arch dict must have "depths" and "channels", ' \
+                f'but got {list(arch.keys())}.'
+
+        self.depths = arch['depths']
+        self.channels = arch['channels']
+        assert (isinstance(self.depths, Sequence)
+                and isinstance(self.channels, Sequence)
+                and len(self.depths) == len(self.channels)), \
+            f'The "depths" ({self.depths}) and "channels" ({self.channels}) ' \
+            'should be both sequence with the same length.'
+
+        self.num_stages = len(self.depths)
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = 4 + index
+                assert out_indices[i] >= 0, f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+
+        self.frozen_stages = frozen_stages
+        self.gap_before_final_norm = gap_before_final_norm
+
+        # stochastic depth decay rule
+        dpr = [
+            x.item()
+            for x in torch.linspace(0, drop_path_rate, sum(self.depths))
+        ]
+        block_idx = 0
+
+        # 4 downsample layers between stages, including the stem layer.
+        self.downsample_layers = ModuleList()
+        stem = nn.Sequential(
+            nn.Conv2d(
+                in_channels,
+                self.channels[0],
+                kernel_size=stem_patch_size,
+                stride=stem_patch_size),
+            build_norm_layer(norm_cfg, self.channels[0]),
+        )
+        self.downsample_layers.append(stem)
+
+        # 4 feature resolution stages, each consisting of multiple residual
+        # blocks
+        self.stages = nn.ModuleList()
+
+        for i in range(self.num_stages):
+            depth = self.depths[i]
+            channels = self.channels[i]
+
+            if i >= 1:
+                downsample_layer = nn.Sequential(
+                    build_norm_layer(norm_cfg, self.channels[i - 1]),
+                    nn.Conv2d(
+                        self.channels[i - 1],
+                        channels,
+                        kernel_size=2,
+                        stride=2),
+                )
+                self.downsample_layers.append(downsample_layer)
+
+            stage = Sequential(*[
+                ConvNeXtBlock(
+                    in_channels=channels,
+                    drop_path_rate=dpr[block_idx + j],
+                    norm_cfg=norm_cfg,
+                    act_cfg=act_cfg,
+                    linear_pw_conv=linear_pw_conv,
+                    layer_scale_init_value=layer_scale_init_value,
+                    use_grn=use_grn,
+                    with_cp=with_cp) for j in range(depth)
+            ])
+            block_idx += depth
+
+            self.stages.append(stage)
+
+            if i in self.out_indices:
+                norm_layer = build_norm_layer(norm_cfg, channels)
+                self.add_module(f'norm{i}', norm_layer)
+
+        self._freeze_stages()
+
+    def forward(self, x):
+        outs = []
+        for i, stage in enumerate(self.stages):
+            x = self.downsample_layers[i](x)
+            x = stage(x)
+            if i in self.out_indices:
+                norm_layer = getattr(self, f'norm{i}')
+                if self.gap_before_final_norm:
+                    gap = x.mean([-2, -1], keepdim=True)
+                    outs.append(norm_layer(gap).flatten(1))
+                else:
+                    outs.append(norm_layer(x))
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        for i in range(self.frozen_stages):
+            downsample_layer = self.downsample_layers[i]
+            stage = self.stages[i]
+            downsample_layer.eval()
+            stage.eval()
+            for param in chain(downsample_layer.parameters(),
+                               stage.parameters()):
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        super(ConvNeXt, self).train(mode)
+        self._freeze_stages()
+
+    def get_layer_depth(self, param_name: str, prefix: str = ''):
+        """Get the layer-wise depth of a parameter.
+
+        Args:
+            param_name (str): The name of the parameter.
+            prefix (str): The prefix for the parameter.
+                Defaults to an empty string.
+
+        Returns:
+            Tuple[int, int]: The layer-wise depth and the num of layers.
+        """
+
+        max_layer_id = 12 if self.depths[-2] > 9 else 6
+
+        if not param_name.startswith(prefix):
+            # For subsequent module like head
+            return max_layer_id + 1, max_layer_id + 2
+
+        param_name = param_name[len(prefix):]
+        if param_name.startswith('downsample_layers'):
+            stage_id = int(param_name.split('.')[1])
+            if stage_id == 0:
+                layer_id = 0
+            elif stage_id == 1 or stage_id == 2:
+                layer_id = stage_id + 1
+            else:  # stage_id == 3:
+                layer_id = max_layer_id
+
+        elif param_name.startswith('stages'):
+            stage_id = int(param_name.split('.')[1])
+            block_id = int(param_name.split('.')[2])
+            if stage_id == 0 or stage_id == 1:
+                layer_id = stage_id + 1
+            elif stage_id == 2:
+                layer_id = 3 + block_id // 3
+            else:  # stage_id == 3:
+                layer_id = max_layer_id
+
+        # final norm layer
+        else:
+            layer_id = max_layer_id + 1
+
+        return layer_id, max_layer_id + 2
diff --git a/mmpretrain/models/backbones/cspnet.py b/mmpretrain/models/backbones/cspnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..7492e97702c28861dcce2808207a35e67f32f752
--- /dev/null
+++ b/mmpretrain/models/backbones/cspnet.py
@@ -0,0 +1,679 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from typing import Sequence
+
+import torch
+import torch.nn as nn
+from mmcv.cnn import ConvModule, DepthwiseSeparableConvModule
+from mmcv.cnn.bricks import DropPath
+from mmengine.model import BaseModule, Sequential
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.registry import MODELS
+from ..utils import to_ntuple
+from .resnet import Bottleneck as ResNetBottleneck
+from .resnext import Bottleneck as ResNeXtBottleneck
+
+eps = 1.0e-5
+
+
+class DarknetBottleneck(BaseModule):
+    """The basic bottleneck block used in Darknet. Each DarknetBottleneck
+    consists of two ConvModules and the input is added to the final output.
+    Each ConvModule is composed of Conv, BN, and LeakyReLU. The first convLayer
+    has filter size of 1x1 and the second one has the filter size of 3x3.
+
+    Args:
+        in_channels (int): The input channels of this Module.
+        out_channels (int): The output channels of this Module.
+        expansion (int): The ratio of ``out_channels/mid_channels`` where
+            ``mid_channels`` is the input/output channels of conv2.
+            Defaults to 4.
+        add_identity (bool): Whether to add identity to the out.
+            Defaults to True.
+        use_depthwise (bool): Whether to use depthwise separable convolution.
+            Defaults to False.
+        conv_cfg (dict): Config dict for convolution layer. Defaults to None,
+            which means using conv2d.
+        drop_path_rate (float): The ratio of the drop path layer. Default: 0.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='BN', eps=1e-5)``.
+        act_cfg (dict): Config dict for activation layer.
+            Defaults to ``dict(type='Swish')``.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 expansion=2,
+                 add_identity=True,
+                 use_depthwise=False,
+                 conv_cfg=None,
+                 drop_path_rate=0,
+                 norm_cfg=dict(type='BN', eps=1e-5),
+                 act_cfg=dict(type='LeakyReLU', inplace=True),
+                 init_cfg=None):
+        super().__init__(init_cfg)
+        hidden_channels = int(out_channels / expansion)
+        conv = DepthwiseSeparableConvModule if use_depthwise else ConvModule
+        self.conv1 = ConvModule(
+            in_channels,
+            hidden_channels,
+            1,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+        self.conv2 = conv(
+            hidden_channels,
+            out_channels,
+            3,
+            stride=1,
+            padding=1,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+        self.add_identity = \
+            add_identity and in_channels == out_channels
+
+        self.drop_path = DropPath(drop_prob=drop_path_rate
+                                  ) if drop_path_rate > eps else nn.Identity()
+
+    def forward(self, x):
+        identity = x
+        out = self.conv1(x)
+        out = self.conv2(out)
+        out = self.drop_path(out)
+
+        if self.add_identity:
+            return out + identity
+        else:
+            return out
+
+
+class CSPStage(BaseModule):
+    """Cross Stage Partial Stage.
+
+    .. code:: text
+
+        Downsample Convolution (optional)
+                    |
+                    |
+            Expand Convolution
+                    |
+                    |
+           Split to xa, xb
+                    |     \
+                    |      \
+                    |      blocks(xb)
+                    |      /
+                    |     /  transition
+                    |    /
+            Concat xa, blocks(xb)
+                    |
+         Transition Convolution
+
+    Args:
+        block_fn (nn.module): The basic block function in the Stage.
+        in_channels (int): The input channels of the CSP layer.
+        out_channels (int): The output channels of the CSP layer.
+        has_downsampler (bool): Whether to add a downsampler in the stage.
+            Default: False.
+        down_growth (bool): Whether to expand the channels in the
+            downsampler layer of the stage. Default: False.
+        expand_ratio (float): The expand ratio to adjust the number of
+             channels of the expand conv layer. Default: 0.5
+        bottle_ratio (float): Ratio to adjust the number of channels of the
+            hidden layer. Default: 0.5
+        block_dpr (float): The ratio of the drop path layer in the
+            blocks of the stage. Default: 0.
+        num_blocks (int): Number of blocks. Default: 1
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Default: None, which means using conv2d.
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='BN')
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='LeakyReLU', inplace=True)
+    """
+
+    def __init__(self,
+                 block_fn,
+                 in_channels,
+                 out_channels,
+                 has_downsampler=True,
+                 down_growth=False,
+                 expand_ratio=0.5,
+                 bottle_ratio=2,
+                 num_blocks=1,
+                 block_dpr=0,
+                 block_args={},
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN', eps=1e-5),
+                 act_cfg=dict(type='LeakyReLU', inplace=True),
+                 init_cfg=None):
+        super().__init__(init_cfg)
+        # grow downsample channels to output channels
+        down_channels = out_channels if down_growth else in_channels
+        block_dpr = to_ntuple(num_blocks)(block_dpr)
+
+        if has_downsampler:
+            self.downsample_conv = ConvModule(
+                in_channels=in_channels,
+                out_channels=down_channels,
+                kernel_size=3,
+                stride=2,
+                padding=1,
+                groups=32 if block_fn is ResNeXtBottleneck else 1,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg)
+        else:
+            self.downsample_conv = nn.Identity()
+
+        exp_channels = int(down_channels * expand_ratio)
+        self.expand_conv = ConvModule(
+            in_channels=down_channels,
+            out_channels=exp_channels,
+            kernel_size=1,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg if block_fn is DarknetBottleneck else None)
+
+        assert exp_channels % 2 == 0, \
+            'The channel number before blocks must be divisible by 2.'
+        block_channels = exp_channels // 2
+        blocks = []
+        for i in range(num_blocks):
+            block_cfg = dict(
+                in_channels=block_channels,
+                out_channels=block_channels,
+                expansion=bottle_ratio,
+                drop_path_rate=block_dpr[i],
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg,
+                **block_args)
+            blocks.append(block_fn(**block_cfg))
+        self.blocks = Sequential(*blocks)
+        self.atfer_blocks_conv = ConvModule(
+            block_channels,
+            block_channels,
+            1,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+
+        self.final_conv = ConvModule(
+            2 * block_channels,
+            out_channels,
+            1,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+
+    def forward(self, x):
+        x = self.downsample_conv(x)
+        x = self.expand_conv(x)
+
+        split = x.shape[1] // 2
+        xa, xb = x[:, :split], x[:, split:]
+
+        xb = self.blocks(xb)
+        xb = self.atfer_blocks_conv(xb).contiguous()
+
+        x_final = torch.cat((xa, xb), dim=1)
+        return self.final_conv(x_final)
+
+
+class CSPNet(BaseModule):
+    """The abstract CSP Network class.
+
+    A Pytorch implementation of `CSPNet: A New Backbone that can Enhance
+    Learning Capability of CNN <https://arxiv.org/abs/1911.11929>`_
+
+    This class is an abstract class because the Cross Stage Partial Network
+    (CSPNet) is a kind of universal network structure, and you
+    network block to implement networks like CSPResNet, CSPResNeXt and
+    CSPDarkNet.
+
+    Args:
+        arch (dict): The architecture of the CSPNet.
+            It should have the following keys:
+
+            - block_fn (Callable): A function or class to return a block
+              module, and it should accept at least ``in_channels``,
+              ``out_channels``, ``expansion``, ``drop_path_rate``, ``norm_cfg``
+              and ``act_cfg``.
+            - in_channels (Tuple[int]): The number of input channels of each
+              stage.
+            - out_channels (Tuple[int]): The number of output channels of each
+              stage.
+            - num_blocks (Tuple[int]): The number of blocks in each stage.
+            - expansion_ratio (float | Tuple[float]): The expansion ratio in
+              the expand convolution of each stage. Defaults to 0.5.
+            - bottle_ratio (float | Tuple[float]): The expansion ratio of
+              blocks in each stage. Defaults to 2.
+            - has_downsampler (bool | Tuple[bool]): Whether to add a
+              downsample convolution in each stage. Defaults to True
+            - down_growth (bool | Tuple[bool]): Whether to expand the channels
+              in the downsampler layer of each stage. Defaults to False.
+            - block_args (dict | Tuple[dict], optional): The extra arguments to
+              the blocks in each stage. Defaults to None.
+
+        stem_fn (Callable): A function or class to return a stem module.
+            And it should accept ``in_channels``.
+        in_channels (int): Number of input image channels. Defaults to 3.
+        out_indices (int | Sequence[int]): Output from which stages.
+            Defaults to -1, which means the last stage.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        conv_cfg (dict, optional): The config dict for conv layers in blocks.
+            Defaults to None, which means use Conv2d.
+        norm_cfg (dict): The config dict for norm layers.
+            Defaults to ``dict(type='BN', eps=1e-5)``.
+        act_cfg (dict): The config dict for activation functions.
+            Defaults to ``dict(type='LeakyReLU', inplace=True)``.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Defaults to False.
+        init_cfg (dict, optional): The initialization settings.
+            Defaults to ``dict(type='Kaiming', layer='Conv2d'))``.
+
+    Example:
+        >>> from functools import partial
+        >>> import torch
+        >>> import torch.nn as nn
+        >>> from mmpretrain.models import CSPNet
+        >>> from mmpretrain.models.backbones.resnet import Bottleneck
+        >>>
+        >>> # A simple example to build CSPNet.
+        >>> arch = dict(
+        ...     block_fn=Bottleneck,
+        ...     in_channels=[32, 64],
+        ...     out_channels=[64, 128],
+        ...     num_blocks=[3, 4]
+        ... )
+        >>> stem_fn = partial(nn.Conv2d, out_channels=32, kernel_size=3)
+        >>> model = CSPNet(arch=arch, stem_fn=stem_fn, out_indices=(0, 1))
+        >>> inputs = torch.rand(1, 3, 224, 224)
+        >>> outs = model(inputs)
+        >>> for out in outs:
+        ...     print(out.shape)
+        ...
+        (1, 64, 111, 111)
+        (1, 128, 56, 56)
+    """
+
+    def __init__(self,
+                 arch,
+                 stem_fn,
+                 in_channels=3,
+                 out_indices=-1,
+                 frozen_stages=-1,
+                 drop_path_rate=0.,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN', eps=1e-5),
+                 act_cfg=dict(type='LeakyReLU', inplace=True),
+                 norm_eval=False,
+                 init_cfg=dict(type='Kaiming', layer='Conv2d')):
+        super().__init__(init_cfg=init_cfg)
+        self.arch = self.expand_arch(arch)
+        self.num_stages = len(self.arch['in_channels'])
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.act_cfg = act_cfg
+        self.norm_eval = norm_eval
+        if frozen_stages not in range(-1, self.num_stages):
+            raise ValueError('frozen_stages must be in range(-1, '
+                             f'{self.num_stages}). But received '
+                             f'{frozen_stages}')
+        self.frozen_stages = frozen_stages
+
+        self.stem = stem_fn(in_channels)
+
+        stages = []
+        depths = self.arch['num_blocks']
+        dpr = torch.linspace(0, drop_path_rate, sum(depths)).split(depths)
+
+        for i in range(self.num_stages):
+            stage_cfg = {k: v[i] for k, v in self.arch.items()}
+            csp_stage = CSPStage(
+                **stage_cfg,
+                block_dpr=dpr[i].tolist(),
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg,
+                init_cfg=init_cfg)
+            stages.append(csp_stage)
+        self.stages = Sequential(*stages)
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        out_indices = list(out_indices)
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = len(self.stages) + index
+            assert 0 <= out_indices[i] <= len(self.stages), \
+                f'Invalid out_indices {index}.'
+        self.out_indices = out_indices
+
+    @staticmethod
+    def expand_arch(arch):
+        num_stages = len(arch['in_channels'])
+
+        def to_tuple(x, name=''):
+            if isinstance(x, (list, tuple)):
+                assert len(x) == num_stages, \
+                    f'The length of {name} ({len(x)}) does not ' \
+                    f'equals to the number of stages ({num_stages})'
+                return tuple(x)
+            else:
+                return (x, ) * num_stages
+
+        full_arch = {k: to_tuple(v, k) for k, v in arch.items()}
+        if 'block_args' not in full_arch:
+            full_arch['block_args'] = to_tuple({})
+        return full_arch
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            self.stem.eval()
+            for param in self.stem.parameters():
+                param.requires_grad = False
+
+        for i in range(self.frozen_stages + 1):
+            m = self.stages[i]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        super(CSPNet, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                if isinstance(m, _BatchNorm):
+                    m.eval()
+
+    def forward(self, x):
+        outs = []
+
+        x = self.stem(x)
+        for i, stage in enumerate(self.stages):
+            x = stage(x)
+            if i in self.out_indices:
+                outs.append(x)
+        return tuple(outs)
+
+
+@MODELS.register_module()
+class CSPDarkNet(CSPNet):
+    """CSP-Darknet backbone used in YOLOv4.
+
+    Args:
+        depth (int): Depth of CSP-Darknet. Default: 53.
+        in_channels (int): Number of input image channels. Default: 3.
+        out_indices (Sequence[int]): Output from which stages.
+            Default: (3, ).
+        frozen_stages (int): Stages to be frozen (stop grad and set eval
+            mode). -1 means not freezing any parameters. Default: -1.
+        conv_cfg (dict): Config dict for convolution layer. Default: None.
+        norm_cfg (dict): Dictionary to construct and config norm layer.
+            Default: dict(type='BN', requires_grad=True).
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='LeakyReLU', negative_slope=0.1).
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Default: None.
+
+    Example:
+        >>> from mmpretrain.models import CSPDarkNet
+        >>> import torch
+        >>> model = CSPDarkNet(depth=53, out_indices=(0, 1, 2, 3, 4))
+        >>> model.eval()
+        >>> inputs = torch.rand(1, 3, 416, 416)
+        >>> level_outputs = model(inputs)
+        >>> for level_out in level_outputs:
+        ...     print(tuple(level_out.shape))
+        ...
+        (1, 64, 208, 208)
+        (1, 128, 104, 104)
+        (1, 256, 52, 52)
+        (1, 512, 26, 26)
+        (1, 1024, 13, 13)
+    """
+    arch_settings = {
+        53:
+        dict(
+            block_fn=DarknetBottleneck,
+            in_channels=(32, 64, 128, 256, 512),
+            out_channels=(64, 128, 256, 512, 1024),
+            num_blocks=(1, 2, 8, 8, 4),
+            expand_ratio=(2, 1, 1, 1, 1),
+            bottle_ratio=(2, 1, 1, 1, 1),
+            has_downsampler=True,
+            down_growth=True,
+        ),
+    }
+
+    def __init__(self,
+                 depth,
+                 in_channels=3,
+                 out_indices=(4, ),
+                 frozen_stages=-1,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN', eps=1e-5),
+                 act_cfg=dict(type='LeakyReLU', inplace=True),
+                 norm_eval=False,
+                 init_cfg=dict(
+                     type='Kaiming',
+                     layer='Conv2d',
+                     a=math.sqrt(5),
+                     distribution='uniform',
+                     mode='fan_in',
+                     nonlinearity='leaky_relu')):
+
+        assert depth in self.arch_settings, 'depth must be one of ' \
+            f'{list(self.arch_settings.keys())}, but get {depth}.'
+
+        super().__init__(
+            arch=self.arch_settings[depth],
+            stem_fn=self._make_stem_layer,
+            in_channels=in_channels,
+            out_indices=out_indices,
+            frozen_stages=frozen_stages,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg,
+            norm_eval=norm_eval,
+            init_cfg=init_cfg)
+
+    def _make_stem_layer(self, in_channels):
+        """using a stride=1 conv as the stem in CSPDarknet."""
+        # `stem_channels` equals to the `in_channels` in the first stage.
+        stem_channels = self.arch['in_channels'][0]
+        stem = ConvModule(
+            in_channels=in_channels,
+            out_channels=stem_channels,
+            kernel_size=3,
+            padding=1,
+            norm_cfg=self.norm_cfg,
+            act_cfg=self.act_cfg)
+        return stem
+
+
+@MODELS.register_module()
+class CSPResNet(CSPNet):
+    """CSP-ResNet backbone.
+
+    Args:
+        depth (int): Depth of CSP-ResNet. Default: 50.
+        out_indices (Sequence[int]): Output from which stages.
+            Default: (4, ).
+        frozen_stages (int): Stages to be frozen (stop grad and set eval
+            mode). -1 means not freezing any parameters. Default: -1.
+        conv_cfg (dict): Config dict for convolution layer. Default: None.
+        norm_cfg (dict): Dictionary to construct and config norm layer.
+            Default: dict(type='BN', requires_grad=True).
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='LeakyReLU', negative_slope=0.1).
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Default: None.
+    Example:
+        >>> from mmpretrain.models import CSPResNet
+        >>> import torch
+        >>> model = CSPResNet(depth=50, out_indices=(0, 1, 2, 3))
+        >>> model.eval()
+        >>> inputs = torch.rand(1, 3, 416, 416)
+        >>> level_outputs = model(inputs)
+        >>> for level_out in level_outputs:
+        ...     print(tuple(level_out.shape))
+        ...
+        (1, 128, 104, 104)
+        (1, 256, 52, 52)
+        (1, 512, 26, 26)
+        (1, 1024, 13, 13)
+    """
+    arch_settings = {
+        50:
+        dict(
+            block_fn=ResNetBottleneck,
+            in_channels=(64, 128, 256, 512),
+            out_channels=(128, 256, 512, 1024),
+            num_blocks=(3, 3, 5, 2),
+            expand_ratio=4,
+            bottle_ratio=2,
+            has_downsampler=(False, True, True, True),
+            down_growth=False),
+    }
+
+    def __init__(self,
+                 depth,
+                 in_channels=3,
+                 out_indices=(3, ),
+                 frozen_stages=-1,
+                 deep_stem=False,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN', eps=1e-5),
+                 act_cfg=dict(type='LeakyReLU', inplace=True),
+                 norm_eval=False,
+                 init_cfg=dict(type='Kaiming', layer='Conv2d')):
+        assert depth in self.arch_settings, 'depth must be one of ' \
+            f'{list(self.arch_settings.keys())}, but get {depth}.'
+        self.deep_stem = deep_stem
+
+        super().__init__(
+            arch=self.arch_settings[depth],
+            stem_fn=self._make_stem_layer,
+            in_channels=in_channels,
+            out_indices=out_indices,
+            frozen_stages=frozen_stages,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg,
+            norm_eval=norm_eval,
+            init_cfg=init_cfg)
+
+    def _make_stem_layer(self, in_channels):
+        # `stem_channels` equals to the `in_channels` in the first stage.
+        stem_channels = self.arch['in_channels'][0]
+        if self.deep_stem:
+            stem = nn.Sequential(
+                ConvModule(
+                    in_channels,
+                    stem_channels // 2,
+                    kernel_size=3,
+                    stride=2,
+                    padding=1,
+                    conv_cfg=self.conv_cfg,
+                    norm_cfg=self.norm_cfg,
+                    act_cfg=self.act_cfg),
+                ConvModule(
+                    stem_channels // 2,
+                    stem_channels // 2,
+                    kernel_size=3,
+                    stride=1,
+                    padding=1,
+                    conv_cfg=self.conv_cfg,
+                    norm_cfg=self.norm_cfg,
+                    act_cfg=self.act_cfg),
+                ConvModule(
+                    stem_channels // 2,
+                    stem_channels,
+                    kernel_size=3,
+                    stride=1,
+                    padding=1,
+                    conv_cfg=self.conv_cfg,
+                    norm_cfg=self.norm_cfg,
+                    act_cfg=self.act_cfg))
+        else:
+            stem = nn.Sequential(
+                ConvModule(
+                    in_channels,
+                    stem_channels,
+                    kernel_size=7,
+                    stride=2,
+                    padding=3,
+                    conv_cfg=self.conv_cfg,
+                    norm_cfg=self.norm_cfg,
+                    act_cfg=self.act_cfg),
+                nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
+        return stem
+
+
+@MODELS.register_module()
+class CSPResNeXt(CSPResNet):
+    """CSP-ResNeXt backbone.
+
+    Args:
+        depth (int): Depth of CSP-ResNeXt. Default: 50.
+        out_indices (Sequence[int]): Output from which stages.
+            Default: (4, ).
+        frozen_stages (int): Stages to be frozen (stop grad and set eval
+            mode). -1 means not freezing any parameters. Default: -1.
+        conv_cfg (dict): Config dict for convolution layer. Default: None.
+        norm_cfg (dict): Dictionary to construct and config norm layer.
+            Default: dict(type='BN', requires_grad=True).
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='LeakyReLU', negative_slope=0.1).
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Default: None.
+    Example:
+        >>> from mmpretrain.models import CSPResNeXt
+        >>> import torch
+        >>> model = CSPResNeXt(depth=50, out_indices=(0, 1, 2, 3))
+        >>> model.eval()
+        >>> inputs = torch.rand(1, 3, 224, 224)
+        >>> level_outputs = model(inputs)
+        >>> for level_out in level_outputs:
+        ...     print(tuple(level_out.shape))
+        ...
+        (1, 256, 56, 56)
+        (1, 512, 28, 28)
+        (1, 1024, 14, 14)
+        (1, 2048, 7, 7)
+    """
+    arch_settings = {
+        50:
+        dict(
+            block_fn=ResNeXtBottleneck,
+            in_channels=(64, 256, 512, 1024),
+            out_channels=(256, 512, 1024, 2048),
+            num_blocks=(3, 3, 5, 2),
+            expand_ratio=(4, 2, 2, 2),
+            bottle_ratio=4,
+            has_downsampler=(False, True, True, True),
+            down_growth=False,
+            # the base_channels is changed from 64 to 32 in CSPNet
+            block_args=dict(base_channels=32),
+        ),
+    }
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
diff --git a/mmpretrain/models/backbones/davit.py b/mmpretrain/models/backbones/davit.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf25e2ed7137fb403e38801b50b355c4306331d6
--- /dev/null
+++ b/mmpretrain/models/backbones/davit.py
@@ -0,0 +1,834 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from typing import Sequence, Tuple
+
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint as cp
+from mmcv.cnn import build_conv_layer, build_norm_layer
+from mmcv.cnn.bricks import Conv2d
+from mmcv.cnn.bricks.transformer import FFN, AdaptivePadding, PatchEmbed
+from mmengine.model import BaseModule, ModuleList
+from mmengine.utils import to_2tuple
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+
+from mmpretrain.models.backbones.base_backbone import BaseBackbone
+from mmpretrain.registry import MODELS
+from ..utils import ShiftWindowMSA
+
+
+class DaViTWindowMSA(BaseModule):
+    """Window based multi-head self-attention (W-MSA) module for DaViT.
+
+    The differences between DaViTWindowMSA & WindowMSA:
+        1. Without relative position bias.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        window_size (tuple[int]): The height and width of the window.
+        num_heads (int): Number of attention heads.
+        qkv_bias (bool, optional): If True, add a learnable bias to q, k, v.
+            Defaults to True.
+        qk_scale (float, optional): Override default qk scale of
+            ``head_dim ** -0.5`` if set. Defaults to None.
+        attn_drop (float, optional): Dropout ratio of attention weight.
+            Defaults to 0.
+        proj_drop (float, optional): Dropout ratio of output. Defaults to 0.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 window_size,
+                 num_heads,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 attn_drop=0.,
+                 proj_drop=0.,
+                 init_cfg=None):
+
+        super().__init__(init_cfg)
+        self.embed_dims = embed_dims
+        self.window_size = window_size  # Wh, Ww
+        self.num_heads = num_heads
+        head_embed_dims = embed_dims // num_heads
+        self.scale = qk_scale or head_embed_dims**-0.5
+
+        self.qkv = nn.Linear(embed_dims, embed_dims * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(embed_dims, embed_dims)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+        self.softmax = nn.Softmax(dim=-1)
+
+    def forward(self, x, mask=None):
+        """
+        Args:
+
+            x (tensor): input features with shape of (num_windows*B, N, C)
+            mask (tensor, Optional): mask with shape of (num_windows, Wh*Ww,
+                Wh*Ww), value should be between (-inf, 0].
+        """
+        B_, N, C = x.shape
+        qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads,
+                                  C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[
+            2]  # make torchscript happy (cannot use tensor as tuple)
+
+        q = q * self.scale
+        attn = (q @ k.transpose(-2, -1))
+
+        if mask is not None:
+            nW = mask.shape[0]
+            attn = attn.view(B_ // nW, nW, self.num_heads, N,
+                             N) + mask.unsqueeze(1).unsqueeze(0)
+            attn = attn.view(-1, self.num_heads, N, N)
+            attn = self.softmax(attn)
+        else:
+            attn = self.softmax(attn)
+
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+    @staticmethod
+    def double_step_seq(step1, len1, step2, len2):
+        seq1 = torch.arange(0, step1 * len1, step1)
+        seq2 = torch.arange(0, step2 * len2, step2)
+        return (seq1[:, None] + seq2[None, :]).reshape(1, -1)
+
+
+class ConvPosEnc(BaseModule):
+    """DaViT conv pos encode block.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        kernel_size (int): The kernel size of the first convolution.
+            Defaults to 3.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self, embed_dims, kernel_size=3, init_cfg=None):
+        super(ConvPosEnc, self).__init__(init_cfg)
+        self.proj = Conv2d(
+            embed_dims,
+            embed_dims,
+            kernel_size,
+            stride=1,
+            padding=kernel_size // 2,
+            groups=embed_dims)
+
+    def forward(self, x, size: Tuple[int, int]):
+        B, N, C = x.shape
+        H, W = size
+        assert N == H * W
+
+        feat = x.transpose(1, 2).view(B, C, H, W)
+        feat = self.proj(feat)
+        feat = feat.flatten(2).transpose(1, 2)
+        x = x + feat
+        return x
+
+
+class DaViTDownSample(BaseModule):
+    """DaViT down sampole block.
+
+    Args:
+        in_channels (int): The number of input channels.
+        out_channels (int): The number of output channels.
+        conv_type (str): The type of convolution
+            to generate patch embedding. Default: "Conv2d".
+        kernel_size (int): The kernel size of the first convolution.
+            Defaults to 2.
+        stride (int): The stride of the second convluation module.
+            Defaults to 2.
+        padding (int | tuple | string ): The padding length of
+            embedding conv. When it is a string, it means the mode
+            of adaptive padding, support "same" and "corner" now.
+            Defaults to "corner".
+        dilation (int): Dilation of the convolution layers. Defaults to 1.
+        bias (bool): Bias of embed conv. Default: True.
+        norm_cfg (dict, optional): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 conv_type='Conv2d',
+                 kernel_size=2,
+                 stride=2,
+                 padding='same',
+                 dilation=1,
+                 bias=True,
+                 norm_cfg=None,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        self.out_channels = out_channels
+        if stride is None:
+            stride = kernel_size
+
+        kernel_size = to_2tuple(kernel_size)
+        stride = to_2tuple(stride)
+        dilation = to_2tuple(dilation)
+
+        if isinstance(padding, str):
+            self.adaptive_padding = AdaptivePadding(
+                kernel_size=kernel_size,
+                stride=stride,
+                dilation=dilation,
+                padding=padding)
+            # disable the padding of conv
+            padding = 0
+        else:
+            self.adaptive_padding = None
+        padding = to_2tuple(padding)
+
+        self.projection = build_conv_layer(
+            dict(type=conv_type),
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            dilation=dilation,
+            bias=bias)
+
+        if norm_cfg is not None:
+            self.norm = build_norm_layer(norm_cfg, in_channels)[1]
+        else:
+            self.norm = None
+
+    def forward(self, x, input_size):
+        if self.adaptive_padding:
+            x = self.adaptive_padding(x)
+        H, W = input_size
+        B, L, C = x.shape
+        assert L == H * W, 'input feature has wrong size'
+
+        x = self.norm(x)
+        x = x.reshape(B, H, W, C).permute(0, 3, 1, 2).contiguous()
+
+        x = self.projection(x)
+        output_size = (x.size(2), x.size(3))
+        x = x.flatten(2).transpose(1, 2)
+        return x, output_size
+
+
+class ChannelAttention(BaseModule):
+    """DaViT channel attention.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to True.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self, embed_dims, num_heads=8, qkv_bias=False, init_cfg=None):
+        super().__init__(init_cfg)
+        self.embed_dims = embed_dims
+        self.num_heads = num_heads
+        self.head_dims = embed_dims // num_heads
+        self.scale = self.head_dims**-0.5
+
+        self.qkv = nn.Linear(embed_dims, embed_dims * 3, bias=qkv_bias)
+        self.proj = nn.Linear(embed_dims, embed_dims)
+
+    def forward(self, x):
+        B, N, _ = x.shape
+
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads,
+                                  self.head_dims).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+
+        k = k * self.scale
+        attention = k.transpose(-1, -2) @ v
+        attention = attention.softmax(dim=-1)
+
+        x = (attention @ q.transpose(-1, -2)).transpose(-1, -2)
+        x = x.transpose(1, 2).reshape(B, N, self.embed_dims)
+        x = self.proj(x)
+        return x
+
+
+class ChannelBlock(BaseModule):
+    """DaViT channel attention block.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        window_size (int): The height and width of the window. Defaults to 7.
+        ffn_ratio (float): The expansion ratio of feedforward network hidden
+            layer channels. Defaults to 4.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to True.
+        drop_path (float): The drop path rate after attention and ffn.
+            Defaults to 0.
+        ffn_cfgs (dict): The extra config of FFN. Defaults to empty dict.
+        norm_cfg (dict): The config of norm layers.
+            Defaults to ``dict(type='LN')``.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 ffn_ratio=4.,
+                 qkv_bias=False,
+                 drop_path=0.,
+                 ffn_cfgs=dict(),
+                 norm_cfg=dict(type='LN'),
+                 with_cp=False,
+                 init_cfg=None):
+        super().__init__(init_cfg)
+        self.with_cp = with_cp
+
+        self.cpe1 = ConvPosEnc(embed_dims=embed_dims, kernel_size=3)
+        self.norm1 = build_norm_layer(norm_cfg, embed_dims)[1]
+        self.attn = ChannelAttention(
+            embed_dims, num_heads=num_heads, qkv_bias=qkv_bias)
+        self.cpe2 = ConvPosEnc(embed_dims=embed_dims, kernel_size=3)
+
+        _ffn_cfgs = {
+            'embed_dims': embed_dims,
+            'feedforward_channels': int(embed_dims * ffn_ratio),
+            'num_fcs': 2,
+            'ffn_drop': 0,
+            'dropout_layer': dict(type='DropPath', drop_prob=drop_path),
+            'act_cfg': dict(type='GELU'),
+            **ffn_cfgs
+        }
+        self.norm2 = build_norm_layer(norm_cfg, embed_dims)[1]
+        self.ffn = FFN(**_ffn_cfgs)
+
+    def forward(self, x, hw_shape):
+
+        def _inner_forward(x):
+            x = self.cpe1(x, hw_shape)
+            identity = x
+            x = self.norm1(x)
+            x = self.attn(x)
+            x = x + identity
+
+            x = self.cpe2(x, hw_shape)
+            identity = x
+            x = self.norm2(x)
+            x = self.ffn(x, identity=identity)
+
+            return x
+
+        if self.with_cp and x.requires_grad:
+            x = cp.checkpoint(_inner_forward, x)
+        else:
+            x = _inner_forward(x)
+
+        return x
+
+
+class SpatialBlock(BaseModule):
+    """DaViT spatial attention block.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        window_size (int): The height and width of the window. Defaults to 7.
+        ffn_ratio (float): The expansion ratio of feedforward network hidden
+            layer channels. Defaults to 4.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to True.
+        drop_path (float): The drop path rate after attention and ffn.
+            Defaults to 0.
+        pad_small_map (bool): If True, pad the small feature map to the window
+            size, which is common used in detection and segmentation. If False,
+            avoid shifting window and shrink the window size to the size of
+            feature map, which is common used in classification.
+            Defaults to False.
+        attn_cfgs (dict): The extra config of Shift Window-MSA.
+            Defaults to empty dict.
+        ffn_cfgs (dict): The extra config of FFN. Defaults to empty dict.
+        norm_cfg (dict): The config of norm layers.
+            Defaults to ``dict(type='LN')``.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 window_size=7,
+                 ffn_ratio=4.,
+                 qkv_bias=True,
+                 drop_path=0.,
+                 pad_small_map=False,
+                 attn_cfgs=dict(),
+                 ffn_cfgs=dict(),
+                 norm_cfg=dict(type='LN'),
+                 with_cp=False,
+                 init_cfg=None):
+
+        super(SpatialBlock, self).__init__(init_cfg)
+        self.with_cp = with_cp
+
+        self.cpe1 = ConvPosEnc(embed_dims=embed_dims, kernel_size=3)
+        self.norm1 = build_norm_layer(norm_cfg, embed_dims)[1]
+        _attn_cfgs = {
+            'embed_dims': embed_dims,
+            'num_heads': num_heads,
+            'shift_size': 0,
+            'window_size': window_size,
+            'dropout_layer': dict(type='DropPath', drop_prob=drop_path),
+            'qkv_bias': qkv_bias,
+            'pad_small_map': pad_small_map,
+            'window_msa': DaViTWindowMSA,
+            **attn_cfgs
+        }
+        self.attn = ShiftWindowMSA(**_attn_cfgs)
+        self.cpe2 = ConvPosEnc(embed_dims=embed_dims, kernel_size=3)
+
+        _ffn_cfgs = {
+            'embed_dims': embed_dims,
+            'feedforward_channels': int(embed_dims * ffn_ratio),
+            'num_fcs': 2,
+            'ffn_drop': 0,
+            'dropout_layer': dict(type='DropPath', drop_prob=drop_path),
+            'act_cfg': dict(type='GELU'),
+            **ffn_cfgs
+        }
+        self.norm2 = build_norm_layer(norm_cfg, embed_dims)[1]
+        self.ffn = FFN(**_ffn_cfgs)
+
+    def forward(self, x, hw_shape):
+
+        def _inner_forward(x):
+            x = self.cpe1(x, hw_shape)
+            identity = x
+            x = self.norm1(x)
+            x = self.attn(x, hw_shape)
+            x = x + identity
+
+            x = self.cpe2(x, hw_shape)
+            identity = x
+            x = self.norm2(x)
+            x = self.ffn(x, identity=identity)
+
+            return x
+
+        if self.with_cp and x.requires_grad:
+            x = cp.checkpoint(_inner_forward, x)
+        else:
+            x = _inner_forward(x)
+
+        return x
+
+
+class DaViTBlock(BaseModule):
+    """DaViT block.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        window_size (int): The height and width of the window. Defaults to 7.
+        ffn_ratio (float): The expansion ratio of feedforward network hidden
+            layer channels. Defaults to 4.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to True.
+        drop_path (float): The drop path rate after attention and ffn.
+            Defaults to 0.
+        pad_small_map (bool): If True, pad the small feature map to the window
+            size, which is common used in detection and segmentation. If False,
+            avoid shifting window and shrink the window size to the size of
+            feature map, which is common used in classification.
+            Defaults to False.
+        attn_cfgs (dict): The extra config of Shift Window-MSA.
+            Defaults to empty dict.
+        ffn_cfgs (dict): The extra config of FFN. Defaults to empty dict.
+        norm_cfg (dict): The config of norm layers.
+            Defaults to ``dict(type='LN')``.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 window_size=7,
+                 ffn_ratio=4.,
+                 qkv_bias=True,
+                 drop_path=0.,
+                 pad_small_map=False,
+                 attn_cfgs=dict(),
+                 ffn_cfgs=dict(),
+                 norm_cfg=dict(type='LN'),
+                 with_cp=False,
+                 init_cfg=None):
+
+        super(DaViTBlock, self).__init__(init_cfg)
+        self.spatial_block = SpatialBlock(
+            embed_dims,
+            num_heads,
+            window_size=window_size,
+            ffn_ratio=ffn_ratio,
+            qkv_bias=qkv_bias,
+            drop_path=drop_path,
+            pad_small_map=pad_small_map,
+            attn_cfgs=attn_cfgs,
+            ffn_cfgs=ffn_cfgs,
+            norm_cfg=norm_cfg,
+            with_cp=with_cp)
+        self.channel_block = ChannelBlock(
+            embed_dims,
+            num_heads,
+            ffn_ratio=ffn_ratio,
+            qkv_bias=qkv_bias,
+            drop_path=drop_path,
+            ffn_cfgs=ffn_cfgs,
+            norm_cfg=norm_cfg,
+            with_cp=False)
+
+    def forward(self, x, hw_shape):
+        x = self.spatial_block(x, hw_shape)
+        x = self.channel_block(x, hw_shape)
+
+        return x
+
+
+class DaViTBlockSequence(BaseModule):
+    """Module with successive DaViT blocks and downsample layer.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        depth (int): Number of successive DaViT blocks.
+        num_heads (int): Number of attention heads.
+        window_size (int): The height and width of the window. Defaults to 7.
+        ffn_ratio (float): The expansion ratio of feedforward network hidden
+            layer channels. Defaults to 4.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to True.
+        downsample (bool): Downsample the output of blocks by patch merging.
+            Defaults to False.
+        downsample_cfg (dict): The extra config of the patch merging layer.
+            Defaults to empty dict.
+        drop_paths (Sequence[float] | float): The drop path rate in each block.
+            Defaults to 0.
+        block_cfgs (Sequence[dict] | dict): The extra config of each block.
+            Defaults to empty dicts.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        pad_small_map (bool): If True, pad the small feature map to the window
+            size, which is common used in detection and segmentation. If False,
+            avoid shifting window and shrink the window size to the size of
+            feature map, which is common used in classification.
+            Defaults to False.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 depth,
+                 num_heads,
+                 window_size=7,
+                 ffn_ratio=4.,
+                 qkv_bias=True,
+                 downsample=False,
+                 downsample_cfg=dict(),
+                 drop_paths=0.,
+                 block_cfgs=dict(),
+                 with_cp=False,
+                 pad_small_map=False,
+                 init_cfg=None):
+        super().__init__(init_cfg)
+
+        if not isinstance(drop_paths, Sequence):
+            drop_paths = [drop_paths] * depth
+
+        if not isinstance(block_cfgs, Sequence):
+            block_cfgs = [deepcopy(block_cfgs) for _ in range(depth)]
+
+        self.embed_dims = embed_dims
+        self.blocks = ModuleList()
+        for i in range(depth):
+            _block_cfg = {
+                'embed_dims': embed_dims,
+                'num_heads': num_heads,
+                'window_size': window_size,
+                'ffn_ratio': ffn_ratio,
+                'qkv_bias': qkv_bias,
+                'drop_path': drop_paths[i],
+                'with_cp': with_cp,
+                'pad_small_map': pad_small_map,
+                **block_cfgs[i]
+            }
+            block = DaViTBlock(**_block_cfg)
+            self.blocks.append(block)
+
+        if downsample:
+            _downsample_cfg = {
+                'in_channels': embed_dims,
+                'out_channels': 2 * embed_dims,
+                'norm_cfg': dict(type='LN'),
+                **downsample_cfg
+            }
+            self.downsample = DaViTDownSample(**_downsample_cfg)
+        else:
+            self.downsample = None
+
+    def forward(self, x, in_shape, do_downsample=True):
+        for block in self.blocks:
+            x = block(x, in_shape)
+
+        if self.downsample is not None and do_downsample:
+            x, out_shape = self.downsample(x, in_shape)
+        else:
+            out_shape = in_shape
+        return x, out_shape
+
+    @property
+    def out_channels(self):
+        if self.downsample:
+            return self.downsample.out_channels
+        else:
+            return self.embed_dims
+
+
+@MODELS.register_module()
+class DaViT(BaseBackbone):
+    """DaViT.
+
+    A PyTorch implement of : `DaViT: Dual Attention Vision Transformers
+    <https://arxiv.org/abs/2204.03645v1>`_
+
+    Inspiration from
+    https://github.com/dingmyu/davit
+
+    Args:
+        arch (str | dict): DaViT architecture. If use string, choose from
+            'tiny', 'small', 'base' and 'large', 'huge', 'giant'. If use dict,
+            it should have below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **depths** (List[int]): The number of blocks in each stage.
+            - **num_heads** (List[int]): The number of heads in attention
+              modules of each stage.
+
+            Defaults to 't'.
+        patch_size (int | tuple): The patch size in patch embedding.
+            Defaults to 4.
+        in_channels (int): The num of input channels. Defaults to 3.
+        window_size (int): The height and width of the window. Defaults to 7.
+        ffn_ratio (float): The expansion ratio of feedforward network hidden
+            layer channels. Defaults to 4.
+        qkv_bias (bool): Whether to add bias for qkv in attention modules.
+            Defaults to True.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.1.
+        out_after_downsample (bool): Whether to output the feature map of a
+            stage after the following downsample layer. Defaults to False.
+        pad_small_map (bool): If True, pad the small feature map to the window
+            size, which is common used in detection and segmentation. If False,
+            avoid shifting window and shrink the window size to the size of
+            feature map, which is common used in classification.
+            Defaults to False.
+        norm_cfg (dict): Config dict for normalization layer for all output
+            features. Defaults to ``dict(type='LN')``
+        stage_cfgs (Sequence[dict] | dict): Extra config dict for each
+            stage. Defaults to an empty dict.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Defaults to False.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+    arch_zoo = {
+        **dict.fromkeys(['t', 'tiny'], {
+                            'embed_dims': 96,
+                            'depths': [1, 1, 3, 1],
+                            'num_heads': [3, 6, 12, 24]
+                        }),
+        **dict.fromkeys(['s', 'small'], {
+                            'embed_dims': 96,
+                            'depths': [1, 1, 9, 1],
+                            'num_heads': [3, 6, 12, 24]
+                        }),
+        **dict.fromkeys(['b', 'base'], {
+                            'embed_dims': 128,
+                            'depths': [1, 1, 9, 1],
+                            'num_heads': [4, 8, 16, 32]
+                        }),
+        **dict.fromkeys(
+            ['l', 'large'], {
+                'embed_dims': 192,
+                'depths': [1, 1, 9, 1],
+                'num_heads': [6, 12, 24, 48]
+            }),
+        **dict.fromkeys(
+            ['h', 'huge'], {
+                'embed_dims': 256,
+                'depths': [1, 1, 9, 1],
+                'num_heads': [8, 16, 32, 64]
+            }),
+        **dict.fromkeys(
+            ['g', 'giant'], {
+                'embed_dims': 384,
+                'depths': [1, 1, 12, 3],
+                'num_heads': [12, 24, 48, 96]
+            }),
+    }
+
+    def __init__(self,
+                 arch='t',
+                 patch_size=4,
+                 in_channels=3,
+                 window_size=7,
+                 ffn_ratio=4.,
+                 qkv_bias=True,
+                 drop_path_rate=0.1,
+                 out_after_downsample=False,
+                 pad_small_map=False,
+                 norm_cfg=dict(type='LN'),
+                 stage_cfgs=dict(),
+                 frozen_stages=-1,
+                 norm_eval=False,
+                 out_indices=(3, ),
+                 with_cp=False,
+                 init_cfg=None):
+        super().__init__(init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {'embed_dims', 'depths', 'num_heads'}
+            assert isinstance(arch, dict) and essential_keys <= set(arch), \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.embed_dims = self.arch_settings['embed_dims']
+        self.depths = self.arch_settings['depths']
+        self.num_heads = self.arch_settings['num_heads']
+        self.num_layers = len(self.depths)
+        self.out_indices = out_indices
+        self.out_after_downsample = out_after_downsample
+        self.frozen_stages = frozen_stages
+        self.norm_eval = norm_eval
+
+        # stochastic depth decay rule
+        total_depth = sum(self.depths)
+        dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, total_depth)
+        ]  # stochastic depth decay rule
+
+        _patch_cfg = dict(
+            in_channels=in_channels,
+            embed_dims=self.embed_dims,
+            conv_type='Conv2d',
+            kernel_size=7,
+            stride=patch_size,
+            padding='same',
+            norm_cfg=dict(type='LN'),
+        )
+        self.patch_embed = PatchEmbed(**_patch_cfg)
+
+        self.stages = ModuleList()
+        embed_dims = [self.embed_dims]
+        for i, (depth,
+                num_heads) in enumerate(zip(self.depths, self.num_heads)):
+            if isinstance(stage_cfgs, Sequence):
+                stage_cfg = stage_cfgs[i]
+            else:
+                stage_cfg = deepcopy(stage_cfgs)
+            downsample = True if i < self.num_layers - 1 else False
+            _stage_cfg = {
+                'embed_dims': embed_dims[-1],
+                'depth': depth,
+                'num_heads': num_heads,
+                'window_size': window_size,
+                'ffn_ratio': ffn_ratio,
+                'qkv_bias': qkv_bias,
+                'downsample': downsample,
+                'drop_paths': dpr[:depth],
+                'with_cp': with_cp,
+                'pad_small_map': pad_small_map,
+                **stage_cfg
+            }
+
+            stage = DaViTBlockSequence(**_stage_cfg)
+            self.stages.append(stage)
+
+            dpr = dpr[depth:]
+            embed_dims.append(stage.out_channels)
+
+        self.num_features = embed_dims[:-1]
+
+        # add a norm layer for each output
+        for i in out_indices:
+            if norm_cfg is not None:
+                norm_layer = build_norm_layer(norm_cfg,
+                                              self.num_features[i])[1]
+            else:
+                norm_layer = nn.Identity()
+
+            self.add_module(f'norm{i}', norm_layer)
+
+    def train(self, mode=True):
+        super().train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                # trick: eval have effect on BatchNorm only
+                if isinstance(m, _BatchNorm):
+                    m.eval()
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            self.patch_embed.eval()
+            for param in self.patch_embed.parameters():
+                param.requires_grad = False
+
+        for i in range(0, self.frozen_stages + 1):
+            m = self.stages[i]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+        for i in self.out_indices:
+            if i <= self.frozen_stages:
+                for param in getattr(self, f'norm{i}').parameters():
+                    param.requires_grad = False
+
+    def forward(self, x):
+        x, hw_shape = self.patch_embed(x)
+
+        outs = []
+        for i, stage in enumerate(self.stages):
+            x, hw_shape = stage(
+                x, hw_shape, do_downsample=self.out_after_downsample)
+            if i in self.out_indices:
+                norm_layer = getattr(self, f'norm{i}')
+                out = norm_layer(x)
+                out = out.view(-1, *hw_shape,
+                               self.num_features[i]).permute(0, 3, 1,
+                                                             2).contiguous()
+                outs.append(out)
+            if stage.downsample is not None and not self.out_after_downsample:
+                x, hw_shape = stage.downsample(x, hw_shape)
+
+        return tuple(outs)
diff --git a/mmpretrain/models/backbones/deit.py b/mmpretrain/models/backbones/deit.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ae340829bece31536d0c0ac119ffe635bce82e0
--- /dev/null
+++ b/mmpretrain/models/backbones/deit.py
@@ -0,0 +1,116 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+from mmengine.model.weight_init import trunc_normal_
+
+from mmpretrain.registry import MODELS
+from .vision_transformer import VisionTransformer
+
+
+@MODELS.register_module()
+class DistilledVisionTransformer(VisionTransformer):
+    """Distilled Vision Transformer.
+
+    A PyTorch implement of : `Training data-efficient image transformers &
+    distillation through attention <https://arxiv.org/abs/2012.12877>`_
+
+    Args:
+        arch (str | dict): Vision Transformer architecture. If use string,
+            choose from 'small', 'base', 'large', 'deit-tiny', 'deit-small'
+            and 'deit-base'. If use dict, it should have below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **num_layers** (int): The number of transformer encoder layers.
+            - **num_heads** (int): The number of heads in attention modules.
+            - **feedforward_channels** (int): The hidden dimensions in
+              feedforward modules.
+
+            Defaults to 'deit-base'.
+        img_size (int | tuple): The expected input image shape. Because we
+            support dynamic input shape, just set the argument to the most
+            common input image shape. Defaults to 224.
+        patch_size (int | tuple): The patch size in patch embedding.
+            Defaults to 16.
+        in_channels (int): The num of input channels. Defaults to 3.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        qkv_bias (bool): Whether to add bias for qkv in attention modules.
+            Defaults to True.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        final_norm (bool): Whether to add a additional layer to normalize
+            final feature map. Defaults to True.
+        out_type (str): The type of output features. Please choose from
+
+            - ``"cls_token"``: A tuple with the class token and the
+              distillation token. The shapes of both tensor are (B, C).
+            - ``"featmap"``: The feature map tensor from the patch tokens
+              with shape (B, C, H, W).
+            - ``"avg_featmap"``: The global averaged feature map tensor
+              with shape (B, C).
+            - ``"raw"``: The raw feature tensor includes patch tokens and
+              class tokens with shape (B, L, C).
+
+            Defaults to ``"cls_token"``.
+        interpolate_mode (str): Select the interpolate mode for position
+            embeding vector resize. Defaults to "bicubic".
+        patch_cfg (dict): Configs of patch embeding. Defaults to an empty dict.
+        layer_cfgs (Sequence | dict): Configs of each transformer layer in
+            encoder. Defaults to an empty dict.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+    num_extra_tokens = 2  # class token and distillation token
+
+    def __init__(self, arch='deit-base', *args, **kwargs):
+        super(DistilledVisionTransformer, self).__init__(
+            arch=arch,
+            with_cls_token=True,
+            *args,
+            **kwargs,
+        )
+        self.dist_token = nn.Parameter(torch.zeros(1, 1, self.embed_dims))
+
+    def forward(self, x):
+        B = x.shape[0]
+        x, patch_resolution = self.patch_embed(x)
+
+        # stole cls_tokens impl from Phil Wang, thanks
+        cls_tokens = self.cls_token.expand(B, -1, -1)
+        dist_token = self.dist_token.expand(B, -1, -1)
+        x = torch.cat((cls_tokens, dist_token, x), dim=1)
+        x = x + self.resize_pos_embed(
+            self.pos_embed,
+            self.patch_resolution,
+            patch_resolution,
+            mode=self.interpolate_mode,
+            num_extra_tokens=self.num_extra_tokens)
+        x = self.drop_after_pos(x)
+
+        outs = []
+        for i, layer in enumerate(self.layers):
+            x = layer(x)
+
+            if i == len(self.layers) - 1 and self.final_norm:
+                x = self.ln1(x)
+
+            if i in self.out_indices:
+                outs.append(self._format_output(x, patch_resolution))
+
+        return tuple(outs)
+
+    def _format_output(self, x, hw):
+        if self.out_type == 'cls_token':
+            return x[:, 0], x[:, 1]
+
+        return super()._format_output(x, hw)
+
+    def init_weights(self):
+        super(DistilledVisionTransformer, self).init_weights()
+
+        if not (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            trunc_normal_(self.dist_token, std=0.02)
diff --git a/mmpretrain/models/backbones/deit3.py b/mmpretrain/models/backbones/deit3.py
new file mode 100644
index 0000000000000000000000000000000000000000..acedabe42d66a8073f34b1b0ae87501522fcc1b5
--- /dev/null
+++ b/mmpretrain/models/backbones/deit3.py
@@ -0,0 +1,454 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Sequence
+
+import numpy as np
+import torch
+from mmcv.cnn import Linear, build_activation_layer
+from mmcv.cnn.bricks.drop import build_dropout
+from mmcv.cnn.bricks.transformer import PatchEmbed
+from mmengine.model import BaseModule, ModuleList, Sequential
+from mmengine.utils import deprecated_api_warning
+from torch import nn
+
+from mmpretrain.registry import MODELS
+from ..utils import (LayerScale, MultiheadAttention, build_norm_layer,
+                     resize_pos_embed, to_2tuple)
+from .vision_transformer import VisionTransformer
+
+
+class DeiT3FFN(BaseModule):
+    """FFN for DeiT3.
+
+    The differences between DeiT3FFN & FFN:
+        1. Use LayerScale.
+
+    Args:
+        embed_dims (int): The feature dimension. Same as
+            `MultiheadAttention`. Defaults: 256.
+        feedforward_channels (int): The hidden dimension of FFNs.
+            Defaults: 1024.
+        num_fcs (int, optional): The number of fully-connected layers in
+            FFNs. Default: 2.
+        act_cfg (dict, optional): The activation config for FFNs.
+            Default: dict(type='ReLU')
+        ffn_drop (float, optional): Probability of an element to be
+            zeroed in FFN. Default 0.0.
+        add_identity (bool, optional): Whether to add the
+            identity connection. Default: `True`.
+        dropout_layer (obj:`ConfigDict`): The dropout_layer used
+            when adding the shortcut.
+        use_layer_scale (bool): Whether to use layer_scale in
+            DeiT3FFN. Defaults to True.
+        init_cfg (obj:`mmcv.ConfigDict`): The Config for initialization.
+            Default: None.
+    """
+
+    @deprecated_api_warning(
+        {
+            'dropout': 'ffn_drop',
+            'add_residual': 'add_identity'
+        },
+        cls_name='FFN')
+    def __init__(self,
+                 embed_dims=256,
+                 feedforward_channels=1024,
+                 num_fcs=2,
+                 act_cfg=dict(type='ReLU', inplace=True),
+                 ffn_drop=0.,
+                 dropout_layer=None,
+                 add_identity=True,
+                 use_layer_scale=True,
+                 init_cfg=None,
+                 **kwargs):
+        super().__init__(init_cfg)
+        assert num_fcs >= 2, 'num_fcs should be no less ' \
+            f'than 2. got {num_fcs}.'
+        self.embed_dims = embed_dims
+        self.feedforward_channels = feedforward_channels
+        self.num_fcs = num_fcs
+        self.act_cfg = act_cfg
+        self.activate = build_activation_layer(act_cfg)
+
+        layers = []
+        in_channels = embed_dims
+        for _ in range(num_fcs - 1):
+            layers.append(
+                Sequential(
+                    Linear(in_channels, feedforward_channels), self.activate,
+                    nn.Dropout(ffn_drop)))
+            in_channels = feedforward_channels
+        layers.append(Linear(feedforward_channels, embed_dims))
+        layers.append(nn.Dropout(ffn_drop))
+        self.layers = Sequential(*layers)
+        self.dropout_layer = build_dropout(
+            dropout_layer) if dropout_layer else torch.nn.Identity()
+        self.add_identity = add_identity
+
+        if use_layer_scale:
+            self.gamma2 = LayerScale(embed_dims)
+        else:
+            self.gamma2 = nn.Identity()
+
+    @deprecated_api_warning({'residual': 'identity'}, cls_name='FFN')
+    def forward(self, x, identity=None):
+        """Forward function for `FFN`.
+
+        The function would add x to the output tensor if residue is None.
+        """
+        out = self.layers(x)
+        out = self.gamma2(out)
+        if not self.add_identity:
+            return self.dropout_layer(out)
+        if identity is None:
+            identity = x
+        return identity + self.dropout_layer(out)
+
+
+class DeiT3TransformerEncoderLayer(BaseModule):
+    """Implements one encoder layer in DeiT3.
+
+    The differences between DeiT3TransformerEncoderLayer &
+    TransformerEncoderLayer:
+        1. Use LayerScale.
+
+    Args:
+        embed_dims (int): The feature dimension
+        num_heads (int): Parallel attention heads
+        feedforward_channels (int): The hidden dimension for FFNs
+        drop_rate (float): Probability of an element to be zeroed
+            after the feed forward layer. Defaults to 0.
+        attn_drop_rate (float): The drop out rate for attention output weights.
+            Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        num_fcs (int): The number of fully-connected layers for FFNs.
+            Defaults to 2.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to True.
+        use_layer_scale (bool): Whether to use layer_scale in
+            DeiT3TransformerEncoderLayer. Defaults to True.
+        act_cfg (dict): The activation config for FFNs.
+            Defaults to ``dict(type='GELU')``.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 feedforward_channels,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 num_fcs=2,
+                 qkv_bias=True,
+                 use_layer_scale=True,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='LN'),
+                 init_cfg=None):
+        super(DeiT3TransformerEncoderLayer, self).__init__(init_cfg=init_cfg)
+
+        self.embed_dims = embed_dims
+
+        self.ln1 = build_norm_layer(norm_cfg, self.embed_dims)
+
+        self.attn = MultiheadAttention(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            attn_drop=attn_drop_rate,
+            proj_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            qkv_bias=qkv_bias,
+            use_layer_scale=use_layer_scale)
+
+        self.ln2 = build_norm_layer(norm_cfg, self.embed_dims)
+
+        self.ffn = DeiT3FFN(
+            embed_dims=embed_dims,
+            feedforward_channels=feedforward_channels,
+            num_fcs=num_fcs,
+            ffn_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            act_cfg=act_cfg,
+            use_layer_scale=use_layer_scale)
+
+    def init_weights(self):
+        super(DeiT3TransformerEncoderLayer, self).init_weights()
+        for m in self.ffn.modules():
+            if isinstance(m, nn.Linear):
+                nn.init.xavier_uniform_(m.weight)
+                nn.init.normal_(m.bias, std=1e-6)
+
+    def forward(self, x):
+        x = x + self.attn(self.ln1(x))
+        x = self.ffn(self.ln1(x), identity=x)
+        return x
+
+
+@MODELS.register_module()
+class DeiT3(VisionTransformer):
+    """DeiT3 backbone.
+
+    A PyTorch implement of : `DeiT III: Revenge of the ViT
+    <https://arxiv.org/pdf/2204.07118.pdf>`_
+
+    The differences between DeiT3 & VisionTransformer:
+
+    1. Use LayerScale.
+    2. Concat cls token after adding pos_embed.
+
+    Args:
+        arch (str | dict): DeiT3 architecture. If use string,
+            choose from 'small', 'base', 'medium', 'large' and 'huge'.
+            If use dict, it should have below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **num_layers** (int): The number of transformer encoder layers.
+            - **num_heads** (int): The number of heads in attention modules.
+            - **feedforward_channels** (int): The hidden dimensions in
+              feedforward modules.
+
+            Defaults to 'base'.
+        img_size (int | tuple): The expected input image shape. Because we
+            support dynamic input shape, just set the argument to the most
+            common input image shape. Defaults to 224.
+        patch_size (int | tuple): The patch size in patch embedding.
+            Defaults to 16.
+        in_channels (int): The num of input channels. Defaults to 3.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        qkv_bias (bool): Whether to add bias for qkv in attention modules.
+            Defaults to True.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        final_norm (bool): Whether to add a additional layer to normalize
+            final feature map. Defaults to True.
+        out_type (str): The type of output features. Please choose from
+
+            - ``"cls_token"``: The class token tensor with shape (B, C).
+            - ``"featmap"``: The feature map tensor from the patch tokens
+              with shape (B, C, H, W).
+            - ``"avg_featmap"``: The global averaged feature map tensor
+              with shape (B, C).
+            - ``"raw"``: The raw feature tensor includes patch tokens and
+              class tokens with shape (B, L, C).
+
+            Defaults to ``"cls_token"``.
+        with_cls_token (bool): Whether concatenating class token into image
+            tokens as transformer input. Defaults to True.
+        use_layer_scale (bool): Whether to use layer_scale in  DeiT3.
+            Defaults to True.
+        interpolate_mode (str): Select the interpolate mode for position
+            embeding vector resize. Defaults to "bicubic".
+        patch_cfg (dict): Configs of patch embeding. Defaults to an empty dict.
+        layer_cfgs (Sequence | dict): Configs of each transformer layer in
+            encoder. Defaults to an empty dict.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+    arch_zoo = {
+        **dict.fromkeys(
+            ['s', 'small'], {
+                'embed_dims': 384,
+                'num_layers': 12,
+                'num_heads': 6,
+                'feedforward_channels': 1536,
+            }),
+        **dict.fromkeys(
+            ['m', 'medium'], {
+                'embed_dims': 512,
+                'num_layers': 12,
+                'num_heads': 8,
+                'feedforward_channels': 2048,
+            }),
+        **dict.fromkeys(
+            ['b', 'base'], {
+                'embed_dims': 768,
+                'num_layers': 12,
+                'num_heads': 12,
+                'feedforward_channels': 3072
+            }),
+        **dict.fromkeys(
+            ['l', 'large'], {
+                'embed_dims': 1024,
+                'num_layers': 24,
+                'num_heads': 16,
+                'feedforward_channels': 4096
+            }),
+        **dict.fromkeys(
+            ['h', 'huge'], {
+                'embed_dims': 1280,
+                'num_layers': 32,
+                'num_heads': 16,
+                'feedforward_channels': 5120
+            }),
+    }
+    num_extra_tokens = 1  # class token
+
+    def __init__(self,
+                 arch='base',
+                 img_size=224,
+                 patch_size=16,
+                 in_channels=3,
+                 out_indices=-1,
+                 drop_rate=0.,
+                 drop_path_rate=0.,
+                 qkv_bias=True,
+                 norm_cfg=dict(type='LN', eps=1e-6),
+                 final_norm=True,
+                 out_type='cls_token',
+                 with_cls_token=True,
+                 use_layer_scale=True,
+                 interpolate_mode='bicubic',
+                 patch_cfg=dict(),
+                 layer_cfgs=dict(),
+                 init_cfg=None):
+        super(VisionTransformer, self).__init__(init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {
+                'embed_dims', 'num_layers', 'num_heads', 'feedforward_channels'
+            }
+            assert isinstance(arch, dict) and essential_keys <= set(arch), \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.embed_dims = self.arch_settings['embed_dims']
+        self.num_layers = self.arch_settings['num_layers']
+        self.img_size = to_2tuple(img_size)
+
+        # Set patch embedding
+        _patch_cfg = dict(
+            in_channels=in_channels,
+            input_size=img_size,
+            embed_dims=self.embed_dims,
+            conv_type='Conv2d',
+            kernel_size=patch_size,
+            stride=patch_size,
+        )
+        _patch_cfg.update(patch_cfg)
+        self.patch_embed = PatchEmbed(**_patch_cfg)
+        self.patch_resolution = self.patch_embed.init_out_size
+        num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+
+        # Set out type
+        if out_type not in self.OUT_TYPES:
+            raise ValueError(f'Unsupported `out_type` {out_type}, please '
+                             f'choose from {self.OUT_TYPES}')
+        self.out_type = out_type
+
+        # Set cls token
+        if with_cls_token:
+            self.cls_token = nn.Parameter(torch.zeros(1, 1, self.embed_dims))
+        elif out_type != 'cls_token':
+            self.cls_token = None
+            self.num_extra_tokens = 0
+        else:
+            raise ValueError(
+                'with_cls_token must be True when `out_type="cls_token"`.')
+
+        # Set position embedding
+        self.interpolate_mode = interpolate_mode
+        self.pos_embed = nn.Parameter(
+            torch.zeros(1, num_patches, self.embed_dims))
+        self._register_load_state_dict_pre_hook(self._prepare_pos_embed)
+
+        self.drop_after_pos = nn.Dropout(p=drop_rate)
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = self.num_layers + index
+            assert 0 <= out_indices[i] <= self.num_layers, \
+                f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+
+        # stochastic depth decay rule
+        dpr = np.linspace(0, drop_path_rate, self.num_layers)
+
+        self.layers = ModuleList()
+        if isinstance(layer_cfgs, dict):
+            layer_cfgs = [layer_cfgs] * self.num_layers
+        for i in range(self.num_layers):
+            _layer_cfg = dict(
+                embed_dims=self.embed_dims,
+                num_heads=self.arch_settings['num_heads'],
+                feedforward_channels=self.
+                arch_settings['feedforward_channels'],
+                drop_rate=drop_rate,
+                drop_path_rate=dpr[i],
+                qkv_bias=qkv_bias,
+                norm_cfg=norm_cfg,
+                use_layer_scale=use_layer_scale)
+            _layer_cfg.update(layer_cfgs[i])
+            self.layers.append(DeiT3TransformerEncoderLayer(**_layer_cfg))
+
+        self.final_norm = final_norm
+        if final_norm:
+            self.ln1 = build_norm_layer(norm_cfg, self.embed_dims)
+
+    def forward(self, x):
+        B = x.shape[0]
+        x, patch_resolution = self.patch_embed(x)
+
+        x = x + resize_pos_embed(
+            self.pos_embed,
+            self.patch_resolution,
+            patch_resolution,
+            mode=self.interpolate_mode,
+            num_extra_tokens=0)
+        x = self.drop_after_pos(x)
+
+        if self.cls_token is not None:
+            # stole cls_tokens impl from Phil Wang, thanks
+            cls_tokens = self.cls_token.expand(B, -1, -1)
+            x = torch.cat((cls_tokens, x), dim=1)
+
+        outs = []
+        for i, layer in enumerate(self.layers):
+            x = layer(x)
+
+            if i == len(self.layers) - 1 and self.final_norm:
+                x = self.ln1(x)
+
+            if i in self.out_indices:
+                outs.append(self._format_output(x, patch_resolution))
+
+        return tuple(outs)
+
+    def _prepare_pos_embed(self, state_dict, prefix, *args, **kwargs):
+        name = prefix + 'pos_embed'
+        if name not in state_dict.keys():
+            return
+
+        ckpt_pos_embed_shape = state_dict[name].shape
+        if self.pos_embed.shape != ckpt_pos_embed_shape:
+            from mmengine.logging import MMLogger
+            logger = MMLogger.get_current_instance()
+            logger.info(
+                f'Resize the pos_embed shape from {ckpt_pos_embed_shape} '
+                f'to {self.pos_embed.shape}.')
+
+            ckpt_pos_embed_shape = to_2tuple(
+                int(np.sqrt(ckpt_pos_embed_shape[1])))
+            pos_embed_shape = self.patch_embed.init_out_size
+
+            state_dict[name] = resize_pos_embed(
+                state_dict[name],
+                ckpt_pos_embed_shape,
+                pos_embed_shape,
+                self.interpolate_mode,
+                num_extra_tokens=0,  # The cls token adding is after pos_embed
+            )
diff --git a/mmpretrain/models/backbones/densenet.py b/mmpretrain/models/backbones/densenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..c9f05302f9b84cd38c7c03701fc21ffd109c1620
--- /dev/null
+++ b/mmpretrain/models/backbones/densenet.py
@@ -0,0 +1,332 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from itertools import chain
+from typing import Sequence
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint as cp
+from mmcv.cnn.bricks import build_activation_layer, build_norm_layer
+from torch.jit.annotations import List
+
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+
+
+class DenseLayer(BaseBackbone):
+    """DenseBlock layers."""
+
+    def __init__(self,
+                 in_channels,
+                 growth_rate,
+                 bn_size,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 drop_rate=0.,
+                 memory_efficient=False):
+        super(DenseLayer, self).__init__()
+
+        self.norm1 = build_norm_layer(norm_cfg, in_channels)[1]
+        self.conv1 = nn.Conv2d(
+            in_channels,
+            bn_size * growth_rate,
+            kernel_size=1,
+            stride=1,
+            bias=False)
+        self.act = build_activation_layer(act_cfg)
+        self.norm2 = build_norm_layer(norm_cfg, bn_size * growth_rate)[1]
+        self.conv2 = nn.Conv2d(
+            bn_size * growth_rate,
+            growth_rate,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+            bias=False)
+        self.drop_rate = float(drop_rate)
+        self.memory_efficient = memory_efficient
+
+    def bottleneck_fn(self, xs):
+        # type: (List[torch.Tensor]) -> torch.Tensor
+        concated_features = torch.cat(xs, 1)
+        bottleneck_output = self.conv1(
+            self.act(self.norm1(concated_features)))  # noqa: T484
+        return bottleneck_output
+
+    # todo: rewrite when torchscript supports any
+    def any_requires_grad(self, x):
+        # type: (List[torch.Tensor]) -> bool
+        for tensor in x:
+            if tensor.requires_grad:
+                return True
+        return False
+
+    # This decorator indicates to the compiler that a function or method
+    # should be ignored and replaced with the raising of an exception.
+    # Here this function is incompatible with torchscript.
+    @torch.jit.unused  # noqa: T484
+    def call_checkpoint_bottleneck(self, x):
+        # type: (List[torch.Tensor]) -> torch.Tensor
+        def closure(*xs):
+            return self.bottleneck_fn(xs)
+
+        # Here use torch.utils.checkpoint to rerun a forward-pass during
+        # backward in bottleneck to save memories.
+        return cp.checkpoint(closure, *x)
+
+    def forward(self, x):  # noqa: F811
+        # type: (List[torch.Tensor]) -> torch.Tensor
+        # assert input features is a list of Tensor
+        assert isinstance(x, list)
+
+        if self.memory_efficient and self.any_requires_grad(x):
+            if torch.jit.is_scripting():
+                raise Exception('Memory Efficient not supported in JIT')
+            bottleneck_output = self.call_checkpoint_bottleneck(x)
+        else:
+            bottleneck_output = self.bottleneck_fn(x)
+
+        new_features = self.conv2(self.act(self.norm2(bottleneck_output)))
+        if self.drop_rate > 0:
+            new_features = F.dropout(
+                new_features, p=self.drop_rate, training=self.training)
+        return new_features
+
+
+class DenseBlock(nn.Module):
+    """DenseNet Blocks."""
+
+    def __init__(self,
+                 num_layers,
+                 in_channels,
+                 bn_size,
+                 growth_rate,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 drop_rate=0.,
+                 memory_efficient=False):
+        super(DenseBlock, self).__init__()
+        self.block = nn.ModuleList([
+            DenseLayer(
+                in_channels + i * growth_rate,
+                growth_rate=growth_rate,
+                bn_size=bn_size,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg,
+                drop_rate=drop_rate,
+                memory_efficient=memory_efficient) for i in range(num_layers)
+        ])
+
+    def forward(self, init_features):
+        features = [init_features]
+        for layer in self.block:
+            new_features = layer(features)
+            features.append(new_features)
+        return torch.cat(features, 1)
+
+
+class DenseTransition(nn.Sequential):
+    """DenseNet Transition Layers."""
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU')):
+        super(DenseTransition, self).__init__()
+        self.add_module('norm', build_norm_layer(norm_cfg, in_channels)[1])
+        self.add_module('act', build_activation_layer(act_cfg))
+        self.add_module(
+            'conv',
+            nn.Conv2d(
+                in_channels, out_channels, kernel_size=1, stride=1,
+                bias=False))
+        self.add_module('pool', nn.AvgPool2d(kernel_size=2, stride=2))
+
+
+@MODELS.register_module()
+class DenseNet(BaseBackbone):
+    """DenseNet.
+
+    A PyTorch implementation of : `Densely Connected Convolutional Networks
+    <https://arxiv.org/pdf/1608.06993.pdf>`_
+
+    Modified from the `official repo
+    <https://github.com/liuzhuang13/DenseNet>`_
+    and `pytorch
+    <https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py>`_.
+
+    Args:
+        arch (str | dict): The model's architecture. If string, it should be
+            one of architecture in ``DenseNet.arch_settings``. And if dict, it
+            should include the following two keys:
+
+            - growth_rate (int): Each layer of DenseBlock produce `k` feature
+              maps. Here refers `k` as the growth rate of the network.
+            - depths (list[int]): Number of repeated layers in each DenseBlock.
+            - init_channels (int): The output channels of stem layers.
+
+            Defaults to '121'.
+        in_channels (int): Number of input image channels. Defaults to 3.
+        bn_size (int): Refers to channel expansion parameter of 1x1
+            convolution layer. Defaults to 4.
+        drop_rate (float): Drop rate of Dropout Layer. Defaults to 0.
+        compression_factor (float): The reduction rate of transition layers.
+            Defaults to 0.5.
+        memory_efficient (bool): If True, uses checkpointing. Much more memory
+            efficient, but slower. Defaults to False.
+            See `"paper" <https://arxiv.org/pdf/1707.06990.pdf>`_.
+        norm_cfg (dict): The config dict for norm layers.
+            Defaults to ``dict(type='BN')``.
+        act_cfg (dict): The config dict for activation after each convolution.
+            Defaults to ``dict(type='ReLU')``.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Defaults to 0, which means not freezing any parameters.
+        init_cfg (dict, optional): Initialization config dict.
+    """
+    arch_settings = {
+        '121': {
+            'growth_rate': 32,
+            'depths': [6, 12, 24, 16],
+            'init_channels': 64,
+        },
+        '169': {
+            'growth_rate': 32,
+            'depths': [6, 12, 32, 32],
+            'init_channels': 64,
+        },
+        '201': {
+            'growth_rate': 32,
+            'depths': [6, 12, 48, 32],
+            'init_channels': 64,
+        },
+        '161': {
+            'growth_rate': 48,
+            'depths': [6, 12, 36, 24],
+            'init_channels': 96,
+        },
+    }
+
+    def __init__(self,
+                 arch='121',
+                 in_channels=3,
+                 bn_size=4,
+                 drop_rate=0,
+                 compression_factor=0.5,
+                 memory_efficient=False,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 out_indices=-1,
+                 frozen_stages=0,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            assert arch in self.arch_settings, \
+                f'Unavailable arch, please choose from ' \
+                f'({set(self.arch_settings)}) or pass a dict.'
+            arch = self.arch_settings[arch]
+        elif isinstance(arch, dict):
+            essential_keys = {'growth_rate', 'depths', 'init_channels'}
+            assert isinstance(arch, dict) and essential_keys <= set(arch), \
+                f'Custom arch needs a dict with keys {essential_keys}'
+
+        self.growth_rate = arch['growth_rate']
+        self.depths = arch['depths']
+        self.init_channels = arch['init_channels']
+        self.act = build_activation_layer(act_cfg)
+
+        self.num_stages = len(self.depths)
+
+        # check out indices and frozen stages
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = self.num_stages + index
+                assert out_indices[i] >= 0, f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+        self.frozen_stages = frozen_stages
+
+        # Set stem layers
+        self.stem = nn.Sequential(
+            nn.Conv2d(
+                in_channels,
+                self.init_channels,
+                kernel_size=7,
+                stride=2,
+                padding=3,
+                bias=False),
+            build_norm_layer(norm_cfg, self.init_channels)[1], self.act,
+            nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
+
+        # Repetitions of DenseNet Blocks
+        self.stages = nn.ModuleList()
+        self.transitions = nn.ModuleList()
+
+        channels = self.init_channels
+        for i in range(self.num_stages):
+            depth = self.depths[i]
+
+            stage = DenseBlock(
+                num_layers=depth,
+                in_channels=channels,
+                bn_size=bn_size,
+                growth_rate=self.growth_rate,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg,
+                drop_rate=drop_rate,
+                memory_efficient=memory_efficient)
+            self.stages.append(stage)
+            channels += depth * self.growth_rate
+
+            if i != self.num_stages - 1:
+                transition = DenseTransition(
+                    in_channels=channels,
+                    out_channels=math.floor(channels * compression_factor),
+                    norm_cfg=norm_cfg,
+                    act_cfg=act_cfg,
+                )
+                channels = math.floor(channels * compression_factor)
+            else:
+                # Final layers after dense block is just bn with act.
+                # Unlike the paper, the original repo also put this in
+                # transition layer, whereas torchvision take this out.
+                # We reckon this as transition layer here.
+                transition = nn.Sequential(
+                    build_norm_layer(norm_cfg, channels)[1],
+                    self.act,
+                )
+            self.transitions.append(transition)
+
+        self._freeze_stages()
+
+    def forward(self, x):
+        x = self.stem(x)
+        outs = []
+        for i in range(self.num_stages):
+            x = self.stages[i](x)
+            x = self.transitions[i](x)
+            if i in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        for i in range(self.frozen_stages):
+            downsample_layer = self.transitions[i]
+            stage = self.stages[i]
+            downsample_layer.eval()
+            stage.eval()
+            for param in chain(downsample_layer.parameters(),
+                               stage.parameters()):
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        super(DenseNet, self).train(mode)
+        self._freeze_stages()
diff --git a/mmpretrain/models/backbones/edgenext.py b/mmpretrain/models/backbones/edgenext.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad4e768e7561eb49da3603f4394faaebed7c9251
--- /dev/null
+++ b/mmpretrain/models/backbones/edgenext.py
@@ -0,0 +1,398 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from itertools import chain
+from typing import Sequence
+
+import torch
+import torch.nn as nn
+from mmcv.cnn.bricks import DropPath
+from mmengine.model import BaseModule, ModuleList, Sequential
+
+from mmpretrain.registry import MODELS
+from ..utils import (ChannelMultiheadAttention, PositionEncodingFourier,
+                     build_norm_layer)
+from .base_backbone import BaseBackbone
+from .convnext import ConvNeXtBlock
+
+
+class SDTAEncoder(BaseModule):
+    """A PyTorch implementation of split depth-wise transpose attention (SDTA)
+    encoder.
+
+    Inspiration from
+    https://github.com/mmaaz60/EdgeNeXt
+    Args:
+        in_channel (int): Number of input channels.
+        drop_path_rate (float): Stochastic depth dropout rate.
+            Defaults to 0.
+        layer_scale_init_value (float): Initial value of layer scale.
+            Defaults to 1e-6.
+        mlp_ratio (int): Number of channels ratio in the MLP.
+            Defaults to 4.
+        use_pos_emb (bool): Whether to use position encoding.
+            Defaults to True.
+        num_heads (int): Number of heads in the multihead attention.
+            Defaults to 8.
+        qkv_bias (bool): Whether to use bias in the multihead attention.
+            Defaults to True.
+        attn_drop (float): Dropout rate of the attention.
+            Defaults to 0.
+        proj_drop (float): Dropout rate of the projection.
+            Defaults to 0.
+        layer_scale_init_value (float): Initial value of layer scale.
+            Defaults to 1e-6.
+        norm_cfg (dict): Dictionary to construct normalization layer.
+            Defaults to ``dict(type='LN')``.
+        act_cfg (dict): Dictionary to construct activation layer.
+            Defaults to ``dict(type='GELU')``.
+        scales (int): Number of scales. Default to 1.
+    """
+
+    def __init__(self,
+                 in_channel,
+                 drop_path_rate=0.,
+                 layer_scale_init_value=1e-6,
+                 mlp_ratio=4,
+                 use_pos_emb=True,
+                 num_heads=8,
+                 qkv_bias=True,
+                 attn_drop=0.,
+                 proj_drop=0.,
+                 norm_cfg=dict(type='LN'),
+                 act_cfg=dict(type='GELU'),
+                 scales=1,
+                 init_cfg=None):
+        super(SDTAEncoder, self).__init__(init_cfg=init_cfg)
+        conv_channels = max(
+            int(math.ceil(in_channel / scales)),
+            int(math.floor(in_channel // scales)))
+        self.conv_channels = conv_channels
+        self.num_convs = scales if scales == 1 else scales - 1
+
+        self.conv_modules = ModuleList()
+        for i in range(self.num_convs):
+            self.conv_modules.append(
+                nn.Conv2d(
+                    conv_channels,
+                    conv_channels,
+                    kernel_size=3,
+                    padding=1,
+                    groups=conv_channels))
+
+        self.pos_embed = PositionEncodingFourier(
+            embed_dims=in_channel) if use_pos_emb else None
+
+        self.norm_csa = build_norm_layer(norm_cfg, in_channel)
+        self.gamma_csa = nn.Parameter(
+            layer_scale_init_value * torch.ones(in_channel),
+            requires_grad=True) if layer_scale_init_value > 0 else None
+        self.csa = ChannelMultiheadAttention(
+            embed_dims=in_channel,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            attn_drop=attn_drop,
+            proj_drop=proj_drop)
+
+        self.norm = build_norm_layer(norm_cfg, in_channel)
+        self.pointwise_conv1 = nn.Linear(in_channel, mlp_ratio * in_channel)
+        self.act = MODELS.build(act_cfg)
+        self.pointwise_conv2 = nn.Linear(mlp_ratio * in_channel, in_channel)
+        self.gamma = nn.Parameter(
+            layer_scale_init_value * torch.ones(in_channel),
+            requires_grad=True) if layer_scale_init_value > 0 else None
+        self.drop_path = DropPath(
+            drop_path_rate) if drop_path_rate > 0. else nn.Identity()
+
+    def forward(self, x):
+        shortcut = x
+        spx = torch.split(x, self.conv_channels, dim=1)
+        for i in range(self.num_convs):
+            if i == 0:
+                sp = spx[i]
+            else:
+                sp = sp + spx[i]
+            sp = self.conv_modules[i](sp)
+            if i == 0:
+                out = sp
+            else:
+                out = torch.cat((out, sp), 1)
+
+        x = torch.cat((out, spx[self.num_convs]), 1)
+
+        # Channel Self-attention
+        B, C, H, W = x.shape
+        x = x.reshape(B, C, H * W).permute(0, 2, 1)
+        if self.pos_embed:
+            pos_encoding = self.pos_embed((B, H, W))
+            pos_encoding = pos_encoding.reshape(B, -1,
+                                                x.shape[1]).permute(0, 2, 1)
+            x += pos_encoding
+
+        x = x + self.drop_path(self.gamma_csa * self.csa(self.norm_csa(x)))
+        x = x.reshape(B, H, W, C)
+
+        # Inverted Bottleneck
+        x = self.norm(x)
+        x = self.pointwise_conv1(x)
+        x = self.act(x)
+        x = self.pointwise_conv2(x)
+
+        if self.gamma is not None:
+            x = self.gamma * x
+        x = x.permute(0, 3, 1, 2)  # (B, H, W, C) -> (B, C, H, W)
+
+        x = shortcut + self.drop_path(x)
+
+        return x
+
+
+@MODELS.register_module()
+class EdgeNeXt(BaseBackbone):
+    """EdgeNeXt.
+
+    A PyTorch implementation of: `EdgeNeXt: Efficiently Amalgamated
+    CNN-Transformer Architecture for Mobile Vision Applications
+    <https://arxiv.org/abs/2206.10589>`_
+
+    Inspiration from
+    https://github.com/mmaaz60/EdgeNeXt
+
+    Args:
+        arch (str | dict): The model's architecture. If string, it should be
+            one of architectures in ``EdgeNeXt.arch_settings``.
+            And if dict, it should include the following keys:
+
+            - channels (list[int]): The number of channels at each stage.
+            - depths (list[int]): The number of blocks at each stage.
+            - num_heads (list[int]): The number of heads at each stage.
+
+            Defaults to 'xxsmall'.
+        in_channels (int): The number of input channels.
+            Defaults to 3.
+        global_blocks (list[int]): The number of global blocks.
+            Defaults to [0, 1, 1, 1].
+        global_block_type (list[str]): The type of global blocks.
+            Defaults to ['None', 'SDTA', 'SDTA', 'SDTA'].
+        drop_path_rate (float): Stochastic depth dropout rate.
+            Defaults to 0.
+        layer_scale_init_value (float): Initial value of layer scale.
+            Defaults to 1e-6.
+        linear_pw_conv (bool): Whether to use linear layer to do pointwise
+            convolution. Defaults to False.
+        mlp_ratio (int): The number of channel ratio in MLP layers.
+            Defaults to 4.
+        conv_kernel_size (list[int]): The kernel size of convolutional layers
+            at each stage. Defaults to [3, 5, 7, 9].
+        use_pos_embd_csa (list[bool]): Whether to use positional embedding in
+            Channel Self-Attention. Defaults to [False, True, False, False].
+        use_pos_emebd_global (bool): Whether to use positional embedding for
+            whole network. Defaults to False.
+        d2_scales (list[int]): The number of channel groups used for SDTA at
+            each stage. Defaults to [2, 2, 3, 4].
+        norm_cfg (dict): The config of normalization layer.
+            Defaults to ``dict(type='LN2d', eps=1e-6)``.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Defaults to 0, which means not freezing any parameters.
+        gap_before_final_norm (bool): Whether to globally average the feature
+            map before the final norm layer. Defaults to True.
+        act_cfg (dict): The config of activation layer.
+            Defaults to ``dict(type='GELU')``.
+        init_cfg (dict, optional): Config for initialization.
+            Defaults to None.
+    """
+    arch_settings = {
+        'xxsmall': {  # parameters: 1.3M
+            'channels': [24, 48, 88, 168],
+            'depths': [2, 2, 6, 2],
+            'num_heads': [4, 4, 4, 4]
+        },
+        'xsmall': {  # parameters: 2.3M
+            'channels': [32, 64, 100, 192],
+            'depths': [3, 3, 9, 3],
+            'num_heads': [4, 4, 4, 4]
+        },
+        'small': {  # parameters: 5.6M
+            'channels': [48, 96, 160, 304],
+            'depths': [3, 3, 9, 3],
+            'num_heads': [8, 8, 8, 8]
+        },
+        'base': {  # parameters: 18.51M
+            'channels': [80, 160, 288, 584],
+            'depths': [3, 3, 9, 3],
+            'num_heads': [8, 8, 8, 8]
+        },
+    }
+
+    def __init__(self,
+                 arch='xxsmall',
+                 in_channels=3,
+                 global_blocks=[0, 1, 1, 1],
+                 global_block_type=['None', 'SDTA', 'SDTA', 'SDTA'],
+                 drop_path_rate=0.,
+                 layer_scale_init_value=1e-6,
+                 linear_pw_conv=True,
+                 mlp_ratio=4,
+                 conv_kernel_sizes=[3, 5, 7, 9],
+                 use_pos_embd_csa=[False, True, False, False],
+                 use_pos_embd_global=False,
+                 d2_scales=[2, 2, 3, 4],
+                 norm_cfg=dict(type='LN2d', eps=1e-6),
+                 out_indices=-1,
+                 frozen_stages=0,
+                 gap_before_final_norm=True,
+                 act_cfg=dict(type='GELU'),
+                 init_cfg=None):
+        super(EdgeNeXt, self).__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in self.arch_settings, \
+                f'Arch {arch} is not in default archs ' \
+                f'{set(self.arch_settings)}'
+            self.arch_settings = self.arch_settings[arch]
+        elif isinstance(arch, dict):
+            essential_keys = {'channels', 'depths', 'num_heads'}
+            assert isinstance(arch, dict) and set(arch) == essential_keys, \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.channels = self.arch_settings['channels']
+        self.depths = self.arch_settings['depths']
+        self.num_heads = self.arch_settings['num_heads']
+        self.num_layers = len(self.depths)
+        self.use_pos_embd_global = use_pos_embd_global
+
+        for g in global_block_type:
+            assert g in ['None',
+                         'SDTA'], f'Global block type {g} is not supported'
+
+        self.num_stages = len(self.depths)
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = 4 + index
+                assert out_indices[i] >= 0, f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+
+        self.frozen_stages = frozen_stages
+        self.gap_before_final_norm = gap_before_final_norm
+
+        if self.use_pos_embd_global:
+            self.pos_embed = PositionEncodingFourier(
+                embed_dims=self.channels[0])
+        else:
+            self.pos_embed = None
+
+        # stochastic depth decay rule
+        dpr = [
+            x.item()
+            for x in torch.linspace(0, drop_path_rate, sum(self.depths))
+        ]
+
+        self.downsample_layers = ModuleList()
+        stem = nn.Sequential(
+            nn.Conv2d(in_channels, self.channels[0], kernel_size=4, stride=4),
+            build_norm_layer(norm_cfg, self.channels[0]),
+        )
+        self.downsample_layers.append(stem)
+
+        self.stages = ModuleList()
+        block_idx = 0
+        for i in range(self.num_stages):
+            depth = self.depths[i]
+            channels = self.channels[i]
+
+            if i >= 1:
+                downsample_layer = nn.Sequential(
+                    build_norm_layer(norm_cfg, self.channels[i - 1]),
+                    nn.Conv2d(
+                        self.channels[i - 1],
+                        channels,
+                        kernel_size=2,
+                        stride=2,
+                    ))
+                self.downsample_layers.append(downsample_layer)
+
+            stage_blocks = []
+            for j in range(depth):
+                if j > depth - global_blocks[i] - 1:
+                    stage_blocks.append(
+                        SDTAEncoder(
+                            in_channel=channels,
+                            drop_path_rate=dpr[block_idx + j],
+                            mlp_ratio=mlp_ratio,
+                            scales=d2_scales[i],
+                            use_pos_emb=use_pos_embd_csa[i],
+                            num_heads=self.num_heads[i],
+                        ))
+                else:
+                    dw_conv_cfg = dict(
+                        kernel_size=conv_kernel_sizes[i],
+                        padding=conv_kernel_sizes[i] // 2,
+                    )
+                    stage_blocks.append(
+                        ConvNeXtBlock(
+                            in_channels=channels,
+                            dw_conv_cfg=dw_conv_cfg,
+                            norm_cfg=norm_cfg,
+                            act_cfg=act_cfg,
+                            linear_pw_conv=linear_pw_conv,
+                            drop_path_rate=dpr[block_idx + j],
+                            layer_scale_init_value=layer_scale_init_value,
+                        ))
+            block_idx += depth
+
+            stage_blocks = Sequential(*stage_blocks)
+            self.stages.append(stage_blocks)
+
+            if i in self.out_indices:
+                out_norm_cfg = dict(type='LN') if self.gap_before_final_norm \
+                    else norm_cfg
+                norm_layer = build_norm_layer(out_norm_cfg, channels)
+                self.add_module(f'norm{i}', norm_layer)
+
+    def init_weights(self) -> None:
+        # TODO: need to be implemented in the future
+        return super().init_weights()
+
+    def forward(self, x):
+        outs = []
+        for i, stage in enumerate(self.stages):
+            x = self.downsample_layers[i](x)
+            x = stage(x)
+            if self.pos_embed and i == 0:
+                B, _, H, W = x.shape
+                x += self.pos_embed((B, H, W))
+
+            if i in self.out_indices:
+                norm_layer = getattr(self, f'norm{i}')
+                if self.gap_before_final_norm:
+                    gap = x.mean([-2, -1], keepdim=True)
+                    outs.append(norm_layer(gap.flatten(1)))
+                else:
+                    # The output of LayerNorm2d may be discontiguous, which
+                    # may cause some problem in the downstream tasks
+                    outs.append(norm_layer(x).contiguous())
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        for i in range(self.frozen_stages):
+            downsample_layer = self.downsample_layers[i]
+            stage = self.stages[i]
+            downsample_layer.eval()
+            stage.eval()
+            for param in chain(downsample_layer.parameters(),
+                               stage.parameters()):
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        super(EdgeNeXt, self).train(mode)
+        self._freeze_stages()
diff --git a/mmpretrain/models/backbones/efficientformer.py b/mmpretrain/models/backbones/efficientformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..c2525c8faaa745ff5404e91004421f2360dd1c41
--- /dev/null
+++ b/mmpretrain/models/backbones/efficientformer.py
@@ -0,0 +1,606 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import itertools
+from typing import Optional, Sequence
+
+import torch
+import torch.nn as nn
+from mmcv.cnn.bricks import (ConvModule, DropPath, build_activation_layer,
+                             build_norm_layer)
+from mmengine.model import BaseModule, ModuleList, Sequential
+
+from mmpretrain.registry import MODELS
+from ..utils import LayerScale
+from .base_backbone import BaseBackbone
+from .poolformer import Pooling
+
+
+class AttentionWithBias(BaseModule):
+    """Multi-head Attention Module with attention_bias.
+
+    Args:
+        embed_dims (int): The embedding dimension.
+        num_heads (int): Parallel attention heads. Defaults to 8.
+        key_dim (int): The dimension of q, k. Defaults to 32.
+        attn_ratio (float): The dimension of v equals to
+            ``key_dim * attn_ratio``. Defaults to 4.
+        resolution (int): The height and width of attention_bias.
+            Defaults to 7.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads=8,
+                 key_dim=32,
+                 attn_ratio=4.,
+                 resolution=7,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        self.num_heads = num_heads
+        self.scale = key_dim**-0.5
+        self.attn_ratio = attn_ratio
+        self.key_dim = key_dim
+        self.nh_kd = key_dim * num_heads
+        self.d = int(attn_ratio * key_dim)
+        self.dh = int(attn_ratio * key_dim) * num_heads
+        h = self.dh + self.nh_kd * 2
+        self.qkv = nn.Linear(embed_dims, h)
+        self.proj = nn.Linear(self.dh, embed_dims)
+
+        points = list(itertools.product(range(resolution), range(resolution)))
+        N = len(points)
+        attention_offsets = {}
+        idxs = []
+        for p1 in points:
+            for p2 in points:
+                offset = (abs(p1[0] - p2[0]), abs(p1[1] - p2[1]))
+                if offset not in attention_offsets:
+                    attention_offsets[offset] = len(attention_offsets)
+                idxs.append(attention_offsets[offset])
+        self.attention_biases = nn.Parameter(
+            torch.zeros(num_heads, len(attention_offsets)))
+        self.register_buffer('attention_bias_idxs',
+                             torch.LongTensor(idxs).view(N, N))
+
+    @torch.no_grad()
+    def train(self, mode=True):
+        """change the mode of model."""
+        super().train(mode)
+        if mode and hasattr(self, 'ab'):
+            del self.ab
+        else:
+            self.ab = self.attention_biases[:, self.attention_bias_idxs]
+
+    def forward(self, x):
+        """forward function.
+
+        Args:
+            x (tensor): input features with shape of (B, N, C)
+        """
+        B, N, _ = x.shape
+        qkv = self.qkv(x)
+        qkv = qkv.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)
+        q, k, v = qkv.split([self.key_dim, self.key_dim, self.d], dim=-1)
+
+        attn = ((q @ k.transpose(-2, -1)) * self.scale +
+                (self.attention_biases[:, self.attention_bias_idxs]
+                 if self.training else self.ab))
+        attn = attn.softmax(dim=-1)
+        x = (attn @ v).transpose(1, 2).reshape(B, N, self.dh)
+        x = self.proj(x)
+        return x
+
+
+class Flat(nn.Module):
+    """Flat the input from (B, C, H, W) to (B, H*W, C)."""
+
+    def __init__(self, ):
+        super().__init__()
+
+    def forward(self, x: torch.Tensor):
+        x = x.flatten(2).transpose(1, 2)
+        return x
+
+
+class LinearMlp(BaseModule):
+    """Mlp implemented with linear.
+
+    The shape of input and output tensor are (B, N, C).
+
+    Args:
+        in_features (int): Dimension of input features.
+        hidden_features (int): Dimension of hidden features.
+        out_features (int): Dimension of output features.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='BN')``.
+        act_cfg (dict): The config dict for activation between pointwise
+            convolution. Defaults to ``dict(type='GELU')``.
+        drop (float): Dropout rate. Defaults to 0.0.
+        init_cfg (obj:`mmcv.ConfigDict`): The Config for initialization.
+            Default: None.
+    """
+
+    def __init__(self,
+                 in_features: int,
+                 hidden_features: Optional[int] = None,
+                 out_features: Optional[int] = None,
+                 act_cfg=dict(type='GELU'),
+                 drop=0.,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = build_activation_layer(act_cfg)
+        self.drop1 = nn.Dropout(drop)
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop2 = nn.Dropout(drop)
+
+    def forward(self, x):
+        """
+        Args:
+            x (torch.Tensor): input tensor with shape (B, N, C).
+
+        Returns:
+            torch.Tensor: output tensor with shape (B, N, C).
+        """
+        x = self.drop1(self.act(self.fc1(x)))
+        x = self.drop2(self.fc2(x))
+        return x
+
+
+class ConvMlp(BaseModule):
+    """Mlp implemented with 1*1 convolutions.
+
+    Args:
+        in_features (int): Dimension of input features.
+        hidden_features (int): Dimension of hidden features.
+        out_features (int): Dimension of output features.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='BN')``.
+        act_cfg (dict): The config dict for activation between pointwise
+            convolution. Defaults to ``dict(type='GELU')``.
+        drop (float): Dropout rate. Defaults to 0.0.
+        init_cfg (obj:`mmcv.ConfigDict`): The Config for initialization.
+            Default: None.
+    """
+
+    def __init__(self,
+                 in_features,
+                 hidden_features=None,
+                 out_features=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='GELU'),
+                 drop=0.,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Conv2d(in_features, hidden_features, 1)
+        self.act = build_activation_layer(act_cfg)
+        self.fc2 = nn.Conv2d(hidden_features, out_features, 1)
+        self.norm1 = build_norm_layer(norm_cfg, hidden_features)[1]
+        self.norm2 = build_norm_layer(norm_cfg, out_features)[1]
+
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        """
+        Args:
+            x (torch.Tensor): input tensor with shape (B, C, H, W).
+
+        Returns:
+            torch.Tensor: output tensor with shape (B, C, H, W).
+        """
+
+        x = self.act(self.norm1(self.fc1(x)))
+        x = self.drop(x)
+        x = self.norm2(self.fc2(x))
+        x = self.drop(x)
+        return x
+
+
+class Meta3D(BaseModule):
+    """Meta Former block using 3 dimensions inputs, ``torch.Tensor`` with shape
+    (B, N, C)."""
+
+    def __init__(self,
+                 dim,
+                 mlp_ratio=4.,
+                 norm_cfg=dict(type='LN'),
+                 act_cfg=dict(type='GELU'),
+                 drop=0.,
+                 drop_path=0.,
+                 use_layer_scale=True,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        self.norm1 = build_norm_layer(norm_cfg, dim)[1]
+        self.token_mixer = AttentionWithBias(dim)
+        self.norm2 = build_norm_layer(norm_cfg, dim)[1]
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = LinearMlp(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_cfg=act_cfg,
+            drop=drop)
+
+        self.drop_path = DropPath(drop_path) if drop_path > 0. \
+            else nn.Identity()
+        if use_layer_scale:
+            self.ls1 = LayerScale(dim)
+            self.ls2 = LayerScale(dim)
+        else:
+            self.ls1, self.ls2 = nn.Identity(), nn.Identity()
+
+    def forward(self, x):
+        x = x + self.drop_path(self.ls1(self.token_mixer(self.norm1(x))))
+        x = x + self.drop_path(self.ls2(self.mlp(self.norm2(x))))
+        return x
+
+
+class Meta4D(BaseModule):
+    """Meta Former block using 4 dimensions inputs, ``torch.Tensor`` with shape
+    (B, C, H, W)."""
+
+    def __init__(self,
+                 dim,
+                 pool_size=3,
+                 mlp_ratio=4.,
+                 act_cfg=dict(type='GELU'),
+                 drop=0.,
+                 drop_path=0.,
+                 use_layer_scale=True,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+
+        self.token_mixer = Pooling(pool_size=pool_size)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = ConvMlp(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_cfg=act_cfg,
+            drop=drop)
+
+        self.drop_path = DropPath(drop_path) if drop_path > 0. \
+            else nn.Identity()
+        if use_layer_scale:
+            self.ls1 = LayerScale(dim, data_format='channels_first')
+            self.ls2 = LayerScale(dim, data_format='channels_first')
+        else:
+            self.ls1, self.ls2 = nn.Identity(), nn.Identity()
+
+    def forward(self, x):
+        x = x + self.drop_path(self.ls1(self.token_mixer(x)))
+        x = x + self.drop_path(self.ls2(self.mlp(x)))
+        return x
+
+
+def basic_blocks(in_channels,
+                 out_channels,
+                 index,
+                 layers,
+                 pool_size=3,
+                 mlp_ratio=4.,
+                 act_cfg=dict(type='GELU'),
+                 drop_rate=.0,
+                 drop_path_rate=0.,
+                 use_layer_scale=True,
+                 vit_num=1,
+                 has_downsamper=False):
+    """generate EfficientFormer blocks for a stage."""
+    blocks = []
+    if has_downsamper:
+        blocks.append(
+            ConvModule(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=3,
+                stride=2,
+                padding=1,
+                bias=True,
+                norm_cfg=dict(type='BN'),
+                act_cfg=None))
+    if index == 3 and vit_num == layers[index]:
+        blocks.append(Flat())
+    for block_idx in range(layers[index]):
+        block_dpr = drop_path_rate * (block_idx + sum(layers[:index])) / (
+            sum(layers) - 1)
+        if index == 3 and layers[index] - block_idx <= vit_num:
+            blocks.append(
+                Meta3D(
+                    out_channels,
+                    mlp_ratio=mlp_ratio,
+                    act_cfg=act_cfg,
+                    drop=drop_rate,
+                    drop_path=block_dpr,
+                    use_layer_scale=use_layer_scale,
+                ))
+        else:
+            blocks.append(
+                Meta4D(
+                    out_channels,
+                    pool_size=pool_size,
+                    act_cfg=act_cfg,
+                    drop=drop_rate,
+                    drop_path=block_dpr,
+                    use_layer_scale=use_layer_scale))
+            if index == 3 and layers[index] - block_idx - 1 == vit_num:
+                blocks.append(Flat())
+    blocks = nn.Sequential(*blocks)
+    return blocks
+
+
+@MODELS.register_module()
+class EfficientFormer(BaseBackbone):
+    """EfficientFormer.
+
+    A PyTorch implementation of EfficientFormer introduced by:
+    `EfficientFormer: Vision Transformers at MobileNet Speed <https://arxiv.org/abs/2206.01191>`_
+
+    Modified from the `official repo
+    <https://github.com/snap-research/EfficientFormer>`.
+
+    Args:
+        arch (str | dict): The model's architecture. If string, it should be
+            one of architecture in ``EfficientFormer.arch_settings``. And if dict,
+            it should include the following 4 keys:
+
+            - layers (list[int]): Number of blocks at each stage.
+            - embed_dims (list[int]): The number of channels at each stage.
+            - downsamples (list[int]): Has downsample or not in the four stages.
+            - vit_num (int): The num of vit blocks in the last stage.
+
+            Defaults to 'l1'.
+
+        in_channels (int): The num of input channels. Defaults to 3.
+        pool_size (int): The pooling size of ``Meta4D`` blocks. Defaults to 3.
+        mlp_ratios (int): The dimension ratio of multi-head attention mechanism
+            in ``Meta4D`` blocks. Defaults to 3.
+        reshape_last_feat (bool): Whether to reshape the feature map from
+            (B, N, C) to (B, C, H, W) in the last stage, when the ``vit-num``
+            in ``arch`` is not 0. Defaults to False. Usually set to True
+            in downstream tasks.
+        out_indices (Sequence[int]): Output from which stages.
+            Defaults to -1.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        act_cfg (dict): The config dict for activation between pointwise
+            convolution. Defaults to ``dict(type='GELU')``.
+        drop_rate (float): Dropout rate. Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        use_layer_scale (bool): Whether to use use_layer_scale in MetaFormer
+            block. Defaults to True.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+
+    Example:
+        >>> from mmpretrain.models import EfficientFormer
+        >>> import torch
+        >>> inputs = torch.rand((1, 3, 224, 224))
+        >>> # build EfficientFormer backbone for classification task
+        >>> model = EfficientFormer(arch="l1")
+        >>> model.eval()
+        >>> level_outputs = model(inputs)
+        >>> for level_out in level_outputs:
+        ...     print(tuple(level_out.shape))
+        (1, 448, 49)
+        >>> # build EfficientFormer backbone for downstream task
+        >>> model = EfficientFormer(
+        >>>    arch="l3",
+        >>>    out_indices=(0, 1, 2, 3),
+        >>>    reshape_last_feat=True)
+        >>> model.eval()
+        >>> level_outputs = model(inputs)
+        >>> for level_out in level_outputs:
+        ...     print(tuple(level_out.shape))
+        (1, 64, 56, 56)
+        (1, 128, 28, 28)
+        (1, 320, 14, 14)
+        (1, 512, 7, 7)
+    """  # noqa: E501
+
+    # --layers: [x,x,x,x], numbers of layers for the four stages
+    # --embed_dims: [x,x,x,x], embedding dims for the four stages
+    # --downsamples: [x,x,x,x], has downsample or not in the four stages
+    # --vit_num：(int), the num of vit blocks in the last stage
+    arch_settings = {
+        'l1': {
+            'layers': [3, 2, 6, 4],
+            'embed_dims': [48, 96, 224, 448],
+            'downsamples': [False, True, True, True],
+            'vit_num': 1,
+        },
+        'l3': {
+            'layers': [4, 4, 12, 6],
+            'embed_dims': [64, 128, 320, 512],
+            'downsamples': [False, True, True, True],
+            'vit_num': 4,
+        },
+        'l7': {
+            'layers': [6, 6, 18, 8],
+            'embed_dims': [96, 192, 384, 768],
+            'downsamples': [False, True, True, True],
+            'vit_num': 8,
+        },
+    }
+
+    def __init__(self,
+                 arch='l1',
+                 in_channels=3,
+                 pool_size=3,
+                 mlp_ratios=4,
+                 reshape_last_feat=False,
+                 out_indices=-1,
+                 frozen_stages=-1,
+                 act_cfg=dict(type='GELU'),
+                 drop_rate=0.,
+                 drop_path_rate=0.,
+                 use_layer_scale=True,
+                 init_cfg=None):
+
+        super().__init__(init_cfg=init_cfg)
+        self.num_extra_tokens = 0  # no cls_token, no dist_token
+
+        if isinstance(arch, str):
+            assert arch in self.arch_settings, \
+                f'Unavailable arch, please choose from ' \
+                f'({set(self.arch_settings)}) or pass a dict.'
+            arch = self.arch_settings[arch]
+        elif isinstance(arch, dict):
+            default_keys = set(self.arch_settings['l1'].keys())
+            assert set(arch.keys()) == default_keys, \
+                f'The arch dict must have {default_keys}, ' \
+                f'but got {list(arch.keys())}.'
+
+        self.layers = arch['layers']
+        self.embed_dims = arch['embed_dims']
+        self.downsamples = arch['downsamples']
+        assert isinstance(self.layers, list) and isinstance(
+            self.embed_dims, list) and isinstance(self.downsamples, list)
+        assert len(self.layers) == len(self.embed_dims) == len(
+            self.downsamples)
+
+        self.vit_num = arch['vit_num']
+        self.reshape_last_feat = reshape_last_feat
+
+        assert self.vit_num >= 0, "'vit_num' must be an integer " \
+                                  'greater than or equal to 0.'
+        assert self.vit_num <= self.layers[-1], (
+            "'vit_num' must be an integer smaller than layer number")
+
+        self._make_stem(in_channels, self.embed_dims[0])
+
+        # set the main block in network
+        network = []
+        for i in range(len(self.layers)):
+            if i != 0:
+                in_channels = self.embed_dims[i - 1]
+            else:
+                in_channels = self.embed_dims[i]
+            out_channels = self.embed_dims[i]
+            stage = basic_blocks(
+                in_channels,
+                out_channels,
+                i,
+                self.layers,
+                pool_size=pool_size,
+                mlp_ratio=mlp_ratios,
+                act_cfg=act_cfg,
+                drop_rate=drop_rate,
+                drop_path_rate=drop_path_rate,
+                vit_num=self.vit_num,
+                use_layer_scale=use_layer_scale,
+                has_downsamper=self.downsamples[i])
+            network.append(stage)
+
+        self.network = ModuleList(network)
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = 4 + index
+                assert out_indices[i] >= 0, f'Invalid out_indices {index}'
+
+        self.out_indices = out_indices
+        for i_layer in self.out_indices:
+            if not self.reshape_last_feat and \
+                    i_layer == 3 and self.vit_num > 0:
+                layer = build_norm_layer(
+                    dict(type='LN'), self.embed_dims[i_layer])[1]
+            else:
+                # use GN with 1 group as channel-first LN2D
+                layer = build_norm_layer(
+                    dict(type='GN', num_groups=1), self.embed_dims[i_layer])[1]
+
+            layer_name = f'norm{i_layer}'
+            self.add_module(layer_name, layer)
+
+        self.frozen_stages = frozen_stages
+        self._freeze_stages()
+
+    def _make_stem(self, in_channels: int, stem_channels: int):
+        """make 2-ConvBNReLu stem layer."""
+        self.patch_embed = Sequential(
+            ConvModule(
+                in_channels,
+                stem_channels // 2,
+                kernel_size=3,
+                stride=2,
+                padding=1,
+                bias=True,
+                conv_cfg=None,
+                norm_cfg=dict(type='BN'),
+                inplace=True),
+            ConvModule(
+                stem_channels // 2,
+                stem_channels,
+                kernel_size=3,
+                stride=2,
+                padding=1,
+                bias=True,
+                conv_cfg=None,
+                norm_cfg=dict(type='BN'),
+                inplace=True))
+
+    def forward_tokens(self, x):
+        outs = []
+        for idx, block in enumerate(self.network):
+            if idx == len(self.network) - 1:
+                N, _, H, W = x.shape
+                if self.downsamples[idx]:
+                    H, W = H // 2, W // 2
+            x = block(x)
+            if idx in self.out_indices:
+                norm_layer = getattr(self, f'norm{idx}')
+
+                if idx == len(self.network) - 1 and x.dim() == 3:
+                    # when ``vit-num`` > 0 and in the last stage,
+                    # if `self.reshape_last_feat`` is True, reshape the
+                    # features to `BCHW` format before the final normalization.
+                    # if `self.reshape_last_feat`` is False, do
+                    # normalization directly and permute the features to `BCN`.
+                    if self.reshape_last_feat:
+                        x = x.permute((0, 2, 1)).reshape(N, -1, H, W)
+                        x_out = norm_layer(x)
+                    else:
+                        x_out = norm_layer(x).permute((0, 2, 1))
+                else:
+                    x_out = norm_layer(x)
+
+                outs.append(x_out.contiguous())
+        return tuple(outs)
+
+    def forward(self, x):
+        # input embedding
+        x = self.patch_embed(x)
+        # through stages
+        x = self.forward_tokens(x)
+        return x
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            self.patch_embed.eval()
+            for param in self.patch_embed.parameters():
+                param.requires_grad = False
+
+        for i in range(self.frozen_stages):
+            # Include both block and downsample layer.
+            module = self.network[i]
+            module.eval()
+            for param in module.parameters():
+                param.requires_grad = False
+            if i in self.out_indices:
+                norm_layer = getattr(self, f'norm{i}')
+                norm_layer.eval()
+                for param in norm_layer.parameters():
+                    param.requires_grad = False
+
+    def train(self, mode=True):
+        super(EfficientFormer, self).train(mode)
+        self._freeze_stages()
diff --git a/mmpretrain/models/backbones/efficientnet.py b/mmpretrain/models/backbones/efficientnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ec7ee81186610f7adb8af92325471d794509ddc
--- /dev/null
+++ b/mmpretrain/models/backbones/efficientnet.py
@@ -0,0 +1,410 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+import math
+from functools import partial
+
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint as cp
+from mmcv.cnn.bricks import ConvModule, DropPath
+from mmengine.model import BaseModule, Sequential
+
+from mmpretrain.models.backbones.base_backbone import BaseBackbone
+from mmpretrain.models.utils import InvertedResidual, SELayer, make_divisible
+from mmpretrain.registry import MODELS
+
+
+class EdgeResidual(BaseModule):
+    """Edge Residual Block.
+
+    Args:
+        in_channels (int): The input channels of this module.
+        out_channels (int): The output channels of this module.
+        mid_channels (int): The input channels of the second convolution.
+        kernel_size (int): The kernel size of the first convolution.
+            Defaults to 3.
+        stride (int): The stride of the first convolution. Defaults to 1.
+        se_cfg (dict, optional): Config dict for se layer. Defaults to None,
+            which means no se layer.
+        with_residual (bool): Use residual connection. Defaults to True.
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Defaults to None, which means using conv2d.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='BN')``.
+        act_cfg (dict): Config dict for activation layer.
+            Defaults to ``dict(type='ReLU')``.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 mid_channels,
+                 kernel_size=3,
+                 stride=1,
+                 se_cfg=None,
+                 with_residual=True,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 drop_path_rate=0.,
+                 with_cp=False,
+                 init_cfg=None):
+        super(EdgeResidual, self).__init__(init_cfg=init_cfg)
+        assert stride in [1, 2]
+        self.with_cp = with_cp
+        self.drop_path = DropPath(
+            drop_path_rate) if drop_path_rate > 0 else nn.Identity()
+        self.with_se = se_cfg is not None
+        self.with_residual = (
+            stride == 1 and in_channels == out_channels and with_residual)
+
+        if self.with_se:
+            assert isinstance(se_cfg, dict)
+
+        self.conv1 = ConvModule(
+            in_channels=in_channels,
+            out_channels=mid_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=kernel_size // 2,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+
+        if self.with_se:
+            self.se = SELayer(**se_cfg)
+
+        self.conv2 = ConvModule(
+            in_channels=mid_channels,
+            out_channels=out_channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            conv_cfg=None,
+            norm_cfg=norm_cfg,
+            act_cfg=None)
+
+    def forward(self, x):
+
+        def _inner_forward(x):
+            out = x
+            out = self.conv1(out)
+
+            if self.with_se:
+                out = self.se(out)
+
+            out = self.conv2(out)
+
+            if self.with_residual:
+                return x + self.drop_path(out)
+            else:
+                return out
+
+        if self.with_cp and x.requires_grad:
+            out = cp.checkpoint(_inner_forward, x)
+        else:
+            out = _inner_forward(x)
+
+        return out
+
+
+def model_scaling(layer_setting, arch_setting):
+    """Scaling operation to the layer's parameters according to the
+    arch_setting."""
+    # scale width
+    new_layer_setting = copy.deepcopy(layer_setting)
+    for layer_cfg in new_layer_setting:
+        for block_cfg in layer_cfg:
+            block_cfg[1] = make_divisible(block_cfg[1] * arch_setting[0], 8)
+
+    # scale depth
+    split_layer_setting = [new_layer_setting[0]]
+    for layer_cfg in new_layer_setting[1:-1]:
+        tmp_index = [0]
+        for i in range(len(layer_cfg) - 1):
+            if layer_cfg[i + 1][1] != layer_cfg[i][1]:
+                tmp_index.append(i + 1)
+        tmp_index.append(len(layer_cfg))
+        for i in range(len(tmp_index) - 1):
+            split_layer_setting.append(layer_cfg[tmp_index[i]:tmp_index[i +
+                                                                        1]])
+    split_layer_setting.append(new_layer_setting[-1])
+
+    num_of_layers = [len(layer_cfg) for layer_cfg in split_layer_setting[1:-1]]
+    new_layers = [
+        int(math.ceil(arch_setting[1] * num)) for num in num_of_layers
+    ]
+
+    merge_layer_setting = [split_layer_setting[0]]
+    for i, layer_cfg in enumerate(split_layer_setting[1:-1]):
+        if new_layers[i] <= num_of_layers[i]:
+            tmp_layer_cfg = layer_cfg[:new_layers[i]]
+        else:
+            tmp_layer_cfg = copy.deepcopy(layer_cfg) + [layer_cfg[-1]] * (
+                new_layers[i] - num_of_layers[i])
+        if tmp_layer_cfg[0][3] == 1 and i != 0:
+            merge_layer_setting[-1] += tmp_layer_cfg.copy()
+        else:
+            merge_layer_setting.append(tmp_layer_cfg.copy())
+    merge_layer_setting.append(split_layer_setting[-1])
+
+    return merge_layer_setting
+
+
+@MODELS.register_module()
+class EfficientNet(BaseBackbone):
+    """EfficientNet backbone.
+
+    Args:
+        arch (str): Architecture of efficientnet. Defaults to b0.
+        out_indices (Sequence[int]): Output from which stages.
+            Defaults to (6, ).
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Defaults to 0, which means not freezing any parameters.
+        conv_cfg (dict): Config dict for convolution layer.
+            Defaults to None, which means using conv2d.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to dict(type='BN').
+        act_cfg (dict): Config dict for activation layer.
+            Defaults to dict(type='Swish').
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Defaults to False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+    """
+
+    # Parameters to build layers.
+    # 'b' represents the architecture of normal EfficientNet family includes
+    # 'b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7', 'b8'.
+    # 'e' represents the architecture of EfficientNet-EdgeTPU including 'es',
+    # 'em', 'el'.
+    # 6 parameters are needed to construct a layer, From left to right:
+    # - kernel_size: The kernel size of the block
+    # - out_channel: The number of out_channels of the block
+    # - se_ratio: The sequeeze ratio of SELayer.
+    # - stride: The stride of the block
+    # - expand_ratio: The expand_ratio of the mid_channels
+    # - block_type: -1: Not a block, 0: InvertedResidual, 1: EdgeResidual
+    layer_settings = {
+        'b': [[[3, 32, 0, 2, 0, -1]],
+              [[3, 16, 4, 1, 1, 0]],
+              [[3, 24, 4, 2, 6, 0],
+               [3, 24, 4, 1, 6, 0]],
+              [[5, 40, 4, 2, 6, 0],
+               [5, 40, 4, 1, 6, 0]],
+              [[3, 80, 4, 2, 6, 0],
+               [3, 80, 4, 1, 6, 0],
+               [3, 80, 4, 1, 6, 0],
+               [5, 112, 4, 1, 6, 0],
+               [5, 112, 4, 1, 6, 0],
+               [5, 112, 4, 1, 6, 0]],
+              [[5, 192, 4, 2, 6, 0],
+               [5, 192, 4, 1, 6, 0],
+               [5, 192, 4, 1, 6, 0],
+               [5, 192, 4, 1, 6, 0],
+               [3, 320, 4, 1, 6, 0]],
+              [[1, 1280, 0, 1, 0, -1]]
+              ],
+        'e': [[[3, 32, 0, 2, 0, -1]],
+              [[3, 24, 0, 1, 3, 1]],
+              [[3, 32, 0, 2, 8, 1],
+               [3, 32, 0, 1, 8, 1]],
+              [[3, 48, 0, 2, 8, 1],
+               [3, 48, 0, 1, 8, 1],
+               [3, 48, 0, 1, 8, 1],
+               [3, 48, 0, 1, 8, 1]],
+              [[5, 96, 0, 2, 8, 0],
+               [5, 96, 0, 1, 8, 0],
+               [5, 96, 0, 1, 8, 0],
+               [5, 96, 0, 1, 8, 0],
+               [5, 96, 0, 1, 8, 0],
+               [5, 144, 0, 1, 8, 0],
+               [5, 144, 0, 1, 8, 0],
+               [5, 144, 0, 1, 8, 0],
+               [5, 144, 0, 1, 8, 0]],
+              [[5, 192, 0, 2, 8, 0],
+               [5, 192, 0, 1, 8, 0]],
+              [[1, 1280, 0, 1, 0, -1]]
+              ]
+    }  # yapf: disable
+
+    # Parameters to build different kinds of architecture.
+    # From left to right: scaling factor for width, scaling factor for depth,
+    # resolution.
+    arch_settings = {
+        'b0': (1.0, 1.0, 224),
+        'b1': (1.0, 1.1, 240),
+        'b2': (1.1, 1.2, 260),
+        'b3': (1.2, 1.4, 300),
+        'b4': (1.4, 1.8, 380),
+        'b5': (1.6, 2.2, 456),
+        'b6': (1.8, 2.6, 528),
+        'b7': (2.0, 3.1, 600),
+        'b8': (2.2, 3.6, 672),
+        'l2': (4.3, 5.3, 800),
+        'es': (1.0, 1.0, 224),
+        'em': (1.0, 1.1, 240),
+        'el': (1.2, 1.4, 300)
+    }
+
+    def __init__(self,
+                 arch='b0',
+                 drop_path_rate=0.,
+                 out_indices=(6, ),
+                 frozen_stages=0,
+                 conv_cfg=dict(type='Conv2dAdaptivePadding'),
+                 norm_cfg=dict(type='BN', eps=1e-3),
+                 act_cfg=dict(type='Swish'),
+                 norm_eval=False,
+                 with_cp=False,
+                 init_cfg=[
+                     dict(type='Kaiming', layer='Conv2d'),
+                     dict(
+                         type='Constant',
+                         layer=['_BatchNorm', 'GroupNorm'],
+                         val=1)
+                 ]):
+        super(EfficientNet, self).__init__(init_cfg)
+        assert arch in self.arch_settings, \
+            f'"{arch}" is not one of the arch_settings ' \
+            f'({", ".join(self.arch_settings.keys())})'
+        self.arch_setting = self.arch_settings[arch]
+        # layer_settings of arch='l2' is 'b'
+        self.layer_setting = self.layer_settings['b' if arch ==
+                                                 'l2' else arch[:1]]
+        for index in out_indices:
+            if index not in range(0, len(self.layer_setting)):
+                raise ValueError('the item in out_indices must in '
+                                 f'range(0, {len(self.layer_setting)}). '
+                                 f'But received {index}')
+
+        if frozen_stages not in range(len(self.layer_setting) + 1):
+            raise ValueError('frozen_stages must be in range(0, '
+                             f'{len(self.layer_setting) + 1}). '
+                             f'But received {frozen_stages}')
+        self.drop_path_rate = drop_path_rate
+        self.out_indices = out_indices
+        self.frozen_stages = frozen_stages
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.act_cfg = act_cfg
+        self.norm_eval = norm_eval
+        self.with_cp = with_cp
+
+        self.layer_setting = model_scaling(self.layer_setting,
+                                           self.arch_setting)
+        block_cfg_0 = self.layer_setting[0][0]
+        block_cfg_last = self.layer_setting[-1][0]
+        self.in_channels = make_divisible(block_cfg_0[1], 8)
+        self.out_channels = block_cfg_last[1]
+        self.layers = nn.ModuleList()
+        self.layers.append(
+            ConvModule(
+                in_channels=3,
+                out_channels=self.in_channels,
+                kernel_size=block_cfg_0[0],
+                stride=block_cfg_0[3],
+                padding=block_cfg_0[0] // 2,
+                conv_cfg=self.conv_cfg,
+                norm_cfg=self.norm_cfg,
+                act_cfg=self.act_cfg))
+        self.make_layer()
+        self.layers.append(
+            ConvModule(
+                in_channels=self.in_channels,
+                out_channels=self.out_channels,
+                kernel_size=block_cfg_last[0],
+                stride=block_cfg_last[3],
+                padding=block_cfg_last[0] // 2,
+                conv_cfg=self.conv_cfg,
+                norm_cfg=self.norm_cfg,
+                act_cfg=self.act_cfg))
+
+    def make_layer(self):
+        # Without the first and the final conv block.
+        layer_setting = self.layer_setting[1:-1]
+
+        total_num_blocks = sum([len(x) for x in layer_setting])
+        block_idx = 0
+        dpr = [
+            x.item()
+            for x in torch.linspace(0, self.drop_path_rate, total_num_blocks)
+        ]  # stochastic depth decay rule
+
+        for layer_cfg in layer_setting:
+            layer = []
+            for i, block_cfg in enumerate(layer_cfg):
+                (kernel_size, out_channels, se_ratio, stride, expand_ratio,
+                 block_type) = block_cfg
+
+                mid_channels = int(self.in_channels * expand_ratio)
+                out_channels = make_divisible(out_channels, 8)
+                if se_ratio <= 0:
+                    se_cfg = None
+                else:
+                    se_cfg = dict(
+                        channels=mid_channels,
+                        ratio=expand_ratio * se_ratio,
+                        divisor=1,
+                        act_cfg=(self.act_cfg, dict(type='Sigmoid')))
+                if block_type == 1:  # edge tpu
+                    if i > 0 and expand_ratio == 3:
+                        with_residual = False
+                        expand_ratio = 4
+                    else:
+                        with_residual = True
+                    mid_channels = int(self.in_channels * expand_ratio)
+                    if se_cfg is not None:
+                        se_cfg = dict(
+                            channels=mid_channels,
+                            ratio=se_ratio * expand_ratio,
+                            divisor=1,
+                            act_cfg=(self.act_cfg, dict(type='Sigmoid')))
+                    block = partial(EdgeResidual, with_residual=with_residual)
+                else:
+                    block = InvertedResidual
+                layer.append(
+                    block(
+                        in_channels=self.in_channels,
+                        out_channels=out_channels,
+                        mid_channels=mid_channels,
+                        kernel_size=kernel_size,
+                        stride=stride,
+                        se_cfg=se_cfg,
+                        conv_cfg=self.conv_cfg,
+                        norm_cfg=self.norm_cfg,
+                        act_cfg=self.act_cfg,
+                        drop_path_rate=dpr[block_idx],
+                        with_cp=self.with_cp))
+                self.in_channels = out_channels
+                block_idx += 1
+            self.layers.append(Sequential(*layer))
+
+    def forward(self, x):
+        outs = []
+        for i, layer in enumerate(self.layers):
+            x = layer(x)
+            if i in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        for i in range(self.frozen_stages):
+            m = self.layers[i]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        super(EfficientNet, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                if isinstance(m, nn.BatchNorm2d):
+                    m.eval()
diff --git a/mmpretrain/models/backbones/efficientnet_v2.py b/mmpretrain/models/backbones/efficientnet_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..fec002a4dac46f756f00ed8f596b37028ba18c37
--- /dev/null
+++ b/mmpretrain/models/backbones/efficientnet_v2.py
@@ -0,0 +1,343 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Sequence, Tuple
+
+import torch
+import torch.nn as nn
+from mmcv.cnn.bricks import ConvModule, DropPath
+from mmengine.model import Sequential
+from torch import Tensor
+
+from mmpretrain.registry import MODELS
+from ..utils import InvertedResidual as MBConv
+from .base_backbone import BaseBackbone
+from .efficientnet import EdgeResidual as FusedMBConv
+
+
+class EnhancedConvModule(ConvModule):
+    """ConvModule with short-cut and droppath.
+
+    Args:
+        in_channels (int): Number of channels in the input feature map.
+            Same as that in ``nn._ConvNd``.
+        out_channels (int): Number of channels produced by the convolution.
+            Same as that in ``nn._ConvNd``.
+        kernel_size (int | tuple[int]): Size of the convolving kernel.
+            Same as that in ``nn._ConvNd``.
+        stride (int | tuple[int]): Stride of the convolution.
+            Same as that in ``nn._ConvNd``.
+        has_skip (bool): Whether there is short-cut. Defaults to False.
+        drop_path_rate (float): Stochastic depth rate. Default 0.0.
+        padding (int | tuple[int]): Zero-padding added to both sides of
+            the input. Same as that in ``nn._ConvNd``.
+        dilation (int | tuple[int]): Spacing between kernel elements.
+            Same as that in ``nn._ConvNd``.
+        groups (int): Number of blocked connections from input channels to
+            output channels. Same as that in ``nn._ConvNd``.
+        bias (bool | str): If specified as `auto`, it will be decided by the
+            norm_cfg. Bias will be set as True if `norm_cfg` is None, otherwise
+            False. Default: "auto".
+        conv_cfg (dict): Config dict for convolution layer. Default: None,
+            which means using conv2d.
+        norm_cfg (dict): Config dict for normalization layer. Default: None.
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='ReLU').
+        inplace (bool): Whether to use inplace mode for activation.
+            Default: True.
+        with_spectral_norm (bool): Whether use spectral norm in conv module.
+            Default: False.
+        padding_mode (str): If the `padding_mode` has not been supported by
+            current `Conv2d` in PyTorch, we will use our own padding layer
+            instead. Currently, we support ['zeros', 'circular'] with official
+            implementation and ['reflect'] with our own implementation.
+            Default: 'zeros'.
+        order (tuple[str]): The order of conv/norm/activation layers. It is a
+            sequence of "conv", "norm" and "act". Common examples are
+            ("conv", "norm", "act") and ("act", "conv", "norm").
+            Default: ('conv', 'norm', 'act').
+    """
+
+    def __init__(self, *args, has_skip=False, drop_path_rate=0, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.has_skip = has_skip
+        if self.has_skip and (self.in_channels != self.out_channels
+                              or self.stride != (1, 1)):
+            raise ValueError('the stride must be 1 and the `in_channels` and'
+                             ' `out_channels` must be the same , when '
+                             '`has_skip` is True in `EnhancedConvModule` .')
+        self.drop_path = DropPath(
+            drop_path_rate) if drop_path_rate else nn.Identity()
+
+    def forward(self, x: torch.Tensor, **kwargs) -> torch.Tensor:
+        short_cut = x
+        x = super().forward(x, **kwargs)
+        if self.has_skip:
+            x = self.drop_path(x) + short_cut
+        return x
+
+
+@MODELS.register_module()
+class EfficientNetV2(BaseBackbone):
+    """EfficientNetV2 backbone.
+
+    A PyTorch implementation of EfficientNetV2 introduced by:
+    `EfficientNetV2: Smaller Models and Faster Training
+    <https://arxiv.org/abs/2104.00298>`_
+
+    Args:
+        arch (str): Architecture of efficientnetv2. Defaults to s.
+        in_channels (int): Number of input image channels. Defaults to 3.
+        drop_path_rate (float): The ratio of the stochastic depth.
+            Defaults to 0.0.
+        out_indices (Sequence[int]): Output from which stages.
+            Defaults to (-1, ).
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Defaults to 0, which means not freezing any parameters.
+        conv_cfg (dict): Config dict for convolution layer.
+            Defaults to None, which means using conv2d.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to dict(type='BN').
+        act_cfg (dict): Config dict for activation layer.
+            Defaults to dict(type='Swish').
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Defaults to False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+    """
+
+    # Parameters to build layers. From left to right:
+    # - repeat (int): The repeat number of the block in the layer
+    # - kernel_size (int): The kernel size of the layer
+    # - stride (int): The stride of the first block of the layer
+    # - expand_ratio (int, float): The expand_ratio of the mid_channels
+    # - in_channel (int): The number of in_channels of the layer
+    # - out_channel (int): The number of out_channels of the layer
+    # - se_ratio (float): The sequeeze ratio of SELayer.
+    # - block_type (int): -2: ConvModule, -1: EnhancedConvModule,
+    #                      0: FusedMBConv, 1: MBConv
+    arch_settings = {
+        **dict.fromkeys(['small', 's'], [[2, 3, 1, 1, 24, 24, 0.0, -1],
+                                         [4, 3, 2, 4, 24, 48, 0.0, 0],
+                                         [4, 3, 2, 4, 48, 64, 0.0, 0],
+                                         [6, 3, 2, 4, 64, 128, 0.25, 1],
+                                         [9, 3, 1, 6, 128, 160, 0.25, 1],
+                                         [15, 3, 2, 6, 160, 256, 0.25, 1],
+                                         [1, 1, 1, 1, 256, 1280, 0.0, -2]]),
+        **dict.fromkeys(['m', 'medium'], [[3, 3, 1, 1, 24, 24, 0.0, -1],
+                                          [5, 3, 2, 4, 24, 48, 0.0, 0],
+                                          [5, 3, 2, 4, 48, 80, 0.0, 0],
+                                          [7, 3, 2, 4, 80, 160, 0.25, 1],
+                                          [14, 3, 1, 6, 160, 176, 0.25, 1],
+                                          [18, 3, 2, 6, 176, 304, 0.25, 1],
+                                          [5, 3, 1, 6, 304, 512, 0.25, 1],
+                                          [1, 1, 1, 1, 512, 1280, 0.0, -2]]),
+        **dict.fromkeys(['l', 'large'], [[4, 3, 1, 1, 32, 32, 0.0, -1],
+                                         [7, 3, 2, 4, 32, 64, 0.0, 0],
+                                         [7, 3, 2, 4, 64, 96, 0.0, 0],
+                                         [10, 3, 2, 4, 96, 192, 0.25, 1],
+                                         [19, 3, 1, 6, 192, 224, 0.25, 1],
+                                         [25, 3, 2, 6, 224, 384, 0.25, 1],
+                                         [7, 3, 1, 6, 384, 640, 0.25, 1],
+                                         [1, 1, 1, 1, 640, 1280, 0.0, -2]]),
+        **dict.fromkeys(['xl'], [[4, 3, 1, 1, 32, 32, 0.0, -1],
+                                 [8, 3, 2, 4, 32, 64, 0.0, 0],
+                                 [8, 3, 2, 4, 64, 96, 0.0, 0],
+                                 [16, 3, 2, 4, 96, 192, 0.25, 1],
+                                 [24, 3, 1, 6, 192, 256, 0.25, 1],
+                                 [32, 3, 2, 6, 256, 512, 0.25, 1],
+                                 [8, 3, 1, 6, 512, 640, 0.25, 1],
+                                 [1, 1, 1, 1, 640, 1280, 0.0, -2]]),
+        **dict.fromkeys(['b0'], [[1, 3, 1, 1, 32, 16, 0.0, -1],
+                                 [2, 3, 2, 4, 16, 32, 0.0, 0],
+                                 [2, 3, 2, 4, 32, 48, 0.0, 0],
+                                 [3, 3, 2, 4, 48, 96, 0.25, 1],
+                                 [5, 3, 1, 6, 96, 112, 0.25, 1],
+                                 [8, 3, 2, 6, 112, 192, 0.25, 1],
+                                 [1, 1, 1, 1, 192, 1280, 0.0, -2]]),
+        **dict.fromkeys(['b1'], [[2, 3, 1, 1, 32, 16, 0.0, -1],
+                                 [3, 3, 2, 4, 16, 32, 0.0, 0],
+                                 [3, 3, 2, 4, 32, 48, 0.0, 0],
+                                 [4, 3, 2, 4, 48, 96, 0.25, 1],
+                                 [6, 3, 1, 6, 96, 112, 0.25, 1],
+                                 [9, 3, 2, 6, 112, 192, 0.25, 1],
+                                 [1, 1, 1, 1, 192, 1280, 0.0, -2]]),
+        **dict.fromkeys(['b2'], [[2, 3, 1, 1, 32, 16, 0.0, -1],
+                                 [3, 3, 2, 4, 16, 32, 0.0, 0],
+                                 [3, 3, 2, 4, 32, 56, 0.0, 0],
+                                 [4, 3, 2, 4, 56, 104, 0.25, 1],
+                                 [6, 3, 1, 6, 104, 120, 0.25, 1],
+                                 [10, 3, 2, 6, 120, 208, 0.25, 1],
+                                 [1, 1, 1, 1, 208, 1408, 0.0, -2]]),
+        **dict.fromkeys(['b3'], [[2, 3, 1, 1, 40, 16, 0.0, -1],
+                                 [3, 3, 2, 4, 16, 40, 0.0, 0],
+                                 [3, 3, 2, 4, 40, 56, 0.0, 0],
+                                 [5, 3, 2, 4, 56, 112, 0.25, 1],
+                                 [7, 3, 1, 6, 112, 136, 0.25, 1],
+                                 [12, 3, 2, 6, 136, 232, 0.25, 1],
+                                 [1, 1, 1, 1, 232, 1536, 0.0, -2]])
+    }
+
+    def __init__(self,
+                 arch: str = 's',
+                 in_channels: int = 3,
+                 drop_path_rate: float = 0.,
+                 out_indices: Sequence[int] = (-1, ),
+                 frozen_stages: int = 0,
+                 conv_cfg=dict(type='Conv2dAdaptivePadding'),
+                 norm_cfg=dict(type='BN', eps=1e-3, momentum=0.1),
+                 act_cfg=dict(type='Swish'),
+                 norm_eval: bool = False,
+                 with_cp: bool = False,
+                 init_cfg=[
+                     dict(type='Kaiming', layer='Conv2d'),
+                     dict(
+                         type='Constant',
+                         layer=['_BatchNorm', 'GroupNorm'],
+                         val=1)
+                 ]):
+        super(EfficientNetV2, self).__init__(init_cfg)
+        assert arch in self.arch_settings, \
+            f'"{arch}" is not one of the arch_settings ' \
+            f'({", ".join(self.arch_settings.keys())})'
+        self.arch = self.arch_settings[arch]
+        if frozen_stages not in range(len(self.arch) + 1):
+            raise ValueError('frozen_stages must be in range(0, '
+                             f'{len(self.arch)}), but get {frozen_stages}')
+        self.drop_path_rate = drop_path_rate
+        self.frozen_stages = frozen_stages
+        self.norm_eval = norm_eval
+        self.with_cp = with_cp
+
+        self.layers = nn.ModuleList()
+        assert self.arch[-1][-1] == -2, \
+            f'the last block_type of `arch_setting` must be -2 ,' \
+            f'but get `{self.arch[-1][-1]}`'
+        self.in_channels = in_channels
+        self.out_channels = self.arch[-1][5]
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.act_cfg = act_cfg
+
+        self.make_layers()
+
+        # there len(slef.arch) + 2 layers in the backbone
+        # including: the first + len(self.arch) layers + the last
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        out_indices = list(out_indices)
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = len(self.layers) + index
+            assert 0 <= out_indices[i] <= len(self.layers), \
+                f'Invalid out_indices {index}.'
+        self.out_indices = out_indices
+
+    def make_layers(self, ):
+        # make the first layer
+        self.layers.append(
+            ConvModule(
+                in_channels=self.in_channels,
+                out_channels=self.arch[0][4],
+                kernel_size=3,
+                stride=2,
+                conv_cfg=self.conv_cfg,
+                norm_cfg=self.norm_cfg,
+                act_cfg=self.act_cfg))
+
+        in_channels = self.arch[0][4]
+        layer_setting = self.arch[:-1]
+
+        total_num_blocks = sum([x[0] for x in layer_setting])
+        block_idx = 0
+        dpr = [
+            x.item()
+            for x in torch.linspace(0, self.drop_path_rate, total_num_blocks)
+        ]  # stochastic depth decay rule
+
+        for layer_cfg in layer_setting:
+            layer = []
+            (repeat, kernel_size, stride, expand_ratio, _, out_channels,
+             se_ratio, block_type) = layer_cfg
+            for i in range(repeat):
+                stride = stride if i == 0 else 1
+                if block_type == -1:
+                    has_skip = stride == 1 and in_channels == out_channels
+                    droppath_rate = dpr[block_idx] if has_skip else 0.0
+                    layer.append(
+                        EnhancedConvModule(
+                            in_channels=in_channels,
+                            out_channels=out_channels,
+                            kernel_size=kernel_size,
+                            has_skip=has_skip,
+                            drop_path_rate=droppath_rate,
+                            stride=stride,
+                            padding=1,
+                            conv_cfg=None,
+                            norm_cfg=self.norm_cfg,
+                            act_cfg=self.act_cfg))
+                    in_channels = out_channels
+                else:
+                    mid_channels = int(in_channels * expand_ratio)
+                    se_cfg = None
+                    if block_type != 0 and se_ratio > 0:
+                        se_cfg = dict(
+                            channels=mid_channels,
+                            ratio=expand_ratio * (1.0 / se_ratio),
+                            divisor=1,
+                            act_cfg=(self.act_cfg, dict(type='Sigmoid')))
+                    block = FusedMBConv if block_type == 0 else MBConv
+                    conv_cfg = self.conv_cfg if stride == 2 else None
+                    layer.append(
+                        block(
+                            in_channels=in_channels,
+                            out_channels=out_channels,
+                            mid_channels=mid_channels,
+                            kernel_size=kernel_size,
+                            stride=stride,
+                            se_cfg=se_cfg,
+                            conv_cfg=conv_cfg,
+                            norm_cfg=self.norm_cfg,
+                            act_cfg=self.act_cfg,
+                            drop_path_rate=dpr[block_idx],
+                            with_cp=self.with_cp))
+                    in_channels = out_channels
+                block_idx += 1
+            self.layers.append(Sequential(*layer))
+
+        # make the last layer
+        self.layers.append(
+            ConvModule(
+                in_channels=in_channels,
+                out_channels=self.out_channels,
+                kernel_size=self.arch[-1][1],
+                stride=self.arch[-1][2],
+                conv_cfg=self.conv_cfg,
+                norm_cfg=self.norm_cfg,
+                act_cfg=self.act_cfg))
+
+    def forward(self, x: Tensor) -> Tuple[Tensor]:
+        outs = []
+        for i, layer in enumerate(self.layers):
+            x = layer(x)
+            if i in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        for i in range(self.frozen_stages):
+            m = self.layers[i]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        super(EfficientNetV2, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                if isinstance(m, nn.BatchNorm2d):
+                    m.eval()
diff --git a/mmpretrain/models/backbones/hivit.py b/mmpretrain/models/backbones/hivit.py
new file mode 100644
index 0000000000000000000000000000000000000000..981cbf819138ace2c2e8441e7e65f927883c55fd
--- /dev/null
+++ b/mmpretrain/models/backbones/hivit.py
@@ -0,0 +1,656 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+
+import torch
+import torch.nn as nn
+from mmcv.cnn.bricks import DropPath
+from mmengine.model.weight_init import trunc_normal_
+
+from mmpretrain.registry import MODELS
+from ..utils import build_norm_layer, to_2tuple
+from .base_backbone import BaseBackbone
+
+
+class Mlp(nn.Module):
+    """MLP block.
+
+    Args:
+        in_features (int): Number of input dims.
+        hidden_features (int): Number of hidden dims.
+        out_feature (int): Number of out dims.
+        act_layer: MLP activation layer.
+        drop (float): MLP dropout rate.
+    """
+
+    def __init__(self,
+                 in_features,
+                 hidden_features=None,
+                 out_features=None,
+                 act_layer=nn.GELU,
+                 drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+class Attention(nn.Module):
+    """Attention.
+
+    Args:
+        input size (int): Input size.
+        dim (int): Number of input dims.
+        num_heads (int): Number of attention heads.
+        qkv_bias (bool): Enable bias for qkv projections if True.
+        qk_scale (float): The number of divider after q@k. Default to None.
+        attn_drop (float): The drop out rate for attention output weights.
+            Defaults to 0.
+        proj_drop (float): Probability of an element to be zeroed
+            after the feed forward layer. Defaults to 0.
+        rpe (bool): If True, add relative position embedding to
+            the patch embedding.
+    """
+
+    def __init__(self,
+                 input_size,
+                 dim,
+                 num_heads,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 attn_drop=0.,
+                 proj_drop=0.,
+                 rpe=True):
+        super().__init__()
+        self.input_size = input_size
+        self.dim = dim
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = qk_scale or head_dim**-0.5
+
+        # define a parameter table of relative position bias
+        self.relative_position_bias_table = nn.Parameter(
+            torch.zeros((2 * input_size - 1) *
+                        (2 * input_size - 1), num_heads)) if rpe else None
+        if rpe:
+            coords_h = torch.arange(input_size)
+            coords_w = torch.arange(input_size)
+            coords = torch.stack(torch.meshgrid([coords_h, coords_w]))
+            coords_flatten = torch.flatten(coords, 1)
+            relative_coords = coords_flatten[:, :,
+                                             None] - coords_flatten[:, None, :]
+            relative_coords = relative_coords.permute(1, 2, 0).contiguous()
+            relative_coords[:, :, 0] += input_size - 1
+            relative_coords[:, :, 1] += input_size - 1
+            relative_coords[:, :, 0] *= 2 * input_size - 1
+            relative_position_index = relative_coords.sum(-1)
+            self.register_buffer('relative_position_index',
+                                 relative_position_index)
+
+            trunc_normal_(self.relative_position_bias_table, std=.02)
+
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+        self.softmax = nn.Softmax(dim=-1)
+
+    def forward(self, x, rpe_index=None, mask=None):
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads,
+                                  C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[
+            2]  # make torchscript happy (cannot use tensor as tuple)
+
+        q = q * self.scale
+        attn = (q @ k.transpose(-2, -1))
+
+        if rpe_index is not None:
+            rpe_index = self.relative_position_index.view(-1)
+            S = int(math.sqrt(rpe_index.size(-1)))
+            relative_position_bias = self.relative_position_bias_table[
+                rpe_index].view(-1, S, S, self.num_heads)
+            relative_position_bias = relative_position_bias.permute(
+                0, 3, 1, 2).contiguous()
+            attn = attn + relative_position_bias
+        if mask is not None:
+            mask = mask.bool()
+            attn = attn.masked_fill(~mask[:, None, None, :], float('-inf'))
+        attn = self.softmax(attn)
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+
+class BlockWithRPE(nn.Module):
+    """HiViT block.
+
+    Args:
+        input_size (int): Input size.
+        dim (int): Number of input dims.
+        num_heads (int): Number of attention heads.
+        mlp_ratio (int): Ratio of MLP hidden dim to embedding dim.
+        qkv_bias (bool): Enable bias for qkv projections if True.
+        qk_scale (float): The number of divider after q@k. Default to None.
+        drop (float): Probability of an element to be zeroed
+            after the feed forward layer. Defaults to 0.
+        attn_drop (float): The drop out rate for attention output weights.
+            Defaults to 0.
+        drop_path (float): Stochastic depth rate. Defaults to 0.
+        rpe (bool): If True, add relative position embedding to
+            the patch embedding.
+        layer_scale_init_value (float): Layer-scale init values. Defaults to 0.
+        act_layer: MLP activation layer.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+    """
+
+    def __init__(self,
+                 input_size,
+                 dim,
+                 num_heads=0.,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 drop=0.,
+                 attn_drop=0.,
+                 drop_path=0.,
+                 rpe=True,
+                 layer_scale_init_value=0.0,
+                 act_layer=nn.GELU,
+                 norm_cfg=dict(type='LN')):
+        super().__init__()
+        self.dim = dim
+        self.num_heads = num_heads
+        self.mlp_ratio = mlp_ratio
+
+        with_attn = num_heads > 0.
+
+        self.norm1 = build_norm_layer(norm_cfg, dim) if with_attn else None
+        self.attn = Attention(
+            input_size,
+            dim,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            attn_drop=attn_drop,
+            proj_drop=drop,
+            rpe=rpe,
+        ) if with_attn else None
+
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = build_norm_layer(norm_cfg, dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_layer=act_layer,
+            drop=drop)
+
+        if layer_scale_init_value > 0:
+            self.gamma_1 = nn.Parameter(
+                layer_scale_init_value * torch.ones(
+                    (dim)), requires_grad=True) if with_attn else None
+            self.gamma_2 = nn.Parameter(
+                layer_scale_init_value * torch.ones((dim)), requires_grad=True)
+        else:
+            self.gamma_1, self.gamma_2 = None, None
+
+    def forward(self, x, rpe_index=None, mask=None):
+        if self.attn is not None:
+            if self.gamma_1 is not None:
+                x = x + self.drop_path(
+                    self.gamma_1 * self.attn(self.norm1(x), rpe_index, mask))
+            else:
+                x = x + self.drop_path(
+                    self.attn(self.norm1(x), rpe_index, mask))
+        if self.gamma_2 is not None:
+            x = x + self.drop_path(self.gamma_2 * self.mlp(self.norm2(x)))
+        else:
+            x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+
+
+class PatchEmbed(nn.Module):
+    """PatchEmbed for HiViT.
+
+    Args:
+        img_size (int): Input image size.
+        patch_size (int): Patch size. Defaults to 16.
+        inner_patches (int): Inner patch. Defaults to 4.
+        in_chans (int): Number of image input channels.
+        embed_dim (int): Transformer embedding dimension.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        kernel_size (int): Kernel size.
+        pad_size (int): Pad size.
+    """
+
+    def __init__(self,
+                 img_size=224,
+                 patch_size=16,
+                 inner_patches=4,
+                 in_chans=3,
+                 embed_dim=128,
+                 norm_cfg=None,
+                 kernel_size=None,
+                 pad_size=None):
+        super().__init__()
+        img_size = to_2tuple(img_size) if not isinstance(img_size,
+                                                         tuple) else img_size
+        patch_size = to_2tuple(patch_size)
+        patches_resolution = [
+            img_size[0] // patch_size[0], img_size[1] // patch_size[1]
+        ]
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.inner_patches = inner_patches
+        self.patches_resolution = patches_resolution
+        self.num_patches = patches_resolution[0] * patches_resolution[1]
+
+        self.in_chans = in_chans
+        self.embed_dim = embed_dim
+
+        conv_size = [size // inner_patches for size in patch_size]
+        kernel_size = kernel_size or conv_size
+        pad_size = pad_size or 0
+        self.proj = nn.Conv2d(
+            in_chans,
+            embed_dim,
+            kernel_size=kernel_size,
+            stride=conv_size,
+            padding=pad_size)
+        if norm_cfg is not None:
+            self.norm = build_norm_layer(norm_cfg, embed_dim)
+        else:
+            self.norm = None
+
+    def forward(self, x):
+        B, C, H, W = x.shape
+        patches_resolution = (H // self.patch_size[0], W // self.patch_size[1])
+        num_patches = patches_resolution[0] * patches_resolution[1]
+        x = self.proj(x).view(
+            B,
+            -1,
+            patches_resolution[0],
+            self.inner_patches,
+            patches_resolution[1],
+            self.inner_patches,
+        ).permute(0, 2, 4, 3, 5, 1).reshape(B, num_patches, self.inner_patches,
+                                            self.inner_patches, -1)
+        if self.norm is not None:
+            x = self.norm(x)
+        return x
+
+
+class PatchMerge(nn.Module):
+    """PatchMerge for HiViT.
+
+    Args:
+        dim (int): Number of input channels.
+        norm_cfg (dict): Config dict for normalization layer.
+    """
+
+    def __init__(self, dim, norm_cfg):
+        super().__init__()
+        self.norm = build_norm_layer(norm_cfg, dim * 4)
+        self.reduction = nn.Linear(dim * 4, dim * 2, bias=False)
+
+    def forward(self, x, *args, **kwargs):
+        is_main_stage = len(x.shape) == 3
+        if is_main_stage:
+            B, N, C = x.shape
+            S = int(math.sqrt(N))
+            x = x.reshape(B, S // 2, 2, S // 2, 2, C) \
+                .permute(0, 1, 3, 2, 4, 5) \
+                .reshape(B, -1, 2, 2, C)
+        x0 = x[..., 0::2, 0::2, :]
+        x1 = x[..., 1::2, 0::2, :]
+        x2 = x[..., 0::2, 1::2, :]
+        x3 = x[..., 1::2, 1::2, :]
+
+        x = torch.cat([x0, x1, x2, x3], dim=-1)
+        x = self.norm(x)
+        x = self.reduction(x)
+
+        if is_main_stage:
+            x = x[:, :, 0, 0, :]
+        return x
+
+
+@MODELS.register_module()
+class HiViT(BaseBackbone):
+    """HiViT.
+
+    A PyTorch implement of: `HiViT: A Simple and More Efficient Design
+    of Hierarchical Vision Transformer <https://arxiv.org/abs/2205.14949>`_.
+
+    Args:
+        arch (str | dict): Swin Transformer architecture. If use string, choose
+            from 'tiny', 'small', and'base'. If use dict, it should
+            have below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **depths** (List[int]): The number of blocks in each stage.
+            - **num_heads** (int): The number of heads in attention
+              modules of each stage.
+
+        Defaults to 'tiny'.
+        img_size (int): Input image size.
+        patch_size (int): Patch size. Defaults to 16.
+        inner_patches (int): Inner patch. Defaults to 4.
+        in_chans (int): Number of image input channels.
+        embed_dim (int): Transformer embedding dimension.
+        depths (list[int]): Number of successive HiViT blocks.
+        num_heads (int): Number of attention heads.
+        stem_mlp_ratio (int): Ratio of MLP hidden dim to embedding dim
+            in the first two stages.
+        mlp_ratio (int): Ratio of MLP hidden dim to embedding dim in
+            the last stage.
+        qkv_bias (bool): Enable bias for qkv projections if True.
+        qk_scale (float): The number of divider after q@k. Default to None.
+        drop_rate (float): Probability of an element to be zeroed
+            after the feed forward layer. Defaults to 0.
+        attn_drop_rate (float): The drop out rate for attention output weights.
+            Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        ape (bool): If True, add absolute position embedding to
+            the patch embedding.
+        rpe (bool): If True, add relative position embedding to
+            the patch embedding.
+        patch_norm (bool): If True, use norm_cfg for normalization layer.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        kernel_size (int): Kernel size.
+        pad_size (int): Pad size.
+        layer_scale_init_value (float): Layer-scale init values. Defaults to 0.
+        init_cfg (dict, optional): The extra config for initialization.
+             Defaults to None.
+    """
+    arch_zoo = {
+        **dict.fromkeys(['t', 'tiny'],
+                        {'embed_dims': 384,
+                         'depths': [1, 1, 10],
+                         'num_heads': 6}),
+        **dict.fromkeys(['s', 'small'],
+                        {'embed_dims': 384,
+                         'depths': [2, 2, 20],
+                         'num_heads': 6}),
+        **dict.fromkeys(['b', 'base'],
+                        {'embed_dims': 512,
+                         'depths': [2, 2, 24],
+                         'num_heads': 8}),
+        **dict.fromkeys(['l', 'large'],
+                        {'embed_dims': 768,
+                         'depths': [2, 2, 40],
+                         'num_heads': 12}),
+    }  # yapf: disable
+
+    num_extra_tokens = 0
+
+    def __init__(self,
+                 arch='base',
+                 img_size=224,
+                 patch_size=16,
+                 inner_patches=4,
+                 in_chans=3,
+                 stem_mlp_ratio=3.,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.0,
+                 norm_cfg=dict(type='LN'),
+                 out_indices=[23],
+                 ape=True,
+                 rpe=False,
+                 patch_norm=True,
+                 frozen_stages=-1,
+                 kernel_size=None,
+                 pad_size=None,
+                 layer_scale_init_value=0.0,
+                 init_cfg=None):
+        super(HiViT, self).__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {'embed_dims', 'depths', 'num_heads'}
+            assert isinstance(arch, dict) and set(arch) == essential_keys, \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+        self.embed_dims = self.arch_settings['embed_dims']
+        self.depths = self.arch_settings['depths']
+        self.num_heads = self.arch_settings['num_heads']
+
+        self.num_stages = len(self.depths)
+        self.ape = ape
+        self.rpe = rpe
+        self.patch_size = patch_size
+        self.num_features = self.embed_dims
+        self.mlp_ratio = mlp_ratio
+        self.num_main_blocks = self.depths[-1]
+        self.out_indices = out_indices
+        self.out_indices[-1] = self.depths[-1] - 1
+
+        img_size = to_2tuple(img_size) if not isinstance(img_size,
+                                                         tuple) else img_size
+
+        embed_dim = self.embed_dims // 2**(self.num_stages - 1)
+        # split image into non-overlapping patches
+        self.patch_embed = PatchEmbed(
+            img_size=img_size,
+            patch_size=patch_size,
+            inner_patches=inner_patches,
+            in_chans=in_chans,
+            embed_dim=embed_dim,
+            norm_cfg=norm_cfg if patch_norm else None,
+            kernel_size=kernel_size,
+            pad_size=pad_size)
+        num_patches = self.patch_embed.num_patches
+        Hp, Wp = self.patch_embed.patches_resolution
+
+        if rpe:
+            assert Hp == Wp, 'If you use relative position, make sure H == W '
+            'of input size'
+
+        # absolute position embedding
+        if ape:
+            self.pos_embed = nn.Parameter(
+                torch.zeros(1, num_patches, self.num_features))
+            trunc_normal_(self.pos_embed, std=.02)
+        if rpe:
+            # get pair-wise relative position index for each token inside the
+            # window
+            coords_h = torch.arange(Hp)
+            coords_w = torch.arange(Wp)
+            coords = torch.stack(torch.meshgrid([coords_h, coords_w]))
+            coords_flatten = torch.flatten(coords, 1)
+            relative_coords = coords_flatten[:, :,
+                                             None] - coords_flatten[:, None, :]
+            relative_coords = relative_coords.permute(1, 2, 0).contiguous()
+            relative_coords[:, :, 0] += Hp - 1
+            relative_coords[:, :, 1] += Wp - 1
+            relative_coords[:, :, 0] *= 2 * Wp - 1
+            relative_position_index = relative_coords.sum(-1)
+            self.register_buffer('relative_position_index',
+                                 relative_position_index)
+
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        # stochastic depth
+        dpr = iter(
+            x.item()
+            for x in torch.linspace(0, drop_path_rate,
+                                    sum(self.depths) + sum(self.depths[:-1])))
+
+        # build blocks
+        self.blocks = nn.ModuleList()
+        for stage_i, stage_depth in enumerate(self.depths):
+            is_main_stage = embed_dim == self.num_features
+            nhead = self.num_heads if is_main_stage else 0
+            ratio = mlp_ratio if is_main_stage else stem_mlp_ratio
+            # every block not in main stage includes two mlp blocks
+            stage_depth = stage_depth if is_main_stage else stage_depth * 2
+            for _ in range(stage_depth):
+                self.blocks.append(
+                    BlockWithRPE(
+                        Hp,
+                        embed_dim,
+                        nhead,
+                        ratio,
+                        qkv_bias,
+                        qk_scale,
+                        drop=drop_rate,
+                        attn_drop=attn_drop_rate,
+                        drop_path=next(dpr),
+                        rpe=rpe,
+                        norm_cfg=norm_cfg,
+                        layer_scale_init_value=layer_scale_init_value,
+                    ))
+            if stage_i + 1 < self.num_stages:
+                self.blocks.append(PatchMerge(embed_dim, norm_cfg))
+                embed_dim *= 2
+
+        self.frozen_stages = frozen_stages
+        if self.frozen_stages > 0:
+            self._freeze_stages()
+
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+
+    def interpolate_pos_encoding(self, x, h, w):
+        npatch = x.shape[1]
+        N = self.pos_embed.shape[1]
+        if npatch == N and w == h:
+            return self.pos_embed
+        patch_pos_embed = self.pos_embed
+        dim = x.shape[-1]
+        w0 = w // self.patch_size
+        h0 = h // self.patch_size
+        # we add a small number to avoid floating point error in interpolation
+        # see discussion at https://github.com/facebookresearch/dino/issues/8
+        w0, h0 = w0 + 0.1, h0 + 0.1
+        patch_pos_embed = nn.functional.interpolate(
+            patch_pos_embed.reshape(1, int(math.sqrt(N)), int(math.sqrt(N)),
+                                    dim).permute(0, 3, 1, 2),
+            scale_factor=(h0 / math.sqrt(N), w0 / math.sqrt(N)),
+            mode='bicubic',
+        )
+        assert int(h0) == patch_pos_embed.shape[-2] and int(
+            w0) == patch_pos_embed.shape[-1]
+        patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
+        return patch_pos_embed
+
+    def forward(self, x):
+        B, C, H, W = x.shape
+        Hp, Wp = H // self.patch_size, W // self.patch_size
+
+        x = self.patch_embed(x)
+
+        outs = []
+        for i, blk in enumerate(self.blocks[:-self.num_main_blocks]):
+            x = blk(x)
+            if i in self.out_indices:
+                x = x.reshape(B, Hp, Wp, *x.shape[-3:]).permute(
+                    0, 5, 1, 3, 2, 4).reshape(B, -1, Hp * x.shape[-3],
+                                              Wp * x.shape[-2]).contiguous()
+                outs.append(x)
+
+        x = x[..., 0, 0, :]
+        if self.ape:
+            x = x + self.interpolate_pos_encoding(x, H, W)
+        x = self.pos_drop(x)
+
+        rpe_index = True if self.rpe else None
+
+        for i, blk in enumerate(self.blocks[-self.num_main_blocks:]):
+            x = blk(x, rpe_index)
+            if i in self.out_indices:
+                x = x.transpose(1, 2).view(B, -1, Hp, Wp).contiguous()
+                outs.append(x)
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        # freeze position embedding
+        if self.pos_embed is not None:
+            self.pos_embed.requires_grad = False
+        # set dropout to eval model
+        self.pos_drop.eval()
+        # freeze patch embedding
+        self.patch_embed.eval()
+        for param in self.patch_embed.parameters():
+            param.requires_grad = False
+        # freeze layers
+        for i in range(1, self.frozen_stages + 1):
+            m = self.blocks[i - 1]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+        # freeze the last layer norm
+        for param in self.fc_norm.parameters():
+            param.requires_grad = False
+
+    def get_layer_depth(self, param_name: str, prefix: str = ''):
+        """Get the layer-wise depth of a parameter.
+
+        Args:
+            param_name (str): The name of the parameter.
+            prefix (str): The prefix for the parameter.
+                Defaults to an empty string.
+
+        Returns:
+            Tuple[int, int]: The layer-wise depth and the num of layers.
+
+        Note:
+            The first depth is the stem module (``layer_depth=0``), and the
+            last depth is the subsequent module (``layer_depth=num_layers-1``)
+        """
+        self.num_layers = len(self.blocks)
+        num_layers = self.num_layers + 2
+
+        if not param_name.startswith(prefix):
+            # For subsequent module like head
+            return num_layers - 1, num_layers
+
+        param_name = param_name[len(prefix):]
+
+        if param_name in 'pos_embed':
+            layer_depth = 0
+        elif param_name.startswith('patch_embed'):
+            layer_depth = 0
+        elif param_name.startswith('layers'):
+            layer_id = int(param_name.split('.')[1])
+            layer_depth = layer_id + 1
+        else:
+            layer_depth = num_layers - 1
+
+        return layer_depth, num_layers
diff --git a/mmpretrain/models/backbones/hornet.py b/mmpretrain/models/backbones/hornet.py
new file mode 100644
index 0000000000000000000000000000000000000000..460f2dc57975712b5eae8308e2fca9c38b89a3e2
--- /dev/null
+++ b/mmpretrain/models/backbones/hornet.py
@@ -0,0 +1,500 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Adapted from official impl at https://github.com/raoyongming/HorNet.
+try:
+    import torch.fft
+    fft = True
+except ImportError:
+    fft = None
+
+import copy
+from functools import partial
+from typing import Sequence
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint as checkpoint
+from mmcv.cnn.bricks import DropPath
+
+from mmpretrain.models.backbones.base_backbone import BaseBackbone
+from mmpretrain.registry import MODELS
+from ..utils import LayerScale
+
+
+def get_dwconv(dim, kernel_size, bias=True):
+    """build a pepth-wise convolution."""
+    return nn.Conv2d(
+        dim,
+        dim,
+        kernel_size=kernel_size,
+        padding=(kernel_size - 1) // 2,
+        bias=bias,
+        groups=dim)
+
+
+class HorNetLayerNorm(nn.Module):
+    """An implementation of LayerNorm of HorNet.
+
+    The differences between HorNetLayerNorm & torch LayerNorm:
+        1. Supports two data formats channels_last or channels_first.
+    Args:
+        normalized_shape (int or list or torch.Size): input shape from an
+            expected input of size.
+        eps (float): a value added to the denominator for numerical stability.
+            Defaults to 1e-5.
+        data_format (str): The ordering of the dimensions in the inputs.
+            channels_last corresponds to inputs with shape (batch_size, height,
+            width, channels) while channels_first corresponds to inputs with
+            shape (batch_size, channels, height, width).
+            Defaults to 'channels_last'.
+    """
+
+    def __init__(self,
+                 normalized_shape,
+                 eps=1e-6,
+                 data_format='channels_last'):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(normalized_shape))
+        self.bias = nn.Parameter(torch.zeros(normalized_shape))
+        self.eps = eps
+        self.data_format = data_format
+        if self.data_format not in ['channels_last', 'channels_first']:
+            raise ValueError(
+                'data_format must be channels_last or channels_first')
+        self.normalized_shape = (normalized_shape, )
+
+    def forward(self, x):
+        if self.data_format == 'channels_last':
+            return F.layer_norm(x, self.normalized_shape, self.weight,
+                                self.bias, self.eps)
+        elif self.data_format == 'channels_first':
+            u = x.mean(1, keepdim=True)
+            s = (x - u).pow(2).mean(1, keepdim=True)
+            x = (x - u) / torch.sqrt(s + self.eps)
+            x = self.weight[:, None, None] * x + self.bias[:, None, None]
+            return x
+
+
+class GlobalLocalFilter(nn.Module):
+    """A GlobalLocalFilter of HorNet.
+
+    Args:
+        dim (int): Number of input channels.
+        h (int): Height of complex_weight.
+            Defaults to 14.
+        w (int): Width of complex_weight.
+            Defaults to 8.
+    """
+
+    def __init__(self, dim, h=14, w=8):
+        super().__init__()
+        self.dw = nn.Conv2d(
+            dim // 2,
+            dim // 2,
+            kernel_size=3,
+            padding=1,
+            bias=False,
+            groups=dim // 2)
+        self.complex_weight = nn.Parameter(
+            torch.randn(dim // 2, h, w, 2, dtype=torch.float32) * 0.02)
+        self.pre_norm = HorNetLayerNorm(
+            dim, eps=1e-6, data_format='channels_first')
+        self.post_norm = HorNetLayerNorm(
+            dim, eps=1e-6, data_format='channels_first')
+
+    def forward(self, x):
+        x = self.pre_norm(x)
+        x1, x2 = torch.chunk(x, 2, dim=1)
+        x1 = self.dw(x1)
+
+        x2 = x2.to(torch.float32)
+        B, C, a, b = x2.shape
+        x2 = torch.fft.rfft2(x2, dim=(2, 3), norm='ortho')
+
+        weight = self.complex_weight
+        if not weight.shape[1:3] == x2.shape[2:4]:
+            weight = F.interpolate(
+                weight.permute(3, 0, 1, 2),
+                size=x2.shape[2:4],
+                mode='bilinear',
+                align_corners=True).permute(1, 2, 3, 0)
+
+        weight = torch.view_as_complex(weight.contiguous())
+
+        x2 = x2 * weight
+        x2 = torch.fft.irfft2(x2, s=(a, b), dim=(2, 3), norm='ortho')
+
+        x = torch.cat([x1.unsqueeze(2), x2.unsqueeze(2)],
+                      dim=2).reshape(B, 2 * C, a, b)
+        x = self.post_norm(x)
+        return x
+
+
+class gnConv(nn.Module):
+    """A gnConv of HorNet.
+
+    Args:
+        dim (int): Number of input channels.
+        order (int): Order of gnConv.
+            Defaults to 5.
+        dw_cfg (dict): The Config for dw conv.
+            Defaults to ``dict(type='DW', kernel_size=7)``.
+        scale (float): Scaling parameter of gflayer outputs.
+            Defaults to 1.0.
+    """
+
+    def __init__(self,
+                 dim,
+                 order=5,
+                 dw_cfg=dict(type='DW', kernel_size=7),
+                 scale=1.0):
+        super().__init__()
+        self.order = order
+        self.dims = [dim // 2**i for i in range(order)]
+        self.dims.reverse()
+        self.proj_in = nn.Conv2d(dim, 2 * dim, 1)
+
+        cfg = copy.deepcopy(dw_cfg)
+        dw_type = cfg.pop('type')
+        assert dw_type in ['DW', 'GF'],\
+            'dw_type should be `DW` or `GF`'
+        if dw_type == 'DW':
+            self.dwconv = get_dwconv(sum(self.dims), **cfg)
+        elif dw_type == 'GF':
+            self.dwconv = GlobalLocalFilter(sum(self.dims), **cfg)
+
+        self.proj_out = nn.Conv2d(dim, dim, 1)
+
+        self.projs = nn.ModuleList([
+            nn.Conv2d(self.dims[i], self.dims[i + 1], 1)
+            for i in range(order - 1)
+        ])
+
+        self.scale = scale
+
+    def forward(self, x):
+        x = self.proj_in(x)
+        y, x = torch.split(x, (self.dims[0], sum(self.dims)), dim=1)
+
+        x = self.dwconv(x) * self.scale
+
+        dw_list = torch.split(x, self.dims, dim=1)
+        x = y * dw_list[0]
+
+        for i in range(self.order - 1):
+            x = self.projs[i](x) * dw_list[i + 1]
+
+        x = self.proj_out(x)
+
+        return x
+
+
+class HorNetBlock(nn.Module):
+    """A block of HorNet.
+
+    Args:
+        dim (int): Number of input channels.
+        order (int): Order of gnConv.
+            Defaults to 5.
+        dw_cfg (dict): The Config for dw conv.
+            Defaults to ``dict(type='DW', kernel_size=7)``.
+        scale (float): Scaling parameter of gflayer outputs.
+            Defaults to 1.0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        use_layer_scale (bool): Whether to use use_layer_scale in HorNet
+             block. Defaults to True.
+    """
+
+    def __init__(self,
+                 dim,
+                 order=5,
+                 dw_cfg=dict(type='DW', kernel_size=7),
+                 scale=1.0,
+                 drop_path_rate=0.,
+                 use_layer_scale=True):
+        super().__init__()
+        self.out_channels = dim
+
+        self.norm1 = HorNetLayerNorm(
+            dim, eps=1e-6, data_format='channels_first')
+        self.gnconv = gnConv(dim, order, dw_cfg, scale)
+        self.norm2 = HorNetLayerNorm(dim, eps=1e-6)
+        self.pwconv1 = nn.Linear(dim, 4 * dim)
+        self.act = nn.GELU()
+        self.pwconv2 = nn.Linear(4 * dim, dim)
+
+        if use_layer_scale:
+            self.gamma1 = LayerScale(dim, data_format='channels_first')
+            self.gamma2 = LayerScale(dim)
+        else:
+            self.gamma1, self.gamma2 = nn.Identity(), nn.Identity()
+
+        self.drop_path = DropPath(
+            drop_path_rate) if drop_path_rate > 0. else nn.Identity()
+
+    def forward(self, x):
+        x = x + self.drop_path(self.gamma1(self.gnconv(self.norm1(x))))
+
+        input = x
+        x = x.permute(0, 2, 3, 1)  # (N, C, H, W) -> (N, H, W, C)
+        x = self.norm2(x)
+        x = self.pwconv1(x)
+        x = self.act(x)
+        x = self.pwconv2(x)
+        x = self.gamma2(x)
+        x = x.permute(0, 3, 1, 2)  # (N, H, W, C) -> (N, C, H, W)
+
+        x = input + self.drop_path(x)
+        return x
+
+
+@MODELS.register_module()
+class HorNet(BaseBackbone):
+    """HorNet backbone.
+
+    A PyTorch implementation of paper `HorNet: Efficient High-Order Spatial
+    Interactions with Recursive Gated Convolutions
+    <https://arxiv.org/abs/2207.14284>`_ .
+    Inspiration from https://github.com/raoyongming/HorNet
+
+    Args:
+        arch (str | dict): HorNet architecture.
+
+            If use string, choose from 'tiny', 'small', 'base' and 'large'.
+            If use dict, it should have below keys:
+
+            - **base_dim** (int): The base dimensions of embedding.
+            - **depths** (List[int]): The number of blocks in each stage.
+            - **orders** (List[int]): The number of order of gnConv in each
+                stage.
+            - **dw_cfg** (List[dict]): The Config for dw conv.
+
+            Defaults to 'tiny'.
+        in_channels (int): Number of input image channels. Defaults to 3.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        scale (float): Scaling parameter of gflayer outputs. Defaults to 1/3.
+        use_layer_scale (bool): Whether to use use_layer_scale in HorNet
+            block. Defaults to True.
+        out_indices (Sequence[int]): Output from which stages.
+            Default: ``(3, )``.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        gap_before_final_norm (bool): Whether to globally average the feature
+            map before the final norm layer. In the official repo, it's only
+            used in classification task. Defaults to True.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+    arch_zoo = {
+        **dict.fromkeys(['t', 'tiny'],
+                        {'base_dim': 64,
+                         'depths': [2, 3, 18, 2],
+                         'orders': [2, 3, 4, 5],
+                         'dw_cfg': [dict(type='DW', kernel_size=7)] * 4}),
+        **dict.fromkeys(['t-gf', 'tiny-gf'],
+                        {'base_dim': 64,
+                         'depths': [2, 3, 18, 2],
+                         'orders': [2, 3, 4, 5],
+                         'dw_cfg': [
+                             dict(type='DW', kernel_size=7),
+                             dict(type='DW', kernel_size=7),
+                             dict(type='GF', h=14, w=8),
+                             dict(type='GF', h=7, w=4)]}),
+        **dict.fromkeys(['s', 'small'],
+                        {'base_dim': 96,
+                         'depths': [2, 3, 18, 2],
+                         'orders': [2, 3, 4, 5],
+                         'dw_cfg': [dict(type='DW', kernel_size=7)] * 4}),
+        **dict.fromkeys(['s-gf', 'small-gf'],
+                        {'base_dim': 96,
+                         'depths': [2, 3, 18, 2],
+                         'orders': [2, 3, 4, 5],
+                         'dw_cfg': [
+                             dict(type='DW', kernel_size=7),
+                             dict(type='DW', kernel_size=7),
+                             dict(type='GF', h=14, w=8),
+                             dict(type='GF', h=7, w=4)]}),
+        **dict.fromkeys(['b', 'base'],
+                        {'base_dim': 128,
+                         'depths': [2, 3, 18, 2],
+                         'orders': [2, 3, 4, 5],
+                         'dw_cfg': [dict(type='DW', kernel_size=7)] * 4}),
+        **dict.fromkeys(['b-gf', 'base-gf'],
+                        {'base_dim': 128,
+                         'depths': [2, 3, 18, 2],
+                         'orders': [2, 3, 4, 5],
+                         'dw_cfg': [
+                             dict(type='DW', kernel_size=7),
+                             dict(type='DW', kernel_size=7),
+                             dict(type='GF', h=14, w=8),
+                             dict(type='GF', h=7, w=4)]}),
+        **dict.fromkeys(['b-gf384', 'base-gf384'],
+                        {'base_dim': 128,
+                         'depths': [2, 3, 18, 2],
+                         'orders': [2, 3, 4, 5],
+                         'dw_cfg': [
+                             dict(type='DW', kernel_size=7),
+                             dict(type='DW', kernel_size=7),
+                             dict(type='GF', h=24, w=12),
+                             dict(type='GF', h=13, w=7)]}),
+        **dict.fromkeys(['l', 'large'],
+                        {'base_dim': 192,
+                         'depths': [2, 3, 18, 2],
+                         'orders': [2, 3, 4, 5],
+                         'dw_cfg': [dict(type='DW', kernel_size=7)] * 4}),
+        **dict.fromkeys(['l-gf', 'large-gf'],
+                        {'base_dim': 192,
+                         'depths': [2, 3, 18, 2],
+                         'orders': [2, 3, 4, 5],
+                         'dw_cfg': [
+                             dict(type='DW', kernel_size=7),
+                             dict(type='DW', kernel_size=7),
+                             dict(type='GF', h=14, w=8),
+                             dict(type='GF', h=7, w=4)]}),
+        **dict.fromkeys(['l-gf384', 'large-gf384'],
+                        {'base_dim': 192,
+                         'depths': [2, 3, 18, 2],
+                         'orders': [2, 3, 4, 5],
+                         'dw_cfg': [
+                             dict(type='DW', kernel_size=7),
+                             dict(type='DW', kernel_size=7),
+                             dict(type='GF', h=24, w=12),
+                             dict(type='GF', h=13, w=7)]}),
+    }  # yapf: disable
+
+    def __init__(self,
+                 arch='tiny',
+                 in_channels=3,
+                 drop_path_rate=0.,
+                 scale=1 / 3,
+                 use_layer_scale=True,
+                 out_indices=(3, ),
+                 frozen_stages=-1,
+                 with_cp=False,
+                 gap_before_final_norm=True,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        if fft is None:
+            raise RuntimeError(
+                'Failed to import torch.fft. Please install "torch>=1.7".')
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {'base_dim', 'depths', 'orders', 'dw_cfg'}
+            assert isinstance(arch, dict) and set(arch) == essential_keys, \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.scale = scale
+        self.out_indices = out_indices
+        self.frozen_stages = frozen_stages
+        self.with_cp = with_cp
+        self.gap_before_final_norm = gap_before_final_norm
+
+        base_dim = self.arch_settings['base_dim']
+        dims = list(map(lambda x: 2**x * base_dim, range(4)))
+
+        self.downsample_layers = nn.ModuleList()
+        stem = nn.Sequential(
+            nn.Conv2d(in_channels, dims[0], kernel_size=4, stride=4),
+            HorNetLayerNorm(dims[0], eps=1e-6, data_format='channels_first'))
+        self.downsample_layers.append(stem)
+        for i in range(3):
+            downsample_layer = nn.Sequential(
+                HorNetLayerNorm(
+                    dims[i], eps=1e-6, data_format='channels_first'),
+                nn.Conv2d(dims[i], dims[i + 1], kernel_size=2, stride=2),
+            )
+            self.downsample_layers.append(downsample_layer)
+
+        total_depth = sum(self.arch_settings['depths'])
+        dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, total_depth)
+        ]  # stochastic depth decay rule
+
+        cur_block_idx = 0
+        self.stages = nn.ModuleList()
+        for i in range(4):
+            stage = nn.Sequential(*[
+                HorNetBlock(
+                    dim=dims[i],
+                    order=self.arch_settings['orders'][i],
+                    dw_cfg=self.arch_settings['dw_cfg'][i],
+                    scale=self.scale,
+                    drop_path_rate=dpr[cur_block_idx + j],
+                    use_layer_scale=use_layer_scale)
+                for j in range(self.arch_settings['depths'][i])
+            ])
+            self.stages.append(stage)
+            cur_block_idx += self.arch_settings['depths'][i]
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        out_indices = list(out_indices)
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = len(self.stages) + index
+            assert 0 <= out_indices[i] <= len(self.stages), \
+                f'Invalid out_indices {index}.'
+        self.out_indices = out_indices
+
+        norm_layer = partial(
+            HorNetLayerNorm, eps=1e-6, data_format='channels_first')
+        for i_layer in out_indices:
+            layer = norm_layer(dims[i_layer])
+            layer_name = f'norm{i_layer}'
+            self.add_module(layer_name, layer)
+
+    def train(self, mode=True):
+        super(HorNet, self).train(mode)
+        self._freeze_stages()
+
+    def _freeze_stages(self):
+        for i in range(0, self.frozen_stages + 1):
+            # freeze patch embed
+            m = self.downsample_layers[i]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+            # freeze blocks
+            m = self.stages[i]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+            if i in self.out_indices:
+                # freeze norm
+                m = getattr(self, f'norm{i + 1}')
+                m.eval()
+                for param in m.parameters():
+                    param.requires_grad = False
+
+    def forward(self, x):
+        outs = []
+        for i in range(4):
+            x = self.downsample_layers[i](x)
+            if self.with_cp:
+                x = checkpoint.checkpoint_sequential(self.stages[i],
+                                                     len(self.stages[i]), x)
+            else:
+                x = self.stages[i](x)
+            if i in self.out_indices:
+                norm_layer = getattr(self, f'norm{i}')
+                if self.gap_before_final_norm:
+                    gap = x.mean([-2, -1], keepdim=True)
+                    outs.append(norm_layer(gap).flatten(1))
+                else:
+                    # The output of LayerNorm2d may be discontiguous, which
+                    # may cause some problem in the downstream tasks
+                    outs.append(norm_layer(x).contiguous())
+        return tuple(outs)
diff --git a/mmpretrain/models/backbones/hrnet.py b/mmpretrain/models/backbones/hrnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..99afa908531326f05ff1c977f0146a528683af43
--- /dev/null
+++ b/mmpretrain/models/backbones/hrnet.py
@@ -0,0 +1,563 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+from mmcv.cnn import build_conv_layer, build_norm_layer
+from mmengine.model import BaseModule, ModuleList, Sequential
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.registry import MODELS
+from .resnet import BasicBlock, Bottleneck, ResLayer, get_expansion
+
+
+class HRModule(BaseModule):
+    """High-Resolution Module for HRNet.
+
+    In this module, every branch has 4 BasicBlocks/Bottlenecks. Fusion/Exchange
+    is in this module.
+
+    Args:
+        num_branches (int): The number of branches.
+        block (``BaseModule``): Convolution block module.
+        num_blocks (tuple): The number of blocks in each branch.
+            The length must be equal to ``num_branches``.
+        num_channels (tuple): The number of base channels in each branch.
+            The length must be equal to ``num_branches``.
+        multiscale_output (bool): Whether to output multi-level features
+            produced by multiple branches. If False, only the first level
+            feature will be output. Defaults to True.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        conv_cfg (dict, optional): Dictionary to construct and config conv
+            layer. Defaults to None.
+        norm_cfg (dict): Dictionary to construct and config norm layer.
+            Defaults to ``dict(type='BN')``.
+        block_init_cfg (dict, optional): The initialization configs of every
+            blocks. Defaults to None.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 num_branches,
+                 block,
+                 num_blocks,
+                 in_channels,
+                 num_channels,
+                 multiscale_output=True,
+                 with_cp=False,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 block_init_cfg=None,
+                 init_cfg=None):
+        super(HRModule, self).__init__(init_cfg)
+        self.block_init_cfg = block_init_cfg
+        self._check_branches(num_branches, num_blocks, in_channels,
+                             num_channels)
+
+        self.in_channels = in_channels
+        self.num_branches = num_branches
+
+        self.multiscale_output = multiscale_output
+        self.norm_cfg = norm_cfg
+        self.conv_cfg = conv_cfg
+        self.with_cp = with_cp
+        self.branches = self._make_branches(num_branches, block, num_blocks,
+                                            num_channels)
+        self.fuse_layers = self._make_fuse_layers()
+        self.relu = nn.ReLU(inplace=False)
+
+    def _check_branches(self, num_branches, num_blocks, in_channels,
+                        num_channels):
+        if num_branches != len(num_blocks):
+            error_msg = f'NUM_BRANCHES({num_branches}) ' \
+                        f'!= NUM_BLOCKS({len(num_blocks)})'
+            raise ValueError(error_msg)
+
+        if num_branches != len(num_channels):
+            error_msg = f'NUM_BRANCHES({num_branches}) ' \
+                        f'!= NUM_CHANNELS({len(num_channels)})'
+            raise ValueError(error_msg)
+
+        if num_branches != len(in_channels):
+            error_msg = f'NUM_BRANCHES({num_branches}) ' \
+                        f'!= NUM_INCHANNELS({len(in_channels)})'
+            raise ValueError(error_msg)
+
+    def _make_branches(self, num_branches, block, num_blocks, num_channels):
+        branches = []
+
+        for i in range(num_branches):
+            out_channels = num_channels[i] * get_expansion(block)
+            branches.append(
+                ResLayer(
+                    block=block,
+                    num_blocks=num_blocks[i],
+                    in_channels=self.in_channels[i],
+                    out_channels=out_channels,
+                    conv_cfg=self.conv_cfg,
+                    norm_cfg=self.norm_cfg,
+                    with_cp=self.with_cp,
+                    init_cfg=self.block_init_cfg,
+                ))
+
+        return ModuleList(branches)
+
+    def _make_fuse_layers(self):
+        if self.num_branches == 1:
+            return None
+
+        num_branches = self.num_branches
+        in_channels = self.in_channels
+        fuse_layers = []
+        num_out_branches = num_branches if self.multiscale_output else 1
+        for i in range(num_out_branches):
+            fuse_layer = []
+            for j in range(num_branches):
+                if j > i:
+                    # Upsample the feature maps of smaller scales.
+                    fuse_layer.append(
+                        nn.Sequential(
+                            build_conv_layer(
+                                self.conv_cfg,
+                                in_channels[j],
+                                in_channels[i],
+                                kernel_size=1,
+                                stride=1,
+                                padding=0,
+                                bias=False),
+                            build_norm_layer(self.norm_cfg, in_channels[i])[1],
+                            nn.Upsample(
+                                scale_factor=2**(j - i), mode='nearest')))
+                elif j == i:
+                    # Keep the feature map with the same scale.
+                    fuse_layer.append(None)
+                else:
+                    # Downsample the feature maps of larger scales.
+                    conv_downsamples = []
+                    for k in range(i - j):
+                        # Use stacked convolution layers to downsample.
+                        if k == i - j - 1:
+                            conv_downsamples.append(
+                                nn.Sequential(
+                                    build_conv_layer(
+                                        self.conv_cfg,
+                                        in_channels[j],
+                                        in_channels[i],
+                                        kernel_size=3,
+                                        stride=2,
+                                        padding=1,
+                                        bias=False),
+                                    build_norm_layer(self.norm_cfg,
+                                                     in_channels[i])[1]))
+                        else:
+                            conv_downsamples.append(
+                                nn.Sequential(
+                                    build_conv_layer(
+                                        self.conv_cfg,
+                                        in_channels[j],
+                                        in_channels[j],
+                                        kernel_size=3,
+                                        stride=2,
+                                        padding=1,
+                                        bias=False),
+                                    build_norm_layer(self.norm_cfg,
+                                                     in_channels[j])[1],
+                                    nn.ReLU(inplace=False)))
+                    fuse_layer.append(nn.Sequential(*conv_downsamples))
+            fuse_layers.append(nn.ModuleList(fuse_layer))
+
+        return nn.ModuleList(fuse_layers)
+
+    def forward(self, x):
+        """Forward function."""
+        if self.num_branches == 1:
+            return [self.branches[0](x[0])]
+
+        for i in range(self.num_branches):
+            x[i] = self.branches[i](x[i])
+
+        x_fuse = []
+        for i in range(len(self.fuse_layers)):
+            y = 0
+            for j in range(self.num_branches):
+                if i == j:
+                    y += x[j]
+                else:
+                    y += self.fuse_layers[i][j](x[j])
+            x_fuse.append(self.relu(y))
+        return x_fuse
+
+
+@MODELS.register_module()
+class HRNet(BaseModule):
+    """HRNet backbone.
+
+    `High-Resolution Representations for Labeling Pixels and Regions
+    <https://arxiv.org/abs/1904.04514>`_.
+
+    Args:
+        arch (str): The preset HRNet architecture, includes 'w18', 'w30',
+            'w32', 'w40', 'w44', 'w48', 'w64'. It will only be used if
+            extra is ``None``. Defaults to 'w32'.
+        extra (dict, optional): Detailed configuration for each stage of HRNet.
+            There must be 4 stages, the configuration for each stage must have
+            5 keys:
+
+            - num_modules (int): The number of HRModule in this stage.
+            - num_branches (int): The number of branches in the HRModule.
+            - block (str): The type of convolution block. Please choose between
+              'BOTTLENECK' and 'BASIC'.
+            - num_blocks (tuple): The number of blocks in each branch.
+              The length must be equal to num_branches.
+            - num_channels (tuple): The number of base channels in each branch.
+              The length must be equal to num_branches.
+
+            Defaults to None.
+        in_channels (int): Number of input image channels. Defaults to 3.
+        conv_cfg (dict, optional): Dictionary to construct and config conv
+            layer. Defaults to None.
+        norm_cfg (dict): Dictionary to construct and config norm layer.
+            Defaults to ``dict(type='BN')``.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Defaults to False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        zero_init_residual (bool): Whether to use zero init for last norm layer
+            in resblocks to let them behave as identity. Defaults to False.
+        multiscale_output (bool): Whether to output multi-level features
+            produced by multiple branches. If False, only the first level
+            feature will be output. Defaults to True.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Defaults to None.
+
+    Example:
+        >>> import torch
+        >>> from mmpretrain.models import HRNet
+        >>> extra = dict(
+        >>>     stage1=dict(
+        >>>         num_modules=1,
+        >>>         num_branches=1,
+        >>>         block='BOTTLENECK',
+        >>>         num_blocks=(4, ),
+        >>>         num_channels=(64, )),
+        >>>     stage2=dict(
+        >>>         num_modules=1,
+        >>>         num_branches=2,
+        >>>         block='BASIC',
+        >>>         num_blocks=(4, 4),
+        >>>         num_channels=(32, 64)),
+        >>>     stage3=dict(
+        >>>         num_modules=4,
+        >>>         num_branches=3,
+        >>>         block='BASIC',
+        >>>         num_blocks=(4, 4, 4),
+        >>>         num_channels=(32, 64, 128)),
+        >>>     stage4=dict(
+        >>>         num_modules=3,
+        >>>         num_branches=4,
+        >>>         block='BASIC',
+        >>>         num_blocks=(4, 4, 4, 4),
+        >>>         num_channels=(32, 64, 128, 256)))
+        >>> self = HRNet(extra, in_channels=1)
+        >>> self.eval()
+        >>> inputs = torch.rand(1, 1, 32, 32)
+        >>> level_outputs = self.forward(inputs)
+        >>> for level_out in level_outputs:
+        ...     print(tuple(level_out.shape))
+        (1, 32, 8, 8)
+        (1, 64, 4, 4)
+        (1, 128, 2, 2)
+        (1, 256, 1, 1)
+    """
+
+    blocks_dict = {'BASIC': BasicBlock, 'BOTTLENECK': Bottleneck}
+    arch_zoo = {
+        # num_modules, num_branches, block, num_blocks, num_channels
+        'w18': [[1, 1, 'BOTTLENECK', (4, ),        (64, )],
+                [1, 2, 'BASIC',      (4, 4),       (18, 36)],
+                [4, 3, 'BASIC',      (4, 4, 4),    (18, 36, 72)],
+                [3, 4, 'BASIC',      (4, 4, 4, 4), (18, 36, 72, 144)]],
+        'w30': [[1, 1, 'BOTTLENECK', (4, ),        (64, )],
+                [1, 2, 'BASIC',      (4, 4),       (30, 60)],
+                [4, 3, 'BASIC',      (4, 4, 4),    (30, 60, 120)],
+                [3, 4, 'BASIC',      (4, 4, 4, 4), (30, 60, 120, 240)]],
+        'w32': [[1, 1, 'BOTTLENECK', (4, ),        (64, )],
+                [1, 2, 'BASIC',      (4, 4),       (32, 64)],
+                [4, 3, 'BASIC',      (4, 4, 4),    (32, 64, 128)],
+                [3, 4, 'BASIC',      (4, 4, 4, 4), (32, 64, 128, 256)]],
+        'w40': [[1, 1, 'BOTTLENECK', (4, ),        (64, )],
+                [1, 2, 'BASIC',      (4, 4),       (40, 80)],
+                [4, 3, 'BASIC',      (4, 4, 4),    (40, 80, 160)],
+                [3, 4, 'BASIC',      (4, 4, 4, 4), (40, 80, 160, 320)]],
+        'w44': [[1, 1, 'BOTTLENECK', (4, ),        (64, )],
+                [1, 2, 'BASIC',      (4, 4),       (44, 88)],
+                [4, 3, 'BASIC',      (4, 4, 4),    (44, 88, 176)],
+                [3, 4, 'BASIC',      (4, 4, 4, 4), (44, 88, 176, 352)]],
+        'w48': [[1, 1, 'BOTTLENECK', (4, ),        (64, )],
+                [1, 2, 'BASIC',      (4, 4),       (48, 96)],
+                [4, 3, 'BASIC',      (4, 4, 4),    (48, 96, 192)],
+                [3, 4, 'BASIC',      (4, 4, 4, 4), (48, 96, 192, 384)]],
+        'w64': [[1, 1, 'BOTTLENECK', (4, ),        (64, )],
+                [1, 2, 'BASIC',      (4, 4),       (64, 128)],
+                [4, 3, 'BASIC',      (4, 4, 4),    (64, 128, 256)],
+                [3, 4, 'BASIC',      (4, 4, 4, 4), (64, 128, 256, 512)]],
+    }  # yapf:disable
+
+    def __init__(self,
+                 arch='w32',
+                 extra=None,
+                 in_channels=3,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 norm_eval=False,
+                 with_cp=False,
+                 zero_init_residual=False,
+                 multiscale_output=True,
+                 init_cfg=[
+                     dict(type='Kaiming', layer='Conv2d'),
+                     dict(
+                         type='Constant',
+                         val=1,
+                         layer=['_BatchNorm', 'GroupNorm'])
+                 ]):
+        super(HRNet, self).__init__(init_cfg)
+
+        extra = self.parse_arch(arch, extra)
+
+        # Assert configurations of 4 stages are in extra
+        for i in range(1, 5):
+            assert f'stage{i}' in extra, f'Missing stage{i} config in "extra".'
+            # Assert whether the length of `num_blocks` and `num_channels` are
+            # equal to `num_branches`
+            cfg = extra[f'stage{i}']
+            assert len(cfg['num_blocks']) == cfg['num_branches'] and \
+                   len(cfg['num_channels']) == cfg['num_branches']
+
+        self.extra = extra
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.norm_eval = norm_eval
+        self.with_cp = with_cp
+        self.zero_init_residual = zero_init_residual
+
+        # -------------------- stem net --------------------
+        self.conv1 = build_conv_layer(
+            self.conv_cfg,
+            in_channels,
+            out_channels=64,
+            kernel_size=3,
+            stride=2,
+            padding=1,
+            bias=False)
+
+        self.norm1_name, norm1 = build_norm_layer(self.norm_cfg, 64, postfix=1)
+        self.add_module(self.norm1_name, norm1)
+
+        self.conv2 = build_conv_layer(
+            self.conv_cfg,
+            in_channels=64,
+            out_channels=64,
+            kernel_size=3,
+            stride=2,
+            padding=1,
+            bias=False)
+
+        self.norm2_name, norm2 = build_norm_layer(self.norm_cfg, 64, postfix=2)
+        self.add_module(self.norm2_name, norm2)
+        self.relu = nn.ReLU(inplace=True)
+
+        # -------------------- stage 1 --------------------
+        self.stage1_cfg = self.extra['stage1']
+        base_channels = self.stage1_cfg['num_channels']
+        block_type = self.stage1_cfg['block']
+        num_blocks = self.stage1_cfg['num_blocks']
+
+        block = self.blocks_dict[block_type]
+        num_channels = [
+            channel * get_expansion(block) for channel in base_channels
+        ]
+        # To align with the original code, use layer1 instead of stage1 here.
+        self.layer1 = ResLayer(
+            block,
+            in_channels=64,
+            out_channels=num_channels[0],
+            num_blocks=num_blocks[0])
+        pre_num_channels = num_channels
+
+        # -------------------- stage 2~4 --------------------
+        for i in range(2, 5):
+            stage_cfg = self.extra[f'stage{i}']
+            base_channels = stage_cfg['num_channels']
+            block = self.blocks_dict[stage_cfg['block']]
+            multiscale_output_ = multiscale_output if i == 4 else True
+
+            num_channels = [
+                channel * get_expansion(block) for channel in base_channels
+            ]
+            # The transition layer from layer1 to stage2
+            transition = self._make_transition_layer(pre_num_channels,
+                                                     num_channels)
+            self.add_module(f'transition{i-1}', transition)
+            stage = self._make_stage(
+                stage_cfg, num_channels, multiscale_output=multiscale_output_)
+            self.add_module(f'stage{i}', stage)
+
+            pre_num_channels = num_channels
+
+    @property
+    def norm1(self):
+        """nn.Module: the normalization layer named "norm1" """
+        return getattr(self, self.norm1_name)
+
+    @property
+    def norm2(self):
+        """nn.Module: the normalization layer named "norm2" """
+        return getattr(self, self.norm2_name)
+
+    def _make_transition_layer(self, num_channels_pre_layer,
+                               num_channels_cur_layer):
+        num_branches_cur = len(num_channels_cur_layer)
+        num_branches_pre = len(num_channels_pre_layer)
+
+        transition_layers = []
+        for i in range(num_branches_cur):
+            if i < num_branches_pre:
+                # For existing scale branches,
+                # add conv block when the channels are not the same.
+                if num_channels_cur_layer[i] != num_channels_pre_layer[i]:
+                    transition_layers.append(
+                        nn.Sequential(
+                            build_conv_layer(
+                                self.conv_cfg,
+                                num_channels_pre_layer[i],
+                                num_channels_cur_layer[i],
+                                kernel_size=3,
+                                stride=1,
+                                padding=1,
+                                bias=False),
+                            build_norm_layer(self.norm_cfg,
+                                             num_channels_cur_layer[i])[1],
+                            nn.ReLU(inplace=True)))
+                else:
+                    transition_layers.append(nn.Identity())
+            else:
+                # For new scale branches, add stacked downsample conv blocks.
+                # For example, num_branches_pre = 2, for the 4th branch, add
+                # stacked two downsample conv blocks.
+                conv_downsamples = []
+                for j in range(i + 1 - num_branches_pre):
+                    in_channels = num_channels_pre_layer[-1]
+                    out_channels = num_channels_cur_layer[i] \
+                        if j == i - num_branches_pre else in_channels
+                    conv_downsamples.append(
+                        nn.Sequential(
+                            build_conv_layer(
+                                self.conv_cfg,
+                                in_channels,
+                                out_channels,
+                                kernel_size=3,
+                                stride=2,
+                                padding=1,
+                                bias=False),
+                            build_norm_layer(self.norm_cfg, out_channels)[1],
+                            nn.ReLU(inplace=True)))
+                transition_layers.append(nn.Sequential(*conv_downsamples))
+
+        return nn.ModuleList(transition_layers)
+
+    def _make_stage(self, layer_config, in_channels, multiscale_output=True):
+        num_modules = layer_config['num_modules']
+        num_branches = layer_config['num_branches']
+        num_blocks = layer_config['num_blocks']
+        num_channels = layer_config['num_channels']
+        block = self.blocks_dict[layer_config['block']]
+
+        hr_modules = []
+        block_init_cfg = None
+        if self.zero_init_residual:
+            if block is BasicBlock:
+                block_init_cfg = dict(
+                    type='Constant', val=0, override=dict(name='norm2'))
+            elif block is Bottleneck:
+                block_init_cfg = dict(
+                    type='Constant', val=0, override=dict(name='norm3'))
+
+        for i in range(num_modules):
+            # multi_scale_output is only used for the last module
+            if not multiscale_output and i == num_modules - 1:
+                reset_multiscale_output = False
+            else:
+                reset_multiscale_output = True
+
+            hr_modules.append(
+                HRModule(
+                    num_branches,
+                    block,
+                    num_blocks,
+                    in_channels,
+                    num_channels,
+                    reset_multiscale_output,
+                    with_cp=self.with_cp,
+                    norm_cfg=self.norm_cfg,
+                    conv_cfg=self.conv_cfg,
+                    block_init_cfg=block_init_cfg))
+
+        return Sequential(*hr_modules)
+
+    def forward(self, x):
+        """Forward function."""
+        x = self.conv1(x)
+        x = self.norm1(x)
+        x = self.relu(x)
+        x = self.conv2(x)
+        x = self.norm2(x)
+        x = self.relu(x)
+        x = self.layer1(x)
+
+        x_list = [x]
+
+        for i in range(2, 5):
+            # Apply transition
+            transition = getattr(self, f'transition{i-1}')
+            inputs = []
+            for j, layer in enumerate(transition):
+                if j < len(x_list):
+                    inputs.append(layer(x_list[j]))
+                else:
+                    inputs.append(layer(x_list[-1]))
+            # Forward HRModule
+            stage = getattr(self, f'stage{i}')
+            x_list = stage(inputs)
+
+        return tuple(x_list)
+
+    def train(self, mode=True):
+        """Convert the model into training mode will keeping the normalization
+        layer freezed."""
+        super(HRNet, self).train(mode)
+        if mode and self.norm_eval:
+            for m in self.modules():
+                # trick: eval have effect on BatchNorm only
+                if isinstance(m, _BatchNorm):
+                    m.eval()
+
+    def parse_arch(self, arch, extra=None):
+        if extra is not None:
+            return extra
+
+        assert arch in self.arch_zoo, \
+            ('Invalid arch, please choose arch from '
+             f'{list(self.arch_zoo.keys())}, or specify `extra` '
+             'argument directly.')
+
+        extra = dict()
+        for i, stage_setting in enumerate(self.arch_zoo[arch], start=1):
+            extra[f'stage{i}'] = dict(
+                num_modules=stage_setting[0],
+                num_branches=stage_setting[1],
+                block=stage_setting[2],
+                num_blocks=stage_setting[3],
+                num_channels=stage_setting[4],
+            )
+
+        return extra
diff --git a/mmpretrain/models/backbones/inception_v3.py b/mmpretrain/models/backbones/inception_v3.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d6c04b9fba4b50fce31539d14874dc7a47a539a
--- /dev/null
+++ b/mmpretrain/models/backbones/inception_v3.py
@@ -0,0 +1,501 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional, Tuple
+
+import torch
+import torch.nn as nn
+from mmcv.cnn import build_conv_layer
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+
+
+class BasicConv2d(BaseModule):
+    """A basic convolution block including convolution, batch norm and ReLU.
+
+    Args:
+        in_channels (int): The number of input channels.
+        out_channels (int): The number of output channels.
+        conv_cfg (dict, optional): The config of convolution layer.
+            Defaults to None, which means to use ``nn.Conv2d``.
+        init_cfg (dict, optional): The config of initialization.
+            Defaults to None.
+        **kwargs: Other keyword arguments of the convolution layer.
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 conv_cfg: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None,
+                 **kwargs) -> None:
+        super().__init__(init_cfg=init_cfg)
+        self.conv = build_conv_layer(
+            conv_cfg, in_channels, out_channels, bias=False, **kwargs)
+        self.bn = nn.BatchNorm2d(out_channels, eps=0.001)
+        self.relu = nn.ReLU(inplace=True)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward function."""
+        x = self.conv(x)
+        x = self.bn(x)
+        return self.relu(x)
+
+
+class InceptionA(BaseModule):
+    """Type-A Inception block.
+
+    Args:
+        in_channels (int): The number of input channels.
+        pool_features (int): The number of channels in pooling branch.
+        conv_cfg (dict, optional): The convolution layer config in the
+            :class:`BasicConv2d` block. Defaults to None.
+        init_cfg (dict, optional): The config of initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 pool_features: int,
+                 conv_cfg: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        super().__init__(init_cfg=init_cfg)
+        self.branch1x1 = BasicConv2d(
+            in_channels, 64, kernel_size=1, conv_cfg=conv_cfg)
+
+        self.branch5x5_1 = BasicConv2d(
+            in_channels, 48, kernel_size=1, conv_cfg=conv_cfg)
+        self.branch5x5_2 = BasicConv2d(
+            48, 64, kernel_size=5, padding=2, conv_cfg=conv_cfg)
+
+        self.branch3x3dbl_1 = BasicConv2d(
+            in_channels, 64, kernel_size=1, conv_cfg=conv_cfg)
+        self.branch3x3dbl_2 = BasicConv2d(
+            64, 96, kernel_size=3, padding=1, conv_cfg=conv_cfg)
+        self.branch3x3dbl_3 = BasicConv2d(
+            96, 96, kernel_size=3, padding=1, conv_cfg=conv_cfg)
+
+        self.branch_pool_downsample = nn.AvgPool2d(
+            kernel_size=3, stride=1, padding=1)
+        self.branch_pool = BasicConv2d(
+            in_channels, pool_features, kernel_size=1, conv_cfg=conv_cfg)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward function."""
+        branch1x1 = self.branch1x1(x)
+
+        branch5x5 = self.branch5x5_1(x)
+        branch5x5 = self.branch5x5_2(branch5x5)
+
+        branch3x3dbl = self.branch3x3dbl_1(x)
+        branch3x3dbl = self.branch3x3dbl_2(branch3x3dbl)
+        branch3x3dbl = self.branch3x3dbl_3(branch3x3dbl)
+
+        branch_pool = self.branch_pool_downsample(x)
+        branch_pool = self.branch_pool(branch_pool)
+
+        outputs = [branch1x1, branch5x5, branch3x3dbl, branch_pool]
+        return torch.cat(outputs, 1)
+
+
+class InceptionB(BaseModule):
+    """Type-B Inception block.
+
+    Args:
+        in_channels (int): The number of input channels.
+        conv_cfg (dict, optional): The convolution layer config in the
+            :class:`BasicConv2d` block. Defaults to None.
+        init_cfg (dict, optional): The config of initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 conv_cfg: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        super().__init__(init_cfg=init_cfg)
+        self.branch3x3 = BasicConv2d(
+            in_channels, 384, kernel_size=3, stride=2, conv_cfg=conv_cfg)
+
+        self.branch3x3dbl_1 = BasicConv2d(
+            in_channels, 64, kernel_size=1, conv_cfg=conv_cfg)
+        self.branch3x3dbl_2 = BasicConv2d(
+            64, 96, kernel_size=3, padding=1, conv_cfg=conv_cfg)
+        self.branch3x3dbl_3 = BasicConv2d(
+            96, 96, kernel_size=3, stride=2, conv_cfg=conv_cfg)
+
+        self.branch_pool = nn.MaxPool2d(kernel_size=3, stride=2)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward function."""
+        branch3x3 = self.branch3x3(x)
+
+        branch3x3dbl = self.branch3x3dbl_1(x)
+        branch3x3dbl = self.branch3x3dbl_2(branch3x3dbl)
+        branch3x3dbl = self.branch3x3dbl_3(branch3x3dbl)
+
+        branch_pool = self.branch_pool(x)
+
+        outputs = [branch3x3, branch3x3dbl, branch_pool]
+        return torch.cat(outputs, 1)
+
+
+class InceptionC(BaseModule):
+    """Type-C Inception block.
+
+    Args:
+        in_channels (int): The number of input channels.
+        channels_7x7 (int): The number of channels in 7x7 convolution branch.
+        conv_cfg (dict, optional): The convolution layer config in the
+            :class:`BasicConv2d` block. Defaults to None.
+        init_cfg (dict, optional): The config of initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 channels_7x7: int,
+                 conv_cfg: Optional[dict] = None,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        self.branch1x1 = BasicConv2d(
+            in_channels, 192, kernel_size=1, conv_cfg=conv_cfg)
+
+        c7 = channels_7x7
+        self.branch7x7_1 = BasicConv2d(
+            in_channels, c7, kernel_size=1, conv_cfg=conv_cfg)
+        self.branch7x7_2 = BasicConv2d(
+            c7, c7, kernel_size=(1, 7), padding=(0, 3), conv_cfg=conv_cfg)
+        self.branch7x7_3 = BasicConv2d(
+            c7, 192, kernel_size=(7, 1), padding=(3, 0), conv_cfg=conv_cfg)
+
+        self.branch7x7dbl_1 = BasicConv2d(
+            in_channels, c7, kernel_size=1, conv_cfg=conv_cfg)
+        self.branch7x7dbl_2 = BasicConv2d(
+            c7, c7, kernel_size=(7, 1), padding=(3, 0), conv_cfg=conv_cfg)
+        self.branch7x7dbl_3 = BasicConv2d(
+            c7, c7, kernel_size=(1, 7), padding=(0, 3), conv_cfg=conv_cfg)
+        self.branch7x7dbl_4 = BasicConv2d(
+            c7, c7, kernel_size=(7, 1), padding=(3, 0), conv_cfg=conv_cfg)
+        self.branch7x7dbl_5 = BasicConv2d(
+            c7, 192, kernel_size=(1, 7), padding=(0, 3), conv_cfg=conv_cfg)
+
+        self.branch_pool_downsample = nn.AvgPool2d(
+            kernel_size=3, stride=1, padding=1)
+        self.branch_pool = BasicConv2d(
+            in_channels, 192, kernel_size=1, conv_cfg=conv_cfg)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward function."""
+        branch1x1 = self.branch1x1(x)
+
+        branch7x7 = self.branch7x7_1(x)
+        branch7x7 = self.branch7x7_2(branch7x7)
+        branch7x7 = self.branch7x7_3(branch7x7)
+
+        branch7x7dbl = self.branch7x7dbl_1(x)
+        branch7x7dbl = self.branch7x7dbl_2(branch7x7dbl)
+        branch7x7dbl = self.branch7x7dbl_3(branch7x7dbl)
+        branch7x7dbl = self.branch7x7dbl_4(branch7x7dbl)
+        branch7x7dbl = self.branch7x7dbl_5(branch7x7dbl)
+
+        branch_pool = self.branch_pool_downsample(x)
+        branch_pool = self.branch_pool(branch_pool)
+
+        outputs = [branch1x1, branch7x7, branch7x7dbl, branch_pool]
+        return torch.cat(outputs, 1)
+
+
+class InceptionD(BaseModule):
+    """Type-D Inception block.
+
+    Args:
+        in_channels (int): The number of input channels.
+        conv_cfg (dict, optional): The convolution layer config in the
+            :class:`BasicConv2d` block. Defaults to None.
+        init_cfg (dict, optional): The config of initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 conv_cfg: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        super().__init__(init_cfg=init_cfg)
+        self.branch3x3_1 = BasicConv2d(
+            in_channels, 192, kernel_size=1, conv_cfg=conv_cfg)
+        self.branch3x3_2 = BasicConv2d(
+            192, 320, kernel_size=3, stride=2, conv_cfg=conv_cfg)
+
+        self.branch7x7x3_1 = BasicConv2d(
+            in_channels, 192, kernel_size=1, conv_cfg=conv_cfg)
+        self.branch7x7x3_2 = BasicConv2d(
+            192, 192, kernel_size=(1, 7), padding=(0, 3), conv_cfg=conv_cfg)
+        self.branch7x7x3_3 = BasicConv2d(
+            192, 192, kernel_size=(7, 1), padding=(3, 0), conv_cfg=conv_cfg)
+        self.branch7x7x3_4 = BasicConv2d(
+            192, 192, kernel_size=3, stride=2, conv_cfg=conv_cfg)
+
+        self.branch_pool = nn.MaxPool2d(kernel_size=3, stride=2)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward function."""
+        branch3x3 = self.branch3x3_1(x)
+        branch3x3 = self.branch3x3_2(branch3x3)
+
+        branch7x7x3 = self.branch7x7x3_1(x)
+        branch7x7x3 = self.branch7x7x3_2(branch7x7x3)
+        branch7x7x3 = self.branch7x7x3_3(branch7x7x3)
+        branch7x7x3 = self.branch7x7x3_4(branch7x7x3)
+
+        branch_pool = self.branch_pool(x)
+        outputs = [branch3x3, branch7x7x3, branch_pool]
+        return torch.cat(outputs, 1)
+
+
+class InceptionE(BaseModule):
+    """Type-E Inception block.
+
+    Args:
+        in_channels (int): The number of input channels.
+        conv_cfg (dict, optional): The convolution layer config in the
+            :class:`BasicConv2d` block. Defaults to None.
+        init_cfg (dict, optional): The config of initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 conv_cfg: Optional[dict] = None,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        self.branch1x1 = BasicConv2d(
+            in_channels, 320, kernel_size=1, conv_cfg=conv_cfg)
+
+        self.branch3x3_1 = BasicConv2d(
+            in_channels, 384, kernel_size=1, conv_cfg=conv_cfg)
+        self.branch3x3_2a = BasicConv2d(
+            384, 384, kernel_size=(1, 3), padding=(0, 1), conv_cfg=conv_cfg)
+        self.branch3x3_2b = BasicConv2d(
+            384, 384, kernel_size=(3, 1), padding=(1, 0), conv_cfg=conv_cfg)
+
+        self.branch3x3dbl_1 = BasicConv2d(
+            in_channels, 448, kernel_size=1, conv_cfg=conv_cfg)
+        self.branch3x3dbl_2 = BasicConv2d(
+            448, 384, kernel_size=3, padding=1, conv_cfg=conv_cfg)
+        self.branch3x3dbl_3a = BasicConv2d(
+            384, 384, kernel_size=(1, 3), padding=(0, 1), conv_cfg=conv_cfg)
+        self.branch3x3dbl_3b = BasicConv2d(
+            384, 384, kernel_size=(3, 1), padding=(1, 0), conv_cfg=conv_cfg)
+
+        self.branch_pool_downsample = nn.AvgPool2d(
+            kernel_size=3, stride=1, padding=1)
+        self.branch_pool = BasicConv2d(
+            in_channels, 192, kernel_size=1, conv_cfg=conv_cfg)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward function."""
+        branch1x1 = self.branch1x1(x)
+
+        branch3x3 = self.branch3x3_1(x)
+        branch3x3 = [
+            self.branch3x3_2a(branch3x3),
+            self.branch3x3_2b(branch3x3),
+        ]
+        branch3x3 = torch.cat(branch3x3, 1)
+
+        branch3x3dbl = self.branch3x3dbl_1(x)
+        branch3x3dbl = self.branch3x3dbl_2(branch3x3dbl)
+        branch3x3dbl = [
+            self.branch3x3dbl_3a(branch3x3dbl),
+            self.branch3x3dbl_3b(branch3x3dbl),
+        ]
+        branch3x3dbl = torch.cat(branch3x3dbl, 1)
+
+        branch_pool = self.branch_pool_downsample(x)
+        branch_pool = self.branch_pool(branch_pool)
+
+        outputs = [branch1x1, branch3x3, branch3x3dbl, branch_pool]
+        return torch.cat(outputs, 1)
+
+
+class InceptionAux(BaseModule):
+    """The Inception block for the auxiliary classification branch.
+
+    Args:
+        in_channels (int): The number of input channels.
+        num_classes (int): The number of categroies.
+        conv_cfg (dict, optional): The convolution layer config in the
+            :class:`BasicConv2d` block. Defaults to None.
+        init_cfg (dict, optional): The config of initialization.
+            Defaults to use trunc normal with ``std=0.01`` for Conv2d layers
+            and use trunc normal with ``std=0.001`` for Linear layers..
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 num_classes: int,
+                 conv_cfg: Optional[dict] = None,
+                 init_cfg: Optional[dict] = [
+                     dict(type='TruncNormal', layer='Conv2d', std=0.01),
+                     dict(type='TruncNormal', layer='Linear', std=0.001)
+                 ]):
+        super().__init__(init_cfg=init_cfg)
+        self.downsample = nn.AvgPool2d(kernel_size=5, stride=3)
+        self.conv0 = BasicConv2d(
+            in_channels, 128, kernel_size=1, conv_cfg=conv_cfg)
+        self.conv1 = BasicConv2d(128, 768, kernel_size=5, conv_cfg=conv_cfg)
+        self.gap = nn.AdaptiveAvgPool2d((1, 1))
+        self.fc = nn.Linear(768, num_classes)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward function."""
+        # N x 768 x 17 x 17
+        x = self.downsample(x)
+        # N x 768 x 5 x 5
+        x = self.conv0(x)
+        # N x 128 x 5 x 5
+        x = self.conv1(x)
+        # N x 768 x 1 x 1
+        # Adaptive average pooling
+        x = self.gap(x)
+        # N x 768 x 1 x 1
+        x = torch.flatten(x, 1)
+        # N x 768
+        x = self.fc(x)
+        # N x 1000
+        return x
+
+
+@MODELS.register_module()
+class InceptionV3(BaseBackbone):
+    """Inception V3 backbone.
+
+    A PyTorch implementation of `Rethinking the Inception Architecture for
+    Computer Vision <https://arxiv.org/abs/1512.00567>`_
+
+    This implementation is modified from
+    https://github.com/pytorch/vision/blob/main/torchvision/models/inception.py.
+    Licensed under the BSD 3-Clause License.
+
+    Args:
+        num_classes (int): The number of categroies. Defaults to 1000.
+        aux_logits (bool): Whether to enable the auxiliary branch. If False,
+            the auxiliary logits output will be None. Defaults to False.
+        dropout (float): Dropout rate. Defaults to 0.5.
+        init_cfg (dict, optional): The config of initialization. Defaults
+            to use trunc normal with ``std=0.1`` for all Conv2d and Linear
+            layers and constant with ``val=1`` for all BatchNorm2d layers.
+
+    Example:
+        >>> import torch
+        >>> from mmpretrain.models import build_backbone
+        >>>
+        >>> inputs = torch.rand(2, 3, 299, 299)
+        >>> cfg = dict(type='InceptionV3', num_classes=100)
+        >>> backbone = build_backbone(cfg)
+        >>> aux_out, out = backbone(inputs)
+        >>> # The auxiliary branch is disabled by default.
+        >>> assert aux_out is None
+        >>> print(out.shape)
+        torch.Size([2, 100])
+        >>> cfg = dict(type='InceptionV3', num_classes=100, aux_logits=True)
+        >>> backbone = build_backbone(cfg)
+        >>> aux_out, out = backbone(inputs)
+        >>> print(aux_out.shape, out.shape)
+        torch.Size([2, 100]) torch.Size([2, 100])
+    """
+
+    def __init__(
+        self,
+        num_classes: int = 1000,
+        aux_logits: bool = False,
+        dropout: float = 0.5,
+        init_cfg: Optional[dict] = [
+            dict(type='TruncNormal', layer=['Conv2d', 'Linear'], std=0.1),
+            dict(type='Constant', layer='BatchNorm2d', val=1)
+        ],
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        self.aux_logits = aux_logits
+        self.Conv2d_1a_3x3 = BasicConv2d(3, 32, kernel_size=3, stride=2)
+        self.Conv2d_2a_3x3 = BasicConv2d(32, 32, kernel_size=3)
+        self.Conv2d_2b_3x3 = BasicConv2d(32, 64, kernel_size=3, padding=1)
+        self.maxpool1 = nn.MaxPool2d(kernel_size=3, stride=2)
+        self.Conv2d_3b_1x1 = BasicConv2d(64, 80, kernel_size=1)
+        self.Conv2d_4a_3x3 = BasicConv2d(80, 192, kernel_size=3)
+        self.maxpool2 = nn.MaxPool2d(kernel_size=3, stride=2)
+        self.Mixed_5b = InceptionA(192, pool_features=32)
+        self.Mixed_5c = InceptionA(256, pool_features=64)
+        self.Mixed_5d = InceptionA(288, pool_features=64)
+        self.Mixed_6a = InceptionB(288)
+        self.Mixed_6b = InceptionC(768, channels_7x7=128)
+        self.Mixed_6c = InceptionC(768, channels_7x7=160)
+        self.Mixed_6d = InceptionC(768, channels_7x7=160)
+        self.Mixed_6e = InceptionC(768, channels_7x7=192)
+        self.AuxLogits: Optional[nn.Module] = None
+        if aux_logits:
+            self.AuxLogits = InceptionAux(768, num_classes)
+        self.Mixed_7a = InceptionD(768)
+        self.Mixed_7b = InceptionE(1280)
+        self.Mixed_7c = InceptionE(2048)
+        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
+        self.dropout = nn.Dropout(p=dropout)
+        self.fc = nn.Linear(2048, num_classes)
+
+    def forward(
+            self,
+            x: torch.Tensor) -> Tuple[Optional[torch.Tensor], torch.Tensor]:
+        """Forward function."""
+        # N x 3 x 299 x 299
+        x = self.Conv2d_1a_3x3(x)
+        # N x 32 x 149 x 149
+        x = self.Conv2d_2a_3x3(x)
+        # N x 32 x 147 x 147
+        x = self.Conv2d_2b_3x3(x)
+        # N x 64 x 147 x 147
+        x = self.maxpool1(x)
+        # N x 64 x 73 x 73
+        x = self.Conv2d_3b_1x1(x)
+        # N x 80 x 73 x 73
+        x = self.Conv2d_4a_3x3(x)
+        # N x 192 x 71 x 71
+        x = self.maxpool2(x)
+        # N x 192 x 35 x 35
+        x = self.Mixed_5b(x)
+        # N x 256 x 35 x 35
+        x = self.Mixed_5c(x)
+        # N x 288 x 35 x 35
+        x = self.Mixed_5d(x)
+        # N x 288 x 35 x 35
+        x = self.Mixed_6a(x)
+        # N x 768 x 17 x 17
+        x = self.Mixed_6b(x)
+        # N x 768 x 17 x 17
+        x = self.Mixed_6c(x)
+        # N x 768 x 17 x 17
+        x = self.Mixed_6d(x)
+        # N x 768 x 17 x 17
+        x = self.Mixed_6e(x)
+        # N x 768 x 17 x 17
+        aux: Optional[torch.Tensor] = None
+        if self.aux_logits and self.training:
+            aux = self.AuxLogits(x)
+        # N x 768 x 17 x 17
+        x = self.Mixed_7a(x)
+        # N x 1280 x 8 x 8
+        x = self.Mixed_7b(x)
+        # N x 2048 x 8 x 8
+        x = self.Mixed_7c(x)
+        # N x 2048 x 8 x 8
+        # Adaptive average pooling
+        x = self.avgpool(x)
+        # N x 2048 x 1 x 1
+        x = self.dropout(x)
+        # N x 2048 x 1 x 1
+        x = torch.flatten(x, 1)
+        # N x 2048
+        x = self.fc(x)
+        # N x 1000 (num_classes)
+        return aux, x
diff --git a/mmpretrain/models/backbones/lenet.py b/mmpretrain/models/backbones/lenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e423c0b15a60660714617e47fd68857b3a6d1e0
--- /dev/null
+++ b/mmpretrain/models/backbones/lenet.py
@@ -0,0 +1,42 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+
+
+@MODELS.register_module()
+class LeNet5(BaseBackbone):
+    """`LeNet5 <https://en.wikipedia.org/wiki/LeNet>`_ backbone.
+
+    The input for LeNet-5 is a 32×32 grayscale image.
+
+    Args:
+        num_classes (int): number of classes for classification.
+            The default value is -1, which uses the backbone as
+            a feature extractor without the top classifier.
+    """
+
+    def __init__(self, num_classes=-1):
+        super(LeNet5, self).__init__()
+        self.num_classes = num_classes
+        self.features = nn.Sequential(
+            nn.Conv2d(1, 6, kernel_size=5, stride=1), nn.Tanh(),
+            nn.AvgPool2d(kernel_size=2),
+            nn.Conv2d(6, 16, kernel_size=5, stride=1), nn.Tanh(),
+            nn.AvgPool2d(kernel_size=2),
+            nn.Conv2d(16, 120, kernel_size=5, stride=1), nn.Tanh())
+        if self.num_classes > 0:
+            self.classifier = nn.Sequential(
+                nn.Linear(120, 84),
+                nn.Tanh(),
+                nn.Linear(84, num_classes),
+            )
+
+    def forward(self, x):
+
+        x = self.features(x)
+        if self.num_classes > 0:
+            x = self.classifier(x.squeeze())
+
+        return (x, )
diff --git a/mmpretrain/models/backbones/levit.py b/mmpretrain/models/backbones/levit.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f7aa324e28b1725fb9e67110a26ea2d5c2831bd
--- /dev/null
+++ b/mmpretrain/models/backbones/levit.py
@@ -0,0 +1,522 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import itertools
+
+import torch
+import torch.nn as nn
+from mmcv.cnn import build_activation_layer, fuse_conv_bn
+from mmcv.cnn.bricks import DropPath
+from mmengine.model import BaseModule, ModuleList, Sequential
+
+from mmpretrain.models.backbones.base_backbone import BaseBackbone
+from mmpretrain.registry import MODELS
+from ..utils import build_norm_layer
+
+
+class HybridBackbone(BaseModule):
+
+    def __init__(
+            self,
+            embed_dim,
+            kernel_size=3,
+            stride=2,
+            pad=1,
+            dilation=1,
+            groups=1,
+            act_cfg=dict(type='HSwish'),
+            conv_cfg=None,
+            norm_cfg=dict(type='BN'),
+            init_cfg=None,
+    ):
+        super(HybridBackbone, self).__init__(init_cfg=init_cfg)
+
+        self.input_channels = [
+            3, embed_dim // 8, embed_dim // 4, embed_dim // 2
+        ]
+        self.output_channels = [
+            embed_dim // 8, embed_dim // 4, embed_dim // 2, embed_dim
+        ]
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+
+        self.patch_embed = Sequential()
+
+        for i in range(len(self.input_channels)):
+            conv_bn = ConvolutionBatchNorm(
+                self.input_channels[i],
+                self.output_channels[i],
+                kernel_size=kernel_size,
+                stride=stride,
+                pad=pad,
+                dilation=dilation,
+                groups=groups,
+                norm_cfg=norm_cfg,
+            )
+            self.patch_embed.add_module('%d' % (2 * i), conv_bn)
+            if i < len(self.input_channels) - 1:
+                self.patch_embed.add_module('%d' % (i * 2 + 1),
+                                            build_activation_layer(act_cfg))
+
+    def forward(self, x):
+        x = self.patch_embed(x)
+        return x
+
+
+class ConvolutionBatchNorm(BaseModule):
+
+    def __init__(
+            self,
+            in_channel,
+            out_channel,
+            kernel_size=3,
+            stride=2,
+            pad=1,
+            dilation=1,
+            groups=1,
+            norm_cfg=dict(type='BN'),
+    ):
+        super(ConvolutionBatchNorm, self).__init__()
+        self.conv = nn.Conv2d(
+            in_channel,
+            out_channel,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=pad,
+            dilation=dilation,
+            groups=groups,
+            bias=False)
+        self.bn = build_norm_layer(norm_cfg, out_channel)
+
+    def forward(self, x):
+        x = self.conv(x)
+        x = self.bn(x)
+        return x
+
+    @torch.no_grad()
+    def fuse(self):
+        return fuse_conv_bn(self).conv
+
+
+class LinearBatchNorm(BaseModule):
+
+    def __init__(self, in_feature, out_feature, norm_cfg=dict(type='BN1d')):
+        super(LinearBatchNorm, self).__init__()
+        self.linear = nn.Linear(in_feature, out_feature, bias=False)
+        self.bn = build_norm_layer(norm_cfg, out_feature)
+
+    def forward(self, x):
+        x = self.linear(x)
+        x = self.bn(x.flatten(0, 1)).reshape_as(x)
+        return x
+
+    @torch.no_grad()
+    def fuse(self):
+        w = self.bn.weight / (self.bn.running_var + self.bn.eps)**0.5
+        w = self.linear.weight * w[:, None]
+        b = self.bn.bias - self.bn.running_mean * self.bn.weight / \
+            (self.bn.running_var + self.bn.eps) ** 0.5
+
+        factory_kwargs = {
+            'device': self.linear.weight.device,
+            'dtype': self.linear.weight.dtype
+        }
+        bias = nn.Parameter(
+            torch.empty(self.linear.out_features, **factory_kwargs))
+        self.linear.register_parameter('bias', bias)
+        self.linear.weight.data.copy_(w)
+        self.linear.bias.data.copy_(b)
+        return self.linear
+
+
+class Residual(BaseModule):
+
+    def __init__(self, block, drop_path_rate=0.):
+        super(Residual, self).__init__()
+        self.block = block
+        if drop_path_rate > 0:
+            self.drop_path = DropPath(drop_path_rate)
+        else:
+            self.drop_path = nn.Identity()
+
+    def forward(self, x):
+        x = x + self.drop_path(self.block(x))
+        return x
+
+
+class Attention(BaseModule):
+
+    def __init__(
+            self,
+            dim,
+            key_dim,
+            num_heads=8,
+            attn_ratio=4,
+            act_cfg=dict(type='HSwish'),
+            resolution=14,
+    ):
+        super(Attention, self).__init__()
+        self.num_heads = num_heads
+        self.scale = key_dim**-0.5
+        self.key_dim = key_dim
+        self.nh_kd = nh_kd = key_dim * num_heads
+        self.d = int(attn_ratio * key_dim)
+        self.dh = int(attn_ratio * key_dim) * num_heads
+        self.attn_ratio = attn_ratio
+        h = self.dh + nh_kd * 2
+        self.qkv = LinearBatchNorm(dim, h)
+        self.proj = nn.Sequential(
+            build_activation_layer(act_cfg), LinearBatchNorm(self.dh, dim))
+
+        points = list(itertools.product(range(resolution), range(resolution)))
+        N = len(points)
+        attention_offsets = {}
+        idxs = []
+        for p1 in points:
+            for p2 in points:
+                offset = (abs(p1[0] - p2[0]), abs(p1[1] - p2[1]))
+                if offset not in attention_offsets:
+                    attention_offsets[offset] = len(attention_offsets)
+                idxs.append(attention_offsets[offset])
+        self.attention_biases = torch.nn.Parameter(
+            torch.zeros(num_heads, len(attention_offsets)))
+        self.register_buffer('attention_bias_idxs',
+                             torch.LongTensor(idxs).view(N, N))
+
+    @torch.no_grad()
+    def train(self, mode=True):
+        """change the mode of model."""
+        super(Attention, self).train(mode)
+        if mode and hasattr(self, 'ab'):
+            del self.ab
+        else:
+            self.ab = self.attention_biases[:, self.attention_bias_idxs]
+
+    def forward(self, x):  # x (B,N,C)
+        B, N, C = x.shape  # 2 196 128
+        qkv = self.qkv(x)  # 2 196 128
+        q, k, v = qkv.view(B, N, self.num_heads, -1).split(
+            [self.key_dim, self.key_dim, self.d],
+            dim=3)  # q 2 196 4 16 ; k 2 196 4 16; v 2 196 4 32
+        q = q.permute(0, 2, 1, 3)  # 2 4 196 16
+        k = k.permute(0, 2, 1, 3)
+        v = v.permute(0, 2, 1, 3)
+
+        attn = ((q @ k.transpose(-2, -1)) *
+                self.scale  # 2 4 196 16 * 2 4 16 196 -> 2 4 196 196
+                + (self.attention_biases[:, self.attention_bias_idxs]
+                   if self.training else self.ab))
+        attn = attn.softmax(dim=-1)  # 2 4 196 196 -> 2 4 196 196
+        x = (attn @ v).transpose(1, 2).reshape(
+            B, N,
+            self.dh)  # 2 4 196 196 * 2 4 196 32 -> 2 4 196 32 -> 2 196 128
+        x = self.proj(x)
+        return x
+
+
+class MLP(nn.Sequential):
+
+    def __init__(self, embed_dim, mlp_ratio, act_cfg=dict(type='HSwish')):
+        super(MLP, self).__init__()
+        h = embed_dim * mlp_ratio
+        self.linear1 = LinearBatchNorm(embed_dim, h)
+        self.activation = build_activation_layer(act_cfg)
+        self.linear2 = LinearBatchNorm(h, embed_dim)
+
+    def forward(self, x):
+        x = self.linear1(x)
+        x = self.activation(x)
+        x = self.linear2(x)
+        return x
+
+
+class Subsample(BaseModule):
+
+    def __init__(self, stride, resolution):
+        super(Subsample, self).__init__()
+        self.stride = stride
+        self.resolution = resolution
+
+    def forward(self, x):
+        B, _, C = x.shape
+        # B, N, C -> B, H, W, C
+        x = x.view(B, self.resolution, self.resolution, C)
+        x = x[:, ::self.stride, ::self.stride]
+        x = x.reshape(B, -1, C)  # B, H', W', C -> B, N', C
+        return x
+
+
+class AttentionSubsample(nn.Sequential):
+
+    def __init__(self,
+                 in_dim,
+                 out_dim,
+                 key_dim,
+                 num_heads=8,
+                 attn_ratio=2,
+                 act_cfg=dict(type='HSwish'),
+                 stride=2,
+                 resolution=14):
+        super(AttentionSubsample, self).__init__()
+        self.num_heads = num_heads
+        self.scale = key_dim**-0.5
+        self.key_dim = key_dim
+        self.nh_kd = nh_kd = key_dim * num_heads
+        self.d = int(attn_ratio * key_dim)
+        self.dh = int(attn_ratio * key_dim) * self.num_heads
+        self.attn_ratio = attn_ratio
+        self.sub_resolution = (resolution - 1) // stride + 1
+        h = self.dh + nh_kd
+        self.kv = LinearBatchNorm(in_dim, h)
+
+        self.q = nn.Sequential(
+            Subsample(stride, resolution), LinearBatchNorm(in_dim, nh_kd))
+        self.proj = nn.Sequential(
+            build_activation_layer(act_cfg), LinearBatchNorm(self.dh, out_dim))
+
+        self.stride = stride
+        self.resolution = resolution
+        points = list(itertools.product(range(resolution), range(resolution)))
+        sub_points = list(
+            itertools.product(
+                range(self.sub_resolution), range(self.sub_resolution)))
+        N = len(points)
+        N_sub = len(sub_points)
+        attention_offsets = {}
+        idxs = []
+        for p1 in sub_points:
+            for p2 in points:
+                size = 1
+                offset = (abs(p1[0] * stride - p2[0] + (size - 1) / 2),
+                          abs(p1[1] * stride - p2[1] + (size - 1) / 2))
+                if offset not in attention_offsets:
+                    attention_offsets[offset] = len(attention_offsets)
+                idxs.append(attention_offsets[offset])
+        self.attention_biases = torch.nn.Parameter(
+            torch.zeros(num_heads, len(attention_offsets)))
+        self.register_buffer('attention_bias_idxs',
+                             torch.LongTensor(idxs).view(N_sub, N))
+
+    @torch.no_grad()
+    def train(self, mode=True):
+        super(AttentionSubsample, self).train(mode)
+        if mode and hasattr(self, 'ab'):
+            del self.ab
+        else:
+            self.ab = self.attention_biases[:, self.attention_bias_idxs]
+
+    def forward(self, x):
+        B, N, C = x.shape
+        k, v = self.kv(x).view(B, N, self.num_heads,
+                               -1).split([self.key_dim, self.d], dim=3)
+        k = k.permute(0, 2, 1, 3)  # BHNC
+        v = v.permute(0, 2, 1, 3)  # BHNC
+        q = self.q(x).view(B, self.sub_resolution**2, self.num_heads,
+                           self.key_dim).permute(0, 2, 1, 3)
+
+        attn = (q @ k.transpose(-2, -1)) * self.scale + \
+               (self.attention_biases[:, self.attention_bias_idxs]
+                if self.training else self.ab)
+        attn = attn.softmax(dim=-1)
+
+        x = (attn @ v).transpose(1, 2).reshape(B, -1, self.dh)
+        x = self.proj(x)
+        return x
+
+
+@MODELS.register_module()
+class LeViT(BaseBackbone):
+    """LeViT backbone.
+
+    A PyTorch implementation of `LeViT: A Vision Transformer in ConvNet's
+    Clothing for Faster Inference <https://arxiv.org/abs/2104.01136>`_
+
+    Modified from the official implementation:
+    https://github.com/facebookresearch/LeViT
+
+    Args:
+        arch (str | dict): LeViT architecture.
+
+            If use string, choose from '128s', '128', '192', '256' and '384'.
+            If use dict, it should have below keys:
+
+            - **embed_dims** (List[int]): The embed dimensions of each stage.
+            - **key_dims** (List[int]): The embed dimensions of the key in the
+              attention layers of each stage.
+            - **num_heads** (List[int]): The number of heads in each stage.
+            - **depths** (List[int]): The number of blocks in each stage.
+
+        img_size (int): Input image size
+        patch_size (int | tuple): The patch size. Deault to 16
+        attn_ratio (int): Ratio of hidden dimensions of the value in attention
+            layers. Defaults to 2.
+        mlp_ratio (int): Ratio of hidden dimensions in MLP layers.
+            Defaults to 2.
+        act_cfg (dict): The config of activation functions.
+            Defaults to ``dict(type='HSwish')``.
+        hybrid_backbone (callable): A callable object to build the patch embed
+            module. Defaults to use :class:`HybridBackbone`.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        deploy (bool): Whether to switch the model structure to
+            deployment mode. Defaults to False.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+    arch_zoo = {
+        '128s': {
+            'embed_dims': [128, 256, 384],
+            'num_heads': [4, 6, 8],
+            'depths': [2, 3, 4],
+            'key_dims': [16, 16, 16],
+        },
+        '128': {
+            'embed_dims': [128, 256, 384],
+            'num_heads': [4, 8, 12],
+            'depths': [4, 4, 4],
+            'key_dims': [16, 16, 16],
+        },
+        '192': {
+            'embed_dims': [192, 288, 384],
+            'num_heads': [3, 5, 6],
+            'depths': [4, 4, 4],
+            'key_dims': [32, 32, 32],
+        },
+        '256': {
+            'embed_dims': [256, 384, 512],
+            'num_heads': [4, 6, 8],
+            'depths': [4, 4, 4],
+            'key_dims': [32, 32, 32],
+        },
+        '384': {
+            'embed_dims': [384, 512, 768],
+            'num_heads': [6, 9, 12],
+            'depths': [4, 4, 4],
+            'key_dims': [32, 32, 32],
+        },
+    }
+
+    def __init__(self,
+                 arch,
+                 img_size=224,
+                 patch_size=16,
+                 attn_ratio=2,
+                 mlp_ratio=2,
+                 act_cfg=dict(type='HSwish'),
+                 hybrid_backbone=HybridBackbone,
+                 out_indices=-1,
+                 deploy=False,
+                 drop_path_rate=0,
+                 init_cfg=None):
+        super(LeViT, self).__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch = self.arch_zoo[arch]
+        elif isinstance(arch, dict):
+            essential_keys = {'embed_dim', 'num_heads', 'depth', 'key_dim'}
+            assert isinstance(arch, dict) and set(arch) == essential_keys, \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch = arch
+        else:
+            raise TypeError('Expect "arch" to be either a string '
+                            f'or a dict, got {type(arch)}')
+
+        self.embed_dims = self.arch['embed_dims']
+        self.num_heads = self.arch['num_heads']
+        self.key_dims = self.arch['key_dims']
+        self.depths = self.arch['depths']
+        self.num_stages = len(self.embed_dims)
+        self.drop_path_rate = drop_path_rate
+
+        self.patch_embed = hybrid_backbone(self.embed_dims[0])
+
+        self.resolutions = []
+        resolution = img_size // patch_size
+        self.stages = ModuleList()
+        for i, (embed_dims, key_dims, depth, num_heads) in enumerate(
+                zip(self.embed_dims, self.key_dims, self.depths,
+                    self.num_heads)):
+            blocks = []
+            if i > 0:
+                downsample = AttentionSubsample(
+                    in_dim=self.embed_dims[i - 1],
+                    out_dim=embed_dims,
+                    key_dim=key_dims,
+                    num_heads=self.embed_dims[i - 1] // key_dims,
+                    attn_ratio=4,
+                    act_cfg=act_cfg,
+                    stride=2,
+                    resolution=resolution)
+                blocks.append(downsample)
+                resolution = downsample.sub_resolution
+                if mlp_ratio > 0:  # mlp_ratio
+                    blocks.append(
+                        Residual(
+                            MLP(embed_dims, mlp_ratio, act_cfg=act_cfg),
+                            self.drop_path_rate))
+            self.resolutions.append(resolution)
+            for _ in range(depth):
+                blocks.append(
+                    Residual(
+                        Attention(
+                            embed_dims,
+                            key_dims,
+                            num_heads,
+                            attn_ratio=attn_ratio,
+                            act_cfg=act_cfg,
+                            resolution=resolution,
+                        ), self.drop_path_rate))
+                if mlp_ratio > 0:
+                    blocks.append(
+                        Residual(
+                            MLP(embed_dims, mlp_ratio, act_cfg=act_cfg),
+                            self.drop_path_rate))
+
+            self.stages.append(Sequential(*blocks))
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        elif isinstance(out_indices, tuple):
+            out_indices = list(out_indices)
+        elif not isinstance(out_indices, list):
+            raise TypeError('"out_indices" must by a list, tuple or int, '
+                            f'get {type(out_indices)} instead.')
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = self.num_stages + index
+            assert 0 <= out_indices[i] < self.num_stages, \
+                f'Invalid out_indices {index}.'
+        self.out_indices = out_indices
+
+        self.deploy = False
+        if deploy:
+            self.switch_to_deploy()
+
+    def switch_to_deploy(self):
+        if self.deploy:
+            return
+        fuse_parameters(self)
+        self.deploy = True
+
+    def forward(self, x):
+        x = self.patch_embed(x)
+        x = x.flatten(2).transpose(1, 2)  # B, C, H, W -> B, L, C
+        outs = []
+        for i, stage in enumerate(self.stages):
+            x = stage(x)
+            B, _, C = x.shape
+            if i in self.out_indices:
+                out = x.reshape(B, self.resolutions[i], self.resolutions[i], C)
+                out = out.permute(0, 3, 1, 2).contiguous()
+                outs.append(out)
+
+        return tuple(outs)
+
+
+def fuse_parameters(module):
+    for child_name, child in module.named_children():
+        if hasattr(child, 'fuse'):
+            setattr(module, child_name, child.fuse())
+        else:
+            fuse_parameters(child)
diff --git a/mmpretrain/models/backbones/mixmim.py b/mmpretrain/models/backbones/mixmim.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c67aa0c3a45c5c85adbacb94ae90dc170b2d0bb
--- /dev/null
+++ b/mmpretrain/models/backbones/mixmim.py
@@ -0,0 +1,533 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Union
+
+import torch
+from mmcv.cnn import build_norm_layer
+from mmcv.cnn.bricks.drop import DropPath
+from mmcv.cnn.bricks.transformer import PatchEmbed, PatchMerging
+from mmengine.model import BaseModule
+from torch import nn
+from torch.utils.checkpoint import checkpoint
+
+from mmpretrain.registry import MODELS
+from ..utils import WindowMSA, to_2tuple
+from .base_backbone import BaseBackbone
+from .vision_transformer import TransformerEncoderLayer
+
+
+class MixMIMWindowAttention(WindowMSA):
+    """MixMIM Window Attention.
+
+    Compared with WindowMSA, we add some modifications
+    in ``forward`` to meet the requirement of MixMIM during
+    pretraining.
+
+    Implements one windown attention in MixMIM.
+    Args:
+        embed_dims (int): The feature dimension.
+        window_size (list): The height and width of the window.
+        num_heads (int): The number of head in attention.
+        qkv_bias (bool): Whether to add bias for qkv in attention modules.
+            Defaults to True.
+        qk_scale (float, optional): Override default qk scale of
+            ``head_dim ** -0.5`` if set. Defaults to None.
+        attn_drop_rate (float): attention drop rate.
+            Defaults to 0.
+        proj_drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 window_size,
+                 num_heads,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 attn_drop_rate=0.,
+                 proj_drop_rate=0.,
+                 init_cfg=None):
+
+        super().__init__(
+            embed_dims=embed_dims,
+            window_size=window_size,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            attn_drop=attn_drop_rate,
+            proj_drop=proj_drop_rate,
+            init_cfg=init_cfg)
+
+    def forward(self, x, mask=None):
+
+        B_, N, C = x.shape
+        qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads,
+                                  C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[
+            2]  # make torchscript happy (cannot use tensor as tuple)
+
+        q = q * self.scale
+        attn = (q @ k.transpose(-2, -1))
+
+        relative_position_bias = self.relative_position_bias_table[
+            self.relative_position_index.view(-1)].view(
+                self.window_size[0] * self.window_size[1],
+                self.window_size[0] * self.window_size[1],
+                -1)  # Wh*Ww,Wh*Ww,nH
+        relative_position_bias = relative_position_bias.permute(
+            2, 0, 1).contiguous()  # nH, Wh*Ww, Wh*Ww
+        attn = attn + relative_position_bias.unsqueeze(0)
+
+        if mask is not None:
+            mask = mask.reshape(B_, 1, 1, N)
+            mask_new = mask * mask.transpose(
+                2, 3) + (1 - mask) * (1 - mask).transpose(2, 3)
+            mask_new = 1 - mask_new
+
+            if mask_new.dtype == torch.float16:
+                attn = attn - 65500 * mask_new
+            else:
+                attn = attn - 1e30 * mask_new
+
+            attn = self.softmax(attn)
+        else:
+            attn = self.softmax(attn)
+
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+
+class MixMIMBlock(TransformerEncoderLayer):
+    """MixMIM Block. Implements one block in MixMIM.
+
+    Args:
+        embed_dims (int): The feature dimension.
+        input_resolution (tuple): Input resolution of this layer.
+        num_heads (int): The number of head in attention,
+        window_size (list): The height and width of the window.
+        mlp_ratio (int): The MLP ration in FFN.
+        num_fcs (int): The number of linear layers in a block.
+        qkv_bias (bool): Whether to add bias for qkv in attention modules.
+            Defaults to True.
+        proj_drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        attn_drop_rate (float): attention drop rate.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate.
+            Defaults to 0.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 input_resolution,
+                 num_heads,
+                 window_size=7,
+                 mlp_ratio=4.,
+                 num_fcs=2,
+                 qkv_bias=True,
+                 proj_drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='LN'),
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+
+        super().__init__(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            feedforward_channels=int(mlp_ratio * embed_dims),
+            drop_rate=proj_drop_rate,
+            attn_drop_rate=attn_drop_rate,
+            drop_path_rate=drop_path_rate,
+            num_fcs=num_fcs,
+            qkv_bias=qkv_bias,
+            act_cfg=act_cfg,
+            norm_cfg=norm_cfg,
+            init_cfg=init_cfg)
+
+        self.embed_dims = embed_dims
+        self.input_resolution = input_resolution
+        self.num_heads = num_heads
+        self.window_size = window_size
+        self.mlp_ratio = mlp_ratio
+
+        if min(self.input_resolution) <= self.window_size:
+            self.window_size = min(self.input_resolution)
+
+        self.attn = MixMIMWindowAttention(
+            embed_dims=embed_dims,
+            window_size=to_2tuple(self.window_size),
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            attn_drop_rate=attn_drop_rate,
+            proj_drop_rate=proj_drop_rate)
+
+        self.drop_path = DropPath(
+            drop_path_rate) if drop_path_rate > 0. else nn.Identity()
+
+    @staticmethod
+    def window_reverse(windows, H, W, window_size):
+        B = int(windows.shape[0] / (H * W / window_size / window_size))
+        x = windows.view(B, H // window_size, W // window_size, window_size,
+                         window_size, -1)
+        x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
+        return x
+
+    @staticmethod
+    def window_partition(x, window_size):
+        B, H, W, C = x.shape
+        x = x.view(B, H // window_size, window_size, W // window_size,
+                   window_size, C)
+        windows = x.permute(0, 1, 3, 2, 4, 5).contiguous()
+        windows = windows.view(-1, window_size, window_size, C)
+        return windows
+
+    def forward(self, x, attn_mask=None):
+        H, W = self.input_resolution
+        B, L, C = x.shape
+
+        shortcut = x
+        x = self.ln1(x)
+        x = x.view(B, H, W, C)
+
+        # partition windows
+        x_windows = self.window_partition(
+            x, self.window_size)  # nW*B, window_size, window_size, C
+        x_windows = x_windows.view(-1, self.window_size * self.window_size,
+                                   C)  # nW*B, window_size*window_size, C
+        if attn_mask is not None:
+            attn_mask = attn_mask.repeat(B, 1, 1)  # B, N, 1
+            attn_mask = attn_mask.view(B, H, W, 1)
+            attn_mask = self.window_partition(attn_mask, self.window_size)
+            attn_mask = attn_mask.view(-1, self.window_size * self.window_size,
+                                       1)
+
+        # W-MSA/SW-MSA
+        attn_windows = self.attn(
+            x_windows, mask=attn_mask)  # nW*B, window_size*window_size, C
+
+        # merge windows
+        attn_windows = attn_windows.view(-1, self.window_size,
+                                         self.window_size, C)
+        x = self.window_reverse(attn_windows, H, W,
+                                self.window_size)  # B H' W' C
+
+        x = x.view(B, H * W, C)
+
+        x = shortcut + self.drop_path(x)
+
+        x = self.ffn(self.norm2(x), identity=x)  # ffn contains DropPath
+
+        return x
+
+
+class MixMIMLayer(BaseModule):
+    """Implements one MixMIM layer, which may contains several MixMIM blocks.
+
+    Args:
+        embed_dims (int): The feature dimension.
+        input_resolution (tuple): Input resolution of this layer.
+        depth (int): The number of blocks in this layer.
+        num_heads (int): The number of head in attention,
+        window_size (list): The height and width of the window.
+        mlp_ratio (int): The MLP ration in FFN.
+        qkv_bias (bool): Whether to add bias for qkv in attention modules.
+            Defaults to True.
+        proj_drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        attn_drop_rate (float): attention drop rate.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate.
+            Defaults to 0.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        downsample (class, optional): Downsample the output of blocks b
+            y patch merging.Defaults to None.
+        use_checkpoint (bool): Whether use the checkpoint to
+        reduce GPU memory cost.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims: int,
+                 input_resolution: int,
+                 depth: int,
+                 num_heads: int,
+                 window_size: int,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 proj_drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=[0.],
+                 norm_cfg=dict(type='LN'),
+                 downsample=None,
+                 use_checkpoint=False,
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(init_cfg=init_cfg)
+        self.embed_dims = embed_dims
+        self.input_resolution = input_resolution
+        self.depth = depth
+        self.use_checkpoint = use_checkpoint
+
+        # build blocks
+        self.blocks = nn.ModuleList()
+        for i in range(depth):
+            self.blocks.append(
+                MixMIMBlock(
+                    embed_dims=embed_dims,
+                    input_resolution=input_resolution,
+                    num_heads=num_heads,
+                    window_size=window_size,
+                    mlp_ratio=mlp_ratio,
+                    qkv_bias=qkv_bias,
+                    proj_drop_rate=proj_drop_rate,
+                    attn_drop_rate=attn_drop_rate,
+                    drop_path_rate=drop_path_rate[i],
+                    norm_cfg=norm_cfg))
+        # patch merging layer
+        if downsample is not None:
+            self.downsample = downsample(
+                in_channels=embed_dims,
+                out_channels=2 * embed_dims,
+                norm_cfg=norm_cfg)
+        else:
+            self.downsample = None
+
+    def forward(self, x, attn_mask=None):
+        for blk in self.blocks:
+            if self.use_checkpoint:
+                x = checkpoint(blk, x, attn_mask)
+            else:
+                x = blk(x, attn_mask=attn_mask)
+        if self.downsample is not None:
+            x, _ = self.downsample(x, self.input_resolution)
+        return x
+
+    def extra_repr(self) -> str:
+        return f'dim={self.embed_dims}, \
+    input_resolution={self.input_resolution}, depth={self.depth}'
+
+
+@MODELS.register_module()
+class MixMIMTransformer(BaseBackbone):
+    """MixMIM backbone.
+
+    A PyTorch implement of : ` MixMIM: Mixed and Masked Image
+    Modeling for Efficient Visual Representation Learning
+    <https://arxiv.org/abs/2205.13137>`_
+
+    Args:
+        arch (str | dict): MixMIM architecture. If use string,
+            choose from 'base','large' and 'huge'.
+            If use dict, it should have below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **depths** (int): The number of transformer encoder layers.
+            - **num_heads** (int): The number of heads in attention modules.
+
+            Defaults to 'base'.
+        mlp_ratio (int): The mlp ratio in FFN.  Defaults to 4.
+        img_size (int | tuple): The expected input image shape. Because we
+            support dynamic input shape, just set the argument to mlp_ratio
+            the most common input image shape. Defaults to 224.
+        patch_size (int | tuple): The patch size in patch embedding.
+            Defaults to 16.
+        in_channels (int): The num of input channels. Defaults to 3.
+        window_size (list): The height and width of the window.
+        qkv_bias (bool): Whether to add bias for qkv in attention modules.
+            Defaults to True.
+        patch_cfg (dict): Extra config dict for patch embedding.
+            Defaults to an empty dict.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        attn_drop_rate (float): attention drop rate. Defaults to 0.
+        use_checkpoint (bool): Whether use the checkpoint to
+        reduce GPU memory cost.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+    arch_zoo = {
+        **dict.fromkeys(
+            ['b', 'base'], {
+                'embed_dims': 128,
+                'depths': [2, 2, 18, 2],
+                'num_heads': [4, 8, 16, 32]
+            }),
+        **dict.fromkeys(
+            ['l', 'large'], {
+                'embed_dims': 192,
+                'depths': [2, 2, 18, 2],
+                'num_heads': [6, 12, 24, 48]
+            }),
+        **dict.fromkeys(
+            ['h', 'huge'], {
+                'embed_dims': 352,
+                'depths': [2, 2, 18, 2],
+                'num_heads': [11, 22, 44, 88]
+            }),
+    }
+
+    def __init__(
+        self,
+        arch='base',
+        mlp_ratio=4,
+        img_size=224,
+        patch_size=4,
+        in_channels=3,
+        window_size=[14, 14, 14, 7],
+        qkv_bias=True,
+        patch_cfg=dict(),
+        norm_cfg=dict(type='LN'),
+        drop_rate=0.0,
+        drop_path_rate=0.0,
+        attn_drop_rate=0.0,
+        use_checkpoint=False,
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super(MixMIMTransformer, self).__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {'embed_dims', 'depths', 'num_heads'}
+            assert isinstance(arch, dict) and essential_keys <= set(arch), \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.embed_dims = self.arch_settings['embed_dims']
+        self.depths = self.arch_settings['depths']
+        self.num_heads = self.arch_settings['num_heads']
+
+        self.encoder_stride = 32
+
+        self.num_layers = len(self.depths)
+        self.qkv_bias = qkv_bias
+        self.drop_rate = drop_rate
+        self.attn_drop_rate = attn_drop_rate
+        self.use_checkpoint = use_checkpoint
+        self.mlp_ratio = mlp_ratio
+        self.window_size = window_size
+
+        _patch_cfg = dict(
+            in_channels=in_channels,
+            input_size=img_size,
+            embed_dims=self.embed_dims,
+            conv_type='Conv2d',
+            kernel_size=patch_size,
+            stride=patch_size,
+            norm_cfg=dict(type='LN'),
+        )
+        _patch_cfg.update(patch_cfg)
+        self.patch_embed = PatchEmbed(**_patch_cfg)
+        self.patch_resolution = self.patch_embed.init_out_size
+
+        self.dpr = [
+            x.item()
+            for x in torch.linspace(0, drop_path_rate, sum(self.depths))
+        ]
+        self.layers = nn.ModuleList()
+        for i_layer in range(self.num_layers):
+            self.layers.append(
+                MixMIMLayer(
+                    embed_dims=int(self.embed_dims * 2**i_layer),
+                    input_resolution=(self.patch_resolution[0] // (2**i_layer),
+                                      self.patch_resolution[1] //
+                                      (2**i_layer)),
+                    depth=self.depths[i_layer],
+                    num_heads=self.num_heads[i_layer],
+                    window_size=self.window_size[i_layer],
+                    mlp_ratio=self.mlp_ratio,
+                    qkv_bias=self.qkv_bias,
+                    proj_drop_rate=self.drop_rate,
+                    attn_drop_rate=self.attn_drop_rate,
+                    drop_path_rate=self.dpr[sum(self.depths[:i_layer]
+                                                ):sum(self.depths[:i_layer +
+                                                                  1])],
+                    norm_cfg=norm_cfg,
+                    downsample=PatchMerging if
+                    (i_layer < self.num_layers - 1) else None,
+                    use_checkpoint=self.use_checkpoint))
+
+        self.num_features = int(self.embed_dims * 2**(self.num_layers - 1))
+        self.drop_after_pos = nn.Dropout(p=self.drop_rate)
+
+        self.avgpool = nn.AdaptiveAvgPool1d(1)
+        self.num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+        self.absolute_pos_embed = nn.Parameter(
+            torch.zeros(1, self.num_patches, self.embed_dims),
+            requires_grad=False)
+
+        _, self.norm = build_norm_layer(norm_cfg, self.num_features)
+
+    def forward(self, x: torch.Tensor):
+        x, _ = self.patch_embed(x)
+
+        x = x + self.absolute_pos_embed
+        x = self.drop_after_pos(x)
+
+        for layer in self.layers:
+            x = layer(x, attn_mask=None)
+
+        x = self.norm(x)
+        x = self.avgpool(x.transpose(1, 2))  # B C 1
+        x = torch.flatten(x, 1)
+
+        return (x, )
+
+    def get_layer_depth(self, param_name: str, prefix: str = ''):
+        """Get the layer-wise depth of a parameter.
+
+        Args:
+            param_name (str): The name of the parameter.
+            prefix (str): The prefix for the parameter.
+                Defaults to an empty string.
+
+        Returns:
+            Tuple[int, int]: The layer-wise depth and the num of layers.
+
+        Note:
+            The first depth is the stem module (``layer_depth=0``), and the
+            last depth is the subsequent module (``layer_depth=num_layers-1``)
+        """
+        num_layers = sum(self.depths) + 2
+
+        if not param_name.startswith(prefix):
+            # For subsequent module like neck and head
+            if param_name.startswith('neck'):
+                return num_layers - 2, num_layers
+            else:
+                return num_layers - 1, num_layers
+
+        param_name = param_name[len(prefix):]
+
+        stem_layers = ('patch_embed', 'absolute_pos_embed', 'pos_embed')
+        if any(stem in param_name for stem in stem_layers):
+            layer_depth = 0
+        elif param_name.startswith('layers'):
+            layer_id = int(param_name.split('.')[1])
+            block_id = param_name.split('.')[3]
+
+            if block_id in ('downsample', 'reduction', 'norm'):
+                layer_depth = sum(self.depths[:layer_id + 1])
+            else:
+                layer_depth = sum(self.depths[:layer_id]) + int(block_id) + 1
+        else:
+            layer_depth = num_layers - 2
+
+        return layer_depth, num_layers
diff --git a/mmpretrain/models/backbones/mlp_mixer.py b/mmpretrain/models/backbones/mlp_mixer.py
new file mode 100644
index 0000000000000000000000000000000000000000..26fb8ce0186c2451a5698c413ebf2bc24f33b6ec
--- /dev/null
+++ b/mmpretrain/models/backbones/mlp_mixer.py
@@ -0,0 +1,263 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Sequence
+
+import torch.nn as nn
+from mmcv.cnn import build_norm_layer
+from mmcv.cnn.bricks.transformer import FFN, PatchEmbed
+from mmengine.model import BaseModule, ModuleList
+
+from mmpretrain.registry import MODELS
+from ..utils import to_2tuple
+from .base_backbone import BaseBackbone
+
+
+class MixerBlock(BaseModule):
+    """Mlp-Mixer basic block.
+
+    Basic module of `MLP-Mixer: An all-MLP Architecture for Vision
+    <https://arxiv.org/pdf/2105.01601.pdf>`_
+
+    Args:
+        num_tokens (int): The number of patched tokens
+        embed_dims (int): The feature dimension
+        tokens_mlp_dims (int): The hidden dimension for tokens FFNs
+        channels_mlp_dims (int): The hidden dimension for channels FFNs
+        drop_rate (float): Probability of an element to be zeroed
+            after the feed forward layer. Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        num_fcs (int): The number of fully-connected layers for FFNs.
+            Defaults to 2.
+        act_cfg (dict): The activation config for FFNs.
+            Defaults to ``dict(type='GELU')``.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 num_tokens,
+                 embed_dims,
+                 tokens_mlp_dims,
+                 channels_mlp_dims,
+                 drop_rate=0.,
+                 drop_path_rate=0.,
+                 num_fcs=2,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='LN'),
+                 init_cfg=None):
+        super(MixerBlock, self).__init__(init_cfg=init_cfg)
+
+        self.norm1_name, norm1 = build_norm_layer(
+            norm_cfg, embed_dims, postfix=1)
+        self.add_module(self.norm1_name, norm1)
+        self.token_mix = FFN(
+            embed_dims=num_tokens,
+            feedforward_channels=tokens_mlp_dims,
+            num_fcs=num_fcs,
+            ffn_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            act_cfg=act_cfg,
+            add_identity=False)
+
+        self.norm2_name, norm2 = build_norm_layer(
+            norm_cfg, embed_dims, postfix=2)
+        self.add_module(self.norm2_name, norm2)
+        self.channel_mix = FFN(
+            embed_dims=embed_dims,
+            feedforward_channels=channels_mlp_dims,
+            num_fcs=num_fcs,
+            ffn_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            act_cfg=act_cfg)
+
+    @property
+    def norm1(self):
+        return getattr(self, self.norm1_name)
+
+    @property
+    def norm2(self):
+        return getattr(self, self.norm2_name)
+
+    def init_weights(self):
+        super(MixerBlock, self).init_weights()
+        for m in self.token_mix.modules():
+            if isinstance(m, nn.Linear):
+                nn.init.xavier_uniform_(m.weight)
+                nn.init.normal_(m.bias, std=1e-6)
+        for m in self.channel_mix.modules():
+            if isinstance(m, nn.Linear):
+                nn.init.xavier_uniform_(m.weight)
+                nn.init.normal_(m.bias, std=1e-6)
+
+    def forward(self, x):
+        out = self.norm1(x).transpose(1, 2)
+        x = x + self.token_mix(out).transpose(1, 2)
+        x = self.channel_mix(self.norm2(x), identity=x)
+        return x
+
+
+@MODELS.register_module()
+class MlpMixer(BaseBackbone):
+    """Mlp-Mixer backbone.
+
+    Pytorch implementation of `MLP-Mixer: An all-MLP Architecture for Vision
+    <https://arxiv.org/pdf/2105.01601.pdf>`_
+
+    Args:
+        arch (str | dict): MLP Mixer architecture. If use string, choose from
+            'small', 'base' and 'large'. If use dict, it should have below
+            keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **num_layers** (int): The number of MLP blocks.
+            - **tokens_mlp_dims** (int): The hidden dimensions for tokens FFNs.
+            - **channels_mlp_dims** (int): The The hidden dimensions for
+              channels FFNs.
+
+            Defaults to 'base'.
+        img_size (int | tuple): The input image shape. Defaults to 224.
+        patch_size (int | tuple): The patch size in patch embedding.
+            Defaults to 16.
+        out_indices (Sequence | int): Output from which layer.
+            Defaults to -1, means the last layer.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        act_cfg (dict): The activation config for FFNs. Default GELU.
+        patch_cfg (dict): Configs of patch embeding. Defaults to an empty dict.
+        layer_cfgs (Sequence | dict): Configs of each mixer block layer.
+            Defaults to an empty dict.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    arch_zoo = {
+        **dict.fromkeys(
+            ['s', 'small'], {
+                'embed_dims': 512,
+                'num_layers': 8,
+                'tokens_mlp_dims': 256,
+                'channels_mlp_dims': 2048,
+            }),
+        **dict.fromkeys(
+            ['b', 'base'], {
+                'embed_dims': 768,
+                'num_layers': 12,
+                'tokens_mlp_dims': 384,
+                'channels_mlp_dims': 3072,
+            }),
+        **dict.fromkeys(
+            ['l', 'large'], {
+                'embed_dims': 1024,
+                'num_layers': 24,
+                'tokens_mlp_dims': 512,
+                'channels_mlp_dims': 4096,
+            }),
+    }
+
+    def __init__(self,
+                 arch='base',
+                 img_size=224,
+                 patch_size=16,
+                 out_indices=-1,
+                 drop_rate=0.,
+                 drop_path_rate=0.,
+                 norm_cfg=dict(type='LN'),
+                 act_cfg=dict(type='GELU'),
+                 patch_cfg=dict(),
+                 layer_cfgs=dict(),
+                 init_cfg=None):
+        super(MlpMixer, self).__init__(init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {
+                'embed_dims', 'num_layers', 'tokens_mlp_dims',
+                'channels_mlp_dims'
+            }
+            assert isinstance(arch, dict) and set(arch) == essential_keys, \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.embed_dims = self.arch_settings['embed_dims']
+        self.num_layers = self.arch_settings['num_layers']
+        self.tokens_mlp_dims = self.arch_settings['tokens_mlp_dims']
+        self.channels_mlp_dims = self.arch_settings['channels_mlp_dims']
+
+        self.img_size = to_2tuple(img_size)
+
+        _patch_cfg = dict(
+            input_size=img_size,
+            embed_dims=self.embed_dims,
+            conv_type='Conv2d',
+            kernel_size=patch_size,
+            stride=patch_size,
+        )
+        _patch_cfg.update(patch_cfg)
+        self.patch_embed = PatchEmbed(**_patch_cfg)
+        self.patch_resolution = self.patch_embed.init_out_size
+        num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must be a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = self.num_layers + index
+                assert out_indices[i] >= 0, f'Invalid out_indices {index}'
+            else:
+                assert index >= self.num_layers, f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+
+        self.layers = ModuleList()
+        if isinstance(layer_cfgs, dict):
+            layer_cfgs = [layer_cfgs] * self.num_layers
+        for i in range(self.num_layers):
+            _layer_cfg = dict(
+                num_tokens=num_patches,
+                embed_dims=self.embed_dims,
+                tokens_mlp_dims=self.tokens_mlp_dims,
+                channels_mlp_dims=self.channels_mlp_dims,
+                drop_rate=drop_rate,
+                drop_path_rate=drop_path_rate,
+                act_cfg=act_cfg,
+                norm_cfg=norm_cfg,
+            )
+            _layer_cfg.update(layer_cfgs[i])
+            self.layers.append(MixerBlock(**_layer_cfg))
+
+        self.norm1_name, norm1 = build_norm_layer(
+            norm_cfg, self.embed_dims, postfix=1)
+        self.add_module(self.norm1_name, norm1)
+
+    @property
+    def norm1(self):
+        return getattr(self, self.norm1_name)
+
+    def forward(self, x):
+        assert x.shape[2:] == self.img_size, \
+            "The MLP-Mixer doesn't support dynamic input shape. " \
+            f'Please input images with shape {self.img_size}'
+        x, _ = self.patch_embed(x)
+
+        outs = []
+        for i, layer in enumerate(self.layers):
+            x = layer(x)
+
+            if i == len(self.layers) - 1:
+                x = self.norm1(x)
+
+            if i in self.out_indices:
+                out = x.transpose(1, 2)
+                outs.append(out)
+
+        return tuple(outs)
diff --git a/mmpretrain/models/backbones/mobilenet_v2.py b/mmpretrain/models/backbones/mobilenet_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..bca1418a13c4ed81c4666e7f53b0417c36b2e99b
--- /dev/null
+++ b/mmpretrain/models/backbones/mobilenet_v2.py
@@ -0,0 +1,264 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+import torch.utils.checkpoint as cp
+from mmcv.cnn import ConvModule
+from mmengine.model import BaseModule
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.utils import make_divisible
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+
+
+class InvertedResidual(BaseModule):
+    """InvertedResidual block for MobileNetV2.
+
+    Args:
+        in_channels (int): The input channels of the InvertedResidual block.
+        out_channels (int): The output channels of the InvertedResidual block.
+        stride (int): Stride of the middle (first) 3x3 convolution.
+        expand_ratio (int): adjusts number of channels of the hidden layer
+            in InvertedResidual by this amount.
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Default: None, which means using conv2d.
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='BN').
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='ReLU6').
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default: False.
+
+    Returns:
+        Tensor: The output tensor
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 stride,
+                 expand_ratio,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU6'),
+                 with_cp=False,
+                 init_cfg=None):
+        super(InvertedResidual, self).__init__(init_cfg)
+        self.stride = stride
+        assert stride in [1, 2], f'stride must in [1, 2]. ' \
+            f'But received {stride}.'
+        self.with_cp = with_cp
+        self.use_res_connect = self.stride == 1 and in_channels == out_channels
+        hidden_dim = int(round(in_channels * expand_ratio))
+
+        layers = []
+        if expand_ratio != 1:
+            layers.append(
+                ConvModule(
+                    in_channels=in_channels,
+                    out_channels=hidden_dim,
+                    kernel_size=1,
+                    conv_cfg=conv_cfg,
+                    norm_cfg=norm_cfg,
+                    act_cfg=act_cfg))
+        layers.extend([
+            ConvModule(
+                in_channels=hidden_dim,
+                out_channels=hidden_dim,
+                kernel_size=3,
+                stride=stride,
+                padding=1,
+                groups=hidden_dim,
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg),
+            ConvModule(
+                in_channels=hidden_dim,
+                out_channels=out_channels,
+                kernel_size=1,
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                act_cfg=None)
+        ])
+        self.conv = nn.Sequential(*layers)
+
+    def forward(self, x):
+
+        def _inner_forward(x):
+            if self.use_res_connect:
+                return x + self.conv(x)
+            else:
+                return self.conv(x)
+
+        if self.with_cp and x.requires_grad:
+            out = cp.checkpoint(_inner_forward, x)
+        else:
+            out = _inner_forward(x)
+
+        return out
+
+
+@MODELS.register_module()
+class MobileNetV2(BaseBackbone):
+    """MobileNetV2 backbone.
+
+    Args:
+        widen_factor (float): Width multiplier, multiply number of
+            channels in each layer by this amount. Default: 1.0.
+        out_indices (None or Sequence[int]): Output from which stages.
+            Default: (7, ).
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Default: -1, which means not freezing any parameters.
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Default: None, which means using conv2d.
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='BN').
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='ReLU6').
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Default: False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default: False.
+    """
+
+    # Parameters to build layers. 4 parameters are needed to construct a
+    # layer, from left to right: expand_ratio, channel, num_blocks, stride.
+    arch_settings = [[1, 16, 1, 1], [6, 24, 2, 2], [6, 32, 3, 2],
+                     [6, 64, 4, 2], [6, 96, 3, 1], [6, 160, 3, 2],
+                     [6, 320, 1, 1]]
+
+    def __init__(self,
+                 widen_factor=1.,
+                 out_indices=(7, ),
+                 frozen_stages=-1,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU6'),
+                 norm_eval=False,
+                 with_cp=False,
+                 init_cfg=[
+                     dict(type='Kaiming', layer=['Conv2d']),
+                     dict(
+                         type='Constant',
+                         val=1,
+                         layer=['_BatchNorm', 'GroupNorm'])
+                 ]):
+        super(MobileNetV2, self).__init__(init_cfg)
+        self.widen_factor = widen_factor
+        self.out_indices = out_indices
+        for index in out_indices:
+            if index not in range(0, 8):
+                raise ValueError('the item in out_indices must in '
+                                 f'range(0, 8). But received {index}')
+
+        if frozen_stages not in range(-1, 8):
+            raise ValueError('frozen_stages must be in range(-1, 8). '
+                             f'But received {frozen_stages}')
+        self.out_indices = out_indices
+        self.frozen_stages = frozen_stages
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.act_cfg = act_cfg
+        self.norm_eval = norm_eval
+        self.with_cp = with_cp
+
+        self.in_channels = make_divisible(32 * widen_factor, 8)
+
+        self.conv1 = ConvModule(
+            in_channels=3,
+            out_channels=self.in_channels,
+            kernel_size=3,
+            stride=2,
+            padding=1,
+            conv_cfg=self.conv_cfg,
+            norm_cfg=self.norm_cfg,
+            act_cfg=self.act_cfg)
+
+        self.layers = []
+
+        for i, layer_cfg in enumerate(self.arch_settings):
+            expand_ratio, channel, num_blocks, stride = layer_cfg
+            out_channels = make_divisible(channel * widen_factor, 8)
+            inverted_res_layer = self.make_layer(
+                out_channels=out_channels,
+                num_blocks=num_blocks,
+                stride=stride,
+                expand_ratio=expand_ratio)
+            layer_name = f'layer{i + 1}'
+            self.add_module(layer_name, inverted_res_layer)
+            self.layers.append(layer_name)
+
+        if widen_factor > 1.0:
+            self.out_channel = int(1280 * widen_factor)
+        else:
+            self.out_channel = 1280
+
+        layer = ConvModule(
+            in_channels=self.in_channels,
+            out_channels=self.out_channel,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            conv_cfg=self.conv_cfg,
+            norm_cfg=self.norm_cfg,
+            act_cfg=self.act_cfg)
+        self.add_module('conv2', layer)
+        self.layers.append('conv2')
+
+    def make_layer(self, out_channels, num_blocks, stride, expand_ratio):
+        """Stack InvertedResidual blocks to build a layer for MobileNetV2.
+
+        Args:
+            out_channels (int): out_channels of block.
+            num_blocks (int): number of blocks.
+            stride (int): stride of the first block. Default: 1
+            expand_ratio (int): Expand the number of channels of the
+                hidden layer in InvertedResidual by this ratio. Default: 6.
+        """
+        layers = []
+        for i in range(num_blocks):
+            if i >= 1:
+                stride = 1
+            layers.append(
+                InvertedResidual(
+                    self.in_channels,
+                    out_channels,
+                    stride,
+                    expand_ratio=expand_ratio,
+                    conv_cfg=self.conv_cfg,
+                    norm_cfg=self.norm_cfg,
+                    act_cfg=self.act_cfg,
+                    with_cp=self.with_cp))
+            self.in_channels = out_channels
+
+        return nn.Sequential(*layers)
+
+    def forward(self, x):
+        x = self.conv1(x)
+
+        outs = []
+        for i, layer_name in enumerate(self.layers):
+            layer = getattr(self, layer_name)
+            x = layer(x)
+            if i in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            for param in self.conv1.parameters():
+                param.requires_grad = False
+        for i in range(1, self.frozen_stages + 1):
+            layer = getattr(self, f'layer{i}')
+            layer.eval()
+            for param in layer.parameters():
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        super(MobileNetV2, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                if isinstance(m, _BatchNorm):
+                    m.eval()
diff --git a/mmpretrain/models/backbones/mobilenet_v3.py b/mmpretrain/models/backbones/mobilenet_v3.py
new file mode 100644
index 0000000000000000000000000000000000000000..577dba94040dec5ecda9388819b8b5205f307dce
--- /dev/null
+++ b/mmpretrain/models/backbones/mobilenet_v3.py
@@ -0,0 +1,217 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmcv.cnn import ConvModule
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.registry import MODELS
+from ..utils import InvertedResidual
+from .base_backbone import BaseBackbone
+
+
+@MODELS.register_module()
+class MobileNetV3(BaseBackbone):
+    """MobileNetV3 backbone.
+
+    Args:
+        arch (str): Architecture of mobilnetv3, from {small, large}.
+            Default: small.
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Default: None, which means using conv2d.
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='BN').
+        out_indices (None or Sequence[int]): Output from which stages.
+            Default: None, which means output tensors from final stage.
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Default: -1, which means not freezing any parameters.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Default: False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save
+            some memory while slowing down the training speed.
+            Default: False.
+    """
+    # Parameters to build each block:
+    #     [kernel size, mid channels, out channels, with_se, act type, stride]
+    arch_settings = {
+        'small': [[3, 16, 16, True, 'ReLU', 2],
+                  [3, 72, 24, False, 'ReLU', 2],
+                  [3, 88, 24, False, 'ReLU', 1],
+                  [5, 96, 40, True, 'HSwish', 2],
+                  [5, 240, 40, True, 'HSwish', 1],
+                  [5, 240, 40, True, 'HSwish', 1],
+                  [5, 120, 48, True, 'HSwish', 1],
+                  [5, 144, 48, True, 'HSwish', 1],
+                  [5, 288, 96, True, 'HSwish', 2],
+                  [5, 576, 96, True, 'HSwish', 1],
+                  [5, 576, 96, True, 'HSwish', 1]],
+        'small_075': [[3, 16, 16, True, 'ReLU', 2],
+                      [3, 72, 24, False, 'ReLU', 2],
+                      [3, 88, 24, False, 'ReLU', 1],
+                      [5, 96, 32, True, 'HSwish', 2],
+                      [5, 192, 32, True, 'HSwish', 1],
+                      [5, 192, 32, True, 'HSwish', 1],
+                      [5, 96, 40, True, 'HSwish', 1],
+                      [5, 120, 40, True, 'HSwish', 1],
+                      [5, 240, 72, True, 'HSwish', 2],
+                      [5, 432, 72, True, 'HSwish', 1],
+                      [5, 432, 72, True, 'HSwish', 1]],
+        'small_050': [[3, 16, 8, True, 'ReLU', 2],
+                      [3, 40, 16, False, 'ReLU', 2],
+                      [3, 56, 16, False, 'ReLU', 1],
+                      [5, 64, 24, True, 'HSwish', 2],
+                      [5, 144, 24, True, 'HSwish', 1],
+                      [5, 144, 24, True, 'HSwish', 1],
+                      [5, 72, 24, True, 'HSwish', 1],
+                      [5, 72, 24, True, 'HSwish', 1],
+                      [5, 144, 48, True, 'HSwish', 2],
+                      [5, 288, 48, True, 'HSwish', 1],
+                      [5, 288, 48, True, 'HSwish', 1]],
+        'large': [[3, 16, 16, False, 'ReLU', 1],
+                  [3, 64, 24, False, 'ReLU', 2],
+                  [3, 72, 24, False, 'ReLU', 1],
+                  [5, 72, 40, True, 'ReLU', 2],
+                  [5, 120, 40, True, 'ReLU', 1],
+                  [5, 120, 40, True, 'ReLU', 1],
+                  [3, 240, 80, False, 'HSwish', 2],
+                  [3, 200, 80, False, 'HSwish', 1],
+                  [3, 184, 80, False, 'HSwish', 1],
+                  [3, 184, 80, False, 'HSwish', 1],
+                  [3, 480, 112, True, 'HSwish', 1],
+                  [3, 672, 112, True, 'HSwish', 1],
+                  [5, 672, 160, True, 'HSwish', 2],
+                  [5, 960, 160, True, 'HSwish', 1],
+                  [5, 960, 160, True, 'HSwish', 1]]
+    }  # yapf: disable
+
+    def __init__(self,
+                 arch='small',
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN', eps=0.001, momentum=0.01),
+                 out_indices=None,
+                 frozen_stages=-1,
+                 norm_eval=False,
+                 with_cp=False,
+                 init_cfg=[
+                     dict(
+                         type='Kaiming',
+                         layer=['Conv2d'],
+                         nonlinearity='leaky_relu'),
+                     dict(type='Normal', layer=['Linear'], std=0.01),
+                     dict(type='Constant', layer=['BatchNorm2d'], val=1)
+                 ]):
+        super(MobileNetV3, self).__init__(init_cfg)
+        assert arch in self.arch_settings
+        if out_indices is None:
+            out_indices = (12, ) if 'small' in arch else (16, )
+        for order, index in enumerate(out_indices):
+            if index not in range(0, len(self.arch_settings[arch]) + 2):
+                raise ValueError(
+                    'the item in out_indices must in '
+                    f'range(0, {len(self.arch_settings[arch]) + 2}). '
+                    f'But received {index}')
+
+        if frozen_stages not in range(-1, len(self.arch_settings[arch]) + 2):
+            raise ValueError('frozen_stages must be in range(-1, '
+                             f'{len(self.arch_settings[arch]) + 2}). '
+                             f'But received {frozen_stages}')
+        self.arch = arch
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.out_indices = out_indices
+        self.frozen_stages = frozen_stages
+        self.norm_eval = norm_eval
+        self.with_cp = with_cp
+
+        self.layers = self._make_layer()
+        self.feat_dim = self.arch_settings[arch][-1][1]
+
+    def _make_layer(self):
+        layers = []
+        layer_setting = self.arch_settings[self.arch]
+        in_channels = 16
+
+        layer = ConvModule(
+            in_channels=3,
+            out_channels=in_channels,
+            kernel_size=3,
+            stride=2,
+            padding=1,
+            conv_cfg=self.conv_cfg,
+            norm_cfg=self.norm_cfg,
+            act_cfg=dict(type='HSwish'))
+        self.add_module('layer0', layer)
+        layers.append('layer0')
+
+        for i, params in enumerate(layer_setting):
+            (kernel_size, mid_channels, out_channels, with_se, act,
+             stride) = params
+            if with_se:
+                se_cfg = dict(
+                    channels=mid_channels,
+                    ratio=4,
+                    act_cfg=(dict(type='ReLU'),
+                             dict(
+                                 type='HSigmoid',
+                                 bias=3,
+                                 divisor=6,
+                                 min_value=0,
+                                 max_value=1)))
+            else:
+                se_cfg = None
+
+            layer = InvertedResidual(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                mid_channels=mid_channels,
+                kernel_size=kernel_size,
+                stride=stride,
+                se_cfg=se_cfg,
+                conv_cfg=self.conv_cfg,
+                norm_cfg=self.norm_cfg,
+                act_cfg=dict(type=act),
+                with_cp=self.with_cp)
+            in_channels = out_channels
+            layer_name = 'layer{}'.format(i + 1)
+            self.add_module(layer_name, layer)
+            layers.append(layer_name)
+
+        # Build the last layer before pooling
+        # TODO: No dilation
+        layer = ConvModule(
+            in_channels=in_channels,
+            out_channels=mid_channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            conv_cfg=self.conv_cfg,
+            norm_cfg=self.norm_cfg,
+            act_cfg=dict(type='HSwish'))
+        layer_name = 'layer{}'.format(len(layer_setting) + 1)
+        self.add_module(layer_name, layer)
+        layers.append(layer_name)
+
+        return layers
+
+    def forward(self, x):
+        outs = []
+        for i, layer_name in enumerate(self.layers):
+            layer = getattr(self, layer_name)
+            x = layer(x)
+            if i in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        for i in range(0, self.frozen_stages + 1):
+            layer = getattr(self, f'layer{i}')
+            layer.eval()
+            for param in layer.parameters():
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        super(MobileNetV3, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                if isinstance(m, _BatchNorm):
+                    m.eval()
diff --git a/mmpretrain/models/backbones/mobileone.py b/mmpretrain/models/backbones/mobileone.py
new file mode 100644
index 0000000000000000000000000000000000000000..1111441af82d43a49d15ecbb5dc0778fc9f87596
--- /dev/null
+++ b/mmpretrain/models/backbones/mobileone.py
@@ -0,0 +1,515 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Modified from official impl https://github.com/apple/ml-mobileone/blob/main/mobileone.py  # noqa: E501
+from typing import Optional, Sequence
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn import build_activation_layer, build_conv_layer, build_norm_layer
+from mmengine.model import BaseModule, ModuleList, Sequential
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.registry import MODELS
+from ..utils.se_layer import SELayer
+from .base_backbone import BaseBackbone
+
+
+class MobileOneBlock(BaseModule):
+    """MobileOne block for MobileOne backbone.
+
+    Args:
+        in_channels (int): The input channels of the block.
+        out_channels (int): The output channels of the block.
+        kernel_size (int): The kernel size of the convs in the block. If the
+            kernel size is large than 1, there will be a ``branch_scale`` in
+             the block.
+        num_convs (int): Number of the convolution branches in the block.
+        stride (int): Stride of convolution layers. Defaults to 1.
+        padding (int): Padding of the convolution layers. Defaults to 1.
+        dilation (int): Dilation of the convolution layers. Defaults to 1.
+        groups (int): Groups of the convolution layers. Defaults to 1.
+        se_cfg (None or dict): The configuration of the se module.
+            Defaults to None.
+        norm_cfg (dict): Configuration to construct and config norm layer.
+            Defaults to ``dict(type='BN')``.
+        act_cfg (dict): Config dict for activation layer.
+            Defaults to ``dict(type='ReLU')``.
+        deploy (bool): Whether the model structure is in the deployment mode.
+            Defaults to False.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 kernel_size: int,
+                 num_convs: int,
+                 stride: int = 1,
+                 padding: int = 1,
+                 dilation: int = 1,
+                 groups: int = 1,
+                 se_cfg: Optional[dict] = None,
+                 conv_cfg: Optional[dict] = None,
+                 norm_cfg: Optional[dict] = dict(type='BN'),
+                 act_cfg: Optional[dict] = dict(type='ReLU'),
+                 deploy: bool = False,
+                 init_cfg: Optional[dict] = None):
+        super(MobileOneBlock, self).__init__(init_cfg)
+
+        assert se_cfg is None or isinstance(se_cfg, dict)
+        if se_cfg is not None:
+            self.se = SELayer(channels=out_channels, **se_cfg)
+        else:
+            self.se = nn.Identity()
+
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size
+        self.num_conv_branches = num_convs
+        self.stride = stride
+        self.padding = padding
+        self.se_cfg = se_cfg
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.act_cfg = act_cfg
+        self.deploy = deploy
+        self.groups = groups
+        self.dilation = dilation
+
+        if deploy:
+            self.branch_reparam = build_conv_layer(
+                conv_cfg,
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=kernel_size,
+                groups=self.groups,
+                stride=stride,
+                padding=padding,
+                dilation=dilation,
+                bias=True)
+        else:
+            # judge if input shape and output shape are the same.
+            # If true, add a normalized identity shortcut.
+            if out_channels == in_channels and stride == 1:
+                self.branch_norm = build_norm_layer(norm_cfg, in_channels)[1]
+            else:
+                self.branch_norm = None
+
+            self.branch_scale = None
+            if kernel_size > 1:
+                self.branch_scale = self.create_conv_bn(kernel_size=1)
+
+            self.branch_conv_list = ModuleList()
+            for _ in range(num_convs):
+                self.branch_conv_list.append(
+                    self.create_conv_bn(
+                        kernel_size=kernel_size,
+                        padding=padding,
+                        dilation=dilation))
+
+        self.act = build_activation_layer(act_cfg)
+
+    def create_conv_bn(self, kernel_size, dilation=1, padding=0):
+        """cearte a (conv + bn) Sequential layer."""
+        conv_bn = Sequential()
+        conv_bn.add_module(
+            'conv',
+            build_conv_layer(
+                self.conv_cfg,
+                in_channels=self.in_channels,
+                out_channels=self.out_channels,
+                kernel_size=kernel_size,
+                groups=self.groups,
+                stride=self.stride,
+                dilation=dilation,
+                padding=padding,
+                bias=False))
+        conv_bn.add_module(
+            'norm',
+            build_norm_layer(self.norm_cfg, num_features=self.out_channels)[1])
+
+        return conv_bn
+
+    def forward(self, x):
+
+        def _inner_forward(inputs):
+            if self.deploy:
+                return self.branch_reparam(inputs)
+
+            inner_out = 0
+            if self.branch_norm is not None:
+                inner_out = self.branch_norm(inputs)
+
+            if self.branch_scale is not None:
+                inner_out += self.branch_scale(inputs)
+
+            for branch_conv in self.branch_conv_list:
+                inner_out += branch_conv(inputs)
+
+            return inner_out
+
+        return self.act(self.se(_inner_forward(x)))
+
+    def switch_to_deploy(self):
+        """Switch the model structure from training mode to deployment mode."""
+        if self.deploy:
+            return
+        assert self.norm_cfg['type'] == 'BN', \
+            "Switch is not allowed when norm_cfg['type'] != 'BN'."
+
+        reparam_weight, reparam_bias = self.reparameterize()
+        self.branch_reparam = build_conv_layer(
+            self.conv_cfg,
+            self.in_channels,
+            self.out_channels,
+            kernel_size=self.kernel_size,
+            stride=self.stride,
+            padding=self.padding,
+            dilation=self.dilation,
+            groups=self.groups,
+            bias=True)
+        self.branch_reparam.weight.data = reparam_weight
+        self.branch_reparam.bias.data = reparam_bias
+
+        for param in self.parameters():
+            param.detach_()
+        delattr(self, 'branch_conv_list')
+        if hasattr(self, 'branch_scale'):
+            delattr(self, 'branch_scale')
+        delattr(self, 'branch_norm')
+
+        self.deploy = True
+
+    def reparameterize(self):
+        """Fuse all the parameters of all branches.
+
+        Returns:
+            tuple[torch.Tensor, torch.Tensor]: Parameters after fusion of all
+                branches. the first element is the weights and the second is
+                the bias.
+        """
+        weight_conv, bias_conv = 0, 0
+        for branch_conv in self.branch_conv_list:
+            weight, bias = self._fuse_conv_bn(branch_conv)
+            weight_conv += weight
+            bias_conv += bias
+
+        weight_scale, bias_scale = 0, 0
+        if self.branch_scale is not None:
+            weight_scale, bias_scale = self._fuse_conv_bn(self.branch_scale)
+            # Pad scale branch kernel to match conv branch kernel size.
+            pad = self.kernel_size // 2
+            weight_scale = F.pad(weight_scale, [pad, pad, pad, pad])
+
+        weight_norm, bias_norm = 0, 0
+        if self.branch_norm:
+            tmp_conv_bn = self._norm_to_conv(self.branch_norm)
+            weight_norm, bias_norm = self._fuse_conv_bn(tmp_conv_bn)
+
+        return (weight_conv + weight_scale + weight_norm,
+                bias_conv + bias_scale + bias_norm)
+
+    def _fuse_conv_bn(self, branch):
+        """Fuse the parameters in a branch with a conv and bn.
+
+        Args:
+            branch (mmcv.runner.Sequential): A branch with conv and bn.
+
+        Returns:
+            tuple[torch.Tensor, torch.Tensor]: The parameters obtained after
+                fusing the parameters of conv and bn in one branch.
+                The first element is the weight and the second is the bias.
+        """
+        if branch is None:
+            return 0, 0
+        kernel = branch.conv.weight
+        running_mean = branch.norm.running_mean
+        running_var = branch.norm.running_var
+        gamma = branch.norm.weight
+        beta = branch.norm.bias
+        eps = branch.norm.eps
+
+        std = (running_var + eps).sqrt()
+        fused_weight = (gamma / std).reshape(-1, 1, 1, 1) * kernel
+        fused_bias = beta - running_mean * gamma / std
+
+        return fused_weight, fused_bias
+
+    def _norm_to_conv(self, branch_nrom):
+        """Convert a norm layer to a conv-bn sequence towards
+        ``self.kernel_size``.
+
+        Args:
+            branch (nn.BatchNorm2d): A branch only with bn in the block.
+
+        Returns:
+            (mmcv.runner.Sequential): a sequential with conv and bn.
+        """
+        input_dim = self.in_channels // self.groups
+        conv_weight = torch.zeros(
+            (self.in_channels, input_dim, self.kernel_size, self.kernel_size),
+            dtype=branch_nrom.weight.dtype)
+
+        for i in range(self.in_channels):
+            conv_weight[i, i % input_dim, self.kernel_size // 2,
+                        self.kernel_size // 2] = 1
+        conv_weight = conv_weight.to(branch_nrom.weight.device)
+
+        tmp_conv = self.create_conv_bn(kernel_size=self.kernel_size)
+        tmp_conv.conv.weight.data = conv_weight
+        tmp_conv.norm = branch_nrom
+        return tmp_conv
+
+
+@MODELS.register_module()
+class MobileOne(BaseBackbone):
+    """MobileOne backbone.
+
+    A PyTorch impl of : `An Improved One millisecond Mobile Backbone
+    <https://arxiv.org/pdf/2206.04040.pdf>`_
+
+    Args:
+        arch (str | dict): MobileOne architecture. If use string, choose
+            from 's0', 's1', 's2', 's3' and 's4'. If use dict, it should
+            have below keys:
+
+            - num_blocks (Sequence[int]): Number of blocks in each stage.
+            - width_factor (Sequence[float]): Width factor in each stage.
+            - num_conv_branches (Sequence[int]): Number of conv branches
+              in each stage.
+            - num_se_blocks (Sequence[int]): Number of SE layers in each
+              stage, all the SE layers are placed in the subsequent order
+              in each stage.
+
+            Defaults to 's0'.
+        in_channels (int): Number of input image channels. Default: 3.
+        out_indices (Sequence[int] | int): Output from which stages.
+            Defaults to ``(3, )``.
+        frozen_stages (int): Stages to be frozen (all param fixed). -1 means
+            not freezing any parameters. Defaults to -1.
+        conv_cfg (dict | None): The config dict for conv layers.
+            Defaults to None.
+        norm_cfg (dict): The config dict for norm layers.
+            Defaults to ``dict(type='BN')``.
+        act_cfg (dict): Config dict for activation layer.
+            Defaults to ``dict(type='ReLU')``.
+        deploy (bool): Whether to switch the model structure to deployment
+            mode. Defaults to False.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Defaults to False.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+
+    Example:
+        >>> from mmpretrain.models import MobileOne
+        >>> import torch
+        >>> x = torch.rand(1, 3, 224, 224)
+        >>> model = MobileOne("s0", out_indices=(0, 1, 2, 3))
+        >>> model.eval()
+        >>> outputs = model(x)
+        >>> for out in outputs:
+        ...     print(tuple(out.shape))
+        (1, 48, 56, 56)
+        (1, 128, 28, 28)
+        (1, 256, 14, 14)
+        (1, 1024, 7, 7)
+    """
+
+    arch_zoo = {
+        's0':
+        dict(
+            num_blocks=[2, 8, 10, 1],
+            width_factor=[0.75, 1.0, 1.0, 2.0],
+            num_conv_branches=[4, 4, 4, 4],
+            num_se_blocks=[0, 0, 0, 0]),
+        's1':
+        dict(
+            num_blocks=[2, 8, 10, 1],
+            width_factor=[1.5, 1.5, 2.0, 2.5],
+            num_conv_branches=[1, 1, 1, 1],
+            num_se_blocks=[0, 0, 0, 0]),
+        's2':
+        dict(
+            num_blocks=[2, 8, 10, 1],
+            width_factor=[1.5, 2.0, 2.5, 4.0],
+            num_conv_branches=[1, 1, 1, 1],
+            num_se_blocks=[0, 0, 0, 0]),
+        's3':
+        dict(
+            num_blocks=[2, 8, 10, 1],
+            width_factor=[2.0, 2.5, 3.0, 4.0],
+            num_conv_branches=[1, 1, 1, 1],
+            num_se_blocks=[0, 0, 0, 0]),
+        's4':
+        dict(
+            num_blocks=[2, 8, 10, 1],
+            width_factor=[3.0, 3.5, 3.5, 4.0],
+            num_conv_branches=[1, 1, 1, 1],
+            num_se_blocks=[0, 0, 5, 1])
+    }
+
+    def __init__(self,
+                 arch,
+                 in_channels=3,
+                 out_indices=(3, ),
+                 frozen_stages=-1,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 se_cfg=dict(ratio=16),
+                 deploy=False,
+                 norm_eval=False,
+                 init_cfg=[
+                     dict(type='Kaiming', layer=['Conv2d']),
+                     dict(type='Constant', val=1, layer=['_BatchNorm'])
+                 ]):
+        super(MobileOne, self).__init__(init_cfg)
+
+        if isinstance(arch, str):
+            assert arch in self.arch_zoo, f'"arch": "{arch}"' \
+                f' is not one of the {list(self.arch_zoo.keys())}'
+            arch = self.arch_zoo[arch]
+        elif not isinstance(arch, dict):
+            raise TypeError('Expect "arch" to be either a string '
+                            f'or a dict, got {type(arch)}')
+
+        self.arch = arch
+        for k, value in self.arch.items():
+            assert isinstance(value, list) and len(value) == 4, \
+                f'the value of {k} in arch must be list with 4 items.'
+
+        self.in_channels = in_channels
+        self.deploy = deploy
+        self.frozen_stages = frozen_stages
+        self.norm_eval = norm_eval
+
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.se_cfg = se_cfg
+        self.act_cfg = act_cfg
+
+        base_channels = [64, 128, 256, 512]
+        channels = min(64,
+                       int(base_channels[0] * self.arch['width_factor'][0]))
+        self.stage0 = MobileOneBlock(
+            self.in_channels,
+            channels,
+            stride=2,
+            kernel_size=3,
+            num_convs=1,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg,
+            deploy=deploy)
+
+        self.in_planes = channels
+        self.stages = []
+        for i, num_blocks in enumerate(self.arch['num_blocks']):
+            planes = int(base_channels[i] * self.arch['width_factor'][i])
+
+            stage = self._make_stage(planes, num_blocks,
+                                     arch['num_se_blocks'][i],
+                                     arch['num_conv_branches'][i])
+
+            stage_name = f'stage{i + 1}'
+            self.add_module(stage_name, stage)
+            self.stages.append(stage_name)
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        out_indices = list(out_indices)
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = len(self.stages) + index
+            assert 0 <= out_indices[i] <= len(self.stages), \
+                f'Invalid out_indices {index}.'
+        self.out_indices = out_indices
+
+    def _make_stage(self, planes, num_blocks, num_se, num_conv_branches):
+        strides = [2] + [1] * (num_blocks - 1)
+        if num_se > num_blocks:
+            raise ValueError('Number of SE blocks cannot '
+                             'exceed number of layers.')
+        blocks = []
+        for i in range(num_blocks):
+            use_se = False
+            if i >= (num_blocks - num_se):
+                use_se = True
+
+            blocks.append(
+                # Depthwise conv
+                MobileOneBlock(
+                    in_channels=self.in_planes,
+                    out_channels=self.in_planes,
+                    kernel_size=3,
+                    num_convs=num_conv_branches,
+                    stride=strides[i],
+                    padding=1,
+                    groups=self.in_planes,
+                    se_cfg=self.se_cfg if use_se else None,
+                    conv_cfg=self.conv_cfg,
+                    norm_cfg=self.norm_cfg,
+                    act_cfg=self.act_cfg,
+                    deploy=self.deploy))
+
+            blocks.append(
+                # Pointwise conv
+                MobileOneBlock(
+                    in_channels=self.in_planes,
+                    out_channels=planes,
+                    kernel_size=1,
+                    num_convs=num_conv_branches,
+                    stride=1,
+                    padding=0,
+                    se_cfg=self.se_cfg if use_se else None,
+                    conv_cfg=self.conv_cfg,
+                    norm_cfg=self.norm_cfg,
+                    act_cfg=self.act_cfg,
+                    deploy=self.deploy))
+
+            self.in_planes = planes
+
+        return Sequential(*blocks)
+
+    def forward(self, x):
+        x = self.stage0(x)
+        outs = []
+        for i, stage_name in enumerate(self.stages):
+            stage = getattr(self, stage_name)
+            x = stage(x)
+            if i in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            self.stage0.eval()
+            for param in self.stage0.parameters():
+                param.requires_grad = False
+        for i in range(self.frozen_stages):
+            stage = getattr(self, f'stage{i+1}')
+            stage.eval()
+            for param in stage.parameters():
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        """switch the mobile to train mode or not."""
+        super(MobileOne, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                if isinstance(m, _BatchNorm):
+                    m.eval()
+
+    def switch_to_deploy(self):
+        """switch the model to deploy mode, which has smaller amount of
+        parameters and calculations."""
+        for m in self.modules():
+            if isinstance(m, MobileOneBlock):
+                m.switch_to_deploy()
+        self.deploy = True
diff --git a/mmpretrain/models/backbones/mobilevit.py b/mmpretrain/models/backbones/mobilevit.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e4043fe46049a4d1bddecc6b7b3768236318e82
--- /dev/null
+++ b/mmpretrain/models/backbones/mobilevit.py
@@ -0,0 +1,431 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from typing import Callable, Optional, Sequence
+
+import torch
+import torch.nn.functional as F
+from mmcv.cnn import ConvModule, build_norm_layer
+from torch import nn
+
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+from .mobilenet_v2 import InvertedResidual
+from .vision_transformer import TransformerEncoderLayer
+
+
+class MobileVitBlock(nn.Module):
+    """MobileViT block.
+
+    According to the paper, the MobileViT block has a local representation.
+    a transformer-as-convolution layer which consists of a global
+    representation with unfolding and folding, and a final fusion layer.
+
+    Args:
+        in_channels (int): Number of input image channels.
+        transformer_dim (int): Number of transformer channels.
+        ffn_dim (int): Number of ffn channels in transformer block.
+        out_channels (int): Number of channels in output.
+        conv_ksize (int): Conv kernel size in local representation
+            and fusion. Defaults to 3.
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Defaults to None, which means using conv2d.
+        norm_cfg (dict, optional): Config dict for normalization layer.
+            Defaults to dict(type='BN').
+        act_cfg (dict, optional): Config dict for activation layer.
+            Defaults to dict(type='Swish').
+        num_transformer_blocks (int): Number of transformer blocks in
+            a MobileViT block. Defaults to 2.
+        patch_size (int): Patch size for unfolding and folding.
+             Defaults to 2.
+        num_heads (int): Number of heads in global representation.
+             Defaults to 4.
+        drop_rate (float): Probability of an element to be zeroed
+            after the feed forward layer. Defaults to 0.
+        attn_drop_rate (float): The drop out rate for attention output weights.
+            Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        no_fusion (bool): Whether to remove the fusion layer.
+            Defaults to False.
+        transformer_norm_cfg (dict, optional): Config dict for normalization
+            layer in transformer. Defaults to dict(type='LN').
+    """
+
+    def __init__(
+            self,
+            in_channels: int,
+            transformer_dim: int,
+            ffn_dim: int,
+            out_channels: int,
+            conv_ksize: int = 3,
+            conv_cfg: Optional[dict] = None,
+            norm_cfg: Optional[dict] = dict(type='BN'),
+            act_cfg: Optional[dict] = dict(type='Swish'),
+            num_transformer_blocks: int = 2,
+            patch_size: int = 2,
+            num_heads: int = 4,
+            drop_rate: float = 0.,
+            attn_drop_rate: float = 0.,
+            drop_path_rate: float = 0.,
+            no_fusion: bool = False,
+            transformer_norm_cfg: Callable = dict(type='LN'),
+    ):
+        super(MobileVitBlock, self).__init__()
+
+        self.local_rep = nn.Sequential(
+            ConvModule(
+                in_channels=in_channels,
+                out_channels=in_channels,
+                kernel_size=conv_ksize,
+                padding=int((conv_ksize - 1) / 2),
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg),
+            ConvModule(
+                in_channels=in_channels,
+                out_channels=transformer_dim,
+                kernel_size=1,
+                bias=False,
+                conv_cfg=conv_cfg,
+                norm_cfg=None,
+                act_cfg=None),
+        )
+
+        global_rep = [
+            TransformerEncoderLayer(
+                embed_dims=transformer_dim,
+                num_heads=num_heads,
+                feedforward_channels=ffn_dim,
+                drop_rate=drop_rate,
+                attn_drop_rate=attn_drop_rate,
+                drop_path_rate=drop_path_rate,
+                qkv_bias=True,
+                act_cfg=dict(type='Swish'),
+                norm_cfg=transformer_norm_cfg)
+            for _ in range(num_transformer_blocks)
+        ]
+        global_rep.append(
+            build_norm_layer(transformer_norm_cfg, transformer_dim)[1])
+        self.global_rep = nn.Sequential(*global_rep)
+
+        self.conv_proj = ConvModule(
+            in_channels=transformer_dim,
+            out_channels=out_channels,
+            kernel_size=1,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+
+        if no_fusion:
+            self.conv_fusion = None
+        else:
+            self.conv_fusion = ConvModule(
+                in_channels=in_channels + out_channels,
+                out_channels=out_channels,
+                kernel_size=conv_ksize,
+                padding=int((conv_ksize - 1) / 2),
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg)
+
+        self.patch_size = (patch_size, patch_size)
+        self.patch_area = self.patch_size[0] * self.patch_size[1]
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        shortcut = x
+
+        # Local representation
+        x = self.local_rep(x)
+
+        # Unfold (feature map -> patches)
+        patch_h, patch_w = self.patch_size
+        B, C, H, W = x.shape
+        new_h, new_w = math.ceil(H / patch_h) * patch_h, math.ceil(
+            W / patch_w) * patch_w
+        num_patch_h, num_patch_w = new_h // patch_h, new_w // patch_w  # n_h, n_w # noqa
+        num_patches = num_patch_h * num_patch_w  # N
+        interpolate = False
+        if new_h != H or new_w != W:
+            # Note: Padding can be done, but then it needs to be handled in attention function. # noqa
+            x = F.interpolate(
+                x, size=(new_h, new_w), mode='bilinear', align_corners=False)
+            interpolate = True
+
+        # [B, C, H, W] --> [B * C * n_h, n_w, p_h, p_w]
+        x = x.reshape(B * C * num_patch_h, patch_h, num_patch_w,
+                      patch_w).transpose(1, 2)
+        # [B * C * n_h, n_w, p_h, p_w] --> [BP, N, C] where P = p_h * p_w and N = n_h * n_w # noqa
+        x = x.reshape(B, C, num_patches,
+                      self.patch_area).transpose(1, 3).reshape(
+                          B * self.patch_area, num_patches, -1)
+
+        # Global representations
+        x = self.global_rep(x)
+
+        # Fold (patch -> feature map)
+        # [B, P, N, C] --> [B*C*n_h, n_w, p_h, p_w]
+        x = x.contiguous().view(B, self.patch_area, num_patches, -1)
+        x = x.transpose(1, 3).reshape(B * C * num_patch_h, num_patch_w,
+                                      patch_h, patch_w)
+        # [B*C*n_h, n_w, p_h, p_w] --> [B*C*n_h, p_h, n_w, p_w] --> [B, C, H, W] # noqa
+        x = x.transpose(1, 2).reshape(B, C, num_patch_h * patch_h,
+                                      num_patch_w * patch_w)
+        if interpolate:
+            x = F.interpolate(
+                x, size=(H, W), mode='bilinear', align_corners=False)
+
+        x = self.conv_proj(x)
+        if self.conv_fusion is not None:
+            x = self.conv_fusion(torch.cat((shortcut, x), dim=1))
+        return x
+
+
+@MODELS.register_module()
+class MobileViT(BaseBackbone):
+    """MobileViT backbone.
+
+    A PyTorch implementation of : `MobileViT: Light-weight, General-purpose,
+    and Mobile-friendly Vision Transformer <https://arxiv.org/pdf/2110.02178.pdf>`_
+
+    Modified from the `official repo
+    <https://github.com/apple/ml-cvnets/blob/main/cvnets/models/classification/mobilevit.py>`_
+    and `timm
+    <https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mobilevit.py>`_.
+
+    Args:
+        arch (str | List[list]): Architecture of MobileViT.
+
+            - If a string, choose from "small", "x_small" and "xx_small".
+
+            - If a list, every item should be also a list, and the first item
+              of the sub-list can be chosen from "moblienetv2" and "mobilevit",
+              which indicates the type of this layer sequence. If "mobilenetv2",
+              the other items are the arguments of :attr:`~MobileViT.make_mobilenetv2_layer`
+              (except ``in_channels``) and if "mobilevit", the other items are
+              the arguments of :attr:`~MobileViT.make_mobilevit_layer`
+              (except ``in_channels``).
+
+            Defaults to "small".
+        in_channels (int): Number of input image channels. Defaults to 3.
+        stem_channels (int): Channels of stem layer.  Defaults to 16.
+        last_exp_factor (int): Channels expand factor of last layer.
+            Defaults to 4.
+        out_indices (Sequence[int]): Output from which stages.
+            Defaults to (4, ).
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Defaults to -1, which means not freezing any parameters.
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Defaults to None, which means using conv2d.
+        norm_cfg (dict, optional): Config dict for normalization layer.
+            Defaults to dict(type='BN').
+        act_cfg (dict, optional): Config dict for activation layer.
+            Defaults to dict(type='Swish').
+        init_cfg (dict, optional): Initialization config dict.
+    """  # noqa
+
+    # Parameters to build layers. The first param is the type of layer.
+    # For `mobilenetv2` layer, the rest params from left to right are:
+    #     out channels, stride, num of blocks, expand_ratio.
+    # For `mobilevit` layer, the rest params from left to right are:
+    #     out channels, stride, transformer_channels, ffn channels,
+    # num of transformer blocks, expand_ratio.
+    arch_settings = {
+        'small': [
+            ['mobilenetv2', 32, 1, 1, 4],
+            ['mobilenetv2', 64, 2, 3, 4],
+            ['mobilevit', 96, 2, 144, 288, 2, 4],
+            ['mobilevit', 128, 2, 192, 384, 4, 4],
+            ['mobilevit', 160, 2, 240, 480, 3, 4],
+        ],
+        'x_small': [
+            ['mobilenetv2', 32, 1, 1, 4],
+            ['mobilenetv2', 48, 2, 3, 4],
+            ['mobilevit', 64, 2, 96, 192, 2, 4],
+            ['mobilevit', 80, 2, 120, 240, 4, 4],
+            ['mobilevit', 96, 2, 144, 288, 3, 4],
+        ],
+        'xx_small': [
+            ['mobilenetv2', 16, 1, 1, 2],
+            ['mobilenetv2', 24, 2, 3, 2],
+            ['mobilevit', 48, 2, 64, 128, 2, 2],
+            ['mobilevit', 64, 2, 80, 160, 4, 2],
+            ['mobilevit', 80, 2, 96, 192, 3, 2],
+        ]
+    }
+
+    def __init__(self,
+                 arch='small',
+                 in_channels=3,
+                 stem_channels=16,
+                 last_exp_factor=4,
+                 out_indices=(4, ),
+                 frozen_stages=-1,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='Swish'),
+                 init_cfg=[
+                     dict(type='Kaiming', layer=['Conv2d']),
+                     dict(
+                         type='Constant',
+                         val=1,
+                         layer=['_BatchNorm', 'GroupNorm'])
+                 ]):
+        super(MobileViT, self).__init__(init_cfg)
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in self.arch_settings, \
+                f'Unavailable arch, please choose from ' \
+                f'({set(self.arch_settings)}) or pass a list.'
+            arch = self.arch_settings[arch]
+
+        self.arch = arch
+        self.num_stages = len(arch)
+
+        # check out indices and frozen stages
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = self.num_stages + index
+                assert out_indices[i] >= 0, f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+
+        if frozen_stages not in range(-1, self.num_stages):
+            raise ValueError('frozen_stages must be in range(-1, '
+                             f'{self.num_stages}). '
+                             f'But received {frozen_stages}')
+        self.frozen_stages = frozen_stages
+
+        _make_layer_func = {
+            'mobilenetv2': self.make_mobilenetv2_layer,
+            'mobilevit': self.make_mobilevit_layer,
+        }
+
+        self.stem = ConvModule(
+            in_channels=in_channels,
+            out_channels=stem_channels,
+            kernel_size=3,
+            stride=2,
+            padding=1,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+
+        in_channels = stem_channels
+        layers = []
+        for i, layer_settings in enumerate(arch):
+            layer_type, settings = layer_settings[0], layer_settings[1:]
+            layer, out_channels = _make_layer_func[layer_type](in_channels,
+                                                               *settings)
+            layers.append(layer)
+            in_channels = out_channels
+        self.layers = nn.Sequential(*layers)
+
+        self.conv_1x1_exp = ConvModule(
+            in_channels=in_channels,
+            out_channels=last_exp_factor * in_channels,
+            kernel_size=1,
+            stride=1,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+
+    @staticmethod
+    def make_mobilevit_layer(in_channels,
+                             out_channels,
+                             stride,
+                             transformer_dim,
+                             ffn_dim,
+                             num_transformer_blocks,
+                             expand_ratio=4):
+        """Build mobilevit layer, which consists of one InvertedResidual and
+        one MobileVitBlock.
+
+        Args:
+            in_channels (int): The input channels.
+            out_channels (int): The output channels.
+            stride (int): The stride of the first 3x3 convolution in the
+                ``InvertedResidual`` layers.
+            transformer_dim (int): The channels of the transformer layers.
+            ffn_dim (int): The mid-channels of the feedforward network in
+                transformer layers.
+            num_transformer_blocks (int): The number of transformer blocks.
+            expand_ratio (int): adjusts number of channels of the hidden layer
+                in ``InvertedResidual`` by this amount. Defaults to 4.
+        """
+        layer = []
+        layer.append(
+            InvertedResidual(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                stride=stride,
+                expand_ratio=expand_ratio,
+                act_cfg=dict(type='Swish'),
+            ))
+        layer.append(
+            MobileVitBlock(
+                in_channels=out_channels,
+                transformer_dim=transformer_dim,
+                ffn_dim=ffn_dim,
+                out_channels=out_channels,
+                num_transformer_blocks=num_transformer_blocks,
+            ))
+        return nn.Sequential(*layer), out_channels
+
+    @staticmethod
+    def make_mobilenetv2_layer(in_channels,
+                               out_channels,
+                               stride,
+                               num_blocks,
+                               expand_ratio=4):
+        """Build mobilenetv2 layer, which consists of several InvertedResidual
+        layers.
+
+        Args:
+            in_channels (int): The input channels.
+            out_channels (int): The output channels.
+            stride (int): The stride of the first 3x3 convolution in the
+                ``InvertedResidual`` layers.
+            num_blocks (int): The number of ``InvertedResidual`` blocks.
+            expand_ratio (int): adjusts number of channels of the hidden layer
+                in ``InvertedResidual`` by this amount. Defaults to 4.
+        """
+        layer = []
+        for i in range(num_blocks):
+            stride = stride if i == 0 else 1
+
+            layer.append(
+                InvertedResidual(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    stride=stride,
+                    expand_ratio=expand_ratio,
+                    act_cfg=dict(type='Swish'),
+                ))
+            in_channels = out_channels
+        return nn.Sequential(*layer), out_channels
+
+    def _freeze_stages(self):
+        for i in range(0, self.frozen_stages):
+            layer = self.layers[i]
+            layer.eval()
+            for param in layer.parameters():
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        super(MobileViT, self).train(mode)
+        self._freeze_stages()
+
+    def forward(self, x):
+        x = self.stem(x)
+        outs = []
+        for i, layer in enumerate(self.layers):
+            x = layer(x)
+            if i == len(self.layers) - 1:
+                x = self.conv_1x1_exp(x)
+            if i in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
diff --git a/mmpretrain/models/backbones/mvit.py b/mmpretrain/models/backbones/mvit.py
new file mode 100644
index 0000000000000000000000000000000000000000..68aee97ddf3077ca58e488f38e9d9422b171d691
--- /dev/null
+++ b/mmpretrain/models/backbones/mvit.py
@@ -0,0 +1,700 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional, Sequence
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn import build_activation_layer, build_norm_layer
+from mmcv.cnn.bricks import DropPath
+from mmcv.cnn.bricks.transformer import PatchEmbed
+from mmengine.model import BaseModule, ModuleList
+from mmengine.model.weight_init import trunc_normal_
+from mmengine.utils import to_2tuple
+
+from ..builder import BACKBONES
+from ..utils import resize_pos_embed
+from .base_backbone import BaseBackbone
+
+
+def resize_decomposed_rel_pos(rel_pos, q_size, k_size):
+    """Get relative positional embeddings according to the relative positions
+    of query and key sizes.
+
+    Args:
+        q_size (int): size of query q.
+        k_size (int): size of key k.
+        rel_pos (Tensor): relative position embeddings (L, C).
+
+    Returns:
+        Extracted positional embeddings according to relative positions.
+    """
+    max_rel_dist = int(2 * max(q_size, k_size) - 1)
+    # Interpolate rel pos if needed.
+    if rel_pos.shape[0] != max_rel_dist:
+        # Interpolate rel pos.
+        resized = F.interpolate(
+            # (L, C) -> (1, C, L)
+            rel_pos.transpose(0, 1).unsqueeze(0),
+            size=max_rel_dist,
+            mode='linear',
+        )
+        # (1, C, L) -> (L, C)
+        resized = resized.squeeze(0).transpose(0, 1)
+    else:
+        resized = rel_pos
+
+    # Scale the coords with short length if shapes for q and k are different.
+    q_h_ratio = max(k_size / q_size, 1.0)
+    k_h_ratio = max(q_size / k_size, 1.0)
+    q_coords = torch.arange(q_size)[:, None] * q_h_ratio
+    k_coords = torch.arange(k_size)[None, :] * k_h_ratio
+    relative_coords = (q_coords - k_coords) + (k_size - 1) * k_h_ratio
+
+    return resized[relative_coords.long()]
+
+
+def add_decomposed_rel_pos(attn,
+                           q,
+                           q_shape,
+                           k_shape,
+                           rel_pos_h,
+                           rel_pos_w,
+                           has_cls_token=False):
+    """Spatial Relative Positional Embeddings."""
+    sp_idx = 1 if has_cls_token else 0
+    B, num_heads, _, C = q.shape
+    q_h, q_w = q_shape
+    k_h, k_w = k_shape
+
+    Rh = resize_decomposed_rel_pos(rel_pos_h, q_h, k_h)
+    Rw = resize_decomposed_rel_pos(rel_pos_w, q_w, k_w)
+
+    r_q = q[:, :, sp_idx:].reshape(B, num_heads, q_h, q_w, C)
+    rel_h = torch.einsum('byhwc,hkc->byhwk', r_q, Rh)
+    rel_w = torch.einsum('byhwc,wkc->byhwk', r_q, Rw)
+    rel_pos_embed = rel_h[:, :, :, :, :, None] + rel_w[:, :, :, :, None, :]
+
+    attn_map = attn[:, :, sp_idx:, sp_idx:].view(B, -1, q_h, q_w, k_h, k_w)
+    attn_map += rel_pos_embed
+    attn[:, :, sp_idx:, sp_idx:] = attn_map.view(B, -1, q_h * q_w, k_h * k_w)
+
+    return attn
+
+
+class MLP(BaseModule):
+    """Two-layer multilayer perceptron.
+
+    Comparing with :class:`mmcv.cnn.bricks.transformer.FFN`, this class allows
+    different input and output channel numbers.
+
+    Args:
+        in_channels (int): The number of input channels.
+        hidden_channels (int, optional): The number of hidden layer channels.
+            If None, same as the ``in_channels``. Defaults to None.
+        out_channels (int, optional): The number of output channels. If None,
+            same as the ``in_channels``. Defaults to None.
+        act_cfg (dict): The config of activation function.
+            Defaults to ``dict(type='GELU')``.
+        init_cfg (dict, optional): The config of weight initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 hidden_channels=None,
+                 out_channels=None,
+                 act_cfg=dict(type='GELU'),
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        out_channels = out_channels or in_channels
+        hidden_channels = hidden_channels or in_channels
+        self.fc1 = nn.Linear(in_channels, hidden_channels)
+        self.act = build_activation_layer(act_cfg)
+        self.fc2 = nn.Linear(hidden_channels, out_channels)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.fc2(x)
+        return x
+
+
+def attention_pool(x: torch.Tensor,
+                   pool: nn.Module,
+                   in_size: tuple,
+                   norm: Optional[nn.Module] = None):
+    """Pooling the feature tokens.
+
+    Args:
+        x (torch.Tensor): The input tensor, should be with shape
+            ``(B, num_heads, L, C)`` or ``(B, L, C)``.
+        pool (nn.Module): The pooling module.
+        in_size (Tuple[int]): The shape of the input feature map.
+        norm (nn.Module, optional): The normalization module.
+            Defaults to None.
+    """
+    ndim = x.ndim
+    if ndim == 4:
+        B, num_heads, L, C = x.shape
+    elif ndim == 3:
+        num_heads = 1
+        B, L, C = x.shape
+    else:
+        raise RuntimeError(f'Unsupported input dimension {x.shape}')
+
+    H, W = in_size
+    assert L == H * W
+
+    # (B, num_heads, H*W, C) -> (B*num_heads, C, H, W)
+    x = x.reshape(B * num_heads, H, W, C).permute(0, 3, 1, 2).contiguous()
+    x = pool(x)
+    out_size = x.shape[-2:]
+
+    # (B*num_heads, C, H', W') -> (B, num_heads, H'*W', C)
+    x = x.reshape(B, num_heads, C, -1).transpose(2, 3)
+
+    if norm is not None:
+        x = norm(x)
+
+    if ndim == 3:
+        x = x.squeeze(1)
+
+    return x, out_size
+
+
+class MultiScaleAttention(BaseModule):
+    """Multiscale Multi-head Attention block.
+
+    Args:
+        in_dims (int): Number of input channels.
+        out_dims (int): Number of output channels.
+        num_heads (int): Number of attention heads.
+        qkv_bias (bool): If True, add a learnable bias to query, key and
+            value. Defaults to True.
+        norm_cfg (dict): The config of normalization layers.
+            Defaults to ``dict(type='LN')``.
+        pool_kernel (tuple): kernel size for qkv pooling layers.
+            Defaults to (3, 3).
+        stride_q (int): stride size for q pooling layer. Defaults to 1.
+        stride_kv (int): stride size for kv pooling layer. Defaults to 1.
+        rel_pos_spatial (bool): Whether to enable the spatial relative
+            position embedding. Defaults to True.
+        residual_pooling (bool): Whether to enable the residual connection
+            after attention pooling. Defaults to True.
+        input_size (Tuple[int], optional): The input resolution, necessary
+            if enable the ``rel_pos_spatial``. Defaults to None.
+        rel_pos_zero_init (bool): If True, zero initialize relative
+            positional parameters. Defaults to False.
+        init_cfg (dict, optional): The config of weight initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_dims,
+                 out_dims,
+                 num_heads,
+                 qkv_bias=True,
+                 norm_cfg=dict(type='LN'),
+                 pool_kernel=(3, 3),
+                 stride_q=1,
+                 stride_kv=1,
+                 rel_pos_spatial=False,
+                 residual_pooling=True,
+                 input_size=None,
+                 rel_pos_zero_init=False,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        self.num_heads = num_heads
+        self.in_dims = in_dims
+        self.out_dims = out_dims
+
+        head_dim = out_dims // num_heads
+        self.scale = head_dim**-0.5
+
+        self.qkv = nn.Linear(in_dims, out_dims * 3, bias=qkv_bias)
+        self.proj = nn.Linear(out_dims, out_dims)
+
+        # qkv pooling
+        pool_padding = [k // 2 for k in pool_kernel]
+        pool_dims = out_dims // num_heads
+
+        def build_pooling(stride):
+            pool = nn.Conv2d(
+                pool_dims,
+                pool_dims,
+                pool_kernel,
+                stride=stride,
+                padding=pool_padding,
+                groups=pool_dims,
+                bias=False,
+            )
+            norm = build_norm_layer(norm_cfg, pool_dims)[1]
+            return pool, norm
+
+        self.pool_q, self.norm_q = build_pooling(stride_q)
+        self.pool_k, self.norm_k = build_pooling(stride_kv)
+        self.pool_v, self.norm_v = build_pooling(stride_kv)
+
+        self.residual_pooling = residual_pooling
+
+        self.rel_pos_spatial = rel_pos_spatial
+        self.rel_pos_zero_init = rel_pos_zero_init
+        if self.rel_pos_spatial:
+            # initialize relative positional embeddings
+            assert input_size[0] == input_size[1]
+
+            size = input_size[0]
+            rel_dim = 2 * max(size // stride_q, size // stride_kv) - 1
+            self.rel_pos_h = nn.Parameter(torch.zeros(rel_dim, head_dim))
+            self.rel_pos_w = nn.Parameter(torch.zeros(rel_dim, head_dim))
+
+    def init_weights(self):
+        """Weight initialization."""
+        super().init_weights()
+
+        if (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            # Suppress rel_pos_zero_init if use pretrained model.
+            return
+
+        if not self.rel_pos_zero_init:
+            trunc_normal_(self.rel_pos_h, std=0.02)
+            trunc_normal_(self.rel_pos_w, std=0.02)
+
+    def forward(self, x, in_size):
+        """Forward the MultiScaleAttention."""
+        B, N, _ = x.shape  # (B, H*W, C)
+
+        # qkv: (B, H*W, 3, num_heads, C)
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, -1)
+        # q, k, v: (B, num_heads, H*W, C)
+        q, k, v = qkv.permute(2, 0, 3, 1, 4).unbind(0)
+
+        q, q_shape = attention_pool(q, self.pool_q, in_size, norm=self.norm_q)
+        k, k_shape = attention_pool(k, self.pool_k, in_size, norm=self.norm_k)
+        v, v_shape = attention_pool(v, self.pool_v, in_size, norm=self.norm_v)
+
+        attn = (q * self.scale) @ k.transpose(-2, -1)
+        if self.rel_pos_spatial:
+            attn = add_decomposed_rel_pos(attn, q, q_shape, k_shape,
+                                          self.rel_pos_h, self.rel_pos_w)
+
+        attn = attn.softmax(dim=-1)
+        x = attn @ v
+
+        if self.residual_pooling:
+            x = x + q
+
+        # (B, num_heads, H'*W', C'//num_heads) -> (B, H'*W', C')
+        x = x.transpose(1, 2).reshape(B, -1, self.out_dims)
+        x = self.proj(x)
+
+        return x, q_shape
+
+
+class MultiScaleBlock(BaseModule):
+    """Multiscale Transformer blocks.
+
+    Args:
+        in_dims (int): Number of input channels.
+        out_dims (int): Number of output channels.
+        num_heads (int): Number of attention heads.
+        mlp_ratio (float): Ratio of hidden dimensions in MLP layers.
+            Defaults to 4.0.
+        qkv_bias (bool): If True, add a learnable bias to query, key and
+            value. Defaults to True.
+        drop_path (float): Stochastic depth rate. Defaults to 0.
+        norm_cfg (dict): The config of normalization layers.
+            Defaults to ``dict(type='LN')``.
+        act_cfg (dict): The config of activation function.
+            Defaults to ``dict(type='GELU')``.
+        qkv_pool_kernel (tuple): kernel size for qkv pooling layers.
+            Defaults to (3, 3).
+        stride_q (int): stride size for q pooling layer. Defaults to 1.
+        stride_kv (int): stride size for kv pooling layer. Defaults to 1.
+        rel_pos_spatial (bool): Whether to enable the spatial relative
+            position embedding. Defaults to True.
+        residual_pooling (bool): Whether to enable the residual connection
+            after attention pooling. Defaults to True.
+        dim_mul_in_attention (bool): Whether to multiply the ``embed_dims`` in
+            attention layers. If False, multiply it in MLP layers.
+            Defaults to True.
+        input_size (Tuple[int], optional): The input resolution, necessary
+            if enable the ``rel_pos_spatial``. Defaults to None.
+        rel_pos_zero_init (bool): If True, zero initialize relative
+            positional parameters. Defaults to False.
+        init_cfg (dict, optional): The config of weight initialization.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        in_dims,
+        out_dims,
+        num_heads,
+        mlp_ratio=4.0,
+        qkv_bias=True,
+        drop_path=0.0,
+        norm_cfg=dict(type='LN'),
+        act_cfg=dict(type='GELU'),
+        qkv_pool_kernel=(3, 3),
+        stride_q=1,
+        stride_kv=1,
+        rel_pos_spatial=True,
+        residual_pooling=True,
+        dim_mul_in_attention=True,
+        input_size=None,
+        rel_pos_zero_init=False,
+        init_cfg=None,
+    ):
+        super().__init__(init_cfg=init_cfg)
+        self.in_dims = in_dims
+        self.out_dims = out_dims
+        self.norm1 = build_norm_layer(norm_cfg, in_dims)[1]
+        self.dim_mul_in_attention = dim_mul_in_attention
+
+        attn_dims = out_dims if dim_mul_in_attention else in_dims
+        self.attn = MultiScaleAttention(
+            in_dims,
+            attn_dims,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            norm_cfg=norm_cfg,
+            pool_kernel=qkv_pool_kernel,
+            stride_q=stride_q,
+            stride_kv=stride_kv,
+            rel_pos_spatial=rel_pos_spatial,
+            residual_pooling=residual_pooling,
+            input_size=input_size,
+            rel_pos_zero_init=rel_pos_zero_init)
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0.0 else nn.Identity()
+
+        self.norm2 = build_norm_layer(norm_cfg, attn_dims)[1]
+
+        self.mlp = MLP(
+            in_channels=attn_dims,
+            hidden_channels=int(attn_dims * mlp_ratio),
+            out_channels=out_dims,
+            act_cfg=act_cfg)
+
+        if in_dims != out_dims:
+            self.proj = nn.Linear(in_dims, out_dims)
+        else:
+            self.proj = None
+
+        if stride_q > 1:
+            kernel_skip = stride_q + 1
+            padding_skip = int(kernel_skip // 2)
+            self.pool_skip = nn.MaxPool2d(
+                kernel_skip, stride_q, padding_skip, ceil_mode=False)
+
+            if input_size is not None:
+                input_size = to_2tuple(input_size)
+                out_size = [size // stride_q for size in input_size]
+                self.init_out_size = out_size
+            else:
+                self.init_out_size = None
+        else:
+            self.pool_skip = None
+            self.init_out_size = input_size
+
+    def forward(self, x, in_size):
+        x_norm = self.norm1(x)
+        x_attn, out_size = self.attn(x_norm, in_size)
+
+        if self.dim_mul_in_attention and self.proj is not None:
+            skip = self.proj(x_norm)
+        else:
+            skip = x
+
+        if self.pool_skip is not None:
+            skip, _ = attention_pool(skip, self.pool_skip, in_size)
+
+        x = skip + self.drop_path(x_attn)
+        x_norm = self.norm2(x)
+        x_mlp = self.mlp(x_norm)
+
+        if not self.dim_mul_in_attention and self.proj is not None:
+            skip = self.proj(x_norm)
+        else:
+            skip = x
+
+        x = skip + self.drop_path(x_mlp)
+
+        return x, out_size
+
+
+@BACKBONES.register_module()
+class MViT(BaseBackbone):
+    """Multi-scale ViT v2.
+
+    A PyTorch implement of : `MViTv2: Improved Multiscale Vision Transformers
+    for Classification and Detection <https://arxiv.org/abs/2112.01526>`_
+
+    Inspiration from `the official implementation
+    <https://github.com/facebookresearch/mvit>`_ and `the detectron2
+    implementation <https://github.com/facebookresearch/detectron2>`_
+
+    Args:
+        arch (str | dict): MViT architecture. If use string, choose
+            from 'tiny', 'small', 'base' and 'large'. If use dict, it should
+            have below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **num_layers** (int): The number of layers.
+            - **num_heads** (int): The number of heads in attention
+              modules of the initial layer.
+            - **downscale_indices** (List[int]): The layer indices to downscale
+              the feature map.
+
+            Defaults to 'base'.
+        img_size (int): The expected input image shape. Defaults to 224.
+        in_channels (int): The num of input channels. Defaults to 3.
+        out_scales (int | Sequence[int]): The output scale indices.
+            They should not exceed the length of ``downscale_indices``.
+            Defaults to -1, which means the last scale.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.1.
+        use_abs_pos_embed (bool): If True, add absolute position embedding to
+            the patch embedding. Defaults to False.
+        interpolate_mode (str): Select the interpolate mode for absolute
+            position embedding vector resize. Defaults to "bicubic".
+        pool_kernel (tuple): kernel size for qkv pooling layers.
+            Defaults to (3, 3).
+        dim_mul (int): The magnification for ``embed_dims`` in the downscale
+            layers. Defaults to 2.
+        head_mul (int): The magnification for ``num_heads`` in the downscale
+            layers. Defaults to 2.
+        adaptive_kv_stride (int): The stride size for kv pooling in the initial
+            layer. Defaults to 4.
+        rel_pos_spatial (bool): Whether to enable the spatial relative position
+            embedding. Defaults to True.
+        residual_pooling (bool): Whether to enable the residual connection
+            after attention pooling. Defaults to True.
+        dim_mul_in_attention (bool): Whether to multiply the ``embed_dims`` in
+            attention layers. If False, multiply it in MLP layers.
+            Defaults to True.
+        rel_pos_zero_init (bool): If True, zero initialize relative
+            positional parameters. Defaults to False.
+        mlp_ratio (float): Ratio of hidden dimensions in MLP layers.
+            Defaults to 4.0.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to True.
+        norm_cfg (dict): Config dict for normalization layer for all output
+            features. Defaults to ``dict(type='LN', eps=1e-6)``.
+        patch_cfg (dict): Config dict for the patch embedding layer.
+            Defaults to ``dict(kernel_size=7, stride=4, padding=3)``.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+
+    Examples:
+        >>> import torch
+        >>> from mmpretrain.models import build_backbone
+        >>>
+        >>> cfg = dict(type='MViT', arch='tiny', out_scales=[0, 1, 2, 3])
+        >>> model = build_backbone(cfg)
+        >>> inputs = torch.rand(1, 3, 224, 224)
+        >>> outputs = model(inputs)
+        >>> for i, output in enumerate(outputs):
+        >>>     print(f'scale{i}: {output.shape}')
+        scale0: torch.Size([1, 96, 56, 56])
+        scale1: torch.Size([1, 192, 28, 28])
+        scale2: torch.Size([1, 384, 14, 14])
+        scale3: torch.Size([1, 768, 7, 7])
+    """
+    arch_zoo = {
+        'tiny': {
+            'embed_dims': 96,
+            'num_layers': 10,
+            'num_heads': 1,
+            'downscale_indices': [1, 3, 8]
+        },
+        'small': {
+            'embed_dims': 96,
+            'num_layers': 16,
+            'num_heads': 1,
+            'downscale_indices': [1, 3, 14]
+        },
+        'base': {
+            'embed_dims': 96,
+            'num_layers': 24,
+            'num_heads': 1,
+            'downscale_indices': [2, 5, 21]
+        },
+        'large': {
+            'embed_dims': 144,
+            'num_layers': 48,
+            'num_heads': 2,
+            'downscale_indices': [2, 8, 44]
+        },
+    }
+    num_extra_tokens = 0
+
+    def __init__(self,
+                 arch='base',
+                 img_size=224,
+                 in_channels=3,
+                 out_scales=-1,
+                 drop_path_rate=0.,
+                 use_abs_pos_embed=False,
+                 interpolate_mode='bicubic',
+                 pool_kernel=(3, 3),
+                 dim_mul=2,
+                 head_mul=2,
+                 adaptive_kv_stride=4,
+                 rel_pos_spatial=True,
+                 residual_pooling=True,
+                 dim_mul_in_attention=True,
+                 rel_pos_zero_init=False,
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 norm_cfg=dict(type='LN', eps=1e-6),
+                 patch_cfg=dict(kernel_size=7, stride=4, padding=3),
+                 init_cfg=None):
+        super().__init__(init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {
+                'embed_dims', 'num_layers', 'num_heads', 'downscale_indices'
+            }
+            assert isinstance(arch, dict) and essential_keys <= set(arch), \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.embed_dims = self.arch_settings['embed_dims']
+        self.num_layers = self.arch_settings['num_layers']
+        self.num_heads = self.arch_settings['num_heads']
+        self.downscale_indices = self.arch_settings['downscale_indices']
+        self.num_scales = len(self.downscale_indices) + 1
+        self.stage_indices = {
+            index - 1: i
+            for i, index in enumerate(self.downscale_indices)
+        }
+        self.stage_indices[self.num_layers - 1] = self.num_scales - 1
+        self.use_abs_pos_embed = use_abs_pos_embed
+        self.interpolate_mode = interpolate_mode
+
+        if isinstance(out_scales, int):
+            out_scales = [out_scales]
+        assert isinstance(out_scales, Sequence), \
+            f'"out_scales" must by a sequence or int, ' \
+            f'get {type(out_scales)} instead.'
+        for i, index in enumerate(out_scales):
+            if index < 0:
+                out_scales[i] = self.num_scales + index
+            assert 0 <= out_scales[i] <= self.num_scales, \
+                f'Invalid out_scales {index}'
+        self.out_scales = sorted(list(out_scales))
+
+        # Set patch embedding
+        _patch_cfg = dict(
+            in_channels=in_channels,
+            input_size=img_size,
+            embed_dims=self.embed_dims,
+            conv_type='Conv2d',
+        )
+        _patch_cfg.update(patch_cfg)
+        self.patch_embed = PatchEmbed(**_patch_cfg)
+        self.patch_resolution = self.patch_embed.init_out_size
+
+        # Set absolute position embedding
+        if self.use_abs_pos_embed:
+            num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+            self.pos_embed = nn.Parameter(
+                torch.zeros(1, num_patches, self.embed_dims))
+
+        # stochastic depth decay rule
+        dpr = np.linspace(0, drop_path_rate, self.num_layers)
+
+        self.blocks = ModuleList()
+        out_dims_list = [self.embed_dims]
+        num_heads = self.num_heads
+        stride_kv = adaptive_kv_stride
+        input_size = self.patch_resolution
+        for i in range(self.num_layers):
+            if i in self.downscale_indices:
+                num_heads *= head_mul
+                stride_q = 2
+                stride_kv = max(stride_kv // 2, 1)
+            else:
+                stride_q = 1
+
+            # Set output embed_dims
+            if dim_mul_in_attention and i in self.downscale_indices:
+                # multiply embed_dims in downscale layers.
+                out_dims = out_dims_list[-1] * dim_mul
+            elif not dim_mul_in_attention and i + 1 in self.downscale_indices:
+                # multiply embed_dims before downscale layers.
+                out_dims = out_dims_list[-1] * dim_mul
+            else:
+                out_dims = out_dims_list[-1]
+
+            attention_block = MultiScaleBlock(
+                in_dims=out_dims_list[-1],
+                out_dims=out_dims,
+                num_heads=num_heads,
+                mlp_ratio=mlp_ratio,
+                qkv_bias=qkv_bias,
+                drop_path=dpr[i],
+                norm_cfg=norm_cfg,
+                qkv_pool_kernel=pool_kernel,
+                stride_q=stride_q,
+                stride_kv=stride_kv,
+                rel_pos_spatial=rel_pos_spatial,
+                residual_pooling=residual_pooling,
+                dim_mul_in_attention=dim_mul_in_attention,
+                input_size=input_size,
+                rel_pos_zero_init=rel_pos_zero_init)
+            self.blocks.append(attention_block)
+
+            input_size = attention_block.init_out_size
+            out_dims_list.append(out_dims)
+
+            if i in self.stage_indices:
+                stage_index = self.stage_indices[i]
+                if stage_index in self.out_scales:
+                    norm_layer = build_norm_layer(norm_cfg, out_dims)[1]
+                    self.add_module(f'norm{stage_index}', norm_layer)
+
+    def init_weights(self):
+        super().init_weights()
+
+        if (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            # Suppress default init if use pretrained model.
+            return
+
+        if self.use_abs_pos_embed:
+            trunc_normal_(self.pos_embed, std=0.02)
+
+    def forward(self, x):
+        """Forward the MViT."""
+        B = x.shape[0]
+        x, patch_resolution = self.patch_embed(x)
+
+        if self.use_abs_pos_embed:
+            x = x + resize_pos_embed(
+                self.pos_embed,
+                self.patch_resolution,
+                patch_resolution,
+                mode=self.interpolate_mode,
+                num_extra_tokens=self.num_extra_tokens)
+
+        outs = []
+        for i, block in enumerate(self.blocks):
+            x, patch_resolution = block(x, patch_resolution)
+
+            if i in self.stage_indices:
+                stage_index = self.stage_indices[i]
+                if stage_index in self.out_scales:
+                    B, _, C = x.shape
+                    x = getattr(self, f'norm{stage_index}')(x)
+                    out = x.transpose(1, 2).reshape(B, C, *patch_resolution)
+                    outs.append(out.contiguous())
+
+        return tuple(outs)
diff --git a/mmpretrain/models/backbones/poolformer.py b/mmpretrain/models/backbones/poolformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2ad67043dbeb0ce6969c2770853342b30df2a74
--- /dev/null
+++ b/mmpretrain/models/backbones/poolformer.py
@@ -0,0 +1,416 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Sequence
+
+import torch
+import torch.nn as nn
+from mmcv.cnn.bricks import DropPath, build_activation_layer, build_norm_layer
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+
+
+class PatchEmbed(nn.Module):
+    """Patch Embedding module implemented by a layer of convolution.
+
+    Input: tensor in shape [B, C, H, W]
+    Output: tensor in shape [B, C, H/stride, W/stride]
+    Args:
+        patch_size (int): Patch size of the patch embedding. Defaults to 16.
+        stride (int): Stride of the patch embedding. Defaults to 16.
+        padding (int): Padding of the patch embedding. Defaults to 0.
+        in_chans (int): Input channels. Defaults to 3.
+        embed_dim (int): Output dimension of the patch embedding.
+            Defaults to 768.
+        norm_layer (module): Normalization module. Defaults to None (not use).
+    """
+
+    def __init__(self,
+                 patch_size=16,
+                 stride=16,
+                 padding=0,
+                 in_chans=3,
+                 embed_dim=768,
+                 norm_layer=None):
+        super().__init__()
+        self.proj = nn.Conv2d(
+            in_chans,
+            embed_dim,
+            kernel_size=patch_size,
+            stride=stride,
+            padding=padding)
+        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
+
+    def forward(self, x):
+        x = self.proj(x)
+        x = self.norm(x)
+        return x
+
+
+class Pooling(nn.Module):
+    """Pooling module.
+
+    Args:
+        pool_size (int): Pooling size. Defaults to 3.
+    """
+
+    def __init__(self, pool_size=3):
+        super().__init__()
+        self.pool = nn.AvgPool2d(
+            pool_size,
+            stride=1,
+            padding=pool_size // 2,
+            count_include_pad=False)
+
+    def forward(self, x):
+        return self.pool(x) - x
+
+
+class Mlp(nn.Module):
+    """Mlp implemented by with 1*1 convolutions.
+
+    Input: Tensor with shape [B, C, H, W].
+    Output: Tensor with shape [B, C, H, W].
+    Args:
+        in_features (int): Dimension of input features.
+        hidden_features (int): Dimension of hidden features.
+        out_features (int): Dimension of output features.
+        act_cfg (dict): The config dict for activation between pointwise
+            convolution. Defaults to ``dict(type='GELU')``.
+        drop (float): Dropout rate. Defaults to 0.0.
+    """
+
+    def __init__(self,
+                 in_features,
+                 hidden_features=None,
+                 out_features=None,
+                 act_cfg=dict(type='GELU'),
+                 drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Conv2d(in_features, hidden_features, 1)
+        self.act = build_activation_layer(act_cfg)
+        self.fc2 = nn.Conv2d(hidden_features, out_features, 1)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+class PoolFormerBlock(BaseModule):
+    """PoolFormer Block.
+
+    Args:
+        dim (int): Embedding dim.
+        pool_size (int): Pooling size. Defaults to 3.
+        mlp_ratio (float): Mlp expansion ratio. Defaults to 4.
+        norm_cfg (dict): The config dict for norm layers.
+            Defaults to ``dict(type='GN', num_groups=1)``.
+        act_cfg (dict): The config dict for activation between pointwise
+            convolution. Defaults to ``dict(type='GELU')``.
+        drop (float): Dropout rate. Defaults to 0.
+        drop_path (float): Stochastic depth rate. Defaults to 0.
+        layer_scale_init_value (float): Init value for Layer Scale.
+            Defaults to 1e-5.
+    """
+
+    def __init__(self,
+                 dim,
+                 pool_size=3,
+                 mlp_ratio=4.,
+                 norm_cfg=dict(type='GN', num_groups=1),
+                 act_cfg=dict(type='GELU'),
+                 drop=0.,
+                 drop_path=0.,
+                 layer_scale_init_value=1e-5):
+
+        super().__init__()
+
+        self.norm1 = build_norm_layer(norm_cfg, dim)[1]
+        self.token_mixer = Pooling(pool_size=pool_size)
+        self.norm2 = build_norm_layer(norm_cfg, dim)[1]
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_cfg=act_cfg,
+            drop=drop)
+
+        # The following two techniques are useful to train deep PoolFormers.
+        self.drop_path = DropPath(drop_path) if drop_path > 0. \
+            else nn.Identity()
+        self.layer_scale_1 = nn.Parameter(
+            layer_scale_init_value * torch.ones((dim)), requires_grad=True)
+        self.layer_scale_2 = nn.Parameter(
+            layer_scale_init_value * torch.ones((dim)), requires_grad=True)
+
+    def forward(self, x):
+        x = x + self.drop_path(
+            self.layer_scale_1.unsqueeze(-1).unsqueeze(-1) *
+            self.token_mixer(self.norm1(x)))
+        x = x + self.drop_path(
+            self.layer_scale_2.unsqueeze(-1).unsqueeze(-1) *
+            self.mlp(self.norm2(x)))
+        return x
+
+
+def basic_blocks(dim,
+                 index,
+                 layers,
+                 pool_size=3,
+                 mlp_ratio=4.,
+                 norm_cfg=dict(type='GN', num_groups=1),
+                 act_cfg=dict(type='GELU'),
+                 drop_rate=.0,
+                 drop_path_rate=0.,
+                 layer_scale_init_value=1e-5):
+    """
+    generate PoolFormer blocks for a stage
+    return: PoolFormer blocks
+    """
+    blocks = []
+    for block_idx in range(layers[index]):
+        block_dpr = drop_path_rate * (block_idx + sum(layers[:index])) / (
+            sum(layers) - 1)
+        blocks.append(
+            PoolFormerBlock(
+                dim,
+                pool_size=pool_size,
+                mlp_ratio=mlp_ratio,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg,
+                drop=drop_rate,
+                drop_path=block_dpr,
+                layer_scale_init_value=layer_scale_init_value,
+            ))
+    blocks = nn.Sequential(*blocks)
+
+    return blocks
+
+
+@MODELS.register_module()
+class PoolFormer(BaseBackbone):
+    """PoolFormer.
+
+    A PyTorch implementation of PoolFormer introduced by:
+    `MetaFormer is Actually What You Need for Vision <https://arxiv.org/abs/2111.11418>`_
+
+    Modified from the `official repo
+    <https://github.com/sail-sg/poolformer/blob/main/models/poolformer.py>`.
+
+    Args:
+        arch (str | dict): The model's architecture. If string, it should be
+            one of architecture in ``PoolFormer.arch_settings``. And if dict, it
+            should include the following two keys:
+
+            - layers (list[int]): Number of blocks at each stage.
+            - embed_dims (list[int]): The number of channels at each stage.
+            - mlp_ratios (list[int]): Expansion ratio of MLPs.
+            - layer_scale_init_value (float): Init value for Layer Scale.
+
+            Defaults to 'S12'.
+
+        norm_cfg (dict): The config dict for norm layers.
+            Defaults to ``dict(type='LN2d', eps=1e-6)``.
+        act_cfg (dict): The config dict for activation between pointwise
+            convolution. Defaults to ``dict(type='GELU')``.
+        in_patch_size (int): The patch size of input image patch embedding.
+            Defaults to 7.
+        in_stride (int): The stride of input image patch embedding.
+            Defaults to 4.
+        in_pad (int): The padding of input image patch embedding.
+            Defaults to 2.
+        down_patch_size (int): The patch size of downsampling patch embedding.
+            Defaults to 3.
+        down_stride (int): The stride of downsampling patch embedding.
+            Defaults to 2.
+        down_pad (int): The padding of downsampling patch embedding.
+            Defaults to 1.
+        drop_rate (float): Dropout rate. Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        out_indices (Sequence | int): Output from which network position.
+            Index 0-6 respectively corresponds to
+            [stage1, downsampling, stage2, downsampling, stage3, downsampling, stage4]
+            Defaults to -1, means the last stage.
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Defaults to 0, which means not freezing any parameters.
+        init_cfg (dict, optional): Initialization config dict
+    """  # noqa: E501
+
+    # --layers: [x,x,x,x], numbers of layers for the four stages
+    # --embed_dims, --mlp_ratios:
+    #     embedding dims and mlp ratios for the four stages
+    # --downsamples: flags to apply downsampling or not in four blocks
+    arch_settings = {
+        's12': {
+            'layers': [2, 2, 6, 2],
+            'embed_dims': [64, 128, 320, 512],
+            'mlp_ratios': [4, 4, 4, 4],
+            'layer_scale_init_value': 1e-5,
+        },
+        's24': {
+            'layers': [4, 4, 12, 4],
+            'embed_dims': [64, 128, 320, 512],
+            'mlp_ratios': [4, 4, 4, 4],
+            'layer_scale_init_value': 1e-5,
+        },
+        's36': {
+            'layers': [6, 6, 18, 6],
+            'embed_dims': [64, 128, 320, 512],
+            'mlp_ratios': [4, 4, 4, 4],
+            'layer_scale_init_value': 1e-6,
+        },
+        'm36': {
+            'layers': [6, 6, 18, 6],
+            'embed_dims': [96, 192, 384, 768],
+            'mlp_ratios': [4, 4, 4, 4],
+            'layer_scale_init_value': 1e-6,
+        },
+        'm48': {
+            'layers': [8, 8, 24, 8],
+            'embed_dims': [96, 192, 384, 768],
+            'mlp_ratios': [4, 4, 4, 4],
+            'layer_scale_init_value': 1e-6,
+        },
+    }
+
+    def __init__(self,
+                 arch='s12',
+                 pool_size=3,
+                 norm_cfg=dict(type='GN', num_groups=1),
+                 act_cfg=dict(type='GELU'),
+                 in_patch_size=7,
+                 in_stride=4,
+                 in_pad=2,
+                 down_patch_size=3,
+                 down_stride=2,
+                 down_pad=1,
+                 drop_rate=0.,
+                 drop_path_rate=0.,
+                 out_indices=-1,
+                 frozen_stages=0,
+                 init_cfg=None):
+
+        super().__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            assert arch in self.arch_settings, \
+                f'Unavailable arch, please choose from ' \
+                f'({set(self.arch_settings)}) or pass a dict.'
+            arch = self.arch_settings[arch]
+        elif isinstance(arch, dict):
+            assert 'layers' in arch and 'embed_dims' in arch, \
+                f'The arch dict must have "layers" and "embed_dims", ' \
+                f'but got {list(arch.keys())}.'
+
+        layers = arch['layers']
+        embed_dims = arch['embed_dims']
+        mlp_ratios = arch['mlp_ratios'] \
+            if 'mlp_ratios' in arch else [4, 4, 4, 4]
+        layer_scale_init_value = arch['layer_scale_init_value'] \
+            if 'layer_scale_init_value' in arch else 1e-5
+
+        self.patch_embed = PatchEmbed(
+            patch_size=in_patch_size,
+            stride=in_stride,
+            padding=in_pad,
+            in_chans=3,
+            embed_dim=embed_dims[0])
+
+        # set the main block in network
+        network = []
+        for i in range(len(layers)):
+            stage = basic_blocks(
+                embed_dims[i],
+                i,
+                layers,
+                pool_size=pool_size,
+                mlp_ratio=mlp_ratios[i],
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg,
+                drop_rate=drop_rate,
+                drop_path_rate=drop_path_rate,
+                layer_scale_init_value=layer_scale_init_value)
+            network.append(stage)
+            if i >= len(layers) - 1:
+                break
+            if embed_dims[i] != embed_dims[i + 1]:
+                # downsampling between two stages
+                network.append(
+                    PatchEmbed(
+                        patch_size=down_patch_size,
+                        stride=down_stride,
+                        padding=down_pad,
+                        in_chans=embed_dims[i],
+                        embed_dim=embed_dims[i + 1]))
+
+        self.network = nn.ModuleList(network)
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = 7 + index
+                assert out_indices[i] >= 0, f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+        if self.out_indices:
+            for i_layer in self.out_indices:
+                layer = build_norm_layer(norm_cfg,
+                                         embed_dims[(i_layer + 1) // 2])[1]
+                layer_name = f'norm{i_layer}'
+                self.add_module(layer_name, layer)
+
+        self.frozen_stages = frozen_stages
+        self._freeze_stages()
+
+    def forward_embeddings(self, x):
+        x = self.patch_embed(x)
+        return x
+
+    def forward_tokens(self, x):
+        outs = []
+        for idx, block in enumerate(self.network):
+            x = block(x)
+            if idx in self.out_indices:
+                norm_layer = getattr(self, f'norm{idx}')
+                x_out = norm_layer(x)
+                outs.append(x_out)
+        return tuple(outs)
+
+    def forward(self, x):
+        # input embedding
+        x = self.forward_embeddings(x)
+        # through backbone
+        x = self.forward_tokens(x)
+        return x
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            self.patch_embed.eval()
+            for param in self.patch_embed.parameters():
+                param.requires_grad = False
+
+        for i in range(self.frozen_stages):
+            # Include both block and downsample layer.
+            module = self.network[i]
+            module.eval()
+            for param in module.parameters():
+                param.requires_grad = False
+            if i in self.out_indices:
+                norm_layer = getattr(self, f'norm{i}')
+                norm_layer.eval()
+                for param in norm_layer.parameters():
+                    param.requires_grad = False
+
+    def train(self, mode=True):
+        super(PoolFormer, self).train(mode)
+        self._freeze_stages()
diff --git a/mmpretrain/models/backbones/regnet.py b/mmpretrain/models/backbones/regnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..85dbdef0bfeb607ecddff1d68d1cf405b61bea65
--- /dev/null
+++ b/mmpretrain/models/backbones/regnet.py
@@ -0,0 +1,312 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import numpy as np
+import torch.nn as nn
+from mmcv.cnn import build_conv_layer, build_norm_layer
+
+from mmpretrain.registry import MODELS
+from .resnet import ResNet
+from .resnext import Bottleneck
+
+
+@MODELS.register_module()
+class RegNet(ResNet):
+    """RegNet backbone.
+
+    More details can be found in `paper <https://arxiv.org/abs/2003.13678>`_ .
+
+    Args:
+        arch (dict): The parameter of RegNets.
+            - w0 (int): initial width
+            - wa (float): slope of width
+            - wm (float): quantization parameter to quantize the width
+            - depth (int): depth of the backbone
+            - group_w (int): width of group
+            - bot_mul (float): bottleneck ratio, i.e. expansion of bottleneck.
+        strides (Sequence[int]): Strides of the first block of each stage.
+        base_channels (int): Base channels after stem layer.
+        in_channels (int): Number of input image channels. Default: 3.
+        dilations (Sequence[int]): Dilation of each stage.
+        out_indices (Sequence[int]): Output from which stages.
+        style (str): `pytorch` or `caffe`. If set to "pytorch", the stride-two
+            layer is the 3x3 conv layer, otherwise the stride-two layer is
+            the first 1x1 conv layer. Default: "pytorch".
+        frozen_stages (int): Stages to be frozen (all param fixed). -1 means
+            not freezing any parameters. Default: -1.
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default: dict(type='BN', requires_grad=True).
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Default: False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default: False.
+        zero_init_residual (bool): whether to use zero init for last norm layer
+            in resblocks to let them behave as identity. Default: True.
+
+    Example:
+        >>> from mmpretrain.models import RegNet
+        >>> import torch
+        >>> self = RegNet(
+                arch=dict(
+                    w0=88,
+                    wa=26.31,
+                    wm=2.25,
+                    group_w=48,
+                    depth=25,
+                    bot_mul=1.0))
+        >>> self.eval()
+        >>> inputs = torch.rand(1, 3, 32, 32)
+        >>> level_outputs = self.forward(inputs)
+        >>> for level_out in level_outputs:
+        ...     print(tuple(level_out.shape))
+        (1, 96, 8, 8)
+        (1, 192, 4, 4)
+        (1, 432, 2, 2)
+        (1, 1008, 1, 1)
+    """
+    arch_settings = {
+        'regnetx_400mf':
+        dict(w0=24, wa=24.48, wm=2.54, group_w=16, depth=22, bot_mul=1.0),
+        'regnetx_800mf':
+        dict(w0=56, wa=35.73, wm=2.28, group_w=16, depth=16, bot_mul=1.0),
+        'regnetx_1.6gf':
+        dict(w0=80, wa=34.01, wm=2.25, group_w=24, depth=18, bot_mul=1.0),
+        'regnetx_3.2gf':
+        dict(w0=88, wa=26.31, wm=2.25, group_w=48, depth=25, bot_mul=1.0),
+        'regnetx_4.0gf':
+        dict(w0=96, wa=38.65, wm=2.43, group_w=40, depth=23, bot_mul=1.0),
+        'regnetx_6.4gf':
+        dict(w0=184, wa=60.83, wm=2.07, group_w=56, depth=17, bot_mul=1.0),
+        'regnetx_8.0gf':
+        dict(w0=80, wa=49.56, wm=2.88, group_w=120, depth=23, bot_mul=1.0),
+        'regnetx_12gf':
+        dict(w0=168, wa=73.36, wm=2.37, group_w=112, depth=19, bot_mul=1.0),
+    }
+
+    def __init__(self,
+                 arch,
+                 in_channels=3,
+                 stem_channels=32,
+                 base_channels=32,
+                 strides=(2, 2, 2, 2),
+                 dilations=(1, 1, 1, 1),
+                 out_indices=(3, ),
+                 style='pytorch',
+                 deep_stem=False,
+                 avg_down=False,
+                 frozen_stages=-1,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN', requires_grad=True),
+                 norm_eval=False,
+                 with_cp=False,
+                 zero_init_residual=True,
+                 init_cfg=None):
+        super(ResNet, self).__init__(init_cfg)
+
+        # Generate RegNet parameters first
+        if isinstance(arch, str):
+            assert arch in self.arch_settings, \
+                f'"arch": "{arch}" is not one of the' \
+                ' arch_settings'
+            arch = self.arch_settings[arch]
+        elif not isinstance(arch, dict):
+            raise TypeError('Expect "arch" to be either a string '
+                            f'or a dict, got {type(arch)}')
+
+        widths, num_stages = self.generate_regnet(
+            arch['w0'],
+            arch['wa'],
+            arch['wm'],
+            arch['depth'],
+        )
+        # Convert to per stage format
+        stage_widths, stage_blocks = self.get_stages_from_blocks(widths)
+        # Generate group widths and bot muls
+        group_widths = [arch['group_w'] for _ in range(num_stages)]
+        self.bottleneck_ratio = [arch['bot_mul'] for _ in range(num_stages)]
+        # Adjust the compatibility of stage_widths and group_widths
+        stage_widths, group_widths = self.adjust_width_group(
+            stage_widths, self.bottleneck_ratio, group_widths)
+
+        # Group params by stage
+        self.stage_widths = stage_widths
+        self.group_widths = group_widths
+        self.depth = sum(stage_blocks)
+        self.stem_channels = stem_channels
+        self.base_channels = base_channels
+        self.num_stages = num_stages
+        assert num_stages >= 1 and num_stages <= 4
+        self.strides = strides
+        self.dilations = dilations
+        assert len(strides) == len(dilations) == num_stages
+        self.out_indices = out_indices
+        assert max(out_indices) < num_stages
+        self.style = style
+        self.deep_stem = deep_stem
+        if self.deep_stem:
+            raise NotImplementedError(
+                'deep_stem has not been implemented for RegNet')
+        self.avg_down = avg_down
+        self.frozen_stages = frozen_stages
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.with_cp = with_cp
+        self.norm_eval = norm_eval
+        self.zero_init_residual = zero_init_residual
+        self.stage_blocks = stage_blocks[:num_stages]
+
+        self._make_stem_layer(in_channels, stem_channels)
+
+        _in_channels = stem_channels
+        self.res_layers = []
+        for i, num_blocks in enumerate(self.stage_blocks):
+            stride = self.strides[i]
+            dilation = self.dilations[i]
+            group_width = self.group_widths[i]
+            width = int(round(self.stage_widths[i] * self.bottleneck_ratio[i]))
+            stage_groups = width // group_width
+
+            res_layer = self.make_res_layer(
+                block=Bottleneck,
+                num_blocks=num_blocks,
+                in_channels=_in_channels,
+                out_channels=self.stage_widths[i],
+                expansion=1,
+                stride=stride,
+                dilation=dilation,
+                style=self.style,
+                avg_down=self.avg_down,
+                with_cp=self.with_cp,
+                conv_cfg=self.conv_cfg,
+                norm_cfg=self.norm_cfg,
+                base_channels=self.stage_widths[i],
+                groups=stage_groups,
+                width_per_group=group_width)
+            _in_channels = self.stage_widths[i]
+            layer_name = f'layer{i + 1}'
+            self.add_module(layer_name, res_layer)
+            self.res_layers.append(layer_name)
+
+        self._freeze_stages()
+
+        self.feat_dim = stage_widths[-1]
+
+    def _make_stem_layer(self, in_channels, base_channels):
+        self.conv1 = build_conv_layer(
+            self.conv_cfg,
+            in_channels,
+            base_channels,
+            kernel_size=3,
+            stride=2,
+            padding=1,
+            bias=False)
+        self.norm1_name, norm1 = build_norm_layer(
+            self.norm_cfg, base_channels, postfix=1)
+        self.add_module(self.norm1_name, norm1)
+        self.relu = nn.ReLU(inplace=True)
+
+    def generate_regnet(self,
+                        initial_width,
+                        width_slope,
+                        width_parameter,
+                        depth,
+                        divisor=8):
+        """Generates per block width from RegNet parameters.
+
+        Args:
+            initial_width ([int]): Initial width of the backbone
+            width_slope ([float]): Slope of the quantized linear function
+            width_parameter ([int]): Parameter used to quantize the width.
+            depth ([int]): Depth of the backbone.
+            divisor (int): The divisor of channels. Defaults to 8.
+
+        Returns:
+            tuple: tuple containing:
+                - list: Widths of each stage.
+                - int: The number of stages.
+        """
+        assert width_slope >= 0
+        assert initial_width > 0
+        assert width_parameter > 1
+        assert initial_width % divisor == 0
+        widths_cont = np.arange(depth) * width_slope + initial_width
+        ks = np.round(
+            np.log(widths_cont / initial_width) / np.log(width_parameter))
+        widths = initial_width * np.power(width_parameter, ks)
+        widths = np.round(np.divide(widths, divisor)) * divisor
+        num_stages = len(np.unique(widths))
+        widths, widths_cont = widths.astype(int).tolist(), widths_cont.tolist()
+        return widths, num_stages
+
+    @staticmethod
+    def quantize_float(number, divisor):
+        """Converts a float to closest non-zero int divisible by divior.
+
+        Args:
+            number (int): Original number to be quantized.
+            divisor (int): Divisor used to quantize the number.
+
+        Returns:
+            int: quantized number that is divisible by devisor.
+        """
+        return int(round(number / divisor) * divisor)
+
+    def adjust_width_group(self, widths, bottleneck_ratio, groups):
+        """Adjusts the compatibility of widths and groups.
+
+        Args:
+            widths (list[int]): Width of each stage.
+            bottleneck_ratio (float): Bottleneck ratio.
+            groups (int): number of groups in each stage
+
+        Returns:
+            tuple(list): The adjusted widths and groups of each stage.
+        """
+        bottleneck_width = [
+            int(w * b) for w, b in zip(widths, bottleneck_ratio)
+        ]
+        groups = [min(g, w_bot) for g, w_bot in zip(groups, bottleneck_width)]
+        bottleneck_width = [
+            self.quantize_float(w_bot, g)
+            for w_bot, g in zip(bottleneck_width, groups)
+        ]
+        widths = [
+            int(w_bot / b)
+            for w_bot, b in zip(bottleneck_width, bottleneck_ratio)
+        ]
+        return widths, groups
+
+    def get_stages_from_blocks(self, widths):
+        """Gets widths/stage_blocks of network at each stage.
+
+        Args:
+            widths (list[int]): Width in each stage.
+
+        Returns:
+            tuple(list): width and depth of each stage
+        """
+        width_diff = [
+            width != width_prev
+            for width, width_prev in zip(widths + [0], [0] + widths)
+        ]
+        stage_widths = [
+            width for width, diff in zip(widths, width_diff[:-1]) if diff
+        ]
+        stage_blocks = np.diff([
+            depth for depth, diff in zip(range(len(width_diff)), width_diff)
+            if diff
+        ]).tolist()
+        return stage_widths, stage_blocks
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.norm1(x)
+        x = self.relu(x)
+
+        outs = []
+        for i, layer_name in enumerate(self.res_layers):
+            res_layer = getattr(self, layer_name)
+            x = res_layer(x)
+            if i in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
diff --git a/mmpretrain/models/backbones/replknet.py b/mmpretrain/models/backbones/replknet.py
new file mode 100644
index 0000000000000000000000000000000000000000..4dce4154fbe1d95806eec118b69ff70f0d74c1c6
--- /dev/null
+++ b/mmpretrain/models/backbones/replknet.py
@@ -0,0 +1,668 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint as checkpoint
+from mmcv.cnn import build_activation_layer, build_norm_layer
+from mmcv.cnn.bricks import DropPath
+from mmengine.model import BaseModule
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+
+
+def conv_bn(in_channels,
+            out_channels,
+            kernel_size,
+            stride,
+            padding,
+            groups,
+            dilation=1,
+            norm_cfg=dict(type='BN')):
+    """Construct a sequential conv and bn.
+
+    Args:
+        in_channels (int): Dimension of input features.
+        out_channels (int): Dimension of output features.
+        kernel_size (int): kernel_size of the convolution.
+        stride (int): stride of the convolution.
+        padding (int): stride of the convolution.
+        groups (int): groups of the convolution.
+        dilation (int): dilation of the convolution. Default to 1.
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default to  ``dict(type='BN', requires_grad=True)``.
+
+    Returns:
+        nn.Sequential(): A conv layer and a batch norm layer.
+    """
+    if padding is None:
+        padding = kernel_size // 2
+    result = nn.Sequential()
+    result.add_module(
+        'conv',
+        nn.Conv2d(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            dilation=dilation,
+            groups=groups,
+            bias=False))
+    result.add_module('bn', build_norm_layer(norm_cfg, out_channels)[1])
+    return result
+
+
+def conv_bn_relu(in_channels,
+                 out_channels,
+                 kernel_size,
+                 stride,
+                 padding,
+                 groups,
+                 dilation=1):
+    """Construct a sequential conv, bn and relu.
+
+    Args:
+        in_channels (int): Dimension of input features.
+        out_channels (int): Dimension of output features.
+        kernel_size (int): kernel_size of the convolution.
+        stride (int): stride of the convolution.
+        padding (int): stride of the convolution.
+        groups (int): groups of the convolution.
+        dilation (int): dilation of the convolution. Default to 1.
+
+    Returns:
+        nn.Sequential(): A conv layer, batch norm layer and a relu function.
+    """
+
+    if padding is None:
+        padding = kernel_size // 2
+    result = conv_bn(
+        in_channels=in_channels,
+        out_channels=out_channels,
+        kernel_size=kernel_size,
+        stride=stride,
+        padding=padding,
+        groups=groups,
+        dilation=dilation)
+    result.add_module('nonlinear', nn.ReLU())
+    return result
+
+
+def fuse_bn(conv, bn):
+    """Fuse the parameters in a branch with a conv and bn.
+
+    Args:
+        conv (nn.Conv2d): The convolution module to fuse.
+        bn (nn.BatchNorm2d): The batch normalization to fuse.
+
+    Returns:
+        tuple[torch.Tensor, torch.Tensor]: The parameters obtained after
+        fusing the parameters of conv and bn in one branch.
+        The first element is the weight and the second is the bias.
+    """
+    kernel = conv.weight
+    running_mean = bn.running_mean
+    running_var = bn.running_var
+    gamma = bn.weight
+    beta = bn.bias
+    eps = bn.eps
+    std = (running_var + eps).sqrt()
+    t = (gamma / std).reshape(-1, 1, 1, 1)
+    return kernel * t, beta - running_mean * gamma / std
+
+
+class ReparamLargeKernelConv(BaseModule):
+    """Super large kernel implemented by with large convolutions.
+
+    Input: Tensor with shape [B, C, H, W].
+    Output: Tensor with shape [B, C, H, W].
+
+    Args:
+        in_channels (int): Dimension of input features.
+        out_channels (int): Dimension of output features.
+        kernel_size (int): kernel_size of the large convolution.
+        stride (int): stride of the large convolution.
+        groups (int): groups of the large convolution.
+        small_kernel (int): kernel_size of the small convolution.
+        small_kernel_merged (bool): Whether to switch the model structure to
+            deployment mode (merge the small kernel to the large kernel).
+            Default to  False.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Defaults to None
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 kernel_size,
+                 stride,
+                 groups,
+                 small_kernel,
+                 small_kernel_merged=False,
+                 init_cfg=None):
+        super(ReparamLargeKernelConv, self).__init__(init_cfg)
+        self.kernel_size = kernel_size
+        self.small_kernel = small_kernel
+        self.small_kernel_merged = small_kernel_merged
+        # We assume the conv does not change the feature map size,
+        # so padding = k//2.
+        # Otherwise, you may configure padding as you wish,
+        # and change the padding of small_conv accordingly.
+        padding = kernel_size // 2
+        if small_kernel_merged:
+            self.lkb_reparam = nn.Conv2d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=kernel_size,
+                stride=stride,
+                padding=padding,
+                dilation=1,
+                groups=groups,
+                bias=True)
+        else:
+            self.lkb_origin = conv_bn(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=kernel_size,
+                stride=stride,
+                padding=padding,
+                dilation=1,
+                groups=groups)
+            if small_kernel is not None:
+                assert small_kernel <= kernel_size
+                self.small_conv = conv_bn(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    kernel_size=small_kernel,
+                    stride=stride,
+                    padding=small_kernel // 2,
+                    groups=groups,
+                    dilation=1)
+
+    def forward(self, inputs):
+        if hasattr(self, 'lkb_reparam'):
+            out = self.lkb_reparam(inputs)
+        else:
+            out = self.lkb_origin(inputs)
+            if hasattr(self, 'small_conv'):
+                out += self.small_conv(inputs)
+        return out
+
+    def get_equivalent_kernel_bias(self):
+        eq_k, eq_b = fuse_bn(self.lkb_origin.conv, self.lkb_origin.bn)
+        if hasattr(self, 'small_conv'):
+            small_k, small_b = fuse_bn(self.small_conv.conv,
+                                       self.small_conv.bn)
+            eq_b += small_b
+            #   add to the central part
+            eq_k += nn.functional.pad(
+                small_k, [(self.kernel_size - self.small_kernel) // 2] * 4)
+        return eq_k, eq_b
+
+    def merge_kernel(self):
+        """Switch the model structure from training mode to deployment mode."""
+        if self.small_kernel_merged:
+            return
+        eq_k, eq_b = self.get_equivalent_kernel_bias()
+        self.lkb_reparam = nn.Conv2d(
+            in_channels=self.lkb_origin.conv.in_channels,
+            out_channels=self.lkb_origin.conv.out_channels,
+            kernel_size=self.lkb_origin.conv.kernel_size,
+            stride=self.lkb_origin.conv.stride,
+            padding=self.lkb_origin.conv.padding,
+            dilation=self.lkb_origin.conv.dilation,
+            groups=self.lkb_origin.conv.groups,
+            bias=True)
+
+        self.lkb_reparam.weight.data = eq_k
+        self.lkb_reparam.bias.data = eq_b
+        self.__delattr__('lkb_origin')
+        if hasattr(self, 'small_conv'):
+            self.__delattr__('small_conv')
+
+        self.small_kernel_merged = True
+
+
+class ConvFFN(BaseModule):
+    """Mlp implemented by with 1*1 convolutions.
+
+    Input: Tensor with shape [B, C, H, W].
+    Output: Tensor with shape [B, C, H, W].
+
+    Args:
+        in_channels (int): Dimension of input features.
+        internal_channels (int): Dimension of hidden features.
+        out_channels (int): Dimension of output features.
+        drop_path (float): Stochastic depth rate. Defaults to 0.
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default to  ``dict(type='BN', requires_grad=True)``.
+        act_cfg (dict): The config dict for activation between pointwise
+            convolution. Defaults to ``dict(type='GELU')``.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 internal_channels,
+                 out_channels,
+                 drop_path,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='GELU'),
+                 init_cfg=None):
+        super(ConvFFN, self).__init__(init_cfg)
+        self.drop_path = DropPath(
+            drop_prob=drop_path) if drop_path > 0. else nn.Identity()
+        self.preffn_bn = build_norm_layer(norm_cfg, in_channels)[1]
+        self.pw1 = conv_bn(
+            in_channels=in_channels,
+            out_channels=internal_channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            groups=1)
+        self.pw2 = conv_bn(
+            in_channels=internal_channels,
+            out_channels=out_channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            groups=1)
+        self.nonlinear = build_activation_layer(act_cfg)
+
+    def forward(self, x):
+        out = self.preffn_bn(x)
+        out = self.pw1(out)
+        out = self.nonlinear(out)
+        out = self.pw2(out)
+        return x + self.drop_path(out)
+
+
+class RepLKBlock(BaseModule):
+    """RepLKBlock for RepLKNet backbone.
+
+    Args:
+        in_channels (int): The input channels of the block.
+        dw_channels (int): The intermediate channels of the block,
+            i.e., input channels of the large kernel convolution.
+        block_lk_size (int): size of the super large kernel. Defaults: 31.
+        small_kernel (int): size of the parallel small kernel. Defaults: 5.
+        drop_path (float): Stochastic depth rate. Defaults: 0.
+        small_kernel_merged (bool): Whether to switch the model structure to
+            deployment mode (merge the small kernel to the large kernel).
+            Default to  False.
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default to  ``dict(type='BN', requires_grad=True)``.
+        act_cfg (dict): Config dict for activation layer.
+            Default to  ``dict(type='ReLU')``.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Default to  None
+    """
+
+    def __init__(self,
+                 in_channels,
+                 dw_channels,
+                 block_lk_size,
+                 small_kernel,
+                 drop_path,
+                 small_kernel_merged=False,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 init_cfg=None):
+        super(RepLKBlock, self).__init__(init_cfg)
+        self.pw1 = conv_bn_relu(in_channels, dw_channels, 1, 1, 0, groups=1)
+        self.pw2 = conv_bn(dw_channels, in_channels, 1, 1, 0, groups=1)
+        self.large_kernel = ReparamLargeKernelConv(
+            in_channels=dw_channels,
+            out_channels=dw_channels,
+            kernel_size=block_lk_size,
+            stride=1,
+            groups=dw_channels,
+            small_kernel=small_kernel,
+            small_kernel_merged=small_kernel_merged)
+        self.lk_nonlinear = build_activation_layer(act_cfg)
+        self.prelkb_bn = build_norm_layer(norm_cfg, in_channels)[1]
+        self.drop_path = DropPath(
+            drop_prob=drop_path) if drop_path > 0. else nn.Identity()
+        # print('drop path:', self.drop_path)
+
+    def forward(self, x):
+        out = self.prelkb_bn(x)
+        out = self.pw1(out)
+        out = self.large_kernel(out)
+        out = self.lk_nonlinear(out)
+        out = self.pw2(out)
+        return x + self.drop_path(out)
+
+
+class RepLKNetStage(BaseModule):
+    """
+    generate RepLKNet blocks for a stage
+    return: RepLKNet blocks
+
+    Args:
+        channels (int): The input channels of the stage.
+        num_blocks (int): The number of blocks of the stage.
+        stage_lk_size (int): size of the super large kernel. Defaults: 31.
+        drop_path (float): Stochastic depth rate. Defaults: 0.
+        small_kernel (int): size of the parallel small kernel. Defaults: 5.
+        dw_ratio (float): The intermediate channels
+            expansion ratio of the block. Defaults: 1.
+        ffn_ratio (float): Mlp expansion ratio. Defaults to 4.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default to  False.
+        small_kernel_merged (bool): Whether to switch the model structure to
+            deployment mode (merge the small kernel to the large kernel).
+            Default to  False.
+        norm_intermediate_features (bool): Construct and config norm layer
+            or not.
+            Using True will normalize the intermediate features for
+            downstream dense prediction tasks.
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default to  ``dict(type='BN', requires_grad=True)``.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Default to  None
+    """
+
+    def __init__(
+            self,
+            channels,
+            num_blocks,
+            stage_lk_size,
+            drop_path,
+            small_kernel,
+            dw_ratio=1,
+            ffn_ratio=4,
+            with_cp=False,  # train with torch.utils.checkpoint to save memory
+            small_kernel_merged=False,
+            norm_intermediate_features=False,
+            norm_cfg=dict(type='BN'),
+            init_cfg=None):
+        super(RepLKNetStage, self).__init__(init_cfg)
+        self.with_cp = with_cp
+        blks = []
+        for i in range(num_blocks):
+            block_drop_path = drop_path[i] if isinstance(drop_path,
+                                                         list) else drop_path
+            #   Assume all RepLK Blocks within a stage share the same lk_size.
+            #   You may tune it on your own model.
+            replk_block = RepLKBlock(
+                in_channels=channels,
+                dw_channels=int(channels * dw_ratio),
+                block_lk_size=stage_lk_size,
+                small_kernel=small_kernel,
+                drop_path=block_drop_path,
+                small_kernel_merged=small_kernel_merged)
+            convffn_block = ConvFFN(
+                in_channels=channels,
+                internal_channels=int(channels * ffn_ratio),
+                out_channels=channels,
+                drop_path=block_drop_path)
+            blks.append(replk_block)
+            blks.append(convffn_block)
+        self.blocks = nn.ModuleList(blks)
+        if norm_intermediate_features:
+            self.norm = build_norm_layer(norm_cfg, channels)[1]
+        else:
+            self.norm = nn.Identity()
+
+    def forward(self, x):
+        for blk in self.blocks:
+            if self.with_cp:
+                x = checkpoint.checkpoint(blk, x)  # Save training memory
+            else:
+                x = blk(x)
+        return x
+
+
+@MODELS.register_module()
+class RepLKNet(BaseBackbone):
+    """RepLKNet backbone.
+
+    A PyTorch impl of :
+    `Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs
+    <https://arxiv.org/abs/2203.06717>`_
+
+    Args:
+        arch (str | dict): The parameter of RepLKNet.
+            If it's a dict, it should contain the following keys:
+
+            - large_kernel_sizes (Sequence[int]):
+                Large kernel size in each stage.
+            - layers (Sequence[int]): Number of blocks in each stage.
+            - channels (Sequence[int]): Number of channels in each stage.
+            - small_kernel (int): size of the parallel small kernel.
+            - dw_ratio (float): The intermediate channels
+                expansion ratio of the block.
+        in_channels (int): Number of input image channels. Default to  3.
+        ffn_ratio (float): Mlp expansion ratio. Defaults to 4.
+        out_indices (Sequence[int]): Output from which stages.
+            Default to  (3, ).
+        strides (Sequence[int]): Strides of the first block of each stage.
+            Default to  (2, 2, 2, 2).
+        dilations (Sequence[int]): Dilation of each stage.
+            Default to  (1, 1, 1, 1).
+        frozen_stages (int): Stages to be frozen
+            (all param fixed). -1 means not freezing any parameters.
+            Default to  -1.
+        conv_cfg (dict | None): The config dict for conv layers.
+            Default to None.
+        norm_cfg (dict): The config dict for norm layers.
+            Default to  ``dict(type='BN')``.
+        act_cfg (dict): Config dict for activation layer.
+            Default to  ``dict(type='ReLU')``.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default to False.
+        deploy (bool): Whether to switch the model structure to deployment
+            mode. Default to False.
+        norm_intermediate_features (bool): Construct and
+            config norm layer or not.
+            Using True will normalize the intermediate features
+            for downstream dense prediction tasks.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Default to False.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+    """
+
+    arch_settings = {
+        '31B':
+        dict(
+            large_kernel_sizes=[31, 29, 27, 13],
+            layers=[2, 2, 18, 2],
+            channels=[128, 256, 512, 1024],
+            small_kernel=5,
+            dw_ratio=1),
+        '31L':
+        dict(
+            large_kernel_sizes=[31, 29, 27, 13],
+            layers=[2, 2, 18, 2],
+            channels=[192, 384, 768, 1536],
+            small_kernel=5,
+            dw_ratio=1),
+        'XL':
+        dict(
+            large_kernel_sizes=[27, 27, 27, 13],
+            layers=[2, 2, 18, 2],
+            channels=[256, 512, 1024, 2048],
+            small_kernel=None,
+            dw_ratio=1.5),
+    }
+
+    def __init__(self,
+                 arch,
+                 in_channels=3,
+                 ffn_ratio=4,
+                 out_indices=(3, ),
+                 strides=(2, 2, 2, 2),
+                 dilations=(1, 1, 1, 1),
+                 frozen_stages=-1,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 with_cp=False,
+                 drop_path_rate=0.3,
+                 small_kernel_merged=False,
+                 norm_intermediate_features=False,
+                 norm_eval=False,
+                 init_cfg=[
+                     dict(type='Kaiming', layer=['Conv2d']),
+                     dict(
+                         type='Constant',
+                         val=1,
+                         layer=['_BatchNorm', 'GroupNorm'])
+                 ]):
+        super(RepLKNet, self).__init__(init_cfg)
+
+        if isinstance(arch, str):
+            assert arch in self.arch_settings, \
+                f'"arch": "{arch}" is not one of the arch_settings'
+            arch = self.arch_settings[arch]
+        elif not isinstance(arch, dict):
+            raise TypeError('Expect "arch" to be either a string '
+                            f'or a dict, got {type(arch)}')
+
+        assert len(arch['layers']) == len(
+            arch['channels']) == len(strides) == len(dilations)
+        assert max(out_indices) < len(arch['layers'])
+
+        self.arch = arch
+        self.in_channels = in_channels
+        self.out_indices = out_indices
+        self.strides = strides
+        self.dilations = dilations
+        self.frozen_stages = frozen_stages
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.act_cfg = act_cfg
+        self.with_cp = with_cp
+        self.drop_path_rate = drop_path_rate
+        self.small_kernel_merged = small_kernel_merged
+        self.norm_eval = norm_eval
+        self.norm_intermediate_features = norm_intermediate_features
+
+        self.out_indices = out_indices
+
+        base_width = self.arch['channels'][0]
+        self.norm_intermediate_features = norm_intermediate_features
+        self.num_stages = len(self.arch['layers'])
+        self.stem = nn.ModuleList([
+            conv_bn_relu(
+                in_channels=in_channels,
+                out_channels=base_width,
+                kernel_size=3,
+                stride=2,
+                padding=1,
+                groups=1),
+            conv_bn_relu(
+                in_channels=base_width,
+                out_channels=base_width,
+                kernel_size=3,
+                stride=1,
+                padding=1,
+                groups=base_width),
+            conv_bn_relu(
+                in_channels=base_width,
+                out_channels=base_width,
+                kernel_size=1,
+                stride=1,
+                padding=0,
+                groups=1),
+            conv_bn_relu(
+                in_channels=base_width,
+                out_channels=base_width,
+                kernel_size=3,
+                stride=2,
+                padding=1,
+                groups=base_width)
+        ])
+        # stochastic depth. We set block-wise drop-path rate.
+        # The higher level blocks are more likely to be dropped.
+        # This implementation follows Swin.
+        dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate,
+                                             sum(self.arch['layers']))
+        ]
+        self.stages = nn.ModuleList()
+        self.transitions = nn.ModuleList()
+        for stage_idx in range(self.num_stages):
+            layer = RepLKNetStage(
+                channels=self.arch['channels'][stage_idx],
+                num_blocks=self.arch['layers'][stage_idx],
+                stage_lk_size=self.arch['large_kernel_sizes'][stage_idx],
+                drop_path=dpr[sum(self.arch['layers'][:stage_idx]
+                                  ):sum(self.arch['layers'][:stage_idx + 1])],
+                small_kernel=self.arch['small_kernel'],
+                dw_ratio=self.arch['dw_ratio'],
+                ffn_ratio=ffn_ratio,
+                with_cp=with_cp,
+                small_kernel_merged=small_kernel_merged,
+                norm_intermediate_features=(stage_idx in out_indices))
+            self.stages.append(layer)
+            if stage_idx < len(self.arch['layers']) - 1:
+                transition = nn.Sequential(
+                    conv_bn_relu(
+                        self.arch['channels'][stage_idx],
+                        self.arch['channels'][stage_idx + 1],
+                        1,
+                        1,
+                        0,
+                        groups=1),
+                    conv_bn_relu(
+                        self.arch['channels'][stage_idx + 1],
+                        self.arch['channels'][stage_idx + 1],
+                        3,
+                        stride=2,
+                        padding=1,
+                        groups=self.arch['channels'][stage_idx + 1]))
+                self.transitions.append(transition)
+
+    def forward_features(self, x):
+        x = self.stem[0](x)
+        for stem_layer in self.stem[1:]:
+            if self.with_cp:
+                x = checkpoint.checkpoint(stem_layer, x)  # save memory
+            else:
+                x = stem_layer(x)
+
+        #   Need the intermediate feature maps
+        outs = []
+        for stage_idx in range(self.num_stages):
+            x = self.stages[stage_idx](x)
+            if stage_idx in self.out_indices:
+                outs.append(self.stages[stage_idx].norm(x))
+                # For RepLKNet-XL normalize the features
+                # before feeding them into the heads
+            if stage_idx < self.num_stages - 1:
+                x = self.transitions[stage_idx](x)
+        return outs
+
+    def forward(self, x):
+        x = self.forward_features(x)
+        return tuple(x)
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            self.stem.eval()
+            for param in self.stem.parameters():
+                param.requires_grad = False
+        for i in range(self.frozen_stages):
+            stage = self.stages[i]
+            stage.eval()
+            for param in stage.parameters():
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        super(RepLKNet, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                if isinstance(m, _BatchNorm):
+                    m.eval()
+
+    def switch_to_deploy(self):
+        for m in self.modules():
+            if hasattr(m, 'merge_kernel'):
+                m.merge_kernel()
+        self.small_kernel_merged = True
diff --git a/mmpretrain/models/backbones/repmlp.py b/mmpretrain/models/backbones/repmlp.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7c06c4875710b33c57f2794c437034d93169b30
--- /dev/null
+++ b/mmpretrain/models/backbones/repmlp.py
@@ -0,0 +1,578 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Adapted from official impl at https://github.com/DingXiaoH/RepMLP.
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn import (ConvModule, build_activation_layer, build_conv_layer,
+                      build_norm_layer)
+from mmcv.cnn.bricks.transformer import PatchEmbed as _PatchEmbed
+from mmengine.model import BaseModule, ModuleList, Sequential
+
+from mmpretrain.models.utils import SELayer, to_2tuple
+from mmpretrain.registry import MODELS
+
+
+def fuse_bn(conv_or_fc, bn):
+    """fuse conv and bn."""
+    std = (bn.running_var + bn.eps).sqrt()
+    tmp_weight = bn.weight / std
+    tmp_weight = tmp_weight.reshape(-1, 1, 1, 1)
+
+    if len(tmp_weight) == conv_or_fc.weight.size(0):
+        return (conv_or_fc.weight * tmp_weight,
+                bn.bias - bn.running_mean * bn.weight / std)
+    else:
+        # in RepMLPBlock, dim0 of fc3 weights and fc3_bn weights
+        # are different.
+        repeat_times = conv_or_fc.weight.size(0) // len(tmp_weight)
+        repeated = tmp_weight.repeat_interleave(repeat_times, 0)
+        fused_weight = conv_or_fc.weight * repeated
+        bias = bn.bias - bn.running_mean * bn.weight / std
+        fused_bias = (bias).repeat_interleave(repeat_times, 0)
+        return (fused_weight, fused_bias)
+
+
+class PatchEmbed(_PatchEmbed):
+    """Image to Patch Embedding.
+
+    Compared with default Patch Embedding(in ViT), Patch Embedding of RepMLP
+     have ReLu and do not convert output tensor into shape (N, L, C).
+
+    Args:
+        in_channels (int): The num of input channels. Default: 3
+        embed_dims (int): The dimensions of embedding. Default: 768
+        conv_type (str): The type of convolution
+            to generate patch embedding. Default: "Conv2d".
+        kernel_size (int): The kernel_size of embedding conv. Default: 16.
+        stride (int): The slide stride of embedding conv.
+            Default: 16.
+        padding (int | tuple | string): The padding length of
+            embedding conv. When it is a string, it means the mode
+            of adaptive padding, support "same" and "corner" now.
+            Default: "corner".
+        dilation (int): The dilation rate of embedding conv. Default: 1.
+        bias (bool): Bias of embed conv. Default: True.
+        norm_cfg (dict, optional): Config dict for normalization layer.
+            Default: None.
+        input_size (int | tuple | None): The size of input, which will be
+            used to calculate the out size. Only works when `dynamic_size`
+            is False. Default: None.
+        init_cfg (`mmcv.ConfigDict`, optional): The Config for initialization.
+            Default: None.
+    """
+
+    def __init__(self, *args, **kwargs):
+        super(PatchEmbed, self).__init__(*args, **kwargs)
+        self.relu = nn.ReLU()
+
+    def forward(self, x):
+        """
+        Args:
+            x (Tensor): Has shape (B, C, H, W). In most case, C is 3.
+        Returns:
+            tuple: Contains merged results and its spatial shape.
+            - x (Tensor): The output tensor.
+            - out_size (tuple[int]): Spatial shape of x, arrange as
+              (out_h, out_w).
+        """
+
+        if self.adaptive_padding:
+            x = self.adaptive_padding(x)
+
+        x = self.projection(x)
+        if self.norm is not None:
+            x = self.norm(x)
+        x = self.relu(x)
+        out_size = (x.shape[2], x.shape[3])
+        return x, out_size
+
+
+class GlobalPerceptron(SELayer):
+    """GlobalPerceptron implemented by using ``mmpretrain.modes.SELayer``.
+
+    Args:
+        input_channels (int): The number of input (and output) channels
+            in the GlobalPerceptron.
+        ratio (int): Squeeze ratio in GlobalPerceptron, the intermediate
+            channel will be ``make_divisible(channels // ratio, divisor)``.
+    """
+
+    def __init__(self, input_channels: int, ratio: int, **kwargs) -> None:
+        super(GlobalPerceptron, self).__init__(
+            channels=input_channels,
+            ratio=ratio,
+            return_weight=True,
+            act_cfg=(dict(type='ReLU'), dict(type='Sigmoid')),
+            **kwargs)
+
+
+class RepMLPBlock(BaseModule):
+    """Basic RepMLPNet, consists of PartitionPerceptron and GlobalPerceptron.
+
+    Args:
+        channels (int): The number of input and the output channels of the
+            block.
+        path_h (int): The height of patches.
+        path_w (int): The weidth of patches.
+        reparam_conv_kernels (Squeue(int) | None): The conv kernels in the
+            GlobalPerceptron. Default: None.
+        globalperceptron_ratio (int): The reducation ratio in the
+            GlobalPerceptron. Default: 4.
+        num_sharesets (int): The number of sharesets in the
+            PartitionPerceptron. Default 1.
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Default: None, which means using conv2d.
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default: dict(type='BN', requires_grad=True).
+        deploy (bool): Whether to switch the model structure to
+            deployment mode. Default: False.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Default: None
+    """
+
+    def __init__(self,
+                 channels,
+                 path_h,
+                 path_w,
+                 reparam_conv_kernels=None,
+                 globalperceptron_ratio=4,
+                 num_sharesets=1,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN', requires_grad=True),
+                 deploy=False,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+
+        self.deploy = deploy
+        self.channels = channels
+        self.num_sharesets = num_sharesets
+        self.path_h, self.path_w = path_h, path_w
+        # the input channel of fc3
+        self._path_vec_channles = path_h * path_w * num_sharesets
+
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+
+        self.gp = GlobalPerceptron(
+            input_channels=channels, ratio=globalperceptron_ratio)
+
+        # using a conv layer to implement a fc layer
+        self.fc3 = build_conv_layer(
+            conv_cfg,
+            in_channels=self._path_vec_channles,
+            out_channels=self._path_vec_channles,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=deploy,
+            groups=num_sharesets)
+        if deploy:
+            self.fc3_bn = nn.Identity()
+        else:
+            norm_layer = build_norm_layer(norm_cfg, num_sharesets)[1]
+            self.add_module('fc3_bn', norm_layer)
+
+        self.reparam_conv_kernels = reparam_conv_kernels
+        if not deploy and reparam_conv_kernels is not None:
+            for k in reparam_conv_kernels:
+                conv_branch = ConvModule(
+                    in_channels=num_sharesets,
+                    out_channels=num_sharesets,
+                    kernel_size=k,
+                    stride=1,
+                    padding=k // 2,
+                    norm_cfg=dict(type='BN', requires_grad=True),
+                    groups=num_sharesets,
+                    act_cfg=None)
+                self.__setattr__('repconv{}'.format(k), conv_branch)
+
+    def partition(self, x, h_parts, w_parts):
+        # convert (N, C, H, W) to (N, h_parts, w_parts, C, path_h, path_w)
+        x = x.reshape(-1, self.channels, h_parts, self.path_h, w_parts,
+                      self.path_w)
+        x = x.permute(0, 2, 4, 1, 3, 5)
+        return x
+
+    def partition_affine(self, x, h_parts, w_parts):
+        """perform Partition Perceptron."""
+        fc_inputs = x.reshape(-1, self._path_vec_channles, 1, 1)
+        out = self.fc3(fc_inputs)
+        out = out.reshape(-1, self.num_sharesets, self.path_h, self.path_w)
+        out = self.fc3_bn(out)
+        out = out.reshape(-1, h_parts, w_parts, self.num_sharesets,
+                          self.path_h, self.path_w)
+        return out
+
+    def forward(self, inputs):
+        # Global Perceptron
+        global_vec = self.gp(inputs)
+
+        origin_shape = inputs.size()
+        h_parts = origin_shape[2] // self.path_h
+        w_parts = origin_shape[3] // self.path_w
+
+        partitions = self.partition(inputs, h_parts, w_parts)
+
+        # Channel Perceptron
+        fc3_out = self.partition_affine(partitions, h_parts, w_parts)
+
+        # perform Local Perceptron
+        if self.reparam_conv_kernels is not None and not self.deploy:
+            conv_inputs = partitions.reshape(-1, self.num_sharesets,
+                                             self.path_h, self.path_w)
+            conv_out = 0
+            for k in self.reparam_conv_kernels:
+                conv_branch = self.__getattr__('repconv{}'.format(k))
+                conv_out += conv_branch(conv_inputs)
+            conv_out = conv_out.reshape(-1, h_parts, w_parts,
+                                        self.num_sharesets, self.path_h,
+                                        self.path_w)
+            fc3_out += conv_out
+
+        # N, h_parts, w_parts, num_sharesets, out_h, out_w
+        fc3_out = fc3_out.permute(0, 3, 1, 4, 2, 5)
+        out = fc3_out.reshape(*origin_shape)
+        out = out * global_vec
+        return out
+
+    def get_equivalent_fc3(self):
+        """get the equivalent fc3 weight and bias."""
+        fc_weight, fc_bias = fuse_bn(self.fc3, self.fc3_bn)
+        if self.reparam_conv_kernels is not None:
+            largest_k = max(self.reparam_conv_kernels)
+            largest_branch = self.__getattr__('repconv{}'.format(largest_k))
+            total_kernel, total_bias = fuse_bn(largest_branch.conv,
+                                               largest_branch.bn)
+            for k in self.reparam_conv_kernels:
+                if k != largest_k:
+                    k_branch = self.__getattr__('repconv{}'.format(k))
+                    kernel, bias = fuse_bn(k_branch.conv, k_branch.bn)
+                    total_kernel += F.pad(kernel, [(largest_k - k) // 2] * 4)
+                    total_bias += bias
+            rep_weight, rep_bias = self._convert_conv_to_fc(
+                total_kernel, total_bias)
+            final_fc3_weight = rep_weight.reshape_as(fc_weight) + fc_weight
+            final_fc3_bias = rep_bias + fc_bias
+        else:
+            final_fc3_weight = fc_weight
+            final_fc3_bias = fc_bias
+        return final_fc3_weight, final_fc3_bias
+
+    def local_inject(self):
+        """inject the Local Perceptron into Partition Perceptron."""
+        self.deploy = True
+        #  Locality Injection
+        fc3_weight, fc3_bias = self.get_equivalent_fc3()
+        #  Remove Local Perceptron
+        if self.reparam_conv_kernels is not None:
+            for k in self.reparam_conv_kernels:
+                self.__delattr__('repconv{}'.format(k))
+        self.__delattr__('fc3')
+        self.__delattr__('fc3_bn')
+        self.fc3 = build_conv_layer(
+            self.conv_cfg,
+            self._path_vec_channles,
+            self._path_vec_channles,
+            1,
+            1,
+            0,
+            bias=True,
+            groups=self.num_sharesets)
+        self.fc3_bn = nn.Identity()
+        self.fc3.weight.data = fc3_weight
+        self.fc3.bias.data = fc3_bias
+
+    def _convert_conv_to_fc(self, conv_kernel, conv_bias):
+        """convert conv_k1 to fc, which is still a conv_k2, and the k2 > k1."""
+        in_channels = torch.eye(self.path_h * self.path_w).repeat(
+            1, self.num_sharesets).reshape(self.path_h * self.path_w,
+                                           self.num_sharesets, self.path_h,
+                                           self.path_w).to(conv_kernel.device)
+        fc_k = F.conv2d(
+            in_channels,
+            conv_kernel,
+            padding=(conv_kernel.size(2) // 2, conv_kernel.size(3) // 2),
+            groups=self.num_sharesets)
+        fc_k = fc_k.reshape(self.path_w * self.path_w, self.num_sharesets *
+                            self.path_h * self.path_w).t()
+        fc_bias = conv_bias.repeat_interleave(self.path_h * self.path_w)
+        return fc_k, fc_bias
+
+
+class RepMLPNetUnit(BaseModule):
+    """A basic unit in RepMLPNet : [REPMLPBlock + BN + ConvFFN + BN].
+
+    Args:
+        channels (int): The number of input and the output channels of the
+            unit.
+        path_h (int): The height of patches.
+        path_w (int): The weidth of patches.
+        reparam_conv_kernels (Squeue(int) | None): The conv kernels in the
+            GlobalPerceptron. Default: None.
+        globalperceptron_ratio (int): The reducation ratio in the
+            GlobalPerceptron. Default: 4.
+        num_sharesets (int): The number of sharesets in the
+            PartitionPerceptron. Default 1.
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Default: None, which means using conv2d.
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default: dict(type='BN', requires_grad=True).
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='ReLU').
+        deploy (bool): Whether to switch the model structure to
+            deployment mode. Default: False.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Default: None
+    """
+
+    def __init__(self,
+                 channels,
+                 path_h,
+                 path_w,
+                 reparam_conv_kernels,
+                 globalperceptron_ratio,
+                 norm_cfg=dict(type='BN', requires_grad=True),
+                 ffn_expand=4,
+                 num_sharesets=1,
+                 deploy=False,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        self.repmlp_block = RepMLPBlock(
+            channels=channels,
+            path_h=path_h,
+            path_w=path_w,
+            reparam_conv_kernels=reparam_conv_kernels,
+            globalperceptron_ratio=globalperceptron_ratio,
+            num_sharesets=num_sharesets,
+            deploy=deploy)
+        self.ffn_block = ConvFFN(channels, channels * ffn_expand)
+        norm1 = build_norm_layer(norm_cfg, channels)[1]
+        self.add_module('norm1', norm1)
+        norm2 = build_norm_layer(norm_cfg, channels)[1]
+        self.add_module('norm2', norm2)
+
+    def forward(self, x):
+        y = x + self.repmlp_block(self.norm1(x))
+        out = y + self.ffn_block(self.norm2(y))
+        return out
+
+
+class ConvFFN(nn.Module):
+    """ConvFFN implemented by using point-wise convs."""
+
+    def __init__(self,
+                 in_channels,
+                 hidden_channels=None,
+                 out_channels=None,
+                 norm_cfg=dict(type='BN', requires_grad=True),
+                 act_cfg=dict(type='GELU')):
+        super().__init__()
+        out_features = out_channels or in_channels
+        hidden_features = hidden_channels or in_channels
+        self.ffn_fc1 = ConvModule(
+            in_channels=in_channels,
+            out_channels=hidden_features,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            norm_cfg=norm_cfg,
+            act_cfg=None)
+        self.ffn_fc2 = ConvModule(
+            in_channels=hidden_features,
+            out_channels=out_features,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            norm_cfg=norm_cfg,
+            act_cfg=None)
+        self.act = build_activation_layer(act_cfg)
+
+    def forward(self, x):
+        x = self.ffn_fc1(x)
+        x = self.act(x)
+        x = self.ffn_fc2(x)
+        return x
+
+
+@MODELS.register_module()
+class RepMLPNet(BaseModule):
+    """RepMLPNet backbone.
+
+    A PyTorch impl of : `RepMLP: Re-parameterizing Convolutions into
+    Fully-connected Layers for Image Recognition
+    <https://arxiv.org/abs/2105.01883>`_
+
+    Args:
+        arch (str | dict): RepMLP architecture. If use string, choose
+            from 'base' and 'b'. If use dict, it should have below keys:
+
+            - channels (List[int]): Number of blocks in each stage.
+            - depths (List[int]): The number of blocks in each branch.
+            - sharesets_nums (List[int]): RepVGG Block that declares
+              the need to apply group convolution.
+
+        img_size (int | tuple): The size of input image. Defaults: 224.
+        in_channels (int): Number of input image channels. Default: 3.
+        patch_size (int | tuple): The patch size in patch embedding.
+            Defaults to 4.
+        out_indices (Sequence[int]): Output from which stages.
+            Default: ``(3, )``.
+        reparam_conv_kernels (Squeue(int) | None): The conv kernels in the
+            GlobalPerceptron. Default: None.
+        globalperceptron_ratio (int): The reducation ratio in the
+            GlobalPerceptron. Default: 4.
+        num_sharesets (int): The number of sharesets in the
+            PartitionPerceptron. Default 1.
+        conv_cfg (dict | None): The config dict for conv layers. Default: None.
+        norm_cfg (dict): The config dict for norm layers.
+            Default: dict(type='BN', requires_grad=True).
+        patch_cfg (dict): Extra config dict for patch embedding.
+            Defaults to an empty dict.
+        final_norm (bool): Whether to add a additional layer to normalize
+            final feature map. Defaults to True.
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='ReLU').
+        deploy (bool): Whether to switch the model structure to deployment
+            mode. Default: False.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+    """
+    arch_zoo = {
+        **dict.fromkeys(['b', 'base'],
+                        {'channels':       [96, 192, 384, 768],
+                         'depths':         [2, 2, 12, 2],
+                         'sharesets_nums': [1, 4, 32, 128]}),
+    }  # yapf: disable
+
+    num_extra_tokens = 0  # there is no cls-token in RepMLP
+
+    def __init__(self,
+                 arch,
+                 img_size=224,
+                 in_channels=3,
+                 patch_size=4,
+                 out_indices=(3, ),
+                 reparam_conv_kernels=(3, ),
+                 globalperceptron_ratio=4,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN', requires_grad=True),
+                 patch_cfg=dict(),
+                 final_norm=True,
+                 deploy=False,
+                 init_cfg=None):
+        super(RepMLPNet, self).__init__(init_cfg=init_cfg)
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {'channels', 'depths', 'sharesets_nums'}
+            assert isinstance(arch, dict) and set(arch) == essential_keys, \
+                f'Custom arch needs a dict with keys {essential_keys}.'
+            self.arch_settings = arch
+
+        self.img_size = to_2tuple(img_size)
+        self.patch_size = to_2tuple(patch_size)
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+
+        self.num_stage = len(self.arch_settings['channels'])
+        for value in self.arch_settings.values():
+            assert isinstance(value, list) and len(value) == self.num_stage, (
+                'Length of setting item in arch dict must be type of list and'
+                ' have the same length.')
+
+        self.channels = self.arch_settings['channels']
+        self.depths = self.arch_settings['depths']
+        self.sharesets_nums = self.arch_settings['sharesets_nums']
+
+        _patch_cfg = dict(
+            in_channels=in_channels,
+            input_size=self.img_size,
+            embed_dims=self.channels[0],
+            conv_type='Conv2d',
+            kernel_size=self.patch_size,
+            stride=self.patch_size,
+            norm_cfg=self.norm_cfg,
+            bias=False)
+        _patch_cfg.update(patch_cfg)
+        self.patch_embed = PatchEmbed(**_patch_cfg)
+        self.patch_resolution = self.patch_embed.init_out_size
+
+        self.patch_hs = [
+            self.patch_resolution[0] // 2**i for i in range(self.num_stage)
+        ]
+        self.patch_ws = [
+            self.patch_resolution[1] // 2**i for i in range(self.num_stage)
+        ]
+
+        self.stages = ModuleList()
+        self.downsample_layers = ModuleList()
+        for stage_idx in range(self.num_stage):
+            # make stage layers
+            _stage_cfg = dict(
+                channels=self.channels[stage_idx],
+                path_h=self.patch_hs[stage_idx],
+                path_w=self.patch_ws[stage_idx],
+                reparam_conv_kernels=reparam_conv_kernels,
+                globalperceptron_ratio=globalperceptron_ratio,
+                norm_cfg=self.norm_cfg,
+                ffn_expand=4,
+                num_sharesets=self.sharesets_nums[stage_idx],
+                deploy=deploy)
+            stage_blocks = [
+                RepMLPNetUnit(**_stage_cfg)
+                for _ in range(self.depths[stage_idx])
+            ]
+            self.stages.append(Sequential(*stage_blocks))
+
+            # make downsample layers
+            if stage_idx < self.num_stage - 1:
+                self.downsample_layers.append(
+                    ConvModule(
+                        in_channels=self.channels[stage_idx],
+                        out_channels=self.channels[stage_idx + 1],
+                        kernel_size=2,
+                        stride=2,
+                        padding=0,
+                        conv_cfg=self.conv_cfg,
+                        norm_cfg=self.norm_cfg,
+                        inplace=True))
+
+        self.out_indice = out_indices
+
+        if final_norm:
+            norm_layer = build_norm_layer(norm_cfg, self.channels[-1])[1]
+        else:
+            norm_layer = nn.Identity()
+        self.add_module('final_norm', norm_layer)
+
+    def forward(self, x):
+        assert x.shape[2:] == self.img_size, \
+            "The Rep-MLP doesn't support dynamic input shape. " \
+            f'Please input images with shape {self.img_size}'
+
+        outs = []
+
+        x, _ = self.patch_embed(x)
+        for i, stage in enumerate(self.stages):
+            x = stage(x)
+
+            # downsample after each stage except last stage
+            if i < len(self.stages) - 1:
+                downsample = self.downsample_layers[i]
+                x = downsample(x)
+
+            if i in self.out_indice:
+                if self.final_norm and i == len(self.stages) - 1:
+                    out = self.final_norm(x)
+                else:
+                    out = x
+                outs.append(out)
+
+        return tuple(outs)
+
+    def switch_to_deploy(self):
+        for m in self.modules():
+            if hasattr(m, 'local_inject'):
+                m.local_inject()
diff --git a/mmpretrain/models/backbones/repvgg.py b/mmpretrain/models/backbones/repvgg.py
new file mode 100644
index 0000000000000000000000000000000000000000..67c9d147546eb2839a44749040a1a787ee5ce0ea
--- /dev/null
+++ b/mmpretrain/models/backbones/repvgg.py
@@ -0,0 +1,622 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint as cp
+from mmcv.cnn import (ConvModule, build_activation_layer, build_conv_layer,
+                      build_norm_layer)
+from mmengine.model import BaseModule, Sequential
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+from torch import nn
+
+from mmpretrain.registry import MODELS
+from ..utils.se_layer import SELayer
+from .base_backbone import BaseBackbone
+
+
+class RepVGGBlock(BaseModule):
+    """RepVGG block for RepVGG backbone.
+
+    Args:
+        in_channels (int): The input channels of the block.
+        out_channels (int): The output channels of the block.
+        stride (int): Stride of the 3x3 and 1x1 convolution layer. Default: 1.
+        padding (int): Padding of the 3x3 convolution layer.
+        dilation (int): Dilation of the 3x3 convolution layer.
+        groups (int): Groups of the 3x3 and 1x1 convolution layer. Default: 1.
+        padding_mode (str): Padding mode of the 3x3 convolution layer.
+            Default: 'zeros'.
+        se_cfg (None or dict): The configuration of the se module.
+            Default: None.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default: False.
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Default: None, which means using conv2d.
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default: dict(type='BN', requires_grad=True).
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='ReLU').
+        deploy (bool): Whether to switch the model structure to
+            deployment mode. Default: False.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Default: None
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 stride=1,
+                 padding=1,
+                 dilation=1,
+                 groups=1,
+                 padding_mode='zeros',
+                 se_cfg=None,
+                 with_cp=False,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 deploy=False,
+                 init_cfg=None):
+        super(RepVGGBlock, self).__init__(init_cfg)
+
+        assert se_cfg is None or isinstance(se_cfg, dict)
+
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.stride = stride
+        self.padding = padding
+        self.dilation = dilation
+        self.groups = groups
+        self.se_cfg = se_cfg
+        self.with_cp = with_cp
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.act_cfg = act_cfg
+        self.deploy = deploy
+
+        if deploy:
+            self.branch_reparam = build_conv_layer(
+                conv_cfg,
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=3,
+                stride=stride,
+                padding=padding,
+                dilation=dilation,
+                groups=groups,
+                bias=True,
+                padding_mode=padding_mode)
+        else:
+            # judge if input shape and output shape are the same.
+            # If true, add a normalized identity shortcut.
+            if out_channels == in_channels and stride == 1 and \
+                    padding == dilation:
+                self.branch_norm = build_norm_layer(norm_cfg, in_channels)[1]
+            else:
+                self.branch_norm = None
+
+            self.branch_3x3 = self.create_conv_bn(
+                kernel_size=3,
+                dilation=dilation,
+                padding=padding,
+            )
+            self.branch_1x1 = self.create_conv_bn(kernel_size=1)
+
+        if se_cfg is not None:
+            self.se_layer = SELayer(channels=out_channels, **se_cfg)
+        else:
+            self.se_layer = None
+
+        self.act = build_activation_layer(act_cfg)
+
+    def create_conv_bn(self, kernel_size, dilation=1, padding=0):
+        conv_bn = Sequential()
+        conv_bn.add_module(
+            'conv',
+            build_conv_layer(
+                self.conv_cfg,
+                in_channels=self.in_channels,
+                out_channels=self.out_channels,
+                kernel_size=kernel_size,
+                stride=self.stride,
+                dilation=dilation,
+                padding=padding,
+                groups=self.groups,
+                bias=False))
+        conv_bn.add_module(
+            'norm',
+            build_norm_layer(self.norm_cfg, num_features=self.out_channels)[1])
+
+        return conv_bn
+
+    def forward(self, x):
+
+        def _inner_forward(inputs):
+            if self.deploy:
+                return self.branch_reparam(inputs)
+
+            if self.branch_norm is None:
+                branch_norm_out = 0
+            else:
+                branch_norm_out = self.branch_norm(inputs)
+
+            inner_out = self.branch_3x3(inputs) + self.branch_1x1(
+                inputs) + branch_norm_out
+
+            if self.se_cfg is not None:
+                inner_out = self.se_layer(inner_out)
+
+            return inner_out
+
+        if self.with_cp and x.requires_grad:
+            out = cp.checkpoint(_inner_forward, x)
+        else:
+            out = _inner_forward(x)
+
+        out = self.act(out)
+
+        return out
+
+    def switch_to_deploy(self):
+        """Switch the model structure from training mode to deployment mode."""
+        if self.deploy:
+            return
+        assert self.norm_cfg['type'] == 'BN', \
+            "Switch is not allowed when norm_cfg['type'] != 'BN'."
+
+        reparam_weight, reparam_bias = self.reparameterize()
+        self.branch_reparam = build_conv_layer(
+            self.conv_cfg,
+            self.in_channels,
+            self.out_channels,
+            kernel_size=3,
+            stride=self.stride,
+            padding=self.padding,
+            dilation=self.dilation,
+            groups=self.groups,
+            bias=True)
+        self.branch_reparam.weight.data = reparam_weight
+        self.branch_reparam.bias.data = reparam_bias
+
+        for param in self.parameters():
+            param.detach_()
+        delattr(self, 'branch_3x3')
+        delattr(self, 'branch_1x1')
+        delattr(self, 'branch_norm')
+
+        self.deploy = True
+
+    def reparameterize(self):
+        """Fuse all the parameters of all branches.
+
+        Returns:
+            tuple[torch.Tensor, torch.Tensor]: Parameters after fusion of all
+                branches. the first element is the weights and the second is
+                the bias.
+        """
+        weight_3x3, bias_3x3 = self._fuse_conv_bn(self.branch_3x3)
+        weight_1x1, bias_1x1 = self._fuse_conv_bn(self.branch_1x1)
+        # pad a conv1x1 weight to a conv3x3 weight
+        weight_1x1 = F.pad(weight_1x1, [1, 1, 1, 1], value=0)
+
+        weight_norm, bias_norm = 0, 0
+        if self.branch_norm:
+            tmp_conv_bn = self._norm_to_conv3x3(self.branch_norm)
+            weight_norm, bias_norm = self._fuse_conv_bn(tmp_conv_bn)
+
+        return (weight_3x3 + weight_1x1 + weight_norm,
+                bias_3x3 + bias_1x1 + bias_norm)
+
+    def _fuse_conv_bn(self, branch):
+        """Fuse the parameters in a branch with a conv and bn.
+
+        Args:
+            branch (mmcv.runner.Sequential): A branch with conv and bn.
+
+        Returns:
+            tuple[torch.Tensor, torch.Tensor]: The parameters obtained after
+                fusing the parameters of conv and bn in one branch.
+                The first element is the weight and the second is the bias.
+        """
+        if branch is None:
+            return 0, 0
+        conv_weight = branch.conv.weight
+        running_mean = branch.norm.running_mean
+        running_var = branch.norm.running_var
+        gamma = branch.norm.weight
+        beta = branch.norm.bias
+        eps = branch.norm.eps
+
+        std = (running_var + eps).sqrt()
+        fused_weight = (gamma / std).reshape(-1, 1, 1, 1) * conv_weight
+        fused_bias = -running_mean * gamma / std + beta
+
+        return fused_weight, fused_bias
+
+    def _norm_to_conv3x3(self, branch_nrom):
+        """Convert a norm layer to a conv3x3-bn sequence.
+
+        Args:
+            branch (nn.BatchNorm2d): A branch only with bn in the block.
+
+        Returns:
+            tmp_conv3x3 (mmcv.runner.Sequential): a sequential with conv3x3 and
+                bn.
+        """
+        input_dim = self.in_channels // self.groups
+        conv_weight = torch.zeros((self.in_channels, input_dim, 3, 3),
+                                  dtype=branch_nrom.weight.dtype)
+
+        for i in range(self.in_channels):
+            conv_weight[i, i % input_dim, 1, 1] = 1
+        conv_weight = conv_weight.to(branch_nrom.weight.device)
+
+        tmp_conv3x3 = self.create_conv_bn(kernel_size=3)
+        tmp_conv3x3.conv.weight.data = conv_weight
+        tmp_conv3x3.norm = branch_nrom
+        return tmp_conv3x3
+
+
+class MTSPPF(BaseModule):
+    """MTSPPF block for YOLOX-PAI RepVGG backbone.
+
+    Args:
+        in_channels (int): The input channels of the block.
+        out_channels (int): The output channels of the block.
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default: dict(type='BN').
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='ReLU').
+        kernel_size (int): Kernel size of pooling. Default: 5.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 kernel_size=5):
+        super().__init__()
+        hidden_features = in_channels // 2  # hidden channels
+        self.conv1 = ConvModule(
+            in_channels,
+            hidden_features,
+            1,
+            stride=1,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+        self.conv2 = ConvModule(
+            hidden_features * 4,
+            out_channels,
+            1,
+            stride=1,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+        self.maxpool = nn.MaxPool2d(
+            kernel_size=kernel_size, stride=1, padding=kernel_size // 2)
+
+    def forward(self, x):
+        x = self.conv1(x)
+        y1 = self.maxpool(x)
+        y2 = self.maxpool(y1)
+        return self.conv2(torch.cat([x, y1, y2, self.maxpool(y2)], 1))
+
+
+@MODELS.register_module()
+class RepVGG(BaseBackbone):
+    """RepVGG backbone.
+
+    A PyTorch impl of : `RepVGG: Making VGG-style ConvNets Great Again
+    <https://arxiv.org/abs/2101.03697>`_
+
+    Args:
+        arch (str | dict): RepVGG architecture. If use string, choose from
+            'A0', 'A1`', 'A2', 'B0', 'B1', 'B1g2', 'B1g4', 'B2', 'B2g2',
+            'B2g4', 'B3', 'B3g2', 'B3g4'  or 'D2se'. If use dict, it should
+            have below keys:
+
+            - **num_blocks** (Sequence[int]): Number of blocks in each stage.
+            - **width_factor** (Sequence[float]): Width deflator in each stage.
+            - **group_layer_map** (dict | None): RepVGG Block that declares
+              the need to apply group convolution.
+            - **se_cfg** (dict | None): SE Layer config.
+            - **stem_channels** (int, optional): The stem channels, the final
+              stem channels will be
+              ``min(stem_channels, base_channels*width_factor[0])``.
+              If not set here, 64 is used by default in the code.
+
+        in_channels (int): Number of input image channels. Defaults to 3.
+        base_channels (int): Base channels of RepVGG backbone, work with
+            width_factor together. Defaults to 64.
+        out_indices (Sequence[int]): Output from which stages.
+            Defaults to ``(3, )``.
+        strides (Sequence[int]): Strides of the first block of each stage.
+            Defaults to ``(2, 2, 2, 2)``.
+        dilations (Sequence[int]): Dilation of each stage.
+            Defaults to ``(1, 1, 1, 1)``.
+        frozen_stages (int): Stages to be frozen (all param fixed). -1 means
+            not freezing any parameters. Defaults to -1.
+        conv_cfg (dict | None): The config dict for conv layers.
+            Defaults to None.
+        norm_cfg (dict): The config dict for norm layers.
+            Defaults to ``dict(type='BN')``.
+        act_cfg (dict): Config dict for activation layer.
+            Defaults to ``dict(type='ReLU')``.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        deploy (bool): Whether to switch the model structure to deployment
+            mode. Defaults to False.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Defaults to False.
+        add_ppf (bool): Whether to use the MTSPPF block. Defaults to False.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    groupwise_layers = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26]
+    g2_layer_map = {layer: 2 for layer in groupwise_layers}
+    g4_layer_map = {layer: 4 for layer in groupwise_layers}
+
+    arch_settings = {
+        'A0':
+        dict(
+            num_blocks=[2, 4, 14, 1],
+            width_factor=[0.75, 0.75, 0.75, 2.5],
+            group_layer_map=None,
+            se_cfg=None),
+        'A1':
+        dict(
+            num_blocks=[2, 4, 14, 1],
+            width_factor=[1, 1, 1, 2.5],
+            group_layer_map=None,
+            se_cfg=None),
+        'A2':
+        dict(
+            num_blocks=[2, 4, 14, 1],
+            width_factor=[1.5, 1.5, 1.5, 2.75],
+            group_layer_map=None,
+            se_cfg=None),
+        'B0':
+        dict(
+            num_blocks=[4, 6, 16, 1],
+            width_factor=[1, 1, 1, 2.5],
+            group_layer_map=None,
+            se_cfg=None,
+            stem_channels=64),
+        'B1':
+        dict(
+            num_blocks=[4, 6, 16, 1],
+            width_factor=[2, 2, 2, 4],
+            group_layer_map=None,
+            se_cfg=None),
+        'B1g2':
+        dict(
+            num_blocks=[4, 6, 16, 1],
+            width_factor=[2, 2, 2, 4],
+            group_layer_map=g2_layer_map,
+            se_cfg=None),
+        'B1g4':
+        dict(
+            num_blocks=[4, 6, 16, 1],
+            width_factor=[2, 2, 2, 4],
+            group_layer_map=g4_layer_map,
+            se_cfg=None),
+        'B2':
+        dict(
+            num_blocks=[4, 6, 16, 1],
+            width_factor=[2.5, 2.5, 2.5, 5],
+            group_layer_map=None,
+            se_cfg=None),
+        'B2g2':
+        dict(
+            num_blocks=[4, 6, 16, 1],
+            width_factor=[2.5, 2.5, 2.5, 5],
+            group_layer_map=g2_layer_map,
+            se_cfg=None),
+        'B2g4':
+        dict(
+            num_blocks=[4, 6, 16, 1],
+            width_factor=[2.5, 2.5, 2.5, 5],
+            group_layer_map=g4_layer_map,
+            se_cfg=None),
+        'B3':
+        dict(
+            num_blocks=[4, 6, 16, 1],
+            width_factor=[3, 3, 3, 5],
+            group_layer_map=None,
+            se_cfg=None),
+        'B3g2':
+        dict(
+            num_blocks=[4, 6, 16, 1],
+            width_factor=[3, 3, 3, 5],
+            group_layer_map=g2_layer_map,
+            se_cfg=None),
+        'B3g4':
+        dict(
+            num_blocks=[4, 6, 16, 1],
+            width_factor=[3, 3, 3, 5],
+            group_layer_map=g4_layer_map,
+            se_cfg=None),
+        'D2se':
+        dict(
+            num_blocks=[8, 14, 24, 1],
+            width_factor=[2.5, 2.5, 2.5, 5],
+            group_layer_map=None,
+            se_cfg=dict(ratio=16, divisor=1)),
+        'yolox-pai-small':
+        dict(
+            num_blocks=[3, 5, 7, 3],
+            width_factor=[1, 1, 1, 1],
+            group_layer_map=None,
+            se_cfg=None,
+            stem_channels=32),
+    }
+
+    def __init__(self,
+                 arch,
+                 in_channels=3,
+                 base_channels=64,
+                 out_indices=(3, ),
+                 strides=(2, 2, 2, 2),
+                 dilations=(1, 1, 1, 1),
+                 frozen_stages=-1,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 with_cp=False,
+                 deploy=False,
+                 norm_eval=False,
+                 add_ppf=False,
+                 init_cfg=[
+                     dict(type='Kaiming', layer=['Conv2d']),
+                     dict(
+                         type='Constant',
+                         val=1,
+                         layer=['_BatchNorm', 'GroupNorm'])
+                 ]):
+        super(RepVGG, self).__init__(init_cfg)
+
+        if isinstance(arch, str):
+            assert arch in self.arch_settings, \
+                f'"arch": "{arch}" is not one of the arch_settings'
+            arch = self.arch_settings[arch]
+        elif not isinstance(arch, dict):
+            raise TypeError('Expect "arch" to be either a string '
+                            f'or a dict, got {type(arch)}')
+
+        assert len(arch['num_blocks']) == len(
+            arch['width_factor']) == len(strides) == len(dilations)
+        assert max(out_indices) < len(arch['num_blocks'])
+        if arch['group_layer_map'] is not None:
+            assert max(arch['group_layer_map'].keys()) <= sum(
+                arch['num_blocks'])
+
+        if arch['se_cfg'] is not None:
+            assert isinstance(arch['se_cfg'], dict)
+
+        self.base_channels = base_channels
+        self.arch = arch
+        self.in_channels = in_channels
+        self.out_indices = out_indices
+        self.strides = strides
+        self.dilations = dilations
+        self.deploy = deploy
+        self.frozen_stages = frozen_stages
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.act_cfg = act_cfg
+        self.with_cp = with_cp
+        self.norm_eval = norm_eval
+
+        # defaults to 64 to prevert BC-breaking if stem_channels
+        # not in arch dict;
+        # the stem channels should not be larger than that of stage1.
+        channels = min(
+            arch.get('stem_channels', 64),
+            int(self.base_channels * self.arch['width_factor'][0]))
+        self.stem = RepVGGBlock(
+            self.in_channels,
+            channels,
+            stride=2,
+            se_cfg=arch['se_cfg'],
+            with_cp=with_cp,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg,
+            deploy=deploy)
+
+        next_create_block_idx = 1
+        self.stages = []
+        for i in range(len(arch['num_blocks'])):
+            num_blocks = self.arch['num_blocks'][i]
+            stride = self.strides[i]
+            dilation = self.dilations[i]
+            out_channels = int(self.base_channels * 2**i *
+                               self.arch['width_factor'][i])
+
+            stage, next_create_block_idx = self._make_stage(
+                channels, out_channels, num_blocks, stride, dilation,
+                next_create_block_idx, init_cfg)
+            stage_name = f'stage_{i + 1}'
+            self.add_module(stage_name, stage)
+            self.stages.append(stage_name)
+
+            channels = out_channels
+
+        if add_ppf:
+            self.ppf = MTSPPF(
+                out_channels,
+                out_channels,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg,
+                kernel_size=5)
+        else:
+            self.ppf = nn.Identity()
+
+    def _make_stage(self, in_channels, out_channels, num_blocks, stride,
+                    dilation, next_create_block_idx, init_cfg):
+        strides = [stride] + [1] * (num_blocks - 1)
+        dilations = [dilation] * num_blocks
+
+        blocks = []
+        for i in range(num_blocks):
+            groups = self.arch['group_layer_map'].get(
+                next_create_block_idx,
+                1) if self.arch['group_layer_map'] is not None else 1
+            blocks.append(
+                RepVGGBlock(
+                    in_channels,
+                    out_channels,
+                    stride=strides[i],
+                    padding=dilations[i],
+                    dilation=dilations[i],
+                    groups=groups,
+                    se_cfg=self.arch['se_cfg'],
+                    with_cp=self.with_cp,
+                    conv_cfg=self.conv_cfg,
+                    norm_cfg=self.norm_cfg,
+                    act_cfg=self.act_cfg,
+                    deploy=self.deploy,
+                    init_cfg=init_cfg))
+            in_channels = out_channels
+            next_create_block_idx += 1
+
+        return Sequential(*blocks), next_create_block_idx
+
+    def forward(self, x):
+        x = self.stem(x)
+        outs = []
+        for i, stage_name in enumerate(self.stages):
+            stage = getattr(self, stage_name)
+            x = stage(x)
+            if i + 1 == len(self.stages):
+                x = self.ppf(x)
+            if i in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            self.stem.eval()
+            for param in self.stem.parameters():
+                param.requires_grad = False
+        for i in range(self.frozen_stages):
+            stage = getattr(self, f'stage_{i+1}')
+            stage.eval()
+            for param in stage.parameters():
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        super(RepVGG, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                if isinstance(m, _BatchNorm):
+                    m.eval()
+
+    def switch_to_deploy(self):
+        for m in self.modules():
+            if isinstance(m, RepVGGBlock):
+                m.switch_to_deploy()
+        self.deploy = True
diff --git a/mmpretrain/models/backbones/res2net.py b/mmpretrain/models/backbones/res2net.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e9bb6df37a2d2c9d19e613faa50ce0103aff357
--- /dev/null
+++ b/mmpretrain/models/backbones/res2net.py
@@ -0,0 +1,317 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint as cp
+from mmcv.cnn import build_conv_layer, build_norm_layer
+from mmengine.model import ModuleList, Sequential
+
+from mmpretrain.registry import MODELS
+from .resnet import Bottleneck as _Bottleneck
+from .resnet import ResNet
+
+
+class Bottle2neck(_Bottleneck):
+    expansion = 4
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 scales=4,
+                 base_width=26,
+                 base_channels=64,
+                 stage_type='normal',
+                 **kwargs):
+        """Bottle2neck block for Res2Net."""
+        super(Bottle2neck, self).__init__(in_channels, out_channels, **kwargs)
+        assert scales > 1, 'Res2Net degenerates to ResNet when scales = 1.'
+
+        mid_channels = out_channels // self.expansion
+        width = int(math.floor(mid_channels * (base_width / base_channels)))
+
+        self.norm1_name, norm1 = build_norm_layer(
+            self.norm_cfg, width * scales, postfix=1)
+        self.norm3_name, norm3 = build_norm_layer(
+            self.norm_cfg, self.out_channels, postfix=3)
+
+        self.conv1 = build_conv_layer(
+            self.conv_cfg,
+            self.in_channels,
+            width * scales,
+            kernel_size=1,
+            stride=self.conv1_stride,
+            bias=False)
+        self.add_module(self.norm1_name, norm1)
+
+        if stage_type == 'stage':
+            self.pool = nn.AvgPool2d(
+                kernel_size=3, stride=self.conv2_stride, padding=1)
+
+        self.convs = ModuleList()
+        self.bns = ModuleList()
+        for i in range(scales - 1):
+            self.convs.append(
+                build_conv_layer(
+                    self.conv_cfg,
+                    width,
+                    width,
+                    kernel_size=3,
+                    stride=self.conv2_stride,
+                    padding=self.dilation,
+                    dilation=self.dilation,
+                    bias=False))
+            self.bns.append(
+                build_norm_layer(self.norm_cfg, width, postfix=i + 1)[1])
+
+        self.conv3 = build_conv_layer(
+            self.conv_cfg,
+            width * scales,
+            self.out_channels,
+            kernel_size=1,
+            bias=False)
+        self.add_module(self.norm3_name, norm3)
+
+        self.stage_type = stage_type
+        self.scales = scales
+        self.width = width
+        delattr(self, 'conv2')
+        delattr(self, self.norm2_name)
+
+    def forward(self, x):
+        """Forward function."""
+
+        def _inner_forward(x):
+            identity = x
+
+            out = self.conv1(x)
+            out = self.norm1(out)
+            out = self.relu(out)
+
+            spx = torch.split(out, self.width, 1)
+            sp = self.convs[0](spx[0].contiguous())
+            sp = self.relu(self.bns[0](sp))
+            out = sp
+            for i in range(1, self.scales - 1):
+                if self.stage_type == 'stage':
+                    sp = spx[i]
+                else:
+                    sp = sp + spx[i]
+                sp = self.convs[i](sp.contiguous())
+                sp = self.relu(self.bns[i](sp))
+                out = torch.cat((out, sp), 1)
+
+            if self.stage_type == 'normal' and self.scales != 1:
+                out = torch.cat((out, spx[self.scales - 1]), 1)
+            elif self.stage_type == 'stage' and self.scales != 1:
+                out = torch.cat((out, self.pool(spx[self.scales - 1])), 1)
+
+            out = self.conv3(out)
+            out = self.norm3(out)
+
+            if self.downsample is not None:
+                identity = self.downsample(x)
+
+            out += identity
+
+            return out
+
+        if self.with_cp and x.requires_grad:
+            out = cp.checkpoint(_inner_forward, x)
+        else:
+            out = _inner_forward(x)
+
+        out = self.relu(out)
+
+        return out
+
+
+class Res2Layer(Sequential):
+    """Res2Layer to build Res2Net style backbone.
+
+    Args:
+        block (nn.Module): block used to build ResLayer.
+        inplanes (int): inplanes of block.
+        planes (int): planes of block.
+        num_blocks (int): number of blocks.
+        stride (int): stride of the first block. Default: 1
+        avg_down (bool): Use AvgPool instead of stride conv when
+            downsampling in the bottle2neck. Defaults to True.
+        conv_cfg (dict): dictionary to construct and config conv layer.
+            Default: None
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default: dict(type='BN')
+        scales (int): Scales used in Res2Net. Default: 4
+        base_width (int): Basic width of each scale. Default: 26
+        drop_path_rate (float or np.ndarray): stochastic depth rate.
+            Default: 0.
+    """
+
+    def __init__(self,
+                 block,
+                 in_channels,
+                 out_channels,
+                 num_blocks,
+                 stride=1,
+                 avg_down=True,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 scales=4,
+                 base_width=26,
+                 drop_path_rate=0.0,
+                 **kwargs):
+        self.block = block
+
+        if isinstance(drop_path_rate, float):
+            drop_path_rate = [drop_path_rate] * num_blocks
+
+        assert len(drop_path_rate
+                   ) == num_blocks, 'Please check the length of drop_path_rate'
+
+        downsample = None
+        if stride != 1 or in_channels != out_channels:
+            if avg_down:
+                downsample = nn.Sequential(
+                    nn.AvgPool2d(
+                        kernel_size=stride,
+                        stride=stride,
+                        ceil_mode=True,
+                        count_include_pad=False),
+                    build_conv_layer(
+                        conv_cfg,
+                        in_channels,
+                        out_channels,
+                        kernel_size=1,
+                        stride=1,
+                        bias=False),
+                    build_norm_layer(norm_cfg, out_channels)[1],
+                )
+            else:
+                downsample = nn.Sequential(
+                    build_conv_layer(
+                        conv_cfg,
+                        in_channels,
+                        out_channels,
+                        kernel_size=1,
+                        stride=stride,
+                        bias=False),
+                    build_norm_layer(norm_cfg, out_channels)[1],
+                )
+
+        layers = []
+        layers.append(
+            block(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                stride=stride,
+                downsample=downsample,
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                scales=scales,
+                base_width=base_width,
+                stage_type='stage',
+                drop_path_rate=drop_path_rate[0],
+                **kwargs))
+        in_channels = out_channels
+        for i in range(1, num_blocks):
+            layers.append(
+                block(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    stride=1,
+                    conv_cfg=conv_cfg,
+                    norm_cfg=norm_cfg,
+                    scales=scales,
+                    base_width=base_width,
+                    drop_path_rate=drop_path_rate[i],
+                    **kwargs))
+        super(Res2Layer, self).__init__(*layers)
+
+
+@MODELS.register_module()
+class Res2Net(ResNet):
+    """Res2Net backbone.
+
+    A PyTorch implement of : `Res2Net: A New Multi-scale Backbone
+    Architecture <https://arxiv.org/pdf/1904.01169.pdf>`_
+
+    Args:
+        depth (int): Depth of Res2Net, choose from {50, 101, 152}.
+        scales (int): Scales used in Res2Net. Defaults to 4.
+        base_width (int): Basic width of each scale. Defaults to 26.
+        in_channels (int): Number of input image channels. Defaults to 3.
+        num_stages (int): Number of Res2Net stages. Defaults to 4.
+        strides (Sequence[int]): Strides of the first block of each stage.
+            Defaults to ``(1, 2, 2, 2)``.
+        dilations (Sequence[int]): Dilation of each stage.
+            Defaults to ``(1, 1, 1, 1)``.
+        out_indices (Sequence[int]): Output from which stages.
+            Defaults to ``(3, )``.
+        style (str): "pytorch" or "caffe". If set to "pytorch", the stride-two
+            layer is the 3x3 conv layer, otherwise the stride-two layer is
+            the first 1x1 conv layer. Defaults to "pytorch".
+        deep_stem (bool): Replace 7x7 conv in input stem with 3 3x3 conv.
+            Defaults to True.
+        avg_down (bool): Use AvgPool instead of stride conv when
+            downsampling in the bottle2neck. Defaults to True.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        norm_cfg (dict): Dictionary to construct and config norm layer.
+            Defaults to ``dict(type='BN', requires_grad=True)``.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Defaults to False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        zero_init_residual (bool): Whether to use zero init for last norm layer
+            in resblocks to let them behave as identity. Defaults to True.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Defaults to None.
+
+    Example:
+        >>> from mmpretrain.models import Res2Net
+        >>> import torch
+        >>> model = Res2Net(depth=50,
+        ...                 scales=4,
+        ...                 base_width=26,
+        ...                 out_indices=(0, 1, 2, 3))
+        >>> model.eval()
+        >>> inputs = torch.rand(1, 3, 32, 32)
+        >>> level_outputs = model.forward(inputs)
+        >>> for level_out in level_outputs:
+        ...     print(tuple(level_out.shape))
+        (1, 256, 8, 8)
+        (1, 512, 4, 4)
+        (1, 1024, 2, 2)
+        (1, 2048, 1, 1)
+    """
+
+    arch_settings = {
+        50: (Bottle2neck, (3, 4, 6, 3)),
+        101: (Bottle2neck, (3, 4, 23, 3)),
+        152: (Bottle2neck, (3, 8, 36, 3))
+    }
+
+    def __init__(self,
+                 scales=4,
+                 base_width=26,
+                 style='pytorch',
+                 deep_stem=True,
+                 avg_down=True,
+                 init_cfg=None,
+                 **kwargs):
+        self.scales = scales
+        self.base_width = base_width
+        super(Res2Net, self).__init__(
+            style=style,
+            deep_stem=deep_stem,
+            avg_down=avg_down,
+            init_cfg=init_cfg,
+            **kwargs)
+
+    def make_res_layer(self, **kwargs):
+        return Res2Layer(
+            scales=self.scales,
+            base_width=self.base_width,
+            base_channels=self.base_channels,
+            **kwargs)
diff --git a/mmpretrain/models/backbones/resnest.py b/mmpretrain/models/backbones/resnest.py
new file mode 100644
index 0000000000000000000000000000000000000000..4bb438f042d606946fd7b69d73568f28563e0efa
--- /dev/null
+++ b/mmpretrain/models/backbones/resnest.py
@@ -0,0 +1,339 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint as cp
+from mmcv.cnn import build_conv_layer, build_norm_layer
+
+from mmpretrain.registry import MODELS
+from .resnet import Bottleneck as _Bottleneck
+from .resnet import ResLayer, ResNetV1d
+
+
+class RSoftmax(nn.Module):
+    """Radix Softmax module in ``SplitAttentionConv2d``.
+
+    Args:
+        radix (int): Radix of input.
+        groups (int): Groups of input.
+    """
+
+    def __init__(self, radix, groups):
+        super().__init__()
+        self.radix = radix
+        self.groups = groups
+
+    def forward(self, x):
+        batch = x.size(0)
+        if self.radix > 1:
+            x = x.view(batch, self.groups, self.radix, -1).transpose(1, 2)
+            x = F.softmax(x, dim=1)
+            x = x.reshape(batch, -1)
+        else:
+            x = torch.sigmoid(x)
+        return x
+
+
+class SplitAttentionConv2d(nn.Module):
+    """Split-Attention Conv2d.
+
+    Args:
+        in_channels (int): Same as nn.Conv2d.
+        out_channels (int): Same as nn.Conv2d.
+        kernel_size (int | tuple[int]): Same as nn.Conv2d.
+        stride (int | tuple[int]): Same as nn.Conv2d.
+        padding (int | tuple[int]): Same as nn.Conv2d.
+        dilation (int | tuple[int]): Same as nn.Conv2d.
+        groups (int): Same as nn.Conv2d.
+        radix (int): Radix of SpltAtConv2d. Default: 2
+        reduction_factor (int): Reduction factor of SplitAttentionConv2d.
+            Default: 4.
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Default: None, which means using conv2d.
+        norm_cfg (dict, optional): Config dict for normalization layer.
+            Default: None.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 channels,
+                 kernel_size,
+                 stride=1,
+                 padding=0,
+                 dilation=1,
+                 groups=1,
+                 radix=2,
+                 reduction_factor=4,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN')):
+        super(SplitAttentionConv2d, self).__init__()
+        inter_channels = max(in_channels * radix // reduction_factor, 32)
+        self.radix = radix
+        self.groups = groups
+        self.channels = channels
+        self.conv = build_conv_layer(
+            conv_cfg,
+            in_channels,
+            channels * radix,
+            kernel_size,
+            stride=stride,
+            padding=padding,
+            dilation=dilation,
+            groups=groups * radix,
+            bias=False)
+        self.norm0_name, norm0 = build_norm_layer(
+            norm_cfg, channels * radix, postfix=0)
+        self.add_module(self.norm0_name, norm0)
+        self.relu = nn.ReLU(inplace=True)
+        self.fc1 = build_conv_layer(
+            None, channels, inter_channels, 1, groups=self.groups)
+        self.norm1_name, norm1 = build_norm_layer(
+            norm_cfg, inter_channels, postfix=1)
+        self.add_module(self.norm1_name, norm1)
+        self.fc2 = build_conv_layer(
+            None, inter_channels, channels * radix, 1, groups=self.groups)
+        self.rsoftmax = RSoftmax(radix, groups)
+
+    @property
+    def norm0(self):
+        return getattr(self, self.norm0_name)
+
+    @property
+    def norm1(self):
+        return getattr(self, self.norm1_name)
+
+    def forward(self, x):
+        x = self.conv(x)
+        x = self.norm0(x)
+        x = self.relu(x)
+
+        batch, rchannel = x.shape[:2]
+        if self.radix > 1:
+            splits = x.view(batch, self.radix, -1, *x.shape[2:])
+            gap = splits.sum(dim=1)
+        else:
+            gap = x
+        gap = F.adaptive_avg_pool2d(gap, 1)
+        gap = self.fc1(gap)
+
+        gap = self.norm1(gap)
+        gap = self.relu(gap)
+
+        atten = self.fc2(gap)
+        atten = self.rsoftmax(atten).view(batch, -1, 1, 1)
+
+        if self.radix > 1:
+            attens = atten.view(batch, self.radix, -1, *atten.shape[2:])
+            out = torch.sum(attens * splits, dim=1)
+        else:
+            out = atten * x
+        return out.contiguous()
+
+
+class Bottleneck(_Bottleneck):
+    """Bottleneck block for ResNeSt.
+
+    Args:
+        in_channels (int): Input channels of this block.
+        out_channels (int): Output channels of this block.
+        groups (int): Groups of conv2.
+        width_per_group (int): Width per group of conv2. 64x4d indicates
+            ``groups=64, width_per_group=4`` and 32x8d indicates
+            ``groups=32, width_per_group=8``.
+        radix (int): Radix of SpltAtConv2d. Default: 2
+        reduction_factor (int): Reduction factor of SplitAttentionConv2d.
+            Default: 4.
+        avg_down_stride (bool): Whether to use average pool for stride in
+            Bottleneck. Default: True.
+        stride (int): stride of the block. Default: 1
+        dilation (int): dilation of convolution. Default: 1
+        downsample (nn.Module, optional): downsample operation on identity
+            branch. Default: None
+        style (str): `pytorch` or `caffe`. If set to "pytorch", the stride-two
+            layer is the 3x3 conv layer, otherwise the stride-two layer is
+            the first 1x1 conv layer.
+        conv_cfg (dict, optional): dictionary to construct and config conv
+            layer. Default: None
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default: dict(type='BN')
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 groups=1,
+                 width_per_group=4,
+                 base_channels=64,
+                 radix=2,
+                 reduction_factor=4,
+                 avg_down_stride=True,
+                 **kwargs):
+        super(Bottleneck, self).__init__(in_channels, out_channels, **kwargs)
+
+        self.groups = groups
+        self.width_per_group = width_per_group
+
+        # For ResNet bottleneck, middle channels are determined by expansion
+        # and out_channels, but for ResNeXt bottleneck, it is determined by
+        # groups and width_per_group and the stage it is located in.
+        if groups != 1:
+            assert self.mid_channels % base_channels == 0
+            self.mid_channels = (
+                groups * width_per_group * self.mid_channels // base_channels)
+
+        self.avg_down_stride = avg_down_stride and self.conv2_stride > 1
+
+        self.norm1_name, norm1 = build_norm_layer(
+            self.norm_cfg, self.mid_channels, postfix=1)
+        self.norm3_name, norm3 = build_norm_layer(
+            self.norm_cfg, self.out_channels, postfix=3)
+
+        self.conv1 = build_conv_layer(
+            self.conv_cfg,
+            self.in_channels,
+            self.mid_channels,
+            kernel_size=1,
+            stride=self.conv1_stride,
+            bias=False)
+        self.add_module(self.norm1_name, norm1)
+        self.conv2 = SplitAttentionConv2d(
+            self.mid_channels,
+            self.mid_channels,
+            kernel_size=3,
+            stride=1 if self.avg_down_stride else self.conv2_stride,
+            padding=self.dilation,
+            dilation=self.dilation,
+            groups=groups,
+            radix=radix,
+            reduction_factor=reduction_factor,
+            conv_cfg=self.conv_cfg,
+            norm_cfg=self.norm_cfg)
+        delattr(self, self.norm2_name)
+
+        if self.avg_down_stride:
+            self.avd_layer = nn.AvgPool2d(3, self.conv2_stride, padding=1)
+
+        self.conv3 = build_conv_layer(
+            self.conv_cfg,
+            self.mid_channels,
+            self.out_channels,
+            kernel_size=1,
+            bias=False)
+        self.add_module(self.norm3_name, norm3)
+
+    def forward(self, x):
+
+        def _inner_forward(x):
+            identity = x
+
+            out = self.conv1(x)
+            out = self.norm1(out)
+            out = self.relu(out)
+
+            out = self.conv2(out)
+
+            if self.avg_down_stride:
+                out = self.avd_layer(out)
+
+            out = self.conv3(out)
+            out = self.norm3(out)
+
+            if self.downsample is not None:
+                identity = self.downsample(x)
+
+            out += identity
+
+            return out
+
+        if self.with_cp and x.requires_grad:
+            out = cp.checkpoint(_inner_forward, x)
+        else:
+            out = _inner_forward(x)
+
+        out = self.relu(out)
+
+        return out
+
+
+@MODELS.register_module()
+class ResNeSt(ResNetV1d):
+    """ResNeSt backbone.
+
+    Please refer to the `paper <https://arxiv.org/pdf/2004.08955.pdf>`__ for
+    details.
+
+    Args:
+        depth (int): Network depth, from {50, 101, 152, 200}.
+        groups (int): Groups of conv2 in Bottleneck. Default: 32.
+        width_per_group (int): Width per group of conv2 in Bottleneck.
+            Default: 4.
+        radix (int): Radix of SpltAtConv2d. Default: 2
+        reduction_factor (int): Reduction factor of SplitAttentionConv2d.
+            Default: 4.
+        avg_down_stride (bool): Whether to use average pool for stride in
+            Bottleneck. Default: True.
+        in_channels (int): Number of input image channels. Default: 3.
+        stem_channels (int): Output channels of the stem layer. Default: 64.
+        num_stages (int): Stages of the network. Default: 4.
+        strides (Sequence[int]): Strides of the first block of each stage.
+            Default: ``(1, 2, 2, 2)``.
+        dilations (Sequence[int]): Dilation of each stage.
+            Default: ``(1, 1, 1, 1)``.
+        out_indices (Sequence[int]): Output from which stages. If only one
+            stage is specified, a single tensor (feature map) is returned,
+            otherwise multiple stages are specified, a tuple of tensors will
+            be returned. Default: ``(3, )``.
+        style (str): `pytorch` or `caffe`. If set to "pytorch", the stride-two
+            layer is the 3x3 conv layer, otherwise the stride-two layer is
+            the first 1x1 conv layer.
+        deep_stem (bool): Replace 7x7 conv in input stem with 3 3x3 conv.
+            Default: False.
+        avg_down (bool): Use AvgPool instead of stride conv when
+            downsampling in the bottleneck. Default: False.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Default: -1.
+        conv_cfg (dict | None): The config dict for conv layers. Default: None.
+        norm_cfg (dict): The config dict for norm layers.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Default: False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default: False.
+        zero_init_residual (bool): Whether to use zero init for last norm layer
+            in resblocks to let them behave as identity. Default: True.
+    """
+
+    arch_settings = {
+        50: (Bottleneck, (3, 4, 6, 3)),
+        101: (Bottleneck, (3, 4, 23, 3)),
+        152: (Bottleneck, (3, 8, 36, 3)),
+        200: (Bottleneck, (3, 24, 36, 3)),
+        269: (Bottleneck, (3, 30, 48, 8))
+    }
+
+    def __init__(self,
+                 depth,
+                 groups=1,
+                 width_per_group=4,
+                 radix=2,
+                 reduction_factor=4,
+                 avg_down_stride=True,
+                 **kwargs):
+        self.groups = groups
+        self.width_per_group = width_per_group
+        self.radix = radix
+        self.reduction_factor = reduction_factor
+        self.avg_down_stride = avg_down_stride
+        super(ResNeSt, self).__init__(depth=depth, **kwargs)
+
+    def make_res_layer(self, **kwargs):
+        return ResLayer(
+            groups=self.groups,
+            width_per_group=self.width_per_group,
+            base_channels=self.base_channels,
+            radix=self.radix,
+            reduction_factor=self.reduction_factor,
+            avg_down_stride=self.avg_down_stride,
+            **kwargs)
diff --git a/mmpretrain/models/backbones/resnet.py b/mmpretrain/models/backbones/resnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..4a254f7c2b76f03974e05194b39fbb802684873a
--- /dev/null
+++ b/mmpretrain/models/backbones/resnet.py
@@ -0,0 +1,768 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint as cp
+from mmcv.cnn import (ConvModule, build_activation_layer, build_conv_layer,
+                      build_norm_layer)
+from mmcv.cnn.bricks import DropPath
+from mmengine.model import BaseModule
+from mmengine.model.weight_init import constant_init
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+
+eps = 1.0e-5
+
+
+class BasicBlock(BaseModule):
+    """BasicBlock for ResNet.
+
+    Args:
+        in_channels (int): Input channels of this block.
+        out_channels (int): Output channels of this block.
+        expansion (int): The ratio of ``out_channels/mid_channels`` where
+            ``mid_channels`` is the output channels of conv1. This is a
+            reserved argument in BasicBlock and should always be 1. Default: 1.
+        stride (int): stride of the block. Default: 1
+        dilation (int): dilation of convolution. Default: 1
+        downsample (nn.Module, optional): downsample operation on identity
+            branch. Default: None.
+        style (str): `pytorch` or `caffe`. It is unused and reserved for
+            unified API with Bottleneck.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed.
+        conv_cfg (dict, optional): dictionary to construct and config conv
+            layer. Default: None
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default: dict(type='BN')
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 expansion=1,
+                 stride=1,
+                 dilation=1,
+                 downsample=None,
+                 style='pytorch',
+                 with_cp=False,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 drop_path_rate=0.0,
+                 act_cfg=dict(type='ReLU', inplace=True),
+                 init_cfg=None):
+        super(BasicBlock, self).__init__(init_cfg=init_cfg)
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.expansion = expansion
+        assert self.expansion == 1
+        assert out_channels % expansion == 0
+        self.mid_channels = out_channels // expansion
+        self.stride = stride
+        self.dilation = dilation
+        self.style = style
+        self.with_cp = with_cp
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+
+        self.norm1_name, norm1 = build_norm_layer(
+            norm_cfg, self.mid_channels, postfix=1)
+        self.norm2_name, norm2 = build_norm_layer(
+            norm_cfg, out_channels, postfix=2)
+
+        self.conv1 = build_conv_layer(
+            conv_cfg,
+            in_channels,
+            self.mid_channels,
+            3,
+            stride=stride,
+            padding=dilation,
+            dilation=dilation,
+            bias=False)
+        self.add_module(self.norm1_name, norm1)
+        self.conv2 = build_conv_layer(
+            conv_cfg,
+            self.mid_channels,
+            out_channels,
+            3,
+            padding=1,
+            bias=False)
+        self.add_module(self.norm2_name, norm2)
+
+        self.relu = build_activation_layer(act_cfg)
+        self.downsample = downsample
+        self.drop_path = DropPath(drop_prob=drop_path_rate
+                                  ) if drop_path_rate > eps else nn.Identity()
+
+    @property
+    def norm1(self):
+        return getattr(self, self.norm1_name)
+
+    @property
+    def norm2(self):
+        return getattr(self, self.norm2_name)
+
+    def forward(self, x):
+
+        def _inner_forward(x):
+            identity = x
+
+            out = self.conv1(x)
+            out = self.norm1(out)
+            out = self.relu(out)
+
+            out = self.conv2(out)
+            out = self.norm2(out)
+
+            if self.downsample is not None:
+                identity = self.downsample(x)
+
+            out = self.drop_path(out)
+
+            out += identity
+
+            return out
+
+        if self.with_cp and x.requires_grad:
+            out = cp.checkpoint(_inner_forward, x)
+        else:
+            out = _inner_forward(x)
+
+        out = self.relu(out)
+
+        return out
+
+
+class Bottleneck(BaseModule):
+    """Bottleneck block for ResNet.
+
+    Args:
+        in_channels (int): Input channels of this block.
+        out_channels (int): Output channels of this block.
+        expansion (int): The ratio of ``out_channels/mid_channels`` where
+            ``mid_channels`` is the input/output channels of conv2. Default: 4.
+        stride (int): stride of the block. Default: 1
+        dilation (int): dilation of convolution. Default: 1
+        downsample (nn.Module, optional): downsample operation on identity
+            branch. Default: None.
+        style (str): ``"pytorch"`` or ``"caffe"``. If set to "pytorch", the
+            stride-two layer is the 3x3 conv layer, otherwise the stride-two
+            layer is the first 1x1 conv layer. Default: "pytorch".
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed.
+        conv_cfg (dict, optional): dictionary to construct and config conv
+            layer. Default: None
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default: dict(type='BN')
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 expansion=4,
+                 stride=1,
+                 dilation=1,
+                 downsample=None,
+                 style='pytorch',
+                 with_cp=False,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU', inplace=True),
+                 drop_path_rate=0.0,
+                 init_cfg=None):
+        super(Bottleneck, self).__init__(init_cfg=init_cfg)
+        assert style in ['pytorch', 'caffe']
+
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.expansion = expansion
+        assert out_channels % expansion == 0
+        self.mid_channels = out_channels // expansion
+        self.stride = stride
+        self.dilation = dilation
+        self.style = style
+        self.with_cp = with_cp
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+
+        if self.style == 'pytorch':
+            self.conv1_stride = 1
+            self.conv2_stride = stride
+        else:
+            self.conv1_stride = stride
+            self.conv2_stride = 1
+
+        self.norm1_name, norm1 = build_norm_layer(
+            norm_cfg, self.mid_channels, postfix=1)
+        self.norm2_name, norm2 = build_norm_layer(
+            norm_cfg, self.mid_channels, postfix=2)
+        self.norm3_name, norm3 = build_norm_layer(
+            norm_cfg, out_channels, postfix=3)
+
+        self.conv1 = build_conv_layer(
+            conv_cfg,
+            in_channels,
+            self.mid_channels,
+            kernel_size=1,
+            stride=self.conv1_stride,
+            bias=False)
+        self.add_module(self.norm1_name, norm1)
+        self.conv2 = build_conv_layer(
+            conv_cfg,
+            self.mid_channels,
+            self.mid_channels,
+            kernel_size=3,
+            stride=self.conv2_stride,
+            padding=dilation,
+            dilation=dilation,
+            bias=False)
+
+        self.add_module(self.norm2_name, norm2)
+        self.conv3 = build_conv_layer(
+            conv_cfg,
+            self.mid_channels,
+            out_channels,
+            kernel_size=1,
+            bias=False)
+        self.add_module(self.norm3_name, norm3)
+
+        self.relu = build_activation_layer(act_cfg)
+        self.downsample = downsample
+        self.drop_path = DropPath(drop_prob=drop_path_rate
+                                  ) if drop_path_rate > eps else nn.Identity()
+
+    @property
+    def norm1(self):
+        return getattr(self, self.norm1_name)
+
+    @property
+    def norm2(self):
+        return getattr(self, self.norm2_name)
+
+    @property
+    def norm3(self):
+        return getattr(self, self.norm3_name)
+
+    def forward(self, x):
+
+        def _inner_forward(x):
+            identity = x
+
+            out = self.conv1(x)
+            out = self.norm1(out)
+            out = self.relu(out)
+
+            out = self.conv2(out)
+            out = self.norm2(out)
+            out = self.relu(out)
+
+            out = self.conv3(out)
+            out = self.norm3(out)
+
+            if self.downsample is not None:
+                identity = self.downsample(x)
+
+            out = self.drop_path(out)
+
+            out += identity
+
+            return out
+
+        if self.with_cp and x.requires_grad:
+            out = cp.checkpoint(_inner_forward, x)
+        else:
+            out = _inner_forward(x)
+
+        out = self.relu(out)
+
+        return out
+
+
+def get_expansion(block, expansion=None):
+    """Get the expansion of a residual block.
+
+    The block expansion will be obtained by the following order:
+
+    1. If ``expansion`` is given, just return it.
+    2. If ``block`` has the attribute ``expansion``, then return
+       ``block.expansion``.
+    3. Return the default value according the the block type:
+       1 for ``BasicBlock`` and 4 for ``Bottleneck``.
+
+    Args:
+        block (class): The block class.
+        expansion (int | None): The given expansion ratio.
+
+    Returns:
+        int: The expansion of the block.
+    """
+    if isinstance(expansion, int):
+        assert expansion > 0
+    elif expansion is None:
+        if hasattr(block, 'expansion'):
+            expansion = block.expansion
+        elif issubclass(block, BasicBlock):
+            expansion = 1
+        elif issubclass(block, Bottleneck):
+            expansion = 4
+        else:
+            raise TypeError(f'expansion is not specified for {block.__name__}')
+    else:
+        raise TypeError('expansion must be an integer or None')
+
+    return expansion
+
+
+class ResLayer(nn.Sequential):
+    """ResLayer to build ResNet style backbone.
+
+    Args:
+        block (nn.Module): Residual block used to build ResLayer.
+        num_blocks (int): Number of blocks.
+        in_channels (int): Input channels of this block.
+        out_channels (int): Output channels of this block.
+        expansion (int, optional): The expansion for BasicBlock/Bottleneck.
+            If not specified, it will firstly be obtained via
+            ``block.expansion``. If the block has no attribute "expansion",
+            the following default values will be used: 1 for BasicBlock and
+            4 for Bottleneck. Default: None.
+        stride (int): stride of the first block. Default: 1.
+        avg_down (bool): Use AvgPool instead of stride conv when
+            downsampling in the bottleneck. Default: False
+        conv_cfg (dict, optional): dictionary to construct and config conv
+            layer. Default: None
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default: dict(type='BN')
+        drop_path_rate (float or list): stochastic depth rate.
+            Default: 0.
+    """
+
+    def __init__(self,
+                 block,
+                 num_blocks,
+                 in_channels,
+                 out_channels,
+                 expansion=None,
+                 stride=1,
+                 avg_down=False,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 drop_path_rate=0.0,
+                 **kwargs):
+        self.block = block
+        self.expansion = get_expansion(block, expansion)
+
+        if isinstance(drop_path_rate, float):
+            drop_path_rate = [drop_path_rate] * num_blocks
+
+        assert len(drop_path_rate
+                   ) == num_blocks, 'Please check the length of drop_path_rate'
+
+        downsample = None
+        if stride != 1 or in_channels != out_channels:
+            downsample = []
+            conv_stride = stride
+            if avg_down and stride != 1:
+                conv_stride = 1
+                downsample.append(
+                    nn.AvgPool2d(
+                        kernel_size=stride,
+                        stride=stride,
+                        ceil_mode=True,
+                        count_include_pad=False))
+            downsample.extend([
+                build_conv_layer(
+                    conv_cfg,
+                    in_channels,
+                    out_channels,
+                    kernel_size=1,
+                    stride=conv_stride,
+                    bias=False),
+                build_norm_layer(norm_cfg, out_channels)[1]
+            ])
+            downsample = nn.Sequential(*downsample)
+
+        layers = []
+        layers.append(
+            block(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                expansion=self.expansion,
+                stride=stride,
+                downsample=downsample,
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                drop_path_rate=drop_path_rate[0],
+                **kwargs))
+        in_channels = out_channels
+        for i in range(1, num_blocks):
+            layers.append(
+                block(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    expansion=self.expansion,
+                    stride=1,
+                    conv_cfg=conv_cfg,
+                    norm_cfg=norm_cfg,
+                    drop_path_rate=drop_path_rate[i],
+                    **kwargs))
+        super(ResLayer, self).__init__(*layers)
+
+
+@MODELS.register_module()
+class ResNet(BaseBackbone):
+    """ResNet backbone.
+
+    Please refer to the `paper <https://arxiv.org/abs/1512.03385>`__ for
+    details.
+
+    Args:
+        depth (int): Network depth, from {18, 34, 50, 101, 152}.
+        in_channels (int): Number of input image channels. Default: 3.
+        stem_channels (int): Output channels of the stem layer. Default: 64.
+        base_channels (int): Middle channels of the first stage. Default: 64.
+        num_stages (int): Stages of the network. Default: 4.
+        strides (Sequence[int]): Strides of the first block of each stage.
+            Default: ``(1, 2, 2, 2)``.
+        dilations (Sequence[int]): Dilation of each stage.
+            Default: ``(1, 1, 1, 1)``.
+        out_indices (Sequence[int]): Output from which stages.
+            Default: ``(3, )``.
+        style (str): `pytorch` or `caffe`. If set to "pytorch", the stride-two
+            layer is the 3x3 conv layer, otherwise the stride-two layer is
+            the first 1x1 conv layer.
+        deep_stem (bool): Replace 7x7 conv in input stem with 3 3x3 conv.
+            Default: False.
+        avg_down (bool): Use AvgPool instead of stride conv when
+            downsampling in the bottleneck. Default: False.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Default: -1.
+        conv_cfg (dict | None): The config dict for conv layers. Default: None.
+        norm_cfg (dict): The config dict for norm layers.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Default: False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default: False.
+        zero_init_residual (bool): Whether to use zero init for last norm layer
+            in resblocks to let them behave as identity. Default: True.
+
+    Example:
+        >>> from mmpretrain.models import ResNet
+        >>> import torch
+        >>> self = ResNet(depth=18)
+        >>> self.eval()
+        >>> inputs = torch.rand(1, 3, 32, 32)
+        >>> level_outputs = self.forward(inputs)
+        >>> for level_out in level_outputs:
+        ...     print(tuple(level_out.shape))
+        (1, 64, 8, 8)
+        (1, 128, 4, 4)
+        (1, 256, 2, 2)
+        (1, 512, 1, 1)
+    """
+
+    arch_settings = {
+        18: (BasicBlock, (2, 2, 2, 2)),
+        34: (BasicBlock, (3, 4, 6, 3)),
+        50: (Bottleneck, (3, 4, 6, 3)),
+        101: (Bottleneck, (3, 4, 23, 3)),
+        152: (Bottleneck, (3, 8, 36, 3))
+    }
+
+    def __init__(self,
+                 depth,
+                 in_channels=3,
+                 stem_channels=64,
+                 base_channels=64,
+                 expansion=None,
+                 num_stages=4,
+                 strides=(1, 2, 2, 2),
+                 dilations=(1, 1, 1, 1),
+                 out_indices=(3, ),
+                 style='pytorch',
+                 deep_stem=False,
+                 avg_down=False,
+                 frozen_stages=-1,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN', requires_grad=True),
+                 norm_eval=False,
+                 with_cp=False,
+                 zero_init_residual=True,
+                 init_cfg=[
+                     dict(type='Kaiming', layer=['Conv2d']),
+                     dict(
+                         type='Constant',
+                         val=1,
+                         layer=['_BatchNorm', 'GroupNorm'])
+                 ],
+                 drop_path_rate=0.0):
+        super(ResNet, self).__init__(init_cfg)
+        if depth not in self.arch_settings:
+            raise KeyError(f'invalid depth {depth} for resnet')
+        self.depth = depth
+        self.stem_channels = stem_channels
+        self.base_channels = base_channels
+        self.num_stages = num_stages
+        assert num_stages >= 1 and num_stages <= 4
+        self.strides = strides
+        self.dilations = dilations
+        assert len(strides) == len(dilations) == num_stages
+        self.out_indices = out_indices
+        assert max(out_indices) < num_stages
+        self.style = style
+        self.deep_stem = deep_stem
+        self.avg_down = avg_down
+        self.frozen_stages = frozen_stages
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.with_cp = with_cp
+        self.norm_eval = norm_eval
+        self.zero_init_residual = zero_init_residual
+        self.block, stage_blocks = self.arch_settings[depth]
+        self.stage_blocks = stage_blocks[:num_stages]
+        self.expansion = get_expansion(self.block, expansion)
+
+        self._make_stem_layer(in_channels, stem_channels)
+
+        self.res_layers = []
+        _in_channels = stem_channels
+        _out_channels = base_channels * self.expansion
+
+        # stochastic depth decay rule
+        total_depth = sum(stage_blocks)
+        dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, total_depth)
+        ]
+
+        for i, num_blocks in enumerate(self.stage_blocks):
+            stride = strides[i]
+            dilation = dilations[i]
+            res_layer = self.make_res_layer(
+                block=self.block,
+                num_blocks=num_blocks,
+                in_channels=_in_channels,
+                out_channels=_out_channels,
+                expansion=self.expansion,
+                stride=stride,
+                dilation=dilation,
+                style=self.style,
+                avg_down=self.avg_down,
+                with_cp=with_cp,
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                drop_path_rate=dpr[:num_blocks])
+            _in_channels = _out_channels
+            _out_channels *= 2
+            dpr = dpr[num_blocks:]
+            layer_name = f'layer{i + 1}'
+            self.add_module(layer_name, res_layer)
+            self.res_layers.append(layer_name)
+
+        self._freeze_stages()
+
+        self.feat_dim = res_layer[-1].out_channels
+
+    def make_res_layer(self, **kwargs):
+        return ResLayer(**kwargs)
+
+    @property
+    def norm1(self):
+        return getattr(self, self.norm1_name)
+
+    def _make_stem_layer(self, in_channels, stem_channels):
+        if self.deep_stem:
+            self.stem = nn.Sequential(
+                ConvModule(
+                    in_channels,
+                    stem_channels // 2,
+                    kernel_size=3,
+                    stride=2,
+                    padding=1,
+                    conv_cfg=self.conv_cfg,
+                    norm_cfg=self.norm_cfg,
+                    inplace=True),
+                ConvModule(
+                    stem_channels // 2,
+                    stem_channels // 2,
+                    kernel_size=3,
+                    stride=1,
+                    padding=1,
+                    conv_cfg=self.conv_cfg,
+                    norm_cfg=self.norm_cfg,
+                    inplace=True),
+                ConvModule(
+                    stem_channels // 2,
+                    stem_channels,
+                    kernel_size=3,
+                    stride=1,
+                    padding=1,
+                    conv_cfg=self.conv_cfg,
+                    norm_cfg=self.norm_cfg,
+                    inplace=True))
+        else:
+            self.conv1 = build_conv_layer(
+                self.conv_cfg,
+                in_channels,
+                stem_channels,
+                kernel_size=7,
+                stride=2,
+                padding=3,
+                bias=False)
+            self.norm1_name, norm1 = build_norm_layer(
+                self.norm_cfg, stem_channels, postfix=1)
+            self.add_module(self.norm1_name, norm1)
+            self.relu = nn.ReLU(inplace=True)
+        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            if self.deep_stem:
+                self.stem.eval()
+                for param in self.stem.parameters():
+                    param.requires_grad = False
+            else:
+                self.norm1.eval()
+                for m in [self.conv1, self.norm1]:
+                    for param in m.parameters():
+                        param.requires_grad = False
+
+        for i in range(1, self.frozen_stages + 1):
+            m = getattr(self, f'layer{i}')
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+    def init_weights(self):
+        super(ResNet, self).init_weights()
+
+        if (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            # Suppress zero_init_residual if use pretrained model.
+            return
+
+        if self.zero_init_residual:
+            for m in self.modules():
+                if isinstance(m, Bottleneck):
+                    constant_init(m.norm3, 0)
+                elif isinstance(m, BasicBlock):
+                    constant_init(m.norm2, 0)
+
+    def forward(self, x):
+        if self.deep_stem:
+            x = self.stem(x)
+        else:
+            x = self.conv1(x)
+            x = self.norm1(x)
+            x = self.relu(x)
+        x = self.maxpool(x)
+        outs = []
+        for i, layer_name in enumerate(self.res_layers):
+            res_layer = getattr(self, layer_name)
+            x = res_layer(x)
+            if i in self.out_indices:
+                outs.append(x)
+        return tuple(outs)
+
+    def train(self, mode=True):
+        super(ResNet, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                # trick: eval have effect on BatchNorm only
+                if isinstance(m, _BatchNorm):
+                    m.eval()
+
+    def get_layer_depth(self, param_name: str, prefix: str = ''):
+        """Get the layer id to set the different learning rates for ResNet.
+
+        ResNet stages:
+        50  :    [3, 4, 6, 3]
+        101 :    [3, 4, 23, 3]
+        152 :    [3, 8, 36, 3]
+        200 :    [3, 24, 36, 3]
+        eca269d: [3, 30, 48, 8]
+
+        Args:
+            param_name (str): The name of the parameter.
+            prefix (str): The prefix for the parameter.
+                Defaults to an empty string.
+
+        Returns:
+            Tuple[int, int]: The layer-wise depth and the num of layers.
+        """
+        depths = self.stage_blocks
+        if depths[1] == 4 and depths[2] == 6:
+            blk2, blk3 = 2, 3
+        elif depths[1] == 4 and depths[2] == 23:
+            blk2, blk3 = 2, 3
+        elif depths[1] == 8 and depths[2] == 36:
+            blk2, blk3 = 4, 4
+        elif depths[1] == 24 and depths[2] == 36:
+            blk2, blk3 = 4, 4
+        elif depths[1] == 30 and depths[2] == 48:
+            blk2, blk3 = 5, 6
+        else:
+            raise NotImplementedError
+
+        N2, N3 = math.ceil(depths[1] / blk2 -
+                           1e-5), math.ceil(depths[2] / blk3 - 1e-5)
+        N = 2 + N2 + N3  # r50: 2 + 2 + 2 = 6
+        max_layer_id = N + 1  # r50: 2 + 2 + 2 + 1(like head) = 7
+
+        if not param_name.startswith(prefix):
+            # For subsequent module like head
+            return max_layer_id, max_layer_id + 1
+
+        if param_name.startswith('backbone.layer'):
+            stage_id = int(param_name.split('.')[1][5:])
+            block_id = int(param_name.split('.')[2])
+
+            if stage_id == 1:
+                layer_id = 1
+            elif stage_id == 2:
+                layer_id = 2 + block_id // blk2  # r50: 2, 3
+            elif stage_id == 3:
+                layer_id = 2 + N2 + block_id // blk3  # r50: 4, 5
+            else:  # stage_id == 4
+                layer_id = N  # r50: 6
+            return layer_id, max_layer_id + 1
+
+        else:
+            return 0, max_layer_id + 1
+
+
+@MODELS.register_module()
+class ResNetV1c(ResNet):
+    """ResNetV1c backbone.
+
+    This variant is described in `Bag of Tricks.
+    <https://arxiv.org/pdf/1812.01187.pdf>`_.
+
+    Compared with default ResNet(ResNetV1b), ResNetV1c replaces the 7x7 conv
+    in the input stem with three 3x3 convs.
+    """
+
+    def __init__(self, **kwargs):
+        super(ResNetV1c, self).__init__(
+            deep_stem=True, avg_down=False, **kwargs)
+
+
+@MODELS.register_module()
+class ResNetV1d(ResNet):
+    """ResNetV1d backbone.
+
+    This variant is described in `Bag of Tricks.
+    <https://arxiv.org/pdf/1812.01187.pdf>`_.
+
+    Compared with default ResNet(ResNetV1b), ResNetV1d replaces the 7x7 conv in
+    the input stem with three 3x3 convs. And in the downsampling block, a 2x2
+    avg_pool with stride 2 is added before conv, whose stride is changed to 1.
+    """
+
+    def __init__(self, **kwargs):
+        super(ResNetV1d, self).__init__(
+            deep_stem=True, avg_down=True, **kwargs)
diff --git a/mmpretrain/models/backbones/resnet_cifar.py b/mmpretrain/models/backbones/resnet_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..9f17f92fd76a690ea90977b38ab2ea00345ba903
--- /dev/null
+++ b/mmpretrain/models/backbones/resnet_cifar.py
@@ -0,0 +1,81 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+from mmcv.cnn import build_conv_layer, build_norm_layer
+
+from mmpretrain.registry import MODELS
+from .resnet import ResNet
+
+
+@MODELS.register_module()
+class ResNet_CIFAR(ResNet):
+    """ResNet backbone for CIFAR.
+
+    Compared to standard ResNet, it uses `kernel_size=3` and `stride=1` in
+    conv1, and does not apply MaxPoolinng after stem. It has been proven to
+    be more efficient than standard ResNet in other public codebase, e.g.,
+    `https://github.com/kuangliu/pytorch-cifar/blob/master/models/resnet.py`.
+
+    Args:
+        depth (int): Network depth, from {18, 34, 50, 101, 152}.
+        in_channels (int): Number of input image channels. Default: 3.
+        stem_channels (int): Output channels of the stem layer. Default: 64.
+        base_channels (int): Middle channels of the first stage. Default: 64.
+        num_stages (int): Stages of the network. Default: 4.
+        strides (Sequence[int]): Strides of the first block of each stage.
+            Default: ``(1, 2, 2, 2)``.
+        dilations (Sequence[int]): Dilation of each stage.
+            Default: ``(1, 1, 1, 1)``.
+        out_indices (Sequence[int]): Output from which stages. If only one
+            stage is specified, a single tensor (feature map) is returned,
+            otherwise multiple stages are specified, a tuple of tensors will
+            be returned. Default: ``(3, )``.
+        style (str): `pytorch` or `caffe`. If set to "pytorch", the stride-two
+            layer is the 3x3 conv layer, otherwise the stride-two layer is
+            the first 1x1 conv layer.
+        deep_stem (bool): This network has specific designed stem, thus it is
+            asserted to be False.
+        avg_down (bool): Use AvgPool instead of stride conv when
+            downsampling in the bottleneck. Default: False.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Default: -1.
+        conv_cfg (dict | None): The config dict for conv layers. Default: None.
+        norm_cfg (dict): The config dict for norm layers.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Default: False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default: False.
+        zero_init_residual (bool): Whether to use zero init for last norm layer
+            in resblocks to let them behave as identity. Default: True.
+    """
+
+    def __init__(self, depth, deep_stem=False, **kwargs):
+        super(ResNet_CIFAR, self).__init__(
+            depth, deep_stem=deep_stem, **kwargs)
+        assert not self.deep_stem, 'ResNet_CIFAR do not support deep_stem'
+
+    def _make_stem_layer(self, in_channels, base_channels):
+        self.conv1 = build_conv_layer(
+            self.conv_cfg,
+            in_channels,
+            base_channels,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+            bias=False)
+        self.norm1_name, norm1 = build_norm_layer(
+            self.norm_cfg, base_channels, postfix=1)
+        self.add_module(self.norm1_name, norm1)
+        self.relu = nn.ReLU(inplace=True)
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.norm1(x)
+        x = self.relu(x)
+        outs = []
+        for i, layer_name in enumerate(self.res_layers):
+            res_layer = getattr(self, layer_name)
+            x = res_layer(x)
+            if i in self.out_indices:
+                outs.append(x)
+        return tuple(outs)
diff --git a/mmpretrain/models/backbones/resnext.py b/mmpretrain/models/backbones/resnext.py
new file mode 100644
index 0000000000000000000000000000000000000000..8858b7d3dffdcb20677e091fba4f5a1084d086a3
--- /dev/null
+++ b/mmpretrain/models/backbones/resnext.py
@@ -0,0 +1,148 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmcv.cnn import build_conv_layer, build_norm_layer
+
+from mmpretrain.registry import MODELS
+from .resnet import Bottleneck as _Bottleneck
+from .resnet import ResLayer, ResNet
+
+
+class Bottleneck(_Bottleneck):
+    """Bottleneck block for ResNeXt.
+
+    Args:
+        in_channels (int): Input channels of this block.
+        out_channels (int): Output channels of this block.
+        groups (int): Groups of conv2.
+        width_per_group (int): Width per group of conv2. 64x4d indicates
+            ``groups=64, width_per_group=4`` and 32x8d indicates
+            ``groups=32, width_per_group=8``.
+        stride (int): stride of the block. Default: 1
+        dilation (int): dilation of convolution. Default: 1
+        downsample (nn.Module, optional): downsample operation on identity
+            branch. Default: None
+        style (str): `pytorch` or `caffe`. If set to "pytorch", the stride-two
+            layer is the 3x3 conv layer, otherwise the stride-two layer is
+            the first 1x1 conv layer.
+        conv_cfg (dict, optional): dictionary to construct and config conv
+            layer. Default: None
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default: dict(type='BN')
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 base_channels=64,
+                 groups=32,
+                 width_per_group=4,
+                 **kwargs):
+        super(Bottleneck, self).__init__(in_channels, out_channels, **kwargs)
+        self.groups = groups
+        self.width_per_group = width_per_group
+
+        # For ResNet bottleneck, middle channels are determined by expansion
+        # and out_channels, but for ResNeXt bottleneck, it is determined by
+        # groups and width_per_group and the stage it is located in.
+        if groups != 1:
+            assert self.mid_channels % base_channels == 0
+            self.mid_channels = (
+                groups * width_per_group * self.mid_channels // base_channels)
+
+        self.norm1_name, norm1 = build_norm_layer(
+            self.norm_cfg, self.mid_channels, postfix=1)
+        self.norm2_name, norm2 = build_norm_layer(
+            self.norm_cfg, self.mid_channels, postfix=2)
+        self.norm3_name, norm3 = build_norm_layer(
+            self.norm_cfg, self.out_channels, postfix=3)
+
+        self.conv1 = build_conv_layer(
+            self.conv_cfg,
+            self.in_channels,
+            self.mid_channels,
+            kernel_size=1,
+            stride=self.conv1_stride,
+            bias=False)
+        self.add_module(self.norm1_name, norm1)
+        self.conv2 = build_conv_layer(
+            self.conv_cfg,
+            self.mid_channels,
+            self.mid_channels,
+            kernel_size=3,
+            stride=self.conv2_stride,
+            padding=self.dilation,
+            dilation=self.dilation,
+            groups=groups,
+            bias=False)
+
+        self.add_module(self.norm2_name, norm2)
+        self.conv3 = build_conv_layer(
+            self.conv_cfg,
+            self.mid_channels,
+            self.out_channels,
+            kernel_size=1,
+            bias=False)
+        self.add_module(self.norm3_name, norm3)
+
+
+@MODELS.register_module()
+class ResNeXt(ResNet):
+    """ResNeXt backbone.
+
+    Please refer to the `paper <https://arxiv.org/abs/1611.05431>`__ for
+    details.
+
+    Args:
+        depth (int): Network depth, from {50, 101, 152}.
+        groups (int): Groups of conv2 in Bottleneck. Default: 32.
+        width_per_group (int): Width per group of conv2 in Bottleneck.
+            Default: 4.
+        in_channels (int): Number of input image channels. Default: 3.
+        stem_channels (int): Output channels of the stem layer. Default: 64.
+        num_stages (int): Stages of the network. Default: 4.
+        strides (Sequence[int]): Strides of the first block of each stage.
+            Default: ``(1, 2, 2, 2)``.
+        dilations (Sequence[int]): Dilation of each stage.
+            Default: ``(1, 1, 1, 1)``.
+        out_indices (Sequence[int]): Output from which stages. If only one
+            stage is specified, a single tensor (feature map) is returned,
+            otherwise multiple stages are specified, a tuple of tensors will
+            be returned. Default: ``(3, )``.
+        style (str): `pytorch` or `caffe`. If set to "pytorch", the stride-two
+            layer is the 3x3 conv layer, otherwise the stride-two layer is
+            the first 1x1 conv layer.
+        deep_stem (bool): Replace 7x7 conv in input stem with 3 3x3 conv.
+            Default: False.
+        avg_down (bool): Use AvgPool instead of stride conv when
+            downsampling in the bottleneck. Default: False.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Default: -1.
+        conv_cfg (dict | None): The config dict for conv layers. Default: None.
+        norm_cfg (dict): The config dict for norm layers.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Default: False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default: False.
+        zero_init_residual (bool): Whether to use zero init for last norm layer
+            in resblocks to let them behave as identity. Default: True.
+    """
+
+    arch_settings = {
+        50: (Bottleneck, (3, 4, 6, 3)),
+        101: (Bottleneck, (3, 4, 23, 3)),
+        152: (Bottleneck, (3, 8, 36, 3))
+    }
+
+    def __init__(self, depth, groups=32, width_per_group=4, **kwargs):
+        self.groups = groups
+        self.width_per_group = width_per_group
+        super(ResNeXt, self).__init__(depth, **kwargs)
+
+    def make_res_layer(self, **kwargs):
+        return ResLayer(
+            groups=self.groups,
+            width_per_group=self.width_per_group,
+            base_channels=self.base_channels,
+            **kwargs)
diff --git a/mmpretrain/models/backbones/revvit.py b/mmpretrain/models/backbones/revvit.py
new file mode 100644
index 0000000000000000000000000000000000000000..f2e6c28c943c83d0580634ac04450ee7ffc5f478
--- /dev/null
+++ b/mmpretrain/models/backbones/revvit.py
@@ -0,0 +1,671 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import sys
+
+import numpy as np
+import torch
+from mmcv.cnn.bricks.drop import build_dropout
+from mmcv.cnn.bricks.transformer import FFN, PatchEmbed
+from mmengine.model import BaseModule, ModuleList
+from mmengine.model.weight_init import trunc_normal_
+from torch import nn
+from torch.autograd import Function as Function
+
+from mmpretrain.models.backbones.base_backbone import BaseBackbone
+from mmpretrain.registry import MODELS
+from ..utils import (MultiheadAttention, build_norm_layer, resize_pos_embed,
+                     to_2tuple)
+
+
+class RevBackProp(Function):
+    """Custom Backpropagation function to allow (A) flushing memory in forward
+    and (B) activation recomputation reversibly in backward for gradient
+    calculation.
+
+    Inspired by
+    https://github.com/RobinBruegger/RevTorch/blob/master/revtorch/revtorch.py
+    """
+
+    @staticmethod
+    def forward(
+            ctx,
+            x,
+            layers,
+            buffer_layers,  # List of layer ids for int activation to buffer
+    ):
+        """Reversible Forward pass.
+
+        Any intermediate activations from `buffer_layers` are cached in ctx for
+        forward pass. This is not necessary for standard usecases. Each
+        reversible layer implements its own forward pass logic.
+        """
+        buffer_layers.sort()
+        x1, x2 = torch.chunk(x, 2, dim=-1)
+        intermediate = []
+
+        for layer in layers:
+            x1, x2 = layer(x1, x2)
+            if layer.layer_id in buffer_layers:
+                intermediate.extend([x1.detach(), x2.detach()])
+
+        if len(buffer_layers) == 0:
+            all_tensors = [x1.detach(), x2.detach()]
+        else:
+            intermediate = [torch.LongTensor(buffer_layers), *intermediate]
+            all_tensors = [x1.detach(), x2.detach(), *intermediate]
+
+        ctx.save_for_backward(*all_tensors)
+        ctx.layers = layers
+
+        return torch.cat([x1, x2], dim=-1)
+
+    @staticmethod
+    def backward(ctx, dx):
+        """Reversible Backward pass.
+
+        Any intermediate activations from `buffer_layers` are recovered from
+        ctx. Each layer implements its own loic for backward pass (both
+        activation recomputation and grad calculation).
+        """
+        d_x1, d_x2 = torch.chunk(dx, 2, dim=-1)
+        # retrieve params from ctx for backward
+        x1, x2, *int_tensors = ctx.saved_tensors
+        # no buffering
+        if len(int_tensors) != 0:
+            buffer_layers = int_tensors[0].tolist()
+        else:
+            buffer_layers = []
+
+        layers = ctx.layers
+
+        for _, layer in enumerate(layers[::-1]):
+            if layer.layer_id in buffer_layers:
+                x1, x2, d_x1, d_x2 = layer.backward_pass(
+                    y1=int_tensors[buffer_layers.index(layer.layer_id) * 2 +
+                                   1],
+                    y2=int_tensors[buffer_layers.index(layer.layer_id) * 2 +
+                                   2],
+                    d_y1=d_x1,
+                    d_y2=d_x2,
+                )
+            else:
+                x1, x2, d_x1, d_x2 = layer.backward_pass(
+                    y1=x1,
+                    y2=x2,
+                    d_y1=d_x1,
+                    d_y2=d_x2,
+                )
+
+        dx = torch.cat([d_x1, d_x2], dim=-1)
+
+        del int_tensors
+        del d_x1, d_x2, x1, x2
+
+        return dx, None, None
+
+
+class RevTransformerEncoderLayer(BaseModule):
+    """Reversible Transformer Encoder Layer.
+
+    This module is a building block of Reversible Transformer Encoder,
+    which support backpropagation without storing activations.
+    The residual connection is not applied to the FFN layer.
+
+    Args:
+        embed_dims (int): The feature dimension.
+        num_heads (int): Parallel attention heads.
+        feedforward_channels (int): The hidden dimension for FFNs.
+        drop_rate (float): Probability of an element to be zeroed.
+            Default: 0.0
+        attn_drop_rate (float): The drop out rate for attention layer.
+            Default: 0.0
+        drop_path_rate (float): stochastic depth rate.
+            Default 0.0
+        num_fcs (int): The number of linear in FFN
+            Default: 2
+        qkv_bias (bool): enable bias for qkv if True.
+            Default: True
+        act_cfg (dict): The activation config for FFNs.
+            Default: dict(type='GELU')
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='LN')
+        layer_id (int): The layer id of current layer. Used in RevBackProp.
+            Default: 0
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+    """
+
+    def __init__(self,
+                 embed_dims: int,
+                 num_heads: int,
+                 feedforward_channels: int,
+                 drop_rate: float = 0.,
+                 attn_drop_rate: float = 0.,
+                 drop_path_rate: float = 0.,
+                 num_fcs: int = 2,
+                 qkv_bias: bool = True,
+                 act_cfg: dict = dict(type='GELU'),
+                 norm_cfg: dict = dict(type='LN'),
+                 layer_id: int = 0,
+                 init_cfg=None):
+        super(RevTransformerEncoderLayer, self).__init__(init_cfg=init_cfg)
+
+        self.drop_path_cfg = dict(type='DropPath', drop_prob=drop_path_rate)
+        self.embed_dims = embed_dims
+
+        self.ln1 = build_norm_layer(norm_cfg, self.embed_dims)
+
+        self.attn = MultiheadAttention(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            attn_drop=attn_drop_rate,
+            proj_drop=drop_rate,
+            qkv_bias=qkv_bias)
+
+        self.ln2 = build_norm_layer(norm_cfg, self.embed_dims)
+
+        self.ffn = FFN(
+            embed_dims=embed_dims,
+            feedforward_channels=feedforward_channels,
+            num_fcs=num_fcs,
+            ffn_drop=drop_rate,
+            act_cfg=act_cfg,
+            add_identity=False)
+
+        self.layer_id = layer_id
+        self.seeds = {}
+
+    def init_weights(self):
+        super(RevTransformerEncoderLayer, self).init_weights()
+        for m in self.ffn.modules():
+            if isinstance(m, nn.Linear):
+                nn.init.xavier_uniform_(m.weight)
+                nn.init.normal_(m.bias, std=1e-6)
+
+    def seed_cuda(self, key):
+        """Fix seeds to allow for stochastic elements such as dropout to be
+        reproduced exactly in activation recomputation in the backward pass."""
+        # randomize seeds
+        # use cuda generator if available
+        if (hasattr(torch.cuda, 'default_generators')
+                and len(torch.cuda.default_generators) > 0):
+            # GPU
+            device_idx = torch.cuda.current_device()
+            seed = torch.cuda.default_generators[device_idx].seed()
+        else:
+            # CPU
+            seed = int(torch.seed() % sys.maxsize)
+
+        self.seeds[key] = seed
+        torch.manual_seed(self.seeds[key])
+
+    def forward(self, x1, x2):
+        """
+        Implementation of Reversible TransformerEncoderLayer
+
+        `
+        x = x + self.attn(self.ln1(x))
+        x = self.ffn(self.ln2(x), identity=x)
+        `
+        """
+        self.seed_cuda('attn')
+        # attention output
+        f_x2 = self.attn(self.ln1(x2))
+        # apply droppath on attention output
+        self.seed_cuda('droppath')
+        f_x2_dropped = build_dropout(self.drop_path_cfg)(f_x2)
+        y1 = x1 + f_x2_dropped
+
+        # free memory
+        if self.training:
+            del x1
+
+        # ffn output
+        self.seed_cuda('ffn')
+        g_y1 = self.ffn(self.ln2(y1))
+        # apply droppath on ffn output
+        torch.manual_seed(self.seeds['droppath'])
+        g_y1_dropped = build_dropout(self.drop_path_cfg)(g_y1)
+        # final output
+        y2 = x2 + g_y1_dropped
+
+        # free memory
+        if self.training:
+            del x2
+
+        return y1, y2
+
+    def backward_pass(self, y1, y2, d_y1, d_y2):
+        """Activation re-compute with the following equation.
+
+        x2 = y2 - g(y1), g = FFN
+        x1 = y1 - f(x2), f = MSHA
+        """
+
+        # temporarily record intermediate activation for G
+        # and use them for gradient calculation of G
+        with torch.enable_grad():
+            y1.requires_grad = True
+
+            torch.manual_seed(self.seeds['ffn'])
+            g_y1 = self.ffn(self.ln2(y1))
+
+            torch.manual_seed(self.seeds['droppath'])
+            g_y1 = build_dropout(self.drop_path_cfg)(g_y1)
+
+            g_y1.backward(d_y2, retain_graph=True)
+
+        # activate recomputation is by design and not part of
+        # the computation graph in forward pass
+        with torch.no_grad():
+            x2 = y2 - g_y1
+            del g_y1
+
+            d_y1 = d_y1 + y1.grad
+            y1.grad = None
+
+        # record F activation and calculate gradients on F
+        with torch.enable_grad():
+            x2.requires_grad = True
+
+            torch.manual_seed(self.seeds['attn'])
+            f_x2 = self.attn(self.ln1(x2))
+
+            torch.manual_seed(self.seeds['droppath'])
+            f_x2 = build_dropout(self.drop_path_cfg)(f_x2)
+
+            f_x2.backward(d_y1, retain_graph=True)
+
+        # propagate reverse computed activations at the
+        # start of the previous block
+        with torch.no_grad():
+            x1 = y1 - f_x2
+            del f_x2, y1
+
+            d_y2 = d_y2 + x2.grad
+
+            x2.grad = None
+            x2 = x2.detach()
+
+        return x1, x2, d_y1, d_y2
+
+
+class TwoStreamFusion(nn.Module):
+    """A general constructor for neural modules fusing two equal sized tensors
+    in forward.
+
+    Args:
+        mode (str): The mode of fusion. Options are 'add', 'max', 'min',
+            'avg', 'concat'.
+    """
+
+    def __init__(self, mode: str):
+        super().__init__()
+        self.mode = mode
+
+        if mode == 'add':
+            self.fuse_fn = lambda x: torch.stack(x).sum(dim=0)
+        elif mode == 'max':
+            self.fuse_fn = lambda x: torch.stack(x).max(dim=0).values
+        elif mode == 'min':
+            self.fuse_fn = lambda x: torch.stack(x).min(dim=0).values
+        elif mode == 'avg':
+            self.fuse_fn = lambda x: torch.stack(x).mean(dim=0)
+        elif mode == 'concat':
+            self.fuse_fn = lambda x: torch.cat(x, dim=-1)
+        else:
+            raise NotImplementedError
+
+    def forward(self, x):
+        # split the tensor into two halves in the channel dimension
+        x = torch.chunk(x, 2, dim=2)
+        return self.fuse_fn(x)
+
+
+@MODELS.register_module()
+class RevVisionTransformer(BaseBackbone):
+    """Reversible Vision Transformer.
+
+    A PyTorch implementation of : `Reversible Vision Transformers
+    <https://openaccess.thecvf.com/content/CVPR2022/html/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.html>`_ # noqa: E501
+
+    Args:
+        arch (str | dict): Vision Transformer architecture. If use string,
+            choose from 'small', 'base', 'large', 'deit-tiny', 'deit-small'
+            and 'deit-base'. If use dict, it should have below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **num_layers** (int): The number of transformer encoder layers.
+            - **num_heads** (int): The number of heads in attention modules.
+            - **feedforward_channels** (int): The hidden dimensions in
+              feedforward modules.
+
+            Defaults to 'base'.
+        img_size (int | tuple): The expected input image shape. Because we
+            support dynamic input shape, just set the argument to the most
+            common input image shape. Defaults to 224.
+        patch_size (int | tuple): The patch size in patch embedding.
+            Defaults to 16.
+        in_channels (int): The num of input channels. Defaults to 3.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        qkv_bias (bool): Whether to add bias for qkv in attention modules.
+            Defaults to True.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        final_norm (bool): Whether to add a additional layer to normalize
+            final feature map. Defaults to True.
+        out_type (str): The type of output features. Please choose from
+
+            - ``"cls_token"``: The class token tensor with shape (B, C).
+            - ``"featmap"``: The feature map tensor from the patch tokens
+              with shape (B, C, H, W).
+            - ``"avg_featmap"``: The global averaged feature map tensor
+              with shape (B, C).
+            - ``"raw"``: The raw feature tensor includes patch tokens and
+              class tokens with shape (B, L, C).
+
+            Defaults to ``"avg_featmap"``.
+        with_cls_token (bool): Whether concatenating class token into image
+            tokens as transformer input. Defaults to False.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        interpolate_mode (str): Select the interpolate mode for position
+            embeding vector resize. Defaults to "bicubic".
+        patch_cfg (dict): Configs of patch embeding. Defaults to an empty dict.
+        layer_cfgs (Sequence | dict): Configs of each transformer layer in
+            encoder. Defaults to an empty dict.
+        fusion_mode (str): The fusion mode of transformer layers.
+            Defaults to 'concat'.
+        no_custom_backward (bool): Whether to use custom backward.
+            Defaults to False.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+    arch_zoo = {
+        **dict.fromkeys(
+            ['s', 'small'], {
+                'embed_dims': 768,
+                'num_layers': 8,
+                'num_heads': 8,
+                'feedforward_channels': 768 * 3,
+            }),
+        **dict.fromkeys(
+            ['b', 'base'], {
+                'embed_dims': 768,
+                'num_layers': 12,
+                'num_heads': 12,
+                'feedforward_channels': 3072
+            }),
+        **dict.fromkeys(
+            ['l', 'large'], {
+                'embed_dims': 1024,
+                'num_layers': 24,
+                'num_heads': 16,
+                'feedforward_channels': 4096
+            }),
+        **dict.fromkeys(
+            ['h', 'huge'],
+            {
+                # The same as the implementation in MAE
+                # <https://arxiv.org/abs/2111.06377>
+                'embed_dims': 1280,
+                'num_layers': 32,
+                'num_heads': 16,
+                'feedforward_channels': 5120
+            }),
+        **dict.fromkeys(
+            ['deit-t', 'deit-tiny'], {
+                'embed_dims': 192,
+                'num_layers': 12,
+                'num_heads': 3,
+                'feedforward_channels': 192 * 4
+            }),
+        **dict.fromkeys(
+            ['deit-s', 'deit-small'], {
+                'embed_dims': 384,
+                'num_layers': 12,
+                'num_heads': 6,
+                'feedforward_channels': 384 * 4
+            }),
+        **dict.fromkeys(
+            ['deit-b', 'deit-base'], {
+                'embed_dims': 768,
+                'num_layers': 12,
+                'num_heads': 12,
+                'feedforward_channels': 768 * 4
+            }),
+    }
+    num_extra_tokens = 0  # The official RevViT doesn't have class token
+    OUT_TYPES = {'raw', 'cls_token', 'featmap', 'avg_featmap'}
+
+    def __init__(self,
+                 arch='base',
+                 img_size=224,
+                 patch_size=16,
+                 in_channels=3,
+                 drop_rate=0.,
+                 drop_path_rate=0.,
+                 qkv_bias=True,
+                 norm_cfg=dict(type='LN', eps=1e-6),
+                 final_norm=True,
+                 out_type='avg_featmap',
+                 with_cls_token=False,
+                 frozen_stages=-1,
+                 interpolate_mode='bicubic',
+                 patch_cfg=dict(),
+                 layer_cfgs=dict(),
+                 fusion_mode='concat',
+                 no_custom_backward=False,
+                 init_cfg=None):
+        super(RevVisionTransformer, self).__init__(init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {
+                'embed_dims', 'num_layers', 'num_heads', 'feedforward_channels'
+            }
+            assert isinstance(arch, dict) and essential_keys <= set(arch), \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.embed_dims = self.arch_settings['embed_dims']
+        self.num_layers = self.arch_settings['num_layers']
+        self.img_size = to_2tuple(img_size)
+        self.no_custom_backward = no_custom_backward
+
+        # Set patch embedding
+        _patch_cfg = dict(
+            in_channels=in_channels,
+            input_size=img_size,
+            embed_dims=self.embed_dims,
+            conv_type='Conv2d',
+            kernel_size=patch_size,
+            stride=patch_size,
+        )
+        _patch_cfg.update(patch_cfg)
+        self.patch_embed = PatchEmbed(**_patch_cfg)
+        self.patch_resolution = self.patch_embed.init_out_size
+        num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+
+        # Set out type
+        if out_type not in self.OUT_TYPES:
+            raise ValueError(f'Unsupported `out_type` {out_type}, please '
+                             f'choose from {self.OUT_TYPES}')
+        self.out_type = out_type
+
+        # Set cls token
+        if with_cls_token:
+            self.cls_token = nn.Parameter(torch.zeros(1, 1, self.embed_dims))
+            self.num_extra_tokens = 1
+        elif out_type != 'cls_token':
+            self.cls_token = None
+            self.num_extra_tokens = 0
+        else:
+            raise ValueError(
+                'with_cls_token must be True when `out_type="cls_token"`.')
+
+        # Set position embedding
+        self.interpolate_mode = interpolate_mode
+        self.pos_embed = nn.Parameter(
+            torch.zeros(1, num_patches + self.num_extra_tokens,
+                        self.embed_dims))
+        self._register_load_state_dict_pre_hook(self._prepare_pos_embed)
+
+        self.drop_after_pos = nn.Dropout(p=drop_rate)
+
+        # stochastic depth decay rule
+        dpr = np.linspace(0, drop_path_rate, self.num_layers)
+
+        self.layers = ModuleList()
+        if isinstance(layer_cfgs, dict):
+            layer_cfgs = [layer_cfgs] * self.num_layers
+        for i in range(self.num_layers):
+            _layer_cfg = dict(
+                embed_dims=self.embed_dims,
+                num_heads=self.arch_settings['num_heads'],
+                feedforward_channels=self.
+                arch_settings['feedforward_channels'],
+                drop_rate=drop_rate,
+                drop_path_rate=dpr[i],
+                qkv_bias=qkv_bias,
+                layer_id=i,
+                norm_cfg=norm_cfg)
+            _layer_cfg.update(layer_cfgs[i])
+            self.layers.append(RevTransformerEncoderLayer(**_layer_cfg))
+
+        # fusion operation for the final output
+        self.fusion_layer = TwoStreamFusion(mode=fusion_mode)
+
+        self.frozen_stages = frozen_stages
+        self.final_norm = final_norm
+        if final_norm:
+            self.ln1 = build_norm_layer(norm_cfg, self.embed_dims * 2)
+
+        # freeze stages only when self.frozen_stages > 0
+        if self.frozen_stages > 0:
+            self._freeze_stages()
+
+    def init_weights(self):
+        super(RevVisionTransformer, self).init_weights()
+        if not (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            trunc_normal_(self.pos_embed, std=0.02)
+
+    def _prepare_pos_embed(self, state_dict, prefix, *args, **kwargs):
+        name = prefix + 'pos_embed'
+        if name not in state_dict.keys():
+            return
+
+        ckpt_pos_embed_shape = state_dict[name].shape
+        if self.pos_embed.shape != ckpt_pos_embed_shape:
+            from mmengine.logging import MMLogger
+            logger = MMLogger.get_current_instance()
+            logger.info(
+                f'Resize the pos_embed shape from {ckpt_pos_embed_shape} '
+                f'to {self.pos_embed.shape}.')
+
+            ckpt_pos_embed_shape = to_2tuple(
+                int(np.sqrt(ckpt_pos_embed_shape[1] - self.num_extra_tokens)))
+            pos_embed_shape = self.patch_embed.init_out_size
+
+            state_dict[name] = resize_pos_embed(state_dict[name],
+                                                ckpt_pos_embed_shape,
+                                                pos_embed_shape,
+                                                self.interpolate_mode,
+                                                self.num_extra_tokens)
+
+    @staticmethod
+    def resize_pos_embed(*args, **kwargs):
+        """Interface for backward-compatibility."""
+        return resize_pos_embed(*args, **kwargs)
+
+    def _freeze_stages(self):
+        # freeze position embedding
+        self.pos_embed.requires_grad = False
+        # set dropout to eval model
+        self.drop_after_pos.eval()
+        # freeze patch embedding
+        self.patch_embed.eval()
+        for param in self.patch_embed.parameters():
+            param.requires_grad = False
+        # freeze cls_token
+        if self.cls_token is not None:
+            self.cls_token.requires_grad = False
+        # freeze layers
+        for i in range(1, self.frozen_stages + 1):
+            m = self.layers[i - 1]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+        # freeze the last layer norm
+        if self.frozen_stages == len(self.layers) and self.final_norm:
+            self.ln1.eval()
+            for param in self.ln1.parameters():
+                param.requires_grad = False
+
+    def forward(self, x):
+        B = x.shape[0]
+        x, patch_resolution = self.patch_embed(x)
+
+        if self.cls_token is not None:
+            cls_token = self.cls_token.expand(B, -1, -1)
+            x = torch.cat((cls_token, x), dim=1)
+
+        x = x + resize_pos_embed(
+            self.pos_embed,
+            self.patch_resolution,
+            patch_resolution,
+            mode=self.interpolate_mode,
+            num_extra_tokens=self.num_extra_tokens)
+        x = self.drop_after_pos(x)
+
+        x = torch.cat([x, x], dim=-1)
+
+        # forward with different conditions
+        if not self.training or self.no_custom_backward:
+            # in eval/inference model
+            executing_fn = RevVisionTransformer._forward_vanilla_bp
+        else:
+            # use custom backward when self.training=True.
+            executing_fn = RevBackProp.apply
+
+        x = executing_fn(x, self.layers, [])
+
+        if self.final_norm:
+            x = self.ln1(x)
+        x = self.fusion_layer(x)
+
+        return (self._format_output(x, patch_resolution), )
+
+    @staticmethod
+    def _forward_vanilla_bp(hidden_state, layers, buffer=[]):
+        """Using reversible layers without reversible backpropagation.
+
+        Debugging purpose only. Activated with self.no_custom_backward
+        """
+        # split into ffn state(ffn_out) and attention output(attn_out)
+        ffn_out, attn_out = torch.chunk(hidden_state, 2, dim=-1)
+        del hidden_state
+
+        for _, layer in enumerate(layers):
+            attn_out, ffn_out = layer(attn_out, ffn_out)
+
+        return torch.cat([attn_out, ffn_out], dim=-1)
+
+    def _format_output(self, x, hw):
+        if self.out_type == 'raw':
+            return x
+        if self.out_type == 'cls_token':
+            return x[:, 0]
+
+        patch_token = x[:, self.num_extra_tokens:]
+        if self.out_type == 'featmap':
+            B = x.size(0)
+            # (B, N, C) -> (B, H, W, C) -> (B, C, H, W)
+            return patch_token.reshape(B, *hw, -1).permute(0, 3, 1, 2)
+        if self.out_type == 'avg_featmap':
+            return patch_token.mean(dim=1)
diff --git a/mmpretrain/models/backbones/riformer.py b/mmpretrain/models/backbones/riformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad7cb4d37c2ac6f1479fd3c533c456f3b0a0c45e
--- /dev/null
+++ b/mmpretrain/models/backbones/riformer.py
@@ -0,0 +1,390 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Sequence
+
+import torch
+import torch.nn as nn
+from mmcv.cnn.bricks import DropPath, build_norm_layer
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+from .poolformer import Mlp, PatchEmbed
+
+
+class Affine(nn.Module):
+    """Affine Transformation module.
+
+    Args:
+        in_features (int): Input dimension.
+    """
+
+    def __init__(self, in_features):
+        super().__init__()
+        self.affine = nn.Conv2d(
+            in_features,
+            in_features,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            groups=in_features,
+            bias=True)
+
+    def forward(self, x):
+        return self.affine(x) - x
+
+
+class RIFormerBlock(BaseModule):
+    """RIFormer Block.
+
+    Args:
+        dim (int): Embedding dim.
+        mlp_ratio (float): Mlp expansion ratio. Defaults to 4.
+        norm_cfg (dict): The config dict for norm layers.
+            Defaults to ``dict(type='GN', num_groups=1)``.
+        act_cfg (dict): The config dict for activation between pointwise
+            convolution. Defaults to ``dict(type='GELU')``.
+        drop (float): Dropout rate. Defaults to 0.
+        drop_path (float): Stochastic depth rate. Defaults to 0.
+        layer_scale_init_value (float): Init value for Layer Scale.
+            Defaults to 1e-5.
+        deploy (bool): Whether to switch the model structure to
+            deployment mode. Default: False.
+    """
+
+    def __init__(self,
+                 dim,
+                 mlp_ratio=4.,
+                 norm_cfg=dict(type='GN', num_groups=1),
+                 act_cfg=dict(type='GELU'),
+                 drop=0.,
+                 drop_path=0.,
+                 layer_scale_init_value=1e-5,
+                 deploy=False):
+
+        super().__init__()
+
+        if deploy:
+            self.norm_reparam = build_norm_layer(norm_cfg, dim)[1]
+        else:
+            self.norm1 = build_norm_layer(norm_cfg, dim)[1]
+            self.token_mixer = Affine(in_features=dim)
+        self.norm2 = build_norm_layer(norm_cfg, dim)[1]
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_cfg=act_cfg,
+            drop=drop)
+
+        # The following two techniques are useful to train deep RIFormers.
+        self.drop_path = DropPath(drop_path) if drop_path > 0. \
+            else nn.Identity()
+        self.layer_scale_1 = nn.Parameter(
+            layer_scale_init_value * torch.ones((dim)), requires_grad=True)
+        self.layer_scale_2 = nn.Parameter(
+            layer_scale_init_value * torch.ones((dim)), requires_grad=True)
+        self.norm_cfg = norm_cfg
+        self.dim = dim
+        self.deploy = deploy
+
+    def forward(self, x):
+        if hasattr(self, 'norm_reparam'):
+            x = x + self.drop_path(
+                self.layer_scale_1.unsqueeze(-1).unsqueeze(-1) *
+                self.norm_reparam(x))
+            x = x + self.drop_path(
+                self.layer_scale_2.unsqueeze(-1).unsqueeze(-1) *
+                self.mlp(self.norm2(x)))
+        else:
+            x = x + self.drop_path(
+                self.layer_scale_1.unsqueeze(-1).unsqueeze(-1) *
+                self.token_mixer(self.norm1(x)))
+            x = x + self.drop_path(
+                self.layer_scale_2.unsqueeze(-1).unsqueeze(-1) *
+                self.mlp(self.norm2(x)))
+        return x
+
+    def fuse_affine(self, norm, token_mixer):
+        gamma_affn = token_mixer.affine.weight.reshape(-1)
+        gamma_affn = gamma_affn - torch.ones_like(gamma_affn)
+        beta_affn = token_mixer.affine.bias
+        gamma_ln = norm.weight
+        beta_ln = norm.bias
+        return (gamma_ln * gamma_affn), (beta_ln * gamma_affn + beta_affn)
+
+    def get_equivalent_scale_bias(self):
+        eq_s, eq_b = self.fuse_affine(self.norm1, self.token_mixer)
+        return eq_s, eq_b
+
+    def switch_to_deploy(self):
+        if self.deploy:
+            return
+        eq_s, eq_b = self.get_equivalent_scale_bias()
+        self.norm_reparam = build_norm_layer(self.norm_cfg, self.dim)[1]
+        self.norm_reparam.weight.data = eq_s
+        self.norm_reparam.bias.data = eq_b
+        self.__delattr__('norm1')
+        if hasattr(self, 'token_mixer'):
+            self.__delattr__('token_mixer')
+        self.deploy = True
+
+
+def basic_blocks(dim,
+                 index,
+                 layers,
+                 mlp_ratio=4.,
+                 norm_cfg=dict(type='GN', num_groups=1),
+                 act_cfg=dict(type='GELU'),
+                 drop_rate=.0,
+                 drop_path_rate=0.,
+                 layer_scale_init_value=1e-5,
+                 deploy=False):
+    """generate RIFormer blocks for a stage."""
+    blocks = []
+    for block_idx in range(layers[index]):
+        block_dpr = drop_path_rate * (block_idx + sum(layers[:index])) / (
+            sum(layers) - 1)
+        blocks.append(
+            RIFormerBlock(
+                dim,
+                mlp_ratio=mlp_ratio,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg,
+                drop=drop_rate,
+                drop_path=block_dpr,
+                layer_scale_init_value=layer_scale_init_value,
+                deploy=deploy,
+            ))
+    blocks = nn.Sequential(*blocks)
+
+    return blocks
+
+
+@MODELS.register_module()
+class RIFormer(BaseBackbone):
+    """RIFormer.
+
+    A PyTorch implementation of RIFormer introduced by:
+    `RIFormer: Keep Your Vision Backbone Effective But Removing Token Mixer <https://arxiv.org/abs/xxxx.xxxxx>`_
+
+    Args:
+        arch (str | dict): The model's architecture. If string, it should be
+            one of architecture in ``RIFormer.arch_settings``. And if dict, it
+            should include the following two keys:
+
+            - layers (list[int]): Number of blocks at each stage.
+            - embed_dims (list[int]): The number of channels at each stage.
+            - mlp_ratios (list[int]): Expansion ratio of MLPs.
+            - layer_scale_init_value (float): Init value for Layer Scale.
+
+            Defaults to 'S12'.
+
+        norm_cfg (dict): The config dict for norm layers.
+            Defaults to ``dict(type='LN2d', eps=1e-6)``.
+        act_cfg (dict): The config dict for activation between pointwise
+            convolution. Defaults to ``dict(type='GELU')``.
+        in_patch_size (int): The patch size of/? input image patch embedding.
+            Defaults to 7.
+        in_stride (int): The stride of input image patch embedding.
+            Defaults to 4.
+        in_pad (int): The padding of input image patch embedding.
+            Defaults to 2.
+        down_patch_size (int): The patch size of downsampling patch embedding.
+            Defaults to 3.
+        down_stride (int): The stride of downsampling patch embedding.
+            Defaults to 2.
+        down_pad (int): The padding of downsampling patch embedding.
+            Defaults to 1.
+        drop_rate (float): Dropout rate. Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        out_indices (Sequence | int): Output from which network position.
+            Index 0-6 respectively corresponds to
+            [stage1, downsampling, stage2, downsampling, stage3, downsampling, stage4]
+            Defaults to -1, means the last stage.
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Defaults to -1, which means not freezing any parameters.
+        deploy (bool): Whether to switch the model structure to
+            deployment mode. Default: False.
+        init_cfg (dict, optional): Initialization config dict
+    """  # noqa: E501
+
+    # --layers: [x,x,x,x], numbers of layers for the four stages
+    # --embed_dims, --mlp_ratios:
+    #     embedding dims and mlp ratios for the four stages
+    # --downsamples: flags to apply downsampling or not in four blocks
+    arch_settings = {
+        's12': {
+            'layers': [2, 2, 6, 2],
+            'embed_dims': [64, 128, 320, 512],
+            'mlp_ratios': [4, 4, 4, 4],
+            'layer_scale_init_value': 1e-5,
+        },
+        's24': {
+            'layers': [4, 4, 12, 4],
+            'embed_dims': [64, 128, 320, 512],
+            'mlp_ratios': [4, 4, 4, 4],
+            'layer_scale_init_value': 1e-5,
+        },
+        's36': {
+            'layers': [6, 6, 18, 6],
+            'embed_dims': [64, 128, 320, 512],
+            'mlp_ratios': [4, 4, 4, 4],
+            'layer_scale_init_value': 1e-6,
+        },
+        'm36': {
+            'layers': [6, 6, 18, 6],
+            'embed_dims': [96, 192, 384, 768],
+            'mlp_ratios': [4, 4, 4, 4],
+            'layer_scale_init_value': 1e-6,
+        },
+        'm48': {
+            'layers': [8, 8, 24, 8],
+            'embed_dims': [96, 192, 384, 768],
+            'mlp_ratios': [4, 4, 4, 4],
+            'layer_scale_init_value': 1e-6,
+        },
+    }
+
+    def __init__(self,
+                 arch='s12',
+                 in_channels=3,
+                 norm_cfg=dict(type='GN', num_groups=1),
+                 act_cfg=dict(type='GELU'),
+                 in_patch_size=7,
+                 in_stride=4,
+                 in_pad=2,
+                 down_patch_size=3,
+                 down_stride=2,
+                 down_pad=1,
+                 drop_rate=0.,
+                 drop_path_rate=0.,
+                 out_indices=-1,
+                 frozen_stages=-1,
+                 init_cfg=None,
+                 deploy=False):
+
+        super().__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            assert arch in self.arch_settings, \
+                f'Unavailable arch, please choose from ' \
+                f'({set(self.arch_settings)}) or pass a dict.'
+            arch = self.arch_settings[arch]
+        elif isinstance(arch, dict):
+            assert 'layers' in arch and 'embed_dims' in arch, \
+                f'The arch dict must have "layers" and "embed_dims", ' \
+                f'but got {list(arch.keys())}.'
+
+        layers = arch['layers']
+        embed_dims = arch['embed_dims']
+        mlp_ratios = arch['mlp_ratios'] \
+            if 'mlp_ratios' in arch else [4, 4, 4, 4]
+        layer_scale_init_value = arch['layer_scale_init_value'] \
+            if 'layer_scale_init_value' in arch else 1e-5
+
+        self.patch_embed = PatchEmbed(
+            patch_size=in_patch_size,
+            stride=in_stride,
+            padding=in_pad,
+            in_chans=in_channels,
+            embed_dim=embed_dims[0])
+
+        # set the main block in network
+        network = []
+        for i in range(len(layers)):
+            stage = basic_blocks(
+                embed_dims[i],
+                i,
+                layers,
+                mlp_ratio=mlp_ratios[i],
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg,
+                drop_rate=drop_rate,
+                drop_path_rate=drop_path_rate,
+                layer_scale_init_value=layer_scale_init_value,
+                deploy=deploy)
+            network.append(stage)
+            if i >= len(layers) - 1:
+                break
+            if embed_dims[i] != embed_dims[i + 1]:
+                # downsampling between two stages
+                network.append(
+                    PatchEmbed(
+                        patch_size=down_patch_size,
+                        stride=down_stride,
+                        padding=down_pad,
+                        in_chans=embed_dims[i],
+                        embed_dim=embed_dims[i + 1]))
+
+        self.network = nn.ModuleList(network)
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = 7 + index
+                assert out_indices[i] >= 0, f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+        if self.out_indices:
+            for i_layer in self.out_indices:
+                layer = build_norm_layer(norm_cfg,
+                                         embed_dims[(i_layer + 1) // 2])[1]
+                layer_name = f'norm{i_layer}'
+                self.add_module(layer_name, layer)
+
+        self.frozen_stages = frozen_stages
+        self._freeze_stages()
+        self.deploy = deploy
+
+    def forward_embeddings(self, x):
+        x = self.patch_embed(x)
+        return x
+
+    def forward_tokens(self, x):
+        outs = []
+        for idx, block in enumerate(self.network):
+            x = block(x)
+            if idx in self.out_indices:
+                norm_layer = getattr(self, f'norm{idx}')
+                x_out = norm_layer(x)
+                outs.append(x_out)
+        return tuple(outs)
+
+    def forward(self, x):
+        # input embedding
+        x = self.forward_embeddings(x)
+        # through backbone
+        x = self.forward_tokens(x)
+        return x
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            self.patch_embed.eval()
+            for param in self.patch_embed.parameters():
+                param.requires_grad = False
+
+        for i in range(0, self.frozen_stages + 1):
+            # Include both block and downsample layer.
+            module = self.network[i]
+            module.eval()
+            for param in module.parameters():
+                param.requires_grad = False
+            if i in self.out_indices:
+                norm_layer = getattr(self, f'norm{i}')
+                norm_layer.eval()
+                for param in norm_layer.parameters():
+                    param.requires_grad = False
+
+    def train(self, mode=True):
+        super(RIFormer, self).train(mode)
+        self._freeze_stages()
+        return self
+
+    def switch_to_deploy(self):
+        for m in self.modules():
+            if isinstance(m, RIFormerBlock):
+                m.switch_to_deploy()
+        self.deploy = True
diff --git a/mmpretrain/models/backbones/seresnet.py b/mmpretrain/models/backbones/seresnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..4437c17fa06d62f57ac18a31967a35b4f44f190f
--- /dev/null
+++ b/mmpretrain/models/backbones/seresnet.py
@@ -0,0 +1,125 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.utils.checkpoint as cp
+
+from mmpretrain.registry import MODELS
+from ..utils.se_layer import SELayer
+from .resnet import Bottleneck, ResLayer, ResNet
+
+
+class SEBottleneck(Bottleneck):
+    """SEBottleneck block for SEResNet.
+
+    Args:
+        in_channels (int): The input channels of the SEBottleneck block.
+        out_channels (int): The output channel of the SEBottleneck block.
+        se_ratio (int): Squeeze ratio in SELayer. Default: 16
+    """
+
+    def __init__(self, in_channels, out_channels, se_ratio=16, **kwargs):
+        super(SEBottleneck, self).__init__(in_channels, out_channels, **kwargs)
+        self.se_layer = SELayer(out_channels, ratio=se_ratio)
+
+    def forward(self, x):
+
+        def _inner_forward(x):
+            identity = x
+
+            out = self.conv1(x)
+            out = self.norm1(out)
+            out = self.relu(out)
+
+            out = self.conv2(out)
+            out = self.norm2(out)
+            out = self.relu(out)
+
+            out = self.conv3(out)
+            out = self.norm3(out)
+
+            out = self.se_layer(out)
+
+            if self.downsample is not None:
+                identity = self.downsample(x)
+
+            out += identity
+
+            return out
+
+        if self.with_cp and x.requires_grad:
+            out = cp.checkpoint(_inner_forward, x)
+        else:
+            out = _inner_forward(x)
+
+        out = self.relu(out)
+
+        return out
+
+
+@MODELS.register_module()
+class SEResNet(ResNet):
+    """SEResNet backbone.
+
+    Please refer to the `paper <https://arxiv.org/abs/1709.01507>`__ for
+    details.
+
+    Args:
+        depth (int): Network depth, from {50, 101, 152}.
+        se_ratio (int): Squeeze ratio in SELayer. Default: 16.
+        in_channels (int): Number of input image channels. Default: 3.
+        stem_channels (int): Output channels of the stem layer. Default: 64.
+        num_stages (int): Stages of the network. Default: 4.
+        strides (Sequence[int]): Strides of the first block of each stage.
+            Default: ``(1, 2, 2, 2)``.
+        dilations (Sequence[int]): Dilation of each stage.
+            Default: ``(1, 1, 1, 1)``.
+        out_indices (Sequence[int]): Output from which stages. If only one
+            stage is specified, a single tensor (feature map) is returned,
+            otherwise multiple stages are specified, a tuple of tensors will
+            be returned. Default: ``(3, )``.
+        style (str): `pytorch` or `caffe`. If set to "pytorch", the stride-two
+            layer is the 3x3 conv layer, otherwise the stride-two layer is
+            the first 1x1 conv layer.
+        deep_stem (bool): Replace 7x7 conv in input stem with 3 3x3 conv.
+            Default: False.
+        avg_down (bool): Use AvgPool instead of stride conv when
+            downsampling in the bottleneck. Default: False.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Default: -1.
+        conv_cfg (dict | None): The config dict for conv layers. Default: None.
+        norm_cfg (dict): The config dict for norm layers.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Default: False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default: False.
+        zero_init_residual (bool): Whether to use zero init for last norm layer
+            in resblocks to let them behave as identity. Default: True.
+
+    Example:
+        >>> from mmpretrain.models import SEResNet
+        >>> import torch
+        >>> self = SEResNet(depth=50)
+        >>> self.eval()
+        >>> inputs = torch.rand(1, 3, 224, 224)
+        >>> level_outputs = self.forward(inputs)
+        >>> for level_out in level_outputs:
+        ...     print(tuple(level_out.shape))
+        (1, 64, 56, 56)
+        (1, 128, 28, 28)
+        (1, 256, 14, 14)
+        (1, 512, 7, 7)
+    """
+
+    arch_settings = {
+        50: (SEBottleneck, (3, 4, 6, 3)),
+        101: (SEBottleneck, (3, 4, 23, 3)),
+        152: (SEBottleneck, (3, 8, 36, 3))
+    }
+
+    def __init__(self, depth, se_ratio=16, **kwargs):
+        if depth not in self.arch_settings:
+            raise KeyError(f'invalid depth {depth} for SEResNet')
+        self.se_ratio = se_ratio
+        super(SEResNet, self).__init__(depth, **kwargs)
+
+    def make_res_layer(self, **kwargs):
+        return ResLayer(se_ratio=self.se_ratio, **kwargs)
diff --git a/mmpretrain/models/backbones/seresnext.py b/mmpretrain/models/backbones/seresnext.py
new file mode 100644
index 0000000000000000000000000000000000000000..6a2838074225930795d6d8ad70ba067b6ad4c2da
--- /dev/null
+++ b/mmpretrain/models/backbones/seresnext.py
@@ -0,0 +1,155 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmcv.cnn import build_conv_layer, build_norm_layer
+
+from mmpretrain.registry import MODELS
+from .resnet import ResLayer
+from .seresnet import SEBottleneck as _SEBottleneck
+from .seresnet import SEResNet
+
+
+class SEBottleneck(_SEBottleneck):
+    """SEBottleneck block for SEResNeXt.
+
+    Args:
+        in_channels (int): Input channels of this block.
+        out_channels (int): Output channels of this block.
+        base_channels (int): Middle channels of the first stage. Default: 64.
+        groups (int): Groups of conv2.
+        width_per_group (int): Width per group of conv2. 64x4d indicates
+            ``groups=64, width_per_group=4`` and 32x8d indicates
+            ``groups=32, width_per_group=8``.
+        stride (int): stride of the block. Default: 1
+        dilation (int): dilation of convolution. Default: 1
+        downsample (nn.Module, optional): downsample operation on identity
+            branch. Default: None
+        se_ratio (int): Squeeze ratio in SELayer. Default: 16
+        style (str): `pytorch` or `caffe`. If set to "pytorch", the stride-two
+            layer is the 3x3 conv layer, otherwise the stride-two layer is
+            the first 1x1 conv layer.
+        conv_cfg (dict, optional): dictionary to construct and config conv
+            layer. Default: None
+        norm_cfg (dict): dictionary to construct and config norm layer.
+            Default: dict(type='BN')
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 base_channels=64,
+                 groups=32,
+                 width_per_group=4,
+                 se_ratio=16,
+                 **kwargs):
+        super(SEBottleneck, self).__init__(in_channels, out_channels, se_ratio,
+                                           **kwargs)
+        self.groups = groups
+        self.width_per_group = width_per_group
+
+        # We follow the same rational of ResNext to compute mid_channels.
+        # For SEResNet bottleneck, middle channels are determined by expansion
+        # and out_channels, but for SEResNeXt bottleneck, it is determined by
+        # groups and width_per_group and the stage it is located in.
+        if groups != 1:
+            assert self.mid_channels % base_channels == 0
+            self.mid_channels = (
+                groups * width_per_group * self.mid_channels // base_channels)
+
+        self.norm1_name, norm1 = build_norm_layer(
+            self.norm_cfg, self.mid_channels, postfix=1)
+        self.norm2_name, norm2 = build_norm_layer(
+            self.norm_cfg, self.mid_channels, postfix=2)
+        self.norm3_name, norm3 = build_norm_layer(
+            self.norm_cfg, self.out_channels, postfix=3)
+
+        self.conv1 = build_conv_layer(
+            self.conv_cfg,
+            self.in_channels,
+            self.mid_channels,
+            kernel_size=1,
+            stride=self.conv1_stride,
+            bias=False)
+        self.add_module(self.norm1_name, norm1)
+        self.conv2 = build_conv_layer(
+            self.conv_cfg,
+            self.mid_channels,
+            self.mid_channels,
+            kernel_size=3,
+            stride=self.conv2_stride,
+            padding=self.dilation,
+            dilation=self.dilation,
+            groups=groups,
+            bias=False)
+
+        self.add_module(self.norm2_name, norm2)
+        self.conv3 = build_conv_layer(
+            self.conv_cfg,
+            self.mid_channels,
+            self.out_channels,
+            kernel_size=1,
+            bias=False)
+        self.add_module(self.norm3_name, norm3)
+
+
+@MODELS.register_module()
+class SEResNeXt(SEResNet):
+    """SEResNeXt backbone.
+
+    Please refer to the `paper <https://arxiv.org/abs/1709.01507>`__ for
+    details.
+
+    Args:
+        depth (int): Network depth, from {50, 101, 152}.
+        groups (int): Groups of conv2 in Bottleneck. Default: 32.
+        width_per_group (int): Width per group of conv2 in Bottleneck.
+            Default: 4.
+        se_ratio (int): Squeeze ratio in SELayer. Default: 16.
+        in_channels (int): Number of input image channels. Default: 3.
+        stem_channels (int): Output channels of the stem layer. Default: 64.
+        num_stages (int): Stages of the network. Default: 4.
+        strides (Sequence[int]): Strides of the first block of each stage.
+            Default: ``(1, 2, 2, 2)``.
+        dilations (Sequence[int]): Dilation of each stage.
+            Default: ``(1, 1, 1, 1)``.
+        out_indices (Sequence[int]): Output from which stages. If only one
+            stage is specified, a single tensor (feature map) is returned,
+            otherwise multiple stages are specified, a tuple of tensors will
+            be returned. Default: ``(3, )``.
+        style (str): `pytorch` or `caffe`. If set to "pytorch", the stride-two
+            layer is the 3x3 conv layer, otherwise the stride-two layer is
+            the first 1x1 conv layer.
+        deep_stem (bool): Replace 7x7 conv in input stem with 3 3x3 conv.
+            Default: False.
+        avg_down (bool): Use AvgPool instead of stride conv when
+            downsampling in the bottleneck. Default: False.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Default: -1.
+        conv_cfg (dict | None): The config dict for conv layers. Default: None.
+        norm_cfg (dict): The config dict for norm layers.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Default: False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default: False.
+        zero_init_residual (bool): Whether to use zero init for last norm layer
+            in resblocks to let them behave as identity. Default: True.
+    """
+
+    arch_settings = {
+        50: (SEBottleneck, (3, 4, 6, 3)),
+        101: (SEBottleneck, (3, 4, 23, 3)),
+        152: (SEBottleneck, (3, 8, 36, 3))
+    }
+
+    def __init__(self, depth, groups=32, width_per_group=4, **kwargs):
+        self.groups = groups
+        self.width_per_group = width_per_group
+        super(SEResNeXt, self).__init__(depth, **kwargs)
+
+    def make_res_layer(self, **kwargs):
+        return ResLayer(
+            groups=self.groups,
+            width_per_group=self.width_per_group,
+            base_channels=self.base_channels,
+            **kwargs)
diff --git a/mmpretrain/models/backbones/shufflenet_v1.py b/mmpretrain/models/backbones/shufflenet_v1.py
new file mode 100644
index 0000000000000000000000000000000000000000..2cc3617f93b82fa5e37fa2bb5b47d93e6bd9a58f
--- /dev/null
+++ b/mmpretrain/models/backbones/shufflenet_v1.py
@@ -0,0 +1,321 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint as cp
+from mmcv.cnn import ConvModule, build_activation_layer
+from mmengine.model import BaseModule
+from mmengine.model.weight_init import constant_init, normal_init
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.utils import channel_shuffle, make_divisible
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+
+
+class ShuffleUnit(BaseModule):
+    """ShuffleUnit block.
+
+    ShuffleNet unit with pointwise group convolution (GConv) and channel
+    shuffle.
+
+    Args:
+        in_channels (int): The input channels of the ShuffleUnit.
+        out_channels (int): The output channels of the ShuffleUnit.
+        groups (int): The number of groups to be used in grouped 1x1
+            convolutions in each ShuffleUnit. Default: 3
+        first_block (bool): Whether it is the first ShuffleUnit of a
+            sequential ShuffleUnits. Default: True, which means not using the
+            grouped 1x1 convolution.
+        combine (str): The ways to combine the input and output
+            branches. Default: 'add'.
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Default: None, which means using conv2d.
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='BN').
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='ReLU').
+        with_cp (bool): Use checkpoint or not. Using checkpoint
+            will save some memory while slowing down the training speed.
+            Default: False.
+
+    Returns:
+        Tensor: The output tensor.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 groups=3,
+                 first_block=True,
+                 combine='add',
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 with_cp=False):
+        super(ShuffleUnit, self).__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.first_block = first_block
+        self.combine = combine
+        self.groups = groups
+        self.bottleneck_channels = self.out_channels // 4
+        self.with_cp = with_cp
+
+        if self.combine == 'add':
+            self.depthwise_stride = 1
+            self._combine_func = self._add
+            assert in_channels == out_channels, (
+                'in_channels must be equal to out_channels when combine '
+                'is add')
+        elif self.combine == 'concat':
+            self.depthwise_stride = 2
+            self._combine_func = self._concat
+            self.out_channels -= self.in_channels
+            self.avgpool = nn.AvgPool2d(kernel_size=3, stride=2, padding=1)
+        else:
+            raise ValueError(f'Cannot combine tensors with {self.combine}. '
+                             'Only "add" and "concat" are supported')
+
+        self.first_1x1_groups = 1 if first_block else self.groups
+        self.g_conv_1x1_compress = ConvModule(
+            in_channels=self.in_channels,
+            out_channels=self.bottleneck_channels,
+            kernel_size=1,
+            groups=self.first_1x1_groups,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+
+        self.depthwise_conv3x3_bn = ConvModule(
+            in_channels=self.bottleneck_channels,
+            out_channels=self.bottleneck_channels,
+            kernel_size=3,
+            stride=self.depthwise_stride,
+            padding=1,
+            groups=self.bottleneck_channels,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=None)
+
+        self.g_conv_1x1_expand = ConvModule(
+            in_channels=self.bottleneck_channels,
+            out_channels=self.out_channels,
+            kernel_size=1,
+            groups=self.groups,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=None)
+
+        self.act = build_activation_layer(act_cfg)
+
+    @staticmethod
+    def _add(x, out):
+        # residual connection
+        return x + out
+
+    @staticmethod
+    def _concat(x, out):
+        # concatenate along channel axis
+        return torch.cat((x, out), 1)
+
+    def forward(self, x):
+
+        def _inner_forward(x):
+            residual = x
+
+            out = self.g_conv_1x1_compress(x)
+            out = self.depthwise_conv3x3_bn(out)
+
+            if self.groups > 1:
+                out = channel_shuffle(out, self.groups)
+
+            out = self.g_conv_1x1_expand(out)
+
+            if self.combine == 'concat':
+                residual = self.avgpool(residual)
+                out = self.act(out)
+                out = self._combine_func(residual, out)
+            else:
+                out = self._combine_func(residual, out)
+                out = self.act(out)
+            return out
+
+        if self.with_cp and x.requires_grad:
+            out = cp.checkpoint(_inner_forward, x)
+        else:
+            out = _inner_forward(x)
+
+        return out
+
+
+@MODELS.register_module()
+class ShuffleNetV1(BaseBackbone):
+    """ShuffleNetV1 backbone.
+
+    Args:
+        groups (int): The number of groups to be used in grouped 1x1
+            convolutions in each ShuffleUnit. Default: 3.
+        widen_factor (float): Width multiplier - adjusts the number
+            of channels in each layer by this amount. Default: 1.0.
+        out_indices (Sequence[int]): Output from which stages.
+            Default: (2, )
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Default: -1, which means not freezing any parameters.
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Default: None, which means using conv2d.
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='BN').
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='ReLU').
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Default: False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default: False.
+    """
+
+    def __init__(self,
+                 groups=3,
+                 widen_factor=1.0,
+                 out_indices=(2, ),
+                 frozen_stages=-1,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 norm_eval=False,
+                 with_cp=False,
+                 init_cfg=None):
+        super(ShuffleNetV1, self).__init__(init_cfg)
+        self.init_cfg = init_cfg
+        self.stage_blocks = [4, 8, 4]
+        self.groups = groups
+
+        for index in out_indices:
+            if index not in range(0, 3):
+                raise ValueError('the item in out_indices must in '
+                                 f'range(0, 3). But received {index}')
+
+        if frozen_stages not in range(-1, 3):
+            raise ValueError('frozen_stages must be in range(-1, 3). '
+                             f'But received {frozen_stages}')
+        self.out_indices = out_indices
+        self.frozen_stages = frozen_stages
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.act_cfg = act_cfg
+        self.norm_eval = norm_eval
+        self.with_cp = with_cp
+
+        if groups == 1:
+            channels = (144, 288, 576)
+        elif groups == 2:
+            channels = (200, 400, 800)
+        elif groups == 3:
+            channels = (240, 480, 960)
+        elif groups == 4:
+            channels = (272, 544, 1088)
+        elif groups == 8:
+            channels = (384, 768, 1536)
+        else:
+            raise ValueError(f'{groups} groups is not supported for 1x1 '
+                             'Grouped Convolutions')
+
+        channels = [make_divisible(ch * widen_factor, 8) for ch in channels]
+
+        self.in_channels = int(24 * widen_factor)
+
+        self.conv1 = ConvModule(
+            in_channels=3,
+            out_channels=self.in_channels,
+            kernel_size=3,
+            stride=2,
+            padding=1,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
+
+        self.layers = nn.ModuleList()
+        for i, num_blocks in enumerate(self.stage_blocks):
+            first_block = True if i == 0 else False
+            layer = self.make_layer(channels[i], num_blocks, first_block)
+            self.layers.append(layer)
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            for param in self.conv1.parameters():
+                param.requires_grad = False
+        for i in range(self.frozen_stages):
+            layer = self.layers[i]
+            layer.eval()
+            for param in layer.parameters():
+                param.requires_grad = False
+
+    def init_weights(self):
+        super(ShuffleNetV1, self).init_weights()
+
+        if (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            # Suppress default init if use pretrained model.
+            return
+
+        for name, m in self.named_modules():
+            if isinstance(m, nn.Conv2d):
+                if 'conv1' in name:
+                    normal_init(m, mean=0, std=0.01)
+                else:
+                    normal_init(m, mean=0, std=1.0 / m.weight.shape[1])
+            elif isinstance(m, (_BatchNorm, nn.GroupNorm)):
+                constant_init(m, val=1, bias=0.0001)
+                if isinstance(m, _BatchNorm):
+                    if m.running_mean is not None:
+                        nn.init.constant_(m.running_mean, 0)
+
+    def make_layer(self, out_channels, num_blocks, first_block=False):
+        """Stack ShuffleUnit blocks to make a layer.
+
+        Args:
+            out_channels (int): out_channels of the block.
+            num_blocks (int): Number of blocks.
+            first_block (bool): Whether is the first ShuffleUnit of a
+                sequential ShuffleUnits. Default: False, which means using
+                the grouped 1x1 convolution.
+        """
+        layers = []
+        for i in range(num_blocks):
+            first_block = first_block if i == 0 else False
+            combine_mode = 'concat' if i == 0 else 'add'
+            layers.append(
+                ShuffleUnit(
+                    self.in_channels,
+                    out_channels,
+                    groups=self.groups,
+                    first_block=first_block,
+                    combine=combine_mode,
+                    conv_cfg=self.conv_cfg,
+                    norm_cfg=self.norm_cfg,
+                    act_cfg=self.act_cfg,
+                    with_cp=self.with_cp))
+            self.in_channels = out_channels
+
+        return nn.Sequential(*layers)
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.maxpool(x)
+
+        outs = []
+        for i, layer in enumerate(self.layers):
+            x = layer(x)
+            if i in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
+
+    def train(self, mode=True):
+        super(ShuffleNetV1, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                if isinstance(m, _BatchNorm):
+                    m.eval()
diff --git a/mmpretrain/models/backbones/shufflenet_v2.py b/mmpretrain/models/backbones/shufflenet_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..02f9c749a814b0b4ee4e04dd6afacda078ae6f39
--- /dev/null
+++ b/mmpretrain/models/backbones/shufflenet_v2.py
@@ -0,0 +1,305 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint as cp
+from mmcv.cnn import ConvModule
+from mmengine.model import BaseModule
+from mmengine.model.weight_init import constant_init, normal_init
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.utils import channel_shuffle
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+
+
+class InvertedResidual(BaseModule):
+    """InvertedResidual block for ShuffleNetV2 backbone.
+
+    Args:
+        in_channels (int): The input channels of the block.
+        out_channels (int): The output channels of the block.
+        stride (int): Stride of the 3x3 convolution layer. Default: 1
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Default: None, which means using conv2d.
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='BN').
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='ReLU').
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default: False.
+
+    Returns:
+        Tensor: The output tensor.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 stride=1,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 with_cp=False,
+                 init_cfg=None):
+        super(InvertedResidual, self).__init__(init_cfg)
+        self.stride = stride
+        self.with_cp = with_cp
+
+        branch_features = out_channels // 2
+        if self.stride == 1:
+            assert in_channels == branch_features * 2, (
+                f'in_channels ({in_channels}) should equal to '
+                f'branch_features * 2 ({branch_features * 2}) '
+                'when stride is 1')
+
+        if in_channels != branch_features * 2:
+            assert self.stride != 1, (
+                f'stride ({self.stride}) should not equal 1 when '
+                f'in_channels != branch_features * 2')
+
+        if self.stride > 1:
+            self.branch1 = nn.Sequential(
+                ConvModule(
+                    in_channels,
+                    in_channels,
+                    kernel_size=3,
+                    stride=self.stride,
+                    padding=1,
+                    groups=in_channels,
+                    conv_cfg=conv_cfg,
+                    norm_cfg=norm_cfg,
+                    act_cfg=None),
+                ConvModule(
+                    in_channels,
+                    branch_features,
+                    kernel_size=1,
+                    stride=1,
+                    padding=0,
+                    conv_cfg=conv_cfg,
+                    norm_cfg=norm_cfg,
+                    act_cfg=act_cfg),
+            )
+
+        self.branch2 = nn.Sequential(
+            ConvModule(
+                in_channels if (self.stride > 1) else branch_features,
+                branch_features,
+                kernel_size=1,
+                stride=1,
+                padding=0,
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg),
+            ConvModule(
+                branch_features,
+                branch_features,
+                kernel_size=3,
+                stride=self.stride,
+                padding=1,
+                groups=branch_features,
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                act_cfg=None),
+            ConvModule(
+                branch_features,
+                branch_features,
+                kernel_size=1,
+                stride=1,
+                padding=0,
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg))
+
+    def forward(self, x):
+
+        def _inner_forward(x):
+            if self.stride > 1:
+                out = torch.cat((self.branch1(x), self.branch2(x)), dim=1)
+            else:
+                # Channel Split operation. using these lines of code to replace
+                # ``chunk(x, 2, dim=1)`` can make it easier to deploy a
+                # shufflenetv2 model by using mmdeploy.
+                channels = x.shape[1]
+                c = channels // 2 + channels % 2
+                x1 = x[:, :c, :, :]
+                x2 = x[:, c:, :, :]
+
+                out = torch.cat((x1, self.branch2(x2)), dim=1)
+
+            out = channel_shuffle(out, 2)
+
+            return out
+
+        if self.with_cp and x.requires_grad:
+            out = cp.checkpoint(_inner_forward, x)
+        else:
+            out = _inner_forward(x)
+
+        return out
+
+
+@MODELS.register_module()
+class ShuffleNetV2(BaseBackbone):
+    """ShuffleNetV2 backbone.
+
+    Args:
+        widen_factor (float): Width multiplier - adjusts the number of
+            channels in each layer by this amount. Default: 1.0.
+        out_indices (Sequence[int]): Output from which stages.
+            Default: (0, 1, 2, 3).
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Default: -1, which means not freezing any parameters.
+        conv_cfg (dict, optional): Config dict for convolution layer.
+            Default: None, which means using conv2d.
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='BN').
+        act_cfg (dict): Config dict for activation layer.
+            Default: dict(type='ReLU').
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Default: False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Default: False.
+    """
+
+    def __init__(self,
+                 widen_factor=1.0,
+                 out_indices=(3, ),
+                 frozen_stages=-1,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 norm_eval=False,
+                 with_cp=False,
+                 init_cfg=None):
+        super(ShuffleNetV2, self).__init__(init_cfg)
+        self.stage_blocks = [4, 8, 4]
+        for index in out_indices:
+            if index not in range(0, 4):
+                raise ValueError('the item in out_indices must in '
+                                 f'range(0, 4). But received {index}')
+
+        if frozen_stages not in range(-1, 4):
+            raise ValueError('frozen_stages must be in range(-1, 4). '
+                             f'But received {frozen_stages}')
+        self.out_indices = out_indices
+        self.frozen_stages = frozen_stages
+        self.conv_cfg = conv_cfg
+        self.norm_cfg = norm_cfg
+        self.act_cfg = act_cfg
+        self.norm_eval = norm_eval
+        self.with_cp = with_cp
+
+        if widen_factor == 0.5:
+            channels = [48, 96, 192, 1024]
+        elif widen_factor == 1.0:
+            channels = [116, 232, 464, 1024]
+        elif widen_factor == 1.5:
+            channels = [176, 352, 704, 1024]
+        elif widen_factor == 2.0:
+            channels = [244, 488, 976, 2048]
+        else:
+            raise ValueError('widen_factor must be in [0.5, 1.0, 1.5, 2.0]. '
+                             f'But received {widen_factor}')
+
+        self.in_channels = 24
+        self.conv1 = ConvModule(
+            in_channels=3,
+            out_channels=self.in_channels,
+            kernel_size=3,
+            stride=2,
+            padding=1,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+
+        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
+
+        self.layers = nn.ModuleList()
+        for i, num_blocks in enumerate(self.stage_blocks):
+            layer = self._make_layer(channels[i], num_blocks)
+            self.layers.append(layer)
+
+        output_channels = channels[-1]
+        self.layers.append(
+            ConvModule(
+                in_channels=self.in_channels,
+                out_channels=output_channels,
+                kernel_size=1,
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg))
+
+    def _make_layer(self, out_channels, num_blocks):
+        """Stack blocks to make a layer.
+
+        Args:
+            out_channels (int): out_channels of the block.
+            num_blocks (int): number of blocks.
+        """
+        layers = []
+        for i in range(num_blocks):
+            stride = 2 if i == 0 else 1
+            layers.append(
+                InvertedResidual(
+                    in_channels=self.in_channels,
+                    out_channels=out_channels,
+                    stride=stride,
+                    conv_cfg=self.conv_cfg,
+                    norm_cfg=self.norm_cfg,
+                    act_cfg=self.act_cfg,
+                    with_cp=self.with_cp))
+            self.in_channels = out_channels
+
+        return nn.Sequential(*layers)
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            for param in self.conv1.parameters():
+                param.requires_grad = False
+
+        for i in range(self.frozen_stages):
+            m = self.layers[i]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+    def init_weights(self):
+        super(ShuffleNetV2, self).init_weights()
+
+        if (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            # Suppress default init if use pretrained model.
+            return
+
+        for name, m in self.named_modules():
+            if isinstance(m, nn.Conv2d):
+                if 'conv1' in name:
+                    normal_init(m, mean=0, std=0.01)
+                else:
+                    normal_init(m, mean=0, std=1.0 / m.weight.shape[1])
+            elif isinstance(m, (_BatchNorm, nn.GroupNorm)):
+                constant_init(m.weight, val=1, bias=0.0001)
+                if isinstance(m, _BatchNorm):
+                    if m.running_mean is not None:
+                        nn.init.constant_(m.running_mean, 0)
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = self.maxpool(x)
+
+        outs = []
+        for i, layer in enumerate(self.layers):
+            x = layer(x)
+            if i in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
+
+    def train(self, mode=True):
+        super(ShuffleNetV2, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                if isinstance(m, nn.BatchNorm2d):
+                    m.eval()
diff --git a/mmpretrain/models/backbones/sparse_convnext.py b/mmpretrain/models/backbones/sparse_convnext.py
new file mode 100644
index 0000000000000000000000000000000000000000..8f361360af460746a0f70206becb519252135596
--- /dev/null
+++ b/mmpretrain/models/backbones/sparse_convnext.py
@@ -0,0 +1,298 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Sequence, Union
+
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint as cp
+from mmengine.model import ModuleList, Sequential
+
+from mmpretrain.registry import MODELS
+from ..utils import (SparseAvgPooling, SparseConv2d, SparseHelper,
+                     SparseMaxPooling, build_norm_layer)
+from .convnext import ConvNeXt, ConvNeXtBlock
+
+
+class SparseConvNeXtBlock(ConvNeXtBlock):
+    """Sparse ConvNeXt Block.
+
+    Note:
+        There are two equivalent implementations:
+        1. DwConv -> SparseLayerNorm -> 1x1 Conv -> GELU -> 1x1 Conv;
+           all outputs are in (N, C, H, W).
+        2. DwConv -> SparseLayerNorm -> Permute to (N, H, W, C) -> Linear ->
+           GELU -> Linear; Permute back
+        As default, we use the second to align with the official repository.
+        And it may be slightly faster.
+    """
+
+    def forward(self, x):
+
+        def _inner_forward(x):
+            shortcut = x
+            x = self.depthwise_conv(x)
+
+            if self.linear_pw_conv:
+                x = x.permute(0, 2, 3, 1)  # (N, C, H, W) -> (N, H, W, C)
+                x = self.norm(x, data_format='channel_last')
+                x = self.pointwise_conv1(x)
+                x = self.act(x)
+                if self.grn is not None:
+                    x = self.grn(x, data_format='channel_last')
+                x = self.pointwise_conv2(x)
+                x = x.permute(0, 3, 1, 2)  # (N, H, W, C) -> (N, C, H, W)
+            else:
+                x = self.norm(x, data_format='channel_first')
+                x = self.pointwise_conv1(x)
+                x = self.act(x)
+
+                if self.grn is not None:
+                    x = self.grn(x, data_format='channel_first')
+                x = self.pointwise_conv2(x)
+
+            if self.gamma is not None:
+                x = x.mul(self.gamma.view(1, -1, 1, 1))
+
+            x *= SparseHelper._get_active_map_or_index(
+                H=x.shape[2], returning_active_map=True)
+
+            x = shortcut + self.drop_path(x)
+            return x
+
+        if self.with_cp and x.requires_grad:
+            x = cp.checkpoint(_inner_forward, x)
+        else:
+            x = _inner_forward(x)
+        return x
+
+
+@MODELS.register_module()
+class SparseConvNeXt(ConvNeXt):
+    """ConvNeXt with sparse module conversion function.
+
+    Modified from
+    https://github.com/keyu-tian/SparK/blob/main/models/convnext.py
+    and
+    https://github.com/keyu-tian/SparK/blob/main/encoder.py
+    To use ConvNeXt v2, please set ``use_grn=True`` and ``layer_scale_init_value=0.``.
+
+    Args:
+        arch (str | dict): The model's architecture. If string, it should be
+            one of architecture in ``ConvNeXt.arch_settings``. And if dict, it
+            should include the following two keys:
+            - depths (list[int]): Number of blocks at each stage.
+            - channels (list[int]): The number of channels at each stage.
+            Defaults to 'tiny'.
+        in_channels (int): Number of input image channels. Defaults to 3.
+        stem_patch_size (int): The size of one patch in the stem layer.
+            Defaults to 4.
+        norm_cfg (dict): The config dict for norm layers.
+            Defaults to ``dict(type='SparseLN2d', eps=1e-6)``.
+        act_cfg (dict): The config dict for activation between pointwise
+            convolution. Defaults to ``dict(type='GELU')``.
+        linear_pw_conv (bool): Whether to use linear layer to do pointwise
+            convolution. Defaults to True.
+        use_grn (bool): Whether to add Global Response Normalization in the
+            blocks. Defaults to False.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        layer_scale_init_value (float): Init value for Layer Scale.
+            Defaults to 1e-6.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Defaults to 0, which means not freezing any parameters.
+        gap_before_output (bool): Whether to globally average the feature
+            map before the final norm layer. In the official repo, it's only
+            used in classification task. Defaults to True.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        init_cfg (dict, optional): Initialization config dict.
+    """  # noqa: E501
+
+    def __init__(self,
+                 arch: str = 'small',
+                 in_channels: int = 3,
+                 stem_patch_size: int = 4,
+                 norm_cfg: dict = dict(type='SparseLN2d', eps=1e-6),
+                 act_cfg: dict = dict(type='GELU'),
+                 linear_pw_conv: bool = True,
+                 use_grn: bool = False,
+                 drop_path_rate: float = 0,
+                 layer_scale_init_value: float = 1e-6,
+                 out_indices: int = -1,
+                 frozen_stages: int = 0,
+                 gap_before_output: bool = True,
+                 with_cp: bool = False,
+                 init_cfg: Optional[Union[dict, List[dict]]] = [
+                     dict(
+                         type='TruncNormal',
+                         layer=['Conv2d', 'Linear'],
+                         std=.02,
+                         bias=0.),
+                     dict(
+                         type='Constant', layer=['LayerNorm'], val=1.,
+                         bias=0.),
+                 ]):
+        super(ConvNeXt, self).__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            assert arch in self.arch_settings, \
+                f'Unavailable arch, please choose from ' \
+                f'({set(self.arch_settings)}) or pass a dict.'
+            arch = self.arch_settings[arch]
+        elif isinstance(arch, dict):
+            assert 'depths' in arch and 'channels' in arch, \
+                f'The arch dict must have "depths" and "channels", ' \
+                f'but got {list(arch.keys())}.'
+
+        self.depths = arch['depths']
+        self.channels = arch['channels']
+        assert (isinstance(self.depths, Sequence)
+                and isinstance(self.channels, Sequence)
+                and len(self.depths) == len(self.channels)), \
+            f'The "depths" ({self.depths}) and "channels" ({self.channels}) ' \
+            'should be both sequence with the same length.'
+
+        self.num_stages = len(self.depths)
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = 4 + index
+                assert out_indices[i] >= 0, f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+
+        self.frozen_stages = frozen_stages
+        self.gap_before_output = gap_before_output
+
+        # 4 downsample layers between stages, including the stem layer.
+        self.downsample_layers = ModuleList()
+        stem = nn.Sequential(
+            nn.Conv2d(
+                in_channels,
+                self.channels[0],
+                kernel_size=stem_patch_size,
+                stride=stem_patch_size),
+            build_norm_layer(norm_cfg, self.channels[0]),
+        )
+        self.downsample_layers.append(stem)
+
+        # stochastic depth decay rule
+        dpr = [
+            x.item()
+            for x in torch.linspace(0, drop_path_rate, sum(self.depths))
+        ]
+        block_idx = 0
+
+        # 4 feature resolution stages, each consisting of multiple residual
+        # blocks
+        self.stages = nn.ModuleList()
+        for i in range(self.num_stages):
+            depth = self.depths[i]
+            channels = self.channels[i]
+
+            if i >= 1:
+                downsample_layer = nn.Sequential(
+                    build_norm_layer(norm_cfg, self.channels[i - 1]),
+                    nn.Conv2d(
+                        self.channels[i - 1],
+                        channels,
+                        kernel_size=2,
+                        stride=2),
+                )
+                self.downsample_layers.append(downsample_layer)
+
+            stage = Sequential(*[
+                SparseConvNeXtBlock(
+                    in_channels=channels,
+                    drop_path_rate=dpr[block_idx + j],
+                    norm_cfg=norm_cfg,
+                    act_cfg=act_cfg,
+                    linear_pw_conv=linear_pw_conv,
+                    layer_scale_init_value=layer_scale_init_value,
+                    use_grn=use_grn,
+                    with_cp=with_cp) for j in range(depth)
+            ])
+            block_idx += depth
+
+            self.stages.append(stage)
+
+        self.dense_model_to_sparse(m=self)
+
+    def forward(self, x):
+        outs = []
+        for i, stage in enumerate(self.stages):
+            x = self.downsample_layers[i](x)
+            x = stage(x)
+            if i in self.out_indices:
+                if self.gap_before_output:
+                    gap = x.mean([-2, -1], keepdim=True)
+                    outs.append(gap.flatten(1))
+                else:
+                    outs.append(x)
+
+        return tuple(outs)
+
+    def dense_model_to_sparse(self, m: nn.Module) -> nn.Module:
+        """Convert regular dense modules to sparse modules."""
+        output = m
+        if isinstance(m, nn.Conv2d):
+            m: nn.Conv2d
+            bias = m.bias is not None
+            output = SparseConv2d(
+                m.in_channels,
+                m.out_channels,
+                kernel_size=m.kernel_size,
+                stride=m.stride,
+                padding=m.padding,
+                dilation=m.dilation,
+                groups=m.groups,
+                bias=bias,
+                padding_mode=m.padding_mode,
+            )
+            output.weight.data.copy_(m.weight.data)
+            if bias:
+                output.bias.data.copy_(m.bias.data)
+
+        elif isinstance(m, nn.MaxPool2d):
+            m: nn.MaxPool2d
+            output = SparseMaxPooling(
+                m.kernel_size,
+                stride=m.stride,
+                padding=m.padding,
+                dilation=m.dilation,
+                return_indices=m.return_indices,
+                ceil_mode=m.ceil_mode)
+
+        elif isinstance(m, nn.AvgPool2d):
+            m: nn.AvgPool2d
+            output = SparseAvgPooling(
+                m.kernel_size,
+                m.stride,
+                m.padding,
+                ceil_mode=m.ceil_mode,
+                count_include_pad=m.count_include_pad,
+                divisor_override=m.divisor_override)
+
+        # elif isinstance(m, (nn.BatchNorm2d, nn.SyncBatchNorm)):
+        #     m: nn.BatchNorm2d
+        #     output = (SparseSyncBatchNorm2d
+        #               if enable_sync_bn else SparseBatchNorm2d)(
+        #                   m.weight.shape[0],
+        #                   eps=m.eps,
+        #                   momentum=m.momentum,
+        #                   affine=m.affine,
+        #                   track_running_stats=m.track_running_stats)
+        #     output.weight.data.copy_(m.weight.data)
+        #     output.bias.data.copy_(m.bias.data)
+        #     output.running_mean.data.copy_(m.running_mean.data)
+        #     output.running_var.data.copy_(m.running_var.data)
+        #     output.num_batches_tracked.data.copy_(m.num_batches_tracked.data)
+
+        for name, child in m.named_children():
+            output.add_module(name, self.dense_model_to_sparse(child))
+        del m
+        return output
diff --git a/mmpretrain/models/backbones/sparse_resnet.py b/mmpretrain/models/backbones/sparse_resnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..67597f1f0327f466a6841333c8247f96238ce35f
--- /dev/null
+++ b/mmpretrain/models/backbones/sparse_resnet.py
@@ -0,0 +1,179 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import re
+from typing import Optional, Tuple
+
+import torch.nn as nn
+
+from mmpretrain.models.utils.sparse_modules import (SparseAvgPooling,
+                                                    SparseBatchNorm2d,
+                                                    SparseConv2d,
+                                                    SparseMaxPooling,
+                                                    SparseSyncBatchNorm2d)
+from mmpretrain.registry import MODELS
+from .resnet import ResNet
+
+
+@MODELS.register_module()
+class SparseResNet(ResNet):
+    """ResNet with sparse module conversion function.
+
+    Modified from https://github.com/keyu-tian/SparK/blob/main/encoder.py
+
+    Args:
+        depth (int): Network depth, from {18, 34, 50, 101, 152}.
+        in_channels (int): Number of input image channels. Defaults to 3.
+        stem_channels (int): Output channels of the stem layer. Defaults to 64.
+        base_channels (int): Middle channels of the first stage.
+            Defaults to 64.
+        num_stages (int): Stages of the network. Defaults to 4.
+        strides (Sequence[int]): Strides of the first block of each stage.
+            Defaults to ``(1, 2, 2, 2)``.
+        dilations (Sequence[int]): Dilation of each stage.
+            Defaults to ``(1, 1, 1, 1)``.
+        out_indices (Sequence[int]): Output from which stages.
+            Defaults to ``(3, )``.
+        style (str): `pytorch` or `caffe`. If set to "pytorch", the stride-two
+            layer is the 3x3 conv layer, otherwise the stride-two layer is
+            the first 1x1 conv layer.
+        deep_stem (bool): Replace 7x7 conv in input stem with 3 3x3 conv.
+            Defaults to False.
+        avg_down (bool): Use AvgPool instead of stride conv when
+            downsampling in the bottleneck. Defaults to False.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        conv_cfg (dict | None): The config dict for conv layers.
+            Defaults to None.
+        norm_cfg (dict): The config dict for norm layers.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Defaults to False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        zero_init_residual (bool): Whether to use zero init for last norm layer
+            in resblocks to let them behave as identity. Defaults to True.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+    """
+
+    def __init__(self,
+                 depth: int,
+                 in_channels: int = 3,
+                 stem_channels: int = 64,
+                 base_channels: int = 64,
+                 expansion: Optional[int] = None,
+                 num_stages: int = 4,
+                 strides: Tuple[int] = (1, 2, 2, 2),
+                 dilations: Tuple[int] = (1, 1, 1, 1),
+                 out_indices: Tuple[int] = (3, ),
+                 style: str = 'pytorch',
+                 deep_stem: bool = False,
+                 avg_down: bool = False,
+                 frozen_stages: int = -1,
+                 conv_cfg: Optional[dict] = None,
+                 norm_cfg: dict = dict(type='SparseSyncBatchNorm2d'),
+                 norm_eval: bool = False,
+                 with_cp: bool = False,
+                 zero_init_residual: bool = False,
+                 init_cfg: Optional[dict] = [
+                     dict(type='Kaiming', layer=['Conv2d']),
+                     dict(
+                         type='Constant',
+                         val=1,
+                         layer=['_BatchNorm', 'GroupNorm'])
+                 ],
+                 drop_path_rate: float = 0,
+                 **kwargs):
+        super().__init__(
+            depth=depth,
+            in_channels=in_channels,
+            stem_channels=stem_channels,
+            base_channels=base_channels,
+            expansion=expansion,
+            num_stages=num_stages,
+            strides=strides,
+            dilations=dilations,
+            out_indices=out_indices,
+            style=style,
+            deep_stem=deep_stem,
+            avg_down=avg_down,
+            frozen_stages=frozen_stages,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            norm_eval=norm_eval,
+            with_cp=with_cp,
+            zero_init_residual=zero_init_residual,
+            init_cfg=init_cfg,
+            drop_path_rate=drop_path_rate,
+            **kwargs)
+        norm_type = norm_cfg['type']
+        enable_sync_bn = False
+        if re.search('Sync', norm_type) is not None:
+            enable_sync_bn = True
+        self.dense_model_to_sparse(m=self, enable_sync_bn=enable_sync_bn)
+
+    def dense_model_to_sparse(self, m: nn.Module,
+                              enable_sync_bn: bool) -> nn.Module:
+        """Convert regular dense modules to sparse modules."""
+        output = m
+        if isinstance(m, nn.Conv2d):
+            m: nn.Conv2d
+            bias = m.bias is not None
+            output = SparseConv2d(
+                m.in_channels,
+                m.out_channels,
+                kernel_size=m.kernel_size,
+                stride=m.stride,
+                padding=m.padding,
+                dilation=m.dilation,
+                groups=m.groups,
+                bias=bias,
+                padding_mode=m.padding_mode,
+            )
+            output.weight.data.copy_(m.weight.data)
+            if bias:
+                output.bias.data.copy_(m.bias.data)
+
+        elif isinstance(m, nn.MaxPool2d):
+            m: nn.MaxPool2d
+            output = SparseMaxPooling(
+                m.kernel_size,
+                stride=m.stride,
+                padding=m.padding,
+                dilation=m.dilation,
+                return_indices=m.return_indices,
+                ceil_mode=m.ceil_mode)
+
+        elif isinstance(m, nn.AvgPool2d):
+            m: nn.AvgPool2d
+            output = SparseAvgPooling(
+                m.kernel_size,
+                m.stride,
+                m.padding,
+                ceil_mode=m.ceil_mode,
+                count_include_pad=m.count_include_pad,
+                divisor_override=m.divisor_override)
+
+        elif isinstance(m, (nn.BatchNorm2d, nn.SyncBatchNorm)):
+            m: nn.BatchNorm2d
+            output = (SparseSyncBatchNorm2d
+                      if enable_sync_bn else SparseBatchNorm2d)(
+                          m.weight.shape[0],
+                          eps=m.eps,
+                          momentum=m.momentum,
+                          affine=m.affine,
+                          track_running_stats=m.track_running_stats)
+            output.weight.data.copy_(m.weight.data)
+            output.bias.data.copy_(m.bias.data)
+            output.running_mean.data.copy_(m.running_mean.data)
+            output.running_var.data.copy_(m.running_var.data)
+            output.num_batches_tracked.data.copy_(m.num_batches_tracked.data)
+
+        elif isinstance(m, (nn.Conv1d, )):
+            raise NotImplementedError
+
+        for name, child in m.named_children():
+            output.add_module(
+                name,
+                self.dense_model_to_sparse(
+                    child, enable_sync_bn=enable_sync_bn))
+        del m
+        return output
diff --git a/mmpretrain/models/backbones/swin_transformer.py b/mmpretrain/models/backbones/swin_transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..559fd5e9150f78a9801fcb9070e114b4e96113c5
--- /dev/null
+++ b/mmpretrain/models/backbones/swin_transformer.py
@@ -0,0 +1,585 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from typing import Sequence
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint as cp
+from mmcv.cnn import build_norm_layer
+from mmcv.cnn.bricks.transformer import FFN, PatchEmbed, PatchMerging
+from mmengine.model import BaseModule, ModuleList
+from mmengine.model.weight_init import trunc_normal_
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+
+from mmpretrain.registry import MODELS
+from ..utils import (ShiftWindowMSA, resize_pos_embed,
+                     resize_relative_position_bias_table, to_2tuple)
+from .base_backbone import BaseBackbone
+
+
+class SwinBlock(BaseModule):
+    """Swin Transformer block.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        window_size (int): The height and width of the window. Defaults to 7.
+        shift (bool): Shift the attention window or not. Defaults to False.
+        ffn_ratio (float): The expansion ratio of feedforward network hidden
+            layer channels. Defaults to 4.
+        drop_path (float): The drop path rate after attention and ffn.
+            Defaults to 0.
+        pad_small_map (bool): If True, pad the small feature map to the window
+            size, which is common used in detection and segmentation. If False,
+            avoid shifting window and shrink the window size to the size of
+            feature map, which is common used in classification.
+            Defaults to False.
+        attn_cfgs (dict): The extra config of Shift Window-MSA.
+            Defaults to empty dict.
+        ffn_cfgs (dict): The extra config of FFN. Defaults to empty dict.
+        norm_cfg (dict): The config of norm layers.
+            Defaults to ``dict(type='LN')``.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 window_size=7,
+                 shift=False,
+                 ffn_ratio=4.,
+                 drop_path=0.,
+                 pad_small_map=False,
+                 attn_cfgs=dict(),
+                 ffn_cfgs=dict(),
+                 norm_cfg=dict(type='LN'),
+                 with_cp=False,
+                 init_cfg=None):
+
+        super(SwinBlock, self).__init__(init_cfg)
+        self.with_cp = with_cp
+
+        _attn_cfgs = {
+            'embed_dims': embed_dims,
+            'num_heads': num_heads,
+            'shift_size': window_size // 2 if shift else 0,
+            'window_size': window_size,
+            'dropout_layer': dict(type='DropPath', drop_prob=drop_path),
+            'pad_small_map': pad_small_map,
+            **attn_cfgs
+        }
+        self.norm1 = build_norm_layer(norm_cfg, embed_dims)[1]
+        self.attn = ShiftWindowMSA(**_attn_cfgs)
+
+        _ffn_cfgs = {
+            'embed_dims': embed_dims,
+            'feedforward_channels': int(embed_dims * ffn_ratio),
+            'num_fcs': 2,
+            'ffn_drop': 0,
+            'dropout_layer': dict(type='DropPath', drop_prob=drop_path),
+            'act_cfg': dict(type='GELU'),
+            **ffn_cfgs
+        }
+        self.norm2 = build_norm_layer(norm_cfg, embed_dims)[1]
+        self.ffn = FFN(**_ffn_cfgs)
+
+    def forward(self, x, hw_shape):
+
+        def _inner_forward(x):
+            identity = x
+            x = self.norm1(x)
+            x = self.attn(x, hw_shape)
+            x = x + identity
+
+            identity = x
+            x = self.norm2(x)
+            x = self.ffn(x, identity=identity)
+
+            return x
+
+        if self.with_cp and x.requires_grad:
+            x = cp.checkpoint(_inner_forward, x)
+        else:
+            x = _inner_forward(x)
+
+        return x
+
+
+class SwinBlockSequence(BaseModule):
+    """Module with successive Swin Transformer blocks and downsample layer.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        depth (int): Number of successive swin transformer blocks.
+        num_heads (int): Number of attention heads.
+        window_size (int): The height and width of the window. Defaults to 7.
+        downsample (bool): Downsample the output of blocks by patch merging.
+            Defaults to False.
+        downsample_cfg (dict): The extra config of the patch merging layer.
+            Defaults to empty dict.
+        drop_paths (Sequence[float] | float): The drop path rate in each block.
+            Defaults to 0.
+        block_cfgs (Sequence[dict] | dict): The extra config of each block.
+            Defaults to empty dicts.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        pad_small_map (bool): If True, pad the small feature map to the window
+            size, which is common used in detection and segmentation. If False,
+            avoid shifting window and shrink the window size to the size of
+            feature map, which is common used in classification.
+            Defaults to False.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 depth,
+                 num_heads,
+                 window_size=7,
+                 downsample=False,
+                 downsample_cfg=dict(),
+                 drop_paths=0.,
+                 block_cfgs=dict(),
+                 with_cp=False,
+                 pad_small_map=False,
+                 init_cfg=None):
+        super().__init__(init_cfg)
+
+        if not isinstance(drop_paths, Sequence):
+            drop_paths = [drop_paths] * depth
+
+        if not isinstance(block_cfgs, Sequence):
+            block_cfgs = [deepcopy(block_cfgs) for _ in range(depth)]
+
+        self.embed_dims = embed_dims
+        self.blocks = ModuleList()
+        for i in range(depth):
+            _block_cfg = {
+                'embed_dims': embed_dims,
+                'num_heads': num_heads,
+                'window_size': window_size,
+                'shift': False if i % 2 == 0 else True,
+                'drop_path': drop_paths[i],
+                'with_cp': with_cp,
+                'pad_small_map': pad_small_map,
+                **block_cfgs[i]
+            }
+            block = SwinBlock(**_block_cfg)
+            self.blocks.append(block)
+
+        if downsample:
+            _downsample_cfg = {
+                'in_channels': embed_dims,
+                'out_channels': 2 * embed_dims,
+                'norm_cfg': dict(type='LN'),
+                **downsample_cfg
+            }
+            self.downsample = PatchMerging(**_downsample_cfg)
+        else:
+            self.downsample = None
+
+    def forward(self, x, in_shape, do_downsample=True):
+        for block in self.blocks:
+            x = block(x, in_shape)
+
+        if self.downsample is not None and do_downsample:
+            x, out_shape = self.downsample(x, in_shape)
+        else:
+            out_shape = in_shape
+        return x, out_shape
+
+    @property
+    def out_channels(self):
+        if self.downsample:
+            return self.downsample.out_channels
+        else:
+            return self.embed_dims
+
+
+@MODELS.register_module()
+class SwinTransformer(BaseBackbone):
+    """Swin Transformer.
+
+    A PyTorch implement of : `Swin Transformer:
+    Hierarchical Vision Transformer using Shifted Windows
+    <https://arxiv.org/abs/2103.14030>`_
+
+    Inspiration from
+    https://github.com/microsoft/Swin-Transformer
+
+    Args:
+        arch (str | dict): Swin Transformer architecture. If use string, choose
+            from 'tiny', 'small', 'base' and 'large'. If use dict, it should
+            have below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **depths** (List[int]): The number of blocks in each stage.
+            - **num_heads** (List[int]): The number of heads in attention
+              modules of each stage.
+
+            Defaults to 'tiny'.
+        img_size (int | tuple): The expected input image shape. Because we
+            support dynamic input shape, just set the argument to the most
+            common input image shape. Defaults to 224.
+        patch_size (int | tuple): The patch size in patch embedding.
+            Defaults to 4.
+        in_channels (int): The num of input channels. Defaults to 3.
+        window_size (int): The height and width of the window. Defaults to 7.
+        drop_rate (float): Dropout rate after embedding. Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.1.
+        out_after_downsample (bool): Whether to output the feature map of a
+            stage after the following downsample layer. Defaults to False.
+        use_abs_pos_embed (bool): If True, add absolute position embedding to
+            the patch embedding. Defaults to False.
+        interpolate_mode (str): Select the interpolate mode for absolute
+            position embeding vector resize. Defaults to "bicubic".
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Defaults to False.
+        pad_small_map (bool): If True, pad the small feature map to the window
+            size, which is common used in detection and segmentation. If False,
+            avoid shifting window and shrink the window size to the size of
+            feature map, which is common used in classification.
+            Defaults to False.
+        norm_cfg (dict): Config dict for normalization layer for all output
+            features. Defaults to ``dict(type='LN')``
+        stage_cfgs (Sequence[dict] | dict): Extra config dict for each
+            stage. Defaults to an empty dict.
+        patch_cfg (dict): Extra config dict for patch embedding.
+            Defaults to an empty dict.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+
+    Examples:
+        >>> from mmpretrain.models import SwinTransformer
+        >>> import torch
+        >>> extra_config = dict(
+        >>>     arch='tiny',
+        >>>     stage_cfgs=dict(downsample_cfg={'kernel_size': 3,
+        >>>                                     'expansion_ratio': 3}))
+        >>> self = SwinTransformer(**extra_config)
+        >>> inputs = torch.rand(1, 3, 224, 224)
+        >>> output = self.forward(inputs)
+        >>> print(output.shape)
+        (1, 2592, 4)
+    """
+    arch_zoo = {
+        **dict.fromkeys(['t', 'tiny'],
+                        {'embed_dims': 96,
+                         'depths':     [2, 2,  6,  2],
+                         'num_heads':  [3, 6, 12, 24]}),
+        **dict.fromkeys(['s', 'small'],
+                        {'embed_dims': 96,
+                         'depths':     [2, 2, 18,  2],
+                         'num_heads':  [3, 6, 12, 24]}),
+        **dict.fromkeys(['b', 'base'],
+                        {'embed_dims': 128,
+                         'depths':     [2, 2, 18,  2],
+                         'num_heads':  [4, 8, 16, 32]}),
+        **dict.fromkeys(['l', 'large'],
+                        {'embed_dims': 192,
+                         'depths':     [2,  2, 18,  2],
+                         'num_heads':  [6, 12, 24, 48]}),
+    }  # yapf: disable
+
+    _version = 3
+    num_extra_tokens = 0
+
+    def __init__(self,
+                 arch='tiny',
+                 img_size=224,
+                 patch_size=4,
+                 in_channels=3,
+                 window_size=7,
+                 drop_rate=0.,
+                 drop_path_rate=0.1,
+                 out_indices=(3, ),
+                 out_after_downsample=False,
+                 use_abs_pos_embed=False,
+                 interpolate_mode='bicubic',
+                 with_cp=False,
+                 frozen_stages=-1,
+                 norm_eval=False,
+                 pad_small_map=False,
+                 norm_cfg=dict(type='LN'),
+                 stage_cfgs=dict(),
+                 patch_cfg=dict(),
+                 init_cfg=None):
+        super(SwinTransformer, self).__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {'embed_dims', 'depths', 'num_heads'}
+            assert isinstance(arch, dict) and set(arch) == essential_keys, \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.embed_dims = self.arch_settings['embed_dims']
+        self.depths = self.arch_settings['depths']
+        self.num_heads = self.arch_settings['num_heads']
+        self.num_layers = len(self.depths)
+        self.out_indices = out_indices
+        self.out_after_downsample = out_after_downsample
+        self.use_abs_pos_embed = use_abs_pos_embed
+        self.interpolate_mode = interpolate_mode
+        self.frozen_stages = frozen_stages
+
+        _patch_cfg = dict(
+            in_channels=in_channels,
+            input_size=img_size,
+            embed_dims=self.embed_dims,
+            conv_type='Conv2d',
+            kernel_size=patch_size,
+            stride=patch_size,
+            norm_cfg=dict(type='LN'),
+        )
+        _patch_cfg.update(patch_cfg)
+        self.patch_embed = PatchEmbed(**_patch_cfg)
+        self.patch_resolution = self.patch_embed.init_out_size
+
+        if self.use_abs_pos_embed:
+            num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+            self.absolute_pos_embed = nn.Parameter(
+                torch.zeros(1, num_patches, self.embed_dims))
+            self._register_load_state_dict_pre_hook(
+                self._prepare_abs_pos_embed)
+
+        self._register_load_state_dict_pre_hook(
+            self._prepare_relative_position_bias_table)
+
+        self.drop_after_pos = nn.Dropout(p=drop_rate)
+        self.norm_eval = norm_eval
+
+        # stochastic depth
+        total_depth = sum(self.depths)
+        dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, total_depth)
+        ]  # stochastic depth decay rule
+
+        self.stages = ModuleList()
+        embed_dims = [self.embed_dims]
+        for i, (depth,
+                num_heads) in enumerate(zip(self.depths, self.num_heads)):
+            if isinstance(stage_cfgs, Sequence):
+                stage_cfg = stage_cfgs[i]
+            else:
+                stage_cfg = deepcopy(stage_cfgs)
+            downsample = True if i < self.num_layers - 1 else False
+            _stage_cfg = {
+                'embed_dims': embed_dims[-1],
+                'depth': depth,
+                'num_heads': num_heads,
+                'window_size': window_size,
+                'downsample': downsample,
+                'drop_paths': dpr[:depth],
+                'with_cp': with_cp,
+                'pad_small_map': pad_small_map,
+                **stage_cfg
+            }
+
+            stage = SwinBlockSequence(**_stage_cfg)
+            self.stages.append(stage)
+
+            dpr = dpr[depth:]
+            embed_dims.append(stage.out_channels)
+
+        if self.out_after_downsample:
+            self.num_features = embed_dims[1:]
+        else:
+            self.num_features = embed_dims[:-1]
+
+        for i in out_indices:
+            if norm_cfg is not None:
+                norm_layer = build_norm_layer(norm_cfg,
+                                              self.num_features[i])[1]
+            else:
+                norm_layer = nn.Identity()
+
+            self.add_module(f'norm{i}', norm_layer)
+
+    def init_weights(self):
+        super(SwinTransformer, self).init_weights()
+
+        if (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            # Suppress default init if use pretrained model.
+            return
+
+        if self.use_abs_pos_embed:
+            trunc_normal_(self.absolute_pos_embed, std=0.02)
+
+    def forward(self, x):
+        x, hw_shape = self.patch_embed(x)
+        if self.use_abs_pos_embed:
+            x = x + resize_pos_embed(
+                self.absolute_pos_embed, self.patch_resolution, hw_shape,
+                self.interpolate_mode, self.num_extra_tokens)
+        x = self.drop_after_pos(x)
+
+        outs = []
+        for i, stage in enumerate(self.stages):
+            x, hw_shape = stage(
+                x, hw_shape, do_downsample=self.out_after_downsample)
+            if i in self.out_indices:
+                norm_layer = getattr(self, f'norm{i}')
+                out = norm_layer(x)
+                out = out.view(-1, *hw_shape,
+                               self.num_features[i]).permute(0, 3, 1,
+                                                             2).contiguous()
+                outs.append(out)
+            if stage.downsample is not None and not self.out_after_downsample:
+                x, hw_shape = stage.downsample(x, hw_shape)
+
+        return tuple(outs)
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, *args,
+                              **kwargs):
+        """load checkpoints."""
+        # Names of some parameters in has been changed.
+        version = local_metadata.get('version', None)
+        if (version is None
+                or version < 2) and self.__class__ is SwinTransformer:
+            final_stage_num = len(self.stages) - 1
+            state_dict_keys = list(state_dict.keys())
+            for k in state_dict_keys:
+                if k.startswith('norm.') or k.startswith('backbone.norm.'):
+                    convert_key = k.replace('norm.', f'norm{final_stage_num}.')
+                    state_dict[convert_key] = state_dict[k]
+                    del state_dict[k]
+        if (version is None
+                or version < 3) and self.__class__ is SwinTransformer:
+            state_dict_keys = list(state_dict.keys())
+            for k in state_dict_keys:
+                if 'attn_mask' in k:
+                    del state_dict[k]
+
+        super()._load_from_state_dict(state_dict, prefix, local_metadata,
+                                      *args, **kwargs)
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            self.patch_embed.eval()
+            for param in self.patch_embed.parameters():
+                param.requires_grad = False
+
+        for i in range(0, self.frozen_stages + 1):
+            m = self.stages[i]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+        for i in self.out_indices:
+            if i <= self.frozen_stages:
+                for param in getattr(self, f'norm{i}').parameters():
+                    param.requires_grad = False
+
+    def train(self, mode=True):
+        super(SwinTransformer, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                # trick: eval have effect on BatchNorm only
+                if isinstance(m, _BatchNorm):
+                    m.eval()
+
+    def _prepare_abs_pos_embed(self, state_dict, prefix, *args, **kwargs):
+        name = prefix + 'absolute_pos_embed'
+        if name not in state_dict.keys():
+            return
+
+        ckpt_pos_embed_shape = state_dict[name].shape
+        if self.absolute_pos_embed.shape != ckpt_pos_embed_shape:
+            from mmengine.logging import MMLogger
+            logger = MMLogger.get_current_instance()
+            logger.info(
+                'Resize the absolute_pos_embed shape from '
+                f'{ckpt_pos_embed_shape} to {self.absolute_pos_embed.shape}.')
+
+            ckpt_pos_embed_shape = to_2tuple(
+                int(np.sqrt(ckpt_pos_embed_shape[1] - self.num_extra_tokens)))
+            pos_embed_shape = self.patch_embed.init_out_size
+
+            state_dict[name] = resize_pos_embed(state_dict[name],
+                                                ckpt_pos_embed_shape,
+                                                pos_embed_shape,
+                                                self.interpolate_mode,
+                                                self.num_extra_tokens)
+
+    def _prepare_relative_position_bias_table(self, state_dict, prefix, *args,
+                                              **kwargs):
+        state_dict_model = self.state_dict()
+        all_keys = list(state_dict_model.keys())
+        for key in all_keys:
+            if 'relative_position_bias_table' in key:
+                ckpt_key = prefix + key
+                if ckpt_key not in state_dict:
+                    continue
+                relative_position_bias_table_pretrained = state_dict[ckpt_key]
+                relative_position_bias_table_current = state_dict_model[key]
+                L1, nH1 = relative_position_bias_table_pretrained.size()
+                L2, nH2 = relative_position_bias_table_current.size()
+                if L1 != L2:
+                    src_size = int(L1**0.5)
+                    dst_size = int(L2**0.5)
+                    new_rel_pos_bias = resize_relative_position_bias_table(
+                        src_size, dst_size,
+                        relative_position_bias_table_pretrained, nH1)
+                    from mmengine.logging import MMLogger
+                    logger = MMLogger.get_current_instance()
+                    logger.info('Resize the relative_position_bias_table from '
+                                f'{state_dict[ckpt_key].shape} to '
+                                f'{new_rel_pos_bias.shape}')
+                    state_dict[ckpt_key] = new_rel_pos_bias
+
+                    # The index buffer need to be re-generated.
+                    index_buffer = ckpt_key.replace('bias_table', 'index')
+                    del state_dict[index_buffer]
+
+    def get_layer_depth(self, param_name: str, prefix: str = ''):
+        """Get the layer-wise depth of a parameter.
+
+        Args:
+            param_name (str): The name of the parameter.
+            prefix (str): The prefix for the parameter.
+                Defaults to an empty string.
+
+        Returns:
+            Tuple[int, int]: The layer-wise depth and the num of layers.
+
+        Note:
+            The first depth is the stem module (``layer_depth=0``), and the
+            last depth is the subsequent module (``layer_depth=num_layers-1``)
+        """
+        num_layers = sum(self.depths) + 2
+
+        if not param_name.startswith(prefix):
+            # For subsequent module like head
+            return num_layers - 1, num_layers
+
+        param_name = param_name[len(prefix):]
+
+        if param_name.startswith('patch_embed'):
+            layer_depth = 0
+        elif param_name.startswith('stages'):
+            stage_id = int(param_name.split('.')[1])
+            block_id = param_name.split('.')[3]
+            if block_id in ('reduction', 'norm'):
+                layer_depth = sum(self.depths[:stage_id + 1])
+            else:
+                layer_depth = sum(self.depths[:stage_id]) + int(block_id) + 1
+        else:
+            layer_depth = num_layers - 1
+
+        return layer_depth, num_layers
diff --git a/mmpretrain/models/backbones/swin_transformer_v2.py b/mmpretrain/models/backbones/swin_transformer_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..142505a808ae3fc631d54e1a56ae483db242da31
--- /dev/null
+++ b/mmpretrain/models/backbones/swin_transformer_v2.py
@@ -0,0 +1,567 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from typing import Sequence
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint as cp
+from mmcv.cnn import build_norm_layer
+from mmcv.cnn.bricks.transformer import FFN, PatchEmbed
+from mmengine.model import BaseModule, ModuleList
+from mmengine.model.weight_init import trunc_normal_
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+
+from ..builder import MODELS
+from ..utils import (PatchMerging, ShiftWindowMSA, WindowMSAV2,
+                     resize_pos_embed, to_2tuple)
+from .base_backbone import BaseBackbone
+
+
+class SwinBlockV2(BaseModule):
+    """Swin Transformer V2 block. Use post normalization.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        window_size (int): The height and width of the window. Defaults to 7.
+        shift (bool): Shift the attention window or not. Defaults to False.
+        extra_norm (bool): Whether add extra norm at the end of main branch.
+        ffn_ratio (float): The expansion ratio of feedforward network hidden
+            layer channels. Defaults to 4.
+        drop_path (float): The drop path rate after attention and ffn.
+            Defaults to 0.
+        pad_small_map (bool): If True, pad the small feature map to the window
+            size, which is common used in detection and segmentation. If False,
+            avoid shifting window and shrink the window size to the size of
+            feature map, which is common used in classification.
+            Defaults to False.
+        attn_cfgs (dict): The extra config of Shift Window-MSA.
+            Defaults to empty dict.
+        ffn_cfgs (dict): The extra config of FFN. Defaults to empty dict.
+        norm_cfg (dict): The config of norm layers.
+            Defaults to ``dict(type='LN')``.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        pretrained_window_size (int): Window size in pretrained.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 window_size=8,
+                 shift=False,
+                 extra_norm=False,
+                 ffn_ratio=4.,
+                 drop_path=0.,
+                 pad_small_map=False,
+                 attn_cfgs=dict(),
+                 ffn_cfgs=dict(),
+                 norm_cfg=dict(type='LN'),
+                 with_cp=False,
+                 pretrained_window_size=0,
+                 init_cfg=None):
+
+        super(SwinBlockV2, self).__init__(init_cfg)
+        self.with_cp = with_cp
+        self.extra_norm = extra_norm
+
+        _attn_cfgs = {
+            'embed_dims': embed_dims,
+            'num_heads': num_heads,
+            'shift_size': window_size // 2 if shift else 0,
+            'window_size': window_size,
+            'dropout_layer': dict(type='DropPath', drop_prob=drop_path),
+            'pad_small_map': pad_small_map,
+            **attn_cfgs
+        }
+        # use V2 attention implementation
+        _attn_cfgs.update(
+            window_msa=WindowMSAV2,
+            pretrained_window_size=to_2tuple(pretrained_window_size))
+        self.attn = ShiftWindowMSA(**_attn_cfgs)
+        self.norm1 = build_norm_layer(norm_cfg, embed_dims)[1]
+
+        _ffn_cfgs = {
+            'embed_dims': embed_dims,
+            'feedforward_channels': int(embed_dims * ffn_ratio),
+            'num_fcs': 2,
+            'ffn_drop': 0,
+            'dropout_layer': dict(type='DropPath', drop_prob=drop_path),
+            'act_cfg': dict(type='GELU'),
+            'add_identity': False,
+            **ffn_cfgs
+        }
+        self.ffn = FFN(**_ffn_cfgs)
+        self.norm2 = build_norm_layer(norm_cfg, embed_dims)[1]
+
+        # add extra norm for every n blocks in huge and giant model
+        if self.extra_norm:
+            self.norm3 = build_norm_layer(norm_cfg, embed_dims)[1]
+
+    def forward(self, x, hw_shape):
+
+        def _inner_forward(x):
+            # Use post normalization
+            identity = x
+            x = self.attn(x, hw_shape)
+            x = self.norm1(x)
+            x = x + identity
+
+            identity = x
+            x = self.ffn(x)
+            x = self.norm2(x)
+            x = x + identity
+
+            if self.extra_norm:
+                x = self.norm3(x)
+
+            return x
+
+        if self.with_cp and x.requires_grad:
+            x = cp.checkpoint(_inner_forward, x)
+        else:
+            x = _inner_forward(x)
+
+        return x
+
+
+class SwinBlockV2Sequence(BaseModule):
+    """Module with successive Swin Transformer blocks and downsample layer.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        depth (int): Number of successive swin transformer blocks.
+        num_heads (int): Number of attention heads.
+        window_size (int): The height and width of the window. Defaults to 7.
+        downsample (bool): Downsample the output of blocks by patch merging.
+            Defaults to False.
+        downsample_cfg (dict): The extra config of the patch merging layer.
+            Defaults to empty dict.
+        drop_paths (Sequence[float] | float): The drop path rate in each block.
+            Defaults to 0.
+        block_cfgs (Sequence[dict] | dict): The extra config of each block.
+            Defaults to empty dicts.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        pad_small_map (bool): If True, pad the small feature map to the window
+            size, which is common used in detection and segmentation. If False,
+            avoid shifting window and shrink the window size to the size of
+            feature map, which is common used in classification.
+            Defaults to False.
+        extra_norm_every_n_blocks (int): Add extra norm at the end of main
+            branch every n blocks. Defaults to 0, which means no needs for
+            extra norm layer.
+        pretrained_window_size (int): Window size in pretrained.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 depth,
+                 num_heads,
+                 window_size=8,
+                 downsample=False,
+                 downsample_cfg=dict(),
+                 drop_paths=0.,
+                 block_cfgs=dict(),
+                 with_cp=False,
+                 pad_small_map=False,
+                 extra_norm_every_n_blocks=0,
+                 pretrained_window_size=0,
+                 init_cfg=None):
+        super().__init__(init_cfg)
+
+        if not isinstance(drop_paths, Sequence):
+            drop_paths = [drop_paths] * depth
+
+        if not isinstance(block_cfgs, Sequence):
+            block_cfgs = [deepcopy(block_cfgs) for _ in range(depth)]
+
+        if downsample:
+            self.out_channels = 2 * embed_dims
+            _downsample_cfg = {
+                'in_channels': embed_dims,
+                'out_channels': self.out_channels,
+                'norm_cfg': dict(type='LN'),
+                **downsample_cfg
+            }
+            self.downsample = PatchMerging(**_downsample_cfg)
+        else:
+            self.out_channels = embed_dims
+            self.downsample = None
+
+        self.blocks = ModuleList()
+        for i in range(depth):
+            extra_norm = True if extra_norm_every_n_blocks and \
+                (i + 1) % extra_norm_every_n_blocks == 0 else False
+            _block_cfg = {
+                'embed_dims': self.out_channels,
+                'num_heads': num_heads,
+                'window_size': window_size,
+                'shift': False if i % 2 == 0 else True,
+                'extra_norm': extra_norm,
+                'drop_path': drop_paths[i],
+                'with_cp': with_cp,
+                'pad_small_map': pad_small_map,
+                'pretrained_window_size': pretrained_window_size,
+                **block_cfgs[i]
+            }
+            block = SwinBlockV2(**_block_cfg)
+            self.blocks.append(block)
+
+    def forward(self, x, in_shape):
+        if self.downsample:
+            x, out_shape = self.downsample(x, in_shape)
+        else:
+            out_shape = in_shape
+
+        for block in self.blocks:
+            x = block(x, out_shape)
+
+        return x, out_shape
+
+
+@MODELS.register_module()
+class SwinTransformerV2(BaseBackbone):
+    """Swin Transformer V2.
+
+    A PyTorch implement of : `Swin Transformer V2:
+    Scaling Up Capacity and Resolution
+    <https://arxiv.org/abs/2111.09883>`_
+
+    Inspiration from
+    https://github.com/microsoft/Swin-Transformer
+
+    Args:
+        arch (str | dict): Swin Transformer architecture. If use string, choose
+            from 'tiny', 'small', 'base' and 'large'. If use dict, it should
+            have below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **depths** (List[int]): The number of blocks in each stage.
+            - **num_heads** (List[int]): The number of heads in attention
+              modules of each stage.
+            - **extra_norm_every_n_blocks** (int): Add extra norm at the end
+              of main branch every n blocks.
+
+            Defaults to 'tiny'.
+        img_size (int | tuple): The expected input image shape. Because we
+            support dynamic input shape, just set the argument to the most
+            common input image shape. Defaults to 224.
+        patch_size (int | tuple): The patch size in patch embedding.
+            Defaults to 4.
+        in_channels (int): The num of input channels. Defaults to 3.
+        window_size (int | Sequence): The height and width of the window.
+            Defaults to 7.
+        drop_rate (float): Dropout rate after embedding. Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.1.
+        use_abs_pos_embed (bool): If True, add absolute position embedding to
+            the patch embedding. Defaults to False.
+        interpolate_mode (str): Select the interpolate mode for absolute
+            position embeding vector resize. Defaults to "bicubic".
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Defaults to False.
+        pad_small_map (bool): If True, pad the small feature map to the window
+            size, which is common used in detection and segmentation. If False,
+            avoid shifting window and shrink the window size to the size of
+            feature map, which is common used in classification.
+            Defaults to False.
+        norm_cfg (dict): Config dict for normalization layer for all output
+            features. Defaults to ``dict(type='LN')``
+        stage_cfgs (Sequence[dict] | dict): Extra config dict for each
+            stage. Defaults to an empty dict.
+        patch_cfg (dict): Extra config dict for patch embedding.
+            Defaults to an empty dict.
+        pretrained_window_sizes (tuple(int)): Pretrained window sizes of
+            each layer.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+
+    Examples:
+        >>> from mmpretrain.models import SwinTransformerV2
+        >>> import torch
+        >>> extra_config = dict(
+        >>>     arch='tiny',
+        >>>     stage_cfgs=dict(downsample_cfg={'kernel_size': 3,
+        >>>                                     'padding': 'same'}))
+        >>> self = SwinTransformerV2(**extra_config)
+        >>> inputs = torch.rand(1, 3, 224, 224)
+        >>> output = self.forward(inputs)
+        >>> print(output.shape)
+        (1, 2592, 4)
+    """
+    arch_zoo = {
+        **dict.fromkeys(['t', 'tiny'],
+                        {'embed_dims': 96,
+                         'depths':     [2, 2,  6,  2],
+                         'num_heads':  [3, 6, 12, 24],
+                         'extra_norm_every_n_blocks': 0}),
+        **dict.fromkeys(['s', 'small'],
+                        {'embed_dims': 96,
+                         'depths':     [2, 2, 18,  2],
+                         'num_heads':  [3, 6, 12, 24],
+                         'extra_norm_every_n_blocks': 0}),
+        **dict.fromkeys(['b', 'base'],
+                        {'embed_dims': 128,
+                         'depths':     [2, 2, 18,  2],
+                         'num_heads':  [4, 8, 16, 32],
+                         'extra_norm_every_n_blocks': 0}),
+        **dict.fromkeys(['l', 'large'],
+                        {'embed_dims': 192,
+                         'depths':     [2,  2, 18,  2],
+                         'num_heads':  [6, 12, 24, 48],
+                         'extra_norm_every_n_blocks': 0}),
+        # head count not certain for huge, and is employed for another
+        # parallel study about self-supervised learning.
+        **dict.fromkeys(['h', 'huge'],
+                        {'embed_dims': 352,
+                         'depths':     [2,  2, 18,  2],
+                         'num_heads':  [8, 16, 32, 64],
+                         'extra_norm_every_n_blocks': 6}),
+        **dict.fromkeys(['g', 'giant'],
+                        {'embed_dims': 512,
+                         'depths':     [2,  2, 42,  4],
+                         'num_heads':  [16, 32, 64, 128],
+                         'extra_norm_every_n_blocks': 6}),
+    }  # yapf: disable
+
+    _version = 1
+    num_extra_tokens = 0
+
+    def __init__(self,
+                 arch='tiny',
+                 img_size=256,
+                 patch_size=4,
+                 in_channels=3,
+                 window_size=8,
+                 drop_rate=0.,
+                 drop_path_rate=0.1,
+                 out_indices=(3, ),
+                 use_abs_pos_embed=False,
+                 interpolate_mode='bicubic',
+                 with_cp=False,
+                 frozen_stages=-1,
+                 norm_eval=False,
+                 pad_small_map=False,
+                 norm_cfg=dict(type='LN'),
+                 stage_cfgs=dict(),
+                 patch_cfg=dict(),
+                 pretrained_window_sizes=[0, 0, 0, 0],
+                 init_cfg=None):
+        super(SwinTransformerV2, self).__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {
+                'embed_dims', 'depths', 'num_heads',
+                'extra_norm_every_n_blocks'
+            }
+            assert isinstance(arch, dict) and set(arch) == essential_keys, \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.embed_dims = self.arch_settings['embed_dims']
+        self.depths = self.arch_settings['depths']
+        self.num_heads = self.arch_settings['num_heads']
+        self.extra_norm_every_n_blocks = self.arch_settings[
+            'extra_norm_every_n_blocks']
+        self.num_layers = len(self.depths)
+        self.out_indices = out_indices
+        self.use_abs_pos_embed = use_abs_pos_embed
+        self.interpolate_mode = interpolate_mode
+        self.frozen_stages = frozen_stages
+
+        if isinstance(window_size, int):
+            self.window_sizes = [window_size for _ in range(self.num_layers)]
+        elif isinstance(window_size, Sequence):
+            assert len(window_size) == self.num_layers, \
+                f'Length of window_sizes {len(window_size)} is not equal to '\
+                f'length of stages {self.num_layers}.'
+            self.window_sizes = window_size
+        else:
+            raise TypeError('window_size should be a Sequence or int.')
+
+        _patch_cfg = dict(
+            in_channels=in_channels,
+            input_size=img_size,
+            embed_dims=self.embed_dims,
+            conv_type='Conv2d',
+            kernel_size=patch_size,
+            stride=patch_size,
+            norm_cfg=dict(type='LN'),
+        )
+        _patch_cfg.update(patch_cfg)
+        self.patch_embed = PatchEmbed(**_patch_cfg)
+        self.patch_resolution = self.patch_embed.init_out_size
+
+        if self.use_abs_pos_embed:
+            num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+            self.absolute_pos_embed = nn.Parameter(
+                torch.zeros(1, num_patches, self.embed_dims))
+            self._register_load_state_dict_pre_hook(
+                self._prepare_abs_pos_embed)
+
+        self._register_load_state_dict_pre_hook(self._delete_reinit_params)
+
+        self.drop_after_pos = nn.Dropout(p=drop_rate)
+        self.norm_eval = norm_eval
+
+        # stochastic depth
+        total_depth = sum(self.depths)
+        dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, total_depth)
+        ]  # stochastic depth decay rule
+
+        self.stages = ModuleList()
+        embed_dims = [self.embed_dims]
+        for i, (depth,
+                num_heads) in enumerate(zip(self.depths, self.num_heads)):
+            if isinstance(stage_cfgs, Sequence):
+                stage_cfg = stage_cfgs[i]
+            else:
+                stage_cfg = deepcopy(stage_cfgs)
+            downsample = True if i > 0 else False
+            _stage_cfg = {
+                'embed_dims': embed_dims[-1],
+                'depth': depth,
+                'num_heads': num_heads,
+                'window_size': self.window_sizes[i],
+                'downsample': downsample,
+                'drop_paths': dpr[:depth],
+                'with_cp': with_cp,
+                'pad_small_map': pad_small_map,
+                'extra_norm_every_n_blocks': self.extra_norm_every_n_blocks,
+                'pretrained_window_size': pretrained_window_sizes[i],
+                'downsample_cfg': dict(use_post_norm=True),
+                **stage_cfg
+            }
+
+            stage = SwinBlockV2Sequence(**_stage_cfg)
+            self.stages.append(stage)
+
+            dpr = dpr[depth:]
+            embed_dims.append(stage.out_channels)
+
+        for i in out_indices:
+            if norm_cfg is not None:
+                norm_layer = build_norm_layer(norm_cfg, embed_dims[i + 1])[1]
+            else:
+                norm_layer = nn.Identity()
+
+            self.add_module(f'norm{i}', norm_layer)
+
+    def init_weights(self):
+        super(SwinTransformerV2, self).init_weights()
+
+        if (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            # Suppress default init if use pretrained model.
+            return
+
+        if self.use_abs_pos_embed:
+            trunc_normal_(self.absolute_pos_embed, std=0.02)
+
+    def forward(self, x):
+        x, hw_shape = self.patch_embed(x)
+
+        if self.use_abs_pos_embed:
+            x = x + resize_pos_embed(
+                self.absolute_pos_embed, self.patch_resolution, hw_shape,
+                self.interpolate_mode, self.num_extra_tokens)
+        x = self.drop_after_pos(x)
+
+        outs = []
+        for i, stage in enumerate(self.stages):
+            x, hw_shape = stage(x, hw_shape)
+            if i in self.out_indices:
+                norm_layer = getattr(self, f'norm{i}')
+                out = norm_layer(x)
+                out = out.view(-1, *hw_shape,
+                               stage.out_channels).permute(0, 3, 1,
+                                                           2).contiguous()
+                outs.append(out)
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        if self.frozen_stages >= 0:
+            self.patch_embed.eval()
+            for param in self.patch_embed.parameters():
+                param.requires_grad = False
+
+        for i in range(0, self.frozen_stages + 1):
+            m = self.stages[i]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+        for i in self.out_indices:
+            if i <= self.frozen_stages:
+                for param in getattr(self, f'norm{i}').parameters():
+                    param.requires_grad = False
+
+    def train(self, mode=True):
+        super(SwinTransformerV2, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                # trick: eval have effect on BatchNorm only
+                if isinstance(m, _BatchNorm):
+                    m.eval()
+
+    def _prepare_abs_pos_embed(self, state_dict, prefix, *args, **kwargs):
+        name = prefix + 'absolute_pos_embed'
+        if name not in state_dict.keys():
+            return
+
+        ckpt_pos_embed_shape = state_dict[name].shape
+        if self.absolute_pos_embed.shape != ckpt_pos_embed_shape:
+            from mmengine.logging import MMLogger
+            logger = MMLogger.get_current_instance()
+            logger.info(
+                'Resize the absolute_pos_embed shape from '
+                f'{ckpt_pos_embed_shape} to {self.absolute_pos_embed.shape}.')
+
+            ckpt_pos_embed_shape = to_2tuple(
+                int(np.sqrt(ckpt_pos_embed_shape[1] - self.num_extra_tokens)))
+            pos_embed_shape = self.patch_embed.init_out_size
+
+            state_dict[name] = resize_pos_embed(state_dict[name],
+                                                ckpt_pos_embed_shape,
+                                                pos_embed_shape,
+                                                self.interpolate_mode,
+                                                self.num_extra_tokens)
+
+    def _delete_reinit_params(self, state_dict, prefix, *args, **kwargs):
+        # delete relative_position_index since we always re-init it
+        from mmengine.logging import MMLogger
+        logger = MMLogger.get_current_instance()
+        logger.info(
+            'Delete `relative_position_index` and `relative_coords_table` '
+            'since we always re-init these params according to the '
+            '`window_size`, which might cause unwanted but unworried '
+            'warnings when loading checkpoint.')
+        relative_position_index_keys = [
+            k for k in state_dict.keys() if 'relative_position_index' in k
+        ]
+        for k in relative_position_index_keys:
+            del state_dict[k]
+
+        # delete relative_coords_table since we always re-init it
+        relative_position_index_keys = [
+            k for k in state_dict.keys() if 'relative_coords_table' in k
+        ]
+        for k in relative_position_index_keys:
+            del state_dict[k]
diff --git a/mmpretrain/models/backbones/t2t_vit.py b/mmpretrain/models/backbones/t2t_vit.py
new file mode 100644
index 0000000000000000000000000000000000000000..a57b95e1fb00b227c400e7b32fa612e3539503c6
--- /dev/null
+++ b/mmpretrain/models/backbones/t2t_vit.py
@@ -0,0 +1,447 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from typing import Sequence
+
+import numpy as np
+import torch
+import torch.nn as nn
+from mmcv.cnn.bricks.transformer import FFN
+from mmengine.model import BaseModule, ModuleList
+from mmengine.model.weight_init import trunc_normal_
+
+from mmpretrain.registry import MODELS
+from ..utils import (MultiheadAttention, build_norm_layer, resize_pos_embed,
+                     to_2tuple)
+from .base_backbone import BaseBackbone
+
+
+class T2TTransformerLayer(BaseModule):
+    """Transformer Layer for T2T_ViT.
+
+    Comparing with :obj:`TransformerEncoderLayer` in ViT, it supports
+    different ``input_dims`` and ``embed_dims``.
+
+    Args:
+        embed_dims (int): The feature dimension.
+        num_heads (int): Parallel attention heads.
+        feedforward_channels (int): The hidden dimension for FFNs
+        input_dims (int, optional): The input token dimension.
+            Defaults to None.
+        drop_rate (float): Probability of an element to be zeroed
+            after the feed forward layer. Defaults to 0.
+        attn_drop_rate (float): The drop out rate for attention output weights.
+            Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        num_fcs (int): The number of fully-connected layers for FFNs.
+            Defaults to 2.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to True.
+        qk_scale (float, optional): Override default qk scale of
+            ``(input_dims // num_heads) ** -0.5`` if set. Defaults to None.
+        act_cfg (dict): The activation config for FFNs.
+            Defaults to ``dict(type='GELU')``.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+
+    Notes:
+        In general, ``qk_scale`` should be ``head_dims ** -0.5``, i.e.
+        ``(embed_dims // num_heads) ** -0.5``. However, in the official
+        code, it uses ``(input_dims // num_heads) ** -0.5``, so here we
+        keep the same with the official implementation.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 feedforward_channels,
+                 input_dims=None,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 num_fcs=2,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='LN'),
+                 init_cfg=None):
+        super(T2TTransformerLayer, self).__init__(init_cfg=init_cfg)
+
+        self.v_shortcut = True if input_dims is not None else False
+        input_dims = input_dims or embed_dims
+
+        self.ln1 = build_norm_layer(norm_cfg, input_dims)
+
+        self.attn = MultiheadAttention(
+            input_dims=input_dims,
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            attn_drop=attn_drop_rate,
+            proj_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale or (input_dims // num_heads)**-0.5,
+            v_shortcut=self.v_shortcut)
+
+        self.ln2 = build_norm_layer(norm_cfg, embed_dims)
+
+        self.ffn = FFN(
+            embed_dims=embed_dims,
+            feedforward_channels=feedforward_channels,
+            num_fcs=num_fcs,
+            ffn_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            act_cfg=act_cfg)
+
+    def forward(self, x):
+        if self.v_shortcut:
+            x = self.attn(self.ln1(x))
+        else:
+            x = x + self.attn(self.ln1(x))
+        x = self.ffn(self.ln2(x), identity=x)
+        return x
+
+
+class T2TModule(BaseModule):
+    """Tokens-to-Token module.
+
+    "Tokens-to-Token module" (T2T Module) can model the local structure
+    information of images and reduce the length of tokens progressively.
+
+    Args:
+        img_size (int): Input image size
+        in_channels (int): Number of input channels
+        embed_dims (int): Embedding dimension
+        token_dims (int): Tokens dimension in T2TModuleAttention.
+        use_performer (bool): If True, use Performer version self-attention to
+            adopt regular self-attention. Defaults to False.
+        init_cfg (dict, optional): The extra config for initialization.
+            Default: None.
+
+    Notes:
+        Usually, ``token_dim`` is set as a small value (32 or 64) to reduce
+        MACs
+    """
+
+    def __init__(
+        self,
+        img_size=224,
+        in_channels=3,
+        embed_dims=384,
+        token_dims=64,
+        use_performer=False,
+        init_cfg=None,
+    ):
+        super(T2TModule, self).__init__(init_cfg)
+
+        self.embed_dims = embed_dims
+
+        self.soft_split0 = nn.Unfold(
+            kernel_size=(7, 7), stride=(4, 4), padding=(2, 2))
+        self.soft_split1 = nn.Unfold(
+            kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+        self.soft_split2 = nn.Unfold(
+            kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+
+        if not use_performer:
+            self.attention1 = T2TTransformerLayer(
+                input_dims=in_channels * 7 * 7,
+                embed_dims=token_dims,
+                num_heads=1,
+                feedforward_channels=token_dims)
+
+            self.attention2 = T2TTransformerLayer(
+                input_dims=token_dims * 3 * 3,
+                embed_dims=token_dims,
+                num_heads=1,
+                feedforward_channels=token_dims)
+
+            self.project = nn.Linear(token_dims * 3 * 3, embed_dims)
+        else:
+            raise NotImplementedError("Performer hasn't been implemented.")
+
+        # there are 3 soft split, stride are 4,2,2 separately
+        out_side = img_size // (4 * 2 * 2)
+        self.init_out_size = [out_side, out_side]
+        self.num_patches = out_side**2
+
+    @staticmethod
+    def _get_unfold_size(unfold: nn.Unfold, input_size):
+        h, w = input_size
+        kernel_size = to_2tuple(unfold.kernel_size)
+        stride = to_2tuple(unfold.stride)
+        padding = to_2tuple(unfold.padding)
+        dilation = to_2tuple(unfold.dilation)
+
+        h_out = (h + 2 * padding[0] - dilation[0] *
+                 (kernel_size[0] - 1) - 1) // stride[0] + 1
+        w_out = (w + 2 * padding[1] - dilation[1] *
+                 (kernel_size[1] - 1) - 1) // stride[1] + 1
+        return (h_out, w_out)
+
+    def forward(self, x):
+        # step0: soft split
+        hw_shape = self._get_unfold_size(self.soft_split0, x.shape[2:])
+        x = self.soft_split0(x).transpose(1, 2)
+
+        for step in [1, 2]:
+            # re-structurization/reconstruction
+            attn = getattr(self, f'attention{step}')
+            x = attn(x).transpose(1, 2)
+            B, C, _ = x.shape
+            x = x.reshape(B, C, hw_shape[0], hw_shape[1])
+
+            # soft split
+            soft_split = getattr(self, f'soft_split{step}')
+            hw_shape = self._get_unfold_size(soft_split, hw_shape)
+            x = soft_split(x).transpose(1, 2)
+
+        # final tokens
+        x = self.project(x)
+        return x, hw_shape
+
+
+def get_sinusoid_encoding(n_position, embed_dims):
+    """Generate sinusoid encoding table.
+
+    Sinusoid encoding is a kind of relative position encoding method came from
+    `Attention Is All You Need<https://arxiv.org/abs/1706.03762>`_.
+
+    Args:
+        n_position (int): The length of the input token.
+        embed_dims (int): The position embedding dimension.
+
+    Returns:
+        :obj:`torch.FloatTensor`: The sinusoid encoding table.
+    """
+
+    def get_position_angle_vec(position):
+        return [
+            position / np.power(10000, 2 * (i // 2) / embed_dims)
+            for i in range(embed_dims)
+        ]
+
+    sinusoid_table = np.array(
+        [get_position_angle_vec(pos) for pos in range(n_position)])
+    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2i
+    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2i+1
+
+    return torch.FloatTensor(sinusoid_table).unsqueeze(0)
+
+
+@MODELS.register_module()
+class T2T_ViT(BaseBackbone):
+    """Tokens-to-Token Vision Transformer (T2T-ViT)
+
+    A PyTorch implementation of `Tokens-to-Token ViT: Training Vision
+    Transformers from Scratch on ImageNet <https://arxiv.org/abs/2101.11986>`_
+
+    Args:
+        img_size (int | tuple): The expected input image shape. Because we
+            support dynamic input shape, just set the argument to the most
+            common input image shape. Defaults to 224.
+        in_channels (int): Number of input channels.
+        embed_dims (int): Embedding dimension.
+        num_layers (int): Num of transformer layers in encoder.
+            Defaults to 14.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        drop_rate (float): Dropout rate after position embedding.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        norm_cfg (dict): Config dict for normalization layer. Defaults to
+            ``dict(type='LN')``.
+        final_norm (bool): Whether to add a additional layer to normalize
+            final feature map. Defaults to True.
+        out_type (str): The type of output features. Please choose from
+
+            - ``"cls_token"``: The class token tensor with shape (B, C).
+            - ``"featmap"``: The feature map tensor from the patch tokens
+              with shape (B, C, H, W).
+            - ``"avg_featmap"``: The global averaged feature map tensor
+              with shape (B, C).
+            - ``"raw"``: The raw feature tensor includes patch tokens and
+              class tokens with shape (B, L, C).
+
+            Defaults to ``"cls_token"``.
+        with_cls_token (bool): Whether concatenating class token into image
+            tokens as transformer input. Defaults to True.
+        interpolate_mode (str): Select the interpolate mode for position
+            embeding vector resize. Defaults to "bicubic".
+        t2t_cfg (dict): Extra config of Tokens-to-Token module.
+            Defaults to an empty dict.
+        layer_cfgs (Sequence | dict): Configs of each transformer layer in
+            encoder. Defaults to an empty dict.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+    OUT_TYPES = {'raw', 'cls_token', 'featmap', 'avg_featmap'}
+
+    def __init__(self,
+                 img_size=224,
+                 in_channels=3,
+                 embed_dims=384,
+                 num_layers=14,
+                 out_indices=-1,
+                 drop_rate=0.,
+                 drop_path_rate=0.,
+                 norm_cfg=dict(type='LN'),
+                 final_norm=True,
+                 out_type='cls_token',
+                 with_cls_token=True,
+                 interpolate_mode='bicubic',
+                 t2t_cfg=dict(),
+                 layer_cfgs=dict(),
+                 init_cfg=None):
+        super().__init__(init_cfg)
+
+        # Token-to-Token Module
+        self.tokens_to_token = T2TModule(
+            img_size=img_size,
+            in_channels=in_channels,
+            embed_dims=embed_dims,
+            **t2t_cfg)
+        self.patch_resolution = self.tokens_to_token.init_out_size
+        num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+
+        # Set out type
+        if out_type not in self.OUT_TYPES:
+            raise ValueError(f'Unsupported `out_type` {out_type}, please '
+                             f'choose from {self.OUT_TYPES}')
+        self.out_type = out_type
+
+        # Set cls token
+        if with_cls_token:
+            self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dims))
+            self.num_extra_tokens = 1
+        elif out_type != 'cls_token':
+            self.cls_token = None
+            self.num_extra_tokens = 0
+        else:
+            raise ValueError(
+                'with_cls_token must be True when `out_type="cls_token"`.')
+
+        # Set position embedding
+        self.interpolate_mode = interpolate_mode
+        sinusoid_table = get_sinusoid_encoding(
+            num_patches + self.num_extra_tokens, embed_dims)
+        self.register_buffer('pos_embed', sinusoid_table)
+        self._register_load_state_dict_pre_hook(self._prepare_pos_embed)
+
+        self.drop_after_pos = nn.Dropout(p=drop_rate)
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must be a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = num_layers + index
+            assert 0 <= out_indices[i] <= num_layers, \
+                f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+
+        # stochastic depth decay rule
+        dpr = [x for x in np.linspace(0, drop_path_rate, num_layers)]
+
+        self.encoder = ModuleList()
+        for i in range(num_layers):
+            if isinstance(layer_cfgs, Sequence):
+                layer_cfg = layer_cfgs[i]
+            else:
+                layer_cfg = deepcopy(layer_cfgs)
+            layer_cfg = {
+                'embed_dims': embed_dims,
+                'num_heads': 6,
+                'feedforward_channels': 3 * embed_dims,
+                'drop_path_rate': dpr[i],
+                'qkv_bias': False,
+                'norm_cfg': norm_cfg,
+                **layer_cfg
+            }
+
+            layer = T2TTransformerLayer(**layer_cfg)
+            self.encoder.append(layer)
+
+        self.final_norm = final_norm
+        if final_norm:
+            self.norm = build_norm_layer(norm_cfg, embed_dims)
+        else:
+            self.norm = nn.Identity()
+
+    def init_weights(self):
+        super().init_weights()
+
+        if (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            # Suppress custom init if use pretrained model.
+            return
+
+        trunc_normal_(self.cls_token, std=.02)
+
+    def _prepare_pos_embed(self, state_dict, prefix, *args, **kwargs):
+        name = prefix + 'pos_embed'
+        if name not in state_dict.keys():
+            return
+
+        ckpt_pos_embed_shape = state_dict[name].shape
+        if self.pos_embed.shape != ckpt_pos_embed_shape:
+            from mmengine.logging import MMLogger
+            logger = MMLogger.get_current_instance()
+            logger.info(
+                f'Resize the pos_embed shape from {ckpt_pos_embed_shape} '
+                f'to {self.pos_embed.shape}.')
+
+            ckpt_pos_embed_shape = to_2tuple(
+                int(np.sqrt(ckpt_pos_embed_shape[1] - self.num_extra_tokens)))
+            pos_embed_shape = self.tokens_to_token.init_out_size
+
+            state_dict[name] = resize_pos_embed(state_dict[name],
+                                                ckpt_pos_embed_shape,
+                                                pos_embed_shape,
+                                                self.interpolate_mode,
+                                                self.num_extra_tokens)
+
+    def forward(self, x):
+        B = x.shape[0]
+        x, patch_resolution = self.tokens_to_token(x)
+
+        if self.cls_token is not None:
+            # stole cls_tokens impl from Phil Wang, thanks
+            cls_token = self.cls_token.expand(B, -1, -1)
+            x = torch.cat((cls_token, x), dim=1)
+
+        x = x + resize_pos_embed(
+            self.pos_embed,
+            self.patch_resolution,
+            patch_resolution,
+            mode=self.interpolate_mode,
+            num_extra_tokens=self.num_extra_tokens)
+        x = self.drop_after_pos(x)
+
+        outs = []
+        for i, layer in enumerate(self.encoder):
+            x = layer(x)
+
+            if i == len(self.encoder) - 1 and self.final_norm:
+                x = self.norm(x)
+
+            if i in self.out_indices:
+                outs.append(self._format_output(x, patch_resolution))
+
+        return tuple(outs)
+
+    def _format_output(self, x, hw):
+        if self.out_type == 'raw':
+            return x
+        if self.out_type == 'cls_token':
+            return x[:, 0]
+
+        patch_token = x[:, self.num_extra_tokens:]
+        if self.out_type == 'featmap':
+            B = x.size(0)
+            # (B, N, C) -> (B, H, W, C) -> (B, C, H, W)
+            return patch_token.reshape(B, *hw, -1).permute(0, 3, 1, 2)
+        if self.out_type == 'avg_featmap':
+            return patch_token.mean(dim=1)
diff --git a/mmpretrain/models/backbones/timm_backbone.py b/mmpretrain/models/backbones/timm_backbone.py
new file mode 100644
index 0000000000000000000000000000000000000000..51ecbdbb077be0643026de2ec91c0169263a41f7
--- /dev/null
+++ b/mmpretrain/models/backbones/timm_backbone.py
@@ -0,0 +1,111 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import warnings
+
+from mmengine.logging import MMLogger
+
+from mmpretrain.registry import MODELS
+from mmpretrain.utils import require
+from .base_backbone import BaseBackbone
+
+
+def print_timm_feature_info(feature_info):
+    """Print feature_info of timm backbone to help development and debug.
+
+    Args:
+        feature_info (list[dict] | timm.models.features.FeatureInfo | None):
+            feature_info of timm backbone.
+    """
+    logger = MMLogger.get_current_instance()
+    if feature_info is None:
+        logger.warning('This backbone does not have feature_info')
+    elif isinstance(feature_info, list):
+        for feat_idx, each_info in enumerate(feature_info):
+            logger.info(f'backbone feature_info[{feat_idx}]: {each_info}')
+    else:
+        try:
+            logger.info(f'backbone out_indices: {feature_info.out_indices}')
+            logger.info(f'backbone out_channels: {feature_info.channels()}')
+            logger.info(f'backbone out_strides: {feature_info.reduction()}')
+        except AttributeError:
+            logger.warning('Unexpected format of backbone feature_info')
+
+
+@MODELS.register_module()
+class TIMMBackbone(BaseBackbone):
+    """Wrapper to use backbones from timm library.
+
+    More details can be found in
+    `timm <https://github.com/rwightman/pytorch-image-models>`_.
+    See especially the document for `feature extraction
+    <https://rwightman.github.io/pytorch-image-models/feature_extraction/>`_.
+
+    Args:
+        model_name (str): Name of timm model to instantiate.
+        features_only (bool): Whether to extract feature pyramid (multi-scale
+            feature maps from the deepest layer at each stride). For Vision
+            Transformer models that do not support this argument,
+            set this False. Defaults to False.
+        pretrained (bool): Whether to load pretrained weights.
+            Defaults to False.
+        checkpoint_path (str): Path of checkpoint to load at the last of
+            ``timm.create_model``. Defaults to empty string, which means
+            not loading.
+        in_channels (int): Number of input image channels. Defaults to 3.
+        init_cfg (dict or list[dict], optional): Initialization config dict of
+            OpenMMLab projects. Defaults to None.
+        **kwargs: Other timm & model specific arguments.
+    """
+
+    @require('timm')
+    def __init__(self,
+                 model_name,
+                 features_only=False,
+                 pretrained=False,
+                 checkpoint_path='',
+                 in_channels=3,
+                 init_cfg=None,
+                 **kwargs):
+        import timm
+
+        if not isinstance(pretrained, bool):
+            raise TypeError('pretrained must be bool, not str for model path')
+        if features_only and checkpoint_path:
+            warnings.warn(
+                'Using both features_only and checkpoint_path will cause error'
+                ' in timm. See '
+                'https://github.com/rwightman/pytorch-image-models/issues/488')
+
+        super(TIMMBackbone, self).__init__(init_cfg)
+        if 'norm_layer' in kwargs:
+            norm_class = MODELS.get(kwargs['norm_layer'])
+
+            def build_norm(*args, **kwargs):
+                return norm_class(*args, **kwargs)
+
+            kwargs['norm_layer'] = build_norm
+        self.timm_model = timm.create_model(
+            model_name=model_name,
+            features_only=features_only,
+            pretrained=pretrained,
+            in_chans=in_channels,
+            checkpoint_path=checkpoint_path,
+            **kwargs)
+
+        # reset classifier
+        if hasattr(self.timm_model, 'reset_classifier'):
+            self.timm_model.reset_classifier(0, '')
+
+        # Hack to use pretrained weights from timm
+        if pretrained or checkpoint_path:
+            self._is_init = True
+
+        feature_info = getattr(self.timm_model, 'feature_info', None)
+        print_timm_feature_info(feature_info)
+
+    def forward(self, x):
+        features = self.timm_model(x)
+        if isinstance(features, (list, tuple)):
+            features = tuple(features)
+        else:
+            features = (features, )
+        return features
diff --git a/mmpretrain/models/backbones/tinyvit.py b/mmpretrain/models/backbones/tinyvit.py
new file mode 100644
index 0000000000000000000000000000000000000000..5279832184343a6e8ff4b253891de1b990192775
--- /dev/null
+++ b/mmpretrain/models/backbones/tinyvit.py
@@ -0,0 +1,769 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Sequence, Tuple
+
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint as checkpoint
+from mmcv.cnn.bricks import DropPath, build_activation_layer, build_norm_layer
+from mmengine.model import BaseModule, ModuleList, Sequential
+from torch.nn import functional as F
+
+from mmpretrain.registry import MODELS
+from ..utils import LeAttention
+from .base_backbone import BaseBackbone
+
+
+class ConvBN2d(Sequential):
+    """An implementation of Conv2d + BatchNorm2d with support of fusion.
+
+    Modified from
+    https://github.com/microsoft/Cream/blob/main/TinyViT/models/tiny_vit.py
+
+    Args:
+        in_channels (int): The number of input channels.
+        out_channels (int): The number of output channels.
+        kernel_size (int): The size of the convolution kernel.
+            Default: 1.
+        stride (int): The stride of the convolution.
+            Default: 1.
+        padding (int): The padding of the convolution.
+            Default: 0.
+        dilation (int): The dilation of the convolution.
+            Default: 1.
+        groups (int): The number of groups in the convolution.
+            Default: 1.
+        bn_weight_init (float): The initial value of the weight of
+            the nn.BatchNorm2d layer. Default: 1.0.
+        init_cfg (dict): The initialization config of the module.
+            Default: None.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 kernel_size=1,
+                 stride=1,
+                 padding=0,
+                 dilation=1,
+                 groups=1,
+                 bn_weight_init=1.0,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        self.add_module(
+            'conv2d',
+            nn.Conv2d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=kernel_size,
+                stride=stride,
+                padding=padding,
+                dilation=dilation,
+                groups=groups,
+                bias=False))
+        bn2d = nn.BatchNorm2d(num_features=out_channels)
+        # bn initialization
+        torch.nn.init.constant_(bn2d.weight, bn_weight_init)
+        torch.nn.init.constant_(bn2d.bias, 0)
+        self.add_module('bn2d', bn2d)
+
+    @torch.no_grad()
+    def fuse(self):
+        conv2d, bn2d = self._modules.values()
+        w = bn2d.weight / (bn2d.running_var + bn2d.eps)**0.5
+        w = conv2d.weight * w[:, None, None, None]
+        b = bn2d.bias - bn2d.running_mean * bn2d.weight / \
+            (bn2d.running_var + bn2d.eps)**0.5
+
+        m = nn.Conv2d(
+            in_channels=w.size(1) * self.c.groups,
+            out_channels=w.size(0),
+            kernel_size=w.shape[2:],
+            stride=self.conv2d.stride,
+            padding=self.conv2d.padding,
+            dilation=self.conv2d.dilation,
+            groups=self.conv2d.groups)
+        m.weight.data.copy_(w)
+        m.bias.data.copy_(b)
+        return m
+
+
+class PatchEmbed(BaseModule):
+    """Patch Embedding for Vision Transformer.
+
+    Adapted from
+    https://github.com/microsoft/Cream/blob/main/TinyViT/models/tiny_vit.py
+
+    Different from `mmcv.cnn.bricks.transformer.PatchEmbed`, this module use
+    Conv2d and BatchNorm2d to implement PatchEmbedding, and output shape is
+    (N, C, H, W).
+
+    Args:
+        in_channels (int): The number of input channels.
+        embed_dim (int): The embedding dimension.
+        resolution (Tuple[int, int]): The resolution of the input feature.
+        act_cfg (dict): The activation config of the module.
+            Default: dict(type='GELU').
+    """
+
+    def __init__(self,
+                 in_channels,
+                 embed_dim,
+                 resolution,
+                 act_cfg=dict(type='GELU')):
+        super().__init__()
+        img_size: Tuple[int, int] = resolution
+        self.patches_resolution = (img_size[0] // 4, img_size[1] // 4)
+        self.num_patches = self.patches_resolution[0] * \
+            self.patches_resolution[1]
+        self.in_channels = in_channels
+        self.embed_dim = embed_dim
+        self.seq = nn.Sequential(
+            ConvBN2d(
+                in_channels,
+                embed_dim // 2,
+                kernel_size=3,
+                stride=2,
+                padding=1),
+            build_activation_layer(act_cfg),
+            ConvBN2d(
+                embed_dim // 2, embed_dim, kernel_size=3, stride=2, padding=1),
+        )
+
+    def forward(self, x):
+        return self.seq(x)
+
+
+class PatchMerging(nn.Module):
+    """Patch Merging for TinyViT.
+
+    Adapted from
+    https://github.com/microsoft/Cream/blob/main/TinyViT/models/tiny_vit.py
+
+    Different from `mmpretrain.models.utils.PatchMerging`, this module use
+    Conv2d and BatchNorm2d to implement PatchMerging.
+
+    Args:
+        in_channels (int): The number of input channels.
+        resolution (Tuple[int, int]): The resolution of the input feature.
+        out_channels (int): The number of output channels.
+        act_cfg (dict): The activation config of the module.
+            Default: dict(type='GELU').
+    """
+
+    def __init__(self,
+                 resolution,
+                 in_channels,
+                 out_channels,
+                 act_cfg=dict(type='GELU')):
+        super().__init__()
+
+        self.img_size = resolution
+
+        self.act = build_activation_layer(act_cfg)
+        self.conv1 = ConvBN2d(in_channels, out_channels, kernel_size=1)
+        self.conv2 = ConvBN2d(
+            out_channels,
+            out_channels,
+            kernel_size=3,
+            stride=2,
+            padding=1,
+            groups=out_channels)
+        self.conv3 = ConvBN2d(out_channels, out_channels, kernel_size=1)
+        self.out_resolution = (resolution[0] // 2, resolution[1] // 2)
+
+    def forward(self, x):
+        if len(x.shape) == 3:
+            H, W = self.img_size
+            B = x.shape[0]
+            x = x.view(B, H, W, -1).permute(0, 3, 1, 2)
+        x = self.conv1(x)
+        x = self.act(x)
+        x = self.conv2(x)
+        x = self.act(x)
+        x = self.conv3(x)
+
+        x = x.flatten(2).transpose(1, 2)
+        return x
+
+
+class MBConvBlock(nn.Module):
+    """Mobile Inverted Residual Bottleneck Block for TinyViT. Adapted from
+    https://github.com/microsoft/Cream/blob/main/TinyViT/models/tiny_vit.py.
+
+    Args:
+        in_channels (int): The number of input channels.
+        out_channels (int): The number of output channels.
+        expand_ratio (int): The expand ratio of the hidden channels.
+        drop_rate (float): The drop rate of the block.
+        act_cfg (dict): The activation config of the module.
+            Default: dict(type='GELU').
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 expand_ratio,
+                 drop_path,
+                 act_cfg=dict(type='GELU')):
+        super().__init__()
+        self.in_channels = in_channels
+        hidden_channels = int(in_channels * expand_ratio)
+
+        # linear
+        self.conv1 = ConvBN2d(in_channels, hidden_channels, kernel_size=1)
+        self.act = build_activation_layer(act_cfg)
+        # depthwise conv
+        self.conv2 = ConvBN2d(
+            in_channels=hidden_channels,
+            out_channels=hidden_channels,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+            groups=hidden_channels)
+        # linear
+        self.conv3 = ConvBN2d(
+            hidden_channels, out_channels, kernel_size=1, bn_weight_init=0.0)
+
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+
+    def forward(self, x):
+        shortcut = x
+
+        x = self.conv1(x)
+        x = self.act(x)
+
+        x = self.conv2(x)
+        x = self.act(x)
+
+        x = self.conv3(x)
+
+        x = self.drop_path(x)
+
+        x += shortcut
+        x = self.act(x)
+
+        return x
+
+
+class ConvStage(BaseModule):
+    """Convolution Stage for TinyViT.
+
+    Adapted from
+    https://github.com/microsoft/Cream/blob/main/TinyViT/models/tiny_vit.py
+
+    Args:
+        in_channels (int): The number of input channels.
+        resolution (Tuple[int, int]): The resolution of the input feature.
+        depth (int): The number of blocks in the stage.
+        act_cfg (dict): The activation config of the module.
+        drop_path (float): The drop path of the block.
+        downsample (None | nn.Module): The downsample operation.
+            Default: None.
+        use_checkpoint (bool): Whether to use checkpointing to save memory.
+        out_channels (int): The number of output channels.
+        conv_expand_ratio (int): The expand ratio of the hidden channels.
+            Default: 4.
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+            Default: None.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 resolution,
+                 depth,
+                 act_cfg,
+                 drop_path=0.,
+                 downsample=None,
+                 use_checkpoint=False,
+                 out_channels=None,
+                 conv_expand_ratio=4.,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+
+        self.use_checkpoint = use_checkpoint
+        # build blocks
+        self.blocks = ModuleList([
+            MBConvBlock(
+                in_channels=in_channels,
+                out_channels=in_channels,
+                expand_ratio=conv_expand_ratio,
+                drop_path=drop_path[i]
+                if isinstance(drop_path, list) else drop_path)
+            for i in range(depth)
+        ])
+
+        # patch merging layer
+        if downsample is not None:
+            self.downsample = downsample(
+                resolution=resolution,
+                in_channels=in_channels,
+                out_channels=out_channels,
+                act_cfg=act_cfg)
+            self.resolution = self.downsample.out_resolution
+        else:
+            self.downsample = None
+            self.resolution = resolution
+
+    def forward(self, x):
+        for block in self.blocks:
+            if self.use_checkpoint:
+                x = checkpoint.checkpoint(block, x)
+            else:
+                x = block(x)
+
+        if self.downsample is not None:
+            x = self.downsample(x)
+        return x
+
+
+class MLP(BaseModule):
+    """MLP module for TinyViT.
+
+    Args:
+        in_channels (int): The number of input channels.
+        hidden_channels (int, optional): The number of hidden channels.
+            Default: None.
+        out_channels (int, optional): The number of output channels.
+            Default: None.
+        act_cfg (dict): The activation config of the module.
+            Default: dict(type='GELU').
+        drop (float): Probability of an element to be zeroed.
+            Default: 0.
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+            Default: None.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 hidden_channels=None,
+                 out_channels=None,
+                 act_cfg=dict(type='GELU'),
+                 drop=0.,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        out_channels = out_channels or in_channels
+        hidden_channels = hidden_channels or in_channels
+        self.norm = nn.LayerNorm(in_channels)
+        self.fc1 = nn.Linear(in_channels, hidden_channels)
+        self.fc2 = nn.Linear(hidden_channels, out_channels)
+        self.act = build_activation_layer(act_cfg)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x):
+        x = self.norm(x)
+
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+class TinyViTBlock(BaseModule):
+    """TinViT Block.
+
+    Args:
+        in_channels (int): The number of input channels.
+        resolution (Tuple[int, int]): The resolution of the input feature.
+        num_heads (int): The number of heads in the multi-head attention.
+        window_size (int): The size of the window.
+            Default: 7.
+        mlp_ratio (float): The ratio of mlp hidden dim to embedding dim.
+            Default: 4.
+        drop (float): Probability of an element to be zeroed.
+            Default: 0.
+        drop_path (float): The drop path of the block.
+            Default: 0.
+        local_conv_size (int): The size of the local convolution.
+            Default: 3.
+        act_cfg (dict): The activation config of the module.
+            Default: dict(type='GELU').
+    """
+
+    def __init__(self,
+                 in_channels,
+                 resolution,
+                 num_heads,
+                 window_size=7,
+                 mlp_ratio=4.,
+                 drop=0.,
+                 drop_path=0.,
+                 local_conv_size=3,
+                 act_cfg=dict(type='GELU')):
+        super().__init__()
+        self.in_channels = in_channels
+        self.img_size = resolution
+        self.num_heads = num_heads
+        assert window_size > 0, 'window_size must be greater than 0'
+        self.window_size = window_size
+        self.mlp_ratio = mlp_ratio
+
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+
+        assert in_channels % num_heads == 0, \
+            'dim must be divisible by num_heads'
+        head_dim = in_channels // num_heads
+
+        window_resolution = (window_size, window_size)
+        self.attn = LeAttention(
+            in_channels,
+            head_dim,
+            num_heads,
+            attn_ratio=1,
+            resolution=window_resolution)
+
+        mlp_hidden_dim = int(in_channels * mlp_ratio)
+        self.mlp = MLP(
+            in_channels=in_channels,
+            hidden_channels=mlp_hidden_dim,
+            act_cfg=act_cfg,
+            drop=drop)
+
+        self.local_conv = ConvBN2d(
+            in_channels=in_channels,
+            out_channels=in_channels,
+            kernel_size=local_conv_size,
+            stride=1,
+            padding=local_conv_size // 2,
+            groups=in_channels)
+
+    def forward(self, x):
+        H, W = self.img_size
+        B, L, C = x.shape
+        assert L == H * W, 'input feature has wrong size'
+        res_x = x
+        if H == self.window_size and W == self.window_size:
+            x = self.attn(x)
+        else:
+            x = x.view(B, H, W, C)
+            pad_b = (self.window_size -
+                     H % self.window_size) % self.window_size
+            pad_r = (self.window_size -
+                     W % self.window_size) % self.window_size
+            padding = pad_b > 0 or pad_r > 0
+
+            if padding:
+                x = F.pad(x, (0, 0, 0, pad_r, 0, pad_b))
+
+            pH, pW = H + pad_b, W + pad_r
+            nH = pH // self.window_size
+            nW = pW // self.window_size
+            # window partition
+            x = x.view(B, nH, self.window_size, nW, self.window_size,
+                       C).transpose(2, 3).reshape(
+                           B * nH * nW, self.window_size * self.window_size, C)
+            x = self.attn(x)
+            # window reverse
+            x = x.view(B, nH, nW, self.window_size, self.window_size,
+                       C).transpose(2, 3).reshape(B, pH, pW, C)
+
+            if padding:
+                x = x[:, :H, :W].contiguous()
+
+            x = x.view(B, L, C)
+
+        x = res_x + self.drop_path(x)
+
+        x = x.transpose(1, 2).reshape(B, C, H, W)
+        x = self.local_conv(x)
+        x = x.view(B, C, L).transpose(1, 2)
+
+        x = x + self.drop_path(self.mlp(x))
+        return x
+
+
+class BasicStage(BaseModule):
+    """Basic Stage for TinyViT.
+
+    Args:
+        in_channels (int): The number of input channels.
+        resolution (Tuple[int, int]): The resolution of the input feature.
+        depth (int): The number of blocks in the stage.
+        num_heads (int): The number of heads in the multi-head attention.
+        window_size (int): The size of the window.
+        mlp_ratio (float): The ratio of mlp hidden dim to embedding dim.
+            Default: 4.
+        drop (float): Probability of an element to be zeroed.
+            Default: 0.
+        drop_path (float): The drop path of the block.
+            Default: 0.
+        downsample (None | nn.Module): The downsample operation.
+            Default: None.
+        use_checkpoint (bool): Whether to use checkpointing to save memory.
+            Default: False.
+        act_cfg (dict): The activation config of the module.
+            Default: dict(type='GELU').
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+            Default: None.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 resolution,
+                 depth,
+                 num_heads,
+                 window_size,
+                 mlp_ratio=4.,
+                 drop=0.,
+                 drop_path=0.,
+                 downsample=None,
+                 use_checkpoint=False,
+                 local_conv_size=3,
+                 out_channels=None,
+                 act_cfg=dict(type='GELU'),
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        self.use_checkpoint = use_checkpoint
+        # build blocks
+        self.blocks = ModuleList([
+            TinyViTBlock(
+                in_channels=in_channels,
+                resolution=resolution,
+                num_heads=num_heads,
+                window_size=window_size,
+                mlp_ratio=mlp_ratio,
+                drop=drop,
+                local_conv_size=local_conv_size,
+                act_cfg=act_cfg,
+                drop_path=drop_path[i]
+                if isinstance(drop_path, list) else drop_path)
+            for i in range(depth)
+        ])
+
+        # build patch merging layer
+        if downsample is not None:
+            self.downsample = downsample(
+                resolution=resolution,
+                in_channels=in_channels,
+                out_channels=out_channels,
+                act_cfg=act_cfg)
+            self.resolution = self.downsample.out_resolution
+        else:
+            self.downsample = None
+            self.resolution = resolution
+
+    def forward(self, x):
+        for block in self.blocks:
+            if self.use_checkpoint:
+                x = checkpoint.checkpoint(block, x)
+            else:
+                x = block(x)
+
+        if self.downsample is not None:
+            x = self.downsample(x)
+        return x
+
+
+@MODELS.register_module()
+class TinyViT(BaseBackbone):
+    """TinyViT.
+    A PyTorch implementation of : `TinyViT: Fast Pretraining Distillation
+    for Small Vision Transformers<https://arxiv.org/abs/2201.03545v1>`_
+
+    Inspiration from
+    https://github.com/microsoft/Cream/blob/main/TinyViT
+
+    Args:
+        arch (str | dict): The architecture of TinyViT.
+            Default: '5m'.
+        img_size (tuple | int): The resolution of the input image.
+            Default: (224, 224)
+        window_size (list): The size of the window.
+            Default: [7, 7, 14, 7]
+        in_channels (int): The number of input channels.
+            Default: 3.
+        depths (list[int]): The depth of each stage.
+            Default: [2, 2, 6, 2].
+        mlp_ratio (list[int]): The ratio of mlp hidden dim to embedding dim.
+            Default: 4.
+        drop_rate (float): Probability of an element to be zeroed.
+            Default: 0.
+        drop_path_rate (float): The drop path of the block.
+            Default: 0.1.
+        use_checkpoint (bool): Whether to use checkpointing to save memory.
+            Default: False.
+        mbconv_expand_ratio (int): The expand ratio of the mbconv.
+            Default: 4.0
+        local_conv_size (int): The size of the local conv.
+            Default: 3.
+        layer_lr_decay (float): The layer lr decay.
+            Default: 1.0
+        out_indices (int | list[int]): Output from which stages.
+            Default: -1
+        frozen_stages (int | list[int]): Stages to be frozen (all param fixed).
+            Default: -0
+        gap_before_final_nrom (bool): Whether to add a gap before the final
+            norm. Default: True.
+        act_cfg (dict): The activation config of the module.
+            Default: dict(type='GELU').
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='LN').
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+            Default: None.
+    """
+    arch_settings = {
+        '5m': {
+            'channels': [64, 128, 160, 320],
+            'num_heads': [2, 4, 5, 10],
+            'depths': [2, 2, 6, 2],
+        },
+        '11m': {
+            'channels': [64, 128, 256, 448],
+            'num_heads': [2, 4, 8, 14],
+            'depths': [2, 2, 6, 2],
+        },
+        '21m': {
+            'channels': [96, 192, 384, 576],
+            'num_heads': [3, 6, 12, 18],
+            'depths': [2, 2, 6, 2],
+        },
+    }
+
+    def __init__(self,
+                 arch='5m',
+                 img_size=(224, 224),
+                 window_size=[7, 7, 14, 7],
+                 in_channels=3,
+                 mlp_ratio=4.,
+                 drop_rate=0.,
+                 drop_path_rate=0.1,
+                 use_checkpoint=False,
+                 mbconv_expand_ratio=4.0,
+                 local_conv_size=3,
+                 layer_lr_decay=1.0,
+                 out_indices=-1,
+                 frozen_stages=0,
+                 gap_before_final_norm=True,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='LN'),
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            assert arch in self.arch_settings, \
+                f'Unavaiable arch, please choose from ' \
+                f'({set(self.arch_settings)} or pass a dict.'
+            arch = self.arch_settings[arch]
+        elif isinstance(arch, dict):
+            assert 'channels' in arch and 'num_heads' in arch and \
+                'depths' in arch, 'The arch dict must have' \
+                f'"channels", "num_heads", "window_sizes" ' \
+                f'keys, but got {arch.keys()}'
+
+        self.channels = arch['channels']
+        self.num_heads = arch['num_heads']
+        self.widow_sizes = window_size
+        self.img_size = img_size
+        self.depths = arch['depths']
+
+        self.num_stages = len(self.channels)
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = 4 + index
+                assert out_indices[i] >= 0, f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+
+        self.frozen_stages = frozen_stages
+        self.gap_before_final_norm = gap_before_final_norm
+        self.layer_lr_decay = layer_lr_decay
+
+        self.patch_embed = PatchEmbed(
+            in_channels=in_channels,
+            embed_dim=self.channels[0],
+            resolution=self.img_size,
+            act_cfg=dict(type='GELU'))
+        patches_resolution = self.patch_embed.patches_resolution
+
+        # stochastic depth decay rule
+        dpr = [
+            x.item()
+            for x in torch.linspace(0, drop_path_rate, sum(self.depths))
+        ]
+
+        # build stages
+        self.stages = ModuleList()
+        for i in range(self.num_stages):
+            depth = self.depths[i]
+            channel = self.channels[i]
+            curr_resolution = (patches_resolution[0] // (2**i),
+                               patches_resolution[1] // (2**i))
+            drop_path = dpr[sum(self.depths[:i]):sum(self.depths[:i + 1])]
+            downsample = PatchMerging if (i < self.num_stages - 1) else None
+            out_channels = self.channels[min(i + 1, self.num_stages - 1)]
+            if i >= 1:
+                stage = BasicStage(
+                    in_channels=channel,
+                    resolution=curr_resolution,
+                    depth=depth,
+                    num_heads=self.num_heads[i],
+                    window_size=self.widow_sizes[i],
+                    mlp_ratio=mlp_ratio,
+                    drop=drop_rate,
+                    drop_path=drop_path,
+                    downsample=downsample,
+                    use_checkpoint=use_checkpoint,
+                    local_conv_size=local_conv_size,
+                    out_channels=out_channels,
+                    act_cfg=act_cfg)
+            else:
+                stage = ConvStage(
+                    in_channels=channel,
+                    resolution=curr_resolution,
+                    depth=depth,
+                    act_cfg=act_cfg,
+                    drop_path=drop_path,
+                    downsample=downsample,
+                    use_checkpoint=use_checkpoint,
+                    out_channels=out_channels,
+                    conv_expand_ratio=mbconv_expand_ratio)
+            self.stages.append(stage)
+
+            # add output norm
+            if i in self.out_indices:
+                norm_layer = build_norm_layer(norm_cfg, out_channels)[1]
+                self.add_module(f'norm{i}', norm_layer)
+
+    def set_layer_lr_decay(self, layer_lr_decay):
+        # TODO: add layer_lr_decay
+        pass
+
+    def forward(self, x):
+        outs = []
+        x = self.patch_embed(x)
+
+        for i, stage in enumerate(self.stages):
+            x = stage(x)
+            if i in self.out_indices:
+                norm_layer = getattr(self, f'norm{i}')
+                if self.gap_before_final_norm:
+                    gap = x.mean(1)
+                    outs.append(norm_layer(gap))
+                else:
+                    out = norm_layer(x)
+                    # convert the (B,L,C) format into (B,C,H,W) format
+                    # which would be better for the downstream tasks.
+                    B, L, C = out.shape
+                    out = out.view(B, *stage.resolution, C)
+                    outs.append(out.permute(0, 3, 1, 2))
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        for i in range(self.frozen_stages):
+            stage = self.stages[i]
+            stage.eval()
+            for param in stage.parameters():
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        super(TinyViT, self).train(mode)
+        self._freeze_stages()
diff --git a/mmpretrain/models/backbones/tnt.py b/mmpretrain/models/backbones/tnt.py
new file mode 100644
index 0000000000000000000000000000000000000000..e1b241c1f6bc398157793748b7a457f0836daedb
--- /dev/null
+++ b/mmpretrain/models/backbones/tnt.py
@@ -0,0 +1,368 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+
+import torch
+import torch.nn as nn
+from mmcv.cnn import build_norm_layer
+from mmcv.cnn.bricks.transformer import FFN, MultiheadAttention
+from mmengine.model import BaseModule, ModuleList
+from mmengine.model.weight_init import trunc_normal_
+
+from mmpretrain.registry import MODELS
+from ..utils import to_2tuple
+from .base_backbone import BaseBackbone
+
+
+class TransformerBlock(BaseModule):
+    """Implement a transformer block in TnTLayer.
+
+    Args:
+        embed_dims (int): The feature dimension
+        num_heads (int): Parallel attention heads
+        ffn_ratio (int): A ratio to calculate the hidden_dims in ffn layer.
+            Default: 4
+        drop_rate (float): Probability of an element to be zeroed
+            after the feed forward layer. Default 0.
+        attn_drop_rate (float): The drop out rate for attention layer.
+            Default 0.
+        drop_path_rate (float): stochastic depth rate. Default 0.
+        num_fcs (int): The number of fully-connected layers for FFNs. Default 2
+        qkv_bias (bool): Enable bias for qkv if True. Default False
+        act_cfg (dict): The activation config for FFNs. Defaults to GELU.
+        norm_cfg (dict): Config dict for normalization layer. Default
+            layer normalization
+        batch_first (bool): Key, Query and Value are shape of
+            (batch, n, embed_dim) or (n, batch, embed_dim).
+            (batch, n, embed_dim) is common case in CV.  Defaults to False
+        init_cfg (dict, optional): Initialization config dict. Defaults to None
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 ffn_ratio=4,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 num_fcs=2,
+                 qkv_bias=False,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='LN'),
+                 batch_first=True,
+                 init_cfg=None):
+        super(TransformerBlock, self).__init__(init_cfg=init_cfg)
+
+        self.norm_attn = build_norm_layer(norm_cfg, embed_dims)[1]
+        self.attn = MultiheadAttention(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            attn_drop=attn_drop_rate,
+            proj_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            batch_first=batch_first)
+
+        self.norm_ffn = build_norm_layer(norm_cfg, embed_dims)[1]
+        self.ffn = FFN(
+            embed_dims=embed_dims,
+            feedforward_channels=embed_dims * ffn_ratio,
+            num_fcs=num_fcs,
+            ffn_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            act_cfg=act_cfg)
+
+        if not qkv_bias:
+            self.attn.attn.in_proj_bias = None
+
+    def forward(self, x):
+        x = self.attn(self.norm_attn(x), identity=x)
+        x = self.ffn(self.norm_ffn(x), identity=x)
+        return x
+
+
+class TnTLayer(BaseModule):
+    """Implement one encoder layer in Transformer in Transformer.
+
+    Args:
+        num_pixel (int): The pixel number in target patch transformed with
+            a linear projection in inner transformer
+        embed_dims_inner (int): Feature dimension in inner transformer block
+        embed_dims_outer (int): Feature dimension in outer transformer block
+        num_heads_inner (int): Parallel attention heads in inner transformer.
+        num_heads_outer (int): Parallel attention heads in outer transformer.
+        inner_block_cfg (dict): Extra config of inner transformer block.
+            Defaults to empty dict.
+        outer_block_cfg (dict): Extra config of outer transformer block.
+            Defaults to empty dict.
+        norm_cfg (dict): Config dict for normalization layer. Default
+            layer normalization
+        init_cfg (dict, optional): Initialization config dict. Defaults to None
+    """
+
+    def __init__(self,
+                 num_pixel,
+                 embed_dims_inner,
+                 embed_dims_outer,
+                 num_heads_inner,
+                 num_heads_outer,
+                 inner_block_cfg=dict(),
+                 outer_block_cfg=dict(),
+                 norm_cfg=dict(type='LN'),
+                 init_cfg=None):
+        super(TnTLayer, self).__init__(init_cfg=init_cfg)
+
+        self.inner_block = TransformerBlock(
+            embed_dims=embed_dims_inner,
+            num_heads=num_heads_inner,
+            **inner_block_cfg)
+
+        self.norm_proj = build_norm_layer(norm_cfg, embed_dims_inner)[1]
+        self.projection = nn.Linear(
+            embed_dims_inner * num_pixel, embed_dims_outer, bias=True)
+
+        self.outer_block = TransformerBlock(
+            embed_dims=embed_dims_outer,
+            num_heads=num_heads_outer,
+            **outer_block_cfg)
+
+    def forward(self, pixel_embed, patch_embed):
+        pixel_embed = self.inner_block(pixel_embed)
+
+        B, N, C = patch_embed.size()
+        patch_embed[:, 1:] = patch_embed[:, 1:] + self.projection(
+            self.norm_proj(pixel_embed).reshape(B, N - 1, -1))
+        patch_embed = self.outer_block(patch_embed)
+
+        return pixel_embed, patch_embed
+
+
+class PixelEmbed(BaseModule):
+    """Image to Pixel Embedding.
+
+    Args:
+        img_size (int | tuple): The size of input image
+        patch_size (int): The size of one patch
+        in_channels (int): The num of input channels
+        embed_dims_inner (int): The num of channels of the target patch
+            transformed with a linear projection in inner transformer
+        stride (int): The stride of the conv2d layer. We use a conv2d layer
+            and a unfold layer to implement image to pixel embedding.
+        init_cfg (dict, optional): Initialization config dict
+    """
+
+    def __init__(self,
+                 img_size=224,
+                 patch_size=16,
+                 in_channels=3,
+                 embed_dims_inner=48,
+                 stride=4,
+                 init_cfg=None):
+        super(PixelEmbed, self).__init__(init_cfg=init_cfg)
+        img_size = to_2tuple(img_size)
+        patch_size = to_2tuple(patch_size)
+        # patches_resolution property necessary for resizing
+        # positional embedding
+        patches_resolution = [
+            img_size[0] // patch_size[0], img_size[1] // patch_size[1]
+        ]
+        num_patches = patches_resolution[0] * patches_resolution[1]
+
+        self.img_size = img_size
+        self.num_patches = num_patches
+        self.embed_dims_inner = embed_dims_inner
+
+        new_patch_size = [math.ceil(ps / stride) for ps in patch_size]
+        self.new_patch_size = new_patch_size
+
+        self.proj = nn.Conv2d(
+            in_channels,
+            self.embed_dims_inner,
+            kernel_size=7,
+            padding=3,
+            stride=stride)
+        self.unfold = nn.Unfold(
+            kernel_size=new_patch_size, stride=new_patch_size)
+
+    def forward(self, x, pixel_pos):
+        B, C, H, W = x.shape
+        assert H == self.img_size[0] and W == self.img_size[1], \
+            f"Input image size ({H}*{W}) doesn't match model " \
+            f'({self.img_size[0]}*{self.img_size[1]}).'
+        x = self.proj(x)
+        x = self.unfold(x)
+        x = x.transpose(1,
+                        2).reshape(B * self.num_patches, self.embed_dims_inner,
+                                   self.new_patch_size[0],
+                                   self.new_patch_size[1])
+        x = x + pixel_pos
+        x = x.reshape(B * self.num_patches, self.embed_dims_inner,
+                      -1).transpose(1, 2)
+        return x
+
+
+@MODELS.register_module()
+class TNT(BaseBackbone):
+    """Transformer in Transformer.
+
+    A PyTorch implement of: `Transformer in Transformer
+    <https://arxiv.org/abs/2103.00112>`_
+
+    Inspiration from
+    https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/tnt.py
+
+    Args:
+        arch (str | dict): Vision Transformer architecture
+            Default: 'b'
+        img_size (int | tuple): Input image size. Defaults to 224
+        patch_size (int | tuple): The patch size. Deault to 16
+        in_channels (int): Number of input channels. Defaults to 3
+        ffn_ratio (int): A ratio to calculate the hidden_dims in ffn layer.
+            Default: 4
+        qkv_bias (bool): Enable bias for qkv if True. Default False
+        drop_rate (float): Probability of an element to be zeroed
+            after the feed forward layer. Default 0.
+        attn_drop_rate (float): The drop out rate for attention layer.
+            Default 0.
+        drop_path_rate (float): stochastic depth rate. Default 0.
+        act_cfg (dict): The activation config for FFNs. Defaults to GELU.
+        norm_cfg (dict): Config dict for normalization layer. Default
+            layer normalization
+        first_stride (int): The stride of the conv2d layer. We use a conv2d
+            layer and a unfold layer to implement image to pixel embedding.
+        num_fcs (int): The number of fully-connected layers for FFNs. Default 2
+        init_cfg (dict, optional): Initialization config dict
+    """
+    arch_zoo = {
+        **dict.fromkeys(
+            ['s', 'small'], {
+                'embed_dims_outer': 384,
+                'embed_dims_inner': 24,
+                'num_layers': 12,
+                'num_heads_outer': 6,
+                'num_heads_inner': 4
+            }),
+        **dict.fromkeys(
+            ['b', 'base'], {
+                'embed_dims_outer': 640,
+                'embed_dims_inner': 40,
+                'num_layers': 12,
+                'num_heads_outer': 10,
+                'num_heads_inner': 4
+            })
+    }
+
+    def __init__(self,
+                 arch='b',
+                 img_size=224,
+                 patch_size=16,
+                 in_channels=3,
+                 ffn_ratio=4,
+                 qkv_bias=False,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='LN'),
+                 first_stride=4,
+                 num_fcs=2,
+                 init_cfg=[
+                     dict(type='TruncNormal', layer='Linear', std=.02),
+                     dict(type='Constant', layer='LayerNorm', val=1., bias=0.)
+                 ]):
+        super(TNT, self).__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {
+                'embed_dims_outer', 'embed_dims_inner', 'num_layers',
+                'num_heads_inner', 'num_heads_outer'
+            }
+            assert isinstance(arch, dict) and set(arch) == essential_keys, \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.embed_dims_inner = self.arch_settings['embed_dims_inner']
+        self.embed_dims_outer = self.arch_settings['embed_dims_outer']
+        # embed_dims for consistency with other models
+        self.embed_dims = self.embed_dims_outer
+        self.num_layers = self.arch_settings['num_layers']
+        self.num_heads_inner = self.arch_settings['num_heads_inner']
+        self.num_heads_outer = self.arch_settings['num_heads_outer']
+
+        self.pixel_embed = PixelEmbed(
+            img_size=img_size,
+            patch_size=patch_size,
+            in_channels=in_channels,
+            embed_dims_inner=self.embed_dims_inner,
+            stride=first_stride)
+        num_patches = self.pixel_embed.num_patches
+        self.num_patches = num_patches
+        new_patch_size = self.pixel_embed.new_patch_size
+        num_pixel = new_patch_size[0] * new_patch_size[1]
+
+        self.norm1_proj = build_norm_layer(norm_cfg, num_pixel *
+                                           self.embed_dims_inner)[1]
+        self.projection = nn.Linear(num_pixel * self.embed_dims_inner,
+                                    self.embed_dims_outer)
+        self.norm2_proj = build_norm_layer(norm_cfg, self.embed_dims_outer)[1]
+
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, self.embed_dims_outer))
+        self.patch_pos = nn.Parameter(
+            torch.zeros(1, num_patches + 1, self.embed_dims_outer))
+        self.pixel_pos = nn.Parameter(
+            torch.zeros(1, self.embed_dims_inner, new_patch_size[0],
+                        new_patch_size[1]))
+        self.drop_after_pos = nn.Dropout(p=drop_rate)
+
+        dpr = [
+            x.item()
+            for x in torch.linspace(0, drop_path_rate, self.num_layers)
+        ]  # stochastic depth decay rule
+        self.layers = ModuleList()
+        for i in range(self.num_layers):
+            block_cfg = dict(
+                ffn_ratio=ffn_ratio,
+                drop_rate=drop_rate,
+                attn_drop_rate=attn_drop_rate,
+                drop_path_rate=dpr[i],
+                num_fcs=num_fcs,
+                qkv_bias=qkv_bias,
+                norm_cfg=norm_cfg,
+                batch_first=True)
+            self.layers.append(
+                TnTLayer(
+                    num_pixel=num_pixel,
+                    embed_dims_inner=self.embed_dims_inner,
+                    embed_dims_outer=self.embed_dims_outer,
+                    num_heads_inner=self.num_heads_inner,
+                    num_heads_outer=self.num_heads_outer,
+                    inner_block_cfg=block_cfg,
+                    outer_block_cfg=block_cfg,
+                    norm_cfg=norm_cfg))
+
+        self.norm = build_norm_layer(norm_cfg, self.embed_dims_outer)[1]
+
+        trunc_normal_(self.cls_token, std=.02)
+        trunc_normal_(self.patch_pos, std=.02)
+        trunc_normal_(self.pixel_pos, std=.02)
+
+    def forward(self, x):
+        B = x.shape[0]
+        pixel_embed = self.pixel_embed(x, self.pixel_pos)
+
+        patch_embed = self.norm2_proj(
+            self.projection(
+                self.norm1_proj(pixel_embed.reshape(B, self.num_patches, -1))))
+        patch_embed = torch.cat(
+            (self.cls_token.expand(B, -1, -1), patch_embed), dim=1)
+        patch_embed = patch_embed + self.patch_pos
+        patch_embed = self.drop_after_pos(patch_embed)
+
+        for layer in self.layers:
+            pixel_embed, patch_embed = layer(pixel_embed, patch_embed)
+
+        patch_embed = self.norm(patch_embed)
+        return (patch_embed[:, 0], )
diff --git a/mmpretrain/models/backbones/twins.py b/mmpretrain/models/backbones/twins.py
new file mode 100644
index 0000000000000000000000000000000000000000..be55c02db1daa5cb37760f2066448b3fca2cb893
--- /dev/null
+++ b/mmpretrain/models/backbones/twins.py
@@ -0,0 +1,721 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn import Conv2d, build_norm_layer
+from mmcv.cnn.bricks.drop import build_dropout
+from mmcv.cnn.bricks.transformer import FFN, PatchEmbed
+from mmengine.model import BaseModule, ModuleList
+from mmengine.model.weight_init import (constant_init, normal_init,
+                                        trunc_normal_init)
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.registry import MODELS
+from ..utils import ConditionalPositionEncoding, MultiheadAttention
+
+
+class GlobalSubsampledAttention(MultiheadAttention):
+    """Global Sub-sampled Attention (GSA) module.
+
+    Args:
+        embed_dims (int): The embedding dimension.
+        num_heads (int): Parallel attention heads.
+        input_dims (int, optional): The input dimension, and if None,
+            use ``embed_dims``. Defaults to None.
+        attn_drop (float): Dropout rate of the dropout layer after the
+            attention calculation of query and key. Defaults to 0.
+        proj_drop (float): Dropout rate of the dropout layer after the
+            output projection. Defaults to 0.
+        dropout_layer (dict): The dropout config before adding the shortcut.
+            Defaults to ``dict(type='Dropout', drop_prob=0.)``.
+        qkv_bias (bool): If True, add a learnable bias to q, k, v.
+            Defaults to True.
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='LN').
+        qk_scale (float, optional): Override default qk scale of
+            ``head_dim ** -0.5`` if set. Defaults to None.
+        proj_bias (bool) If True, add a learnable bias to output projection.
+            Defaults to True.
+        v_shortcut (bool): Add a shortcut from value to output. It's usually
+            used if ``input_dims`` is different from ``embed_dims``.
+            Defaults to False.
+        sr_ratio (float): The ratio of spatial reduction in attention modules.
+            Defaults to 1.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 norm_cfg=dict(type='LN'),
+                 qkv_bias=True,
+                 sr_ratio=1,
+                 **kwargs):
+        super(GlobalSubsampledAttention,
+              self).__init__(embed_dims, num_heads, **kwargs)
+
+        self.qkv_bias = qkv_bias
+        self.q = nn.Linear(self.input_dims, embed_dims, bias=qkv_bias)
+        self.kv = nn.Linear(self.input_dims, embed_dims * 2, bias=qkv_bias)
+
+        # remove self.qkv, here split into self.q, self.kv
+        delattr(self, 'qkv')
+
+        self.sr_ratio = sr_ratio
+        if sr_ratio > 1:
+            # use a conv as the spatial-reduction operation, the kernel_size
+            # and stride in conv are equal to the sr_ratio.
+            self.sr = Conv2d(
+                in_channels=embed_dims,
+                out_channels=embed_dims,
+                kernel_size=sr_ratio,
+                stride=sr_ratio)
+            # The ret[0] of build_norm_layer is norm name.
+            self.norm = build_norm_layer(norm_cfg, embed_dims)[1]
+
+    def forward(self, x, hw_shape):
+        B, N, C = x.shape
+        H, W = hw_shape
+        assert H * W == N, 'The product of h and w of hw_shape must be N, ' \
+                           'which is the 2nd dim number of the input Tensor x.'
+
+        q = self.q(x).reshape(B, N, self.num_heads,
+                              C // self.num_heads).permute(0, 2, 1, 3)
+
+        if self.sr_ratio > 1:
+            x = x.permute(0, 2, 1).reshape(B, C, *hw_shape)  # BNC_2_BCHW
+            x = self.sr(x)
+            x = x.reshape(B, C, -1).permute(0, 2, 1)  # BCHW_2_BNC
+            x = self.norm(x)
+
+        kv = self.kv(x).reshape(B, -1, 2, self.num_heads,
+                                self.head_dims).permute(2, 0, 3, 1, 4)
+        k, v = kv[0], kv[1]
+
+        attn_drop = self.attn_drop if self.training else 0.
+        x = self.scaled_dot_product_attention(q, k, v, dropout_p=attn_drop)
+        x = x.transpose(1, 2).reshape(B, N, self.embed_dims)
+
+        x = self.proj(x)
+        x = self.out_drop(self.proj_drop(x))
+
+        if self.v_shortcut:
+            x = v.squeeze(1) + x
+        return x
+
+
+class GSAEncoderLayer(BaseModule):
+    """Implements one encoder layer with GlobalSubsampledAttention(GSA).
+
+    Args:
+        embed_dims (int): The feature dimension.
+        num_heads (int): Parallel attention heads.
+        feedforward_channels (int): The hidden dimension for FFNs.
+        drop_rate (float): Probability of an element to be zeroed
+            after the feed forward layer. Default: 0.0.
+        attn_drop_rate (float): The drop out rate for attention layer.
+            Default: 0.0.
+        drop_path_rate (float): Stochastic depth rate. Default 0.0.
+        num_fcs (int): The number of fully-connected layers for FFNs.
+            Default: 2.
+        qkv_bias (bool): Enable bias for qkv if True. Default: True
+        act_cfg (dict): The activation config for FFNs.
+            Default: dict(type='GELU').
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='LN').
+        sr_ratio (float): The ratio of spatial reduction in attention modules.
+            Defaults to 1.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 feedforward_channels,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 num_fcs=2,
+                 qkv_bias=True,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='LN'),
+                 sr_ratio=1.,
+                 init_cfg=None):
+        super(GSAEncoderLayer, self).__init__(init_cfg=init_cfg)
+
+        self.norm1 = build_norm_layer(norm_cfg, embed_dims, postfix=1)[1]
+        self.attn = GlobalSubsampledAttention(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            attn_drop=attn_drop_rate,
+            proj_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            qkv_bias=qkv_bias,
+            norm_cfg=norm_cfg,
+            sr_ratio=sr_ratio)
+
+        self.norm2 = build_norm_layer(norm_cfg, embed_dims, postfix=2)[1]
+        self.ffn = FFN(
+            embed_dims=embed_dims,
+            feedforward_channels=feedforward_channels,
+            num_fcs=num_fcs,
+            ffn_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            act_cfg=act_cfg,
+            add_identity=False)
+
+        self.drop_path = build_dropout(
+            dict(type='DropPath', drop_prob=drop_path_rate)
+        ) if drop_path_rate > 0. else nn.Identity()
+
+    def forward(self, x, hw_shape):
+        x = x + self.drop_path(self.attn(self.norm1(x), hw_shape))
+        x = x + self.drop_path(self.ffn(self.norm2(x)))
+        return x
+
+
+class LocallyGroupedSelfAttention(BaseModule):
+    """Locally-grouped Self Attention (LSA) module.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        num_heads (int): Number of attention heads. Default: 8
+        qkv_bias (bool, optional):  If True, add a learnable bias to q, k, v.
+            Default: False.
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Default: None.
+        attn_drop_rate (float, optional): Dropout ratio of attention weight.
+            Default: 0.0
+        proj_drop_rate (float, optional): Dropout ratio of output. Default: 0.
+        window_size(int): Window size of LSA. Default: 1.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads=8,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 attn_drop_rate=0.,
+                 proj_drop_rate=0.,
+                 window_size=1,
+                 init_cfg=None):
+        super(LocallyGroupedSelfAttention, self).__init__(init_cfg=init_cfg)
+
+        assert embed_dims % num_heads == 0, \
+            f'dim {embed_dims} should be divided by num_heads {num_heads}'
+
+        self.embed_dims = embed_dims
+        self.num_heads = num_heads
+        head_dim = embed_dims // num_heads
+        self.scale = qk_scale or head_dim**-0.5
+
+        self.qkv = nn.Linear(embed_dims, embed_dims * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop_rate)
+        self.proj = nn.Linear(embed_dims, embed_dims)
+        self.proj_drop = nn.Dropout(proj_drop_rate)
+        self.window_size = window_size
+
+    def forward(self, x, hw_shape):
+        B, N, C = x.shape
+        H, W = hw_shape
+        x = x.view(B, H, W, C)
+
+        # pad feature maps to multiples of Local-groups
+        pad_l = pad_t = 0
+        pad_r = (self.window_size - W % self.window_size) % self.window_size
+        pad_b = (self.window_size - H % self.window_size) % self.window_size
+        x = F.pad(x, (0, 0, pad_l, pad_r, pad_t, pad_b))
+
+        # calculate attention mask for LSA
+        Hp, Wp = x.shape[1:-1]
+        _h, _w = Hp // self.window_size, Wp // self.window_size
+        mask = torch.zeros((1, Hp, Wp), device=x.device)
+        mask[:, -pad_b:, :].fill_(1)
+        mask[:, :, -pad_r:].fill_(1)
+
+        # [B, _h, _w, window_size, window_size, C]
+        x = x.reshape(B, _h, self.window_size, _w, self.window_size,
+                      C).transpose(2, 3)
+        mask = mask.reshape(1, _h, self.window_size, _w,
+                            self.window_size).transpose(2, 3).reshape(
+                                1, _h * _w,
+                                self.window_size * self.window_size)
+        # [1, _h*_w, window_size*window_size, window_size*window_size]
+        attn_mask = mask.unsqueeze(2) - mask.unsqueeze(3)
+        attn_mask = attn_mask.masked_fill(attn_mask != 0,
+                                          float(-1000.0)).masked_fill(
+                                              attn_mask == 0, float(0.0))
+
+        # [3, B, _w*_h, nhead, window_size*window_size, dim]
+        qkv = self.qkv(x).reshape(B, _h * _w,
+                                  self.window_size * self.window_size, 3,
+                                  self.num_heads, C // self.num_heads).permute(
+                                      3, 0, 1, 4, 2, 5)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+        # [B, _h*_w, n_head, window_size*window_size, window_size*window_size]
+        attn = (q @ k.transpose(-2, -1)) * self.scale
+        attn = attn + attn_mask.unsqueeze(2)
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+        attn = (attn @ v).transpose(2, 3).reshape(B, _h, _w, self.window_size,
+                                                  self.window_size, C)
+        x = attn.transpose(2, 3).reshape(B, _h * self.window_size,
+                                         _w * self.window_size, C)
+        if pad_r > 0 or pad_b > 0:
+            x = x[:, :H, :W, :].contiguous()
+
+        x = x.reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+
+class LSAEncoderLayer(BaseModule):
+    """Implements one encoder layer with LocallyGroupedSelfAttention(LSA).
+
+    Args:
+        embed_dims (int): The feature dimension.
+        num_heads (int): Parallel attention heads.
+        feedforward_channels (int): The hidden dimension for FFNs.
+        drop_rate (float): Probability of an element to be zeroed
+            after the feed forward layer. Default: 0.0.
+        attn_drop_rate (float, optional): Dropout ratio of attention weight.
+           Default: 0.0
+        drop_path_rate (float): Stochastic depth rate. Default 0.0.
+        num_fcs (int): The number of fully-connected layers for FFNs.
+            Default: 2.
+        qkv_bias (bool): Enable bias for qkv if True. Default: True
+        qk_scale (float | None, optional): Override default qk scale of
+           head_dim ** -0.5 if set. Default: None.
+        act_cfg (dict): The activation config for FFNs.
+            Default: dict(type='GELU').
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='LN').
+        window_size (int): Window size of LSA. Default: 1.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 feedforward_channels,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 num_fcs=2,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='LN'),
+                 window_size=1,
+                 init_cfg=None):
+
+        super(LSAEncoderLayer, self).__init__(init_cfg=init_cfg)
+
+        self.norm1 = build_norm_layer(norm_cfg, embed_dims, postfix=1)[1]
+        self.attn = LocallyGroupedSelfAttention(embed_dims, num_heads,
+                                                qkv_bias, qk_scale,
+                                                attn_drop_rate, drop_rate,
+                                                window_size)
+
+        self.norm2 = build_norm_layer(norm_cfg, embed_dims, postfix=2)[1]
+        self.ffn = FFN(
+            embed_dims=embed_dims,
+            feedforward_channels=feedforward_channels,
+            num_fcs=num_fcs,
+            ffn_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            act_cfg=act_cfg,
+            add_identity=False)
+
+        self.drop_path = build_dropout(
+            dict(type='DropPath', drop_prob=drop_path_rate)
+        ) if drop_path_rate > 0. else nn.Identity()
+
+    def forward(self, x, hw_shape):
+        x = x + self.drop_path(self.attn(self.norm1(x), hw_shape))
+        x = x + self.drop_path(self.ffn(self.norm2(x)))
+        return x
+
+
+@MODELS.register_module()
+class PCPVT(BaseModule):
+    """The backbone of Twins-PCPVT.
+
+    This backbone is the implementation of `Twins: Revisiting the Design
+    of Spatial Attention in Vision Transformers
+    <https://arxiv.org/abs/1512.03385>`_.
+
+    Args:
+        arch (dict, str): PCPVT architecture, a str value in arch zoo or a
+            detailed configuration dict with 7 keys, and the length of all the
+            values in dict should be the same:
+
+            - depths (List[int]): The number of encoder layers in each stage.
+            - embed_dims (List[int]): Embedding dimension in each stage.
+            - patch_sizes (List[int]): The patch sizes in each stage.
+            - num_heads (List[int]): Numbers of attention head in each stage.
+            - strides (List[int]): The strides in each stage.
+            - mlp_ratios (List[int]): The ratios of mlp in each stage.
+            - sr_ratios (List[int]): The ratios of GSA-encoder layers in each
+              stage.
+
+        in_channels (int): Number of input channels. Defaults to 3.
+        out_indices (tuple[int]): Output from which stages.
+            Defaults to ``(3, )``.
+        qkv_bias (bool): Enable bias for qkv if True. Defaults to False.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        attn_drop_rate (float): The drop out rate for attention layer.
+            Defaults to 0.0
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.0.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        norm_after_stage(bool, List[bool]): Add extra norm after each stage.
+            Defaults to False.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+
+    Examples:
+        >>> from mmpretrain.models import PCPVT
+        >>> import torch
+        >>> pcpvt_cfg = {'arch': "small",
+        >>>              'norm_after_stage': [False, False, False, True]}
+        >>> model = PCPVT(**pcpvt_cfg)
+        >>> x = torch.rand(1, 3, 224, 224)
+        >>> outputs = model(x)
+        >>> print(outputs[-1].shape)
+        torch.Size([1, 512, 7, 7])
+        >>> pcpvt_cfg['norm_after_stage'] = [True, True, True, True]
+        >>> pcpvt_cfg['out_indices'] = (0, 1, 2, 3)
+        >>> model = PCPVT(**pcpvt_cfg)
+        >>> outputs = model(x)
+        >>> for feat in outputs:
+        >>>     print(feat.shape)
+        torch.Size([1, 64, 56, 56])
+        torch.Size([1, 128, 28, 28])
+        torch.Size([1, 320, 14, 14])
+        torch.Size([1, 512, 7, 7])
+    """
+    arch_zoo = {
+        **dict.fromkeys(['s', 'small'],
+                        {'embed_dims':    [64, 128, 320, 512],
+                         'depths':        [3, 4, 6, 3],
+                         'num_heads':     [1, 2, 5, 8],
+                         'patch_sizes':   [4, 2, 2, 2],
+                         'strides':       [4, 2, 2, 2],
+                         'mlp_ratios':    [8, 8, 4, 4],
+                         'sr_ratios':     [8, 4, 2, 1]}),
+        **dict.fromkeys(['b', 'base'],
+                        {'embed_dims':    [64, 128, 320, 512],
+                         'depths':        [3, 4, 18, 3],
+                         'num_heads':     [1, 2, 5, 8],
+                         'patch_sizes':   [4, 2, 2, 2],
+                         'strides':       [4, 2, 2, 2],
+                         'mlp_ratios':    [8, 8, 4, 4],
+                         'sr_ratios':     [8, 4, 2, 1]}),
+        **dict.fromkeys(['l', 'large'],
+                        {'embed_dims':    [64, 128, 320, 512],
+                         'depths':        [3, 8, 27, 3],
+                         'num_heads':     [1, 2, 5, 8],
+                         'patch_sizes':   [4, 2, 2, 2],
+                         'strides':       [4, 2, 2, 2],
+                         'mlp_ratios':    [8, 8, 4, 4],
+                         'sr_ratios':     [8, 4, 2, 1]}),
+    }   # yapf: disable
+
+    essential_keys = {
+        'embed_dims', 'depths', 'num_heads', 'patch_sizes', 'strides',
+        'mlp_ratios', 'sr_ratios'
+    }
+
+    def __init__(self,
+                 arch,
+                 in_channels=3,
+                 out_indices=(3, ),
+                 qkv_bias=False,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 norm_cfg=dict(type='LN'),
+                 norm_after_stage=False,
+                 init_cfg=None):
+        super(PCPVT, self).__init__(init_cfg=init_cfg)
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            assert isinstance(arch, dict) and (
+                set(arch) == self.essential_keys
+            ), f'Custom arch needs a dict with keys {self.essential_keys}.'
+            self.arch_settings = arch
+
+        self.depths = self.arch_settings['depths']
+        self.embed_dims = self.arch_settings['embed_dims']
+        self.patch_sizes = self.arch_settings['patch_sizes']
+        self.strides = self.arch_settings['strides']
+        self.mlp_ratios = self.arch_settings['mlp_ratios']
+        self.num_heads = self.arch_settings['num_heads']
+        self.sr_ratios = self.arch_settings['sr_ratios']
+
+        self.num_extra_tokens = 0  # there is no cls-token in Twins
+        self.num_stage = len(self.depths)
+        for key, value in self.arch_settings.items():
+            assert isinstance(value, list) and len(value) == self.num_stage, (
+                'Length of setting item in arch dict must be type of list and'
+                ' have the same length.')
+
+        # patch_embeds
+        self.patch_embeds = ModuleList()
+        self.position_encoding_drops = ModuleList()
+        self.stages = ModuleList()
+
+        for i in range(self.num_stage):
+            # use in_channels of the model in the first stage
+            if i == 0:
+                stage_in_channels = in_channels
+            else:
+                stage_in_channels = self.embed_dims[i - 1]
+
+            self.patch_embeds.append(
+                PatchEmbed(
+                    in_channels=stage_in_channels,
+                    embed_dims=self.embed_dims[i],
+                    conv_type='Conv2d',
+                    kernel_size=self.patch_sizes[i],
+                    stride=self.strides[i],
+                    padding='corner',
+                    norm_cfg=dict(type='LN')))
+
+            self.position_encoding_drops.append(nn.Dropout(p=drop_rate))
+
+        # PEGs
+        self.position_encodings = ModuleList([
+            ConditionalPositionEncoding(embed_dim, embed_dim)
+            for embed_dim in self.embed_dims
+        ])
+
+        # stochastic depth
+        total_depth = sum(self.depths)
+        self.dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, total_depth)
+        ]  # stochastic depth decay rule
+        cur = 0
+
+        for k in range(len(self.depths)):
+            _block = ModuleList([
+                GSAEncoderLayer(
+                    embed_dims=self.embed_dims[k],
+                    num_heads=self.num_heads[k],
+                    feedforward_channels=self.mlp_ratios[k] *
+                    self.embed_dims[k],
+                    attn_drop_rate=attn_drop_rate,
+                    drop_rate=drop_rate,
+                    drop_path_rate=self.dpr[cur + i],
+                    num_fcs=2,
+                    qkv_bias=qkv_bias,
+                    act_cfg=dict(type='GELU'),
+                    norm_cfg=norm_cfg,
+                    sr_ratio=self.sr_ratios[k]) for i in range(self.depths[k])
+            ])
+            self.stages.append(_block)
+            cur += self.depths[k]
+
+        self.out_indices = out_indices
+
+        assert isinstance(norm_after_stage, (bool, list))
+        if isinstance(norm_after_stage, bool):
+            self.norm_after_stage = [norm_after_stage] * self.num_stage
+        else:
+            self.norm_after_stage = norm_after_stage
+        assert len(self.norm_after_stage) == self.num_stage, \
+            (f'Number of norm_after_stage({len(self.norm_after_stage)}) should'
+             f' be equal to the number of stages({self.num_stage}).')
+
+        for i, has_norm in enumerate(self.norm_after_stage):
+            assert isinstance(has_norm, bool), 'norm_after_stage should be ' \
+                                               'bool or List[bool].'
+            if has_norm and norm_cfg is not None:
+                norm_layer = build_norm_layer(norm_cfg, self.embed_dims[i])[1]
+            else:
+                norm_layer = nn.Identity()
+
+            self.add_module(f'norm_after_stage{i}', norm_layer)
+
+    def init_weights(self):
+        if self.init_cfg is not None:
+            super(PCPVT, self).init_weights()
+        else:
+            for m in self.modules():
+                if isinstance(m, nn.Linear):
+                    trunc_normal_init(m, std=.02, bias=0.)
+                elif isinstance(m, (_BatchNorm, nn.GroupNorm, nn.LayerNorm)):
+                    constant_init(m, val=1.0, bias=0.)
+                elif isinstance(m, nn.Conv2d):
+                    fan_out = m.kernel_size[0] * m.kernel_size[
+                        1] * m.out_channels
+                    fan_out //= m.groups
+                    normal_init(
+                        m, mean=0, std=math.sqrt(2.0 / fan_out), bias=0)
+
+    def forward(self, x):
+        outputs = list()
+
+        b = x.shape[0]
+
+        for i in range(self.num_stage):
+            x, hw_shape = self.patch_embeds[i](x)
+            h, w = hw_shape
+            x = self.position_encoding_drops[i](x)
+            for j, blk in enumerate(self.stages[i]):
+                x = blk(x, hw_shape)
+                if j == 0:
+                    x = self.position_encodings[i](x, hw_shape)
+
+            norm_layer = getattr(self, f'norm_after_stage{i}')
+            x = norm_layer(x)
+            x = x.reshape(b, h, w, -1).permute(0, 3, 1, 2).contiguous()
+
+            if i in self.out_indices:
+                outputs.append(x)
+
+        return tuple(outputs)
+
+
+@MODELS.register_module()
+class SVT(PCPVT):
+    """The backbone of Twins-SVT.
+
+    This backbone is the implementation of `Twins: Revisiting the Design
+    of Spatial Attention in Vision Transformers
+    <https://arxiv.org/abs/1512.03385>`_.
+
+    Args:
+        arch (dict, str): SVT architecture, a str value in arch zoo or a
+            detailed configuration dict with 8 keys, and the length of all the
+            values in dict should be the same:
+
+            - depths (List[int]): The number of encoder layers in each stage.
+            - embed_dims (List[int]): Embedding dimension in each stage.
+            - patch_sizes (List[int]): The patch sizes in each stage.
+            - num_heads (List[int]): Numbers of attention head in each stage.
+            - strides (List[int]): The strides in each stage.
+            - mlp_ratios (List[int]): The ratios of mlp in each stage.
+            - sr_ratios (List[int]): The ratios of GSA-encoder layers in each
+              stage.
+            - windiow_sizes (List[int]): The window sizes in LSA-encoder layers
+              in each stage.
+
+        in_channels (int): Number of input channels. Defaults to 3.
+        out_indices (tuple[int]): Output from which stages.
+            Defaults to (3, ).
+        qkv_bias (bool): Enable bias for qkv if True. Defaults to False.
+        drop_rate (float): Dropout rate. Defaults to 0.
+        attn_drop_rate (float): Dropout ratio of attention weight.
+            Defaults to 0.0
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.2.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        norm_after_stage(bool, List[bool]): Add extra norm after each stage.
+            Defaults to False.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+
+    Examples:
+        >>> from mmpretrain.models import SVT
+        >>> import torch
+        >>> svt_cfg = {'arch': "small",
+        >>>            'norm_after_stage': [False, False, False, True]}
+        >>> model = SVT(**svt_cfg)
+        >>> x = torch.rand(1, 3, 224, 224)
+        >>> outputs = model(x)
+        >>> print(outputs[-1].shape)
+        torch.Size([1, 512, 7, 7])
+        >>> svt_cfg["out_indices"] = (0, 1, 2, 3)
+        >>> svt_cfg["norm_after_stage"] = [True, True, True, True]
+        >>> model = SVT(**svt_cfg)
+        >>> output = model(x)
+        >>> for feat in output:
+        >>>     print(feat.shape)
+        torch.Size([1, 64, 56, 56])
+        torch.Size([1, 128, 28, 28])
+        torch.Size([1, 320, 14, 14])
+        torch.Size([1, 512, 7, 7])
+    """
+    arch_zoo = {
+        **dict.fromkeys(['s', 'small'],
+                        {'embed_dims':    [64, 128, 256, 512],
+                         'depths':        [2, 2, 10, 4],
+                         'num_heads':     [2, 4, 8, 16],
+                         'patch_sizes':   [4, 2, 2, 2],
+                         'strides':       [4, 2, 2, 2],
+                         'mlp_ratios':    [4, 4, 4, 4],
+                         'sr_ratios':     [8, 4, 2, 1],
+                         'window_sizes':  [7, 7, 7, 7]}),
+        **dict.fromkeys(['b', 'base'],
+                        {'embed_dims':    [96, 192, 384, 768],
+                         'depths':        [2, 2, 18, 2],
+                         'num_heads':     [3, 6, 12, 24],
+                         'patch_sizes':   [4, 2, 2, 2],
+                         'strides':       [4, 2, 2, 2],
+                         'mlp_ratios':    [4, 4, 4, 4],
+                         'sr_ratios':     [8, 4, 2, 1],
+                         'window_sizes':  [7, 7, 7, 7]}),
+        **dict.fromkeys(['l', 'large'],
+                        {'embed_dims':    [128, 256, 512, 1024],
+                         'depths':        [2, 2, 18, 2],
+                         'num_heads':     [4, 8, 16, 32],
+                         'patch_sizes':   [4, 2, 2, 2],
+                         'strides':       [4, 2, 2, 2],
+                         'mlp_ratios':    [4, 4, 4, 4],
+                         'sr_ratios':     [8, 4, 2, 1],
+                         'window_sizes':  [7, 7, 7, 7]}),
+    }  # yapf: disable
+
+    essential_keys = {
+        'embed_dims', 'depths', 'num_heads', 'patch_sizes', 'strides',
+        'mlp_ratios', 'sr_ratios', 'window_sizes'
+    }
+
+    def __init__(self,
+                 arch,
+                 in_channels=3,
+                 out_indices=(3, ),
+                 qkv_bias=False,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.0,
+                 norm_cfg=dict(type='LN'),
+                 norm_after_stage=False,
+                 init_cfg=None):
+        super(SVT, self).__init__(arch, in_channels, out_indices, qkv_bias,
+                                  drop_rate, attn_drop_rate, drop_path_rate,
+                                  norm_cfg, norm_after_stage, init_cfg)
+
+        self.window_sizes = self.arch_settings['window_sizes']
+
+        for k in range(self.num_stage):
+            for i in range(self.depths[k]):
+                # in even-numbered layers of each stage, replace GSA with LSA
+                if i % 2 == 0:
+                    ffn_channels = self.mlp_ratios[k] * self.embed_dims[k]
+                    self.stages[k][i] = \
+                        LSAEncoderLayer(
+                            embed_dims=self.embed_dims[k],
+                            num_heads=self.num_heads[k],
+                            feedforward_channels=ffn_channels,
+                            drop_rate=drop_rate,
+                            norm_cfg=norm_cfg,
+                            attn_drop_rate=attn_drop_rate,
+                            drop_path_rate=self.dpr[sum(self.depths[:k])+i],
+                            qkv_bias=qkv_bias,
+                            window_size=self.window_sizes[k])
diff --git a/mmpretrain/models/backbones/van.py b/mmpretrain/models/backbones/van.py
new file mode 100644
index 0000000000000000000000000000000000000000..c34dc3362f84ffa39151219f038f0c74ee0242e8
--- /dev/null
+++ b/mmpretrain/models/backbones/van.py
@@ -0,0 +1,434 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+from mmcv.cnn import Conv2d, build_activation_layer, build_norm_layer
+from mmcv.cnn.bricks import DropPath
+from mmcv.cnn.bricks.transformer import PatchEmbed
+from mmengine.model import BaseModule, ModuleList
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+
+
+class MixFFN(BaseModule):
+    """An implementation of MixFFN of VAN. Refer to
+    mmdetection/mmdet/models/backbones/pvt.py.
+
+    The differences between MixFFN & FFN:
+        1. Use 1X1 Conv to replace Linear layer.
+        2. Introduce 3X3 Depth-wise Conv to encode positional information.
+
+    Args:
+        embed_dims (int): The feature dimension. Same as
+            `MultiheadAttention`.
+        feedforward_channels (int): The hidden dimension of FFNs.
+        act_cfg (dict, optional): The activation config for FFNs.
+            Default: dict(type='GELU').
+        ffn_drop (float, optional): Probability of an element to be
+            zeroed in FFN. Default 0.0.
+        init_cfg (obj:`mmcv.ConfigDict`): The Config for initialization.
+            Default: None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 feedforward_channels,
+                 act_cfg=dict(type='GELU'),
+                 ffn_drop=0.,
+                 init_cfg=None):
+        super(MixFFN, self).__init__(init_cfg=init_cfg)
+
+        self.embed_dims = embed_dims
+        self.feedforward_channels = feedforward_channels
+        self.act_cfg = act_cfg
+
+        self.fc1 = Conv2d(
+            in_channels=embed_dims,
+            out_channels=feedforward_channels,
+            kernel_size=1)
+        self.dwconv = Conv2d(
+            in_channels=feedforward_channels,
+            out_channels=feedforward_channels,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+            bias=True,
+            groups=feedforward_channels)
+        self.act = build_activation_layer(act_cfg)
+        self.fc2 = Conv2d(
+            in_channels=feedforward_channels,
+            out_channels=embed_dims,
+            kernel_size=1)
+        self.drop = nn.Dropout(ffn_drop)
+
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.dwconv(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+
+
+class LKA(BaseModule):
+    """Large Kernel Attention(LKA) of VAN.
+
+    .. code:: text
+            DW_conv (depth-wise convolution)
+                            |
+                            |
+        DW_D_conv (depth-wise dilation convolution)
+                            |
+                            |
+        Transition Convolution (1×1 convolution)
+
+    Args:
+        embed_dims (int): Number of input channels.
+        init_cfg (obj:`mmcv.ConfigDict`): The Config for initialization.
+            Default: None.
+    """
+
+    def __init__(self, embed_dims, init_cfg=None):
+        super(LKA, self).__init__(init_cfg=init_cfg)
+
+        # a spatial local convolution (depth-wise convolution)
+        self.DW_conv = Conv2d(
+            in_channels=embed_dims,
+            out_channels=embed_dims,
+            kernel_size=5,
+            padding=2,
+            groups=embed_dims)
+
+        # a spatial long-range convolution (depth-wise dilation convolution)
+        self.DW_D_conv = Conv2d(
+            in_channels=embed_dims,
+            out_channels=embed_dims,
+            kernel_size=7,
+            stride=1,
+            padding=9,
+            groups=embed_dims,
+            dilation=3)
+
+        self.conv1 = Conv2d(
+            in_channels=embed_dims, out_channels=embed_dims, kernel_size=1)
+
+    def forward(self, x):
+        u = x.clone()
+        attn = self.DW_conv(x)
+        attn = self.DW_D_conv(attn)
+        attn = self.conv1(attn)
+
+        return u * attn
+
+
+class SpatialAttention(BaseModule):
+    """Basic attention module in VANBloack.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        act_cfg (dict, optional): The activation config for FFNs.
+            Default: dict(type='GELU').
+        init_cfg (obj:`mmcv.ConfigDict`): The Config for initialization.
+            Default: None.
+    """
+
+    def __init__(self, embed_dims, act_cfg=dict(type='GELU'), init_cfg=None):
+        super(SpatialAttention, self).__init__(init_cfg=init_cfg)
+
+        self.proj_1 = Conv2d(
+            in_channels=embed_dims, out_channels=embed_dims, kernel_size=1)
+        self.activation = build_activation_layer(act_cfg)
+        self.spatial_gating_unit = LKA(embed_dims)
+        self.proj_2 = Conv2d(
+            in_channels=embed_dims, out_channels=embed_dims, kernel_size=1)
+
+    def forward(self, x):
+        shorcut = x.clone()
+        x = self.proj_1(x)
+        x = self.activation(x)
+        x = self.spatial_gating_unit(x)
+        x = self.proj_2(x)
+        x = x + shorcut
+        return x
+
+
+class VANBlock(BaseModule):
+    """A block of VAN.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        ffn_ratio (float): The expansion ratio of feedforward network hidden
+            layer channels. Defaults to 4.
+        drop_rate (float): Dropout rate after embedding. Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.1.
+        act_cfg (dict, optional): The activation config for FFNs.
+            Default: dict(type='GELU').
+        layer_scale_init_value (float): Init value for Layer Scale.
+            Defaults to 1e-2.
+        init_cfg (obj:`mmcv.ConfigDict`): The Config for initialization.
+            Default: None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 ffn_ratio=4.,
+                 drop_rate=0.,
+                 drop_path_rate=0.,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='BN', eps=1e-5),
+                 layer_scale_init_value=1e-2,
+                 init_cfg=None):
+        super(VANBlock, self).__init__(init_cfg=init_cfg)
+        self.out_channels = embed_dims
+
+        self.norm1 = build_norm_layer(norm_cfg, embed_dims)[1]
+        self.attn = SpatialAttention(embed_dims, act_cfg=act_cfg)
+        self.drop_path = DropPath(
+            drop_path_rate) if drop_path_rate > 0. else nn.Identity()
+
+        self.norm2 = build_norm_layer(norm_cfg, embed_dims)[1]
+        mlp_hidden_dim = int(embed_dims * ffn_ratio)
+        self.mlp = MixFFN(
+            embed_dims=embed_dims,
+            feedforward_channels=mlp_hidden_dim,
+            act_cfg=act_cfg,
+            ffn_drop=drop_rate)
+        self.layer_scale_1 = nn.Parameter(
+            layer_scale_init_value * torch.ones((embed_dims)),
+            requires_grad=True) if layer_scale_init_value > 0 else None
+        self.layer_scale_2 = nn.Parameter(
+            layer_scale_init_value * torch.ones((embed_dims)),
+            requires_grad=True) if layer_scale_init_value > 0 else None
+
+    def forward(self, x):
+        identity = x
+        x = self.norm1(x)
+        x = self.attn(x)
+        if self.layer_scale_1 is not None:
+            x = self.layer_scale_1.unsqueeze(-1).unsqueeze(-1) * x
+        x = identity + self.drop_path(x)
+
+        identity = x
+        x = self.norm2(x)
+        x = self.mlp(x)
+        if self.layer_scale_2 is not None:
+            x = self.layer_scale_2.unsqueeze(-1).unsqueeze(-1) * x
+        x = identity + self.drop_path(x)
+
+        return x
+
+
+class VANPatchEmbed(PatchEmbed):
+    """Image to Patch Embedding of VAN.
+
+    The differences between VANPatchEmbed & PatchEmbed:
+        1. Use BN.
+        2. Do not use 'flatten' and 'transpose'.
+    """
+
+    def __init__(self, *args, norm_cfg=dict(type='BN'), **kwargs):
+        super(VANPatchEmbed, self).__init__(*args, norm_cfg=norm_cfg, **kwargs)
+
+    def forward(self, x):
+        """
+        Args:
+            x (Tensor): Has shape (B, C, H, W). In most case, C is 3.
+        Returns:
+            tuple: Contains merged results and its spatial shape.
+            - x (Tensor): Has shape (B, out_h * out_w, embed_dims)
+            - out_size (tuple[int]): Spatial shape of x, arrange as
+              (out_h, out_w).
+        """
+
+        if self.adaptive_padding:
+            x = self.adaptive_padding(x)
+
+        x = self.projection(x)
+        out_size = (x.shape[2], x.shape[3])
+        if self.norm is not None:
+            x = self.norm(x)
+        return x, out_size
+
+
+@MODELS.register_module()
+class VAN(BaseBackbone):
+    """Visual Attention Network.
+
+    A PyTorch implement of : `Visual Attention Network
+    <https://arxiv.org/pdf/2202.09741v2.pdf>`_
+
+    Inspiration from
+    https://github.com/Visual-Attention-Network/VAN-Classification
+
+    Args:
+        arch (str | dict): Visual Attention Network architecture.
+            If use string, choose from 'tiny', 'small', 'base' and 'large'.
+            If use dict, it should have below keys:
+
+            - **embed_dims** (List[int]): The dimensions of embedding.
+            - **depths** (List[int]): The number of blocks in each stage.
+            - **ffn_ratios** (List[int]): The number of expansion ratio of
+              feedforward network hidden layer channels.
+
+            Defaults to 'tiny'.
+        patch_sizes (List[int | tuple]): The patch size in patch embeddings.
+            Defaults to [7, 3, 3, 3].
+        in_channels (int): The num of input channels. Defaults to 3.
+        drop_rate (float): Dropout rate after embedding. Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.1.
+        out_indices (Sequence[int]): Output from which stages.
+            Default: ``(3, )``.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Defaults to False.
+        norm_cfg (dict): Config dict for normalization layer for all output
+            features. Defaults to ``dict(type='LN')``
+        block_cfgs (Sequence[dict] | dict): The extra config of each block.
+            Defaults to empty dicts.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+
+    Examples:
+        >>> from mmpretrain.models import VAN
+        >>> import torch
+        >>> cfg = dict(arch='tiny')
+        >>> model = VAN(**cfg)
+        >>> inputs = torch.rand(1, 3, 224, 224)
+        >>> outputs = model(inputs)
+        >>> for out in outputs:
+        >>>     print(out.size())
+        (1, 256, 7, 7)
+    """
+    arch_zoo = {
+        **dict.fromkeys(['t', 'tiny'],
+                        {'embed_dims': [32, 64, 160, 256],
+                         'depths': [3, 3, 5, 2],
+                         'ffn_ratios': [8, 8, 4, 4]}),
+        **dict.fromkeys(['s', 'small'],
+                        {'embed_dims': [64, 128, 320, 512],
+                         'depths': [2, 2, 4, 2],
+                         'ffn_ratios': [8, 8, 4, 4]}),
+        **dict.fromkeys(['b', 'base'],
+                        {'embed_dims': [64, 128, 320, 512],
+                         'depths': [3, 3, 12, 3],
+                         'ffn_ratios': [8, 8, 4, 4]}),
+        **dict.fromkeys(['l', 'large'],
+                        {'embed_dims': [64, 128, 320, 512],
+                         'depths': [3, 5, 27, 3],
+                         'ffn_ratios': [8, 8, 4, 4]}),
+    }  # yapf: disable
+
+    def __init__(self,
+                 arch='tiny',
+                 patch_sizes=[7, 3, 3, 3],
+                 in_channels=3,
+                 drop_rate=0.,
+                 drop_path_rate=0.,
+                 out_indices=(3, ),
+                 frozen_stages=-1,
+                 norm_eval=False,
+                 norm_cfg=dict(type='LN'),
+                 block_cfgs=dict(),
+                 init_cfg=None):
+        super(VAN, self).__init__(init_cfg=init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {'embed_dims', 'depths', 'ffn_ratios'}
+            assert isinstance(arch, dict) and set(arch) == essential_keys, \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.embed_dims = self.arch_settings['embed_dims']
+        self.depths = self.arch_settings['depths']
+        self.ffn_ratios = self.arch_settings['ffn_ratios']
+        self.num_stages = len(self.depths)
+        self.out_indices = out_indices
+        self.frozen_stages = frozen_stages
+        self.norm_eval = norm_eval
+
+        total_depth = sum(self.depths)
+        dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, total_depth)
+        ]  # stochastic depth decay rule
+
+        cur_block_idx = 0
+        for i, depth in enumerate(self.depths):
+            patch_embed = VANPatchEmbed(
+                in_channels=in_channels if i == 0 else self.embed_dims[i - 1],
+                input_size=None,
+                embed_dims=self.embed_dims[i],
+                kernel_size=patch_sizes[i],
+                stride=patch_sizes[i] // 2 + 1,
+                padding=(patch_sizes[i] // 2, patch_sizes[i] // 2),
+                norm_cfg=dict(type='BN'))
+
+            blocks = ModuleList([
+                VANBlock(
+                    embed_dims=self.embed_dims[i],
+                    ffn_ratio=self.ffn_ratios[i],
+                    drop_rate=drop_rate,
+                    drop_path_rate=dpr[cur_block_idx + j],
+                    **block_cfgs) for j in range(depth)
+            ])
+            cur_block_idx += depth
+            norm = build_norm_layer(norm_cfg, self.embed_dims[i])[1]
+
+            self.add_module(f'patch_embed{i + 1}', patch_embed)
+            self.add_module(f'blocks{i + 1}', blocks)
+            self.add_module(f'norm{i + 1}', norm)
+
+    def train(self, mode=True):
+        super(VAN, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                # trick: eval have effect on BatchNorm only
+                if isinstance(m, _BatchNorm):
+                    m.eval()
+
+    def _freeze_stages(self):
+        for i in range(0, self.frozen_stages + 1):
+            # freeze patch embed
+            m = getattr(self, f'patch_embed{i + 1}')
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+            # freeze blocks
+            m = getattr(self, f'blocks{i + 1}')
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+            # freeze norm
+            m = getattr(self, f'norm{i + 1}')
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+    def forward(self, x):
+        outs = []
+        for i in range(self.num_stages):
+            patch_embed = getattr(self, f'patch_embed{i + 1}')
+            blocks = getattr(self, f'blocks{i + 1}')
+            norm = getattr(self, f'norm{i + 1}')
+            x, hw_shape = patch_embed(x)
+            for block in blocks:
+                x = block(x)
+            x = x.flatten(2).transpose(1, 2)
+            x = norm(x)
+            x = x.reshape(-1, *hw_shape,
+                          block.out_channels).permute(0, 3, 1, 2).contiguous()
+            if i in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
diff --git a/mmpretrain/models/backbones/vgg.py b/mmpretrain/models/backbones/vgg.py
new file mode 100644
index 0000000000000000000000000000000000000000..026b916256cf56cdf75d348ee07b0ceceffd9751
--- /dev/null
+++ b/mmpretrain/models/backbones/vgg.py
@@ -0,0 +1,183 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+from mmcv.cnn import ConvModule
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+
+from mmpretrain.registry import MODELS
+from .base_backbone import BaseBackbone
+
+
+def make_vgg_layer(in_channels,
+                   out_channels,
+                   num_blocks,
+                   conv_cfg=None,
+                   norm_cfg=None,
+                   act_cfg=dict(type='ReLU'),
+                   dilation=1,
+                   with_norm=False,
+                   ceil_mode=False):
+    layers = []
+    for _ in range(num_blocks):
+        layer = ConvModule(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=3,
+            dilation=dilation,
+            padding=dilation,
+            bias=True,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+        layers.append(layer)
+        in_channels = out_channels
+    layers.append(nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=ceil_mode))
+
+    return layers
+
+
+@MODELS.register_module()
+class VGG(BaseBackbone):
+    """VGG backbone.
+
+    Args:
+        depth (int): Depth of vgg, from {11, 13, 16, 19}.
+        with_norm (bool): Use BatchNorm or not.
+        num_classes (int): number of classes for classification.
+        num_stages (int): VGG stages, normally 5.
+        dilations (Sequence[int]): Dilation of each stage.
+        out_indices (Sequence[int], optional): Output from which stages.
+            When it is None, the default behavior depends on whether
+            num_classes is specified. If num_classes <= 0, the default value is
+            (4, ), output the last feature map before classifier. If
+            num_classes > 0, the default value is (5, ), output the
+            classification score. Default: None.
+        frozen_stages (int): Stages to be frozen (all param fixed). -1 means
+            not freezing any parameters.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Default: False.
+        ceil_mode (bool): Whether to use ceil_mode of MaxPool. Default: False.
+        with_last_pool (bool): Whether to keep the last pooling before
+            classifier. Default: True.
+    """
+
+    # Parameters to build layers. Each element specifies the number of conv in
+    # each stage. For example, VGG11 contains 11 layers with learnable
+    # parameters. 11 is computed as 11 = (1 + 1 + 2 + 2 + 2) + 3,
+    # where 3 indicates the last three fully-connected layers.
+    arch_settings = {
+        11: (1, 1, 2, 2, 2),
+        13: (2, 2, 2, 2, 2),
+        16: (2, 2, 3, 3, 3),
+        19: (2, 2, 4, 4, 4)
+    }
+
+    def __init__(self,
+                 depth,
+                 num_classes=-1,
+                 num_stages=5,
+                 dilations=(1, 1, 1, 1, 1),
+                 out_indices=None,
+                 frozen_stages=-1,
+                 conv_cfg=None,
+                 norm_cfg=None,
+                 act_cfg=dict(type='ReLU'),
+                 norm_eval=False,
+                 ceil_mode=False,
+                 with_last_pool=True,
+                 init_cfg=[
+                     dict(type='Kaiming', layer=['Conv2d']),
+                     dict(type='Constant', val=1., layer=['_BatchNorm']),
+                     dict(type='Normal', std=0.01, layer=['Linear'])
+                 ]):
+        super(VGG, self).__init__(init_cfg)
+        if depth not in self.arch_settings:
+            raise KeyError(f'invalid depth {depth} for vgg')
+        assert num_stages >= 1 and num_stages <= 5
+        stage_blocks = self.arch_settings[depth]
+        self.stage_blocks = stage_blocks[:num_stages]
+        assert len(dilations) == num_stages
+
+        self.num_classes = num_classes
+        self.frozen_stages = frozen_stages
+        self.norm_eval = norm_eval
+        with_norm = norm_cfg is not None
+
+        if out_indices is None:
+            out_indices = (5, ) if num_classes > 0 else (4, )
+        assert max(out_indices) <= num_stages
+        self.out_indices = out_indices
+
+        self.in_channels = 3
+        start_idx = 0
+        vgg_layers = []
+        self.range_sub_modules = []
+        for i, num_blocks in enumerate(self.stage_blocks):
+            num_modules = num_blocks + 1
+            end_idx = start_idx + num_modules
+            dilation = dilations[i]
+            out_channels = 64 * 2**i if i < 4 else 512
+            vgg_layer = make_vgg_layer(
+                self.in_channels,
+                out_channels,
+                num_blocks,
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg,
+                dilation=dilation,
+                with_norm=with_norm,
+                ceil_mode=ceil_mode)
+            vgg_layers.extend(vgg_layer)
+            self.in_channels = out_channels
+            self.range_sub_modules.append([start_idx, end_idx])
+            start_idx = end_idx
+        if not with_last_pool:
+            vgg_layers.pop(-1)
+            self.range_sub_modules[-1][1] -= 1
+        self.module_name = 'features'
+        self.add_module(self.module_name, nn.Sequential(*vgg_layers))
+
+        if self.num_classes > 0:
+            self.classifier = nn.Sequential(
+                nn.Linear(512 * 7 * 7, 4096),
+                nn.ReLU(True),
+                nn.Dropout(),
+                nn.Linear(4096, 4096),
+                nn.ReLU(True),
+                nn.Dropout(),
+                nn.Linear(4096, num_classes),
+            )
+
+    def forward(self, x):
+        outs = []
+        vgg_layers = getattr(self, self.module_name)
+        for i in range(len(self.stage_blocks)):
+            for j in range(*self.range_sub_modules[i]):
+                vgg_layer = vgg_layers[j]
+                x = vgg_layer(x)
+            if i in self.out_indices:
+                outs.append(x)
+        if self.num_classes > 0:
+            x = x.view(x.size(0), -1)
+            x = self.classifier(x)
+            outs.append(x)
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        vgg_layers = getattr(self, self.module_name)
+        for i in range(self.frozen_stages):
+            for j in range(*self.range_sub_modules[i]):
+                m = vgg_layers[j]
+                m.eval()
+                for param in m.parameters():
+                    param.requires_grad = False
+
+    def train(self, mode=True):
+        super(VGG, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                # trick: eval have effect on BatchNorm only
+                if isinstance(m, _BatchNorm):
+                    m.eval()
diff --git a/mmpretrain/models/backbones/vig.py b/mmpretrain/models/backbones/vig.py
new file mode 100644
index 0000000000000000000000000000000000000000..c1a7879bd99682c32cbd1e02079fe79e2c6a3d0a
--- /dev/null
+++ b/mmpretrain/models/backbones/vig.py
@@ -0,0 +1,852 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# modified from
+# https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+from typing import Sequence
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn import build_activation_layer
+from mmcv.cnn.bricks import DropPath
+from mmengine.model import ModuleList, Sequential
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones.base_backbone import BaseBackbone
+from mmpretrain.registry import MODELS
+from ..utils import build_norm_layer
+
+
+def get_2d_relative_pos_embed(embed_dim, grid_size):
+    """
+    grid_size: int of the grid height and width
+    return:
+    pos_embed: [grid_size*grid_size, grid_size*grid_size]
+    """
+    pos_embed = get_2d_sincos_pos_embed(embed_dim, grid_size)
+    relative_pos = 2 * np.matmul(pos_embed,
+                                 pos_embed.transpose()) / pos_embed.shape[1]
+    return relative_pos
+
+
+def get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False):
+    """
+    grid_size: int of the grid height and width
+    return:
+    pos_embed: [grid_size*grid_size, embed_dim] or
+    [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
+    """
+    grid_h = np.arange(grid_size, dtype=np.float32)
+    grid_w = np.arange(grid_size, dtype=np.float32)
+    grid = np.meshgrid(grid_w, grid_h)  # here w goes first
+    grid = np.stack(grid, axis=0)
+
+    grid = grid.reshape([2, 1, grid_size, grid_size])
+    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
+    if cls_token:
+        pos_embed = np.concatenate([np.zeros([1, embed_dim]), pos_embed],
+                                   axis=0)
+    return pos_embed
+
+
+def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
+    assert embed_dim % 2 == 0
+
+    # use half of dimensions to encode grid_h
+    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2,
+                                              grid[0])  # (H*W, D/2)
+    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2,
+                                              grid[1])  # (H*W, D/2)
+
+    emb = np.concatenate([emb_h, emb_w], axis=1)  # (H*W, D)
+    return emb
+
+
+def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
+    """
+    embed_dim: output dimension for each position
+    pos: a list of positions to be encoded: size (M,)
+    out: (M, D)
+    """
+    assert embed_dim % 2 == 0
+    omega = np.arange(embed_dim // 2, dtype=np.float32)
+    omega /= embed_dim / 2.
+    omega = 1. / 10000**omega  # (D/2,)
+
+    pos = pos.reshape(-1)  # (M,)
+    out = np.einsum('m,d->md', pos, omega)  # (M, D/2), outer product
+
+    emb_sin = np.sin(out)  # (M, D/2)
+    emb_cos = np.cos(out)  # (M, D/2)
+
+    emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)
+    return emb
+
+
+def xy_pairwise_distance(x, y):
+    """Compute pairwise distance of a point cloud.
+
+    Args:
+        x: tensor (batch_size, num_points, num_dims)
+        y: tensor (batch_size, num_points, num_dims)
+    Returns:
+        pairwise distance: (batch_size, num_points, num_points)
+    """
+    with torch.no_grad():
+        xy_inner = -2 * torch.matmul(x, y.transpose(2, 1))
+        x_square = torch.sum(torch.mul(x, x), dim=-1, keepdim=True)
+        y_square = torch.sum(torch.mul(y, y), dim=-1, keepdim=True)
+        return x_square + xy_inner + y_square.transpose(2, 1)
+
+
+def xy_dense_knn_matrix(x, y, k=16, relative_pos=None):
+    """Get KNN based on the pairwise distance.
+
+    Args:
+        x: (batch_size, num_dims, num_points, 1)
+        y: (batch_size, num_dims, num_points, 1)
+        k: int
+        relative_pos:Whether to use relative_pos
+    Returns:
+        nearest neighbors:
+        (batch_size, num_points, k) (batch_size, num_points, k)
+    """
+    with torch.no_grad():
+        x = x.transpose(2, 1).squeeze(-1)
+        y = y.transpose(2, 1).squeeze(-1)
+        batch_size, n_points, n_dims = x.shape
+        dist = xy_pairwise_distance(x.detach(), y.detach())
+        if relative_pos is not None:
+            dist += relative_pos
+        _, nn_idx = torch.topk(-dist, k=k)
+        center_idx = torch.arange(
+            0, n_points, device=x.device).repeat(batch_size, k,
+                                                 1).transpose(2, 1)
+    return torch.stack((nn_idx, center_idx), dim=0)
+
+
+class DenseDilated(nn.Module):
+    """Find dilated neighbor from neighbor list.
+
+    edge_index: (2, batch_size, num_points, k)
+    """
+
+    def __init__(self, k=9, dilation=1, use_stochastic=False, epsilon=0.0):
+        super(DenseDilated, self).__init__()
+        self.dilation = dilation
+        self.use_stochastic = use_stochastic
+        self.epsilon = epsilon
+        self.k = k
+
+    def forward(self, edge_index):
+        if self.use_stochastic:
+            if torch.rand(1) < self.epsilon and self.training:
+                num = self.k * self.dilation
+                randnum = torch.randperm(num)[:self.k]
+                edge_index = edge_index[:, :, :, randnum]
+            else:
+                edge_index = edge_index[:, :, :, ::self.dilation]
+        else:
+            edge_index = edge_index[:, :, :, ::self.dilation]
+        return edge_index
+
+
+class DenseDilatedKnnGraph(nn.Module):
+    """Find the neighbors' indices based on dilated knn."""
+
+    def __init__(self, k=9, dilation=1, use_stochastic=False, epsilon=0.0):
+        super(DenseDilatedKnnGraph, self).__init__()
+        self.dilation = dilation
+        self.use_stochastic = use_stochastic
+        self.epsilon = epsilon
+        self.k = k
+        self._dilated = DenseDilated(k, dilation, use_stochastic, epsilon)
+
+    def forward(self, x, y=None, relative_pos=None):
+        if y is not None:
+            x = F.normalize(x, p=2.0, dim=1)
+            y = F.normalize(y, p=2.0, dim=1)
+
+            edge_index = xy_dense_knn_matrix(x, y, self.k * self.dilation,
+                                             relative_pos)
+        else:
+            x = F.normalize(x, p=2.0, dim=1)
+            y = x.clone()
+
+            edge_index = xy_dense_knn_matrix(x, y, self.k * self.dilation,
+                                             relative_pos)
+        return self._dilated(edge_index)
+
+
+class BasicConv(Sequential):
+
+    def __init__(self,
+                 channels,
+                 act_cfg,
+                 norm_cfg=None,
+                 graph_conv_bias=True,
+                 drop=0.):
+        m = []
+        for i in range(1, len(channels)):
+            m.append(
+                nn.Conv2d(
+                    channels[i - 1],
+                    channels[i],
+                    1,
+                    bias=graph_conv_bias,
+                    groups=4))
+            if norm_cfg is not None:
+                m.append(build_norm_layer(norm_cfg, channels[-1]))
+            if act_cfg is not None:
+                m.append(build_activation_layer(act_cfg))
+            if drop > 0:
+                m.append(nn.Dropout2d(drop))
+
+        super(BasicConv, self).__init__(*m)
+
+
+def batched_index_select(x, idx):
+    r"""fetches neighbors features from a given neighbor idx
+
+    Args:
+        x (Tensor): input feature Tensor
+                :math:
+                `\mathbf{X} \in \mathbb{R}^{B \times C \times N \times 1}`.
+        idx (Tensor): edge_idx
+                :math:`\mathbf{X} \in \mathbb{R}^{B \times N \times l}`.
+    Returns:
+        Tensor: output neighbors features
+            :math:`\mathbf{X} \in \mathbb{R}^{B \times C \times N \times k}`.
+    """
+    batch_size, num_dims, num_vertices_reduced = x.shape[:3]
+    _, num_vertices, k = idx.shape
+    idx_base = torch.arange(
+        0, batch_size, device=idx.device).view(-1, 1, 1) * num_vertices_reduced
+    idx = idx + idx_base
+    idx = idx.contiguous().view(-1)
+
+    x = x.transpose(2, 1)
+    feature = x.contiguous().view(batch_size * num_vertices_reduced,
+                                  -1)[idx, :]
+    feature = feature.view(batch_size, num_vertices, k,
+                           num_dims).permute(0, 3, 1, 2).contiguous()
+    return feature
+
+
+class MRConv2d(nn.Module):
+    """Max-Relative Graph Convolution (Paper: https://arxiv.org/abs/1904.03751)
+    for dense data type."""
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 act_cfg,
+                 norm_cfg=None,
+                 graph_conv_bias=True):
+        super(MRConv2d, self).__init__()
+        self.nn = BasicConv([in_channels * 2, out_channels], act_cfg, norm_cfg,
+                            graph_conv_bias)
+
+    def forward(self, x, edge_index, y=None):
+        x_i = batched_index_select(x, edge_index[1])
+        if y is not None:
+            x_j = batched_index_select(y, edge_index[0])
+        else:
+            x_j = batched_index_select(x, edge_index[0])
+        x_j, _ = torch.max(x_j - x_i, -1, keepdim=True)
+        b, c, n, _ = x.shape
+        x = torch.cat([x.unsqueeze(2), x_j.unsqueeze(2)],
+                      dim=2).reshape(b, 2 * c, n, _)
+        return self.nn(x)
+
+
+class EdgeConv2d(nn.Module):
+    """Edge convolution layer (with activation, batch normalization) for dense
+    data type."""
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 act_cfg,
+                 norm_cfg=None,
+                 graph_conv_bias=True):
+        super(EdgeConv2d, self).__init__()
+        self.nn = BasicConv([in_channels * 2, out_channels], act_cfg, norm_cfg,
+                            graph_conv_bias)
+
+    def forward(self, x, edge_index, y=None):
+        x_i = batched_index_select(x, edge_index[1])
+        if y is not None:
+            x_j = batched_index_select(y, edge_index[0])
+        else:
+            x_j = batched_index_select(x, edge_index[0])
+        max_value, _ = torch.max(
+            self.nn(torch.cat([x_i, x_j - x_i], dim=1)), -1, keepdim=True)
+        return max_value
+
+
+class GraphSAGE(nn.Module):
+    """GraphSAGE Graph Convolution (Paper: https://arxiv.org/abs/1706.02216)
+    for dense data type."""
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 act_cfg,
+                 norm_cfg=None,
+                 graph_conv_bias=True):
+        super(GraphSAGE, self).__init__()
+        self.nn1 = BasicConv([in_channels, in_channels], act_cfg, norm_cfg,
+                             graph_conv_bias)
+        self.nn2 = BasicConv([in_channels * 2, out_channels], act_cfg,
+                             norm_cfg, graph_conv_bias)
+
+    def forward(self, x, edge_index, y=None):
+        if y is not None:
+            x_j = batched_index_select(y, edge_index[0])
+        else:
+            x_j = batched_index_select(x, edge_index[0])
+        x_j, _ = torch.max(self.nn1(x_j), -1, keepdim=True)
+        return self.nn2(torch.cat([x, x_j], dim=1))
+
+
+class GINConv2d(nn.Module):
+    """GIN Graph Convolution (Paper: https://arxiv.org/abs/1810.00826) for
+    dense data type."""
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 act_cfg,
+                 norm_cfg=None,
+                 graph_conv_bias=True):
+        super(GINConv2d, self).__init__()
+        self.nn = BasicConv([in_channels, out_channels], act_cfg, norm_cfg,
+                            graph_conv_bias)
+        eps_init = 0.0
+        self.eps = nn.Parameter(torch.Tensor([eps_init]))
+
+    def forward(self, x, edge_index, y=None):
+        if y is not None:
+            x_j = batched_index_select(y, edge_index[0])
+        else:
+            x_j = batched_index_select(x, edge_index[0])
+        x_j = torch.sum(x_j, -1, keepdim=True)
+        return self.nn((1 + self.eps) * x + x_j)
+
+
+class GraphConv2d(nn.Module):
+    """Static graph convolution layer."""
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 graph_conv_type,
+                 act_cfg,
+                 norm_cfg=None,
+                 graph_conv_bias=True):
+        super(GraphConv2d, self).__init__()
+        if graph_conv_type == 'edge':
+            self.gconv = EdgeConv2d(in_channels, out_channels, act_cfg,
+                                    norm_cfg, graph_conv_bias)
+        elif graph_conv_type == 'mr':
+            self.gconv = MRConv2d(in_channels, out_channels, act_cfg, norm_cfg,
+                                  graph_conv_bias)
+        elif graph_conv_type == 'sage':
+            self.gconv = GraphSAGE(in_channels, out_channels, act_cfg,
+                                   norm_cfg, graph_conv_bias)
+        elif graph_conv_type == 'gin':
+            self.gconv = GINConv2d(in_channels, out_channels, act_cfg,
+                                   norm_cfg, graph_conv_bias)
+        else:
+            raise NotImplementedError(
+                'graph_conv_type:{} is not supported'.format(graph_conv_type))
+
+    def forward(self, x, edge_index, y=None):
+        return self.gconv(x, edge_index, y)
+
+
+class DyGraphConv2d(GraphConv2d):
+    """Dynamic graph convolution layer."""
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 k=9,
+                 dilation=1,
+                 graph_conv_type='mr',
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=None,
+                 graph_conv_bias=True,
+                 use_stochastic=False,
+                 epsilon=0.2,
+                 r=1):
+        super(DyGraphConv2d,
+              self).__init__(in_channels, out_channels, graph_conv_type,
+                             act_cfg, norm_cfg, graph_conv_bias)
+        self.k = k
+        self.d = dilation
+        self.r = r
+        self.dilated_knn_graph = DenseDilatedKnnGraph(k, dilation,
+                                                      use_stochastic, epsilon)
+
+    def forward(self, x, relative_pos=None):
+        B, C, H, W = x.shape
+        y = None
+        if self.r > 1:
+            y = F.avg_pool2d(x, self.r, self.r)
+            y = y.reshape(B, C, -1, 1).contiguous()
+        x = x.reshape(B, C, -1, 1).contiguous()
+        edge_index = self.dilated_knn_graph(x, y, relative_pos)
+        x = super(DyGraphConv2d, self).forward(x, edge_index, y)
+        return x.reshape(B, -1, H, W).contiguous()
+
+
+class Grapher(nn.Module):
+    """Grapher module with graph convolution and fc layers."""
+
+    def __init__(self,
+                 in_channels,
+                 k=9,
+                 dilation=1,
+                 graph_conv_type='mr',
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=None,
+                 graph_conv_bias=True,
+                 use_stochastic=False,
+                 epsilon=0.2,
+                 r=1,
+                 n=196,
+                 drop_path=0.0,
+                 relative_pos=False):
+        super(Grapher, self).__init__()
+        self.channels = in_channels
+        self.n = n
+        self.r = r
+        self.fc1 = Sequential(
+            nn.Conv2d(in_channels, in_channels, 1, stride=1, padding=0),
+            build_norm_layer(dict(type='BN'), in_channels),
+        )
+        self.graph_conv = DyGraphConv2d(in_channels, in_channels * 2, k,
+                                        dilation, graph_conv_type, act_cfg,
+                                        norm_cfg, graph_conv_bias,
+                                        use_stochastic, epsilon, r)
+        self.fc2 = Sequential(
+            nn.Conv2d(in_channels * 2, in_channels, 1, stride=1, padding=0),
+            build_norm_layer(dict(type='BN'), in_channels),
+        )
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+
+        self.relative_pos = None
+        if relative_pos:
+            relative_pos_tensor = torch.from_numpy(
+                np.float32(
+                    get_2d_relative_pos_embed(in_channels, int(
+                        n**0.5)))).unsqueeze(0).unsqueeze(1)
+            relative_pos_tensor = F.interpolate(
+                relative_pos_tensor,
+                size=(n, n // (r * r)),
+                mode='bicubic',
+                align_corners=False)
+            self.relative_pos = nn.Parameter(
+                -relative_pos_tensor.squeeze(1), requires_grad=False)
+
+    def _get_relative_pos(self, relative_pos, H, W):
+        if relative_pos is None or H * W == self.n:
+            return relative_pos
+        else:
+            N = H * W
+            N_reduced = N // (self.r * self.r)
+            return F.interpolate(
+                relative_pos.unsqueeze(0), size=(N, N_reduced),
+                mode='bicubic').squeeze(0)
+
+    def forward(self, x):
+        B, C, H, W = x.shape
+        relative_pos = self._get_relative_pos(self.relative_pos, H, W)
+        shortcut = x
+        x = self.fc1(x)
+        x = self.graph_conv(x, relative_pos)
+        x = self.fc2(x)
+        x = self.drop_path(x) + shortcut
+        return x
+
+
+class FFN(nn.Module):
+    """"out_features = out_features or in_features\n
+        hidden_features = hidden_features or in_features"""
+
+    def __init__(self,
+                 in_features,
+                 hidden_features=None,
+                 out_features=None,
+                 act_cfg=dict(type='GELU'),
+                 drop_path=0.0):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = Sequential(
+            nn.Conv2d(in_features, hidden_features, 1, stride=1, padding=0),
+            build_norm_layer(dict(type='BN'), hidden_features),
+        )
+        self.act = build_activation_layer(act_cfg)
+        self.fc2 = Sequential(
+            nn.Conv2d(hidden_features, out_features, 1, stride=1, padding=0),
+            build_norm_layer(dict(type='BN'), out_features),
+        )
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+
+    def forward(self, x):
+        shortcut = x
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.fc2(x)
+        x = self.drop_path(x) + shortcut
+        return x
+
+
+@MODELS.register_module()
+class Vig(BaseBackbone):
+    """Vision GNN backbone.
+
+    A PyTorch implementation of `Vision GNN: An Image is Worth Graph of Nodes
+    <https://arxiv.org/abs/2206.00272>`_.
+
+    Modified from the official implementation
+    https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+
+    Args:
+        arch(str): Vision GNN architecture,
+            choose from 'tiny', 'small' and 'base'.
+        in_channels (int): The number of channels of input images.
+            Defaults to 3.
+        k (int): The number of KNN's k. Defaults to 9.
+        out_indices (Sequence | int): Output from which blocks.
+            Defaults to -1, means the last block.
+        act_cfg (dict): The config of activative functions.
+            Defaults to ``dict(type='GELU'))``.
+        norm_cfg (dict): The config of normalization layers.
+            Defaults to ``dict(type='BN', eps=1e-6)``.
+        graph_conv_bias (bool): Whether to use bias in the convolution
+            layers in Grapher. Defaults to True.
+        graph_conv_type (str): The type of graph convolution，choose
+            from 'edge', 'mr', 'sage' and 'gin'. Defaults to 'mr'.
+        epsilon (float): Probability of random arrangement in KNN. It only
+            works when ``use_dilation=True`` and ``use_stochastic=True``.
+            Defaults to 0.2.
+        use_dilation(bool): Whether to use dilation in KNN. Defaults to True.
+        use_stochastic(bool): Whether to use stochastic in KNN.
+            Defaults to False.
+        drop_path (float): stochastic depth rate. Default 0.0
+        relative_pos(bool): Whether to use relative position embedding.
+            Defaults to False.
+        norm_eval (bool): Whether to set the normalization layer to eval mode.
+            Defaults to False.
+        frozen_stages (int): Blocks to be frozen (all param fixed).
+            Defaults to 0, which means not freezing any parameters.
+        init_cfg (dict, optional): The initialization configs.
+            Defaults to None.
+    """  # noqa: E501
+
+    arch_settings = {
+        'tiny': dict(num_blocks=12, channels=192),
+        'small': dict(num_blocks=16, channels=320),
+        'base': dict(num_blocks=16, channels=640),
+    }
+
+    def __init__(self,
+                 arch,
+                 in_channels=3,
+                 k=9,
+                 out_indices=-1,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='BN'),
+                 graph_conv_bias=True,
+                 graph_conv_type='mr',
+                 epsilon=0.2,
+                 use_dilation=True,
+                 use_stochastic=False,
+                 drop_path=0.,
+                 relative_pos=False,
+                 norm_eval=False,
+                 frozen_stages=0,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        arch = self.arch_settings[arch]
+        self.num_blocks = arch['num_blocks']
+        channels = arch['channels']
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        elif isinstance(out_indices, tuple):
+            out_indices = list(out_indices)
+        elif not isinstance(out_indices, list):
+            raise TypeError('"out_indices" must by a tuple, list or int, '
+                            f'get {type(out_indices)} instead.')
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = self.num_blocks + index
+            assert 0 <= out_indices[i] <= self.num_blocks, \
+                f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+
+        self.stem = Sequential(
+            nn.Conv2d(in_channels, channels // 8, 3, stride=2, padding=1),
+            build_norm_layer(norm_cfg, channels // 8),
+            build_activation_layer(act_cfg),
+            nn.Conv2d(channels // 8, channels // 4, 3, stride=2, padding=1),
+            build_norm_layer(norm_cfg, channels // 4),
+            build_activation_layer(act_cfg),
+            nn.Conv2d(channels // 4, channels // 2, 3, stride=2, padding=1),
+            build_norm_layer(norm_cfg, channels // 2),
+            build_activation_layer(act_cfg),
+            nn.Conv2d(channels // 2, channels, 3, stride=2, padding=1),
+            build_norm_layer(norm_cfg, channels),
+            build_activation_layer(act_cfg),
+            nn.Conv2d(channels, channels, 3, stride=1, padding=1),
+            build_norm_layer(norm_cfg, channels),
+        )
+
+        # stochastic depth decay rule
+        dpr = [x.item() for x in torch.linspace(0, drop_path, self.num_blocks)]
+        # number of knn's k
+        num_knn = [
+            int(x.item()) for x in torch.linspace(k, 2 * k, self.num_blocks)
+        ]
+        max_dilation = 196 // max(num_knn)
+
+        self.pos_embed = nn.Parameter(torch.zeros(1, channels, 14, 14))
+
+        self.blocks = ModuleList([
+            Sequential(
+                Grapher(
+                    in_channels=channels,
+                    k=num_knn[i],
+                    dilation=min(i // 4 +
+                                 1, max_dilation) if use_dilation else 1,
+                    graph_conv_type=graph_conv_type,
+                    act_cfg=act_cfg,
+                    norm_cfg=norm_cfg,
+                    graph_conv_bias=graph_conv_bias,
+                    use_stochastic=use_stochastic,
+                    epsilon=epsilon,
+                    drop_path=dpr[i],
+                    relative_pos=relative_pos),
+                FFN(in_features=channels,
+                    hidden_features=channels * 4,
+                    act_cfg=act_cfg,
+                    drop_path=dpr[i])) for i in range(self.num_blocks)
+        ])
+
+        self.norm_eval = norm_eval
+        self.frozen_stages = frozen_stages
+
+    def forward(self, inputs):
+        outs = []
+        x = self.stem(inputs) + self.pos_embed
+
+        for i, block in enumerate(self.blocks):
+            x = block(x)
+
+            if i in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        self.stem.eval()
+        for i in range(self.frozen_stages):
+            m = self.blocks[i]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        super(Vig, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                # trick: eval have effect on BatchNorm only
+                if isinstance(m, _BatchNorm):
+                    m.eval()
+
+
+@MODELS.register_module()
+class PyramidVig(BaseBackbone):
+    """Pyramid Vision GNN backbone.
+
+    A PyTorch implementation of `Vision GNN: An Image is Worth Graph of Nodes
+    <https://arxiv.org/abs/2206.00272>`_.
+
+    Modified from the official implementation
+    https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/vig_pytorch
+
+    Args:
+        arch (str): Vision GNN architecture, choose from 'tiny',
+            'small' and 'base'.
+        in_channels (int): The number of channels of input images.
+            Defaults to 3.
+        k (int): The number of KNN's k. Defaults to 9.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        act_cfg (dict): The config of activative functions.
+            Defaults to ``dict(type='GELU'))``.
+        norm_cfg (dict): The config of normalization layers.
+            Defaults to ``dict(type='BN')``.
+        graph_conv_bias (bool): Whether to use bias in the convolution
+            layers in Grapher. Defaults to True.
+        graph_conv_type (str): The type of graph convolution，choose
+            from 'edge', 'mr', 'sage' and 'gin'. Defaults to 'mr'.
+        epsilon (float): Probability of random arrangement in KNN. It only
+            works when ``use_stochastic=True``. Defaults to 0.2.
+        use_stochastic (bool): Whether to use stochastic in KNN.
+            Defaults to False.
+        drop_path (float): stochastic depth rate. Default 0.0
+        norm_eval (bool): Whether to set the normalization layer to eval mode.
+            Defaults to False.
+        frozen_stages (int): Stages to be frozen (all param fixed).
+            Defaults to 0, which means not freezing any parameters.
+        init_cfg (dict, optional): The initialization configs.
+            Defaults to None.
+    """  # noqa: E501
+    arch_settings = {
+        'tiny': dict(blocks=[2, 2, 6, 2], channels=[48, 96, 240, 384]),
+        'small': dict(blocks=[2, 2, 6, 2], channels=[80, 160, 400, 640]),
+        'medium': dict(blocks=[2, 2, 16, 2], channels=[96, 192, 384, 768]),
+        'base': dict(blocks=[2, 2, 18, 2], channels=[128, 256, 512, 1024]),
+    }
+
+    def __init__(self,
+                 arch,
+                 in_channels=3,
+                 k=9,
+                 out_indices=-1,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='BN'),
+                 graph_conv_bias=True,
+                 graph_conv_type='mr',
+                 epsilon=0.2,
+                 use_stochastic=False,
+                 drop_path=0.,
+                 norm_eval=False,
+                 frozen_stages=0,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        arch = self.arch_settings[arch]
+        self.blocks = arch['blocks']
+        self.num_blocks = sum(self.blocks)
+        self.num_stages = len(self.blocks)
+        channels = arch['channels']
+        self.channels = channels
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = self.num_stages + index
+            assert 0 <= out_indices[i] <= self.num_stages, \
+                f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+
+        self.stem = Sequential(
+            nn.Conv2d(in_channels, channels[0] // 2, 3, stride=2, padding=1),
+            build_norm_layer(norm_cfg, channels[0] // 2),
+            build_activation_layer(act_cfg),
+            nn.Conv2d(channels[0] // 2, channels[0], 3, stride=2, padding=1),
+            build_norm_layer(norm_cfg, channels[0]),
+            build_activation_layer(act_cfg),
+            nn.Conv2d(channels[0], channels[0], 3, stride=1, padding=1),
+            build_norm_layer(norm_cfg, channels[0]),
+        )
+
+        # stochastic depth decay rule
+        dpr = [x.item() for x in torch.linspace(0, drop_path, self.num_blocks)]
+        # number of knn's k
+        num_knn = [
+            int(x.item()) for x in torch.linspace(k, k, self.num_blocks)
+        ]
+        max_dilation = 49 // max(num_knn)
+
+        self.pos_embed = nn.Parameter(
+            torch.zeros(1, channels[0], 224 // 4, 224 // 4))
+        HW = 224 // 4 * 224 // 4
+        reduce_ratios = [4, 2, 1, 1]
+
+        self.stages = ModuleList()
+        block_idx = 0
+        for stage_idx, num_blocks in enumerate(self.blocks):
+            mid_channels = channels[stage_idx]
+            reduce_ratio = reduce_ratios[stage_idx]
+            blocks = []
+            if stage_idx > 0:
+                blocks.append(
+                    Sequential(
+                        nn.Conv2d(
+                            self.channels[stage_idx - 1],
+                            mid_channels,
+                            kernel_size=3,
+                            stride=2,
+                            padding=1),
+                        build_norm_layer(norm_cfg, mid_channels),
+                    ))
+                HW = HW // 4
+            for _ in range(num_blocks):
+                blocks.append(
+                    Sequential(
+                        Grapher(
+                            in_channels=mid_channels,
+                            k=num_knn[block_idx],
+                            dilation=min(block_idx // 4 + 1, max_dilation),
+                            graph_conv_type=graph_conv_type,
+                            act_cfg=act_cfg,
+                            norm_cfg=norm_cfg,
+                            graph_conv_bias=graph_conv_bias,
+                            use_stochastic=use_stochastic,
+                            epsilon=epsilon,
+                            r=reduce_ratio,
+                            n=HW,
+                            drop_path=dpr[block_idx],
+                            relative_pos=True),
+                        FFN(in_features=mid_channels,
+                            hidden_features=mid_channels * 4,
+                            act_cfg=act_cfg,
+                            drop_path=dpr[block_idx])))
+                block_idx += 1
+            self.stages.append(Sequential(*blocks))
+
+        self.norm_eval = norm_eval
+        self.frozen_stages = frozen_stages
+
+    def forward(self, inputs):
+        outs = []
+        x = self.stem(inputs) + self.pos_embed
+
+        for i, blocks in enumerate(self.stages):
+            x = blocks(x)
+
+            if i in self.out_indices:
+                outs.append(x)
+
+        return tuple(outs)
+
+    def _freeze_stages(self):
+        self.stem.eval()
+        for i in range(self.frozen_stages):
+            m = self.stages[i]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+    def train(self, mode=True):
+        super(PyramidVig, self).train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                # trick: eval have effect on BatchNorm only
+                if isinstance(m, _BatchNorm):
+                    m.eval()
diff --git a/mmpretrain/models/backbones/vision_transformer.py b/mmpretrain/models/backbones/vision_transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..a54053c217d18824357ad7250cb6c52be212f15d
--- /dev/null
+++ b/mmpretrain/models/backbones/vision_transformer.py
@@ -0,0 +1,537 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Sequence
+
+import numpy as np
+import torch
+import torch.nn as nn
+from mmcv.cnn.bricks.transformer import FFN, PatchEmbed
+from mmengine.model import BaseModule, ModuleList
+from mmengine.model.weight_init import trunc_normal_
+
+from mmpretrain.registry import MODELS
+from ..utils import (MultiheadAttention, SwiGLUFFNFused, build_norm_layer,
+                     resize_pos_embed, to_2tuple)
+from .base_backbone import BaseBackbone
+
+
+class TransformerEncoderLayer(BaseModule):
+    """Implements one encoder layer in Vision Transformer.
+
+    Args:
+        embed_dims (int): The feature dimension
+        num_heads (int): Parallel attention heads
+        feedforward_channels (int): The hidden dimension for FFNs
+        layer_scale_init_value (float or torch.Tensor): Init value of layer
+            scale. Defaults to 0.
+        drop_rate (float): Probability of an element to be zeroed
+            after the feed forward layer. Defaults to 0.
+        attn_drop_rate (float): The drop out rate for attention output weights.
+            Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        num_fcs (int): The number of fully-connected layers for FFNs.
+            Defaults to 2.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to True.
+        ffn_type (str): Select the type of ffn layers. Defaults to 'origin'.
+        act_cfg (dict): The activation config for FFNs.
+            Defaults to ``dict(type='GELU')``.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 feedforward_channels,
+                 layer_scale_init_value=0.,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 num_fcs=2,
+                 qkv_bias=True,
+                 ffn_type='origin',
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='LN'),
+                 init_cfg=None):
+        super(TransformerEncoderLayer, self).__init__(init_cfg=init_cfg)
+
+        self.embed_dims = embed_dims
+
+        self.ln1 = build_norm_layer(norm_cfg, self.embed_dims)
+
+        self.attn = MultiheadAttention(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            attn_drop=attn_drop_rate,
+            proj_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            qkv_bias=qkv_bias,
+            layer_scale_init_value=layer_scale_init_value)
+
+        self.ln2 = build_norm_layer(norm_cfg, self.embed_dims)
+
+        if ffn_type == 'origin':
+            self.ffn = FFN(
+                embed_dims=embed_dims,
+                feedforward_channels=feedforward_channels,
+                num_fcs=num_fcs,
+                ffn_drop=drop_rate,
+                dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+                act_cfg=act_cfg,
+                layer_scale_init_value=layer_scale_init_value)
+        elif ffn_type == 'swiglu_fused':
+            self.ffn = SwiGLUFFNFused(
+                embed_dims=embed_dims,
+                feedforward_channels=feedforward_channels,
+                layer_scale_init_value=layer_scale_init_value)
+        else:
+            raise NotImplementedError
+
+    @property
+    def norm1(self):
+        return self.ln1
+
+    @property
+    def norm2(self):
+        return self.ln2
+
+    def init_weights(self):
+        super(TransformerEncoderLayer, self).init_weights()
+        for m in self.ffn.modules():
+            if isinstance(m, nn.Linear):
+                nn.init.xavier_uniform_(m.weight)
+                nn.init.normal_(m.bias, std=1e-6)
+
+    def forward(self, x):
+        x = x + self.attn(self.ln1(x))
+        x = self.ffn(self.ln2(x), identity=x)
+        return x
+
+
+@MODELS.register_module()
+class VisionTransformer(BaseBackbone):
+    """Vision Transformer.
+
+    A PyTorch implement of : `An Image is Worth 16x16 Words: Transformers
+    for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`_
+
+    Args:
+        arch (str | dict): Vision Transformer architecture. If use string,
+            choose from 'small', 'base', 'large', 'deit-tiny', 'deit-small'
+            and 'deit-base'. If use dict, it should have below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **num_layers** (int): The number of transformer encoder layers.
+            - **num_heads** (int): The number of heads in attention modules.
+            - **feedforward_channels** (int): The hidden dimensions in
+              feedforward modules.
+
+            Defaults to 'base'.
+        img_size (int | tuple): The expected input image shape. Because we
+            support dynamic input shape, just set the argument to the most
+            common input image shape. Defaults to 224.
+        patch_size (int | tuple): The patch size in patch embedding.
+            Defaults to 16.
+        in_channels (int): The num of input channels. Defaults to 3.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        qkv_bias (bool): Whether to add bias for qkv in attention modules.
+            Defaults to True.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        final_norm (bool): Whether to add a additional layer to normalize
+            final feature map. Defaults to True.
+        out_type (str): The type of output features. Please choose from
+
+            - ``"cls_token"``: The class token tensor with shape (B, C).
+            - ``"featmap"``: The feature map tensor from the patch tokens
+              with shape (B, C, H, W).
+            - ``"avg_featmap"``: The global averaged feature map tensor
+              with shape (B, C).
+            - ``"raw"``: The raw feature tensor includes patch tokens and
+              class tokens with shape (B, L, C).
+
+            Defaults to ``"cls_token"``.
+        with_cls_token (bool): Whether concatenating class token into image
+            tokens as transformer input. Defaults to True.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        interpolate_mode (str): Select the interpolate mode for position
+            embeding vector resize. Defaults to "bicubic".
+        layer_scale_init_value (float or torch.Tensor): Init value of layer
+            scale. Defaults to 0.
+        patch_cfg (dict): Configs of patch embeding. Defaults to an empty dict.
+        layer_cfgs (Sequence | dict): Configs of each transformer layer in
+            encoder. Defaults to an empty dict.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+    arch_zoo = {
+        **dict.fromkeys(
+            ['s', 'small'], {
+                'embed_dims': 768,
+                'num_layers': 8,
+                'num_heads': 8,
+                'feedforward_channels': 768 * 3,
+            }),
+        **dict.fromkeys(
+            ['b', 'base'], {
+                'embed_dims': 768,
+                'num_layers': 12,
+                'num_heads': 12,
+                'feedforward_channels': 3072
+            }),
+        **dict.fromkeys(
+            ['l', 'large'], {
+                'embed_dims': 1024,
+                'num_layers': 24,
+                'num_heads': 16,
+                'feedforward_channels': 4096
+            }),
+        **dict.fromkeys(
+            ['h', 'huge'],
+            {
+                # The same as the implementation in MAE
+                # <https://arxiv.org/abs/2111.06377>
+                'embed_dims': 1280,
+                'num_layers': 32,
+                'num_heads': 16,
+                'feedforward_channels': 5120
+            }),
+        **dict.fromkeys(
+            ['eva-g', 'eva-giant'],
+            {
+                # The implementation in EVA
+                # <https://arxiv.org/abs/2211.07636>
+                'embed_dims': 1408,
+                'num_layers': 40,
+                'num_heads': 16,
+                'feedforward_channels': 6144
+            }),
+        **dict.fromkeys(
+            ['deit-t', 'deit-tiny'], {
+                'embed_dims': 192,
+                'num_layers': 12,
+                'num_heads': 3,
+                'feedforward_channels': 192 * 4
+            }),
+        **dict.fromkeys(
+            ['deit-s', 'deit-small', 'dinov2-s', 'dinov2-small'], {
+                'embed_dims': 384,
+                'num_layers': 12,
+                'num_heads': 6,
+                'feedforward_channels': 384 * 4
+            }),
+        **dict.fromkeys(
+            ['deit-b', 'deit-base'], {
+                'embed_dims': 768,
+                'num_layers': 12,
+                'num_heads': 12,
+                'feedforward_channels': 768 * 4
+            }),
+        **dict.fromkeys(
+            ['dinov2-g', 'dinov2-giant'], {
+                'embed_dims': 1536,
+                'num_layers': 40,
+                'num_heads': 24,
+                'feedforward_channels': 6144
+            }),
+    }
+    num_extra_tokens = 1  # class token
+    OUT_TYPES = {'raw', 'cls_token', 'featmap', 'avg_featmap'}
+
+    def __init__(self,
+                 arch='base',
+                 img_size=224,
+                 patch_size=16,
+                 in_channels=3,
+                 out_indices=-1,
+                 drop_rate=0.,
+                 drop_path_rate=0.,
+                 qkv_bias=True,
+                 norm_cfg=dict(type='LN', eps=1e-6),
+                 final_norm=True,
+                 out_type='cls_token',
+                 with_cls_token=True,
+                 frozen_stages=-1,
+                 interpolate_mode='bicubic',
+                 layer_scale_init_value=0.,
+                 patch_cfg=dict(),
+                 layer_cfgs=dict(),
+                 pre_norm=False,
+                 init_cfg=None):
+        super(VisionTransformer, self).__init__(init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {
+                'embed_dims', 'num_layers', 'num_heads', 'feedforward_channels'
+            }
+            assert isinstance(arch, dict) and essential_keys <= set(arch), \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.embed_dims = self.arch_settings['embed_dims']
+        self.num_layers = self.arch_settings['num_layers']
+        self.img_size = to_2tuple(img_size)
+
+        # Set patch embedding
+        _patch_cfg = dict(
+            in_channels=in_channels,
+            input_size=img_size,
+            embed_dims=self.embed_dims,
+            conv_type='Conv2d',
+            kernel_size=patch_size,
+            stride=patch_size,
+            bias=not pre_norm,  # disable bias if pre_norm is used(e.g., CLIP)
+        )
+        _patch_cfg.update(patch_cfg)
+        self.patch_embed = PatchEmbed(**_patch_cfg)
+        self.patch_resolution = self.patch_embed.init_out_size
+        num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+
+        # Set out type
+        if out_type not in self.OUT_TYPES:
+            raise ValueError(f'Unsupported `out_type` {out_type}, please '
+                             f'choose from {self.OUT_TYPES}')
+        self.out_type = out_type
+
+        # Set cls token
+        self.with_cls_token = with_cls_token
+        if with_cls_token:
+            self.cls_token = nn.Parameter(torch.zeros(1, 1, self.embed_dims))
+        elif out_type != 'cls_token':
+            self.cls_token = None
+            self.num_extra_tokens = 0
+        else:
+            raise ValueError(
+                'with_cls_token must be True when `out_type="cls_token"`.')
+
+        # Set position embedding
+        self.interpolate_mode = interpolate_mode
+        self.pos_embed = nn.Parameter(
+            torch.zeros(1, num_patches + self.num_extra_tokens,
+                        self.embed_dims))
+        self._register_load_state_dict_pre_hook(self._prepare_pos_embed)
+
+        self.drop_after_pos = nn.Dropout(p=drop_rate)
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = self.num_layers + index
+            assert 0 <= out_indices[i] <= self.num_layers, \
+                f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+
+        # stochastic depth decay rule
+        dpr = np.linspace(0, drop_path_rate, self.num_layers)
+
+        self.layers = ModuleList()
+        if isinstance(layer_cfgs, dict):
+            layer_cfgs = [layer_cfgs] * self.num_layers
+        for i in range(self.num_layers):
+            _layer_cfg = dict(
+                embed_dims=self.embed_dims,
+                num_heads=self.arch_settings['num_heads'],
+                feedforward_channels=self.
+                arch_settings['feedforward_channels'],
+                layer_scale_init_value=layer_scale_init_value,
+                drop_rate=drop_rate,
+                drop_path_rate=dpr[i],
+                qkv_bias=qkv_bias,
+                norm_cfg=norm_cfg)
+            _layer_cfg.update(layer_cfgs[i])
+            self.layers.append(TransformerEncoderLayer(**_layer_cfg))
+
+        self.frozen_stages = frozen_stages
+        if pre_norm:
+            self.pre_norm = build_norm_layer(norm_cfg, self.embed_dims)
+        else:
+            self.pre_norm = nn.Identity()
+
+        self.final_norm = final_norm
+        if final_norm:
+            self.ln1 = build_norm_layer(norm_cfg, self.embed_dims)
+        if self.out_type == 'avg_featmap':
+            self.ln2 = build_norm_layer(norm_cfg, self.embed_dims)
+
+        # freeze stages only when self.frozen_stages > 0
+        if self.frozen_stages > 0:
+            self._freeze_stages()
+
+    @property
+    def norm1(self):
+        return self.ln1
+
+    @property
+    def norm2(self):
+        return self.ln2
+
+    def init_weights(self):
+        super(VisionTransformer, self).init_weights()
+
+        if not (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            if self.pos_embed is not None:
+                trunc_normal_(self.pos_embed, std=0.02)
+
+    def _prepare_pos_embed(self, state_dict, prefix, *args, **kwargs):
+        name = prefix + 'pos_embed'
+        if name not in state_dict.keys():
+            return
+
+        ckpt_pos_embed_shape = state_dict[name].shape
+        if (not self.with_cls_token
+                and ckpt_pos_embed_shape[1] == self.pos_embed.shape[1] + 1):
+            # Remove cls token from state dict if it's not used.
+            state_dict[name] = state_dict[name][:, 1:]
+            ckpt_pos_embed_shape = state_dict[name].shape
+
+        if self.pos_embed.shape != ckpt_pos_embed_shape:
+            from mmengine.logging import MMLogger
+            logger = MMLogger.get_current_instance()
+            logger.info(
+                f'Resize the pos_embed shape from {ckpt_pos_embed_shape} '
+                f'to {self.pos_embed.shape}.')
+
+            ckpt_pos_embed_shape = to_2tuple(
+                int(np.sqrt(ckpt_pos_embed_shape[1] - self.num_extra_tokens)))
+            pos_embed_shape = self.patch_embed.init_out_size
+
+            state_dict[name] = resize_pos_embed(state_dict[name],
+                                                ckpt_pos_embed_shape,
+                                                pos_embed_shape,
+                                                self.interpolate_mode,
+                                                self.num_extra_tokens)
+
+    @staticmethod
+    def resize_pos_embed(*args, **kwargs):
+        """Interface for backward-compatibility."""
+        return resize_pos_embed(*args, **kwargs)
+
+    def _freeze_stages(self):
+        # freeze position embedding
+        if self.pos_embed is not None:
+            self.pos_embed.requires_grad = False
+        # set dropout to eval model
+        self.drop_after_pos.eval()
+        # freeze patch embedding
+        self.patch_embed.eval()
+        for param in self.patch_embed.parameters():
+            param.requires_grad = False
+        # freeze pre-norm
+        for param in self.pre_norm.parameters():
+            param.requires_grad = False
+        # freeze cls_token
+        if self.cls_token is not None:
+            self.cls_token.requires_grad = False
+        # freeze layers
+        for i in range(1, self.frozen_stages + 1):
+            m = self.layers[i - 1]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+        # freeze the last layer norm
+        if self.frozen_stages == len(self.layers):
+            if self.final_norm:
+                self.ln1.eval()
+                for param in self.ln1.parameters():
+                    param.requires_grad = False
+
+            if self.out_type == 'avg_featmap':
+                self.ln2.eval()
+                for param in self.ln2.parameters():
+                    param.requires_grad = False
+
+    def forward(self, x):
+        B = x.shape[0]
+        x, patch_resolution = self.patch_embed(x)
+
+        if self.cls_token is not None:
+            # stole cls_tokens impl from Phil Wang, thanks
+            cls_token = self.cls_token.expand(B, -1, -1)
+            x = torch.cat((cls_token, x), dim=1)
+
+        x = x + resize_pos_embed(
+            self.pos_embed,
+            self.patch_resolution,
+            patch_resolution,
+            mode=self.interpolate_mode,
+            num_extra_tokens=self.num_extra_tokens)
+        x = self.drop_after_pos(x)
+
+        x = self.pre_norm(x)
+
+        outs = []
+        for i, layer in enumerate(self.layers):
+            x = layer(x)
+
+            if i == len(self.layers) - 1 and self.final_norm:
+                x = self.ln1(x)
+
+            if i in self.out_indices:
+                outs.append(self._format_output(x, patch_resolution))
+
+        return tuple(outs)
+
+    def _format_output(self, x, hw):
+        if self.out_type == 'raw':
+            return x
+        if self.out_type == 'cls_token':
+            return x[:, 0]
+
+        patch_token = x[:, self.num_extra_tokens:]
+        if self.out_type == 'featmap':
+            B = x.size(0)
+            # (B, N, C) -> (B, H, W, C) -> (B, C, H, W)
+            return patch_token.reshape(B, *hw, -1).permute(0, 3, 1, 2)
+        if self.out_type == 'avg_featmap':
+            return self.ln2(patch_token.mean(dim=1))
+
+    def get_layer_depth(self, param_name: str, prefix: str = ''):
+        """Get the layer-wise depth of a parameter.
+
+        Args:
+            param_name (str): The name of the parameter.
+            prefix (str): The prefix for the parameter.
+                Defaults to an empty string.
+
+        Returns:
+            Tuple[int, int]: The layer-wise depth and the num of layers.
+
+        Note:
+            The first depth is the stem module (``layer_depth=0``), and the
+            last depth is the subsequent module (``layer_depth=num_layers-1``)
+        """
+        num_layers = self.num_layers + 2
+
+        if not param_name.startswith(prefix):
+            # For subsequent module like head
+            return num_layers - 1, num_layers
+
+        param_name = param_name[len(prefix):]
+
+        if param_name in ('cls_token', 'pos_embed'):
+            layer_depth = 0
+        elif param_name.startswith('patch_embed'):
+            layer_depth = 0
+        elif param_name.startswith('layers'):
+            layer_id = int(param_name.split('.')[1])
+            layer_depth = layer_id + 1
+        else:
+            layer_depth = num_layers - 1
+
+        return layer_depth, num_layers
diff --git a/mmpretrain/models/backbones/vit_eva02.py b/mmpretrain/models/backbones/vit_eva02.py
new file mode 100644
index 0000000000000000000000000000000000000000..20ec4b247bbdbfc209c353c8e001d34d71a3990c
--- /dev/null
+++ b/mmpretrain/models/backbones/vit_eva02.py
@@ -0,0 +1,350 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import numpy as np
+import torch
+import torch.nn as nn
+from mmcv.cnn.bricks.drop import build_dropout
+from mmengine.model import BaseModule, ModuleList
+
+from mmpretrain.registry import MODELS
+from ..utils import (RotaryEmbeddingFast, SwiGLUFFN, build_norm_layer,
+                     resize_pos_embed)
+from .vision_transformer import VisionTransformer
+
+
+class AttentionWithRoPE(BaseModule):
+    """Multi-head Attention Module with 2D sincos position embedding (RoPE).
+
+    Args:
+        embed_dims (int): The embedding dimension.
+        num_heads (int): Parallel attention heads.
+        attn_drop (float): Dropout rate of the dropout layer after the
+            attention calculation of query and key. Defaults to 0.
+        proj_drop (float): Dropout rate of the dropout layer after the
+            output projection. Defaults to 0.
+        qkv_bias (bool): If True, add a learnable bias to q and v. Note
+            that we follows the official implementation where ``k_bias``
+            is 0. Defaults to True.
+        qk_scale (float, optional): Override default qk scale of
+            ``head_dim ** -0.5`` if set. Defaults to None.
+        proj_bias (bool) If True, add a learnable bias to output projection.
+            Defaults to True.
+        rope (:obj:`torch.nn.Module`, optional): If it is an object of the
+            ``RotaryEmbedding``, the rotation of the token position will be
+            performed before the softmax. Defaults to None.
+        with_cls_token (bool): Whether concatenating class token into image
+            tokens as transformer input. Defaults to True.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 attn_drop=0.,
+                 proj_drop=0.,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 proj_bias=True,
+                 rope=None,
+                 with_cls_token=True,
+                 init_cfg=None):
+        super(AttentionWithRoPE, self).__init__(init_cfg=init_cfg)
+
+        self.embed_dims = embed_dims
+        self.num_heads = num_heads
+        self.head_dims = embed_dims // num_heads
+        self.scale = qk_scale or self.head_dims**-0.5
+        self.qkv = nn.Linear(embed_dims, embed_dims * 3, bias=qkv_bias)
+
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(embed_dims, embed_dims, bias=proj_bias)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+        self.with_cls_token = with_cls_token
+
+        self.rope = rope
+
+    def forward(self, x, patch_resolution):
+        B, N, _ = x.shape
+
+        qkv = self.qkv(x)
+        qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv.unbind(dim=0)
+
+        if self.rope:
+            if self.with_cls_token:
+                q_t = q[:, :, 1:, :]
+                ro_q_t = self.rope(q_t, patch_resolution)
+                q = torch.cat((q[:, :, :1, :], ro_q_t), -2).type_as(v)
+
+                k_t = k[:, :, 1:, :] if self.with_cls_token else k
+                ro_k_t = self.rope(k_t, patch_resolution)
+                k = torch.cat((k[:, :, :1, :], ro_k_t), -2).type_as(v)
+            else:
+                q = self.rope(q, patch_resolution)
+                k = self.rope(k, patch_resolution)
+
+        q = q * self.scale
+
+        attn = (q @ k.transpose(-2, -1))
+        attn = attn.softmax(dim=-1).type_as(x)
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
+
+        x = self.proj(x)
+        x = self.proj_drop(x)
+
+        return x
+
+
+class EVA02EndcoderLayer(BaseModule):
+    """Implements one encoder EVA02EndcoderLayer in EVA02.
+
+    Args:
+        embed_dims (int): The feature dimension
+        num_heads (int): Parallel attention heads
+        feedforward_channels (int): The hidden dimension of FFNs.
+        sub_ln (bool): Whether to add the sub layer normalization
+            in the attention module. Defaults to False.
+        attn_drop (float): Dropout rate of the dropout layer after the
+            attention calculation of query and key. Defaults to 0.
+        proj_drop (float): Dropout rate of the dropout layer after the
+            output projection. Defaults to 0.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to True.
+        qk_scale (float, optional): Override default qk scale of
+            ``head_dim ** -0.5`` if set. Defaults to None.
+        proj_bias (bool): enable bias for projection in the attention module
+            if True. Defaults to True.
+        rope (:obj:`torch.nn.Module`, optional): RotaryEmbedding object
+            in the attention module. Defaults to None.
+        drop_rate (float): Dropout rate in the mlp module. Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 feedforward_channels,
+                 sub_ln=False,
+                 attn_drop=0.,
+                 proj_drop=0.,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 proj_bias=True,
+                 rope=None,
+                 with_cls_token=True,
+                 drop_rate=0.,
+                 drop_path_rate=0.,
+                 norm_cfg=dict(type='LN'),
+                 init_cfg=None):
+        super(EVA02EndcoderLayer, self).__init__(init_cfg=init_cfg)
+
+        self.norm1 = build_norm_layer(norm_cfg, embed_dims)
+
+        self.attn = AttentionWithRoPE(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            attn_drop=attn_drop,
+            proj_drop=proj_drop,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            proj_bias=proj_bias,
+            rope=rope,
+            with_cls_token=with_cls_token)
+
+        self.drop_path = build_dropout(
+            dict(type='DropPath', drop_prob=drop_path_rate))
+
+        self.norm2 = build_norm_layer(norm_cfg, embed_dims)
+
+        if drop_rate > 0:
+            dropout_layer = dict(type='Dropout', drop_prob=drop_rate)
+        else:
+            dropout_layer = None
+
+        if sub_ln:
+            ffn_norm = norm_cfg
+        else:
+            ffn_norm = None
+
+        self.mlp = SwiGLUFFN(
+            embed_dims=embed_dims,
+            feedforward_channels=feedforward_channels,
+            dropout_layer=dropout_layer,
+            norm_cfg=ffn_norm,
+            add_identity=False,
+        )
+
+    def forward(self, x, patch_resolution):
+        inputs = x
+        x = self.norm1(x)
+        x = self.attn(x, patch_resolution)
+        x = self.drop_path(x)
+        x = inputs + x
+
+        inputs = x
+        x = self.norm2(x)
+        x = self.mlp(x)
+        x = self.drop_path(x)
+        x = inputs + x
+
+        return x
+
+
+@MODELS.register_module()
+class ViTEVA02(VisionTransformer):
+    """EVA02 Vision Transformer.
+
+    A PyTorch implement of : `EVA-02: A Visual Representation for Neon Genesis
+    <https://arxiv.org/abs/2303.11331>`_
+
+    Args:
+        arch (str | dict): Vision Transformer architecture. If use string,
+            choose from 'tiny', 'small', 'base', 'large'. If use dict,
+            it should have below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **num_layers** (int): The number of transformer encoder layers.
+            - **num_heads** (int): The number of heads in attention modules.
+            - **mlp_ratio** (float): The ratio of the mlp module.
+
+            Defaults to 'tiny'.
+
+        sub_ln (bool): Whether to add the sub layer normalization in swiglu.
+            Defaults to False.
+        drop_rate (float): Probability of an element to be zeroed in the
+            mlp module. Defaults to 0.
+        attn_drop_rate (float): Probability of an element to be zeroed after
+            the softmax in the attention. Defaults to 0.
+        proj_drop_rate (float): Probability of an element to be zeroed after
+            projection in the attention. Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        qkv_bias (bool): Whether to add bias for qkv in attention modules.
+            Defaults to True.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        with_cls_token (bool): Whether concatenating class token into image
+            tokens as transformer input. Defaults to True.
+        layer_cfgs (Sequence | dict): Configs of each transformer layer in
+            encoder. Defaults to an empty dict.
+        **kwargs(dict, optional): Other args for Vision Transformer.
+    """
+    arch_zoo = {
+        **dict.fromkeys(
+            ['t', 'ti', 'tiny'], {
+                'embed_dims': 192,
+                'num_layers': 12,
+                'num_heads': 3,
+                'feedforward_channels': int(192 * 4 * 2 / 3)
+            }),
+        **dict.fromkeys(
+            ['s', 'small'], {
+                'embed_dims': 384,
+                'num_layers': 12,
+                'num_heads': 6,
+                'feedforward_channels': int(384 * 4 * 2 / 3)
+            }),
+        **dict.fromkeys(
+            ['b', 'base'], {
+                'embed_dims': 768,
+                'num_layers': 12,
+                'num_heads': 12,
+                'feedforward_channels': int(768 * 4 * 2 / 3)
+            }),
+        **dict.fromkeys(
+            ['l', 'large'], {
+                'embed_dims': 1024,
+                'num_layers': 24,
+                'num_heads': 16,
+                'feedforward_channels': int(1024 * 4 * 2 / 3)
+            })
+    }
+    num_extra_tokens = 1  # class token
+    OUT_TYPES = {'raw', 'cls_token', 'featmap', 'avg_featmap'}
+
+    def __init__(self,
+                 arch='tiny',
+                 sub_ln=False,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 proj_drop_rate=0.,
+                 drop_path_rate=0.,
+                 qkv_bias=True,
+                 norm_cfg=dict(type='LN'),
+                 with_cls_token=True,
+                 layer_cfgs=dict(),
+                 **kwargs):
+        # set essential args for Vision Transformer
+        kwargs.update(
+            arch=arch,
+            drop_rate=drop_rate,
+            drop_path_rate=drop_path_rate,
+            norm_cfg=norm_cfg,
+            with_cls_token=with_cls_token)
+        super(ViTEVA02, self).__init__(**kwargs)
+
+        self.num_heads = self.arch_settings['num_heads']
+
+        # Set RoPE
+        head_dim = self.embed_dims // self.num_heads
+        self.rope = RotaryEmbeddingFast(
+            embed_dims=head_dim, patch_resolution=self.patch_resolution)
+
+        # stochastic depth decay rule
+        dpr = np.linspace(0, drop_path_rate, self.num_layers)
+        self.layers = ModuleList()
+        if isinstance(layer_cfgs, dict):
+            layer_cfgs = [layer_cfgs] * self.num_layers
+        for i in range(self.num_layers):
+            _layer_cfg = dict(
+                embed_dims=self.embed_dims,
+                num_heads=self.num_heads,
+                feedforward_channels=self.
+                arch_settings['feedforward_channels'],
+                sub_ln=sub_ln,
+                norm_cfg=norm_cfg,
+                proj_drop=proj_drop_rate,
+                attn_drop=attn_drop_rate,
+                drop_rate=drop_rate,
+                qkv_bias=qkv_bias,
+                rope=self.rope,
+                with_cls_token=with_cls_token,
+                drop_path_rate=dpr[i])
+            _layer_cfg.update(layer_cfgs[i])
+            self.layers.append(EVA02EndcoderLayer(**_layer_cfg))
+
+    def forward(self, x):
+        B = x.shape[0]
+        x, patch_resolution = self.patch_embed(x)
+
+        if self.cls_token is not None:
+            # stole cls_tokens impl from Phil Wang, thanks
+            cls_tokens = self.cls_token.expand(B, -1, -1)
+            x = torch.cat((cls_tokens, x), dim=1)
+
+        x = x + resize_pos_embed(
+            self.pos_embed,
+            self.patch_resolution,
+            patch_resolution,
+            mode=self.interpolate_mode,
+            num_extra_tokens=self.num_extra_tokens)
+        x = self.drop_after_pos(x)
+
+        x = self.pre_norm(x)
+
+        outs = []
+        for i, layer in enumerate(self.layers):
+            x = layer(x, patch_resolution)
+
+            if i == len(self.layers) - 1 and self.final_norm:
+                x = self.ln1(x)
+
+            if i in self.out_indices:
+                outs.append(self._format_output(x, patch_resolution))
+
+        return tuple(outs)
diff --git a/mmpretrain/models/backbones/vit_sam.py b/mmpretrain/models/backbones/vit_sam.py
new file mode 100644
index 0000000000000000000000000000000000000000..0eb46a72adf26cb62b93d5538116bd74f36070fa
--- /dev/null
+++ b/mmpretrain/models/backbones/vit_sam.py
@@ -0,0 +1,697 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional, Sequence, Tuple
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn.bricks.transformer import FFN, PatchEmbed
+from mmengine.model import BaseModule, ModuleList
+from mmengine.model.weight_init import trunc_normal_
+
+from mmpretrain.registry import MODELS
+from ..utils import LayerNorm2d, build_norm_layer, resize_pos_embed, to_2tuple
+from .base_backbone import BaseBackbone
+
+
+def window_partition(x: torch.Tensor,
+                     window_size: int) -> Tuple[torch.Tensor, Tuple[int, int]]:
+    """Partition into non-overlapping windows with padding if needed.
+
+    Borrowed from https://github.com/facebookresearch/segment-anything/
+
+    Args:
+        x (torch.Tensor): Input tokens with [B, H, W, C].
+        window_size (int): Window size.
+
+    Returns:
+        Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]
+
+        - ``windows``: Windows after partition with
+        [B * num_windows, window_size, window_size, C].
+        - ``(Hp, Wp)``: Padded height and width before partition
+    """
+    B, H, W, C = x.shape
+
+    pad_h = (window_size - H % window_size) % window_size
+    pad_w = (window_size - W % window_size) % window_size
+    if pad_h > 0 or pad_w > 0:
+        x = F.pad(x, (0, 0, 0, pad_w, 0, pad_h))
+    Hp, Wp = H + pad_h, W + pad_w
+
+    x = x.view(B, Hp // window_size, window_size, Wp // window_size,
+               window_size, C)
+    windows = x.permute(0, 1, 3, 2, 4,
+                        5).contiguous().view(-1, window_size, window_size, C)
+    return windows, (Hp, Wp)
+
+
+def window_unpartition(windows: torch.Tensor, window_size: int,
+                       pad_hw: Tuple[int, int],
+                       hw: Tuple[int, int]) -> torch.Tensor:
+    """Window unpartition into original sequences and removing padding.
+
+    Borrowed from https://github.com/facebookresearch/segment-anything/
+
+    Args:
+        x (torch.Tensor): Input tokens with
+            [B * num_windows, window_size, window_size, C].
+        window_size (int): Window size.
+        pad_hw (tuple): Padded height and width (Hp, Wp).
+        hw (tuple): Original height and width (H, W) before padding.
+
+    Returns:
+        torch.Tensor: Unpartitioned sequences with [B, H, W, C].
+    """
+    Hp, Wp = pad_hw
+    H, W = hw
+    B = windows.shape[0] // (Hp * Wp // window_size // window_size)
+    x = windows.view(B, Hp // window_size, Wp // window_size, window_size,
+                     window_size, -1)
+    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, Hp, Wp, -1)
+
+    if Hp > H or Wp > W:
+        x = x[:, :H, :W, :].contiguous()
+    return x
+
+
+def get_rel_pos(q_size: int, k_size: int,
+                rel_pos: torch.Tensor) -> torch.Tensor:
+    """Get relative positional embeddings according to the relative positions
+    of query and key sizes.
+
+    Borrowed from https://github.com/facebookresearch/segment-anything/
+
+    Args:
+        q_size (int): Size of query q.
+        k_size (int): Size of key k.
+        rel_pos (torch.Tensor): Relative position embeddings (L, C).
+
+    Returns:
+        torch.Tensor: Extracted positional embeddings according to relative
+        positions.
+    """
+    max_rel_dist = int(2 * max(q_size, k_size) - 1)
+    # Interpolate rel pos if needed.
+    if rel_pos.shape[0] != max_rel_dist:
+        # Interpolate rel pos.
+        rel_pos_resized = F.interpolate(
+            rel_pos.reshape(1, rel_pos.shape[0], -1).permute(0, 2, 1),
+            size=max_rel_dist,
+            mode='linear',
+        )
+        rel_pos_resized = rel_pos_resized.reshape(-1,
+                                                  max_rel_dist).permute(1, 0)
+    else:
+        rel_pos_resized = rel_pos
+
+    # Scale the coords with short length if shapes for q and k are different.
+    q_coords = torch.arange(q_size)[:, None] * max(k_size / q_size, 1.0)
+    k_coords = torch.arange(k_size)[None, :] * max(q_size / k_size, 1.0)
+    relative_coords = (q_coords -
+                       k_coords) + (k_size - 1) * max(q_size / k_size, 1.0)
+
+    return rel_pos_resized[relative_coords.long()]
+
+
+def add_decomposed_rel_pos(
+    attn: torch.Tensor,
+    q: torch.Tensor,
+    rel_pos_h: torch.Tensor,
+    rel_pos_w: torch.Tensor,
+    q_size: Tuple[int, int],
+    k_size: Tuple[int, int],
+) -> torch.Tensor:
+    """Borrowed from https://github.com/facebookresearch/segment-anything/
+
+    Calculate decomposed Relative Positional Embeddings from :paper:`mvitv2`.
+    https://github.com/facebookresearch/mvit/blob/19786631e330df9f3622e5402b4a419a263a2c80/mvit/models/attention.py
+
+    Args:
+        attn (torch.Tensor): Attention map.
+        q (torch.Tensor): Query q in the attention layer with shape
+            (B, q_h * q_w, C).
+        rel_pos_h (torch.Tensor): Relative position embeddings (Lh, C) for
+            height axis.
+        rel_pos_w (torch.Tensor): Relative position embeddings (Lw, C) for
+            width axis.
+        q_size (tuple): Spatial sequence size of query q with (q_h, q_w).
+        k_size (tuple): Spatial sequence size of key k with (k_h, k_w).
+
+    Returns:
+        torch.Tensor: Attention map with added relative positional embeddings.
+    """
+    q_h, q_w = q_size
+    k_h, k_w = k_size
+    Rh = get_rel_pos(q_h, k_h, rel_pos_h)
+    Rw = get_rel_pos(q_w, k_w, rel_pos_w)
+
+    B, _, dim = q.shape
+    r_q = q.reshape(B, q_h, q_w, dim)
+    rel_h = torch.einsum('bhwc,hkc->bhwk', r_q, Rh)
+    rel_w = torch.einsum('bhwc,wkc->bhwk', r_q, Rw)
+
+    attn = (attn.view(B, q_h, q_w, k_h, k_w) + rel_h[:, :, :, :, None] +
+            rel_w[:, :, :, None, :]).view(B, q_h * q_w, k_h * k_w)
+
+    return attn
+
+
+class Attention(nn.Module):
+    """Multi-head Attention block with relative position embeddings.
+
+    Borrowed from https://github.com/facebookresearch/segment-anything/
+
+    Args:
+        embed_dims (int): The embedding dimension.
+        num_heads (int): Parallel attention heads.
+        qkv_bias (bool): If True, add a learnable bias to q, k, v.
+            Defaults to True.
+        use_rel_pos (bool):Whether to use relative position embedding.
+            Defaults to False.
+        input_size (int, optional): Input resolution for calculating the
+            relative positional parameter size. Defaults to None.
+    """
+
+    def __init__(
+        self,
+        embed_dims: int,
+        num_heads: int = 8,
+        qkv_bias: bool = True,
+        use_rel_pos: bool = False,
+        input_size: Optional[Tuple[int, int]] = None,
+    ) -> None:
+        super().__init__()
+        self.num_heads = num_heads
+        head_embed_dims = embed_dims // num_heads
+        self.scale = head_embed_dims**-0.5
+
+        self.qkv = nn.Linear(embed_dims, embed_dims * 3, bias=qkv_bias)
+        self.proj = nn.Linear(embed_dims, embed_dims)
+
+        self.use_rel_pos = use_rel_pos
+        if self.use_rel_pos:
+            assert (input_size is not None), \
+                'Input size must be provided if using relative position embed.'
+            # initialize relative positional embeddings
+            self.rel_pos_h = nn.Parameter(
+                torch.zeros(2 * input_size[0] - 1, head_embed_dims))
+            self.rel_pos_w = nn.Parameter(
+                torch.zeros(2 * input_size[1] - 1, head_embed_dims))
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        B, H, W, _ = x.shape
+        # qkv with shape (3, B, nHead, H * W, C)
+        qkv = self.qkv(x).reshape(B, H * W, 3, self.num_heads,
+                                  -1).permute(2, 0, 3, 1, 4)
+        # q, k, v with shape (B * nHead, H * W, C)
+        q, k, v = qkv.reshape(3, B * self.num_heads, H * W, -1).unbind(0)
+
+        attn = (q * self.scale) @ k.transpose(-2, -1)
+
+        if self.use_rel_pos:
+            attn = add_decomposed_rel_pos(attn, q, self.rel_pos_h,
+                                          self.rel_pos_w, (H, W), (H, W))
+
+        attn = attn.softmax(dim=-1)
+        x = (attn @ v).view(B, self.num_heads, H, W,
+                            -1).permute(0, 2, 3, 1, 4).reshape(B, H, W, -1)
+        x = self.proj(x)
+
+        return x
+
+
+class TransformerEncoderLayer(BaseModule):
+    """Encoder layer with window attention in Vision Transformer.
+
+    Args:
+        embed_dims (int): The feature dimension
+        num_heads (int): Parallel attention heads
+        feedforward_channels (int): The hidden dimension for FFNs
+        drop_rate (float): Probability of an element to be zeroed
+            after the feed forward layer. Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        num_fcs (int): The number of fully-connected layers for FFNs.
+            Defaults to 2.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to True.
+        act_cfg (dict): The activation config for FFNs.
+            Defaults to ``dict(type='GELU')``.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        use_rel_pos (bool):Whether to use relative position embedding.
+            Defaults to False.
+        window_size (int): Window size for window attention. Defaults to 0.
+        input_size (int, optional): Input resolution for calculating the
+            relative positional parameter size. Defaults to None.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims: int,
+                 num_heads: int,
+                 feedforward_channels: int,
+                 drop_rate: float = 0.,
+                 drop_path_rate: float = 0.,
+                 num_fcs: int = 2,
+                 qkv_bias: bool = True,
+                 act_cfg: dict = dict(type='GELU'),
+                 norm_cfg: dict = dict(type='LN'),
+                 use_rel_pos: bool = False,
+                 window_size: int = 0,
+                 input_size: Optional[Tuple[int, int]] = None,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+
+        self.embed_dims = embed_dims
+        self.window_size = window_size
+
+        self.ln1 = build_norm_layer(norm_cfg, self.embed_dims)
+
+        self.attn = Attention(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            use_rel_pos=use_rel_pos,
+            input_size=input_size if window_size == 0 else
+            (window_size, window_size),
+        )
+
+        self.ln2 = build_norm_layer(norm_cfg, self.embed_dims)
+
+        self.ffn = FFN(
+            embed_dims=embed_dims,
+            feedforward_channels=feedforward_channels,
+            num_fcs=num_fcs,
+            ffn_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            act_cfg=act_cfg)
+
+    @property
+    def norm1(self):
+        return self.ln1
+
+    @property
+    def norm2(self):
+        return self.ln2
+
+    def forward(self, x):
+        shortcut = x
+        x = self.ln1(x)
+        # Window partition
+        if self.window_size > 0:
+            H, W = x.shape[1], x.shape[2]
+            x, pad_hw = window_partition(x, self.window_size)
+
+        x = self.attn(x)
+        # Reverse window partition
+        if self.window_size > 0:
+            x = window_unpartition(x, self.window_size, pad_hw, (H, W))
+        x = shortcut + x
+
+        x = self.ffn(self.ln2(x), identity=x)
+        return x
+
+
+@MODELS.register_module()
+class ViTSAM(BaseBackbone):
+    """Vision Transformer as image encoder used in SAM.
+
+    A PyTorch implement of backbone: `Segment Anything
+    <https://arxiv.org/abs/2304.02643>`_
+
+    Args:
+        arch (str | dict): Vision Transformer architecture. If use string,
+            choose from 'base', 'large', 'huge'. If use dict, it should have
+            below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **num_layers** (int): The number of transformer encoder layers.
+            - **num_heads** (int): The number of heads in attention modules.
+            - **feedforward_channels** (int): The hidden dimensions in
+              feedforward modules.
+            - **global_attn_indexes** (int): The index of layers with global
+              attention.
+
+            Defaults to 'base'.
+        img_size (int | tuple): The expected input image shape. Because we
+            support dynamic input shape, just set the argument to the most
+            common input image shape. Defaults to 224.
+        patch_size (int | tuple): The patch size in patch embedding.
+            Defaults to 16.
+        in_channels (int): The num of input channels. Defaults to 3.
+        out_channels (int): The num of output channels, if equal to 0, the
+            channel reduction layer is disabled. Defaults to 256.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        out_type (str): The type of output features. Please choose from
+
+            - ``"raw"`` or ``"featmap"``: The feature map tensor from the
+              patch tokens with shape (B, C, H, W).
+            - ``"avg_featmap"``: The global averaged feature map tensor
+              with shape (B, C).
+
+            Defaults to ``"raw"``.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        qkv_bias (bool): Whether to add bias for qkv in attention modules.
+            Defaults to True.
+        use_abs_pos (bool): Whether to use absolute position embedding.
+            Defaults to True.
+        use_rel_pos (bool):Whether to use relative position embedding.
+            Defaults to True.
+        window_size (int): Window size for window attention. Defaults to 14.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        interpolate_mode (str): Select the interpolate mode for position
+            embeding vector resize. Defaults to "bicubic".
+        patch_cfg (dict): Configs of patch embeding. Defaults to an empty dict.
+        layer_cfgs (Sequence | dict): Configs of each transformer layer in
+            encoder. Defaults to an empty dict.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+    arch_zoo = {
+        **dict.fromkeys(
+            ['b', 'base'], {
+                'embed_dims': 768,
+                'num_layers': 12,
+                'num_heads': 12,
+                'feedforward_channels': 3072,
+                'global_attn_indexes': [2, 5, 8, 11]
+            }),
+        **dict.fromkeys(
+            ['l', 'large'], {
+                'embed_dims': 1024,
+                'num_layers': 24,
+                'num_heads': 16,
+                'feedforward_channels': 4096,
+                'global_attn_indexes': [5, 11, 17, 23]
+            }),
+        **dict.fromkeys(
+            ['h', 'huge'], {
+                'embed_dims': 1280,
+                'num_layers': 32,
+                'num_heads': 16,
+                'feedforward_channels': 5120,
+                'global_attn_indexes': [7, 15, 23, 31]
+            }),
+    }
+    OUT_TYPES = {'raw', 'featmap', 'avg_featmap'}
+
+    def __init__(self,
+                 arch: str = 'base',
+                 img_size: int = 224,
+                 patch_size: int = 16,
+                 in_channels: int = 3,
+                 out_channels: int = 256,
+                 out_indices: int = -1,
+                 out_type: str = 'raw',
+                 drop_rate: float = 0.,
+                 drop_path_rate: float = 0.,
+                 qkv_bias: bool = True,
+                 use_abs_pos: bool = True,
+                 use_rel_pos: bool = True,
+                 window_size: int = 14,
+                 norm_cfg: dict = dict(type='LN', eps=1e-6),
+                 frozen_stages: int = -1,
+                 interpolate_mode: str = 'bicubic',
+                 patch_cfg: dict = dict(),
+                 layer_cfgs: dict = dict(),
+                 init_cfg: Optional[dict] = None):
+        super().__init__(init_cfg)
+
+        if isinstance(arch, str):
+            arch = arch.lower()
+            assert arch in set(self.arch_zoo), \
+                f'Arch {arch} is not in default archs {set(self.arch_zoo)}'
+            self.arch_settings = self.arch_zoo[arch]
+        else:
+            essential_keys = {
+                'embed_dims', 'num_layers', 'num_heads', 'feedforward_channels'
+            }
+            assert isinstance(arch, dict) and essential_keys <= set(arch), \
+                f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = arch
+
+        self.embed_dims = self.arch_settings['embed_dims']
+        self.num_layers = self.arch_settings['num_layers']
+        self.global_attn_indexes = self.arch_settings['global_attn_indexes']
+        self.img_size = to_2tuple(img_size)
+
+        # Set patch embedding
+        _patch_cfg = dict(
+            in_channels=in_channels,
+            input_size=img_size,
+            embed_dims=self.embed_dims,
+            conv_type='Conv2d',
+            kernel_size=patch_size,
+            stride=patch_size,
+        )
+        _patch_cfg.update(patch_cfg)
+        self.patch_embed = PatchEmbed(**_patch_cfg)
+        self.patch_resolution = self.patch_embed.init_out_size
+
+        # Set out type
+        if out_type not in self.OUT_TYPES:
+            raise ValueError(f'Unsupported `out_type` {out_type}, please '
+                             f'choose from {self.OUT_TYPES}')
+        self.out_type = out_type
+
+        self.use_abs_pos = use_abs_pos
+        self.interpolate_mode = interpolate_mode
+        if use_abs_pos:
+            # Set position embedding
+            self.pos_embed = nn.Parameter(
+                torch.zeros(1, *self.patch_resolution, self.embed_dims))
+            self.drop_after_pos = nn.Dropout(p=drop_rate)
+            self._register_load_state_dict_pre_hook(self._prepare_pos_embed)
+
+        if use_rel_pos:
+            self._register_load_state_dict_pre_hook(
+                self._prepare_relative_position)
+
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = self.num_layers + index
+            assert 0 <= out_indices[i] <= self.num_layers, \
+                f'Invalid out_indices {index}'
+        self.out_indices = out_indices
+
+        # stochastic depth decay rule
+        dpr = np.linspace(0, drop_path_rate, self.num_layers)
+
+        self.layers = ModuleList()
+        if isinstance(layer_cfgs, dict):
+            layer_cfgs = [layer_cfgs] * self.num_layers
+        for i in range(self.num_layers):
+            _layer_cfg = dict(
+                embed_dims=self.embed_dims,
+                num_heads=self.arch_settings['num_heads'],
+                feedforward_channels=self.
+                arch_settings['feedforward_channels'],
+                drop_rate=drop_rate,
+                drop_path_rate=dpr[i],
+                qkv_bias=qkv_bias,
+                window_size=window_size
+                if i not in self.global_attn_indexes else 0,
+                input_size=self.patch_resolution,
+                use_rel_pos=use_rel_pos,
+                norm_cfg=norm_cfg)
+            _layer_cfg.update(layer_cfgs[i])
+            self.layers.append(TransformerEncoderLayer(**_layer_cfg))
+
+        self.out_channels = out_channels
+        if self.out_channels > 0:
+            self.channel_reduction = nn.Sequential(
+                nn.Conv2d(
+                    self.embed_dims,
+                    out_channels,
+                    kernel_size=1,
+                    bias=False,
+                ),
+                LayerNorm2d(out_channels, eps=1e-6),
+                nn.Conv2d(
+                    out_channels,
+                    out_channels,
+                    kernel_size=3,
+                    padding=1,
+                    bias=False,
+                ),
+                LayerNorm2d(out_channels, eps=1e-6),
+            )
+
+        # freeze stages only when self.frozen_stages > 0
+        self.frozen_stages = frozen_stages
+        if self.frozen_stages > 0:
+            self._freeze_stages()
+
+    def init_weights(self):
+        super().init_weights()
+
+        if not (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            if self.pos_embed is not None:
+                trunc_normal_(self.pos_embed, std=0.02)
+
+    def _freeze_stages(self):
+        # freeze position embedding
+        if self.pos_embed is not None:
+            self.pos_embed.requires_grad = False
+        # set dropout to eval model
+        self.drop_after_pos.eval()
+        # freeze patch embedding
+        self.patch_embed.eval()
+        for param in self.patch_embed.parameters():
+            param.requires_grad = False
+
+        # freeze layers
+        for i in range(1, self.frozen_stages + 1):
+            m = self.layers[i - 1]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+        # freeze channel_reduction module
+        if self.frozen_stages == self.num_layers and self.out_channels > 0:
+            m = self.channel_reduction
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor]:
+        B = x.shape[0]
+        x, patch_resolution = self.patch_embed(x)
+        x = x.view(B, patch_resolution[0], patch_resolution[1],
+                   self.embed_dims)
+
+        if self.use_abs_pos:
+            # 'resize_pos_embed' only supports 'pos_embed' with ndim==3, but
+            # in ViTSAM, the 'pos_embed' has 4 dimensions (1, H, W, C), so it
+            # is flattened. Besides, ViTSAM doesn't have any extra token.
+            resized_pos_embed = resize_pos_embed(
+                self.pos_embed.flatten(1, 2),
+                self.patch_resolution,
+                patch_resolution,
+                mode=self.interpolate_mode,
+                num_extra_tokens=0)
+            x = x + resized_pos_embed.view(1, *patch_resolution,
+                                           self.embed_dims)
+            x = self.drop_after_pos(x)
+
+        outs = []
+        for i, layer in enumerate(self.layers):
+            x = layer(x)
+
+            if i in self.out_indices:
+                # (B, H, W, C) -> (B, C, H, W)
+                x_reshape = x.permute(0, 3, 1, 2)
+
+                if self.out_channels > 0:
+                    x_reshape = self.channel_reduction(x_reshape)
+                outs.append(self._format_output(x_reshape))
+
+        return tuple(outs)
+
+    def _format_output(self, x) -> torch.Tensor:
+        if self.out_type == 'raw' or self.out_type == 'featmap':
+            return x
+        elif self.out_type == 'avg_featmap':
+            # (B, C, H, W) -> (B, C, N) -> (B, N, C)
+            x = x.flatten(2).permute(0, 2, 1)
+            return x.mean(dim=1)
+
+    def _prepare_pos_embed(self, state_dict, prefix, *args, **kwargs):
+        name = prefix + 'pos_embed'
+        if name not in state_dict.keys():
+            return
+
+        ckpt_pos_embed_shape = state_dict[name].shape
+        if self.pos_embed.shape != ckpt_pos_embed_shape:
+            from mmengine.logging import MMLogger
+            logger = MMLogger.get_current_instance()
+            logger.info(
+                f'Resize the pos_embed shape from {ckpt_pos_embed_shape} '
+                f'to {self.pos_embed.shape}.')
+
+            ckpt_pos_embed_shape = ckpt_pos_embed_shape[1:3]
+            pos_embed_shape = self.patch_embed.init_out_size
+
+            flattened_pos_embed = state_dict[name].flatten(1, 2)
+            resized_pos_embed = resize_pos_embed(flattened_pos_embed,
+                                                 ckpt_pos_embed_shape,
+                                                 pos_embed_shape,
+                                                 self.interpolate_mode, 0)
+            state_dict[name] = resized_pos_embed.view(1, *pos_embed_shape,
+                                                      self.embed_dims)
+
+    def _prepare_relative_position(self, state_dict, prefix, *args, **kwargs):
+        state_dict_model = self.state_dict()
+        all_keys = list(state_dict_model.keys())
+        for key in all_keys:
+            if 'rel_pos_' in key:
+                ckpt_key = prefix + key
+                if ckpt_key not in state_dict:
+                    continue
+                relative_position_pretrained = state_dict[ckpt_key]
+                relative_position_current = state_dict_model[key]
+                L1, _ = relative_position_pretrained.size()
+                L2, _ = relative_position_current.size()
+                if L1 != L2:
+                    new_rel_pos = F.interpolate(
+                        relative_position_pretrained.reshape(1, L1,
+                                                             -1).permute(
+                                                                 0, 2, 1),
+                        size=L2,
+                        mode='linear',
+                    )
+                    new_rel_pos = new_rel_pos.reshape(-1, L2).permute(1, 0)
+                    from mmengine.logging import MMLogger
+                    logger = MMLogger.get_current_instance()
+                    logger.info(f'Resize the {ckpt_key} from '
+                                f'{state_dict[ckpt_key].shape} to '
+                                f'{new_rel_pos.shape}')
+                    state_dict[ckpt_key] = new_rel_pos
+
+    def get_layer_depth(self, param_name: str, prefix: str = ''):
+        """Get the layer-wise depth of a parameter.
+
+        Args:
+            param_name (str): The name of the parameter.
+            prefix (str): The prefix for the parameter.
+                Defaults to an empty string.
+
+        Returns:
+            Tuple[int, int]: The layer-wise depth and the num of layers.
+
+        Note:
+            The first depth is the stem module (``layer_depth=0``), and the
+            last depth is the subsequent module (``layer_depth=num_layers-1``)
+        """
+        num_layers = self.num_layers + 2
+
+        if not param_name.startswith(prefix):
+            # For subsequent module like head
+            return num_layers - 1, num_layers
+
+        param_name = param_name[len(prefix):]
+
+        if param_name in ('cls_token', 'pos_embed'):
+            layer_depth = 0
+        elif param_name.startswith('patch_embed'):
+            layer_depth = 0
+        elif param_name.startswith('layers'):
+            layer_id = int(param_name.split('.')[1])
+            layer_depth = layer_id + 1
+        else:
+            layer_depth = num_layers - 1
+
+        return layer_depth, num_layers
diff --git a/mmpretrain/models/backbones/xcit.py b/mmpretrain/models/backbones/xcit.py
new file mode 100644
index 0000000000000000000000000000000000000000..392ebbedf457cc199b70afa1923ec0b698f7fd5b
--- /dev/null
+++ b/mmpretrain/models/backbones/xcit.py
@@ -0,0 +1,770 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from functools import partial
+from typing import Optional, Sequence, Union
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn.bricks import ConvModule, DropPath
+from mmcv.cnn.bricks.transformer import FFN
+from mmengine.model import BaseModule, Sequential
+from mmengine.model.weight_init import trunc_normal_
+from mmengine.utils import digit_version
+
+from mmpretrain.registry import MODELS
+from ..utils import build_norm_layer, to_2tuple
+from .base_backbone import BaseBackbone
+
+if digit_version(torch.__version__) < digit_version('1.8.0'):
+    floor_div = torch.floor_divide
+else:
+    floor_div = partial(torch.div, rounding_mode='floor')
+
+
+class ClassAttntion(BaseModule):
+    """Class Attention Module.
+
+    A PyTorch implementation of Class Attention Module introduced by:
+    `Going deeper with Image Transformers <https://arxiv.org/abs/2103.17239>`_
+
+    taken from
+    https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py
+    with slight modifications to do CA
+
+    Args:
+        dim (int): The feature dimension.
+        num_heads (int): Parallel attention heads. Defaults to 8.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to False.
+        attn_drop (float): The drop out rate for attention output weights.
+            Defaults to 0.
+        proj_drop (float): The drop out rate for linear output weights.
+            Defaults to 0.
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """  # noqa: E501
+
+    def __init__(self,
+                 dim: int,
+                 num_heads: int = 8,
+                 qkv_bias: bool = False,
+                 attn_drop: float = 0.,
+                 proj_drop: float = 0.,
+                 init_cfg=None):
+
+        super(ClassAttntion, self).__init__(init_cfg=init_cfg)
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = head_dim**-0.5
+
+        self.q = nn.Linear(dim, dim, bias=qkv_bias)
+        self.k = nn.Linear(dim, dim, bias=qkv_bias)
+        self.v = nn.Linear(dim, dim, bias=qkv_bias)
+
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def forward(self, x):
+        B, N, C = x.shape
+        # We only need to calculate query of cls token.
+        q = self.q(x[:, 0]).unsqueeze(1).reshape(B, 1, self.num_heads,
+                                                 C // self.num_heads).permute(
+                                                     0, 2, 1, 3)
+        k = self.k(x).reshape(B, N, self.num_heads,
+                              C // self.num_heads).permute(0, 2, 1, 3)
+
+        q = q * self.scale
+        v = self.v(x).reshape(B, N, self.num_heads,
+                              C // self.num_heads).permute(0, 2, 1, 3)
+
+        attn = (q @ k.transpose(-2, -1))
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+
+        x_cls = (attn @ v).transpose(1, 2).reshape(B, 1, C)
+        x_cls = self.proj(x_cls)
+        x_cls = self.proj_drop(x_cls)
+
+        return x_cls
+
+
+class PositionalEncodingFourier(BaseModule):
+    """Positional Encoding using a fourier kernel.
+
+    A PyTorch implementation of Positional Encoding relying on
+    a fourier kernel introduced by:
+    `Attention is all you Need <https://arxiv.org/abs/1706.03762>`_
+
+    Based on the `official XCiT code
+    <https://github.com/facebookresearch/xcit/blob/master/xcit.py>`_
+
+    Args:
+        hidden_dim (int): The hidden feature dimension. Defaults to 32.
+        dim (int): The output feature dimension. Defaults to 768.
+        temperature (int): A control variable for position encoding.
+            Defaults to 10000.
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 hidden_dim: int = 32,
+                 dim: int = 768,
+                 temperature: int = 10000,
+                 init_cfg=None):
+        super(PositionalEncodingFourier, self).__init__(init_cfg=init_cfg)
+
+        self.token_projection = ConvModule(
+            in_channels=hidden_dim * 2,
+            out_channels=dim,
+            kernel_size=1,
+            conv_cfg=None,
+            norm_cfg=None,
+            act_cfg=None)
+        self.scale = 2 * math.pi
+        self.temperature = temperature
+        self.hidden_dim = hidden_dim
+        self.dim = dim
+        self.eps = 1e-6
+
+    def forward(self, B: int, H: int, W: int):
+        device = self.token_projection.conv.weight.device
+        y_embed = torch.arange(
+            1, H + 1, device=device).unsqueeze(1).repeat(1, 1, W).float()
+        x_embed = torch.arange(1, W + 1, device=device).repeat(1, H, 1).float()
+        y_embed = y_embed / (y_embed[:, -1:, :] + self.eps) * self.scale
+        x_embed = x_embed / (x_embed[:, :, -1:] + self.eps) * self.scale
+
+        dim_t = torch.arange(self.hidden_dim, device=device).float()
+        dim_t = floor_div(dim_t, 2)
+        dim_t = self.temperature**(2 * dim_t / self.hidden_dim)
+
+        pos_x = x_embed[:, :, :, None] / dim_t
+        pos_y = y_embed[:, :, :, None] / dim_t
+        pos_x = torch.stack(
+            [pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()],
+            dim=4).flatten(3)
+        pos_y = torch.stack(
+            [pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()],
+            dim=4).flatten(3)
+        pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)
+        pos = self.token_projection(pos)
+        return pos.repeat(B, 1, 1, 1)  # (B, C, H, W)
+
+
+class ConvPatchEmbed(BaseModule):
+    """Patch Embedding using multiple convolution layers.
+
+    Args:
+        img_size (int, tuple): input image size.
+            Defaults to 224, means the size is 224*224.
+        patch_size (int): The patch size in conv patch embedding.
+            Defaults to 16.
+        in_channels (int): The input channels of this module.
+            Defaults to 3.
+        embed_dims (int): The feature dimension
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='BN')``.
+        act_cfg (dict): Config dict for activation layer.
+            Defaults to ``dict(type='GELU')``.
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 img_size: Union[int, tuple] = 224,
+                 patch_size: int = 16,
+                 in_channels: int = 3,
+                 embed_dims: int = 768,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='GELU'),
+                 init_cfg=None):
+        super(ConvPatchEmbed, self).__init__(init_cfg=init_cfg)
+        img_size = to_2tuple(img_size)
+        num_patches = (img_size[1] // patch_size) * (img_size[0] // patch_size)
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.num_patches = num_patches
+
+        conv = partial(
+            ConvModule,
+            kernel_size=3,
+            stride=2,
+            padding=1,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg,
+        )
+
+        layer = []
+        if patch_size == 16:
+            layer.append(
+                conv(in_channels=in_channels, out_channels=embed_dims // 8))
+            layer.append(
+                conv(
+                    in_channels=embed_dims // 8, out_channels=embed_dims // 4))
+        elif patch_size == 8:
+            layer.append(
+                conv(in_channels=in_channels, out_channels=embed_dims // 4))
+        else:
+            raise ValueError('For patch embedding, the patch size must be 16 '
+                             f'or 8, but get patch size {self.patch_size}.')
+
+        layer.append(
+            conv(in_channels=embed_dims // 4, out_channels=embed_dims // 2))
+        layer.append(
+            conv(
+                in_channels=embed_dims // 2,
+                out_channels=embed_dims,
+                act_cfg=None,
+            ))
+
+        self.proj = Sequential(*layer)
+
+    def forward(self, x: torch.Tensor):
+        x = self.proj(x)
+        Hp, Wp = x.shape[2], x.shape[3]
+        x = x.flatten(2).transpose(1, 2)  # (B, N, C)
+        return x, (Hp, Wp)
+
+
+class ClassAttentionBlock(BaseModule):
+    """Transformer block using Class Attention.
+
+    Args:
+        dim (int): The feature dimension.
+        num_heads (int): Parallel attention heads.
+        mlp_ratio (float): The hidden dimension ratio for FFN.
+            Defaults to 4.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to False.
+        drop (float): Probability of an element to be zeroed
+            after the feed forward layer. Defaults to 0.
+        attn_drop (float): The drop out rate for attention output weights.
+            Defaults to 0.
+        drop_path (float): Stochastic depth rate. Defaults to 0.
+        layer_scale_init_value (float): The initial value for layer scale.
+            Defaults to 1.
+        tokens_norm (bool): Whether to normalize all tokens or just the
+            cls_token in the CA. Defaults to False.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN', eps=1e-6)``.
+        act_cfg (dict): Config dict for activation layer.
+            Defaults to ``dict(type='GELU')``.
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 dim: int,
+                 num_heads: int,
+                 mlp_ratio: float = 4.,
+                 qkv_bias: bool = False,
+                 drop=0.,
+                 attn_drop=0.,
+                 drop_path=0.,
+                 layer_scale_init_value=1.,
+                 tokens_norm=False,
+                 norm_cfg=dict(type='LN', eps=1e-6),
+                 act_cfg=dict(type='GELU'),
+                 init_cfg=None):
+
+        super(ClassAttentionBlock, self).__init__(init_cfg=init_cfg)
+
+        self.norm1 = build_norm_layer(norm_cfg, dim)
+
+        self.attn = ClassAttntion(
+            dim,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            attn_drop=attn_drop,
+            proj_drop=drop,
+        )
+
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+
+        self.norm2 = build_norm_layer(norm_cfg, dim)
+
+        self.ffn = FFN(
+            embed_dims=dim,
+            feedforward_channels=int(dim * mlp_ratio),
+            act_cfg=act_cfg,
+            ffn_drop=drop,
+        )
+
+        if layer_scale_init_value > 0:
+            self.gamma1 = nn.Parameter(layer_scale_init_value *
+                                       torch.ones(dim))
+            self.gamma2 = nn.Parameter(layer_scale_init_value *
+                                       torch.ones(dim))
+        else:
+            self.gamma1, self.gamma2 = 1.0, 1.0
+
+        # See https://github.com/rwightman/pytorch-image-models/pull/747#issuecomment-877795721  # noqa: E501
+        self.tokens_norm = tokens_norm
+
+    def forward(self, x):
+        x_norm1 = self.norm1(x)
+        x_attn = torch.cat([self.attn(x_norm1), x_norm1[:, 1:]], dim=1)
+        x = x + self.drop_path(self.gamma1 * x_attn)
+        if self.tokens_norm:
+            x = self.norm2(x)
+        else:
+            x = torch.cat([self.norm2(x[:, 0:1]), x[:, 1:]], dim=1)
+        x_res = x
+        cls_token = x[:, 0:1]
+        cls_token = self.gamma2 * self.ffn(cls_token, identity=0)
+        x = torch.cat([cls_token, x[:, 1:]], dim=1)
+        x = x_res + self.drop_path(x)
+        return x
+
+
+class LPI(BaseModule):
+    """Local Patch Interaction module.
+
+    A PyTorch implementation of Local Patch Interaction module
+    as in XCiT introduced by `XCiT: Cross-Covariance Image Transformers
+    <https://arxiv.org/abs/2106.096819>`_
+
+    Local Patch Interaction module that allows explicit communication between
+    tokens in 3x3 windows to augment the implicit communication performed by
+    the block diagonal scatter attention. Implemented using 2 layers of
+    separable 3x3 convolutions with GeLU and BatchNorm2d
+
+    Args:
+        in_features (int): The input channels.
+        out_features (int, optional): The output channels. Defaults to None.
+        kernel_size (int): The kernel_size in ConvModule. Defaults to 3.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='BN')``.
+        act_cfg (dict): Config dict for activation layer.
+            Defaults to ``dict(type='GELU')``.
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_features: int,
+                 out_features: Optional[int] = None,
+                 kernel_size: int = 3,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='GELU'),
+                 init_cfg=None):
+        super(LPI, self).__init__(init_cfg=init_cfg)
+
+        out_features = out_features or in_features
+        padding = kernel_size // 2
+
+        self.conv1 = ConvModule(
+            in_channels=in_features,
+            out_channels=in_features,
+            kernel_size=kernel_size,
+            padding=padding,
+            groups=in_features,
+            bias=True,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg,
+            order=('conv', 'act', 'norm'))
+
+        self.conv2 = ConvModule(
+            in_channels=in_features,
+            out_channels=out_features,
+            kernel_size=kernel_size,
+            padding=padding,
+            groups=out_features,
+            norm_cfg=None,
+            act_cfg=None)
+
+    def forward(self, x: torch.Tensor, H: int, W: int) -> torch.Tensor:
+        B, N, C = x.shape
+        x = x.permute(0, 2, 1).reshape(B, C, H, W)
+        x = self.conv1(x)
+        x = self.conv2(x)
+        x = x.reshape(B, C, N).permute(0, 2, 1)
+        return x
+
+
+class XCA(BaseModule):
+    r"""Cross-Covariance Attention module.
+
+    A PyTorch implementation of Cross-Covariance Attention module
+    as in XCiT introduced by `XCiT: Cross-Covariance Image Transformers
+    <https://arxiv.org/abs/2106.096819>`_
+
+    In Cross-Covariance Attention (XCA), the channels are updated using a
+    weighted sum. The weights are obtained from the (softmax normalized)
+    Cross-covariance matrix :math:`(Q^T \cdot K \in d_h \times d_h)`
+
+    Args:
+        dim (int): The feature dimension.
+        num_heads (int): Parallel attention heads. Defaults to 8.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to False.
+        attn_drop (float): The drop out rate for attention output weights.
+            Defaults to 0.
+        proj_drop (float): The drop out rate for linear output weights.
+            Defaults to 0.
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 dim: int,
+                 num_heads: int = 8,
+                 qkv_bias: bool = False,
+                 attn_drop: float = 0.,
+                 proj_drop: float = 0.,
+                 init_cfg=None):
+        super(XCA, self).__init__(init_cfg=init_cfg)
+        self.num_heads = num_heads
+        self.temperature = nn.Parameter(torch.ones(num_heads, 1, 1))
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        B, N, C = x.shape
+        # (qkv, B, num_heads, channels per head, N)
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads,
+                                  C // self.num_heads).permute(2, 0, 3, 4, 1)
+        q, k, v = qkv.unbind(0)
+
+        # Paper section 3.2 l2-Normalization and temperature scaling
+        q = F.normalize(q, dim=-1)
+        k = F.normalize(k, dim=-1)
+        attn = (q @ k.transpose(-2, -1)) * self.temperature
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+
+        # (B, num_heads, C', N) -> (B, N, num_heads, C') -> (B, N C)
+        x = (attn @ v).permute(0, 3, 1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+
+class XCABlock(BaseModule):
+    """Transformer block using XCA.
+
+    Args:
+        dim (int): The feature dimension.
+        num_heads (int): Parallel attention heads.
+        mlp_ratio (float): The hidden dimension ratio for FFNs.
+            Defaults to 4.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to False.
+        drop (float): Probability of an element to be zeroed
+            after the feed forward layer. Defaults to 0.
+        attn_drop (float): The drop out rate for attention output weights.
+            Defaults to 0.
+        drop_path (float): Stochastic depth rate. Defaults to 0.
+        layer_scale_init_value (float): The initial value for layer scale.
+            Defaults to 1.
+        bn_norm_cfg (dict): Config dict for batchnorm in LPI and
+            ConvPatchEmbed. Defaults to ``dict(type='BN')``.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN', eps=1e-6)``.
+        act_cfg (dict): Config dict for activation layer.
+            Defaults to ``dict(type='GELU')``.
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+    """
+
+    def __init__(self,
+                 dim: int,
+                 num_heads: int,
+                 mlp_ratio: float = 4.,
+                 qkv_bias: bool = False,
+                 drop: float = 0.,
+                 attn_drop: float = 0.,
+                 drop_path: float = 0.,
+                 layer_scale_init_value: float = 1.,
+                 bn_norm_cfg=dict(type='BN'),
+                 norm_cfg=dict(type='LN', eps=1e-6),
+                 act_cfg=dict(type='GELU'),
+                 init_cfg=None):
+        super(XCABlock, self).__init__(init_cfg=init_cfg)
+
+        self.norm1 = build_norm_layer(norm_cfg, dim)
+        self.attn = XCA(
+            dim,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            attn_drop=attn_drop,
+            proj_drop=drop,
+        )
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+
+        self.norm3 = build_norm_layer(norm_cfg, dim)
+        self.local_mp = LPI(
+            in_features=dim,
+            norm_cfg=bn_norm_cfg,
+            act_cfg=act_cfg,
+        )
+
+        self.norm2 = build_norm_layer(norm_cfg, dim)
+        self.ffn = FFN(
+            embed_dims=dim,
+            feedforward_channels=int(dim * mlp_ratio),
+            act_cfg=act_cfg,
+            ffn_drop=drop,
+        )
+
+        self.gamma1 = nn.Parameter(layer_scale_init_value * torch.ones(dim))
+        self.gamma3 = nn.Parameter(layer_scale_init_value * torch.ones(dim))
+        self.gamma2 = nn.Parameter(layer_scale_init_value * torch.ones(dim))
+
+    def forward(self, x, H: int, W: int):
+        x = x + self.drop_path(self.gamma1 * self.attn(self.norm1(x)))
+        # NOTE official code has 3 then 2, so keeping it the same to be
+        # consistent with loaded weights See
+        # https://github.com/rwightman/pytorch-image-models/pull/747#issuecomment-877795721  # noqa: E501
+        x = x + self.drop_path(
+            self.gamma3 * self.local_mp(self.norm3(x), H, W))
+        x = x + self.drop_path(
+            self.gamma2 * self.ffn(self.norm2(x), identity=0))
+        return x
+
+
+@MODELS.register_module()
+class XCiT(BaseBackbone):
+    """XCiT backbone.
+
+    A PyTorch implementation of XCiT backbone introduced by:
+    `XCiT: Cross-Covariance Image Transformers
+    <https://arxiv.org/abs/2106.096819>`_
+
+    Args:
+        img_size (int, tuple): Input image size. Defaults to 224.
+        patch_size (int): Patch size. Defaults to 16.
+        in_channels (int): Number of input channels. Defaults to 3.
+        embed_dims (int): Embedding dimension. Defaults to 768.
+        depth (int): depth of vision transformer. Defaults to 12.
+        cls_attn_layers (int): Depth of Class attention layers.
+            Defaults to 2.
+        num_heads (int): Number of attention heads. Defaults to 12.
+        mlp_ratio (int): Ratio of mlp hidden dim to embedding dim.
+            Defaults to 4.
+        qkv_bias (bool): enable bias for qkv if True. Defaults to True.
+        drop_rate (float): Probability of an element to be zeroed
+            after the feed forward layer. Defaults to 0.
+        attn_drop_rate (float): The drop out rate for attention output weights.
+            Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        use_pos_embed (bool): Whether to use positional encoding.
+            Defaults to True.
+        layer_scale_init_value (float): The initial value for layer scale.
+            Defaults to 1.
+        tokens_norm (bool): Whether to normalize all tokens or just the
+            cls_token in the CA. Defaults to False.
+        out_indices (Sequence[int]): Output from which layers.
+            Defaults to (-1, ).
+        frozen_stages (int): Layers to be frozen (all param fixed), and 0
+            means to freeze the stem stage. Defaults to -1, which means
+            not freeze any parameters.
+        bn_norm_cfg (dict): Config dict for the batch norm layers in LPI and
+            ConvPatchEmbed. Defaults to ``dict(type='BN')``.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN', eps=1e-6)``.
+        act_cfg (dict): Config dict for activation layer.
+            Defaults to ``dict(type='GELU')``.
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+    """
+
+    def __init__(self,
+                 img_size: Union[int, tuple] = 224,
+                 patch_size: int = 16,
+                 in_channels: int = 3,
+                 embed_dims: int = 768,
+                 depth: int = 12,
+                 cls_attn_layers: int = 2,
+                 num_heads: int = 12,
+                 mlp_ratio: float = 4.,
+                 qkv_bias: bool = True,
+                 drop_rate: float = 0.,
+                 attn_drop_rate: float = 0.,
+                 drop_path_rate: float = 0.,
+                 use_pos_embed: bool = True,
+                 layer_scale_init_value: float = 1.,
+                 tokens_norm: bool = False,
+                 out_type: str = 'cls_token',
+                 out_indices: Sequence[int] = (-1, ),
+                 final_norm: bool = True,
+                 frozen_stages: int = -1,
+                 bn_norm_cfg=dict(type='BN'),
+                 norm_cfg=dict(type='LN', eps=1e-6),
+                 act_cfg=dict(type='GELU'),
+                 init_cfg=dict(type='TruncNormal', layer='Linear')):
+        super(XCiT, self).__init__(init_cfg=init_cfg)
+
+        img_size = to_2tuple(img_size)
+        if (img_size[0] % patch_size != 0) or (img_size[1] % patch_size != 0):
+            raise ValueError(f'`patch_size` ({patch_size}) should divide '
+                             f'the image shape ({img_size}) evenly.')
+
+        self.embed_dims = embed_dims
+
+        assert out_type in ('raw', 'featmap', 'avg_featmap', 'cls_token')
+        self.out_type = out_type
+
+        self.patch_embed = ConvPatchEmbed(
+            img_size=img_size,
+            patch_size=patch_size,
+            in_channels=in_channels,
+            embed_dims=embed_dims,
+            norm_cfg=bn_norm_cfg,
+            act_cfg=act_cfg,
+        )
+
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dims))
+        self.use_pos_embed = use_pos_embed
+        if use_pos_embed:
+            self.pos_embed = PositionalEncodingFourier(dim=embed_dims)
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        self.xca_layers = nn.ModuleList()
+        self.ca_layers = nn.ModuleList()
+        self.num_layers = depth + cls_attn_layers
+
+        for _ in range(depth):
+            self.xca_layers.append(
+                XCABlock(
+                    dim=embed_dims,
+                    num_heads=num_heads,
+                    mlp_ratio=mlp_ratio,
+                    qkv_bias=qkv_bias,
+                    drop=drop_rate,
+                    attn_drop=attn_drop_rate,
+                    drop_path=drop_path_rate,
+                    bn_norm_cfg=bn_norm_cfg,
+                    norm_cfg=norm_cfg,
+                    act_cfg=act_cfg,
+                    layer_scale_init_value=layer_scale_init_value,
+                ))
+
+        for _ in range(cls_attn_layers):
+            self.ca_layers.append(
+                ClassAttentionBlock(
+                    dim=embed_dims,
+                    num_heads=num_heads,
+                    mlp_ratio=mlp_ratio,
+                    qkv_bias=qkv_bias,
+                    drop=drop_rate,
+                    attn_drop=attn_drop_rate,
+                    act_cfg=act_cfg,
+                    norm_cfg=norm_cfg,
+                    layer_scale_init_value=layer_scale_init_value,
+                    tokens_norm=tokens_norm,
+                ))
+
+        if final_norm:
+            self.norm = build_norm_layer(norm_cfg, embed_dims)
+
+        # Transform out_indices
+        if isinstance(out_indices, int):
+            out_indices = [out_indices]
+        assert isinstance(out_indices, Sequence), \
+            f'"out_indices" must by a sequence or int, ' \
+            f'get {type(out_indices)} instead.'
+        out_indices = list(out_indices)
+        for i, index in enumerate(out_indices):
+            if index < 0:
+                out_indices[i] = self.num_layers + index
+            assert 0 <= out_indices[i] <= self.num_layers, \
+                f'Invalid out_indices {index}.'
+        self.out_indices = out_indices
+
+        if frozen_stages > self.num_layers + 1:
+            raise ValueError('frozen_stages must be less than '
+                             f'{self.num_layers} but get {frozen_stages}')
+        self.frozen_stages = frozen_stages
+
+    def init_weights(self):
+        super().init_weights()
+
+        if self.init_cfg is not None and self.init_cfg['type'] == 'Pretrained':
+            return
+
+        trunc_normal_(self.cls_token, std=.02)
+
+    def _freeze_stages(self):
+        if self.frozen_stages < 0:
+            return
+
+        # freeze position embedding
+        if self.use_pos_embed:
+            self.pos_embed.eval()
+            for param in self.pos_embed.parameters():
+                param.requires_grad = False
+        # freeze patch embedding
+        self.patch_embed.eval()
+        for param in self.patch_embed.parameters():
+            param.requires_grad = False
+        # set dropout to eval model
+        self.pos_drop.eval()
+        # freeze cls_token, only use in self.Clslayers
+        if self.frozen_stages > len(self.xca_layers):
+            self.cls_token.requires_grad = False
+        # freeze layers
+        for i in range(1, self.frozen_stages):
+            if i <= len(self.xca_layers):
+                m = self.xca_layers[i - 1]
+            else:
+                m = self.ca_layers[i - len(self.xca_layers) - 1]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+        # freeze the last layer norm if all_stages are frozen
+        if self.frozen_stages == len(self.xca_layers) + len(self.ca_layers):
+            self.norm.eval()
+            for param in self.norm.parameters():
+                param.requires_grad = False
+
+    def forward(self, x):
+        outs = []
+        B = x.shape[0]
+        # x is (B, N, C). (Hp, Hw) is the patch resolution
+        x, (Hp, Wp) = self.patch_embed(x)
+
+        if self.use_pos_embed:
+            # (B, C, Hp, Wp) -> (B, C, N) -> (B, N, C)
+            pos_encoding = self.pos_embed(B, Hp, Wp)
+            x = x + pos_encoding.reshape(B, -1, x.size(1)).permute(0, 2, 1)
+        x = self.pos_drop(x)
+
+        for i, layer in enumerate(self.xca_layers):
+            x = layer(x, Hp, Wp)
+            if i in self.out_indices:
+                outs.append(self._format_output(x, (Hp, Wp), False))
+
+        x = torch.cat((self.cls_token.expand(B, -1, -1), x), dim=1)
+
+        for i, layer in enumerate(self.ca_layers):
+            x = layer(x)
+            if i == len(self.ca_layers) - 1:
+                x = self.norm(x)
+            if i + len(self.xca_layers) in self.out_indices:
+                outs.append(self._format_output(x, (Hp, Wp), True))
+
+        return tuple(outs)
+
+    def _format_output(self, x, hw, with_cls_token: bool):
+        if self.out_type == 'raw':
+            return x
+        if self.out_type == 'cls_token':
+            if not with_cls_token:
+                raise ValueError(
+                    'Cannot output cls_token since there is no cls_token.')
+            return x[:, 0]
+
+        patch_token = x[:, 1:] if with_cls_token else x
+        if self.out_type == 'featmap':
+            B = x.size(0)
+            # (B, N, C) -> (B, H, W, C) -> (B, C, H, W)
+            return patch_token.reshape(B, *hw, -1).permute(0, 3, 1, 2)
+        if self.out_type == 'avg_featmap':
+            return patch_token.mean(dim=1)
+
+    def train(self, mode=True):
+        super().train(mode)
+        self._freeze_stages()
diff --git a/mmpretrain/models/builder.py b/mmpretrain/models/builder.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ea4e25c8d6db3bbf07ab94ea08c08e474ec3595
--- /dev/null
+++ b/mmpretrain/models/builder.py
@@ -0,0 +1,39 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmpretrain.registry import MODELS
+
+BACKBONES = MODELS
+NECKS = MODELS
+HEADS = MODELS
+LOSSES = MODELS
+CLASSIFIERS = MODELS
+RETRIEVER = MODELS
+
+
+def build_backbone(cfg):
+    """Build backbone."""
+    return BACKBONES.build(cfg)
+
+
+def build_neck(cfg):
+    """Build neck."""
+    return NECKS.build(cfg)
+
+
+def build_head(cfg):
+    """Build head."""
+    return HEADS.build(cfg)
+
+
+def build_loss(cfg):
+    """Build loss."""
+    return LOSSES.build(cfg)
+
+
+def build_classifier(cfg):
+    """Build classifier."""
+    return CLASSIFIERS.build(cfg)
+
+
+def build_retriever(cfg):
+    """Build retriever."""
+    return RETRIEVER.build(cfg)
diff --git a/mmpretrain/models/classifiers/__init__.py b/mmpretrain/models/classifiers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..5fa276ff5a2152beb93c4d1b42e6bbf4e2cbf822
--- /dev/null
+++ b/mmpretrain/models/classifiers/__init__.py
@@ -0,0 +1,10 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .base import BaseClassifier
+from .hugging_face import HuggingFaceClassifier
+from .image import ImageClassifier
+from .timm import TimmClassifier
+
+__all__ = [
+    'BaseClassifier', 'ImageClassifier', 'TimmClassifier',
+    'HuggingFaceClassifier'
+]
diff --git a/mmpretrain/models/classifiers/__pycache__/__init__.cpython-310.pyc b/mmpretrain/models/classifiers/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..0126f3d9238e6db86e1df245f18a05660de521fd
Binary files /dev/null and b/mmpretrain/models/classifiers/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/models/classifiers/__pycache__/base.cpython-310.pyc b/mmpretrain/models/classifiers/__pycache__/base.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a1416cff253ddc70911f80fc8de7fbc458c2c628
Binary files /dev/null and b/mmpretrain/models/classifiers/__pycache__/base.cpython-310.pyc differ
diff --git a/mmpretrain/models/classifiers/__pycache__/hugging_face.cpython-310.pyc b/mmpretrain/models/classifiers/__pycache__/hugging_face.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..480dc4b5614fd30720e3b942db7c83d859cd4e1f
Binary files /dev/null and b/mmpretrain/models/classifiers/__pycache__/hugging_face.cpython-310.pyc differ
diff --git a/mmpretrain/models/classifiers/__pycache__/image.cpython-310.pyc b/mmpretrain/models/classifiers/__pycache__/image.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f74b08b8868a78e34297d5407a231a37f6350cf0
Binary files /dev/null and b/mmpretrain/models/classifiers/__pycache__/image.cpython-310.pyc differ
diff --git a/mmpretrain/models/classifiers/__pycache__/timm.cpython-310.pyc b/mmpretrain/models/classifiers/__pycache__/timm.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a9feb6cd38246dc72cf4aa36e96c1c7ae1ee07c5
Binary files /dev/null and b/mmpretrain/models/classifiers/__pycache__/timm.cpython-310.pyc differ
diff --git a/mmpretrain/models/classifiers/base.py b/mmpretrain/models/classifiers/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..a65fc213f4bfe271a9298b823ba38fc4ca9f57e1
--- /dev/null
+++ b/mmpretrain/models/classifiers/base.py
@@ -0,0 +1,108 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from abc import ABCMeta, abstractmethod
+from typing import List, Optional, Sequence
+
+import torch
+from mmengine.model import BaseModel
+from mmengine.structures import BaseDataElement
+
+
+class BaseClassifier(BaseModel, metaclass=ABCMeta):
+    """Base class for classifiers.
+
+    Args:
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+        data_preprocessor (dict, optional): The config for preprocessing input
+            data. If None, it will use "BaseDataPreprocessor" as type, see
+            :class:`mmengine.model.BaseDataPreprocessor` for more details.
+            Defaults to None.
+
+    Attributes:
+        init_cfg (dict): Initialization config dict.
+        data_preprocessor (:obj:`mmengine.model.BaseDataPreprocessor`): An
+            extra data pre-processing module, which processes data from
+            dataloader to the format accepted by :meth:`forward`.
+    """
+
+    def __init__(self,
+                 init_cfg: Optional[dict] = None,
+                 data_preprocessor: Optional[dict] = None):
+        super(BaseClassifier, self).__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+    @property
+    def with_neck(self) -> bool:
+        """Whether the classifier has a neck."""
+        return hasattr(self, 'neck') and self.neck is not None
+
+    @property
+    def with_head(self) -> bool:
+        """Whether the classifier has a head."""
+        return hasattr(self, 'head') and self.head is not None
+
+    @abstractmethod
+    def forward(self,
+                inputs: torch.Tensor,
+                data_samples: Optional[List[BaseDataElement]] = None,
+                mode: str = 'tensor'):
+        """The unified entry for a forward process in both training and test.
+
+        The method should accept three modes: "tensor", "predict" and "loss":
+
+        - "tensor": Forward the whole network and return tensor or tuple of
+          tensor without any post-processing, same as a common nn.Module.
+        - "predict": Forward and return the predictions, which are fully
+          processed to a list of :obj:`BaseDataElement`.
+        - "loss": Forward and return a dict of losses according to the given
+          inputs and data samples.
+
+        Note that this method doesn't handle neither back propagation nor
+        optimizer updating, which are done in the :meth:`train_step`.
+
+        Args:
+            inputs (torch.Tensor): The input tensor with shape (N, C, ...)
+                in general.
+            data_samples (List[BaseDataElement], optional): The annotation
+                data of every samples. It's required if ``mode="loss"``.
+                Defaults to None.
+            mode (str): Return what kind of value. Defaults to 'tensor'.
+
+        Returns:
+            The return type depends on ``mode``.
+
+            - If ``mode="tensor"``, return a tensor or a tuple of tensor.
+            - If ``mode="predict"``, return a list of
+              :obj:`mmengine.BaseDataElement`.
+            - If ``mode="loss"``, return a dict of tensor.
+        """
+        pass
+
+    def extract_feat(self, inputs: torch.Tensor):
+        """Extract features from the input tensor with shape (N, C, ...).
+
+        The sub-classes are recommended to implement this method to extract
+        features from backbone and neck.
+
+        Args:
+            inputs (Tensor): A batch of inputs. The shape of it should be
+                ``(num_samples, num_channels, *img_shape)``.
+        """
+        raise NotImplementedError
+
+    def extract_feats(self, multi_inputs: Sequence[torch.Tensor],
+                      **kwargs) -> list:
+        """Extract features from a sequence of input tensor.
+
+        Args:
+            multi_inputs (Sequence[torch.Tensor]): A sequence of input
+                tensor. It can be used in augmented inference.
+            **kwargs: Other keyword arguments accepted by :meth:`extract_feat`.
+
+        Returns:
+            list: Features of every input tensor.
+        """
+        assert isinstance(multi_inputs, Sequence), \
+            '`extract_feats` is used for a sequence of inputs tensor. If you '\
+            'want to extract on single inputs tensor, use `extract_feat`.'
+        return [self.extract_feat(inputs, **kwargs) for inputs in multi_inputs]
diff --git a/mmpretrain/models/classifiers/hugging_face.py b/mmpretrain/models/classifiers/hugging_face.py
new file mode 100644
index 0000000000000000000000000000000000000000..26a8fda51b0d01ee54ba71665caedbb8a7bd842c
--- /dev/null
+++ b/mmpretrain/models/classifiers/hugging_face.py
@@ -0,0 +1,222 @@
+# Copyright (c) OpenMMLab. All right reserved.
+import re
+from collections import OrderedDict
+from typing import List, Optional
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from mmpretrain.utils import require
+from .base import BaseClassifier
+
+
+@MODELS.register_module()
+class HuggingFaceClassifier(BaseClassifier):
+    """Image classifiers for HuggingFace model.
+
+    This class accepts all positional and keyword arguments of the API
+    ``from_pretrained`` (when ``pretrained=True``) and ``from_config`` (when
+    ``pretrained=False``) of `transformers.AutoModelForImageClassification`_
+    and use it to create a model from hugging-face.
+
+    It can load checkpoints of hugging-face directly, and the saved checkpoints
+    also can be directly load by hugging-face.
+
+    Please confirm that you have installed ``transfromers`` if you want to use it.
+
+    .. _transformers.AutoModelForImageClassification:
+        https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForImageClassification
+
+    Args:
+        model_name (str): The name of the model to use in hugging-face.
+        pretrained (bool): Whether to load pretrained checkpoint from
+            hugging-face. Defaults to False.
+        *args: Other positional arguments of the method
+            `from_pretrained` or `from_config`.
+        loss (dict): Config of classification loss. Defaults to
+            ``dict(type='CrossEntropyLoss', loss_weight=1.0)``.
+        train_cfg (dict, optional): The training setting. The acceptable
+            fields are:
+
+            - augments (List[dict]): The batch augmentation methods to use.
+              More details can be found in :mod:`mmpretrain.model.utils.augment`.
+
+            Defaults to None.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        data_preprocessor (dict, optional): The config for preprocessing input
+            data. If None or no specified type, it will use
+            "ClsDataPreprocessor" as type. See :class:`ClsDataPreprocessor` for
+            more details. Defaults to None.
+        init_cfg (dict, optional): the config to control the initialization.
+            Defaults to None.
+        **kwargs: Other keyword arguments of the method
+            `from_pretrained` or `from_config`.
+
+    Examples:
+        >>> import torch
+        >>> from mmpretrain.models import build_classifier
+        >>> cfg = dict(type='HuggingFaceClassifier', model_name='microsoft/resnet-50', pretrained=True)
+        >>> model = build_classifier(cfg)
+        >>> inputs = torch.rand(1, 3, 224, 224)
+        >>> out = model(inputs)
+        >>> print(out.shape)
+        torch.Size([1, 1000])
+    """  # noqa: E501
+
+    @require('transformers')
+    def __init__(self,
+                 model_name,
+                 pretrained=False,
+                 *model_args,
+                 loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+                 train_cfg: Optional[dict] = None,
+                 with_cp: bool = False,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None,
+                 **kwargs):
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        # The build process is in MMEngine, so we need to add scope here.
+        data_preprocessor.setdefault('type', 'mmpretrain.ClsDataPreprocessor')
+
+        if train_cfg is not None and 'augments' in train_cfg:
+            # Set batch augmentations by `train_cfg`
+            data_preprocessor['batch_augments'] = train_cfg
+
+        super().__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+        from transformers import AutoConfig, AutoModelForImageClassification
+        if pretrained:
+            self.model = AutoModelForImageClassification.from_pretrained(
+                model_name, *model_args, **kwargs)
+        else:
+            config = AutoConfig.from_pretrained(model_name, *model_args,
+                                                **kwargs)
+            self.model = AutoModelForImageClassification.from_config(config)
+
+        if not isinstance(loss, nn.Module):
+            loss = MODELS.build(loss)
+        self.loss_module = loss
+
+        self.with_cp = with_cp
+        if self.with_cp:
+            self.model.gradient_checkpointing_enable()
+
+        self._register_state_dict_hook(self._remove_state_dict_prefix)
+        self._register_load_state_dict_pre_hook(self._add_state_dict_prefix)
+
+    def forward(self, inputs, data_samples=None, mode='tensor'):
+        if mode == 'tensor':
+            return self.model(inputs).logits
+        elif mode == 'loss':
+            return self.loss(inputs, data_samples)
+        elif mode == 'predict':
+            return self.predict(inputs, data_samples)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def extract_feat(self, inputs: torch.Tensor):
+        raise NotImplementedError(
+            "The HuggingFaceClassifier doesn't support extract feature yet.")
+
+    def loss(self, inputs: torch.Tensor, data_samples: List[DataSample],
+             **kwargs):
+        """Calculate losses from a batch of inputs and data samples.
+
+        Args:
+            inputs (torch.Tensor): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample]): The annotation data of
+                every samples.
+            **kwargs: Other keyword arguments of the loss module.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components
+        """
+        # The part can be traced by torch.fx
+        cls_score = self.model(inputs).logits
+
+        # The part can not be traced by torch.fx
+        losses = self._get_loss(cls_score, data_samples, **kwargs)
+        return losses
+
+    def _get_loss(self, cls_score: torch.Tensor,
+                  data_samples: List[DataSample], **kwargs):
+        """Unpack data samples and compute loss."""
+        # Unpack data samples and pack targets
+        if 'gt_score' in data_samples[0]:
+            # Batch augmentation may convert labels to one-hot format scores.
+            target = torch.stack([i.gt_score for i in data_samples])
+        else:
+            target = torch.cat([i.gt_label for i in data_samples])
+
+        # compute loss
+        losses = dict()
+        loss = self.loss_module(
+            cls_score, target, avg_factor=cls_score.size(0), **kwargs)
+        losses['loss'] = loss
+
+        return losses
+
+    def predict(self,
+                inputs: torch.Tensor,
+                data_samples: Optional[List[DataSample]] = None):
+        """Predict results from a batch of inputs.
+
+        Args:
+            inputs (torch.Tensor): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. Defaults to None.
+
+        Returns:
+            List[DataSample]: The prediction results.
+        """
+        # The part can be traced by torch.fx
+        cls_score = self.model(inputs).logits
+
+        # The part can not be traced by torch.fx
+        predictions = self._get_predictions(cls_score, data_samples)
+        return predictions
+
+    def _get_predictions(self, cls_score, data_samples):
+        """Post-process the output of head.
+
+        Including softmax and set ``pred_label`` of data samples.
+        """
+        pred_scores = F.softmax(cls_score, dim=1)
+        pred_labels = pred_scores.argmax(dim=1, keepdim=True).detach()
+
+        if data_samples is not None:
+            for data_sample, score, label in zip(data_samples, pred_scores,
+                                                 pred_labels):
+                data_sample.set_pred_score(score).set_pred_label(label)
+        else:
+            data_samples = []
+            for score, label in zip(pred_scores, pred_labels):
+                data_samples.append(
+                    DataSample().set_pred_score(score).set_pred_label(label))
+
+        return data_samples
+
+    @staticmethod
+    def _remove_state_dict_prefix(self, state_dict, prefix, local_metadata):
+        new_state_dict = OrderedDict()
+        for k, v in state_dict.items():
+            new_key = re.sub(f'^{prefix}model.', prefix, k)
+            new_state_dict[new_key] = v
+        return new_state_dict
+
+    @staticmethod
+    def _add_state_dict_prefix(state_dict, prefix, local_metadata, strict,
+                               missing_keys, unexpected_keys, error_msgs):
+        new_prefix = prefix + 'model.'
+        for k in list(state_dict.keys()):
+            new_key = re.sub(f'^{prefix}', new_prefix, k)
+            state_dict[new_key] = state_dict[k]
+            del state_dict[k]
diff --git a/mmpretrain/models/classifiers/image.py b/mmpretrain/models/classifiers/image.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d0edd7aed8ce34a11b6cbbbdf2034bbcd1c652b
--- /dev/null
+++ b/mmpretrain/models/classifiers/image.py
@@ -0,0 +1,265 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional
+
+import torch
+import torch.nn as nn
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from .base import BaseClassifier
+
+
+@MODELS.register_module()
+class ImageClassifier(BaseClassifier):
+    """Image classifiers for supervised classification task.
+
+    Args:
+        backbone (dict): The backbone module. See
+            :mod:`mmpretrain.models.backbones`.
+        neck (dict, optional): The neck module to process features from
+            backbone. See :mod:`mmpretrain.models.necks`. Defaults to None.
+        head (dict, optional): The head module to do prediction and calculate
+            loss from processed features. See :mod:`mmpretrain.models.heads`.
+            Notice that if the head is not set, almost all methods cannot be
+            used except :meth:`extract_feat`. Defaults to None.
+        pretrained (str, optional): The pretrained checkpoint path, support
+            local path and remote path. Defaults to None.
+        train_cfg (dict, optional): The training setting. The acceptable
+            fields are:
+
+            - augments (List[dict]): The batch augmentation methods to use.
+              More details can be found in
+              :mod:`mmpretrain.model.utils.augment`.
+            - probs (List[float], optional): The probability of every batch
+              augmentation methods. If None, choose evenly. Defaults to None.
+
+            Defaults to None.
+        data_preprocessor (dict, optional): The config for preprocessing input
+            data. If None or no specified type, it will use
+            "ClsDataPreprocessor" as type. See :class:`ClsDataPreprocessor` for
+            more details. Defaults to None.
+        init_cfg (dict, optional): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 backbone: dict,
+                 neck: Optional[dict] = None,
+                 head: Optional[dict] = None,
+                 pretrained: Optional[str] = None,
+                 train_cfg: Optional[dict] = None,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        if pretrained is not None:
+            init_cfg = dict(type='Pretrained', checkpoint=pretrained)
+
+        data_preprocessor = data_preprocessor or {}
+
+        if isinstance(data_preprocessor, dict):
+            data_preprocessor.setdefault('type', 'ClsDataPreprocessor')
+            data_preprocessor.setdefault('batch_augments', train_cfg)
+            data_preprocessor = MODELS.build(data_preprocessor)
+        elif not isinstance(data_preprocessor, nn.Module):
+            raise TypeError('data_preprocessor should be a `dict` or '
+                            f'`nn.Module` instance, but got '
+                            f'{type(data_preprocessor)}')
+
+        super(ImageClassifier, self).__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+        if not isinstance(backbone, nn.Module):
+            backbone = MODELS.build(backbone)
+        if neck is not None and not isinstance(neck, nn.Module):
+            neck = MODELS.build(neck)
+        if head is not None and not isinstance(head, nn.Module):
+            head = MODELS.build(head)
+
+        self.backbone = backbone
+        self.neck = neck
+        self.head = head
+
+        # If the model needs to load pretrain weights from a third party,
+        # the key can be modified with this hook
+        if hasattr(self.backbone, '_checkpoint_filter'):
+            self._register_load_state_dict_pre_hook(
+                self.backbone._checkpoint_filter)
+
+    def forward(self,
+                inputs: torch.Tensor,
+                data_samples: Optional[List[DataSample]] = None,
+                mode: str = 'tensor'):
+        """The unified entry for a forward process in both training and test.
+
+        The method should accept three modes: "tensor", "predict" and "loss":
+
+        - "tensor": Forward the whole network and return tensor(s) without any
+          post-processing, same as a common PyTorch Module.
+        - "predict": Forward and return the predictions, which are fully
+          processed to a list of :obj:`DataSample`.
+        - "loss": Forward and return a dict of losses according to the given
+          inputs and data samples.
+
+        Args:
+            inputs (torch.Tensor): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. It's required if ``mode="loss"``.
+                Defaults to None.
+            mode (str): Return what kind of value. Defaults to 'tensor'.
+
+        Returns:
+            The return type depends on ``mode``.
+
+            - If ``mode="tensor"``, return a tensor or a tuple of tensor.
+            - If ``mode="predict"``, return a list of
+              :obj:`mmpretrain.structures.DataSample`.
+            - If ``mode="loss"``, return a dict of tensor.
+        """
+        if mode == 'tensor':
+            feats = self.extract_feat(inputs)
+            return self.head(feats) if self.with_head else feats
+        elif mode == 'loss':
+            return self.loss(inputs, data_samples)
+        elif mode == 'predict':
+            return self.predict(inputs, data_samples)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def extract_feat(self, inputs, stage='neck'):
+        """Extract features from the input tensor with shape (N, C, ...).
+
+        Args:
+            inputs (Tensor): A batch of inputs. The shape of it should be
+                ``(num_samples, num_channels, *img_shape)``.
+            stage (str): Which stage to output the feature. Choose from:
+
+                - "backbone": The output of backbone network. Returns a tuple
+                  including multiple stages features.
+                - "neck": The output of neck module. Returns a tuple including
+                  multiple stages features.
+                - "pre_logits": The feature before the final classification
+                  linear layer. Usually returns a tensor.
+
+                Defaults to "neck".
+
+        Returns:
+            tuple | Tensor: The output of specified stage.
+            The output depends on detailed implementation. In general, the
+            output of backbone and neck is a tuple and the output of
+            pre_logits is a tensor.
+
+        Examples:
+            1. Backbone output
+
+            >>> import torch
+            >>> from mmengine import Config
+            >>> from mmpretrain.models import build_classifier
+            >>>
+            >>> cfg = Config.fromfile('configs/resnet/resnet18_8xb32_in1k.py').model
+            >>> cfg.backbone.out_indices = (0, 1, 2, 3)  # Output multi-scale feature maps
+            >>> model = build_classifier(cfg)
+            >>> outs = model.extract_feat(torch.rand(1, 3, 224, 224), stage='backbone')
+            >>> for out in outs:
+            ...     print(out.shape)
+            torch.Size([1, 64, 56, 56])
+            torch.Size([1, 128, 28, 28])
+            torch.Size([1, 256, 14, 14])
+            torch.Size([1, 512, 7, 7])
+
+            2. Neck output
+
+            >>> import torch
+            >>> from mmengine import Config
+            >>> from mmpretrain.models import build_classifier
+            >>>
+            >>> cfg = Config.fromfile('configs/resnet/resnet18_8xb32_in1k.py').model
+            >>> cfg.backbone.out_indices = (0, 1, 2, 3)  # Output multi-scale feature maps
+            >>> model = build_classifier(cfg)
+            >>>
+            >>> outs = model.extract_feat(torch.rand(1, 3, 224, 224), stage='neck')
+            >>> for out in outs:
+            ...     print(out.shape)
+            torch.Size([1, 64])
+            torch.Size([1, 128])
+            torch.Size([1, 256])
+            torch.Size([1, 512])
+
+            3. Pre-logits output (without the final linear classifier head)
+
+            >>> import torch
+            >>> from mmengine import Config
+            >>> from mmpretrain.models import build_classifier
+            >>>
+            >>> cfg = Config.fromfile('configs/vision_transformer/vit-base-p16_pt-64xb64_in1k-224.py').model
+            >>> model = build_classifier(cfg)
+            >>>
+            >>> out = model.extract_feat(torch.rand(1, 3, 224, 224), stage='pre_logits')
+            >>> print(out.shape)  # The hidden dims in head is 3072
+            torch.Size([1, 3072])
+        """  # noqa: E501
+        assert stage in ['backbone', 'neck', 'pre_logits'], \
+            (f'Invalid output stage "{stage}", please choose from "backbone", '
+             '"neck" and "pre_logits"')
+
+        x = self.backbone(inputs)
+
+        if stage == 'backbone':
+            return x
+
+        if self.with_neck:
+            x = self.neck(x)
+        if stage == 'neck':
+            return x
+
+        assert self.with_head and hasattr(self.head, 'pre_logits'), \
+            "No head or the head doesn't implement `pre_logits` method."
+        return self.head.pre_logits(x)
+
+    def loss(self, inputs: torch.Tensor,
+             data_samples: List[DataSample]) -> dict:
+        """Calculate losses from a batch of inputs and data samples.
+
+        Args:
+            inputs (torch.Tensor): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample]): The annotation data of
+                every samples.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components
+        """
+        feats = self.extract_feat(inputs)
+        return self.head.loss(feats, data_samples)
+
+    def predict(self,
+                inputs: torch.Tensor,
+                data_samples: Optional[List[DataSample]] = None,
+                **kwargs) -> List[DataSample]:
+        """Predict results from a batch of inputs.
+
+        Args:
+            inputs (torch.Tensor): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. Defaults to None.
+            **kwargs: Other keyword arguments accepted by the ``predict``
+                method of :attr:`head`.
+        """
+        feats = self.extract_feat(inputs)
+        return self.head.predict(feats, data_samples, **kwargs)
+
+    def get_layer_depth(self, param_name: str):
+        """Get the layer-wise depth of a parameter.
+
+        Args:
+            param_name (str): The name of the parameter.
+
+        Returns:
+            Tuple[int, int]: The layer-wise depth and the max depth.
+        """
+        if hasattr(self.backbone, 'get_layer_depth'):
+            return self.backbone.get_layer_depth(param_name, 'backbone.')
+        else:
+            raise NotImplementedError(
+                f"The backbone {type(self.backbone)} doesn't "
+                'support `get_layer_depth` by now.')
diff --git a/mmpretrain/models/classifiers/timm.py b/mmpretrain/models/classifiers/timm.py
new file mode 100644
index 0000000000000000000000000000000000000000..d777b2e039d848b01fc9c6b6eaae6619bebb8938
--- /dev/null
+++ b/mmpretrain/models/classifiers/timm.py
@@ -0,0 +1,209 @@
+# Copyright (c) OpenMMLab. All right reserved.
+import re
+from collections import OrderedDict
+from typing import List, Optional
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from mmpretrain.utils import require
+from .base import BaseClassifier
+
+
+@MODELS.register_module()
+class TimmClassifier(BaseClassifier):
+    """Image classifiers for pytorch-image-models (timm) model.
+
+    This class accepts all positional and keyword arguments of the function
+    `timm.models.create_model <https://timm.fast.ai/create_model>`_ and use
+    it to create a model from pytorch-image-models.
+
+    It can load checkpoints of timm directly, and the saved checkpoints also
+    can be directly load by timm.
+
+    Please confirm that you have installed ``timm`` if you want to use it.
+
+    Args:
+        *args: All positional arguments of the function
+            `timm.models.create_model`.
+        loss (dict): Config of classification loss. Defaults to
+            ``dict(type='CrossEntropyLoss', loss_weight=1.0)``.
+        train_cfg (dict, optional): The training setting. The acceptable
+            fields are:
+
+            - augments (List[dict]): The batch augmentation methods to use.
+              More details can be found in :mod:`mmpretrain.model.utils.augment`.
+
+            Defaults to None.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        data_preprocessor (dict, optional): The config for preprocessing input
+            data. If None or no specified type, it will use
+            "ClsDataPreprocessor" as type. See :class:`ClsDataPreprocessor` for
+            more details. Defaults to None.
+        init_cfg (dict, optional): the config to control the initialization.
+            Defaults to None.
+        **kwargs: Other keyword arguments of the function
+            `timm.models.create_model`.
+
+    Examples:
+        >>> import torch
+        >>> from mmpretrain.models import build_classifier
+        >>> cfg = dict(type='TimmClassifier', model_name='resnet50', pretrained=True)
+        >>> model = build_classifier(cfg)
+        >>> inputs = torch.rand(1, 3, 224, 224)
+        >>> out = model(inputs)
+        >>> print(out.shape)
+        torch.Size([1, 1000])
+    """  # noqa: E501
+
+    @require('timm')
+    def __init__(self,
+                 *args,
+                 loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+                 train_cfg: Optional[dict] = None,
+                 with_cp: bool = False,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None,
+                 **kwargs):
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        # The build process is in MMEngine, so we need to add scope here.
+        data_preprocessor.setdefault('type', 'mmpretrain.ClsDataPreprocessor')
+
+        if train_cfg is not None and 'augments' in train_cfg:
+            # Set batch augmentations by `train_cfg`
+            data_preprocessor['batch_augments'] = train_cfg
+
+        super().__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+        from timm.models import create_model
+        self.model = create_model(*args, **kwargs)
+
+        if not isinstance(loss, nn.Module):
+            loss = MODELS.build(loss)
+        self.loss_module = loss
+
+        self.with_cp = with_cp
+        if self.with_cp:
+            self.model.set_grad_checkpointing()
+
+        self._register_state_dict_hook(self._remove_state_dict_prefix)
+        self._register_load_state_dict_pre_hook(self._add_state_dict_prefix)
+
+    def forward(self, inputs, data_samples=None, mode='tensor'):
+        if mode == 'tensor':
+            return self.model(inputs)
+        elif mode == 'loss':
+            return self.loss(inputs, data_samples)
+        elif mode == 'predict':
+            return self.predict(inputs, data_samples)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def extract_feat(self, inputs: torch.Tensor):
+        if hasattr(self.model, 'forward_features'):
+            return self.model.forward_features(inputs)
+        else:
+            raise NotImplementedError(
+                f"The model {type(self.model)} doesn't support extract "
+                "feature because it don't have `forward_features` method.")
+
+    def loss(self, inputs: torch.Tensor, data_samples: List[DataSample],
+             **kwargs):
+        """Calculate losses from a batch of inputs and data samples.
+
+        Args:
+            inputs (torch.Tensor): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample]): The annotation data of
+                every samples.
+            **kwargs: Other keyword arguments of the loss module.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components
+        """
+        # The part can be traced by torch.fx
+        cls_score = self.model(inputs)
+
+        # The part can not be traced by torch.fx
+        losses = self._get_loss(cls_score, data_samples, **kwargs)
+        return losses
+
+    def _get_loss(self, cls_score: torch.Tensor,
+                  data_samples: List[DataSample], **kwargs):
+        """Unpack data samples and compute loss."""
+        # Unpack data samples and pack targets
+        if 'gt_score' in data_samples[0]:
+            # Batch augmentation may convert labels to one-hot format scores.
+            target = torch.stack([i.gt_score for i in data_samples])
+        else:
+            target = torch.cat([i.gt_label for i in data_samples])
+
+        # compute loss
+        losses = dict()
+        loss = self.loss_module(cls_score, target, **kwargs)
+        losses['loss'] = loss
+
+        return losses
+
+    def predict(self,
+                inputs: torch.Tensor,
+                data_samples: Optional[List[DataSample]] = None):
+        """Predict results from a batch of inputs.
+
+        Args:
+            inputs (torch.Tensor): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. Defaults to None.
+
+        Returns:
+            List[DataSample]: The prediction results.
+        """
+        # The part can be traced by torch.fx
+        cls_score = self(inputs)
+
+        # The part can not be traced by torch.fx
+        predictions = self._get_predictions(cls_score, data_samples)
+        return predictions
+
+    def _get_predictions(self, cls_score, data_samples=None):
+        """Post-process the output of head.
+
+        Including softmax and set ``pred_label`` of data samples.
+        """
+        pred_scores = F.softmax(cls_score, dim=1)
+        pred_labels = pred_scores.argmax(dim=1, keepdim=True).detach()
+
+        if data_samples is not None:
+            for data_sample, score, label in zip(data_samples, pred_scores,
+                                                 pred_labels):
+                data_sample.set_pred_score(score).set_pred_label(label)
+        else:
+            data_samples = []
+            for score, label in zip(pred_scores, pred_labels):
+                data_samples.append(
+                    DataSample().set_pred_score(score).set_pred_label(label))
+
+        return data_samples
+
+    @staticmethod
+    def _remove_state_dict_prefix(self, state_dict, prefix, local_metadata):
+        new_state_dict = OrderedDict()
+        for k, v in state_dict.items():
+            new_key = re.sub(f'^{prefix}model.', prefix, k)
+            new_state_dict[new_key] = v
+        return new_state_dict
+
+    @staticmethod
+    def _add_state_dict_prefix(state_dict, prefix, local_metadata, strict,
+                               missing_keys, unexpected_keys, error_msgs):
+        new_prefix = prefix + 'model.'
+        for k in list(state_dict.keys()):
+            new_key = re.sub(f'^{prefix}', new_prefix, k)
+            state_dict[new_key] = state_dict[k]
+            del state_dict[k]
diff --git a/mmpretrain/models/heads/__init__.py b/mmpretrain/models/heads/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4364fb5626f321196952bc07bc2f54e3788a0ebe
--- /dev/null
+++ b/mmpretrain/models/heads/__init__.py
@@ -0,0 +1,69 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .beitv1_head import BEiTV1Head
+from .beitv2_head import BEiTV2Head
+from .cae_head import CAEHead
+from .cls_head import ClsHead
+from .conformer_head import ConformerHead
+from .contrastive_head import ContrastiveHead
+from .deit_head import DeiTClsHead
+from .efficientformer_head import EfficientFormerClsHead
+from .grounding_head import GroundingHead
+from .itc_head import ITCHead
+from .itm_head import ITMHead
+from .itpn_clip_head import iTPNClipHead
+from .latent_heads import LatentCrossCorrelationHead, LatentPredictHead
+from .levit_head import LeViTClsHead
+from .linear_head import LinearClsHead
+from .mae_head import MAEPretrainHead
+from .margin_head import ArcFaceClsHead
+from .mim_head import MIMHead
+from .mixmim_head import MixMIMPretrainHead
+from .mocov3_head import MoCoV3Head
+from .multi_label_cls_head import MultiLabelClsHead
+from .multi_label_csra_head import CSRAClsHead
+from .multi_label_linear_head import MultiLabelLinearClsHead
+from .multi_task_head import MultiTaskHead
+from .seq_gen_head import SeqGenerationHead
+from .simmim_head import SimMIMHead
+from .spark_head import SparKPretrainHead
+from .stacked_head import StackedLinearClsHead
+from .swav_head import SwAVHead
+from .vig_head import VigClsHead
+from .vision_transformer_head import VisionTransformerClsHead
+from .vqa_head import VQAGenerationHead
+
+__all__ = [
+    'ClsHead',
+    'LinearClsHead',
+    'StackedLinearClsHead',
+    'MultiLabelClsHead',
+    'MultiLabelLinearClsHead',
+    'VisionTransformerClsHead',
+    'DeiTClsHead',
+    'ConformerHead',
+    'EfficientFormerClsHead',
+    'ArcFaceClsHead',
+    'CSRAClsHead',
+    'MultiTaskHead',
+    'LeViTClsHead',
+    'VigClsHead',
+    'BEiTV1Head',
+    'BEiTV2Head',
+    'CAEHead',
+    'ContrastiveHead',
+    'LatentCrossCorrelationHead',
+    'LatentPredictHead',
+    'MAEPretrainHead',
+    'MixMIMPretrainHead',
+    'SwAVHead',
+    'MoCoV3Head',
+    'MIMHead',
+    'SimMIMHead',
+    'SeqGenerationHead',
+    'VQAGenerationHead',
+    'ITCHead',
+    'ITMHead',
+    'GroundingHead',
+    'iTPNClipHead',
+    'SparKPretrainHead',
+]
diff --git a/mmpretrain/models/heads/__pycache__/__init__.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7edfe83e3928817991bdd34d8ba50f25e9be7fa1
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/beitv1_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/beitv1_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..86c7d053bebb5cfb362e57605a1fe38862aa7c38
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/beitv1_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/beitv2_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/beitv2_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f48f11deb8ab2fce36e4a74934382cc78065dc70
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/beitv2_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/cae_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/cae_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e2650b54d72402a2d98517f19d166443f391405b
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/cae_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/cls_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/cls_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..741d41ff9f0046d60d7ccb805bf4e174aaaeb7a9
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/cls_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/conformer_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/conformer_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7b889b397ab0068af6e996eb84ae514a25f43133
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/conformer_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/contrastive_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/contrastive_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..dff07c63bf035f7bf0a659d23c80ea83e555916e
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/contrastive_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/deit_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/deit_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..97d043a4bdea302ded11edd8db0a4a50b97c953f
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/deit_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/efficientformer_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/efficientformer_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7d317047e30847afb1f895c9ef9620cc64247d55
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/efficientformer_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/grounding_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/grounding_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..5af3c41a4908317d736770e472bdcf66998b5b24
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/grounding_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/itc_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/itc_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..524fde0c09d73eb71308d42796470263758f2efa
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/itc_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/itm_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/itm_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..66b0be6cfd42bb46d65af7aa895fa93fb938b0e1
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/itm_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/itpn_clip_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/itpn_clip_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..6a5acac8c25cdb3405215198fbccc9d50ab1c40e
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/itpn_clip_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/latent_heads.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/latent_heads.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..39cccf39c829baea8f55100bab999a11af660caf
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/latent_heads.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/levit_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/levit_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ad82e0331f32b4c558f6db192bc76bb1e6e83e6b
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/levit_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/linear_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/linear_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..feafad4b381e7b862bd73efc25632c315fc24b65
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/linear_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/mae_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/mae_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..1b9b1ce68aac9b14896e0ec60d57415e936fd603
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/mae_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/margin_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/margin_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e1d3f2a69dda41c6d5b0c63932b8391db5d64f7c
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/margin_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/mim_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/mim_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..6a257b6ef0db0172429ae7cf20f8661a8520daeb
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/mim_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/mixmim_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/mixmim_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..aaf8d9fdfb64e1b02d6c44aafca8f543a3345e20
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/mixmim_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/mocov3_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/mocov3_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..68ba191a78d28ac4204c2c250f2deaea55f1dafd
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/mocov3_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/multi_label_cls_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/multi_label_cls_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a8d4960bdc5bcaa29326b5c2e6f92f39dafe5e6b
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/multi_label_cls_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/multi_label_csra_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/multi_label_csra_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..d27a6d6deff14c6f0ad43e42f97630313e5b374f
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/multi_label_csra_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/multi_label_linear_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/multi_label_linear_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a48ad9c3bb797c3856dcfd81379eee09c917a44f
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/multi_label_linear_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/multi_task_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/multi_task_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7dd489b268f2c084fe9bfb4a7b097ea4a321bad7
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/multi_task_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/seq_gen_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/seq_gen_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..65e50f9c194d9d013e334555a8b662a5a60bc98d
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/seq_gen_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/simmim_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/simmim_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..2da3c244bb5e5567c560a9104da8d242f407ea2c
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/simmim_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/spark_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/spark_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..bbe9b98196d367058ef5fb9b84aa2bf4cec84977
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/spark_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/stacked_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/stacked_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ec0eb01615621f60cb2e56e00947ee9a6ce7ff7c
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/stacked_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/swav_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/swav_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..77514a8a1317269bf52015228e40830e7c0a9dec
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/swav_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/vig_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/vig_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..cc9fea84846d736cec9cb5ad18e93be728745148
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/vig_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/vision_transformer_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/vision_transformer_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..db7951e2899d52101bbd6778e7638801d75aef93
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/vision_transformer_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/__pycache__/vqa_head.cpython-310.pyc b/mmpretrain/models/heads/__pycache__/vqa_head.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..6bc369b1e15ce3d4f777aa04a57824e95452d33e
Binary files /dev/null and b/mmpretrain/models/heads/__pycache__/vqa_head.cpython-310.pyc differ
diff --git a/mmpretrain/models/heads/beitv1_head.py b/mmpretrain/models/heads/beitv1_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..df422ea71c9090d1ab084bbc93c8889a4f2f402e
--- /dev/null
+++ b/mmpretrain/models/heads/beitv1_head.py
@@ -0,0 +1,55 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Union
+
+import torch
+import torch.nn as nn
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class BEiTV1Head(BaseModule):
+    """Head for BEiT v1 Pre-training.
+
+    Compute the logits and the cross entropy loss.
+
+    Args:
+        embed_dims (int): The dimension of embedding.
+        num_embed (int): The number of classification types.
+        loss (dict): The config of loss.
+        init_cfg (dict or List[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        embed_dims: int,
+        num_embed: int,
+        loss: dict,
+        init_cfg: Optional[Union[dict, List[dict]]] = dict(
+            type='TruncNormal', layer='Linear', std=0.02, bias=0)
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+        self.cls_head = nn.Linear(embed_dims, num_embed)
+        self.loss_module = MODELS.build(loss)
+
+    def loss(self, feats: torch.Tensor, target: torch.Tensor,
+             mask: torch.Tensor) -> torch.Tensor:
+        """Generate loss.
+
+        Args:
+            feats (torch.Tensor): Features from backbone.
+            target (torch.Tensor): Target generated by target_generator.
+            mask (torch.Tensor): Generated mask for pretraing.
+        """
+        mask = mask.flatten(1).to(torch.bool)
+        target = torch.argmax(target, dim=1).flatten(1)
+        target = target[mask]
+
+        # remove cls_token
+        feats = feats[:, 1:]
+        logits = self.cls_head(feats[mask])
+
+        loss = self.loss_module(logits, target)
+        return loss
diff --git a/mmpretrain/models/heads/beitv2_head.py b/mmpretrain/models/heads/beitv2_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf677a2cf7c1a3964f1ba884a0ccae83f8b70a40
--- /dev/null
+++ b/mmpretrain/models/heads/beitv2_head.py
@@ -0,0 +1,57 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Union
+
+import torch
+import torch.nn as nn
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class BEiTV2Head(BaseModule):
+    """Head for BEiT v2 Pre-training.
+
+    Compute the logits and the cross entropy loss.
+
+    Args:
+        embed_dims (int): The dimension of embedding.
+        num_embed (int): The number of classification types.
+        loss (dict): The config of loss.
+        init_cfg (dict or List[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        embed_dims: int,
+        num_embed: int,
+        loss: dict,
+        init_cfg: Optional[Union[dict, List[dict]]] = dict(
+            type='TruncNormal', layer='Linear', std=0.02, bias=0)
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+        self.cls_head = nn.Linear(embed_dims, num_embed)
+        self.loss_module = MODELS.build(loss)
+
+    def loss(self, feats: torch.Tensor, feats_cls_pt: torch.Tensor,
+             target: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
+        """Generate loss.
+
+        Args:
+            feats (torch.Tensor): Features from backbone.
+            feats_cls_pt (torch.Tensor) : Features from class late layers for
+                pretraining.
+            target (torch.Tensor): Target generated by target_generator.
+            mask (torch.Tensor): Generated mask for pretraing.
+        """
+        mask = mask.flatten(1).to(torch.bool)
+        target = target[mask]
+
+        # shared cls head
+        logits = self.cls_head(feats[mask])
+        logits_cls_pt = self.cls_head(feats_cls_pt[mask])
+
+        loss_1 = self.loss_module(logits, target)
+        loss_2 = self.loss_module(logits_cls_pt, target)
+        return loss_1, loss_2
diff --git a/mmpretrain/models/heads/cae_head.py b/mmpretrain/models/heads/cae_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..18a07f0a79297c35a39b9b2da0d25bf1eac6e70b
--- /dev/null
+++ b/mmpretrain/models/heads/cae_head.py
@@ -0,0 +1,69 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Tuple, Union
+
+import torch
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class CAEHead(BaseModule):
+    """Head for CAE Pre-training.
+
+    Compute the align loss and the main loss. In addition, this head also
+    generates the prediction target generated by dalle.
+
+    Args:
+        loss (dict): The config of loss.
+        tokenizer_path (str): The path of the tokenizer.
+        init_cfg (dict or List[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 loss: dict,
+                 init_cfg: Optional[Union[dict, List[dict]]] = None) -> None:
+        super().__init__(init_cfg=init_cfg)
+        self.loss_module = MODELS.build(loss)
+
+    @torch.no_grad()
+    def _generate_target(self, logits_target: torch.Tensor) -> torch.Tensor:
+        """Generate the reconstruction target.
+
+        Args:
+            logits_target (torch.Tensor): The logits generated by DALL-E.s
+
+        Returns:
+            torch.Tensor: The logits target.
+        """
+        target = torch.argmax(logits_target, dim=1)
+        return target.flatten(1)
+
+    def loss(self, logits: torch.Tensor, logits_target: torch.Tensor,
+             latent_pred: torch.Tensor, latent_target: torch.Tensor,
+             mask: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Generate loss.
+
+        Args:
+            logits (torch.Tensor): Logits generated by decoder.
+            logits_target (img_target): Target generated by dalle for decoder
+                prediction.
+            latent_pred (torch.Tensor): Latent prediction by regressor.
+            latent_target (torch.Tensor): Target for latent prediction,
+                generated by teacher.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]: The tuple of loss.
+                - ``loss_main`` (torch.Tensor): Cross entropy loss.
+                - ``loss_align`` (torch.Tensor): MSE loss.
+        """
+
+        target = self._generate_target(logits_target)  # target features
+        target = target[mask].detach()
+
+        # loss main for decoder, loss align for regressor
+        loss_main, loss_align = self.loss_module(logits, target, latent_pred,
+                                                 latent_target)
+
+        return (loss_main, loss_align)
diff --git a/mmpretrain/models/heads/cls_head.py b/mmpretrain/models/heads/cls_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ac4c51804122adbc92df8c8748e4109e205110f
--- /dev/null
+++ b/mmpretrain/models/heads/cls_head.py
@@ -0,0 +1,156 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmengine.model import BaseModule
+
+from mmpretrain.evaluation.metrics import Accuracy
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+
+
+@MODELS.register_module()
+class ClsHead(BaseModule):
+    """Classification head.
+
+    Args:
+        loss (dict): Config of classification loss. Defaults to
+            ``dict(type='CrossEntropyLoss', loss_weight=1.0)``.
+        topk (int | Tuple[int]): Top-k accuracy. Defaults to ``(1, )``.
+        cal_acc (bool): Whether to calculate accuracy during training.
+            If you use batch augmentations like Mixup and CutMix during
+            training, it is pointless to calculate accuracy.
+            Defaults to False.
+        init_cfg (dict, optional): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 loss: dict = dict(type='CrossEntropyLoss', loss_weight=1.0),
+                 topk: Union[int, Tuple[int]] = (1, ),
+                 cal_acc: bool = False,
+                 init_cfg: Optional[dict] = None):
+        super(ClsHead, self).__init__(init_cfg=init_cfg)
+
+        self.topk = topk
+        if not isinstance(loss, nn.Module):
+            loss = MODELS.build(loss)
+        self.loss_module = loss
+        self.cal_acc = cal_acc
+
+    def pre_logits(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The process before the final classification head.
+
+        The input ``feats`` is a tuple of tensor, and each tensor is the
+        feature of a backbone stage. In ``ClsHead``, we just obtain the feature
+        of the last stage.
+        """
+        # The ClsHead doesn't have other module, just return after unpacking.
+        return feats[-1]
+
+    def forward(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The forward process."""
+        pre_logits = self.pre_logits(feats)
+        # The ClsHead doesn't have the final classification head,
+        # just return the unpacked inputs.
+        return pre_logits
+
+    def loss(self, feats: Tuple[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> dict:
+        """Calculate losses from the classification score.
+
+        Args:
+            feats (tuple[Tensor]): The features extracted from the backbone.
+                Multiple stage inputs are acceptable but only the last stage
+                will be used to classify. The shape of every item should be
+                ``(num_samples, num_classes)``.
+            data_samples (List[DataSample]): The annotation data of
+                every samples.
+            **kwargs: Other keyword arguments to forward the loss module.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components
+        """
+        # The part can be traced by torch.fx
+        cls_score = self(feats)
+
+        # The part can not be traced by torch.fx
+        losses = self._get_loss(cls_score, data_samples, **kwargs)
+        return losses
+
+    def _get_loss(self, cls_score: torch.Tensor,
+                  data_samples: List[DataSample], **kwargs):
+        """Unpack data samples and compute loss."""
+        # Unpack data samples and pack targets
+        if 'gt_score' in data_samples[0]:
+            # Batch augmentation may convert labels to one-hot format scores.
+            target = torch.stack([i.gt_score for i in data_samples])
+        else:
+            target = torch.cat([i.gt_label for i in data_samples])
+
+        # compute loss
+        losses = dict()
+        loss = self.loss_module(
+            cls_score, target, avg_factor=cls_score.size(0), **kwargs)
+        losses['loss'] = loss
+
+        # compute accuracy
+        if self.cal_acc:
+            assert target.ndim == 1, 'If you enable batch augmentation ' \
+                'like mixup during training, `cal_acc` is pointless.'
+            acc = Accuracy.calculate(cls_score, target, topk=self.topk)
+            losses.update(
+                {f'accuracy_top-{k}': a
+                 for k, a in zip(self.topk, acc)})
+
+        return losses
+
+    def predict(
+        self,
+        feats: Tuple[torch.Tensor],
+        data_samples: Optional[List[Optional[DataSample]]] = None
+    ) -> List[DataSample]:
+        """Inference without augmentation.
+
+        Args:
+            feats (tuple[Tensor]): The features extracted from the backbone.
+                Multiple stage inputs are acceptable but only the last stage
+                will be used to classify. The shape of every item should be
+                ``(num_samples, num_classes)``.
+            data_samples (List[DataSample | None], optional): The annotation
+                data of every samples. If not None, set ``pred_label`` of
+                the input data samples. Defaults to None.
+
+        Returns:
+            List[DataSample]: A list of data samples which contains the
+            predicted results.
+        """
+        # The part can be traced by torch.fx
+        cls_score = self(feats)
+
+        # The part can not be traced by torch.fx
+        predictions = self._get_predictions(cls_score, data_samples)
+        return predictions
+
+    def _get_predictions(self, cls_score, data_samples):
+        """Post-process the output of head.
+
+        Including softmax and set ``pred_label`` of data samples.
+        """
+        pred_scores = F.softmax(cls_score, dim=1)
+        pred_labels = pred_scores.argmax(dim=1, keepdim=True).detach()
+
+        out_data_samples = []
+        if data_samples is None:
+            data_samples = [None for _ in range(pred_scores.size(0))]
+
+        for data_sample, score, label in zip(data_samples, pred_scores,
+                                             pred_labels):
+            if data_sample is None:
+                data_sample = DataSample()
+
+            data_sample.set_pred_score(score).set_pred_label(label)
+            out_data_samples.append(data_sample)
+        return out_data_samples
diff --git a/mmpretrain/models/heads/conformer_head.py b/mmpretrain/models/heads/conformer_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..eade90d567b5cb9189f62919ad9a6a0e9c47ae23
--- /dev/null
+++ b/mmpretrain/models/heads/conformer_head.py
@@ -0,0 +1,122 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Sequence, Tuple
+
+import torch
+import torch.nn as nn
+
+from mmpretrain.evaluation.metrics import Accuracy
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from .cls_head import ClsHead
+
+
+@MODELS.register_module()
+class ConformerHead(ClsHead):
+    """Linear classifier head.
+
+    Args:
+        num_classes (int): Number of categories excluding the background
+            category.
+        in_channels (Sequence[int]): Number of channels in the input
+            feature map.
+        init_cfg (dict | optional): The extra init config of layers.
+            Defaults to use ``dict(type='Normal', layer='Linear', std=0.01)``.
+    """
+
+    def __init__(
+            self,
+            num_classes: int,
+            in_channels: Sequence[int],  # [conv_dim, trans_dim]
+            init_cfg: dict = dict(type='TruncNormal', layer='Linear', std=.02),
+            **kwargs):
+        super(ConformerHead, self).__init__(init_cfg=init_cfg, **kwargs)
+
+        self.in_channels = in_channels
+        self.num_classes = num_classes
+        self.init_cfg = init_cfg
+
+        if self.num_classes <= 0:
+            raise ValueError(
+                f'num_classes={num_classes} must be a positive integer')
+
+        self.conv_cls_head = nn.Linear(self.in_channels[0], num_classes)
+        self.trans_cls_head = nn.Linear(self.in_channels[1], num_classes)
+
+    def pre_logits(self, feats: Tuple[List[torch.Tensor]]) -> torch.Tensor:
+        """The process before the final classification head.
+
+        The input ``feats`` is a tuple of tensor, and each tensor is the
+        feature of a backbone stage. In ``ConformerHead``, we just obtain the
+        feature of the last stage.
+        """
+        # The ConformerHead doesn't have other module,
+        # just return after unpacking.
+        return feats[-1]
+
+    def forward(self, feats: Tuple[List[torch.Tensor]]) -> Tuple[torch.Tensor]:
+        """The forward process."""
+        x = self.pre_logits(feats)
+        # There are two outputs in the Conformer model
+        assert len(x) == 2
+
+        conv_cls_score = self.conv_cls_head(x[0])
+        tran_cls_score = self.trans_cls_head(x[1])
+
+        return conv_cls_score, tran_cls_score
+
+    def predict(self,
+                feats: Tuple[List[torch.Tensor]],
+                data_samples: List[DataSample] = None) -> List[DataSample]:
+        """Inference without augmentation.
+
+        Args:
+            feats (tuple[Tensor]): The features extracted from the backbone.
+                Multiple stage inputs are acceptable but only the last stage
+                will be used to classify. The shape of every item should be
+                ``(num_samples, num_classes)``.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. If not None, set ``pred_label`` of
+                the input data samples. Defaults to None.
+
+        Returns:
+            List[DataSample]: A list of data samples which contains the
+            predicted results.
+        """
+        # The part can be traced by torch.fx
+        conv_cls_score, tran_cls_score = self(feats)
+        cls_score = conv_cls_score + tran_cls_score
+
+        # The part can not be traced by torch.fx
+        predictions = self._get_predictions(cls_score, data_samples)
+        return predictions
+
+    def _get_loss(self, cls_score: Tuple[torch.Tensor],
+                  data_samples: List[DataSample], **kwargs) -> dict:
+        """Unpack data samples and compute loss."""
+        # Unpack data samples and pack targets
+        if 'gt_score' in data_samples[0]:
+            # Batch augmentation may convert labels to one-hot format scores.
+            target = torch.stack([i.gt_score for i in data_samples])
+        else:
+            target = torch.cat([i.gt_label for i in data_samples])
+
+        # compute loss
+        losses = dict()
+        loss = sum([
+            self.loss_module(
+                score, target, avg_factor=score.size(0), **kwargs)
+            for score in cls_score
+        ])
+        losses['loss'] = loss
+
+        # compute accuracy
+        if self.cal_acc:
+            assert target.ndim == 1, 'If you enable batch augmentation ' \
+                'like mixup during training, `cal_acc` is pointless.'
+            acc = Accuracy.calculate(
+                cls_score[0] + cls_score[1], target, topk=self.topk)
+            losses.update(
+                {f'accuracy_top-{k}': a
+                 for k, a in zip(self.topk, acc)})
+
+        return losses
diff --git a/mmpretrain/models/heads/contrastive_head.py b/mmpretrain/models/heads/contrastive_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d1474aed59e2912ca4b5c24ce5a2430f50cb913
--- /dev/null
+++ b/mmpretrain/models/heads/contrastive_head.py
@@ -0,0 +1,50 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Union
+
+import torch
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class ContrastiveHead(BaseModule):
+    """Head for contrastive learning.
+
+    The contrastive loss is implemented in this head and is used in SimCLR,
+    MoCo, DenseCL, etc.
+
+    Args:
+        loss (dict): Config dict for module of loss functions.
+        temperature (float): The temperature hyper-parameter that
+            controls the concentration level of the distribution.
+            Defaults to 0.1.
+        init_cfg (dict or List[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 loss: dict,
+                 temperature: float = 0.1,
+                 init_cfg: Optional[Union[dict, List[dict]]] = None) -> None:
+        super().__init__(init_cfg=init_cfg)
+        self.loss_module = MODELS.build(loss)
+        self.temperature = temperature
+
+    def loss(self, pos: torch.Tensor, neg: torch.Tensor) -> torch.Tensor:
+        """Forward function to compute contrastive loss.
+
+        Args:
+            pos (torch.Tensor): Nx1 positive similarity.
+            neg (torch.Tensor): Nxk negative similarity.
+
+        Returns:
+            torch.Tensor: The contrastive loss.
+        """
+        N = pos.size(0)
+        logits = torch.cat((pos, neg), dim=1)
+        logits /= self.temperature
+        labels = torch.zeros((N, ), dtype=torch.long).to(pos.device)
+
+        loss = self.loss_module(logits, labels)
+        return loss
diff --git a/mmpretrain/models/heads/deit_head.py b/mmpretrain/models/heads/deit_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..a96f6e152711d23646e02312218c0c85e96300e8
--- /dev/null
+++ b/mmpretrain/models/heads/deit_head.py
@@ -0,0 +1,72 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import warnings
+from typing import List, Tuple
+
+import torch
+import torch.nn as nn
+
+from mmpretrain.registry import MODELS
+from .vision_transformer_head import VisionTransformerClsHead
+
+
+@MODELS.register_module()
+class DeiTClsHead(VisionTransformerClsHead):
+    """Distilled Vision Transformer classifier head.
+
+    Comparing with the :class:`VisionTransformerClsHead`, this head adds an
+    extra linear layer to handle the dist token. The final classification score
+    is the average of both linear transformation results of ``cls_token`` and
+    ``dist_token``.
+
+    Args:
+        num_classes (int): Number of categories excluding the background
+            category.
+        in_channels (int): Number of channels in the input feature map.
+        hidden_dim (int, optional): Number of the dimensions for hidden layer.
+            Defaults to None, which means no extra hidden layer.
+        act_cfg (dict): The activation config. Only available during
+            pre-training. Defaults to ``dict(type='Tanh')``.
+        init_cfg (dict): The extra initialization configs. Defaults to
+            ``dict(type='Constant', layer='Linear', val=0)``.
+    """
+
+    def _init_layers(self):
+        """"Init extra hidden linear layer to handle dist token if exists."""
+        super(DeiTClsHead, self)._init_layers()
+        if self.hidden_dim is None:
+            head_dist = nn.Linear(self.in_channels, self.num_classes)
+        else:
+            head_dist = nn.Linear(self.hidden_dim, self.num_classes)
+        self.layers.add_module('head_dist', head_dist)
+
+    def pre_logits(self,
+                   feats: Tuple[List[torch.Tensor]]) -> Tuple[torch.Tensor]:
+        """The process before the final classification head.
+
+        The input ``feats`` is a tuple of list of tensor, and each tensor is
+        the feature of a backbone stage. In ``DeiTClsHead``, we obtain the
+        feature of the last stage and forward in hidden layer if exists.
+        """
+        feat = feats[-1]  # Obtain feature of the last scale.
+        # For backward-compatibility with the previous ViT output
+        if len(feat) == 3:
+            _, cls_token, dist_token = feat
+        else:
+            cls_token, dist_token = feat
+        if self.hidden_dim is None:
+            return cls_token, dist_token
+        else:
+            cls_token = self.layers.act(self.layers.pre_logits(cls_token))
+            dist_token = self.layers.act(self.layers.pre_logits(dist_token))
+            return cls_token, dist_token
+
+    def forward(self, feats: Tuple[List[torch.Tensor]]) -> torch.Tensor:
+        """The forward process."""
+        if self.training:
+            warnings.warn('MMPretrain cannot train the '
+                          'distilled version DeiT.')
+        cls_token, dist_token = self.pre_logits(feats)
+        # The final classification head.
+        cls_score = (self.layers.head(cls_token) +
+                     self.layers.head_dist(dist_token)) / 2
+        return cls_score
diff --git a/mmpretrain/models/heads/efficientformer_head.py b/mmpretrain/models/heads/efficientformer_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..09aa05b28533028723f599881777939a48982319
--- /dev/null
+++ b/mmpretrain/models/heads/efficientformer_head.py
@@ -0,0 +1,89 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Tuple
+
+import torch
+import torch.nn as nn
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from .cls_head import ClsHead
+
+
+@MODELS.register_module()
+class EfficientFormerClsHead(ClsHead):
+    """EfficientFormer classifier head.
+
+    Args:
+        num_classes (int): Number of categories excluding the background
+            category.
+        in_channels (int): Number of channels in the input feature map.
+        distillation (bool): Whether use a additional distilled head.
+            Defaults to True.
+        init_cfg (dict): The extra initialization configs. Defaults to
+            ``dict(type='Normal', layer='Linear', std=0.01)``.
+    """
+
+    def __init__(self,
+                 num_classes,
+                 in_channels,
+                 distillation=True,
+                 init_cfg=dict(type='Normal', layer='Linear', std=0.01),
+                 *args,
+                 **kwargs):
+        super(EfficientFormerClsHead, self).__init__(
+            init_cfg=init_cfg, *args, **kwargs)
+        self.in_channels = in_channels
+        self.num_classes = num_classes
+        self.dist = distillation
+
+        if self.num_classes <= 0:
+            raise ValueError(
+                f'num_classes={num_classes} must be a positive integer')
+
+        self.head = nn.Linear(self.in_channels, self.num_classes)
+        if self.dist:
+            self.dist_head = nn.Linear(self.in_channels, self.num_classes)
+
+    def forward(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The forward process."""
+        pre_logits = self.pre_logits(feats)
+        # The final classification head.
+        cls_score = self.head(pre_logits)
+
+        if self.dist:
+            cls_score = (cls_score + self.dist_head(pre_logits)) / 2
+        return cls_score
+
+    def pre_logits(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The process before the final classification head.
+
+        The input ``feats`` is a tuple of tensor, and each tensor is the
+        feature of a backbone stage. In :obj`EfficientFormerClsHead`, we just
+        obtain the feature of the last stage.
+        """
+        # The EfficientFormerClsHead doesn't have other module, just return
+        # after unpacking.
+        return feats[-1]
+
+    def loss(self, feats: Tuple[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> dict:
+        """Calculate losses from the classification score.
+
+        Args:
+            feats (tuple[Tensor]): The features extracted from the backbone.
+                Multiple stage inputs are acceptable but only the last stage
+                will be used to classify. The shape of every item should be
+                ``(num_samples, num_classes)``.
+            data_samples (List[DataSample]): The annotation data of
+                every samples.
+            **kwargs: Other keyword arguments to forward the loss module.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components
+        """
+        if self.dist:
+            raise NotImplementedError(
+                "MMPretrain doesn't support to train"
+                ' the distilled version EfficientFormer.')
+        else:
+            return super().loss(feats, data_samples, **kwargs)
diff --git a/mmpretrain/models/heads/grounding_head.py b/mmpretrain/models/heads/grounding_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..a47512ef5930dde51a7023a07c3412d759b6bd8c
--- /dev/null
+++ b/mmpretrain/models/heads/grounding_head.py
@@ -0,0 +1,217 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional
+
+import torch
+import torch.nn.functional as F
+from mmengine.model import BaseModule
+
+from mmpretrain.models.utils.box_utils import (box_cxcywh_to_xyxy,
+                                               generalized_box_iou)
+from mmpretrain.registry import MODELS, TOKENIZER
+
+
+@MODELS.register_module()
+class GroundingHead(BaseModule):
+    """bbox Coordination generation head for multi-modal pre-trained task,
+    adapted by BLIP. Normally used for visual grounding.
+
+    Args:
+        loss: dict,
+        decoder: dict,
+        init_cfg (dict, optional): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        decoder: dict = None,
+        tokenizer: dict = None,
+        box_l1_loss_coeff=4.0,
+        box_giou_loss_coeff=2.0,
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super(GroundingHead, self).__init__(init_cfg=init_cfg)
+        ''' init the decoder from med_config'''
+        self.decoder = None
+        if decoder:
+            self.decoder = MODELS.build(decoder)
+        self.loss_fn = torch.nn.CrossEntropyLoss(
+            reduction='none', ignore_index=-100)
+
+        self.box_l1_loss_coeff = box_l1_loss_coeff
+        self.box_giou_loss_coeff = box_giou_loss_coeff
+
+        if isinstance(tokenizer, dict):
+            self.tokenizer = TOKENIZER.build(tokenizer)
+        else:
+            self.tokenizer = tokenizer
+
+        self.image_res = 640
+        prefix_ids = torch.tensor(
+            self.tokenizer.convert_tokens_to_ids(['[unused339]']))
+        target_ids = torch.tensor(
+            self.tokenizer.convert_tokens_to_ids(
+                [f'[unused{340+_}]' for _ in range(self.image_res + 1)]))
+        self.register_buffer('prefix_ids', prefix_ids)
+        self.register_buffer('target_ids', target_ids)
+
+        bbox_prob_mask = torch.zeros(len(self.tokenizer))
+        bbox_prob_mask[self.target_ids[0]:self.target_ids[-1] + 1] = 1
+        bbox_prob_mask = (1.0 - bbox_prob_mask) * -10000.0
+        self.register_buffer('bbox_prob_mask', bbox_prob_mask)
+        self.bin_start_idx = self.target_ids[0]
+
+    def forward(self, text_embedding, text_embedding_mask,
+                encoder_hidden_states, encoder_attention_mask):
+
+        # localize prompt token, text embedding
+
+        merged_encode_hs = torch.cat([encoder_hidden_states, text_embedding],
+                                     1)
+        merge_att_mask = torch.cat(
+            [encoder_attention_mask, text_embedding_mask], 1)
+
+        loc_prompt = self.prompt.weight.T
+        loc_prompt = torch.repeat_interleave(loc_prompt,
+                                             merge_att_mask.shape[0],
+                                             0).unsqueeze(1)
+
+        loc_prompt_mask = torch.ones(loc_prompt.shape[:-1]).long().to(
+            loc_prompt.device)
+
+        decoder_out = self.decoder(
+            inputs_embeds=loc_prompt,
+            attention_mask=loc_prompt_mask,
+            encoder_hidden_states=merged_encode_hs,
+            encoder_attention_mask=merge_att_mask,
+            output_hidden_states=True,
+            labels=None,
+        )
+        decoder_hs = decoder_out.hidden_states[-1][:, 0, :]
+        box_pred = self.box_head(decoder_hs)
+        return decoder_out, decoder_hs, box_pred
+
+    def loss(self,
+             text_embedding,
+             text_embedding_mask,
+             encoder_hidden_states,
+             encoder_attention_mask,
+             decoder_targets,
+             return_scores=False):
+        """Calculate losses from the extracted features.
+
+        Args:
+            feats (dict): The features extracted from the backbone.
+            data_samples (List[BaseDataElement]): The annotation data of
+                every samples.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components
+        """
+
+        merged_encode_hs = torch.cat([encoder_hidden_states, text_embedding],
+                                     1)
+        merge_att_mask = torch.cat(
+            [encoder_attention_mask, text_embedding_mask], 1)
+
+        answer_targets = (decoder_targets *
+                          self.image_res).long() + self.bin_start_idx
+        prefix_ids = torch.repeat_interleave(self.prefix_ids,
+                                             merge_att_mask.shape[0],
+                                             0).unsqueeze(-1)
+        prefix_ids = torch.cat([prefix_ids, answer_targets], dim=1)
+
+        answer_output = self.decoder(
+            prefix_ids,
+            encoder_hidden_states=merged_encode_hs,
+            encoder_attention_mask=merge_att_mask,
+            labels=None,
+            return_dict=True,
+        )
+        prob_mask = self.bbox_prob_mask.view(1, 1,
+                                             self.bbox_prob_mask.shape[-1])
+        prediction_scores = answer_output.logits + prob_mask
+
+        shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
+        labels = prefix_ids[:, 1:].contiguous()
+        vocab_size = len(self.tokenizer)
+        loss_seq_init = self.loss_fn(
+            shifted_prediction_scores.view(-1, vocab_size), labels.view(-1))
+
+        with torch.no_grad():
+            pred_box = (torch.argmax(
+                prediction_scores[:, :-1, :].contiguous(), dim=-1) -
+                        self.bin_start_idx) / self.image_res
+            weight_bbox = F.l1_loss(
+                pred_box, decoder_targets, reduction='none').clamp(
+                    0, 5) * self.box_l1_loss_coeff
+            weight_giou = (1 - torch.diag(
+                generalized_box_iou(
+                    box_cxcywh_to_xyxy(pred_box),
+                    box_cxcywh_to_xyxy(decoder_targets)))
+                           ) * self.box_giou_loss_coeff
+            bs = text_embedding.shape[0]
+            loss_seq = loss_seq_init[:].view(bs, -1, 4)
+            loss_seq = loss_seq * weight_bbox
+            loss_seq = loss_seq * weight_giou.unsqueeze(1)
+
+        loss_seq = loss_seq.mean()
+
+        losses = {
+            'loss_seq': loss_seq,
+            'loss_seq_init': loss_seq_init.mean(),
+            'loss': loss_seq,
+            'box_l1': weight_bbox.mean(-1).mean().detach(),
+            'box_giou': weight_giou.mean().detach()
+        }
+
+        return losses
+
+    def predict(
+        self,
+        text_embedding,
+        text_embedding_mask,
+        encoder_hidden_states,
+        encoder_attention_mask,
+    ):
+        """Generates the bbox coordinates at inference time."""
+
+        merged_encode_hs = torch.cat([encoder_hidden_states, text_embedding],
+                                     1)
+        merge_att_mask = torch.cat(
+            [encoder_attention_mask, text_embedding_mask], 1)
+
+        prefix_ids = torch.repeat_interleave(self.prefix_ids,
+                                             merge_att_mask.shape[0],
+                                             0).unsqueeze(-1)
+
+        for _ in range(4):
+            decoder_output = self.decoder(
+                prefix_ids,
+                encoder_hidden_states=merged_encode_hs,
+                encoder_attention_mask=merge_att_mask,
+                labels=None,
+                return_dict=True,
+            )
+            prob_mask = self.bbox_prob_mask.view(1, 1,
+                                                 self.bbox_prob_mask.shape[-1])
+            prediction_scores = decoder_output.logits + prob_mask
+
+            prefix_ids = torch.cat([
+                prefix_ids,
+                torch.argmax(prediction_scores[:, -1, :], dim=-1).unsqueeze(1)
+            ],
+                                   dim=1)
+
+        pred_box = self.process_bbox(prefix_ids[:, 1:])  # xywh 0-1 to xyxy 0-1
+
+        return pred_box
+
+    @torch.no_grad()
+    def process_bbox(self, bbox):
+        bbox = bbox - self.bin_start_idx
+        bbox = torch.true_divide(bbox, self.image_res)
+        bbox = box_cxcywh_to_xyxy(bbox)
+        bbox = torch.clip(bbox, 0, 1)
+        assert torch.all(bbox <= 1)
+        return bbox
diff --git a/mmpretrain/models/heads/itc_head.py b/mmpretrain/models/heads/itc_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..006d52c76d9317809c7bb07519f4efb18716d8bd
--- /dev/null
+++ b/mmpretrain/models/heads/itc_head.py
@@ -0,0 +1,157 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional, Tuple
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmengine.dist import all_gather
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class ITCHead(BaseModule):
+    """Image-text matching head for multi-modal pre-trained task. Adapted by
+    BLIP, ALBEF. Normally used for retrieval task.
+
+    Args:
+        embed_dim (int): Embed channel size for queue.
+        queue_size (int): Queue size for image and text. Defaults to 57600.
+        temperature (float): Temperature to calculate the similarity.
+            Defaults to 0.07.
+        use_distill (bool): Whether to use distill to calculate loss.
+            Defaults to True.
+        alpha (float): Weight for momentum similarity. Defaults to 0.4.
+        init_cfg (dict, optional): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dim: int,
+                 queue_size: int = 57600,
+                 temperature: float = 0.07,
+                 use_distill: bool = True,
+                 alpha: float = 0.4,
+                 init_cfg: Optional[dict] = None):
+        super(ITCHead, self).__init__(init_cfg=init_cfg)
+        self.temp = nn.Parameter(temperature * torch.ones([]))
+        self.use_distill = use_distill
+        if self.use_distill:
+            # create the queue
+            self.register_buffer('image_queue',
+                                 torch.randn(embed_dim, queue_size))
+            self.register_buffer('text_queue',
+                                 torch.randn(embed_dim, queue_size))
+            self.register_buffer('idx_queue', torch.full((1, queue_size),
+                                                         -100))
+            self.register_buffer('queue_ptr', torch.zeros(1, dtype=torch.long))
+
+            self.image_queue = F.normalize(self.image_queue, dim=0)
+            self.text_queue = F.normalize(self.text_queue, dim=0)
+
+            self.queue_size = queue_size
+            # This value will be warmup by `WarmupParamHook`
+            self.alpha = alpha
+
+    def forward(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The forward process."""
+        return feats[-1]
+
+    def loss(self, feats: Tuple[torch.Tensor], data_samples, **kwargs) -> dict:
+        """Calculate losses from the classification score.
+
+        Args:
+            feats (tuple[Tensor]): The features extracted from the backbone.
+                Multiple stage inputs are acceptable but only the last stage
+                will be used to classify. The shape of every item should be
+                ``(num_samples, num_classes)``.
+            data_samples (List[ClsDataSample]): The annotation data of
+                every samples.
+            **kwargs: Other keyword arguments to forward the loss module.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components
+        """
+
+        # The part can be traced by torch.fx
+        img_feats, text_feats, img_feats_m, text_feats_m = self(feats)
+
+        img_feats_all = torch.cat(
+            [img_feats_m.t(),
+             self.image_queue.clone().detach()], dim=1)
+        text_feats_all = torch.cat(
+            [text_feats_m.t(),
+             self.text_queue.clone().detach()], dim=1)
+
+        # The part can not be traced by torch.fx
+        losses = self._get_loss(img_feats, text_feats, img_feats_m,
+                                text_feats_m, img_feats_all, text_feats_all,
+                                data_samples, **kwargs)
+        return losses
+
+    def _get_loss(self, img_feats, text_feats, img_feats_m, text_feats_m,
+                  img_feats_all, text_feats_all, data_samples, **kwargs):
+        """Unpack data samples and compute loss."""
+
+        idx = torch.tensor([ds.image_id
+                            for ds in data_samples]).to(img_feats.device)
+        idx = idx.view(-1, 1)
+        idx_all = torch.cat([idx.t(), self.idx_queue.clone().detach()], dim=1)
+        pos_idx = torch.eq(idx, idx_all).float()
+        sim_targets = pos_idx / pos_idx.sum(1, keepdim=True)
+
+        with torch.no_grad():
+            if self.use_distill:
+                sim_i2t_m = img_feats_m @ text_feats_all / self.temp
+                sim_t2i_m = text_feats_m @ img_feats_all / self.temp
+
+                sim_i2t_targets = (
+                    self.alpha * F.softmax(sim_i2t_m, dim=1) +
+                    (1 - self.alpha) * sim_targets)
+                sim_t2i_targets = (
+                    self.alpha * F.softmax(sim_t2i_m, dim=1) +
+                    (1 - self.alpha) * sim_targets)
+
+        sim_i2t = img_feats @ text_feats_all / self.temp
+        sim_t2i = text_feats @ img_feats_all / self.temp
+
+        if self.use_distill:
+            loss_i2t = -torch.sum(
+                F.log_softmax(sim_i2t, dim=1) * sim_i2t_targets, dim=1).mean()
+            loss_t2i = -torch.sum(
+                F.log_softmax(sim_t2i, dim=1) * sim_t2i_targets, dim=1).mean()
+        else:
+            loss_i2t = -torch.sum(
+                F.log_softmax(sim_i2t, dim=1) * sim_targets, dim=1).mean()
+            loss_t2i = -torch.sum(
+                F.log_softmax(sim_t2i, dim=1) * sim_targets, dim=1).mean()
+
+        # compute loss
+        losses = dict()
+
+        losses['itc_loss'] = (loss_i2t + loss_t2i) / 2
+        self._dequeue_and_enqueue(img_feats_m, text_feats_m, idx)
+        return losses
+
+    @torch.no_grad()
+    def _dequeue_and_enqueue(self, image_feat, text_feat, idxs=None):
+        # gather keys before updating queue
+        image_feats = torch.cat(all_gather(image_feat))
+        text_feats = torch.cat(all_gather(text_feat))
+
+        batch_size = image_feats.shape[0]
+
+        ptr = int(self.queue_ptr)
+        assert self.queue_size % batch_size == 0  # for simplicity
+
+        # replace the keys at ptr (dequeue and enqueue)
+        self.image_queue[:, ptr:ptr + batch_size] = image_feats.T
+        self.text_queue[:, ptr:ptr + batch_size] = text_feats.T
+
+        if idxs is not None:
+            idxs = torch.cat(all_gather(idxs))
+            self.idx_queue[:, ptr:ptr + batch_size] = idxs.T
+
+        ptr = (ptr + batch_size) % self.queue_size  # move pointer
+        self.queue_ptr[0] = ptr
diff --git a/mmpretrain/models/heads/itm_head.py b/mmpretrain/models/heads/itm_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7b42f3f684e2ffefd085b39360706a339017f4c
--- /dev/null
+++ b/mmpretrain/models/heads/itm_head.py
@@ -0,0 +1,117 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional, Tuple
+
+import torch
+import torch.nn as nn
+from mmengine.model import BaseModule
+
+from mmpretrain.evaluation import Accuracy
+from mmpretrain.registry import MODELS
+
+
+class Pooler(nn.Module):
+
+    def __init__(self, hidden_size):
+        super().__init__()
+        self.dense = nn.Linear(hidden_size, hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+@MODELS.register_module()
+class ITMHead(BaseModule):
+    """Image-text matching head for multi-modal pre-trained task. Adapted by
+    BLIP, FLAVA.
+
+    Args:
+        hidden_size (int): Hidden channel size out input features.
+        with_pooler (bool): Whether a pooler is added. Defaults to True.
+        loss (dict): Config of global contrasive loss. Defaults to
+            ``dict(type='GlobalContrasiveLoss')``.
+        cal_acc (bool): Whether to calculate accuracy during training.
+            If you use batch augmentations like Mixup and CutMix during
+            training, it is pointless to calculate accuracy.
+            Defaults to False.
+        init_cfg (dict, optional): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 hidden_size: int,
+                 with_pooler: bool = True,
+                 loss: dict = dict(type='CrossEntropyLoss', loss_weight=1.0),
+                 cal_acc: bool = False,
+                 init_cfg: Optional[dict] = None):
+        super(ITMHead, self).__init__(init_cfg=init_cfg)
+        self.hidden_size = hidden_size
+
+        if with_pooler:
+            self.pooler = Pooler(hidden_size=self.hidden_size)
+        else:
+            self.pooler = nn.Identity()
+        self.fc = nn.Linear(self.hidden_size, 2)
+
+        self.loss_module = MODELS.build(loss)
+        self.cal_acc = cal_acc
+
+    def forward(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The forward process."""
+        pre_logits = self.pooler(feats[-1])
+        itm_logits = self.fc(pre_logits)
+        return itm_logits
+
+    def loss(self, feats: Tuple[torch.Tensor], data_samples, **kwargs) -> dict:
+        """Calculate losses from the classification score.
+
+        Args:
+            feats (tuple[Tensor]): The features extracted from the backbone.
+                Multiple stage inputs are acceptable but only the last stage
+                will be used to classify. The shape of every item should be
+                ``(num_samples, num_classes)``.
+            data_samples (List[ClsDataSample]): The annotation data of
+                every samples.
+            **kwargs: Other keyword arguments to forward the loss module.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components
+        """
+
+        # The part can be traced by torch.fx
+        itm_logits = self(feats)
+
+        # deal with query
+        if itm_logits.ndim == 3:
+            itm_logits = itm_logits.mean(dim=1)
+
+        # The part can not be traced by torch.fx
+        losses = self._get_loss(itm_logits, data_samples, **kwargs)
+        return losses
+
+    def _get_loss(self, itm_logits: torch.Tensor, data_samples, **kwargs):
+        """Unpack data samples and compute loss."""
+        # Unpack data samples and pack targets
+        # use `itm_label` in here temporarily
+        target = torch.tensor([i.is_matched
+                               for i in data_samples]).to(itm_logits.device)
+
+        # compute loss
+        losses = dict()
+
+        loss = self.loss_module(
+            itm_logits, target.long(), avg_factor=itm_logits.size(0), **kwargs)
+        losses['itm_loss'] = loss
+
+        # compute accuracy
+        if self.cal_acc:
+            # topk is meaningless for matching task
+            acc = Accuracy.calculate(itm_logits, target)
+            # acc is warpped with two lists of topk and thrs
+            # which are unnecessary here
+            losses.update({'itm_accuracy': acc[0][0]})
+
+        return losses
diff --git a/mmpretrain/models/heads/itpn_clip_head.py b/mmpretrain/models/heads/itpn_clip_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..52c49b8c013c5169d1d997b4db5030dd0bc6a540
--- /dev/null
+++ b/mmpretrain/models/heads/itpn_clip_head.py
@@ -0,0 +1,56 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Union
+
+import torch
+import torch.nn as nn
+from mmengine.device import get_device
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class iTPNClipHead(BaseModule):
+    """Head for iTPN Pre-training using Clip.
+
+    Compute the logits and the cross entropy loss.
+
+    Args:
+        embed_dims (int): The dimension of embedding.
+        num_embed (int): The number of classification types.
+        loss (dict): The config of loss.
+        init_cfg (dict or List[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        embed_dims: int,
+        num_embed: int,
+        loss: dict,
+        init_cfg: Optional[Union[dict, List[dict]]] = dict(
+            type='TruncNormal', layer='Linear', std=0.02, bias=0)
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+        self.cls_head = nn.Linear(embed_dims, num_embed)
+        self.loss_module = MODELS.build(loss)
+
+    def loss(self, feats: torch.Tensor, target: torch.Tensor,
+             mask: torch.Tensor) -> torch.Tensor:
+        """Generate loss.
+
+        Args:
+            feats (torch.Tensor): Features from backbone.
+            target (torch.Tensor): Target generated by target_generator.
+            mask (torch.Tensor): Generated mask for pretraing.
+        """
+        mask = mask.to(get_device(), non_blocking=True)
+        mask = mask.flatten(1).to(torch.bool)
+        target = target[mask]
+
+        # remove cls_token
+        # feats = feats[:, 1:]
+        logits = self.cls_head(feats[mask])
+
+        loss = self.loss_module(logits, target)
+        return loss
diff --git a/mmpretrain/models/heads/latent_heads.py b/mmpretrain/models/heads/latent_heads.py
new file mode 100644
index 0000000000000000000000000000000000000000..a9662b5d91c8534d1a2a7834e4b9e3ec37f552c1
--- /dev/null
+++ b/mmpretrain/models/heads/latent_heads.py
@@ -0,0 +1,94 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+from mmengine.dist import all_reduce, get_world_size
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class LatentPredictHead(BaseModule):
+    """Head for latent feature prediction.
+
+    This head builds a predictor, which can be any registered neck component.
+    For example, BYOL and SimSiam call this head and build NonLinearNeck.
+    It also implements similarity loss between two forward features.
+
+    Args:
+        loss (dict): Config dict for the loss.
+        predictor (dict): Config dict for the predictor.
+        init_cfg (dict or List[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 loss: dict,
+                 predictor: dict,
+                 init_cfg: Optional[Union[dict, List[dict]]] = None) -> None:
+        super().__init__(init_cfg=init_cfg)
+        self.loss_module = MODELS.build(loss)
+        self.predictor = MODELS.build(predictor)
+
+    def loss(self, input: torch.Tensor,
+             target: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Forward head.
+
+        Args:
+            input (torch.Tensor): NxC input features.
+            target (torch.Tensor): NxC target features.
+
+        Returns:
+            torch.Tensor: The latent predict loss.
+        """
+        pred = self.predictor([input])[0]
+        target = target.detach()
+
+        loss = self.loss_module(pred, target)
+
+        return loss
+
+
+@MODELS.register_module()
+class LatentCrossCorrelationHead(BaseModule):
+    """Head for latent feature cross correlation.
+
+    Part of the code is borrowed from `script
+    <https://github.com/facebookresearch/barlowtwins/blob/main/main.py>`_.
+
+    Args:
+        in_channels (int): Number of input channels.
+        loss (dict): Config dict for module of loss functions.
+        init_cfg (dict or List[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 loss: dict,
+                 init_cfg: Optional[Union[dict, List[dict]]] = None) -> None:
+        super().__init__(init_cfg=init_cfg)
+        self.world_size = get_world_size()
+        self.bn = nn.BatchNorm1d(in_channels, affine=False)
+        self.loss_module = MODELS.build(loss)
+
+    def loss(self, input: torch.Tensor, target: torch.Tensor) -> torch.Tensor:
+        """Forward head.
+
+        Args:
+            input (torch.Tensor): NxC input features.
+            target (torch.Tensor): NxC target features.
+
+        Returns:
+            torch.Tensor: The cross correlation loss.
+        """
+        # cross-correlation matrix
+        cross_correlation_matrix = self.bn(input).T @ self.bn(target)
+        cross_correlation_matrix.div_(input.size(0) * self.world_size)
+
+        all_reduce(cross_correlation_matrix)
+
+        loss = self.loss_module(cross_correlation_matrix)
+        return loss
diff --git a/mmpretrain/models/heads/levit_head.py b/mmpretrain/models/heads/levit_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..a74d7ecc52caca0adca642e528f2861f9a0e5833
--- /dev/null
+++ b/mmpretrain/models/heads/levit_head.py
@@ -0,0 +1,81 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+from mmengine.model import BaseModule
+
+from mmpretrain.models.heads import ClsHead
+from mmpretrain.registry import MODELS
+from ..utils import build_norm_layer
+
+
+class BatchNormLinear(BaseModule):
+
+    def __init__(self, in_channels, out_channels, norm_cfg=dict(type='BN1d')):
+        super(BatchNormLinear, self).__init__()
+        self.bn = build_norm_layer(norm_cfg, in_channels)
+        self.linear = nn.Linear(in_channels, out_channels)
+
+    @torch.no_grad()
+    def fuse(self):
+        w = self.bn.weight / (self.bn.running_var + self.bn.eps)**0.5
+        b = self.bn.bias - self.bn.running_mean * \
+            self.bn.weight / (self.bn.running_var + self.bn.eps) ** 0.5
+        w = self.linear.weight * w[None, :]
+        b = (self.linear.weight @ b[:, None]).view(-1) + self.linear.bias
+
+        self.linear.weight.data.copy_(w)
+        self.linear.bias.data.copy_(b)
+        return self.linear
+
+    def forward(self, x):
+        x = self.bn(x)
+        x = self.linear(x)
+        return x
+
+
+def fuse_parameters(module):
+    for child_name, child in module.named_children():
+        if hasattr(child, 'fuse'):
+            setattr(module, child_name, child.fuse())
+        else:
+            fuse_parameters(child)
+
+
+@MODELS.register_module()
+class LeViTClsHead(ClsHead):
+
+    def __init__(self,
+                 num_classes=1000,
+                 distillation=True,
+                 in_channels=None,
+                 deploy=False,
+                 **kwargs):
+        super(LeViTClsHead, self).__init__(**kwargs)
+        self.num_classes = num_classes
+        self.distillation = distillation
+        self.deploy = deploy
+        self.head = BatchNormLinear(in_channels, num_classes)
+        if distillation:
+            self.head_dist = BatchNormLinear(in_channels, num_classes)
+
+        if self.deploy:
+            self.switch_to_deploy(self)
+
+    def switch_to_deploy(self):
+        if self.deploy:
+            return
+        fuse_parameters(self)
+        self.deploy = True
+
+    def forward(self, x):
+        x = self.pre_logits(x)
+        if self.distillation:
+            x = self.head(x), self.head_dist(x)  # 2 16 384 -> 2 1000
+            if not self.training:
+                x = (x[0] + x[1]) / 2
+            else:
+                raise NotImplementedError("MMPretrain doesn't support "
+                                          'training in distillation mode.')
+        else:
+            x = self.head(x)
+        return x
diff --git a/mmpretrain/models/heads/linear_head.py b/mmpretrain/models/heads/linear_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..90b4c2b11eb0b2ba087fd438a32596cedb13cebb
--- /dev/null
+++ b/mmpretrain/models/heads/linear_head.py
@@ -0,0 +1,63 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional, Tuple
+
+import torch
+import torch.nn as nn
+
+from mmpretrain.registry import MODELS
+from .cls_head import ClsHead
+
+
+@MODELS.register_module()
+class LinearClsHead(ClsHead):
+    """Linear classifier head.
+
+    Args:
+        num_classes (int): Number of categories excluding the background
+            category.
+        in_channels (int): Number of channels in the input feature map.
+        loss (dict): Config of classification loss. Defaults to
+            ``dict(type='CrossEntropyLoss', loss_weight=1.0)``.
+        topk (int | Tuple[int]): Top-k accuracy. Defaults to ``(1, )``.
+        cal_acc (bool): Whether to calculate accuracy during training.
+            If you use batch augmentations like Mixup and CutMix during
+            training, it is pointless to calculate accuracy.
+            Defaults to False.
+        init_cfg (dict, optional): the config to control the initialization.
+            Defaults to ``dict(type='Normal', layer='Linear', std=0.01)``.
+    """
+
+    def __init__(self,
+                 num_classes: int,
+                 in_channels: int,
+                 init_cfg: Optional[dict] = dict(
+                     type='Normal', layer='Linear', std=0.01),
+                 **kwargs):
+        super(LinearClsHead, self).__init__(init_cfg=init_cfg, **kwargs)
+
+        self.in_channels = in_channels
+        self.num_classes = num_classes
+
+        if self.num_classes <= 0:
+            raise ValueError(
+                f'num_classes={num_classes} must be a positive integer')
+
+        self.fc = nn.Linear(self.in_channels, self.num_classes)
+
+    def pre_logits(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The process before the final classification head.
+
+        The input ``feats`` is a tuple of tensor, and each tensor is the
+        feature of a backbone stage. In ``LinearClsHead``, we just obtain the
+        feature of the last stage.
+        """
+        # The LinearClsHead doesn't have other module, just return after
+        # unpacking.
+        return feats[-1]
+
+    def forward(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The forward process."""
+        pre_logits = self.pre_logits(feats)
+        # The final classification head.
+        cls_score = self.fc(pre_logits)
+        return cls_score
diff --git a/mmpretrain/models/heads/mae_head.py b/mmpretrain/models/heads/mae_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..b76ecedd96dd84d34ce2d9cb6dfa4fe725ea870b
--- /dev/null
+++ b/mmpretrain/models/heads/mae_head.py
@@ -0,0 +1,106 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class MAEPretrainHead(BaseModule):
+    """Head for MAE Pre-training.
+
+    Args:
+        loss (dict): Config of loss.
+        norm_pix_loss (bool): Whether or not normalize target.
+            Defaults to False.
+        patch_size (int): Patch size. Defaults to 16.
+        in_channels (int): Number of input channels. Defaults to 3.
+    """
+
+    def __init__(self,
+                 loss: dict,
+                 norm_pix: bool = False,
+                 patch_size: int = 16,
+                 in_channels: int = 3) -> None:
+        super().__init__()
+        self.norm_pix = norm_pix
+        self.patch_size = patch_size
+        self.in_channels = in_channels
+        self.loss_module = MODELS.build(loss)
+
+    def patchify(self, imgs: torch.Tensor) -> torch.Tensor:
+        r"""Split images into non-overlapped patches.
+
+        Args:
+            imgs (torch.Tensor): A batch of images. The shape should
+                be :math:`(B, C, H, W)`.
+
+        Returns:
+            torch.Tensor: Patchified images. The shape is
+            :math:`(B, L, \text{patch_size}^2 \times C)`.
+        """
+        p = self.patch_size
+        assert imgs.shape[2] == imgs.shape[3] and imgs.shape[2] % p == 0
+
+        h = w = imgs.shape[2] // p
+        x = imgs.reshape(shape=(imgs.shape[0], self.in_channels, h, p, w, p))
+        x = torch.einsum('nchpwq->nhwpqc', x)
+        x = x.reshape(shape=(imgs.shape[0], h * w, p**2 * self.in_channels))
+        return x
+
+    def unpatchify(self, x: torch.Tensor) -> torch.Tensor:
+        r"""Combine non-overlapped patches into images.
+
+        Args:
+            x (torch.Tensor): The shape is
+                :math:`(B, L, \text{patch_size}^2 \times C)`.
+
+        Returns:
+            torch.Tensor: The shape is :math:`(B, C, H, W)`.
+        """
+        p = self.patch_size
+        h = w = int(x.shape[1]**.5)
+        assert h * w == x.shape[1]
+
+        x = x.reshape(shape=(x.shape[0], h, w, p, p, self.in_channels))
+        x = torch.einsum('nhwpqc->nchpwq', x)
+        imgs = x.reshape(shape=(x.shape[0], self.in_channels, h * p, h * p))
+        return imgs
+
+    def construct_target(self, target: torch.Tensor) -> torch.Tensor:
+        """Construct the reconstruction target.
+
+        In addition to splitting images into tokens, this module will also
+        normalize the image according to ``norm_pix``.
+
+        Args:
+            target (torch.Tensor): Image with the shape of B x C x H x W
+
+        Returns:
+            torch.Tensor: Tokenized images with the shape of B x L x C
+        """
+        target = self.patchify(target)
+        if self.norm_pix:
+            # normalize the target image
+            mean = target.mean(dim=-1, keepdim=True)
+            var = target.var(dim=-1, keepdim=True)
+            target = (target - mean) / (var + 1.e-6)**.5
+
+        return target
+
+    def loss(self, pred: torch.Tensor, target: torch.Tensor,
+             mask: torch.Tensor) -> torch.Tensor:
+        """Generate loss.
+
+        Args:
+            pred (torch.Tensor): The reconstructed image.
+            target (torch.Tensor): The target image.
+            mask (torch.Tensor): The mask of the target image.
+
+        Returns:
+            torch.Tensor: The reconstruction loss.
+        """
+        target = self.construct_target(target)
+        loss = self.loss_module(pred, target, mask)
+
+        return loss
diff --git a/mmpretrain/models/heads/margin_head.py b/mmpretrain/models/heads/margin_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a88bf8b3f4d19b233192a7578f49b750ff53ed5
--- /dev/null
+++ b/mmpretrain/models/heads/margin_head.py
@@ -0,0 +1,300 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from typing import List, Optional, Sequence, Tuple, Union
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmengine.fileio import list_from_file
+from mmengine.runner import autocast
+from mmengine.utils import is_seq_of
+
+from mmpretrain.models.losses import convert_to_one_hot
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from .cls_head import ClsHead
+
+
+class NormProduct(nn.Linear):
+    """An enhanced linear layer with k clustering centers to calculate product
+    between normalized input and linear weight.
+
+    Args:
+        in_features (int): size of each input sample.
+        out_features (int): size of each output sample
+        k (int): The number of clustering centers. Defaults to 1.
+        bias (bool): Whether there is bias. If set to ``False``, the
+            layer will not learn an additive bias. Defaults to ``True``.
+        feature_norm (bool): Whether to normalize the input feature.
+            Defaults to ``True``.
+        weight_norm (bool):Whether to normalize the weight.
+            Defaults to ``True``.
+    """
+
+    def __init__(self,
+                 in_features: int,
+                 out_features: int,
+                 k=1,
+                 bias: bool = False,
+                 feature_norm: bool = True,
+                 weight_norm: bool = True):
+
+        super().__init__(in_features, out_features * k, bias=bias)
+        self.weight_norm = weight_norm
+        self.feature_norm = feature_norm
+        self.out_features = out_features
+        self.k = k
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        if self.feature_norm:
+            input = F.normalize(input)
+        if self.weight_norm:
+            weight = F.normalize(self.weight)
+        else:
+            weight = self.weight
+        cosine_all = F.linear(input, weight, self.bias)
+
+        if self.k == 1:
+            return cosine_all
+        else:
+            cosine_all = cosine_all.view(-1, self.out_features, self.k)
+            cosine, _ = torch.max(cosine_all, dim=2)
+            return cosine
+
+
+@MODELS.register_module()
+class ArcFaceClsHead(ClsHead):
+    """ArcFace classifier head.
+
+    A PyTorch implementation of paper `ArcFace: Additive Angular Margin Loss
+    for Deep Face Recognition <https://arxiv.org/abs/1801.07698>`_ and
+    `Sub-center ArcFace: Boosting Face Recognition by Large-Scale Noisy Web
+    Faces <https://link.springer.com/chapter/10.1007/978-3-030-58621-8_43>`_
+
+    Example:
+        To use ArcFace in config files.
+
+        1. use vanilla ArcFace
+
+        .. code:: python
+
+            mode = dict(
+                backbone = xxx,
+                neck = xxxx,
+                head=dict(
+                    type='ArcFaceClsHead',
+                    num_classes=5000,
+                    in_channels=1024,
+                    loss = dict(type='CrossEntropyLoss', loss_weight=1.0),
+                    init_cfg=None),
+            )
+
+        2. use SubCenterArcFace with 3 sub-centers
+
+        .. code:: python
+
+            mode = dict(
+                backbone = xxx,
+                neck = xxxx,
+                head=dict(
+                    type='ArcFaceClsHead',
+                    num_classes=5000,
+                    in_channels=1024,
+                    num_subcenters=3,
+                    loss = dict(type='CrossEntropyLoss', loss_weight=1.0),
+                    init_cfg=None),
+            )
+
+        3. use SubCenterArcFace With CountPowerAdaptiveMargins
+
+        .. code:: python
+
+            mode = dict(
+                backbone = xxx,
+                neck = xxxx,
+                head=dict(
+                    type='ArcFaceClsHead',
+                    num_classes=5000,
+                    in_channels=1024,
+                    num_subcenters=3,
+                    loss = dict(type='CrossEntropyLoss', loss_weight=1.0),
+                    init_cfg=None),
+            )
+
+            custom_hooks = [dict(type='SetAdaptiveMarginsHook')]
+
+
+    Args:
+        num_classes (int): Number of categories excluding the background
+            category.
+        in_channels (int): Number of channels in the input feature map.
+        num_subcenters (int): Number of subcenters. Defaults to 1.
+        scale (float): Scale factor of output logit. Defaults to 64.0.
+        margins (float): The penalty margin. Could be the fllowing formats:
+
+            - float: The margin, would be same for all the categories.
+            - Sequence[float]: The category-based margins list.
+            - str: A '.txt' file path which contains a list. Each line
+              represents the margin of a category, and the number in the
+              i-th row indicates the margin of the i-th class.
+
+            Defaults to 0.5.
+        easy_margin (bool): Avoid theta + m >= PI. Defaults to False.
+        loss (dict): Config of classification loss. Defaults to
+            ``dict(type='CrossEntropyLoss', loss_weight=1.0)``.
+        init_cfg (dict, optional): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 num_classes: int,
+                 in_channels: int,
+                 num_subcenters: int = 1,
+                 scale: float = 64.,
+                 margins: Optional[Union[float, Sequence[float], str]] = 0.50,
+                 easy_margin: bool = False,
+                 loss: dict = dict(type='CrossEntropyLoss', loss_weight=1.0),
+                 init_cfg: Optional[dict] = None):
+
+        super(ArcFaceClsHead, self).__init__(init_cfg=init_cfg)
+        if not isinstance(loss, nn.Module):
+            loss = MODELS.build(loss)
+        self.loss_module = loss
+
+        assert num_subcenters >= 1 and num_classes >= 0
+        self.in_channels = in_channels
+        self.num_classes = num_classes
+        self.num_subcenters = num_subcenters
+        self.scale = scale
+        self.easy_margin = easy_margin
+
+        self.norm_product = NormProduct(in_channels, num_classes,
+                                        num_subcenters)
+
+        if isinstance(margins, float):
+            margins = [margins] * num_classes
+        elif isinstance(margins, str) and margins.endswith('.txt'):
+            margins = [float(item) for item in list_from_file(margins)]
+        else:
+            assert is_seq_of(list(margins), (float, int)), (
+                'the attribute `margins` in ``ArcFaceClsHead`` should be a '
+                ' float, a Sequence of float, or a ".txt" file path.')
+
+        assert len(margins) == num_classes, \
+            'The length of margins must be equal with num_classes.'
+
+        self.register_buffer(
+            'margins', torch.tensor(margins).float(), persistent=False)
+        # To make `phi` monotonic decreasing, refers to
+        # https://github.com/deepinsight/insightface/issues/108
+        sinm_m = torch.sin(math.pi - self.margins) * self.margins
+        threshold = torch.cos(math.pi - self.margins)
+        self.register_buffer('sinm_m', sinm_m, persistent=False)
+        self.register_buffer('threshold', threshold, persistent=False)
+
+    def set_margins(self, margins: Union[Sequence[float], float]) -> None:
+        """set margins of arcface head.
+
+        Args:
+            margins (Union[Sequence[float], float]): The marigins.
+        """
+        if isinstance(margins, float):
+            margins = [margins] * self.num_classes
+        assert is_seq_of(
+            list(margins), float) and (len(margins) == self.num_classes), (
+                f'margins must be Sequence[Union(float, int)], get {margins}')
+
+        self.margins = torch.tensor(
+            margins, device=self.margins.device, dtype=torch.float32)
+        self.sinm_m = torch.sin(self.margins) * self.margins
+        self.threshold = -torch.cos(self.margins)
+
+    def pre_logits(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The process before the final classification head.
+
+        The input ``feats`` is a tuple of tensor, and each tensor is the
+        feature of a backbone stage. In ``ArcFaceHead``, we just obtain the
+        feature of the last stage.
+        """
+        # The ArcFaceHead doesn't have other module, just return after
+        # unpacking.
+        return feats[-1]
+
+    def _get_logit_with_margin(self, pre_logits, target):
+        """add arc margin to the cosine in target index.
+
+        The target must be in index format.
+        """
+        assert target.dim() == 1 or (
+            target.dim() == 2 and target.shape[1] == 1), \
+            'The target must be in index format.'
+        cosine = self.norm_product(pre_logits)
+        phi = torch.cos(torch.acos(cosine) + self.margins)
+
+        if self.easy_margin:
+            # when cosine>0, choose phi
+            # when cosine<=0, choose cosine
+            phi = torch.where(cosine > 0, phi, cosine)
+        else:
+            # when cos>th, choose phi
+            # when cos<=th, choose cosine-mm
+            phi = torch.where(cosine > self.threshold, phi,
+                              cosine - self.sinm_m)
+
+        target = convert_to_one_hot(target, self.num_classes)
+        output = target * phi + (1 - target) * cosine
+        return output
+
+    def forward(self,
+                feats: Tuple[torch.Tensor],
+                target: Optional[torch.Tensor] = None) -> torch.Tensor:
+        """The forward process."""
+        # Disable AMP
+        with autocast(enabled=False):
+            pre_logits = self.pre_logits(feats)
+
+            if target is None:
+                # when eval, logit is the cosine between W and pre_logits;
+                # cos(theta_yj) = (x/||x||) * (W/||W||)
+                logit = self.norm_product(pre_logits)
+            else:
+                # when training, add a margin to the pre_logits where target is
+                # True, then logit is the cosine between W and new pre_logits
+                logit = self._get_logit_with_margin(pre_logits, target)
+
+        return self.scale * logit
+
+    def loss(self, feats: Tuple[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> dict:
+        """Calculate losses from the classification score.
+
+        Args:
+            feats (tuple[Tensor]): The features extracted from the backbone.
+                Multiple stage inputs are acceptable but only the last stage
+                will be used to classify. The shape of every item should be
+                ``(num_samples, num_classes)``.
+            data_samples (List[DataSample]): The annotation data of
+                every samples.
+            **kwargs: Other keyword arguments to forward the loss module.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components
+        """
+        # Unpack data samples and pack targets
+        label_target = torch.cat([i.gt_label for i in data_samples])
+        if 'gt_score' in data_samples[0]:
+            # Batch augmentation may convert labels to one-hot format scores.
+            target = torch.stack([i.gt_score for i in data_samples])
+        else:
+            target = label_target
+
+        # the index format target would be used
+        cls_score = self(feats, label_target)
+
+        # compute loss
+        losses = dict()
+        loss = self.loss_module(
+            cls_score, target, avg_factor=cls_score.size(0), **kwargs)
+        losses['loss'] = loss
+
+        return losses
diff --git a/mmpretrain/models/heads/mim_head.py b/mmpretrain/models/heads/mim_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..bda90c8198986ec9b2ff2d03db3350e1f1a25823
--- /dev/null
+++ b/mmpretrain/models/heads/mim_head.py
@@ -0,0 +1,37 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional
+
+import torch
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class MIMHead(BaseModule):
+    """Pre-training head for Masked Image Modeling.
+
+    Args:
+        loss (dict): Config dict for module of loss functions.
+    """
+
+    def __init__(self, loss: dict) -> None:
+        super().__init__()
+        self.loss_module = MODELS.build(loss)
+
+    def loss(self,
+             pred: torch.Tensor,
+             target: torch.Tensor,
+             mask: Optional[torch.Tensor] = None) -> torch.Tensor:
+        """Forward head.
+
+        Args:
+            pred (torch.Tensor): Predictions with shape B x L x C.
+            target (torch.Tensor): Targets with shape B x L x C.
+            mask (torch.Tensor): Mask with shape B x L.
+
+        Returns:
+            torch.Tensor: The loss tensor.
+        """
+        loss = self.loss_module(pred, target, mask)
+        return loss
diff --git a/mmpretrain/models/heads/mixmim_head.py b/mmpretrain/models/heads/mixmim_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..a709630abb26bce1153596cec842da0912bab912
--- /dev/null
+++ b/mmpretrain/models/heads/mixmim_head.py
@@ -0,0 +1,49 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+
+from mmpretrain.registry import MODELS
+from .mae_head import MAEPretrainHead
+
+
+@MODELS.register_module()
+class MixMIMPretrainHead(MAEPretrainHead):
+    """Head for MixMIM Pre-training.
+
+    Args:
+        loss (dict): Config of loss.
+        norm_pix_loss (bool): Whether or not normalize target.
+            Defaults to False.
+        patch_size (int): Patch size. Defaults to 16.
+    """
+
+    def __init__(self,
+                 loss: dict,
+                 norm_pix: bool = False,
+                 patch_size: int = 16) -> None:
+        super().__init__(loss=loss, norm_pix=norm_pix, patch_size=patch_size)
+
+    def loss(self, x_rec: torch.Tensor, target: torch.Tensor,
+             mask: torch.Tensor) -> torch.Tensor:
+        """Generate loss.
+
+        Args:
+            pred (torch.Tensor): The reconstructed image.
+            target (torch.Tensor): The target image.
+            mask (torch.Tensor): The mask of the target image.
+
+        Returns:
+            torch.Tensor: The reconstruction loss.
+        """
+        target = self.construct_target(target)
+
+        B, L, C = x_rec.shape
+
+        # unmix tokens
+        x1_rec = x_rec[:B // 2]
+        x2_rec = x_rec[B // 2:]
+
+        unmix_x_rec = x1_rec * mask + x2_rec.flip(0) * (1 - mask)
+
+        loss_rec = self.loss_module(unmix_x_rec, target)
+
+        return loss_rec
diff --git a/mmpretrain/models/heads/mocov3_head.py b/mmpretrain/models/heads/mocov3_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..c2bec2a6cc90247fab44d6d954a8a0c6ede0a812
--- /dev/null
+++ b/mmpretrain/models/heads/mocov3_head.py
@@ -0,0 +1,66 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+from mmengine.dist import all_gather, get_rank
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class MoCoV3Head(BaseModule):
+    """Head for MoCo v3 Pre-training.
+
+    This head builds a predictor, which can be any registered neck component.
+    It also implements latent contrastive loss between two forward features.
+    Part of the code is modified from:
+    `<https://github.com/facebookresearch/moco-v3/blob/main/moco/builder.py>`_.
+
+    Args:
+        predictor (dict): Config dict for module of predictor.
+        loss (dict): Config dict for module of loss functions.
+        temperature (float): The temperature hyper-parameter that
+            controls the concentration level of the distribution.
+            Defaults to 1.0.
+    """
+
+    def __init__(self,
+                 predictor: dict,
+                 loss: dict,
+                 temperature: float = 1.0) -> None:
+        super().__init__()
+        self.predictor = MODELS.build(predictor)
+        self.loss_module = MODELS.build(loss)
+        self.temperature = temperature
+
+    def loss(self, base_out: torch.Tensor,
+             momentum_out: torch.Tensor) -> torch.Tensor:
+        """Generate loss.
+
+        Args:
+            base_out (torch.Tensor): NxC features from base_encoder.
+            momentum_out (torch.Tensor): NxC features from momentum_encoder.
+
+        Returns:
+            torch.Tensor: The loss tensor.
+        """
+        # predictor computation
+        pred = self.predictor([base_out])[0]
+
+        # normalize
+        pred = nn.functional.normalize(pred, dim=1)
+        target = nn.functional.normalize(momentum_out, dim=1)
+
+        # get negative samples
+        target = torch.cat(all_gather(target), dim=0)
+
+        # Einstein sum is more intuitive
+        logits = torch.einsum('nc,mc->nm', [pred, target]) / self.temperature
+
+        # generate labels
+        batch_size = logits.shape[0]
+        labels = (torch.arange(batch_size, dtype=torch.long) +
+                  batch_size * get_rank()).to(logits.device)
+
+        loss = self.loss_module(logits, labels)
+        return loss
diff --git a/mmpretrain/models/heads/multi_label_cls_head.py b/mmpretrain/models/heads/multi_label_cls_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca36bfe06e70e1e0f16a5dc4c161b186234f57ac
--- /dev/null
+++ b/mmpretrain/models/heads/multi_label_cls_head.py
@@ -0,0 +1,155 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List, Optional, Tuple
+
+import torch
+import torch.nn as nn
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample, label_to_onehot
+
+
+@MODELS.register_module()
+class MultiLabelClsHead(BaseModule):
+    """Classification head for multilabel task.
+
+    Args:
+        loss (dict): Config of classification loss. Defaults to
+            dict(type='CrossEntropyLoss', use_sigmoid=True).
+        thr (float, optional): Predictions with scores under the thresholds
+            are considered as negative. Defaults to None.
+        topk (int, optional): Predictions with the k-th highest scores are
+            considered as positive. Defaults to None.
+        init_cfg (dict, optional): The extra init config of layers.
+            Defaults to None.
+
+    Notes:
+        If both ``thr`` and ``topk`` are set, use ``thr` to determine
+        positive predictions. If neither is set, use ``thr=0.5`` as
+        default.
+    """
+
+    def __init__(self,
+                 loss: Dict = dict(type='CrossEntropyLoss', use_sigmoid=True),
+                 thr: Optional[float] = None,
+                 topk: Optional[int] = None,
+                 init_cfg: Optional[dict] = None):
+        super(MultiLabelClsHead, self).__init__(init_cfg=init_cfg)
+
+        if not isinstance(loss, nn.Module):
+            loss = MODELS.build(loss)
+        self.loss_module = loss
+
+        if thr is None and topk is None:
+            thr = 0.5
+
+        self.thr = thr
+        self.topk = topk
+
+    def pre_logits(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The process before the final classification head.
+
+        The input ``feats`` is a tuple of tensor, and each tensor is the
+        feature of a backbone stage. In ``MultiLabelClsHead``, we just obtain
+        the feature of the last stage.
+        """
+        # The MultiLabelClsHead doesn't have other module, just return after
+        # unpacking.
+        return feats[-1]
+
+    def forward(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The forward process."""
+        pre_logits = self.pre_logits(feats)
+        # The MultiLabelClsHead doesn't have the final classification head,
+        # just return the unpacked inputs.
+        return pre_logits
+
+    def loss(self, feats: Tuple[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> dict:
+        """Calculate losses from the classification score.
+
+        Args:
+            feats (tuple[Tensor]): The features extracted from the backbone.
+                Multiple stage inputs are acceptable but only the last stage
+                will be used to classify. The shape of every item should be
+                ``(num_samples, num_classes)``.
+            data_samples (List[DataSample]): The annotation data of
+                every samples.
+            **kwargs: Other keyword arguments to forward the loss module.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components
+        """
+        # The part can be traced by torch.fx
+        cls_score = self(feats)
+
+        # The part can not be traced by torch.fx
+        losses = self._get_loss(cls_score, data_samples, **kwargs)
+        return losses
+
+    def _get_loss(self, cls_score: torch.Tensor,
+                  data_samples: List[DataSample], **kwargs):
+        """Unpack data samples and compute loss."""
+        num_classes = cls_score.size()[-1]
+        # Unpack data samples and pack targets
+        if 'gt_score' in data_samples[0]:
+            target = torch.stack([i.gt_score.float() for i in data_samples])
+        else:
+            target = torch.stack([
+                label_to_onehot(i.gt_label, num_classes) for i in data_samples
+            ]).float()
+
+        # compute loss
+        losses = dict()
+        loss = self.loss_module(
+            cls_score, target, avg_factor=cls_score.size(0), **kwargs)
+        losses['loss'] = loss
+
+        return losses
+
+    def predict(self,
+                feats: Tuple[torch.Tensor],
+                data_samples: List[DataSample] = None) -> List[DataSample]:
+        """Inference without augmentation.
+
+        Args:
+            feats (tuple[Tensor]): The features extracted from the backbone.
+                Multiple stage inputs are acceptable but only the last stage
+                will be used to classify. The shape of every item should be
+                ``(num_samples, num_classes)``.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. If not None, set ``pred_label`` of
+                the input data samples. Defaults to None.
+
+        Returns:
+            List[DataSample]: A list of data samples which contains the
+            predicted results.
+        """
+        # The part can be traced by torch.fx
+        cls_score = self(feats)
+
+        # The part can not be traced by torch.fx
+        predictions = self._get_predictions(cls_score, data_samples)
+        return predictions
+
+    def _get_predictions(self, cls_score: torch.Tensor,
+                         data_samples: List[DataSample]):
+        """Post-process the output of head.
+
+        Including softmax and set ``pred_label`` of data samples.
+        """
+        pred_scores = torch.sigmoid(cls_score)
+
+        if data_samples is None:
+            data_samples = [DataSample() for _ in range(cls_score.size(0))]
+
+        for data_sample, score in zip(data_samples, pred_scores):
+            if self.thr is not None:
+                # a label is predicted positive if larger than thr
+                label = torch.where(score >= self.thr)[0]
+            else:
+                # top-k labels will be predicted positive for any example
+                _, label = score.topk(self.topk)
+            data_sample.set_pred_score(score).set_pred_label(label)
+
+        return data_samples
diff --git a/mmpretrain/models/heads/multi_label_csra_head.py b/mmpretrain/models/heads/multi_label_csra_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..95a3a0e8b9d6c68c2f2c1da3c0c160c4c695cc7c
--- /dev/null
+++ b/mmpretrain/models/heads/multi_label_csra_head.py
@@ -0,0 +1,112 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Modified from https://github.com/Kevinz-code/CSRA
+from typing import Tuple
+
+import torch
+import torch.nn as nn
+from mmengine.model import BaseModule, ModuleList
+
+from mmpretrain.registry import MODELS
+from .multi_label_cls_head import MultiLabelClsHead
+
+
+@MODELS.register_module()
+class CSRAClsHead(MultiLabelClsHead):
+    """Class-specific residual attention classifier head.
+
+    Please refer to the `Residual Attention: A Simple but Effective Method for
+    Multi-Label Recognition (ICCV 2021) <https://arxiv.org/abs/2108.02456>`_
+    for details.
+
+    Args:
+        num_classes (int): Number of categories.
+        in_channels (int): Number of channels in the input feature map.
+        num_heads (int): Number of residual at tensor heads.
+        loss (dict): Config of classification loss.
+        lam (float): Lambda that combines global average and max pooling
+            scores.
+        init_cfg (dict, optional): The extra init config of layers.
+            Defaults to use ``dict(type='Normal', layer='Linear', std=0.01)``.
+    """
+    temperature_settings = {  # softmax temperature settings
+        1: [1],
+        2: [1, 99],
+        4: [1, 2, 4, 99],
+        6: [1, 2, 3, 4, 5, 99],
+        8: [1, 2, 3, 4, 5, 6, 7, 99]
+    }
+
+    def __init__(self,
+                 num_classes: int,
+                 in_channels: int,
+                 num_heads: int,
+                 lam: float,
+                 init_cfg=dict(type='Normal', layer='Linear', std=0.01),
+                 **kwargs):
+        assert num_heads in self.temperature_settings.keys(
+        ), 'The num of heads is not in temperature setting.'
+        assert lam > 0, 'Lambda should be between 0 and 1.'
+        super(CSRAClsHead, self).__init__(init_cfg=init_cfg, **kwargs)
+        self.temp_list = self.temperature_settings[num_heads]
+        self.csra_heads = ModuleList([
+            CSRAModule(num_classes, in_channels, self.temp_list[i], lam)
+            for i in range(num_heads)
+        ])
+
+    def pre_logits(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The process before the final classification head.
+
+        The input ``feats`` is a tuple of tensor, and each tensor is the
+        feature of a backbone stage. In ``CSRAClsHead``, we just obtain the
+        feature of the last stage.
+        """
+        # The CSRAClsHead doesn't have other module, just return after
+        # unpacking.
+        return feats[-1]
+
+    def forward(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The forward process."""
+        pre_logits = self.pre_logits(feats)
+        logit = sum([head(pre_logits) for head in self.csra_heads])
+        return logit
+
+
+class CSRAModule(BaseModule):
+    """Basic module of CSRA with different temperature.
+
+    Args:
+        num_classes (int): Number of categories.
+        in_channels (int): Number of channels in the input feature map.
+        T (int): Temperature setting.
+        lam (float): Lambda that combines global average and max pooling
+            scores.
+        init_cfg (dict | optional): The extra init config of layers.
+            Defaults to use dict(type='Normal', layer='Linear', std=0.01).
+    """
+
+    def __init__(self,
+                 num_classes: int,
+                 in_channels: int,
+                 T: int,
+                 lam: float,
+                 init_cfg=None):
+
+        super(CSRAModule, self).__init__(init_cfg=init_cfg)
+        self.T = T  # temperature
+        self.lam = lam  # Lambda
+        self.head = nn.Conv2d(in_channels, num_classes, 1, bias=False)
+        self.softmax = nn.Softmax(dim=2)
+
+    def forward(self, x):
+        score = self.head(x) / torch.norm(
+            self.head.weight, dim=1, keepdim=True).transpose(0, 1)
+        score = score.flatten(2)
+        base_logit = torch.mean(score, dim=2)
+
+        if self.T == 99:  # max-pooling
+            att_logit = torch.max(score, dim=2)[0]
+        else:
+            score_soft = self.softmax(score * self.T)
+            att_logit = torch.sum(score * score_soft, dim=2)
+
+        return base_logit + self.lam * att_logit
diff --git a/mmpretrain/models/heads/multi_label_linear_head.py b/mmpretrain/models/heads/multi_label_linear_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..81217ec55c54f23748b7e4ce8797509abfbb2ed3
--- /dev/null
+++ b/mmpretrain/models/heads/multi_label_linear_head.py
@@ -0,0 +1,66 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, Optional, Tuple
+
+import torch
+import torch.nn as nn
+
+from mmpretrain.registry import MODELS
+from .multi_label_cls_head import MultiLabelClsHead
+
+
+@MODELS.register_module()
+class MultiLabelLinearClsHead(MultiLabelClsHead):
+    """Linear classification head for multilabel task.
+
+    Args:
+        loss (dict): Config of classification loss. Defaults to
+            dict(type='CrossEntropyLoss', use_sigmoid=True).
+        thr (float, optional): Predictions with scores under the thresholds
+            are considered as negative. Defaults to None.
+        topk (int, optional): Predictions with the k-th highest scores are
+            considered as positive. Defaults to None.
+        init_cfg (dict, optional): The extra init config of layers.
+            Defaults to use dict(type='Normal', layer='Linear', std=0.01).
+
+    Notes:
+        If both ``thr`` and ``topk`` are set, use ``thr` to determine
+        positive predictions. If neither is set, use ``thr=0.5`` as
+        default.
+    """
+
+    def __init__(self,
+                 num_classes: int,
+                 in_channels: int,
+                 loss: Dict = dict(type='CrossEntropyLoss', use_sigmoid=True),
+                 thr: Optional[float] = None,
+                 topk: Optional[int] = None,
+                 init_cfg: Optional[dict] = dict(
+                     type='Normal', layer='Linear', std=0.01)):
+        super(MultiLabelLinearClsHead, self).__init__(
+            loss=loss, thr=thr, topk=topk, init_cfg=init_cfg)
+
+        assert num_classes > 0, f'num_classes ({num_classes}) must be a ' \
+            'positive integer.'
+
+        self.in_channels = in_channels
+        self.num_classes = num_classes
+
+        self.fc = nn.Linear(self.in_channels, self.num_classes)
+
+    def pre_logits(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The process before the final classification head.
+
+        The input ``feats`` is a tuple of tensor, and each tensor is the
+        feature of a backbone stage. In ``MultiLabelLinearClsHead``, we just
+        obtain the feature of the last stage.
+        """
+        # The obtain the MultiLabelLinearClsHead doesn't have other module,
+        # just return after unpacking.
+        return feats[-1]
+
+    def forward(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The forward process."""
+        pre_logits = self.pre_logits(feats)
+        # The final classification head.
+        cls_score = self.fc(pre_logits)
+        return cls_score
diff --git a/mmpretrain/models/heads/multi_task_head.py b/mmpretrain/models/heads/multi_task_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..3515a2b1e0b2140a57f57a69416b2c462ecec871
--- /dev/null
+++ b/mmpretrain/models/heads/multi_task_head.py
@@ -0,0 +1,153 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Sequence, Tuple
+
+import torch
+import torch.nn as nn
+from mmengine.model import BaseModule, ModuleDict
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import MultiTaskDataSample
+
+
+def loss_convertor(loss_func, task_name):
+
+    def wrapped(inputs, data_samples, **kwargs):
+        mask = torch.empty(len(data_samples), dtype=torch.bool)
+        task_data_samples = []
+        for i, data_sample in enumerate(data_samples):
+            assert isinstance(data_sample, MultiTaskDataSample)
+            sample_mask = task_name in data_sample
+            mask[i] = sample_mask
+            if sample_mask:
+                task_data_samples.append(data_sample.get(task_name))
+
+        if len(task_data_samples) == 0:
+            # This makes it possible to perform loss.backward when a
+            # task does not have gt_labels within a batch.
+            loss = (inputs[0] * 0).sum()
+            return {'loss': loss, 'mask_size': torch.tensor(0.)}
+
+        # Mask the inputs of the task
+        def mask_inputs(inputs, mask):
+            if isinstance(inputs, Sequence):
+                return type(inputs)(
+                    [mask_inputs(input, mask) for input in inputs])
+            elif isinstance(inputs, torch.Tensor):
+                return inputs[mask]
+
+        masked_inputs = mask_inputs(inputs, mask)
+        loss_output = loss_func(masked_inputs, task_data_samples, **kwargs)
+        loss_output['mask_size'] = mask.sum().to(torch.float)
+        return loss_output
+
+    return wrapped
+
+
+@MODELS.register_module()
+class MultiTaskHead(BaseModule):
+    """Multi task head.
+
+    Args:
+        task_heads (dict): Sub heads to use, the key will be use to rename the
+            loss components.
+        common_cfg (dict): The common settings for all heads. Defaults to an
+            empty dict.
+        init_cfg (dict, optional): The extra initialization settings.
+            Defaults to None.
+    """
+
+    def __init__(self, task_heads, init_cfg=None, **kwargs):
+        super(MultiTaskHead, self).__init__(init_cfg=init_cfg)
+
+        assert isinstance(task_heads, dict), 'The `task_heads` argument' \
+            "should be a dict, which's keys are task names and values are" \
+            'configs of head for the task.'
+
+        self.task_heads = ModuleDict()
+
+        for task_name, sub_head in task_heads.items():
+            if not isinstance(sub_head, nn.Module):
+                sub_head = MODELS.build(sub_head, default_args=kwargs)
+            sub_head.loss = loss_convertor(sub_head.loss, task_name)
+            self.task_heads[task_name] = sub_head
+
+    def forward(self, feats):
+        """The forward process."""
+        return {
+            task_name: head(feats)
+            for task_name, head in self.task_heads.items()
+        }
+
+    def loss(self, feats: Tuple[torch.Tensor],
+             data_samples: List[MultiTaskDataSample], **kwargs) -> dict:
+        """Calculate losses from the classification score.
+
+        Args:
+            feats (tuple[Tensor]): The features extracted from the backbone.
+            data_samples (List[MultiTaskDataSample]): The annotation data of
+                every samples.
+            **kwargs: Other keyword arguments to forward the loss module.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components, each task loss
+                key will be prefixed by the task_name like "task1_loss"
+        """
+        losses = dict()
+        for task_name, head in self.task_heads.items():
+            head_loss = head.loss(feats, data_samples, **kwargs)
+            for k, v in head_loss.items():
+                losses[f'{task_name}_{k}'] = v
+        return losses
+
+    def predict(
+        self,
+        feats: Tuple[torch.Tensor],
+        data_samples: List[MultiTaskDataSample] = None
+    ) -> List[MultiTaskDataSample]:
+        """Inference without augmentation.
+
+        Args:
+            feats (tuple[Tensor]): The features extracted from the backbone.
+            data_samples (List[MultiTaskDataSample], optional): The annotation
+                data of every samples. If not None, set ``pred_label`` of
+                the input data samples. Defaults to None.
+
+        Returns:
+            List[MultiTaskDataSample]: A list of data samples which contains
+            the predicted results.
+        """
+        predictions_dict = dict()
+
+        for task_name, head in self.task_heads.items():
+            task_samples = None
+            if data_samples is not None:
+                task_samples = [
+                    data_sample.get(task_name, None) if data_sample else None
+                    for data_sample in data_samples
+                ]
+
+            task_samples = head.predict(feats, task_samples)
+            batch_size = len(task_samples)
+            predictions_dict[task_name] = task_samples
+
+        if data_samples is None:
+            data_samples = [MultiTaskDataSample() for _ in range(batch_size)]
+        else:
+            data_samples = [
+                MultiTaskDataSample() if data_sample is None else data_sample
+                for data_sample in data_samples
+            ]
+
+        for task_name, task_samples in predictions_dict.items():
+            for data_sample, task_sample in zip(data_samples, task_samples):
+                task_sample.set_field(
+                    task_name in data_sample.tasks,
+                    'eval_mask',
+                    field_type='metainfo')
+
+                if task_name in data_sample.tasks:
+                    data_sample.get(task_name).update(task_sample)
+                else:
+                    data_sample.set_field(task_sample, task_name)
+
+        return data_samples
diff --git a/mmpretrain/models/heads/seq_gen_head.py b/mmpretrain/models/heads/seq_gen_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..b2e9b10efe6e1e6a709cd870f0572f14bbd176ee
--- /dev/null
+++ b/mmpretrain/models/heads/seq_gen_head.py
@@ -0,0 +1,188 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional
+
+import torch
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class SeqGenerationHead(BaseModule):
+    """Generation head for multi-modal pre-trained task, adopted by BLIP.
+    Normally used for generation task.
+
+    Args:
+        decoder (dict): Decoder for blip generation head.
+        init_cfg (dict, optional): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        decoder: dict,
+        ignore_index=-100,
+        loss: dict = dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super(SeqGenerationHead, self).__init__(init_cfg=init_cfg)
+        self.decoder = MODELS.build(decoder)
+        self.loss_fn = MODELS.build(loss)
+        self.ignore_index = ignore_index
+
+    def forward(self, input_ids: torch.Tensor,
+                encoder_hidden_states: torch.Tensor,
+                encoder_attention_mask: torch.Tensor, labels: torch.Tensor):
+        """Forward to get decoder output.
+
+        Args:
+            input_ids (torch.Tensor): The tokenized input text tensor.
+            encoder_hidden_states (torch.Tensor): Hidden states from image
+                embeddings.
+            encoder_attention_mask (torch.Tensor): Image embeddings hidden
+                states attention mask.
+            labels (torch.Tensor): Decoder target for calculate loss.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of decoder outputs.
+        """
+
+        decoder_out = self.decoder(
+            input_ids=input_ids,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            labels=labels,
+            return_dict=True,
+        )
+        return decoder_out
+
+    def loss(self, input_ids, encoder_hidden_states, encoder_attention_mask,
+             labels):
+        """Calculate losses from the extracted features.
+
+        Args:
+            input_ids (torch.Tensor): The tokenized input text tensor.
+            encoder_hidden_states (torch.Tensor): Hidden states from image
+                embeddings.
+            encoder_attention_mask (torch.Tensor): Image embeddings hidden
+                states attention mask.
+            labels (torch.Tensor): Decoder target for calculate loss.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components.
+        """
+
+        decoder_out = self(
+            input_ids=input_ids,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            labels=labels,
+        )
+        prediction_scores = decoder_out['logits']
+        # we are doing next-token prediction;
+        # shift prediction scores and input ids by one
+        shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
+        labels = labels[:, 1:].contiguous()
+
+        vocab_size = prediction_scores.shape[-1]
+
+        # mask ignored index
+        if (labels == self.ignore_index).any():
+            labels = labels.view(-1).clone()
+            ignore_mask = (labels == self.ignore_index)
+            labels.masked_fill_(ignore_mask, 0)
+            weight = torch.logical_not(ignore_mask)
+            avg_factor = max(weight.sum(), 1)
+        else:
+            weight = None
+            avg_factor = labels.size(0)
+
+        lm_loss = self.loss_fn(
+            shifted_prediction_scores.view(-1, vocab_size),
+            labels,
+            weight=weight,
+            avg_factor=avg_factor,
+        )
+        losses = {
+            'seq_gen_lm_loss': lm_loss,
+        }
+
+        return losses
+
+    def predict(self,
+                input_ids,
+                encoder_hidden_states,
+                sep_token_id,
+                pad_token_id,
+                use_nucleus_sampling=False,
+                num_beams=3,
+                max_length=20,
+                min_length=2,
+                top_p=0.9,
+                repetition_penalty=1.0,
+                **kwargs):
+        """Decoder prediction method.
+
+        Args:
+            input_ids (torch.Tensor): The tokenized input text tensor.
+            encoder_hidden_states (torch.Tensor): Hidden states from image
+                embeddings.
+            sep_token_id (int): Tokenid of separation token.
+            pad_token_id (int): Tokenid of pad token.
+            use_nucleus_sampling (bool): Whether to use nucleus sampling in
+                prediction. Defaults to False.
+            num_beams (int): Number of beams used in predition.
+                Defaults to 3.
+            max_length (int): Max length of generated text in predition.
+                Defaults to 20.
+            min_length (int): Min length of generated text in predition.
+                Defaults to 20.
+            top_p (float):
+                If < 1.0, only keep the top tokens with cumulative probability
+                 >= top_p (nucleus filtering). Defaults to 0.9.
+            repetition_penalty (float): The parameter for repetition penalty.
+                Defaults to 1.0.
+            **kwarg: Other arguments that might used in generation.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of generation outputs.
+        """
+        device = encoder_hidden_states.device
+
+        # TODO: In old version of transformers
+        # Additional repeat interleave of hidden states should be add here.
+        image_atts = torch.ones(
+            encoder_hidden_states.size()[:-1], dtype=torch.long).to(device)
+
+        model_kwargs = {
+            'encoder_hidden_states': encoder_hidden_states,
+            'encoder_attention_mask': image_atts,
+        }
+        model_kwargs.update(kwargs)
+
+        if use_nucleus_sampling:
+            # nucleus sampling
+            outputs = self.decoder.generate(
+                input_ids=input_ids,
+                max_length=max_length,
+                min_length=min_length,
+                do_sample=True,
+                top_p=top_p,
+                num_return_sequences=1,
+                eos_token_id=sep_token_id,
+                pad_token_id=pad_token_id,
+                repetition_penalty=1.1,
+                **model_kwargs)
+        else:
+            # beam search
+            outputs = self.decoder.generate(
+                input_ids=input_ids,
+                max_length=max_length,
+                min_length=min_length,
+                num_beams=num_beams,
+                eos_token_id=sep_token_id,
+                pad_token_id=pad_token_id,
+                repetition_penalty=repetition_penalty,
+                **model_kwargs)
+
+        return outputs
diff --git a/mmpretrain/models/heads/simmim_head.py b/mmpretrain/models/heads/simmim_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7af984c9eb4891e9f4281daf630355cafbb6cc7
--- /dev/null
+++ b/mmpretrain/models/heads/simmim_head.py
@@ -0,0 +1,40 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class SimMIMHead(BaseModule):
+    """Head for SimMIM Pre-training.
+
+    Args:
+        patch_size (int): Patch size of each token.
+        loss (dict): The config for loss.
+    """
+
+    def __init__(self, patch_size: int, loss: dict) -> None:
+        super().__init__()
+        self.patch_size = patch_size
+        self.loss_module = MODELS.build(loss)
+
+    def loss(self, pred: torch.Tensor, target: torch.Tensor,
+             mask: torch.Tensor) -> torch.Tensor:
+        """Generate loss.
+
+        This method will expand mask to the size of the original image.
+
+        Args:
+            pred (torch.Tensor): The reconstructed image (B, C, H, W).
+            target (torch.Tensor): The target image (B, C, H, W).
+            mask (torch.Tensor): The mask of the target image.
+
+        Returns:
+            torch.Tensor: The reconstruction loss.
+        """
+        mask = mask.repeat_interleave(self.patch_size, 1).repeat_interleave(
+            self.patch_size, 2).unsqueeze(1).contiguous()
+        loss = self.loss_module(pred, target, mask)
+
+        return loss
diff --git a/mmpretrain/models/heads/spark_head.py b/mmpretrain/models/heads/spark_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..a2748762ae50e1e085bd2ce536e95c6d52e51d9c
--- /dev/null
+++ b/mmpretrain/models/heads/spark_head.py
@@ -0,0 +1,92 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class SparKPretrainHead(BaseModule):
+    """Pre-training head for SparK.
+
+    Args:
+        loss (dict): Config of loss.
+        norm_pix (bool): Whether or not normalize target. Defaults to True.
+        patch_size (int): Patch size, equal to downsample ratio of backbone.
+            Defaults to 32.
+    """
+
+    def __init__(self,
+                 loss: dict,
+                 norm_pix: bool = True,
+                 patch_size: int = 32) -> None:
+        super().__init__()
+        self.norm_pix = norm_pix
+        self.patch_size = patch_size
+        self.loss = MODELS.build(loss)
+
+    def patchify(self, imgs):
+        """Split images into non-overlapped patches.
+
+        Args:
+            imgs (torch.Tensor): A batch of images, of shape B x C x H x W.
+        Returns:
+            torch.Tensor: Patchified images. The shape is B x L x D.
+        """
+        p = self.patch_size
+        assert len(imgs.shape
+                   ) == 4 and imgs.shape[2] % p == 0 and imgs.shape[3] % p == 0
+
+        B, C, ori_h, ori_w = imgs.shape
+        h = ori_h // p
+        w = ori_w // p
+        x = imgs.reshape(shape=(B, C, h, p, w, p))
+        x = torch.einsum('bchpwq->bhwpqc', x)
+
+        # (B, f*f, downsample_raito*downsample_raito*3)
+        x = x.reshape(shape=(B, h * w, p**2 * C))
+        return x
+
+    def construct_target(self, target: torch.Tensor) -> torch.Tensor:
+        """Construct the reconstruction target.
+
+        In addition to splitting images into tokens, this module will also
+        normalize the image according to ``norm_pix``.
+        Args:
+            target (torch.Tensor): Image with the shape of B x 3 x H x W
+        Returns:
+            torch.Tensor: Tokenized images with the shape of B x L x C
+        """
+        target = self.patchify(target)
+        if self.norm_pix:
+            # normalize the target image
+            mean = target.mean(dim=-1, keepdim=True)
+            var = target.var(dim=-1, keepdim=True)
+            target = (target - mean) / (var + 1.e-6)**.5
+
+        return target
+
+    def forward(self, pred: torch.Tensor, target: torch.Tensor,
+                active_mask: torch.Tensor) -> torch.Tensor:
+        """Forward function of MAE head.
+
+        Args:
+            pred (torch.Tensor): The reconstructed image.
+            target (torch.Tensor): The target image.
+            active_mask (torch.Tensor): The mask of the target image.
+        Returns:
+            torch.Tensor: The reconstruction loss.
+        """
+        # (B, C, H, W) -> (B, L, C) and perform normalization
+        target = self.construct_target(target)
+
+        # (B, C, H, W) -> (B, L, C)
+        pred = self.patchify(pred)
+
+        # (B, 1, f, f) -> (B, L)
+        non_active_mask = active_mask.logical_not().int().view(
+            active_mask.shape[0], -1)
+
+        # MSE loss on masked patches
+        loss = self.loss(pred, target, non_active_mask)
+        return loss
diff --git a/mmpretrain/models/heads/stacked_head.py b/mmpretrain/models/heads/stacked_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..6cd819de8e8daf162bb906d5524871577754fa1f
--- /dev/null
+++ b/mmpretrain/models/heads/stacked_head.py
@@ -0,0 +1,135 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, Optional, Sequence, Tuple
+
+import torch
+import torch.nn as nn
+from mmcv.cnn import build_activation_layer, build_norm_layer
+from mmengine.model import BaseModule, ModuleList
+
+from mmpretrain.registry import MODELS
+from .cls_head import ClsHead
+
+
+class LinearBlock(BaseModule):
+    """Linear block for StackedLinearClsHead."""
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 dropout_rate=0.,
+                 norm_cfg=None,
+                 act_cfg=None,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        self.fc = nn.Linear(in_channels, out_channels)
+
+        self.norm = None
+        self.act = None
+        self.dropout = None
+
+        if norm_cfg is not None:
+            self.norm = build_norm_layer(norm_cfg, out_channels)[1]
+        if act_cfg is not None:
+            self.act = build_activation_layer(act_cfg)
+        if dropout_rate > 0:
+            self.dropout = nn.Dropout(p=dropout_rate)
+
+    def forward(self, x):
+        """The forward process."""
+        x = self.fc(x)
+        if self.norm is not None:
+            x = self.norm(x)
+        if self.act is not None:
+            x = self.act(x)
+        if self.dropout is not None:
+            x = self.dropout(x)
+        return x
+
+
+@MODELS.register_module()
+class StackedLinearClsHead(ClsHead):
+    """Classifier head with several hidden fc layer and a output fc layer.
+
+    Args:
+        num_classes (int): Number of categories.
+        in_channels (int): Number of channels in the input feature map.
+        mid_channels (Sequence[int]): Number of channels in the hidden fc
+            layers.
+        dropout_rate (float): Dropout rate after each hidden fc layer,
+            except the last layer. Defaults to 0.
+        norm_cfg (dict, optional): Config dict of normalization layer after
+            each hidden fc layer, except the last layer. Defaults to None.
+        act_cfg (dict, optional): Config dict of activation function after each
+            hidden layer, except the last layer. Defaults to use "ReLU".
+    """
+
+    def __init__(self,
+                 num_classes: int,
+                 in_channels: int,
+                 mid_channels: Sequence[int],
+                 dropout_rate: float = 0.,
+                 norm_cfg: Optional[Dict] = None,
+                 act_cfg: Optional[Dict] = dict(type='ReLU'),
+                 **kwargs):
+        super(StackedLinearClsHead, self).__init__(**kwargs)
+        self.num_classes = num_classes
+        self.in_channels = in_channels
+        if self.num_classes <= 0:
+            raise ValueError(
+                f'num_classes={num_classes} must be a positive integer')
+
+        assert isinstance(mid_channels, Sequence), \
+            f'`mid_channels` of StackedLinearClsHead should be a sequence, ' \
+            f'instead of {type(mid_channels)}'
+        self.mid_channels = mid_channels
+
+        self.dropout_rate = dropout_rate
+        self.norm_cfg = norm_cfg
+        self.act_cfg = act_cfg
+
+        self._init_layers()
+
+    def _init_layers(self):
+        """"Init layers."""
+        self.layers = ModuleList()
+        in_channels = self.in_channels
+        for hidden_channels in self.mid_channels:
+            self.layers.append(
+                LinearBlock(
+                    in_channels,
+                    hidden_channels,
+                    dropout_rate=self.dropout_rate,
+                    norm_cfg=self.norm_cfg,
+                    act_cfg=self.act_cfg))
+            in_channels = hidden_channels
+
+        self.layers.append(
+            LinearBlock(
+                self.mid_channels[-1],
+                self.num_classes,
+                dropout_rate=0.,
+                norm_cfg=None,
+                act_cfg=None))
+
+    def pre_logits(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The process before the final classification head.
+
+        The input ``feats`` is a tuple of tensor, and each tensor is the
+        feature of a backbone stage.
+        """
+        x = feats[-1]
+        for layer in self.layers[:-1]:
+            x = layer(x)
+        return x
+
+    @property
+    def fc(self):
+        """Full connected layer."""
+        return self.layers[-1]
+
+    def forward(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The forward process."""
+        pre_logits = self.pre_logits(feats)
+        # The final classification head.
+        cls_score = self.fc(pre_logits)
+        return cls_score
diff --git a/mmpretrain/models/heads/swav_head.py b/mmpretrain/models/heads/swav_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..8f3a30236e019822a166e25551f77feec8228d84
--- /dev/null
+++ b/mmpretrain/models/heads/swav_head.py
@@ -0,0 +1,31 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class SwAVHead(BaseModule):
+    """Head for SwAV Pre-training.
+
+    Args:
+        loss (dict): Config dict for module of loss functions.
+    """
+
+    def __init__(self, loss: dict) -> None:
+        super().__init__()
+        self.loss_module = MODELS.build(loss)
+
+    def loss(self, pred: torch.Tensor) -> torch.Tensor:
+        """Generate loss.
+
+        Args:
+            pred (torch.Tensor): NxC input features.
+
+        Returns:
+            torch.Tensor: The SwAV loss.
+        """
+        loss = self.loss_module(pred)
+
+        return loss
diff --git a/mmpretrain/models/heads/vig_head.py b/mmpretrain/models/heads/vig_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..ecb984deb4b0b6bf162263a86771f2d3eba71cbd
--- /dev/null
+++ b/mmpretrain/models/heads/vig_head.py
@@ -0,0 +1,65 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Tuple
+
+import torch
+import torch.nn as nn
+from mmcv.cnn import build_activation_layer
+
+from mmpretrain.registry import MODELS
+from .cls_head import ClsHead
+
+
+@MODELS.register_module()
+class VigClsHead(ClsHead):
+    """The classification head for Vision GNN.
+
+    Args:
+        num_classes (int): Number of categories excluding the background
+            category.
+        in_channels (int): Number of channels in the input feature map.
+        hidden_dim (int): The number of middle channels. Defaults to 1024.
+        act_cfg (dict): The config of activation function.
+            Defaults to ``dict(type='GELU')``.
+        dropout (float): The dropout rate.
+        loss (dict): Config of classification loss. Defaults to
+            ``dict(type='CrossEntropyLoss', loss_weight=1.0)``.
+        init_cfg (dict, optional): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 num_classes: int,
+                 in_channels: int,
+                 hidden_dim: int = 1024,
+                 act_cfg: dict = dict(type='GELU'),
+                 dropout: float = 0.,
+                 **kwargs):
+        super().__init__(**kwargs)
+
+        self.fc1 = nn.Linear(in_channels, hidden_dim)
+        self.bn = nn.BatchNorm1d(hidden_dim)
+        self.act = build_activation_layer(act_cfg)
+        self.drop = nn.Dropout(dropout)
+        self.fc2 = nn.Linear(hidden_dim, num_classes)
+
+    def pre_logits(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The process before the final classification head.
+
+        The input ``feats`` is a tuple of tensor, and each tensor is the
+        feature of a stage_blocks stage. In ``VigClsHead``, we just obtain the
+        feature of the last stage.
+        """
+        feats = feats[-1]
+        feats = self.fc1(feats)
+        feats = self.bn(feats)
+        feats = self.act(feats)
+        feats = self.drop(feats)
+
+        return feats
+
+    def forward(self, feats: Tuple[torch.Tensor]) -> torch.Tensor:
+        """The forward process."""
+        pre_logits = self.pre_logits(feats)
+        # The final classification head.
+        cls_score = self.fc2(pre_logits)
+        return cls_score
diff --git a/mmpretrain/models/heads/vision_transformer_head.py b/mmpretrain/models/heads/vision_transformer_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..83e8fca125cd626c51abfcc87b28387f654618f9
--- /dev/null
+++ b/mmpretrain/models/heads/vision_transformer_head.py
@@ -0,0 +1,97 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from collections import OrderedDict
+from typing import List, Optional, Tuple
+
+import torch
+import torch.nn as nn
+from mmcv.cnn import build_activation_layer
+from mmengine.model import Sequential
+from mmengine.model.weight_init import trunc_normal_
+
+from mmpretrain.registry import MODELS
+from .cls_head import ClsHead
+
+
+@MODELS.register_module()
+class VisionTransformerClsHead(ClsHead):
+    """Vision Transformer classifier head.
+
+    Args:
+        num_classes (int): Number of categories excluding the background
+            category.
+        in_channels (int): Number of channels in the input feature map.
+        hidden_dim (int, optional): Number of the dimensions for hidden layer.
+            Defaults to None, which means no extra hidden layer.
+        act_cfg (dict): The activation config. Only available during
+            pre-training. Defaults to ``dict(type='Tanh')``.
+        init_cfg (dict): The extra initialization configs. Defaults to
+            ``dict(type='Constant', layer='Linear', val=0)``.
+    """
+
+    def __init__(self,
+                 num_classes: int,
+                 in_channels: int,
+                 hidden_dim: Optional[int] = None,
+                 act_cfg: dict = dict(type='Tanh'),
+                 init_cfg: dict = dict(type='Constant', layer='Linear', val=0),
+                 **kwargs):
+        super(VisionTransformerClsHead, self).__init__(
+            init_cfg=init_cfg, **kwargs)
+        self.in_channels = in_channels
+        self.num_classes = num_classes
+        self.hidden_dim = hidden_dim
+        self.act_cfg = act_cfg
+
+        if self.num_classes <= 0:
+            raise ValueError(
+                f'num_classes={num_classes} must be a positive integer')
+
+        self._init_layers()
+
+    def _init_layers(self):
+        """"Init hidden layer if exists."""
+        if self.hidden_dim is None:
+            layers = [('head', nn.Linear(self.in_channels, self.num_classes))]
+        else:
+            layers = [
+                ('pre_logits', nn.Linear(self.in_channels, self.hidden_dim)),
+                ('act', build_activation_layer(self.act_cfg)),
+                ('head', nn.Linear(self.hidden_dim, self.num_classes)),
+            ]
+        self.layers = Sequential(OrderedDict(layers))
+
+    def init_weights(self):
+        """"Init weights of hidden layer if exists."""
+        super(VisionTransformerClsHead, self).init_weights()
+        # Modified from ClassyVision
+        if hasattr(self.layers, 'pre_logits'):
+            # Lecun norm
+            trunc_normal_(
+                self.layers.pre_logits.weight,
+                std=math.sqrt(1 / self.layers.pre_logits.in_features))
+            nn.init.zeros_(self.layers.pre_logits.bias)
+
+    def pre_logits(self, feats: Tuple[List[torch.Tensor]]) -> torch.Tensor:
+        """The process before the final classification head.
+
+        The input ``feats`` is a tuple of list of tensor, and each tensor is
+        the feature of a backbone stage. In ``VisionTransformerClsHead``, we
+        obtain the feature of the last stage and forward in hidden layer if
+        exists.
+        """
+        feat = feats[-1]  # Obtain feature of the last scale.
+        # For backward-compatibility with the previous ViT output
+        cls_token = feat[-1] if isinstance(feat, list) else feat
+        if self.hidden_dim is None:
+            return cls_token
+        else:
+            x = self.layers.pre_logits(cls_token)
+            return self.layers.act(x)
+
+    def forward(self, feats: Tuple[List[torch.Tensor]]) -> torch.Tensor:
+        """The forward process."""
+        pre_logits = self.pre_logits(feats)
+        # The final classification head.
+        cls_score = self.layers.head(pre_logits)
+        return cls_score
diff --git a/mmpretrain/models/heads/vqa_head.py b/mmpretrain/models/heads/vqa_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7b5fe532874e2e8325caa3090d3be66b098ad46
--- /dev/null
+++ b/mmpretrain/models/heads/vqa_head.py
@@ -0,0 +1,246 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional, Union
+
+import mmengine
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class VQAGenerationHead(BaseModule):
+    """Generation head for multi-modal pre-trained task, adapted by BLIP.
+    Normally used for qa generation task (open-set)
+
+    Args:
+        decoder (dict): Decoder for decoding answers.
+        inference_method (str): Inference method. One of 'rank', 'generate'.
+            - If 'rank', the model will return answers with the highest
+                probability from the answer list.
+            - If 'generate', the model will generate answers.
+            - Only for test, not for train / val.
+        num_beams (int): Number of beams for beam search. 1 means no beam
+            search. Only support when inference_method=='generate'.
+            Defaults to 3.
+        num_ans_candidates (int): Number of answer candidates, used to filter
+            out answers with low probability. Only support when
+            inference_method=='rank'. Defaults to 128.
+        loss (dict or nn.Module): Config of loss or module of loss. Defaults to
+            ``nn.CrossEntropyLoss(reduction='none', ignore_index=-100)``.
+        init_cfg (dict, optional): the config to control the initialization.
+            Defaults to None.
+        answer_list_path (str, optional): Path to `answer_list.json`
+            (json file of a answer list). Required when
+            inference_method=='rank'.
+
+
+    TODO: `mmcls.LabelSmoothLoss` has not support `ignore_index` param.
+    Now using `nn.CrossEntropyLoss`, without label_smoothing, in order to
+    maintain compatibility with torch < 1.10.0
+    """
+
+    def __init__(
+        self,
+        decoder: dict,
+        inference_method: str = 'generate',
+        num_beams: int = 3,
+        num_ans_candidates: int = 128,
+        loss: Union[dict, nn.Module] = nn.CrossEntropyLoss(
+            reduction='none', ignore_index=-100),
+        init_cfg: Optional[dict] = None,
+        answer_list_path: Optional[str] = None,
+    ) -> None:
+
+        super(VQAGenerationHead, self).__init__(init_cfg=init_cfg)
+        self.decoder = MODELS.build(decoder)
+
+        if inference_method == 'generate':
+            assert isinstance(num_beams, int), \
+                'for VQA `generate` mode, `num_beams` must be a int.'
+            self.num_beams = num_beams
+            self.num_ans_candidates = None
+            self.answer_list = None
+
+        elif inference_method == 'rank':
+            assert isinstance(num_ans_candidates, int), \
+                'for VQA `rank` mode, `num_ans_candidates` must be a int.'
+            assert isinstance(answer_list_path, str), \
+                'for VQA `rank` mode, `answer_list_path` must be set as ' \
+                'the path to `answer_list.json`.'
+            self.num_beams = None
+            self.answer_list = mmengine.load(answer_list_path)
+            if isinstance(self.answer_list, dict):
+                self.answer_list = list(self.answer_list.keys())
+            assert isinstance(self.answer_list, list) and all(
+                isinstance(item, str) for item in self.answer_list), \
+                'for VQA `rank` mode, `answer_list.json` must be a list of str'
+            self.num_ans_candidates = min(num_ans_candidates,
+                                          len(self.answer_list))
+
+        else:
+            raise AssertionError(
+                'for VQA, `inference_method` must be "generate" or "rank", '
+                'got {}.'.format(inference_method))
+
+        self.inference_method = inference_method
+        if not isinstance(loss, nn.Module):
+            loss = MODELS.build(loss)
+        self.loss_module = loss
+
+    def forward(self, feats: dict):
+        prediction_logits = self.decoder(
+            feats['answer_input_ids'],
+            attention_mask=feats['answer_attention_mask'],
+            encoder_hidden_states=feats['question_states'],
+            encoder_attention_mask=feats['question_atts'],
+            labels=feats['answer_targets'],
+            return_dict=True,
+            return_logits=True,  # directly return logits, not computing loss
+            reduction='none',
+        )
+        return prediction_logits
+
+    def loss(self, feats: dict, data_samples=None):
+        """Calculate losses from the extracted features.
+
+        Args:
+            feats (dict): The features extracted from the backbone.
+            data_samples (List[BaseDataElement]): The annotation data of
+                every samples.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components
+        """
+        shifted_prediction_scores = self(feats)
+        labels = feats['answer_targets']
+        lm_loss = None
+
+        # we are doing next-token prediction;
+        # shift prediction scores and input ids by one
+        labels = labels[:, 1:].contiguous()
+        lm_loss = self.loss_module(
+            shifted_prediction_scores.view(-1,
+                                           self.decoder.med_config.vocab_size),
+            labels.view(-1))
+        lm_loss = lm_loss.view(shifted_prediction_scores.size(0), -1).sum(1)
+        # compute weighted loss
+        losses = dict()
+        loss = feats['answer_weight'] * lm_loss
+        loss = loss.sum() / feats['batch_size']
+        losses['vqa_loss'] = loss
+
+        return losses
+
+    def predict_rank(self, feats: dict, data_samples=None):
+        """Predict rank in a close-set answer list."""
+        question_states = feats['multimodal_embeds']
+        question_atts = feats['question_atts']
+        answer_candidates = feats['answer_candidates']
+        assert answer_candidates is not None
+
+        answer_ids = answer_candidates.input_ids
+        answer_atts = answer_candidates.attention_mask
+        num_ques = question_states.size(0)
+        start_ids = answer_ids[0, 0].repeat(num_ques, 1)  # bos token
+
+        start_output = self.decoder(
+            start_ids,
+            encoder_hidden_states=question_states,
+            encoder_attention_mask=question_atts,
+            return_dict=True,
+            reduction='none',
+        )
+        logits = start_output.logits[:, 0, :]  # first token's logit
+
+        # topk_probs: top-k probability
+        # topk_ids: [num_question, k]
+        answer_first_token = answer_ids[:, 1]
+        prob_first_token = F.softmax(
+            logits, dim=1).index_select(
+                dim=1, index=answer_first_token)
+        topk_probs, topk_ids = prob_first_token.topk(
+            self.num_ans_candidates, dim=1)
+
+        # answer input: [num_question*k, answer_len]
+        input_ids = []
+        input_atts = []
+        for b, topk_id in enumerate(topk_ids):
+            input_ids.append(answer_ids.index_select(dim=0, index=topk_id))
+            input_atts.append(answer_atts.index_select(dim=0, index=topk_id))
+        input_ids = torch.cat(input_ids, dim=0)
+        input_atts = torch.cat(input_atts, dim=0)
+
+        targets_ids = input_ids.masked_fill(input_ids == feats['pad_token_id'],
+                                            -100)
+
+        def tile(x, dim, n_tile):
+            init_dim = x.size(dim)
+            repeat_idx = [1] * x.dim()
+            repeat_idx[dim] = n_tile
+            x = x.repeat(*(repeat_idx))
+            order_index = torch.LongTensor(
+                np.concatenate([
+                    init_dim * np.arange(n_tile) + i for i in range(init_dim)
+                ]))
+            return torch.index_select(x, dim, order_index.to(x.device))
+
+        # repeat encoder's output for top-k answers
+        question_states = tile(question_states, 0, self.num_ans_candidates)
+        question_atts = tile(question_atts, 0, self.num_ans_candidates)
+
+        output = self.decoder(
+            input_ids,
+            attention_mask=input_atts,
+            encoder_hidden_states=question_states,
+            encoder_attention_mask=question_atts,
+            labels=targets_ids,
+            return_dict=True,
+            reduction='none',
+        )
+
+        log_probs_sum = -output.loss
+        log_probs_sum = log_probs_sum.view(num_ques, self.num_ans_candidates)
+
+        max_topk_ids = log_probs_sum.argmax(dim=1)
+        max_ids = topk_ids[max_topk_ids >= 0, max_topk_ids]
+
+        answers = [self.answer_list[max_id] for max_id in max_ids]
+
+        return answers
+
+    def predict_generate(self, feats: dict, data_samples=None):
+        """Predict answers in a generation manner."""
+        device = feats['multimodal_embeds'].device
+        question_states = feats['multimodal_embeds']
+        question_atts = torch.ones(
+            question_states.size()[:-1], dtype=torch.long).to(device)
+        model_kwargs = {
+            'encoder_hidden_states': question_states,
+            'encoder_attention_mask': question_atts
+        }
+
+        bos_ids = torch.full((feats['multimodal_embeds'].shape[0], 1),
+                             fill_value=feats['bos_token_id'],
+                             device=device)
+
+        outputs = self.decoder.generate(
+            input_ids=bos_ids,
+            max_length=10,
+            min_length=1,
+            num_beams=self.num_beams,
+            eos_token_id=feats['sep_token_id'],
+            pad_token_id=feats['pad_token_id'],
+            **model_kwargs)
+
+        return outputs
+
+    def predict(self, feats: dict, data_samples=None):
+        """Predict results from the extracted features."""
+        if self.inference_method == 'generate':
+            return self.predict_generate(feats, data_samples)
+        elif self.inference_method == 'rank':
+            return self.predict_rank(feats, data_samples)
diff --git a/mmpretrain/models/losses/__init__.py b/mmpretrain/models/losses/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..b1b2ed725ef76df7e18bf9283ec84b3b12e3d2cf
--- /dev/null
+++ b/mmpretrain/models/losses/__init__.py
@@ -0,0 +1,35 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .asymmetric_loss import AsymmetricLoss, asymmetric_loss
+from .cae_loss import CAELoss
+from .cosine_similarity_loss import CosineSimilarityLoss
+from .cross_correlation_loss import CrossCorrelationLoss
+from .cross_entropy_loss import (CrossEntropyLoss, binary_cross_entropy,
+                                 cross_entropy)
+from .focal_loss import FocalLoss, sigmoid_focal_loss
+from .label_smooth_loss import LabelSmoothLoss
+from .reconstruction_loss import PixelReconstructionLoss
+from .seesaw_loss import SeesawLoss
+from .swav_loss import SwAVLoss
+from .utils import (convert_to_one_hot, reduce_loss, weight_reduce_loss,
+                    weighted_loss)
+
+__all__ = [
+    'asymmetric_loss',
+    'AsymmetricLoss',
+    'cross_entropy',
+    'binary_cross_entropy',
+    'CrossEntropyLoss',
+    'reduce_loss',
+    'weight_reduce_loss',
+    'LabelSmoothLoss',
+    'weighted_loss',
+    'FocalLoss',
+    'sigmoid_focal_loss',
+    'convert_to_one_hot',
+    'SeesawLoss',
+    'CAELoss',
+    'CosineSimilarityLoss',
+    'CrossCorrelationLoss',
+    'PixelReconstructionLoss',
+    'SwAVLoss',
+]
diff --git a/mmpretrain/models/losses/__pycache__/__init__.cpython-310.pyc b/mmpretrain/models/losses/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7f08f1bd062ab576bacb808e379224de16cfa098
Binary files /dev/null and b/mmpretrain/models/losses/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/models/losses/__pycache__/asymmetric_loss.cpython-310.pyc b/mmpretrain/models/losses/__pycache__/asymmetric_loss.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3cab02e896c3dce1a6c972ef47a6800ef123f79b
Binary files /dev/null and b/mmpretrain/models/losses/__pycache__/asymmetric_loss.cpython-310.pyc differ
diff --git a/mmpretrain/models/losses/__pycache__/cae_loss.cpython-310.pyc b/mmpretrain/models/losses/__pycache__/cae_loss.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..c2296cee090a767289f0f942f171913da3c33510
Binary files /dev/null and b/mmpretrain/models/losses/__pycache__/cae_loss.cpython-310.pyc differ
diff --git a/mmpretrain/models/losses/__pycache__/cosine_similarity_loss.cpython-310.pyc b/mmpretrain/models/losses/__pycache__/cosine_similarity_loss.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..1b85f183a01bfd80052429fe4d1739ce816c59c3
Binary files /dev/null and b/mmpretrain/models/losses/__pycache__/cosine_similarity_loss.cpython-310.pyc differ
diff --git a/mmpretrain/models/losses/__pycache__/cross_correlation_loss.cpython-310.pyc b/mmpretrain/models/losses/__pycache__/cross_correlation_loss.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ce7a0de1e6261f65696048df9eca2bd2eeaa9a70
Binary files /dev/null and b/mmpretrain/models/losses/__pycache__/cross_correlation_loss.cpython-310.pyc differ
diff --git a/mmpretrain/models/losses/__pycache__/cross_entropy_loss.cpython-310.pyc b/mmpretrain/models/losses/__pycache__/cross_entropy_loss.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..63284883c88cc559ddc7c7901f2cdc608cc09fdc
Binary files /dev/null and b/mmpretrain/models/losses/__pycache__/cross_entropy_loss.cpython-310.pyc differ
diff --git a/mmpretrain/models/losses/__pycache__/focal_loss.cpython-310.pyc b/mmpretrain/models/losses/__pycache__/focal_loss.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..6e9cb87451977d61cd2d5d50ef5ec97922ca7140
Binary files /dev/null and b/mmpretrain/models/losses/__pycache__/focal_loss.cpython-310.pyc differ
diff --git a/mmpretrain/models/losses/__pycache__/label_smooth_loss.cpython-310.pyc b/mmpretrain/models/losses/__pycache__/label_smooth_loss.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..434be68fccd60be9dbd71b6198e81d6a83421f1a
Binary files /dev/null and b/mmpretrain/models/losses/__pycache__/label_smooth_loss.cpython-310.pyc differ
diff --git a/mmpretrain/models/losses/__pycache__/reconstruction_loss.cpython-310.pyc b/mmpretrain/models/losses/__pycache__/reconstruction_loss.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e9cd6bf18c1726dedce12ae794b83dd6a71dc0aa
Binary files /dev/null and b/mmpretrain/models/losses/__pycache__/reconstruction_loss.cpython-310.pyc differ
diff --git a/mmpretrain/models/losses/__pycache__/seesaw_loss.cpython-310.pyc b/mmpretrain/models/losses/__pycache__/seesaw_loss.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..51a44dff1ac5238a1ae4971c6d6f0d62a5ff4d39
Binary files /dev/null and b/mmpretrain/models/losses/__pycache__/seesaw_loss.cpython-310.pyc differ
diff --git a/mmpretrain/models/losses/__pycache__/swav_loss.cpython-310.pyc b/mmpretrain/models/losses/__pycache__/swav_loss.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..0ac222c101c26bc706c3a4281ff8697f555e59c2
Binary files /dev/null and b/mmpretrain/models/losses/__pycache__/swav_loss.cpython-310.pyc differ
diff --git a/mmpretrain/models/losses/__pycache__/utils.cpython-310.pyc b/mmpretrain/models/losses/__pycache__/utils.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ead668eeeedc9d250cdad941b76fbc0603515e39
Binary files /dev/null and b/mmpretrain/models/losses/__pycache__/utils.cpython-310.pyc differ
diff --git a/mmpretrain/models/losses/asymmetric_loss.py b/mmpretrain/models/losses/asymmetric_loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..dcc9707da8475b5e87d2b4f8a5a2cf669d7ffe2f
--- /dev/null
+++ b/mmpretrain/models/losses/asymmetric_loss.py
@@ -0,0 +1,149 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+
+from mmpretrain.registry import MODELS
+from .utils import convert_to_one_hot, weight_reduce_loss
+
+
+def asymmetric_loss(pred,
+                    target,
+                    weight=None,
+                    gamma_pos=1.0,
+                    gamma_neg=4.0,
+                    clip=0.05,
+                    reduction='mean',
+                    avg_factor=None,
+                    use_sigmoid=True,
+                    eps=1e-8):
+    r"""asymmetric loss.
+
+    Please refer to the `paper <https://arxiv.org/abs/2009.14119>`__ for
+    details.
+
+    Args:
+        pred (torch.Tensor): The prediction with shape (N, \*).
+        target (torch.Tensor): The ground truth label of the prediction with
+            shape (N, \*).
+        weight (torch.Tensor, optional): Sample-wise loss weight with shape
+            (N, ). Defaults to None.
+        gamma_pos (float): positive focusing parameter. Defaults to 0.0.
+        gamma_neg (float): Negative focusing parameter. We usually set
+            gamma_neg > gamma_pos. Defaults to 4.0.
+        clip (float, optional): Probability margin. Defaults to 0.05.
+        reduction (str): The method used to reduce the loss.
+            Options are "none", "mean" and "sum". If reduction is 'none' , loss
+            is same shape as pred and label. Defaults to 'mean'.
+        avg_factor (int, optional): Average factor that is used to average
+            the loss. Defaults to None.
+        use_sigmoid (bool): Whether the prediction uses sigmoid instead
+            of softmax. Defaults to True.
+        eps (float): The minimum value of the argument of logarithm. Defaults
+            to 1e-8.
+
+    Returns:
+        torch.Tensor: Loss.
+    """
+    assert pred.shape == \
+        target.shape, 'pred and target should be in the same shape.'
+
+    if use_sigmoid:
+        pred_sigmoid = pred.sigmoid()
+    else:
+        pred_sigmoid = nn.functional.softmax(pred, dim=-1)
+
+    target = target.type_as(pred)
+
+    if clip and clip > 0:
+        pt = (1 - pred_sigmoid +
+              clip).clamp(max=1) * (1 - target) + pred_sigmoid * target
+    else:
+        pt = (1 - pred_sigmoid) * (1 - target) + pred_sigmoid * target
+    asymmetric_weight = (1 - pt).pow(gamma_pos * target + gamma_neg *
+                                     (1 - target))
+    loss = -torch.log(pt.clamp(min=eps)) * asymmetric_weight
+    if weight is not None:
+        assert weight.dim() == 1
+        weight = weight.float()
+        if pred.dim() > 1:
+            weight = weight.reshape(-1, 1)
+    loss = weight_reduce_loss(loss, weight, reduction, avg_factor)
+    return loss
+
+
+@MODELS.register_module()
+class AsymmetricLoss(nn.Module):
+    """asymmetric loss.
+
+    Args:
+        gamma_pos (float): positive focusing parameter.
+            Defaults to 0.0.
+        gamma_neg (float): Negative focusing parameter. We
+            usually set gamma_neg > gamma_pos. Defaults to 4.0.
+        clip (float, optional): Probability margin. Defaults to 0.05.
+        reduction (str): The method used to reduce the loss into
+            a scalar.
+        loss_weight (float): Weight of loss. Defaults to 1.0.
+        use_sigmoid (bool): Whether the prediction uses sigmoid instead
+            of softmax. Defaults to True.
+        eps (float): The minimum value of the argument of logarithm. Defaults
+            to 1e-8.
+    """
+
+    def __init__(self,
+                 gamma_pos=0.0,
+                 gamma_neg=4.0,
+                 clip=0.05,
+                 reduction='mean',
+                 loss_weight=1.0,
+                 use_sigmoid=True,
+                 eps=1e-8):
+        super(AsymmetricLoss, self).__init__()
+        self.gamma_pos = gamma_pos
+        self.gamma_neg = gamma_neg
+        self.clip = clip
+        self.reduction = reduction
+        self.loss_weight = loss_weight
+        self.use_sigmoid = use_sigmoid
+        self.eps = eps
+
+    def forward(self,
+                pred,
+                target,
+                weight=None,
+                avg_factor=None,
+                reduction_override=None):
+        r"""asymmetric loss.
+
+        Args:
+            pred (torch.Tensor): The prediction with shape (N, \*).
+            target (torch.Tensor): The ground truth label of the prediction
+                with shape (N, \*), N or (N,1).
+            weight (torch.Tensor, optional): Sample-wise loss weight with shape
+                (N, \*). Defaults to None.
+            avg_factor (int, optional): Average factor that is used to average
+                the loss. Defaults to None.
+            reduction_override (str, optional): The method used to reduce the
+                loss into a scalar. Options are "none", "mean" and "sum".
+                Defaults to None.
+
+        Returns:
+            torch.Tensor: Loss.
+        """
+        assert reduction_override in (None, 'none', 'mean', 'sum')
+        reduction = (
+            reduction_override if reduction_override else self.reduction)
+        if target.dim() == 1 or (target.dim() == 2 and target.shape[1] == 1):
+            target = convert_to_one_hot(target.view(-1, 1), pred.shape[-1])
+        loss_cls = self.loss_weight * asymmetric_loss(
+            pred,
+            target,
+            weight,
+            gamma_pos=self.gamma_pos,
+            gamma_neg=self.gamma_neg,
+            clip=self.clip,
+            reduction=reduction,
+            avg_factor=avg_factor,
+            use_sigmoid=self.use_sigmoid,
+            eps=self.eps)
+        return loss_cls
diff --git a/mmpretrain/models/losses/cae_loss.py b/mmpretrain/models/losses/cae_loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..1dc081b603361e9b06c96cf836941fa971a4b4c4
--- /dev/null
+++ b/mmpretrain/models/losses/cae_loss.py
@@ -0,0 +1,48 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Tuple
+
+import torch
+from mmengine.model import BaseModule
+from torch import nn
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class CAELoss(BaseModule):
+    """Loss function for CAE.
+
+    Compute the align loss and the main loss.
+
+    Args:
+        lambd (float): The weight for the align loss.
+    """
+
+    def __init__(self, lambd: float) -> None:
+        super().__init__()
+        self.lambd = lambd
+        self.loss_cross_entropy = nn.CrossEntropyLoss()
+        self.loss_mse = nn.MSELoss()
+
+    def forward(
+            self, logits: torch.Tensor, target: torch.Tensor,
+            latent_pred: torch.Tensor,
+            latent_target: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Forward function of CAE Loss.
+
+        Args:
+            logits (torch.Tensor): The outputs from the decoder.
+            target (torch.Tensor): The targets generated by dalle.
+            latent_pred (torch.Tensor): The latent prediction from the
+                regressor.
+            latent_target (torch.Tensor): The latent target from the teacher
+                network.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]: The main loss and align loss.
+        """
+        loss_main = self.loss_cross_entropy(logits, target)
+        loss_align = self.loss_mse(latent_pred,
+                                   latent_target.detach()) * self.lambd
+
+        return loss_main, loss_align
diff --git a/mmpretrain/models/losses/cosine_similarity_loss.py b/mmpretrain/models/losses/cosine_similarity_loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..f0a5931e24686bd560196e1e310fc283fc4c9d4d
--- /dev/null
+++ b/mmpretrain/models/losses/cosine_similarity_loss.py
@@ -0,0 +1,55 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+
+from typing import Optional
+
+import torch
+from mmengine.model import BaseModule
+from torch import nn
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class CosineSimilarityLoss(BaseModule):
+    """Cosine similarity loss function.
+
+    Compute the similarity between two features and optimize that similarity as
+    loss.
+
+    Args:
+        shift_factor (float): The shift factor of cosine similarity.
+            Default: 0.0.
+        scale_factor (float): The scale factor of cosine similarity.
+            Default: 1.0.
+    """
+
+    def __init__(self,
+                 shift_factor: float = 0.0,
+                 scale_factor: float = 1.0) -> None:
+        super().__init__()
+        self.shift_factor = shift_factor
+        self.scale_factor = scale_factor
+
+    def forward(self,
+                pred: torch.Tensor,
+                target: torch.Tensor,
+                mask: Optional[torch.Tensor] = None) -> torch.Tensor:
+        """Forward function of cosine similarity loss.
+
+        Args:
+            pred (torch.Tensor): The predicted features.
+            target (torch.Tensor): The target features.
+
+        Returns:
+            torch.Tensor: The cosine similarity loss.
+        """
+        pred_norm = nn.functional.normalize(pred, dim=-1)
+        target_norm = nn.functional.normalize(target, dim=-1)
+        loss = self.shift_factor - self.scale_factor * (
+            pred_norm * target_norm).sum(dim=-1)
+
+        if mask is None:
+            loss = loss.mean()
+        else:
+            loss = (loss * mask).sum() / mask.sum()
+        return loss
diff --git a/mmpretrain/models/losses/cross_correlation_loss.py b/mmpretrain/models/losses/cross_correlation_loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..d26ce3ddbd7b41778cbf25147df39da256788dd1
--- /dev/null
+++ b/mmpretrain/models/losses/cross_correlation_loss.py
@@ -0,0 +1,44 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class CrossCorrelationLoss(BaseModule):
+    """Cross correlation loss function.
+
+    Compute the on-diagnal and off-diagnal loss.
+
+    Args:
+        lambd (float): The weight for the off-diag loss.
+    """
+
+    def __init__(self, lambd: float = 0.0051) -> None:
+        super().__init__()
+        self.lambd = lambd
+
+    def forward(self, cross_correlation_matrix: torch.Tensor) -> torch.Tensor:
+        """Forward function of cross correlation loss.
+
+        Args:
+            cross_correlation_matrix (torch.Tensor): The cross correlation
+                matrix.
+
+        Returns:
+            torch.Tensor: cross correlation loss.
+        """
+        # loss
+        on_diag = torch.diagonal(cross_correlation_matrix).add_(-1).pow_(
+            2).sum()
+        off_diag = self.off_diagonal(cross_correlation_matrix).pow_(2).sum()
+        loss = on_diag + self.lambd * off_diag
+        return loss
+
+    def off_diagonal(self, x: torch.Tensor) -> torch.Tensor:
+        """Rreturn a flattened view of the off-diagonal elements of a square
+        matrix."""
+        n, m = x.shape
+        assert n == m
+        return x.flatten()[:-1].view(n - 1, n + 1)[:, 1:].flatten()
diff --git a/mmpretrain/models/losses/cross_entropy_loss.py b/mmpretrain/models/losses/cross_entropy_loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d418beb812f8493668aeff99198555068a55435
--- /dev/null
+++ b/mmpretrain/models/losses/cross_entropy_loss.py
@@ -0,0 +1,209 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+import torch.nn.functional as F
+
+from mmpretrain.registry import MODELS
+from .utils import weight_reduce_loss
+
+
+def cross_entropy(pred,
+                  label,
+                  weight=None,
+                  reduction='mean',
+                  avg_factor=None,
+                  class_weight=None):
+    """Calculate the CrossEntropy loss.
+
+    Args:
+        pred (torch.Tensor): The prediction with shape (N, C), C is the number
+            of classes.
+        label (torch.Tensor): The gt label of the prediction.
+        weight (torch.Tensor, optional): Sample-wise loss weight.
+        reduction (str): The method used to reduce the loss.
+        avg_factor (int, optional): Average factor that is used to average
+            the loss. Defaults to None.
+        class_weight (torch.Tensor, optional): The weight for each class with
+            shape (C), C is the number of classes. Default None.
+
+    Returns:
+        torch.Tensor: The calculated loss
+    """
+    # element-wise losses
+    loss = F.cross_entropy(pred, label, weight=class_weight, reduction='none')
+
+    # apply weights and do the reduction
+    if weight is not None:
+        weight = weight.float()
+    loss = weight_reduce_loss(
+        loss, weight=weight, reduction=reduction, avg_factor=avg_factor)
+
+    return loss
+
+
+def soft_cross_entropy(pred,
+                       label,
+                       weight=None,
+                       reduction='mean',
+                       class_weight=None,
+                       avg_factor=None):
+    """Calculate the Soft CrossEntropy loss. The label can be float.
+
+    Args:
+        pred (torch.Tensor): The prediction with shape (N, C), C is the number
+            of classes.
+        label (torch.Tensor): The gt label of the prediction with shape (N, C).
+            When using "mixup", the label can be float.
+        weight (torch.Tensor, optional): Sample-wise loss weight.
+        reduction (str): The method used to reduce the loss.
+        avg_factor (int, optional): Average factor that is used to average
+            the loss. Defaults to None.
+        class_weight (torch.Tensor, optional): The weight for each class with
+            shape (C), C is the number of classes. Default None.
+
+    Returns:
+        torch.Tensor: The calculated loss
+    """
+    # element-wise losses
+    loss = -label * F.log_softmax(pred, dim=-1)
+    if class_weight is not None:
+        loss *= class_weight
+    loss = loss.sum(dim=-1)
+
+    # apply weights and do the reduction
+    if weight is not None:
+        weight = weight.float()
+    loss = weight_reduce_loss(
+        loss, weight=weight, reduction=reduction, avg_factor=avg_factor)
+
+    return loss
+
+
+def binary_cross_entropy(pred,
+                         label,
+                         weight=None,
+                         reduction='mean',
+                         avg_factor=None,
+                         class_weight=None,
+                         pos_weight=None):
+    r"""Calculate the binary CrossEntropy loss with logits.
+
+    Args:
+        pred (torch.Tensor): The prediction with shape (N, \*).
+        label (torch.Tensor): The gt label with shape (N, \*).
+        weight (torch.Tensor, optional): Element-wise weight of loss with shape
+            (N, ). Defaults to None.
+        reduction (str): The method used to reduce the loss.
+            Options are "none", "mean" and "sum". If reduction is 'none' , loss
+            is same shape as pred and label. Defaults to 'mean'.
+        avg_factor (int, optional): Average factor that is used to average
+            the loss. Defaults to None.
+        class_weight (torch.Tensor, optional): The weight for each class with
+            shape (C), C is the number of classes. Default None.
+        pos_weight (torch.Tensor, optional): The positive weight for each
+            class with shape (C), C is the number of classes. Default None.
+
+    Returns:
+        torch.Tensor: The calculated loss
+    """
+    # Ensure that the size of class_weight is consistent with pred and label to
+    # avoid automatic boracast,
+    assert pred.dim() == label.dim()
+
+    if class_weight is not None:
+        N = pred.size()[0]
+        class_weight = class_weight.repeat(N, 1)
+    loss = F.binary_cross_entropy_with_logits(
+        pred,
+        label.float(),  # only accepts float type tensor
+        weight=class_weight,
+        pos_weight=pos_weight,
+        reduction='none')
+
+    # apply weights and do the reduction
+    if weight is not None:
+        assert weight.dim() == 1
+        weight = weight.float()
+        if pred.dim() > 1:
+            weight = weight.reshape(-1, 1)
+    loss = weight_reduce_loss(
+        loss, weight=weight, reduction=reduction, avg_factor=avg_factor)
+    return loss
+
+
+@MODELS.register_module()
+class CrossEntropyLoss(nn.Module):
+    """Cross entropy loss.
+
+    Args:
+        use_sigmoid (bool): Whether the prediction uses sigmoid
+            of softmax. Defaults to False.
+        use_soft (bool): Whether to use the soft version of CrossEntropyLoss.
+            Defaults to False.
+        reduction (str): The method used to reduce the loss.
+            Options are "none", "mean" and "sum". Defaults to 'mean'.
+        loss_weight (float):  Weight of the loss. Defaults to 1.0.
+        class_weight (List[float], optional): The weight for each class with
+            shape (C), C is the number of classes. Default None.
+        pos_weight (List[float], optional): The positive weight for each
+            class with shape (C), C is the number of classes. Only enabled in
+            BCE loss when ``use_sigmoid`` is True. Default None.
+    """
+
+    def __init__(self,
+                 use_sigmoid=False,
+                 use_soft=False,
+                 reduction='mean',
+                 loss_weight=1.0,
+                 class_weight=None,
+                 pos_weight=None):
+        super(CrossEntropyLoss, self).__init__()
+        self.use_sigmoid = use_sigmoid
+        self.use_soft = use_soft
+        assert not (
+            self.use_soft and self.use_sigmoid
+        ), 'use_sigmoid and use_soft could not be set simultaneously'
+
+        self.reduction = reduction
+        self.loss_weight = loss_weight
+        self.class_weight = class_weight
+        self.pos_weight = pos_weight
+
+        if self.use_sigmoid:
+            self.cls_criterion = binary_cross_entropy
+        elif self.use_soft:
+            self.cls_criterion = soft_cross_entropy
+        else:
+            self.cls_criterion = cross_entropy
+
+    def forward(self,
+                cls_score,
+                label,
+                weight=None,
+                avg_factor=None,
+                reduction_override=None,
+                **kwargs):
+        assert reduction_override in (None, 'none', 'mean', 'sum')
+        reduction = (
+            reduction_override if reduction_override else self.reduction)
+
+        if self.class_weight is not None:
+            class_weight = cls_score.new_tensor(self.class_weight)
+        else:
+            class_weight = None
+
+        # only BCE loss has pos_weight
+        if self.pos_weight is not None and self.use_sigmoid:
+            pos_weight = cls_score.new_tensor(self.pos_weight)
+            kwargs.update({'pos_weight': pos_weight})
+        else:
+            pos_weight = None
+
+        loss_cls = self.loss_weight * self.cls_criterion(
+            cls_score,
+            label,
+            weight,
+            class_weight=class_weight,
+            reduction=reduction,
+            avg_factor=avg_factor,
+            **kwargs)
+        return loss_cls
diff --git a/mmpretrain/models/losses/focal_loss.py b/mmpretrain/models/losses/focal_loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d2cf5035aedfd923ae388b264a7457312b274fd
--- /dev/null
+++ b/mmpretrain/models/losses/focal_loss.py
@@ -0,0 +1,116 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+import torch.nn.functional as F
+
+from mmpretrain.registry import MODELS
+from .utils import convert_to_one_hot, weight_reduce_loss
+
+
+def sigmoid_focal_loss(pred,
+                       target,
+                       weight=None,
+                       gamma=2.0,
+                       alpha=0.25,
+                       reduction='mean',
+                       avg_factor=None):
+    r"""Sigmoid focal loss.
+
+    Args:
+        pred (torch.Tensor): The prediction with shape (N, \*).
+        target (torch.Tensor): The ground truth label of the prediction with
+            shape (N, \*).
+        weight (torch.Tensor, optional): Sample-wise loss weight with shape
+            (N, ). Defaults to None.
+        gamma (float): The gamma for calculating the modulating factor.
+            Defaults to 2.0.
+        alpha (float): A balanced form for Focal Loss. Defaults to 0.25.
+        reduction (str): The method used to reduce the loss.
+            Options are "none", "mean" and "sum". If reduction is 'none' ,
+            loss is same shape as pred and label. Defaults to 'mean'.
+        avg_factor (int, optional): Average factor that is used to average
+            the loss. Defaults to None.
+
+    Returns:
+        torch.Tensor: Loss.
+    """
+    assert pred.shape == \
+        target.shape, 'pred and target should be in the same shape.'
+    pred_sigmoid = pred.sigmoid()
+    target = target.type_as(pred)
+    pt = (1 - pred_sigmoid) * target + pred_sigmoid * (1 - target)
+    focal_weight = (alpha * target + (1 - alpha) *
+                    (1 - target)) * pt.pow(gamma)
+    loss = F.binary_cross_entropy_with_logits(
+        pred, target, reduction='none') * focal_weight
+    if weight is not None:
+        assert weight.dim() == 1
+        weight = weight.float()
+        if pred.dim() > 1:
+            weight = weight.reshape(-1, 1)
+    loss = weight_reduce_loss(loss, weight, reduction, avg_factor)
+    return loss
+
+
+@MODELS.register_module()
+class FocalLoss(nn.Module):
+    """Focal loss.
+
+    Args:
+        gamma (float): Focusing parameter in focal loss.
+            Defaults to 2.0.
+        alpha (float): The parameter in balanced form of focal
+            loss. Defaults to 0.25.
+        reduction (str): The method used to reduce the loss into
+            a scalar. Options are "none" and "mean". Defaults to 'mean'.
+        loss_weight (float): Weight of loss. Defaults to 1.0.
+    """
+
+    def __init__(self,
+                 gamma=2.0,
+                 alpha=0.25,
+                 reduction='mean',
+                 loss_weight=1.0):
+
+        super(FocalLoss, self).__init__()
+        self.gamma = gamma
+        self.alpha = alpha
+        self.reduction = reduction
+        self.loss_weight = loss_weight
+
+    def forward(self,
+                pred,
+                target,
+                weight=None,
+                avg_factor=None,
+                reduction_override=None):
+        r"""Sigmoid focal loss.
+
+        Args:
+            pred (torch.Tensor): The prediction with shape (N, \*).
+            target (torch.Tensor): The ground truth label of the prediction
+                with shape (N, \*), N or (N,1).
+            weight (torch.Tensor, optional): Sample-wise loss weight with shape
+                (N, \*). Defaults to None.
+            avg_factor (int, optional): Average factor that is used to average
+                the loss. Defaults to None.
+            reduction_override (str, optional): The method used to reduce the
+                loss into a scalar. Options are "none", "mean" and "sum".
+                Defaults to None.
+
+        Returns:
+            torch.Tensor: Loss.
+        """
+        assert reduction_override in (None, 'none', 'mean', 'sum')
+        reduction = (
+            reduction_override if reduction_override else self.reduction)
+        if target.dim() == 1 or (target.dim() == 2 and target.shape[1] == 1):
+            target = convert_to_one_hot(target.view(-1, 1), pred.shape[-1])
+        loss_cls = self.loss_weight * sigmoid_focal_loss(
+            pred,
+            target,
+            weight,
+            gamma=self.gamma,
+            alpha=self.alpha,
+            reduction=reduction,
+            avg_factor=avg_factor)
+        return loss_cls
diff --git a/mmpretrain/models/losses/label_smooth_loss.py b/mmpretrain/models/losses/label_smooth_loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..f117df33b07c05ee7516f0b99d985f0d001b2d31
--- /dev/null
+++ b/mmpretrain/models/losses/label_smooth_loss.py
@@ -0,0 +1,177 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+
+from mmpretrain.registry import MODELS
+from .cross_entropy_loss import CrossEntropyLoss
+from .utils import convert_to_one_hot
+
+
+@MODELS.register_module()
+class LabelSmoothLoss(nn.Module):
+    r"""Initializer for the label smoothed cross entropy loss.
+
+    Refers to `Rethinking the Inception Architecture for Computer Vision
+    <https://arxiv.org/abs/1512.00567>`_
+
+    This decreases gap between output scores and encourages generalization.
+    Labels provided to forward can be one-hot like vectors (NxC) or class
+    indices (Nx1).
+    And this accepts linear combination of one-hot like labels from mixup or
+    cutmix except multi-label task.
+
+    Args:
+        label_smooth_val (float): The degree of label smoothing.
+        num_classes (int, optional): Number of classes. Defaults to None.
+        mode (str): Refers to notes, Options are 'original', 'classy_vision',
+            'multi_label'. Defaults to 'original'.
+        use_sigmoid (bool, optional): Whether the prediction uses sigmoid of
+            softmax. Defaults to None, which means to use sigmoid in
+            "multi_label" mode and not use in other modes.
+        reduction (str): The method used to reduce the loss.
+            Options are "none", "mean" and "sum". Defaults to 'mean'.
+        loss_weight (float):  Weight of the loss. Defaults to 1.0.
+
+    Notes:
+        - if the mode is **"original"**, this will use the same label smooth
+          method as the original paper as:
+
+          .. math::
+              (1-\epsilon)\delta_{k, y} + \frac{\epsilon}{K}
+
+          where :math:`\epsilon` is the ``label_smooth_val``, :math:`K` is the
+          ``num_classes`` and :math:`\delta_{k, y}` is Dirac delta, which
+          equals 1 for :math:`k=y` and 0 otherwise.
+
+        - if the mode is **"classy_vision"**, this will use the same label
+          smooth method as the facebookresearch/ClassyVision repo as:
+
+          .. math::
+              \frac{\delta_{k, y} + \epsilon/K}{1+\epsilon}
+
+        - if the mode is **"multi_label"**, this will accept labels from
+          multi-label task and smoothing them as:
+
+          .. math::
+              (1-2\epsilon)\delta_{k, y} + \epsilon
+    """
+
+    def __init__(self,
+                 label_smooth_val,
+                 num_classes=None,
+                 use_sigmoid=None,
+                 mode='original',
+                 reduction='mean',
+                 loss_weight=1.0,
+                 class_weight=None,
+                 pos_weight=None):
+        super().__init__()
+        self.num_classes = num_classes
+        self.loss_weight = loss_weight
+
+        assert (isinstance(label_smooth_val, float)
+                and 0 <= label_smooth_val < 1), \
+            f'LabelSmoothLoss accepts a float label_smooth_val ' \
+            f'over [0, 1), but gets {label_smooth_val}'
+        self.label_smooth_val = label_smooth_val
+
+        accept_reduction = {'none', 'mean', 'sum'}
+        assert reduction in accept_reduction, \
+            f'LabelSmoothLoss supports reduction {accept_reduction}, ' \
+            f'but gets {mode}.'
+        self.reduction = reduction
+
+        accept_mode = {'original', 'classy_vision', 'multi_label'}
+        assert mode in accept_mode, \
+            f'LabelSmoothLoss supports mode {accept_mode}, but gets {mode}.'
+        self.mode = mode
+
+        self._eps = label_smooth_val
+        if mode == 'classy_vision':
+            self._eps = label_smooth_val / (1 + label_smooth_val)
+
+        if mode == 'multi_label':
+            if not use_sigmoid:
+                from mmengine.logging import MMLogger
+                MMLogger.get_current_instance().warning(
+                    'For multi-label tasks, please set `use_sigmoid=True` '
+                    'to use binary cross entropy.')
+            self.smooth_label = self.multilabel_smooth_label
+            use_sigmoid = True if use_sigmoid is None else use_sigmoid
+        else:
+            self.smooth_label = self.original_smooth_label
+            use_sigmoid = False if use_sigmoid is None else use_sigmoid
+
+        self.ce = CrossEntropyLoss(
+            use_sigmoid=use_sigmoid,
+            use_soft=not use_sigmoid,
+            reduction=reduction,
+            class_weight=class_weight,
+            pos_weight=pos_weight)
+
+    def generate_one_hot_like_label(self, label):
+        """This function takes one-hot or index label vectors and computes one-
+        hot like label vectors (float)"""
+        # check if targets are inputted as class integers
+        if label.dim() == 1 or (label.dim() == 2 and label.shape[1] == 1):
+            label = convert_to_one_hot(label.view(-1, 1), self.num_classes)
+        return label.float()
+
+    def original_smooth_label(self, one_hot_like_label):
+        assert self.num_classes > 0
+        smooth_label = one_hot_like_label * (1 - self._eps)
+        smooth_label += self._eps / self.num_classes
+        return smooth_label
+
+    def multilabel_smooth_label(self, one_hot_like_label):
+        assert self.num_classes > 0
+        smooth_label = torch.full_like(one_hot_like_label, self._eps)
+        smooth_label.masked_fill_(one_hot_like_label > 0, 1 - self._eps)
+        return smooth_label
+
+    def forward(self,
+                cls_score,
+                label,
+                weight=None,
+                avg_factor=None,
+                reduction_override=None,
+                **kwargs):
+        r"""Label smooth loss.
+
+        Args:
+            pred (torch.Tensor): The prediction with shape (N, \*).
+            label (torch.Tensor): The ground truth label of the prediction
+                with shape (N, \*).
+            weight (torch.Tensor, optional): Sample-wise loss weight with shape
+                (N, \*). Defaults to None.
+            avg_factor (int, optional): Average factor that is used to average
+                the loss. Defaults to None.
+            reduction_override (str, optional): The method used to reduce the
+                loss into a scalar. Options are "none", "mean" and "sum".
+                Defaults to None.
+
+        Returns:
+            torch.Tensor: Loss.
+        """
+        if self.num_classes is not None:
+            assert self.num_classes == cls_score.shape[1], \
+                f'num_classes should equal to cls_score.shape[1], ' \
+                f'but got num_classes: {self.num_classes} and ' \
+                f'cls_score.shape[1]: {cls_score.shape[1]}'
+        else:
+            self.num_classes = cls_score.shape[1]
+
+        one_hot_like_label = self.generate_one_hot_like_label(label=label)
+        assert one_hot_like_label.shape == cls_score.shape, \
+            f'LabelSmoothLoss requires output and target ' \
+            f'to be same shape, but got output.shape: {cls_score.shape} ' \
+            f'and target.shape: {one_hot_like_label.shape}'
+
+        smoothed_label = self.smooth_label(one_hot_like_label)
+        return self.loss_weight * self.ce.forward(
+            cls_score,
+            smoothed_label,
+            weight=weight,
+            avg_factor=avg_factor,
+            reduction_override=reduction_override,
+            **kwargs)
diff --git a/mmpretrain/models/losses/reconstruction_loss.py b/mmpretrain/models/losses/reconstruction_loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..40e6bfd707b8e378f1ec656cfb443c27e8bbdbb3
--- /dev/null
+++ b/mmpretrain/models/losses/reconstruction_loss.py
@@ -0,0 +1,67 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional
+
+import torch
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class PixelReconstructionLoss(BaseModule):
+    """Loss for the reconstruction of pixel in Masked Image Modeling.
+
+    This module measures the distance between the target image and the
+    reconstructed image and compute the loss to optimize the model. Currently,
+    This module only provides L1 and L2 loss to penalize the reconstructed
+    error. In addition, a mask can be passed in the ``forward`` function to
+    only apply loss on visible region, like that in MAE.
+
+    Args:
+        criterion (str): The loss the penalize the reconstructed error.
+            Currently, only supports L1 and L2 loss
+        channel (int, optional): The number of channels to average the
+            reconstruction loss. If not None, the reconstruction loss
+            will be divided by the channel. Defaults to None.
+    """
+
+    def __init__(self, criterion: str, channel: Optional[int] = None) -> None:
+        super().__init__()
+
+        if criterion == 'L1':
+            self.penalty = torch.nn.L1Loss(reduction='none')
+        elif criterion == 'L2':
+            self.penalty = torch.nn.MSELoss(reduction='none')
+        else:
+            raise NotImplementedError(f'Currently, PixelReconstructionLoss \
+            only supports L1 and L2 loss, but get {criterion}')
+
+        self.channel = channel if channel is not None else 1
+
+    def forward(self,
+                pred: torch.Tensor,
+                target: torch.Tensor,
+                mask: Optional[torch.Tensor] = None) -> torch.Tensor:
+        """Forward function to compute the reconstrction loss.
+
+        Args:
+            pred (torch.Tensor): The reconstructed image.
+            target (torch.Tensor): The target image.
+            mask (torch.Tensor): The mask of the target image.
+
+        Returns:
+            torch.Tensor: The reconstruction loss.
+        """
+        loss = self.penalty(pred, target)
+
+        # if the dim of the loss is 3, take the average of the loss
+        # along the last dim
+        if len(loss.shape) == 3:
+            loss = loss.mean(dim=-1)
+
+        if mask is None:
+            loss = loss.mean()
+        else:
+            loss = (loss * mask).sum() / mask.sum() / self.channel
+
+        return loss
diff --git a/mmpretrain/models/losses/seesaw_loss.py b/mmpretrain/models/losses/seesaw_loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..4aaaa451b41ea7e86b7efbfe1c0b6ce8b3756d80
--- /dev/null
+++ b/mmpretrain/models/losses/seesaw_loss.py
@@ -0,0 +1,173 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# migrate from mmdetection with modifications
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from mmpretrain.registry import MODELS
+from .utils import weight_reduce_loss
+
+
+def seesaw_ce_loss(cls_score,
+                   labels,
+                   weight,
+                   cum_samples,
+                   num_classes,
+                   p,
+                   q,
+                   eps,
+                   reduction='mean',
+                   avg_factor=None):
+    """Calculate the Seesaw CrossEntropy loss.
+
+    Args:
+        cls_score (torch.Tensor): The prediction with shape (N, C),
+             C is the number of classes.
+        labels (torch.Tensor): The learning label of the prediction.
+        weight (torch.Tensor): Sample-wise loss weight.
+        cum_samples (torch.Tensor): Cumulative samples for each category.
+        num_classes (int): The number of classes.
+        p (float): The ``p`` in the mitigation factor.
+        q (float): The ``q`` in the compenstation factor.
+        eps (float): The minimal value of divisor to smooth
+             the computation of compensation factor
+        reduction (str, optional): The method used to reduce the loss.
+        avg_factor (int, optional): Average factor that is used to average
+            the loss. Defaults to None.
+
+    Returns:
+        torch.Tensor: The calculated loss
+    """
+    assert cls_score.size(-1) == num_classes
+    assert len(cum_samples) == num_classes
+
+    onehot_labels = F.one_hot(labels, num_classes)
+    seesaw_weights = cls_score.new_ones(onehot_labels.size())
+
+    # mitigation factor
+    if p > 0:
+        sample_ratio_matrix = cum_samples[None, :].clamp(
+            min=1) / cum_samples[:, None].clamp(min=1)
+        index = (sample_ratio_matrix < 1.0).float()
+        sample_weights = sample_ratio_matrix.pow(p) * index + (1 - index
+                                                               )  # M_{ij}
+        mitigation_factor = sample_weights[labels.long(), :]
+        seesaw_weights = seesaw_weights * mitigation_factor
+
+    # compensation factor
+    if q > 0:
+        scores = F.softmax(cls_score.detach(), dim=1)
+        self_scores = scores[
+            torch.arange(0, len(scores)).to(scores.device).long(),
+            labels.long()]
+        score_matrix = scores / self_scores[:, None].clamp(min=eps)
+        index = (score_matrix > 1.0).float()
+        compensation_factor = score_matrix.pow(q) * index + (1 - index)
+        seesaw_weights = seesaw_weights * compensation_factor
+
+    cls_score = cls_score + (seesaw_weights.log() * (1 - onehot_labels))
+
+    loss = F.cross_entropy(cls_score, labels, weight=None, reduction='none')
+
+    if weight is not None:
+        weight = weight.float()
+    loss = weight_reduce_loss(
+        loss, weight=weight, reduction=reduction, avg_factor=avg_factor)
+    return loss
+
+
+@MODELS.register_module()
+class SeesawLoss(nn.Module):
+    """Implementation of seesaw loss.
+
+    Refers to `Seesaw Loss for Long-Tailed Instance Segmentation (CVPR 2021)
+    <https://arxiv.org/abs/2008.10032>`_
+
+    Args:
+        use_sigmoid (bool): Whether the prediction uses sigmoid of softmax.
+             Only False is supported. Defaults to False.
+        p (float): The ``p`` in the mitigation factor.
+             Defaults to 0.8.
+        q (float): The ``q`` in the compenstation factor.
+             Defaults to 2.0.
+        num_classes (int): The number of classes.
+             Defaults to 1000 for the ImageNet dataset.
+        eps (float): The minimal value of divisor to smooth
+             the computation of compensation factor, default to 1e-2.
+        reduction (str): The method that reduces the loss to a scalar.
+             Options are "none", "mean" and "sum". Defaults to "mean".
+        loss_weight (float): The weight of the loss. Defaults to 1.0
+    """
+
+    def __init__(self,
+                 use_sigmoid=False,
+                 p=0.8,
+                 q=2.0,
+                 num_classes=1000,
+                 eps=1e-2,
+                 reduction='mean',
+                 loss_weight=1.0):
+        super(SeesawLoss, self).__init__()
+        assert not use_sigmoid, '`use_sigmoid` is not supported'
+        self.use_sigmoid = False
+        self.p = p
+        self.q = q
+        self.num_classes = num_classes
+        self.eps = eps
+        self.reduction = reduction
+        self.loss_weight = loss_weight
+
+        self.cls_criterion = seesaw_ce_loss
+
+        # cumulative samples for each category
+        self.register_buffer('cum_samples',
+                             torch.zeros(self.num_classes, dtype=torch.float))
+
+    def forward(self,
+                cls_score,
+                labels,
+                weight=None,
+                avg_factor=None,
+                reduction_override=None):
+        """Forward function.
+
+        Args:
+            cls_score (torch.Tensor): The prediction with shape (N, C).
+            labels (torch.Tensor): The learning label of the prediction.
+            weight (torch.Tensor, optional): Sample-wise loss weight.
+            avg_factor (int, optional): Average factor that is used to average
+                 the loss. Defaults to None.
+            reduction (str, optional): The method used to reduce the loss.
+                 Options are "none", "mean" and "sum".
+        Returns:
+            torch.Tensor: The calculated loss
+        """
+        assert reduction_override in (None, 'none', 'mean', 'sum'), \
+            f'The `reduction_override` should be one of (None, "none", ' \
+            f'"mean", "sum"), but get "{reduction_override}".'
+        assert cls_score.size(0) == labels.view(-1).size(0), \
+            f'Expected `labels` shape [{cls_score.size(0)}], ' \
+            f'but got {list(labels.size())}'
+        reduction = (
+            reduction_override if reduction_override else self.reduction)
+        assert cls_score.size(-1) == self.num_classes, \
+            f'The channel number of output ({cls_score.size(-1)}) does ' \
+            f'not match the `num_classes` of seesaw loss ({self.num_classes}).'
+
+        # accumulate the samples for each category
+        unique_labels = labels.unique()
+        for u_l in unique_labels:
+            inds_ = labels == u_l.item()
+            self.cum_samples[u_l] += inds_.sum()
+
+        if weight is not None:
+            weight = weight.float()
+        else:
+            weight = labels.new_ones(labels.size(), dtype=torch.float)
+
+        # calculate loss_cls_classes
+        loss_cls = self.loss_weight * self.cls_criterion(
+            cls_score, labels, weight, self.cum_samples, self.num_classes,
+            self.p, self.q, self.eps, reduction, avg_factor)
+
+        return loss_cls
diff --git a/mmpretrain/models/losses/swav_loss.py b/mmpretrain/models/losses/swav_loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7dbb78e9bf6619cede65a874569072b863bdfa0
--- /dev/null
+++ b/mmpretrain/models/losses/swav_loss.py
@@ -0,0 +1,190 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Union
+
+import numpy as np
+import torch
+import torch.distributed as dist
+import torch.nn as nn
+from mmengine.dist import all_reduce
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@torch.no_grad()
+def distributed_sinkhorn(out: torch.Tensor, sinkhorn_iterations: int,
+                         world_size: int, epsilon: float) -> torch.Tensor:
+    """Apply the distributed sinknorn optimization on the scores matrix to find
+    the assignments.
+
+    This function is modified from
+    https://github.com/facebookresearch/swav/blob/main/main_swav.py
+
+    Args:
+        out (torch.Tensor): The scores matrix
+        sinkhorn_iterations (int): Number of iterations in Sinkhorn-Knopp
+            algorithm.
+        world_size (int): The world size of the process group.
+        epsilon (float): regularization parameter for Sinkhorn-Knopp algorithm.
+
+    Returns:
+        torch.Tensor: Output of sinkhorn algorithm.
+    """
+    eps_num_stab = 1e-12
+    Q = torch.exp(out / epsilon).t(
+    )  # Q is K-by-B for consistency with notations from our paper
+    B = Q.shape[1] * world_size  # number of samples to assign
+    K = Q.shape[0]  # how many prototypes
+
+    # make the matrix sums to 1
+    sum_Q = torch.sum(Q)
+    all_reduce(sum_Q)
+    Q /= sum_Q
+
+    for it in range(sinkhorn_iterations):
+        # normalize each row: total weight per prototype must be 1/K
+        u = torch.sum(Q, dim=1, keepdim=True)
+        if len(torch.nonzero(u == 0)) > 0:
+            Q += eps_num_stab
+            u = torch.sum(Q, dim=1, keepdim=True, dtype=Q.dtype)
+            all_reduce(u)
+        Q /= u
+        Q /= K
+
+        # normalize each column: total weight per sample must be 1/B
+        Q /= torch.sum(Q, dim=0, keepdim=True)
+        Q /= B
+
+    Q *= B  # the columns must sum to 1 so that Q is an assignment
+    return Q.t()
+
+
+class MultiPrototypes(BaseModule):
+    """Multi-prototypes for SwAV head.
+
+    Args:
+        output_dim (int): The output dim from SwAV neck.
+        num_prototypes (List[int]): The number of prototypes needed.
+        init_cfg (dict or List[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 output_dim: int,
+                 num_prototypes: List[int],
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(init_cfg=init_cfg)
+        assert isinstance(num_prototypes, list)
+        self.num_heads = len(num_prototypes)
+        for i, k in enumerate(num_prototypes):
+            self.add_module('prototypes' + str(i),
+                            nn.Linear(output_dim, k, bias=False))
+
+    def forward(self, x: torch.Tensor) -> List[torch.Tensor]:
+        """Run forward for every prototype."""
+        out = []
+        for i in range(self.num_heads):
+            out.append(getattr(self, 'prototypes' + str(i))(x))
+        return out
+
+
+@MODELS.register_module()
+class SwAVLoss(BaseModule):
+    """The Loss for SwAV.
+
+    This Loss contains clustering and sinkhorn algorithms to compute Q codes.
+    Part of the code is borrowed from `script
+    <https://github.com/facebookresearch/swav>`_.
+    The queue is built in `engine/hooks/swav_hook.py`.
+
+    Args:
+        feat_dim (int): feature dimension of the prototypes.
+        sinkhorn_iterations (int): number of iterations in Sinkhorn-Knopp
+            algorithm. Defaults to 3.
+        epsilon (float): regularization parameter for Sinkhorn-Knopp algorithm.
+            Defaults to 0.05.
+        temperature (float): temperature parameter in training loss.
+            Defaults to 0.1.
+        crops_for_assign (List[int]): list of crops id used for computing
+            assignments. Defaults to [0, 1].
+        num_crops (List[int]): list of number of crops. Defaults to [2].
+        num_prototypes (int): number of prototypes. Defaults to 3000.
+        init_cfg (dict or List[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 feat_dim: int,
+                 sinkhorn_iterations: int = 3,
+                 epsilon: float = 0.05,
+                 temperature: float = 0.1,
+                 crops_for_assign: List[int] = [0, 1],
+                 num_crops: List[int] = [2],
+                 num_prototypes: int = 3000,
+                 init_cfg: Optional[Union[List[dict], dict]] = None):
+        super().__init__(init_cfg=init_cfg)
+        self.sinkhorn_iterations = sinkhorn_iterations
+        self.epsilon = epsilon
+        self.temperature = temperature
+        self.crops_for_assign = crops_for_assign
+        self.num_crops = num_crops
+        self.use_queue = False
+        self.queue = None
+        self.world_size = dist.get_world_size() if dist.is_initialized() else 1
+
+        # prototype layer
+        self.prototypes = None
+        if isinstance(num_prototypes, list):
+            self.prototypes = MultiPrototypes(feat_dim, num_prototypes)
+        elif num_prototypes > 0:
+            self.prototypes = nn.Linear(feat_dim, num_prototypes, bias=False)
+        assert self.prototypes is not None
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward function of SwAV loss.
+
+        Args:
+            x (torch.Tensor): NxC input features.
+        Returns:
+            torch.Tensor: The returned loss.
+        """
+        # normalize the prototypes
+        with torch.no_grad():
+            w = self.prototypes.weight.data.clone()
+            w = nn.functional.normalize(w, dim=1, p=2)
+            self.prototypes.weight.copy_(w)
+
+        embedding, output = x, self.prototypes(x)
+        embedding = embedding.detach()
+
+        bs = int(embedding.size(0) / sum(self.num_crops))
+        loss = 0
+        for i, crop_id in enumerate(self.crops_for_assign):
+            with torch.no_grad():
+                out = output[bs * crop_id:bs * (crop_id + 1)].detach()
+                # time to use the queue
+                if self.queue is not None:
+                    if self.use_queue or not torch.all(self.queue[i,
+                                                                  -1, :] == 0):
+                        self.use_queue = True
+                        out = torch.cat(
+                            (torch.mm(self.queue[i],
+                                      self.prototypes.weight.t()), out))
+                    # fill the queue
+                    self.queue[i, bs:] = self.queue[i, :-bs].clone()
+                    self.queue[i, :bs] = embedding[crop_id * bs:(crop_id + 1) *
+                                                   bs]
+
+                # get assignments (batch_size * num_prototypes)
+                q = distributed_sinkhorn(out, self.sinkhorn_iterations,
+                                         self.world_size, self.epsilon)[-bs:]
+
+            # cluster assignment prediction
+            subloss = 0
+            for v in np.delete(np.arange(np.sum(self.num_crops)), crop_id):
+                x = output[bs * v:bs * (v + 1)] / self.temperature
+                subloss -= torch.mean(
+                    torch.sum(q * nn.functional.log_softmax(x, dim=1), dim=1))
+            loss += subloss / (np.sum(self.num_crops) - 1)
+        loss /= len(self.crops_for_assign)
+        return loss
diff --git a/mmpretrain/models/losses/utils.py b/mmpretrain/models/losses/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..a65b68a6590aa3fe10a023022c9c9c9bce51f935
--- /dev/null
+++ b/mmpretrain/models/losses/utils.py
@@ -0,0 +1,119 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import functools
+
+import torch
+import torch.nn.functional as F
+
+
+def reduce_loss(loss, reduction):
+    """Reduce loss as specified.
+
+    Args:
+        loss (Tensor): Elementwise loss tensor.
+        reduction (str): Options are "none", "mean" and "sum".
+
+    Return:
+        Tensor: Reduced loss tensor.
+    """
+    reduction_enum = F._Reduction.get_enum(reduction)
+    # none: 0, elementwise_mean:1, sum: 2
+    if reduction_enum == 0:
+        return loss
+    elif reduction_enum == 1:
+        return loss.mean()
+    elif reduction_enum == 2:
+        return loss.sum()
+
+
+def weight_reduce_loss(loss, weight=None, reduction='mean', avg_factor=None):
+    """Apply element-wise weight and reduce loss.
+
+    Args:
+        loss (Tensor): Element-wise loss.
+        weight (Tensor): Element-wise weights.
+        reduction (str): Same as built-in losses of PyTorch.
+        avg_factor (float): Average factor when computing the mean of losses.
+
+    Returns:
+        Tensor: Processed loss values.
+    """
+    # if weight is specified, apply element-wise weight
+    if weight is not None:
+        loss = loss * weight
+
+    # if avg_factor is not specified, just reduce the loss
+    if avg_factor is None:
+        loss = reduce_loss(loss, reduction)
+    else:
+        # if reduction is mean, then average the loss by avg_factor
+        if reduction == 'mean':
+            loss = loss.sum() / avg_factor
+        # if reduction is 'none', then do nothing, otherwise raise an error
+        elif reduction != 'none':
+            raise ValueError('avg_factor can not be used with reduction="sum"')
+    return loss
+
+
+def weighted_loss(loss_func):
+    """Create a weighted version of a given loss function.
+
+    To use this decorator, the loss function must have the signature like
+    ``loss_func(pred, target, **kwargs)``. The function only needs to compute
+    element-wise loss without any reduction. This decorator will add weight
+    and reduction arguments to the function. The decorated function will have
+    the signature like ``loss_func(pred, target, weight=None, reduction='mean',
+    avg_factor=None, **kwargs)``.
+
+    :Example:
+
+    >>> import torch
+    >>> @weighted_loss
+    >>> def l1_loss(pred, target):
+    >>>     return (pred - target).abs()
+
+    >>> pred = torch.Tensor([0, 2, 3])
+    >>> target = torch.Tensor([1, 1, 1])
+    >>> weight = torch.Tensor([1, 0, 1])
+
+    >>> l1_loss(pred, target)
+    tensor(1.3333)
+    >>> l1_loss(pred, target, weight)
+    tensor(1.)
+    >>> l1_loss(pred, target, reduction='none')
+    tensor([1., 1., 2.])
+    >>> l1_loss(pred, target, weight, avg_factor=2)
+    tensor(1.5000)
+    """
+
+    @functools.wraps(loss_func)
+    def wrapper(pred,
+                target,
+                weight=None,
+                reduction='mean',
+                avg_factor=None,
+                **kwargs):
+        # get element-wise loss
+        loss = loss_func(pred, target, **kwargs)
+        loss = weight_reduce_loss(loss, weight, reduction, avg_factor)
+        return loss
+
+    return wrapper
+
+
+def convert_to_one_hot(targets: torch.Tensor, classes) -> torch.Tensor:
+    """This function converts target class indices to one-hot vectors, given
+    the number of classes.
+
+    Args:
+        targets (Tensor): The ground truth label of the prediction
+                with shape (N, 1)
+        classes (int): the number of classes.
+
+    Returns:
+        Tensor: Processed loss values.
+    """
+    assert (torch.max(targets).item() <
+            classes), 'Class Index must be less than number of classes'
+    one_hot_targets = F.one_hot(
+        targets.long().squeeze(-1), num_classes=classes)
+    return one_hot_targets
diff --git a/mmpretrain/models/multimodal/__init__.py b/mmpretrain/models/multimodal/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e68504c616709b0957ea469ee04c36a5191232bd
--- /dev/null
+++ b/mmpretrain/models/multimodal/__init__.py
@@ -0,0 +1,24 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmpretrain.utils.dependency import WITH_MULTIMODAL
+
+if WITH_MULTIMODAL:
+    from .blip import *  # noqa: F401,F403
+    from .blip2 import *  # noqa: F401,F403
+    from .chinese_clip import *  # noqa: F401, F403
+    from .clip import *  # noqa: F401, F403
+    from .flamingo import *  # noqa: F401, F403
+    from .llava import *  # noqa: F401, F403
+    from .minigpt4 import *  # noqa: F401, F403
+    from .ofa import *  # noqa: F401, F403
+    from .otter import *  # noqa: F401, F403
+    from .ram import *  # noqa: F401, F403
+else:
+    from mmpretrain.registry import MODELS
+    from mmpretrain.utils.dependency import register_multimodal_placeholder
+
+    register_multimodal_placeholder([
+        'Blip2Caption', 'Blip2Retrieval', 'Blip2VQA', 'BlipCaption',
+        'BlipNLVR', 'BlipRetrieval', 'BlipGrounding', 'BlipVQA', 'Flamingo',
+        'OFA', 'ChineseCLIP', 'MiniGPT4', 'Llava', 'Otter', 'CLIP',
+        'CLIPZeroShot', 'RAM', 'RAMNormal', 'RAMOpenset'
+    ], MODELS)
diff --git a/mmpretrain/models/multimodal/__pycache__/__init__.cpython-310.pyc b/mmpretrain/models/multimodal/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..01a24bbdb5e756bf78b075306afefd0cb4f65700
Binary files /dev/null and b/mmpretrain/models/multimodal/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/models/multimodal/blip/__init__.py b/mmpretrain/models/multimodal/blip/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..ebbc0da6e0d11c116d4575b6c981724e387e415a
--- /dev/null
+++ b/mmpretrain/models/multimodal/blip/__init__.py
@@ -0,0 +1,12 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .blip_caption import BlipCaption
+from .blip_grounding import BlipGrounding
+from .blip_nlvr import BlipNLVR
+from .blip_retrieval import BlipRetrieval
+from .blip_vqa import BlipVQA
+from .language_model import BertLMHeadModel, XBertEncoder, XBertLMHeadDecoder
+
+__all__ = [
+    'BertLMHeadModel', 'BlipCaption', 'BlipGrounding', 'BlipNLVR',
+    'BlipRetrieval', 'BlipVQA', 'XBertEncoder', 'XBertLMHeadDecoder'
+]
diff --git a/mmpretrain/models/multimodal/blip/blip_caption.py b/mmpretrain/models/multimodal/blip/blip_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..9af3e2408da8c6b3a55694a1323e6434dfc609e1
--- /dev/null
+++ b/mmpretrain/models/multimodal/blip/blip_caption.py
@@ -0,0 +1,184 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional
+
+import torch
+from mmengine.model import BaseModel
+
+from mmpretrain.registry import MODELS, TOKENIZER
+from mmpretrain.structures import DataSample
+
+
+@MODELS.register_module()
+class BlipCaption(BaseModel):
+    """BLIP Caption.
+
+    Args:
+        vision_encoder (dict): Encoder for extracting image features.
+        decoder_head (dict): The decoder head module to forward and
+            calculate loss from processed features.
+        tokenizer: (Optional[dict]): The config for tokenizer.
+            Defaults to None.
+        prompt (str): Prompt used for training and eval.
+            Defaults to ''.
+        max_txt_len (int): Max text length of input text.
+        num_captions (int): Number of captions to be generated for each image.
+        data_preprocessor (Optional[dict]): The config for preprocessing input
+            data. If None or no specified type, it will use
+            "MutimodalDataPreprocessor" as type.
+            See :class:`MutimodalDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (Optional[dict]): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 vision_encoder: dict,
+                 decoder_head: dict,
+                 tokenizer: Optional[dict] = None,
+                 prompt: str = '',
+                 max_txt_len: int = 20,
+                 num_captions: int = 1,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        if isinstance(data_preprocessor, dict):
+            data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+            data_preprocessor = MODELS.build(data_preprocessor)
+
+        super(BlipCaption, self).__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+        self.tokenizer = TOKENIZER.build(tokenizer)
+        self.visual_encoder = MODELS.build(vision_encoder)
+        self.seq_gen_head = MODELS.build(decoder_head)
+
+        self.prompt = prompt
+        self.prompt_length = len(self.tokenizer(self.prompt).input_ids) - 1
+        self.max_txt_len = max_txt_len
+        self.num_captions = num_captions
+
+    def forward(
+        self,
+        images: torch.Tensor,
+        data_samples: Optional[List] = None,
+        mode: str = 'loss',
+    ):
+        """The unified entry for a forward process in both training and test.
+        The method should accept two modes: "predict" and "loss":
+
+        - "predict": Forward and return the predictions, which are fully
+          processed to a list of :obj:`DataSample`.
+        - "loss": Forward and return a dict of losses according to the given
+          inputs and data samples.
+
+        Note that this method doesn't handle neither back propagation nor
+        optimizer updating, which are done in the :meth:`train_step`.
+
+        Args:
+            images (torch.Tensor): pre_processed img tensor  (N, C, ...).
+            data_samples (List[DataSample], optional): Data samples with
+                additional infos.
+            mode (str): Return what kind of value. Defaults to 'loss'.
+
+        Returns:
+            The return type depends on ``mode``.
+            - If ``mode="loss"``, return a dict of tensor.
+        """
+        if mode == 'loss':
+            return self.loss(images, data_samples)
+        elif mode == 'predict':
+            return self.predict(images, data_samples)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def predict(self, images, data_samples=None, **kwargs):
+        """Predict captions from a batch of inputs.
+
+        Args:
+            images (torch.Tensor): The input images tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. Defaults to None.
+            **kwargs: Other keyword arguments accepted by the ``predict``
+                method of :attr:`head`.
+
+        Returns:
+            List[DataSample]: Return list of data samples.
+        """
+        # prepare inputs for decoder generation.
+        image_embeds = self.visual_encoder(images)[0]
+        image_embeds = torch.repeat_interleave(image_embeds, self.num_captions,
+                                               0)
+
+        prompt = [self.prompt] * image_embeds.size(0)
+        prompt = self.tokenizer(
+            prompt, padding='longest',
+            return_tensors='pt').to(image_embeds.device)
+
+        prompt.input_ids[:, 0] = self.tokenizer.bos_token_id
+        prompt.input_ids = prompt.input_ids[:, :-1]
+
+        decoder_out = self.seq_gen_head.predict(
+            input_ids=prompt.input_ids,
+            encoder_hidden_states=image_embeds,
+            sep_token_id=self.tokenizer.sep_token_id,
+            pad_token_id=self.tokenizer.pad_token_id,
+            output_attentions=True,
+            return_dict_in_generate=True,
+        )
+
+        decode_tokens = self.tokenizer.batch_decode(
+            decoder_out.sequences, skip_special_tokens=True)
+
+        out_data_samples = []
+        if data_samples is None:
+            data_samples = [None for _ in range(len(decode_tokens))]
+
+        for data_sample, decode_token in zip(data_samples, decode_tokens):
+            if data_sample is None:
+                data_sample = DataSample()
+            data_sample.pred_caption = decode_token[len(self.prompt):]
+            out_data_samples.append(data_sample)
+
+        return out_data_samples
+
+    def loss(self, images, data_samples):
+        """Calculate losses from a batch of images and data samples.
+
+        Args:
+            images (torch.Tensor): The input images tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[ImageTextDataSample]): The annotation data of
+                every samples.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components.
+        """
+        image_embeds = self.visual_encoder(images)[0]
+        raw_text = [self.prompt + ds.gt_caption for ds in data_samples]
+
+        text = self.tokenizer(
+            raw_text,
+            padding='longest',
+            truncation=True,
+            max_length=self.max_txt_len,
+            return_tensors='pt',
+        ).to(image_embeds.device)
+        text.input_ids[:, 0] = self.tokenizer.bos_token_id
+
+        # prepare targets for forwarding decoder
+        labels = text.input_ids.masked_fill(
+            text.input_ids == self.tokenizer.pad_token_id, -100)
+        labels[:, :self.prompt_length] = -100
+        # forward decoder
+        image_atts = torch.ones(
+            image_embeds.size()[:-1], dtype=torch.long).to(image_embeds.device)
+
+        losses = self.seq_gen_head.loss(
+            input_ids=text.input_ids,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_atts,
+            labels=labels,
+        )
+        return losses
diff --git a/mmpretrain/models/multimodal/blip/blip_grounding.py b/mmpretrain/models/multimodal/blip/blip_grounding.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb087287220a91b3bfcd50acee244eb5dc118bac
--- /dev/null
+++ b/mmpretrain/models/multimodal/blip/blip_grounding.py
@@ -0,0 +1,248 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+from typing import List, Optional, Tuple, Union
+
+import numpy as np
+import torch
+from mmengine.model import BaseModel
+
+from mmpretrain.models.utils.box_utils import box_xyxy_to_cxcywh
+from mmpretrain.registry import MODELS, TOKENIZER
+from mmpretrain.structures.data_sample import DataSample
+
+
+@MODELS.register_module()
+class BlipGrounding(BaseModel):
+    """BLIP Grounding.
+
+    Args:
+        visual_encoder (dict): Backbone for extracting image features.
+        text_encoder (dict): Backbone for extracting text features.
+                              but we integrate the vqa text extractor
+                              into the tokenizer part in datasets/transform/
+                              so we don't need text_backbone
+        multimodal_encoder (Optional[dict]): Backbone for extracting
+            multi-modal features. We apply this part as VQA fusion module.
+        neck (Optional[dict]): The neck module to process features from
+            backbone. Defaults to None.
+        head (Optional[Union[List[dict], dict]]): The head module to calculate
+            loss from processed features. See :mod:`mmpretrain.models.heads`.
+            Notice that if the head is not set, `loss` method cannot be used.
+            Defaults to None.
+        data_preprocessor (Optional[dict]): The config for preprocessing input
+            data. If None or no specified type, it will use
+            "MutimodalDataPreprocessor" as type.
+            See :class:`MutimodalDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (Optional[dict]): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 tokenizer: Optional[dict] = None,
+                 visual_encoder: Optional[dict] = None,
+                 text_encoder: Optional[dict] = None,
+                 multimodal_encoder: Optional[dict] = None,
+                 head: Optional[Union[List[dict], dict]] = None,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None) -> None:
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        if isinstance(data_preprocessor, dict):
+            data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+            data_preprocessor = MODELS.build(data_preprocessor)
+
+        super(BlipGrounding, self).__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+        self.tokenizer = TOKENIZER.build(tokenizer)
+        self.prompt = 'localize instance: '
+        self.visual_encoder = MODELS.build(visual_encoder)
+        self.text_encoder = MODELS.build(text_encoder)
+        self.multimodal_encoder = MODELS.build(multimodal_encoder)
+        head.setdefault('tokenizer', self.tokenizer)
+        self.grounding_head = MODELS.build(head)
+
+    def forward(
+        self,
+        images: torch.Tensor,
+        data_samples: Optional[List[DataSample]] = None,
+        mode: str = 'loss',
+    ):
+        """The unified entry for a forward process in both training and test.
+        The method should accept only one mode "loss":
+
+        - "loss": Forward and return a dict of losses according to the given
+          inputs and data samples.
+
+        Note that this method doesn't handle neither back propagation nor
+        optimizer updating, which are done in the :meth:`train_step`.
+
+        Args:
+            inputs (torch.Tensor, tuple): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[VQADataSample], optional): The annotation
+                data of every samples. It's required if ``mode="loss"``.
+                Defaults to None.
+            mode (str): Return what kind of value. Defaults to 'loss'.
+
+        Returns:
+            The return type depends on ``mode``.
+            - If ``mode="loss"``, return a dict of tensor.
+        """
+
+        if mode == 'loss':
+            return self.loss(images, data_samples)
+        elif mode == 'predict':
+            return self.predict(images, data_samples)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def extract_feat(self, images: torch.Tensor) -> torch.Tensor:
+        """Extract features from the input tensor with shape (N, C, ...).
+
+        Args:
+            inputs (Tensor): A batch of inputs. The shape of it should be
+                ``(num_samples, num_channels, *img_shape)``.
+        Returns:
+            image_embeds (Tensor): The output features.
+        """
+        image_embeds = self.visual_encoder(images)[0]
+        return image_embeds
+
+    def loss(
+        self,
+        images: torch.Tensor,
+        data_samples=None,
+    ) -> Union[torch.Tensor, Tuple[torch.Tensor]]:
+        """generate train_loss from the input tensor and data_samples.
+
+        Args:
+            inputs (Tensor): A batch of inputs. The shape of it should be
+                ``(num_samples, num_channels, *img_shape)``.
+            data_samples (List[VQADataSample], optional): The annotation
+                data of every samples..
+
+        Returns:
+            Dict[torch.Tensor]: The losses features.
+        """
+
+        # extract image feature
+        image_embeds = self.extract_feat(images)
+        image_atts = image_embeds.new_ones(
+            image_embeds.size()[:-1], dtype=torch.long)
+
+        raw_text = []
+        box_targets = []
+        for ds in data_samples:
+
+            raw_text.append(ds.text)
+            box_t = copy.deepcopy(ds.box) * 1.0
+            box_t[1] /= ds.img_shape[0]
+            box_t[3] /= ds.img_shape[0]
+            box_t[0] /= ds.img_shape[1]
+            box_t[2] /= ds.img_shape[1]
+
+            box_targets.append(box_t)
+
+        box_targets = image_embeds.new_tensor(np.stack(box_targets))
+        box_targets = box_xyxy_to_cxcywh(box_targets)  # xywh 0-1
+
+        text = self.tokenizer(
+            raw_text,
+            padding='longest',
+            truncation=True,
+            max_length=128,
+            return_tensors='pt',
+        ).to(image_embeds.device)
+
+        text_embeds = self.text_encoder(
+            text.input_ids,
+            attention_mask=text.attention_mask,
+            mode='text',
+            return_dict=True)  # bz, seq_len, hid
+
+        # multimodal fusion
+        multimodal_embeds = self.multimodal_encoder(
+            encoder_embeds=text_embeds.last_hidden_state,
+            attention_mask=text.attention_mask,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_atts,
+            return_dict=True,
+        )
+
+        # put answer from data_samples into tensor form
+        losses = self.grounding_head.loss(
+            text_embedding=multimodal_embeds.last_hidden_state,
+            text_embedding_mask=text.attention_mask,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_atts,
+            decoder_targets=box_targets,
+        )
+
+        return losses
+
+    def predict(self, images, data_samples=None):
+        """"""
+
+        # extract image feature
+        image_embeds = self.extract_feat(images)
+        image_atts = image_embeds.new_ones(
+            image_embeds.size()[:-1], dtype=torch.long)
+
+        raw_text = []
+        for ds in data_samples:
+            raw_text.append(ds.text)
+
+        text = self.tokenizer(
+            raw_text,
+            padding='longest',
+            truncation=True,
+            max_length=128,
+            return_tensors='pt',
+        ).to(image_embeds.device)
+
+        text_embeds = self.text_encoder(
+            text.input_ids,
+            attention_mask=text.attention_mask,
+            mode='text',
+            return_dict=True)  # bz, seq_len, hid
+
+        # multimodal fusion
+        multimodal_embeds = self.multimodal_encoder(
+            encoder_embeds=text_embeds.last_hidden_state,
+            attention_mask=text.attention_mask,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_atts,
+            return_dict=True,
+        )
+
+        # put answer from data_samples into tensor form
+        output_boxes = self.grounding_head.predict(
+            text_embedding=multimodal_embeds.last_hidden_state,
+            text_embedding_mask=text.attention_mask,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_atts,
+        )  # xyxy 0-1
+
+        out_data_samples = []
+        for bbox, data_sample, img in zip(output_boxes, data_samples, images):
+            if data_sample is None:
+                data_sample = DataSample()
+
+            img_size = img.shape[-2:]
+            scale_factor = data_sample.get('scale_factor', (1, 1))
+            bbox[0::2] = bbox[0::2] * img_size[1] / scale_factor[0]
+            bbox[1::2] = bbox[1::2] * img_size[0] / scale_factor[1]
+            bbox = bbox[None, :]
+            data_sample.pred_bboxes = bbox
+
+            if 'gt_bboxes' in data_sample:
+                gt_bboxes = torch.Tensor(data_sample.get('gt_bboxes'))
+                gt_bboxes[:, 0::2] /= scale_factor[0]
+                gt_bboxes[:, 1::2] /= scale_factor[1]
+                data_sample.gt_bboxes = gt_bboxes
+
+            out_data_samples.append(data_sample)
+
+        return out_data_samples
diff --git a/mmpretrain/models/multimodal/blip/blip_nlvr.py b/mmpretrain/models/multimodal/blip/blip_nlvr.py
new file mode 100644
index 0000000000000000000000000000000000000000..f96e3cce237fd3b064c74264e8f907a8bd3a47ca
--- /dev/null
+++ b/mmpretrain/models/multimodal/blip/blip_nlvr.py
@@ -0,0 +1,205 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmengine.model import BaseModel
+
+from mmpretrain.registry import MODELS, TOKENIZER
+
+
+@MODELS.register_module()
+class BlipNLVR(BaseModel):
+    """BLIP NLVR.
+
+    Args:
+        vision_backbone (dict): Backbone for extracting image features.
+        text_backbone (dict): Backbone for extracting text features.
+            but we integrate the vqa text extractor into the tokenizer part in
+            datasets/transform/ so we don't need text_backbone
+        multimodal_backbone (Optional[dict]): Backbone for extracting
+            multi-modal features. We apply this part as VQA fusion module.
+        neck (Optional[dict]): The neck module to process features from
+            backbone. Defaults to None.
+        head (Optional[dict]): The head module to calculate
+            loss from processed features. See :mod:`mmmultimodal.models.heads`.
+            Notice that if the head is not set, `loss` method cannot be used.
+            Defaults to None.
+        tokenizer: (Optional[dict]): The config for tokenizer
+        data_preprocessor (Optional[dict]): The config for preprocessing input
+            data. If None or no specified type, it will use
+            "MutimodalDataPreprocessor" as type.
+            See :class:`MutimodalDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (Optional[dict]): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 vision_backbone: dict,
+                 multimodal_backbone: dict,
+                 tokenizer: Optional[dict] = None,
+                 max_txt_len: int = 35,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        if isinstance(data_preprocessor, dict):
+            data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+            data_preprocessor = MODELS.build(data_preprocessor)
+
+        super().__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+        if tokenizer is not None:
+            self.tokenizer = TOKENIZER.build(tokenizer)
+        self.vision_backbone = MODELS.build(vision_backbone)
+        self.multimodal_backbone = MODELS.build(multimodal_backbone)
+        self.max_txt_len = max_txt_len
+
+        # For simplity, directly use head definition here.
+        # If more complex head is designed, move this and loss to a new
+        # head module.
+        hidden_size = self.multimodal_backbone.config.hidden_size
+        self.head = nn.Sequential(
+            nn.Linear(hidden_size, hidden_size),
+            nn.ReLU(),
+            nn.Linear(hidden_size, 2),
+        )
+
+    @property
+    def device(self):
+        return next(self.parameters()).device
+
+    def preprocess_text(self, data_samples):
+
+        sample_item = data_samples[0]
+
+        if sample_item is not None and 'text' in sample_item:
+            texts = [sample.get('text') for sample in data_samples]
+        else:
+            return None
+
+        # perform tokenize first if satisfied conditions
+        texts = self.tokenizer(
+            texts,
+            padding='longest',
+            truncation=True,
+            max_length=self.max_txt_len,
+            return_tensors='pt',
+        ).to(self.device)
+
+        return texts
+
+    def forward(
+        self,
+        images: dict,
+        data_samples: Optional[List] = None,
+        mode: str = 'tensor',
+    ):
+        """The unified entry for a forward process in both training and test.
+        The method should accept only one mode "loss":
+
+        - "loss": Forward and return a dict of losses according to the given
+          images and data samples.
+
+        Note that this method doesn't handle neither back propagation nor
+        optimizer updating, which are done in the :meth:`train_step`.
+
+        Args:
+            images (dict of torch.Tensor):
+                img: pre_processed img tensor  (N, C, ...).
+                text: tokenized text (N, L)
+            data_samples (List[CaptionDataSample], optional):
+            The annotation data of every samples.
+                'image': raw image data
+                'text' tokenized text
+            mode (str): Return what kind of value. Defaults to 'tensor'.
+
+        Returns:
+            The return type depends on ``mode``.
+            - If ``mode="loss"``, return a dict of tensor.
+        """
+        # B, T, C, H, W to T*B, C, H, W
+        images = images.permute(1, 0, 2, 3, 4).flatten(0, 1)
+
+        if mode == 'loss':
+            return self.loss(images, data_samples)
+        elif mode == 'predict':
+            return self.predict(images, data_samples)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def predict(self, images, data_samples=None):
+        """Predict caption."""
+        # prepare inputs for decoder generation.
+        image_embeds = self.vision_backbone(images)[0]
+        texts = self.preprocess_text(data_samples)
+        image_atts = torch.ones(
+            image_embeds.size()[:-1], dtype=torch.long).to(self.device)
+
+        image0_embeds, image1_embeds = torch.split(image_embeds,
+                                                   texts.input_ids.size(0))
+
+        # multimodal fusion
+        multimodal_embeds = self.multimodal_backbone(
+            texts.input_ids,
+            attention_mask=texts.attention_mask,
+            encoder_hidden_states=[image0_embeds, image1_embeds],
+            encoder_attention_mask=[
+                image_atts[:image0_embeds.size(0)],
+                image_atts[image0_embeds.size(0):],
+            ],
+            return_dict=True,
+        )
+
+        # get prediction
+        outputs = self.head(multimodal_embeds.last_hidden_state[:, 0, :])
+
+        pred_scores = F.softmax(outputs, dim=1)
+
+        for pred_score, data_sample in zip(pred_scores, data_samples):
+            data_sample.set_pred_score(pred_score)
+            data_sample.set_pred_label(pred_score.argmax(dim=0))
+
+        return data_samples
+
+    def loss(self, images, data_samples):
+        """Calculate losses from a batch of inputs and data samples.
+
+        Args:
+            images (torch.Tensor): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[ImageTextDataSample]): The annotation data of
+                every samples.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components.
+        """
+        # prepare inputs for decoder generation.
+        image_embeds = self.vision_backbone(images)[0]
+        texts = self.preprocess_text(data_samples)
+        image_atts = torch.ones(
+            image_embeds.size()[:-1], dtype=torch.long).to(self.device)
+        image0_embeds, image1_embeds = torch.split(image_embeds,
+                                                   texts.input_ids.size(0))
+
+        # multimodal fusion
+        multimodal_embeds = self.multimodal_backbone(
+            texts.input_ids,
+            attention_mask=texts.attention_mask,
+            encoder_hidden_states=[image0_embeds, image1_embeds],
+            encoder_attention_mask=[
+                image_atts[:image0_embeds.size(0)],
+                image_atts[image0_embeds.size(0):],
+            ],
+            return_dict=True,
+        )
+
+        # get prediction
+        outputs = self.head(multimodal_embeds.last_hidden_state[:, 0, :])
+
+        targets = torch.tensor([i.gt_label
+                                for i in data_samples]).to(outputs.device)
+        loss = F.cross_entropy(outputs, targets)
+        return {'loss': loss}
diff --git a/mmpretrain/models/multimodal/blip/blip_retrieval.py b/mmpretrain/models/multimodal/blip/blip_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ebc2513de28d928bc6e2442929cbb402348b1ca
--- /dev/null
+++ b/mmpretrain/models/multimodal/blip/blip_retrieval.py
@@ -0,0 +1,716 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from collections import ChainMap
+from copy import deepcopy
+from typing import Dict, List, Optional, Tuple, Union
+
+import mmengine.dist as dist
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmengine.model import BaseModel
+from torch import distributed as torch_dist
+
+from mmpretrain.registry import MODELS, TOKENIZER
+from mmpretrain.structures import DataSample
+from mmpretrain.utils import track_on_main_process
+
+
+def all_gather_concat(data: torch.Tensor) -> torch.Tensor:
+    """Gather tensors with different first-dimension size and concat to one
+    tenosr.
+
+    Note:
+        Only the first dimension should be different.
+
+    Args:
+        data (Tensor): Tensor to be gathered.
+
+    Returns:
+        torch.Tensor: The concatenated tenosr.
+    """
+    if dist.get_world_size() == 1:
+        return data
+
+    data_size = torch.tensor(data.size(0), device=data.device)
+    sizes_list = dist.all_gather(data_size)
+
+    max_length = max(sizes_list)
+    size_diff = max_length.item() - data_size.item()
+    if size_diff:
+        padding = torch.zeros(
+            size_diff, *data.size()[1:], device=data.device, dtype=data.dtype)
+        data = torch.cat((data, padding))
+
+    gather_list = dist.all_gather(data)
+
+    all_data = []
+    for tensor, size in zip(gather_list, sizes_list):
+
+        all_data.append(tensor[:size])
+
+    return torch.concat(all_data)
+
+
+@MODELS.register_module()
+class BlipRetrieval(BaseModel):
+    """BLIP Retriever.
+
+    Args:
+        vision_backbone (dict): Backbone for extracting image features.
+        text_backbone (dict): Backbone for extracting text features.
+        multimodal_backbone (Optional[dict]): Backbone for extracting
+            multi-modal features.
+        vision_neck (Optional[dict]): The neck module to process image features
+            from vision backbone. Defaults to None.
+        text_neck (Optional[dict]): The neck module to process text features
+            from text backbone. Defaults to None.
+        head (Optional[Union[List[dict], dict]]): The head module to calculate
+            loss from processed single modality features.
+            See :mod:`mmmultimodal.models.heads`.
+            Notice that if the head is not set, `loss` method cannot be used.
+            Defaults to None.
+        multimodal_head (Optional[Union[List[dict], dict]]): The multi-modal
+            head module to calculate loss from processed multimodal features.
+            See :mod:`mmmultimodal.models.heads`.
+            Notice that if the head is not set, `loss` method cannot be used.
+            Defaults to None.
+        momentum (float): Momentum used for momentum contrast.
+            Defaults to .995.
+        negative_all_rank (bool): Whether to sample negative data from all
+            ranks for image text matching in training. Defaults to True.
+        temperature (float): Temperature parameter that controls the
+            concentration level of the distribution. Defaults to 0.07.
+        fast_match (bool): If False, select topk similarity as candidates and
+            compute the matching score. If True, return the similarity as the
+            matching score directly. Defaults to False.
+        topk (int): Select topk similarity as candidates for compute matching
+            scores. Notice that this is not the topk in evaluation.
+            Defaults to 256.
+        data_preprocessor (Optional[dict]): The config for preprocessing input
+            data. If None or no specified type, it will use
+            "MutimodalDataPreprocessor" as type.
+            See :class:`MutimodalDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (Optional[dict]): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 vision_backbone: dict,
+                 text_backbone: dict,
+                 multimodal_backbone: Optional[dict] = None,
+                 vision_neck: Optional[dict] = None,
+                 text_neck: Optional[dict] = None,
+                 head: Optional[Union[List[dict], dict]] = None,
+                 multimodal_head: Optional[Union[List[dict], dict]] = None,
+                 tokenizer: Optional[dict] = None,
+                 momentum: float = .995,
+                 negative_all_rank: bool = True,
+                 temperature: float = 0.07,
+                 fast_match: bool = False,
+                 topk: int = 256,
+                 max_txt_len: int = 20,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        if isinstance(data_preprocessor, dict):
+            data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+            data_preprocessor = MODELS.build(data_preprocessor)
+
+        super().__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+        self.vision_backbone = MODELS.build(vision_backbone)
+        self.text_backbone = MODELS.build(text_backbone)
+
+        if multimodal_backbone is not None:
+            self.multimodal_backbone = MODELS.build(multimodal_backbone)
+
+        if vision_neck is not None:
+            self.vision_neck = MODELS.build(vision_neck)
+
+        if text_neck is not None:
+            self.text_neck = MODELS.build(text_neck)
+
+        if head is not None:
+            self.head = MODELS.build(head)
+
+        if multimodal_head is not None:
+            self.multimodal_head = MODELS.build(multimodal_head)
+
+        if tokenizer is not None:
+            self.tokenizer = TOKENIZER.build(tokenizer)
+
+        self.momentum = momentum
+        self.negative_all_rank = negative_all_rank
+        self.temp = nn.Parameter(temperature * torch.ones([]))
+        # Shares the same para
+        self.head.temp = self.temp
+
+        # create the momentum encoder
+        self.vision_backbone_m = deepcopy(self.vision_backbone)
+        self.text_backbone_m = deepcopy(self.text_backbone)
+
+        self.vision_neck_m = deepcopy(self.vision_neck)
+        self.text_neck_m = deepcopy(self.text_neck)
+
+        self.model_pairs = [
+            [self.vision_backbone, self.vision_backbone_m],
+            [self.text_backbone, self.text_backbone_m],
+            [self.vision_neck, self.vision_neck_m],
+            [self.text_neck, self.text_neck_m],
+        ]
+        self.copy_params()
+
+        # multimodal backbone shares weights with text backbone in BLIP
+        # No need to set up
+
+        # Notice that this topk is used for select k candidate to compute
+        # image-text score, but not the final metric topk in evaluation.
+        self.fast_match = fast_match
+        self.topk = topk
+
+        self.max_txt_len = max_txt_len
+
+    @property
+    def device(self):
+        return next(self.parameters()).device
+
+    def preprocess_text(self, data_samples):
+        sample_item = data_samples[0]
+
+        if sample_item is not None and 'text' in sample_item:
+            if isinstance(sample_item.get('text'), (list, tuple)):
+                texts = []
+                for sample in data_samples:
+                    texts.extend(sample.get('text'))
+            elif isinstance(sample_item.get('text'), str):
+                texts = [sample.get('text') for sample in data_samples]
+            else:
+                raise TypeError('text must be a string or a list of strings')
+        else:
+            return None
+
+        # perform tokenize first if satisfied conditions
+        texts = self.tokenizer(
+            texts,
+            padding='max_length',
+            truncation=True,
+            max_length=self.max_txt_len,
+            return_tensors='pt',
+        ).to(self.device)
+
+        return texts
+
+    def forward(self,
+                images: torch.tensor = None,
+                data_samples: Optional[List[DataSample]] = None,
+                mode: str = 'tensor') -> Union[Tuple, dict]:
+        """The unified entry for a forward process in both training and test.
+        The method should accept two modes: "tensor", and "loss":
+
+        - "tensor": Forward the whole network and return tensor without any
+          post-processing, same as a common nn.Module.
+        - "loss": Forward and return a dict of losses according to the given
+          inputs and data samples.
+
+        Note that this method doesn't handle neither back propagation nor
+        optimizer updating, which are done in the :meth:`train_step`.
+
+        For unified "predict" mode in other mm repos. It is noticed that
+        image-text retrieval cannot perform batch prediction since it will go
+        through all the samples. A standard process of retrieval evaluation is
+        to extract and collect all feats, and then predict all samples.
+        Therefore the `predict` mode here is remained as a trigger
+        to inform use to choose the right configurations.
+
+        Args:
+            images (torch.Tensor): The input inputs tensor of shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. It's required if ``mode="loss"``.
+                Defaults to None.
+            mode (str): Return what kind of value. Defaults to 'tensor'.
+
+        Returns:
+            The return type depends on ``mode``.
+            - If ``mode="tensor"``, return a tuple.
+            - If ``mode="loss"``, return a dict of tensor.
+        """
+        if mode == 'tensor':
+            return self.extract_feat(images, data_samples)
+        elif mode == 'loss':
+            return self.loss(images, data_samples)
+        elif mode == 'predict':
+            return self.predict(images, data_samples)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def extract_feat(
+        self,
+        images: torch.Tensor = None,
+        data_samples: List[DataSample] = None,
+        return_texts=True,
+        return_embeds=None,
+    ) -> Dict[str, torch.Tensor]:
+        """Extract features from the input dict.
+
+        Args:
+            images (tensor, optional): The images to extract features.
+                Defaults to None.
+            data_samples (list, optional): The data samples containing texts
+                to extract features. Defaults to None.
+            return_texts (bool): Whether to return the tokenized text and the
+                corresponding attention masks. Defaults to True.
+            return_embeds (bool): Whether to return the text embedding and
+                image embedding. Defaults to None, which means to use
+                ``self.fast_match``.
+
+        Returns:
+            Tuple[torch.Tensor]: The output features.
+                If multimodal_backbone is not exist, tuple of torch.Tensor
+                will be returned.
+        """
+        if data_samples is not None:
+            texts = self.preprocess_text(data_samples)
+        else:
+            texts = None
+
+        assert images is not None or texts is not None, \
+            'At least single modality should be passed as inputs.'
+
+        results = {}
+        if texts is not None and return_texts:
+            results.update({
+                'text_ids': texts.input_ids,
+                'text_attn_mask': texts.attention_mask,
+            })
+
+        if return_embeds is None:
+            return_embeds = not self.fast_match
+
+        # extract image features
+        if images is not None:
+            output = self._extract_feat(images, modality='images')
+            results['image_feat'] = output['image_feat']
+            if return_embeds:
+                results['image_embeds'] = output['image_embeds']
+
+        # extract text features
+        if texts is not None:
+            output = self._extract_feat(texts, modality='texts')
+            results['text_feat'] = output['text_feat']
+            if return_embeds:
+                results['text_embeds'] = output['text_embeds']
+
+        return results
+
+    def _extract_feat(self, inputs: Union[torch.Tensor, dict],
+                      modality: str) -> Tuple[torch.Tensor]:
+        """Extract features from the single modality.
+
+        Args:
+            inputs (Union[torch.Tensor, dict]): A batch of inputs.
+                For image, a tensor of shape (N, C, ...) in general.
+                For text, a dict of tokenized text inputs.
+            modality (str): Modality feature to be extracted. Only two
+                options are supported.
+
+                - ``images``: Only extract image features, mostly used for
+                    inference.
+                - ``texts``: Only extract text features, mostly used for
+                    inference.
+
+        Returns:
+            Tuple[torch.Tensor]: The output features.
+        """
+
+        if modality == 'images':
+            # extract image features
+            image_embeds = self.vision_backbone(inputs)[0]
+            image_feat = F.normalize(
+                self.vision_neck(image_embeds[:, 0, :]), dim=-1)
+            return {'image_embeds': image_embeds, 'image_feat': image_feat}
+        elif modality == 'texts':
+            # extract text features
+            text_output = self.text_backbone(
+                inputs.input_ids,
+                attention_mask=inputs.attention_mask,
+                token_type_ids=None,
+                return_dict=True,
+                mode='text',
+            )
+            text_embeds = text_output.last_hidden_state
+            text_feat = F.normalize(
+                self.text_neck(text_embeds[:, 0, :]), dim=-1)
+            return {'text_embeds': text_embeds, 'text_feat': text_feat}
+        else:
+            raise RuntimeError(f'Invalid modality "{modality}".')
+
+    def loss(
+        self,
+        images: torch.Tensor,
+        data_samples: Optional[List[DataSample]] = None,
+    ) -> Dict[str, torch.tensor]:
+        """Calculate losses from a batch of inputs and data samples.
+
+        Args:
+            inputs (dict): A batch of inputs. The input tensor with of
+                at least one modality. For image, the value is a tensor
+                of shape (N, C, ...) in general.
+                For text, the value is a dict of tokenized text inputs.
+            data_samples (Optional[List[DataSample]]):
+                The annotation data of every samples. Defaults to None.
+
+        Returns:
+            Dict[str, torch.tensor]: a dictionary of loss components of
+                both head and multimodal head.
+        """
+        output = self.extract_feat(images, data_samples, return_embeds=True)
+
+        text_ids = output['text_ids']
+        text_attn_mask = output['text_attn_mask']
+        image_embeds = output['image_embeds']
+        image_feat = output['image_feat']
+        text_feat = output['text_feat']
+
+        image_atts = torch.ones(
+            image_embeds.size()[:-1], dtype=torch.long).to(self.device)
+
+        # get momentum features
+        with torch.no_grad():
+            self._momentum_update()
+            image_embeds_m = self.vision_backbone_m(images)[0]
+            image_feat_m = F.normalize(
+                self.vision_neck_m(image_embeds_m[:, 0, :]), dim=-1)
+
+            text_output_m = self.text_backbone_m(
+                text_ids,
+                attention_mask=text_attn_mask,
+                token_type_ids=None,
+                return_dict=True,
+                mode='text',
+            )
+            text_embeds_m = text_output_m.last_hidden_state
+            text_feat_m = F.normalize(
+                self.text_neck_m(text_embeds_m[:, 0, :]), dim=-1)
+
+        loss = self.head.loss(
+            ([image_feat, text_feat, image_feat_m, text_feat_m], ),
+            data_samples)
+
+        # prepare for itm
+        encoder_input_ids = text_ids.clone()
+        encoder_input_ids[:,
+                          0] = self.tokenizer.additional_special_tokens_ids[0]
+        output_pos = self.text_backbone(
+            encoder_input_ids,
+            attention_mask=text_attn_mask,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_atts,
+            return_dict=True,
+        )
+
+        idx = torch.tensor([i.image_id for i in data_samples]).view(-1, 1)
+        bs = idx.size(0)
+        idxs = torch.cat(dist.all_gather(idx))
+        if self.negative_all_rank:
+            # compute sample similarity
+            with torch.no_grad():
+                mask = torch.eq(idx, idxs.t()).to(self.device)
+
+                image_feat_world = torch.cat(dist.all_gather(image_feat))
+                text_feat_world = torch.cat(dist.all_gather(text_feat))
+
+                sim_i2t = image_feat @ text_feat_world.t() / self.temp
+                sim_t2i = text_feat @ image_feat_world.t() / self.temp
+
+                weights_i2t = F.softmax(sim_i2t, dim=1)
+                weights_i2t.masked_fill_(mask, 0)
+
+                weights_t2i = F.softmax(sim_t2i, dim=1)
+                weights_t2i.masked_fill_(mask, 0)
+
+            world_size = dist.get_world_size()
+            if world_size == 1:
+                image_embeds_world = image_embeds
+            else:
+                image_embeds_world = torch.cat(
+                    torch_dist.nn.all_gather(image_embeds))
+
+            # select a negative image (from all ranks) for each text
+            image_embeds_neg = []
+            for b in range(bs):
+                neg_idx = torch.multinomial(weights_t2i[b], 1).item()
+                image_embeds_neg.append(image_embeds_world[neg_idx])
+            image_embeds_neg = torch.stack(image_embeds_neg, dim=0)
+
+            # select a negative text (from all ranks) for each image
+            input_ids_world = torch.cat(dist.all_gather(encoder_input_ids))
+            att_mask_world = torch.cat(dist.all_gather(text_attn_mask))
+
+            text_ids_neg = []
+            text_atts_neg = []
+            for b in range(bs):
+                neg_idx = torch.multinomial(weights_i2t[b], 1).item()
+                text_ids_neg.append(input_ids_world[neg_idx])
+                text_atts_neg.append(att_mask_world[neg_idx])
+
+        text_ids_neg = torch.stack(text_ids_neg, dim=0)
+        text_atts_neg = torch.stack(text_atts_neg, dim=0)
+
+        text_ids_all = torch.cat([encoder_input_ids, text_ids_neg], dim=0)
+        text_atts_all = torch.cat([text_attn_mask, text_atts_neg], dim=0)
+
+        image_embeds_all = torch.cat([image_embeds_neg, image_embeds], dim=0)
+        image_atts_all = torch.cat([image_atts, image_atts], dim=0)
+
+        output_neg = self.text_backbone(
+            text_ids_all,
+            attention_mask=text_atts_all,
+            encoder_hidden_states=image_embeds_all,
+            encoder_attention_mask=image_atts_all,
+            return_dict=True,
+        )
+
+        vl_embeddings = torch.cat(
+            [
+                output_pos.last_hidden_state[:, 0, :],
+                output_neg.last_hidden_state[:, 0, :],
+            ],
+            dim=0,
+        )
+
+        # create false data samples
+        data_samples.extend(
+            [DataSample(is_matched=False) for _ in range(2 * bs)])
+        loss_multimodal = self.multimodal_head.loss((vl_embeddings, ),
+                                                    data_samples)
+
+        return dict(ChainMap(loss, loss_multimodal))
+
+    def predict(self, images, data_samples, cal_i2t=True, cal_t2i=True):
+        feats = self.extract_feat(images, data_samples)
+
+        return self.predict_all(
+            feats, data_samples, cal_i2t=cal_i2t, cal_t2i=cal_t2i)
+
+    def predict_all(self,
+                    feats,
+                    data_samples,
+                    num_images=None,
+                    num_texts=None,
+                    cal_i2t=True,
+                    cal_t2i=True):
+        text_ids = feats['text_ids']
+        text_ids[:, 0] = self.tokenizer.additional_special_tokens_ids[0]
+        text_attn_mask = feats['text_attn_mask']
+        image_embeds = feats.get('image_embeds', None)
+        image_feat = feats['image_feat']
+        text_feat = feats['text_feat']
+
+        num_images = num_images or image_feat.size(0)
+        num_texts = num_texts or text_feat.size(0)
+
+        if not self.fast_match:
+            image_embeds_all = all_gather_concat(image_embeds)[:num_images]
+        else:
+            image_embeds_all = None
+        image_feat_all = all_gather_concat(image_feat)[:num_images]
+        text_feat_all = all_gather_concat(text_feat)[:num_texts]
+        text_ids_all = all_gather_concat(text_ids)[:num_texts]
+        text_attn_mask_all = all_gather_concat(text_attn_mask)[:num_texts]
+
+        results = []
+        if cal_i2t:
+            result_i2t = self.compute_score_matrix_i2t(
+                image_feat,
+                image_embeds,
+                text_feat_all,
+                text_ids_all,
+                text_attn_mask_all,
+            )
+            results.append(
+                self._get_predictions(result_i2t, data_samples, mode='i2t'))
+        if cal_t2i:
+            result_t2i = self.compute_score_matrix_t2i(
+                image_feat_all,
+                image_embeds_all,
+                text_feat,
+                text_ids,
+                text_attn_mask,
+            )
+            results.append(
+                self._get_predictions(result_t2i, data_samples, mode='t2i'))
+        return tuple(results)
+
+    def compute_score_matrix_i2t(self, img_feats, img_embeds, text_feats,
+                                 text_ids, text_atts):
+        """Compare the score matrix for image-to-text retrieval. Every image
+        should compare to all the text features.
+
+        Args:
+            img_feats (torch.Tensor): The input img feats tensor with shape
+                (M, C). M stands for numbers of samples on a single GPU.
+            img_embeds (torch.Tensor): The input img embeds tensor with shape
+                (M, C). M stands for numbers of samples on a single GPU.
+            text_feats (torch.Tensor): The input text feats tensor with shape
+                (N, C). N stands for numbers of all samples on all GPUs.
+            text_ids (torch.Tensor): The input tensor with shape (N, C).
+            text_atts (torch.Tensor): The input tensor with shape (N, C).
+
+        Returns:
+            torch.Tensor: Score matrix of image-to-text retrieval.
+        """
+
+        # compute i2t sim matrix
+        sim_matrix_i2t = img_feats @ text_feats.t()
+        if self.fast_match:
+            return sim_matrix_i2t
+
+        score_matrix_i2t = torch.full((img_feats.size(0), text_feats.size(0)),
+                                      -100.0).to(self.device)
+        for i in track_on_main_process(
+                range(img_feats.size(0)), 'Compute I2T scores...'):
+            sims = sim_matrix_i2t[i]
+            topk_sim, topk_idx = sims.topk(k=self.topk, dim=0)
+
+            encoder_output = img_embeds[i].repeat(self.topk, 1, 1)
+            encoder_att = torch.ones(
+                encoder_output.size()[:-1], dtype=torch.long).to(self.device)
+            output = self.text_backbone(
+                text_ids[topk_idx],
+                attention_mask=text_atts[topk_idx],
+                encoder_hidden_states=encoder_output,
+                encoder_attention_mask=encoder_att,
+                return_dict=True,
+            )
+            score = self.multimodal_head(
+                (output.last_hidden_state[:, 0, :], ))[:, 1]
+            score_matrix_i2t[i, topk_idx] = score + topk_sim
+
+        return score_matrix_i2t
+
+    def compute_score_matrix_t2i(self, img_feats, img_embeds, text_feats,
+                                 text_ids, text_atts):
+        """Compare the score matrix for text-to-image retrieval. Every text
+        should compare to all the image features.
+
+        Args:
+            img_feats (torch.Tensor): The input img feats tensor with shape
+                (M, C). M stands for numbers of samples on a single GPU.
+            img_embeds (torch.Tensor): The input img embeds tensor with shape
+                (M, C). M stands for numbers of samples on a single GPU.
+            text_feats (torch.Tensor): The input text feats tensor with shape
+                (N, C). N stands for numbers of all samples on all GPUs.
+            text_ids (torch.Tensor): The input tensor with shape (M, C).
+            text_atts (torch.Tensor): The input tensor with shape (M, C).
+
+        Returns:
+            torch.Tensor: Score matrix of text-to-image retrieval.
+        """
+
+        # compute t2i sim matrix
+        sim_matrix_t2i = text_feats @ img_feats.t()
+        if self.fast_match:
+            return sim_matrix_t2i
+
+        score_matrix_t2i = torch.full((text_feats.size(0), img_feats.size(0)),
+                                      -100.0).to(self.device)
+        for i in track_on_main_process(
+                range(text_feats.size(0)), 'Compute T2I scores...'):
+            sims = sim_matrix_t2i[i]
+            topk_sim, topk_idx = sims.topk(k=self.topk, dim=0)
+
+            encoder_output = img_embeds[topk_idx]
+            encoder_att = torch.ones(
+                encoder_output.size()[:-1], dtype=torch.long).to(self.device)
+            output = self.text_backbone(
+                text_ids[i].repeat(self.topk, 1),
+                attention_mask=text_atts[i].repeat(self.topk, 1),
+                encoder_hidden_states=encoder_output,
+                encoder_attention_mask=encoder_att,
+                return_dict=True,
+            )
+            score = self.multimodal_head(
+                (output.last_hidden_state[:, 0, :], ))[:, 1]
+            score_matrix_t2i[i, topk_idx] = score + topk_sim
+
+        return score_matrix_t2i
+
+    def _get_predictions(self,
+                         result: torch.Tensor,
+                         data_samples: List[DataSample],
+                         mode: str = 'i2t'):
+        """Post-process the output of retriever.
+
+        Args:
+            result (torch.Tensor): Score matrix of single retrieve,
+                either from image or text.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples.
+            mode (str): Retrieve mode, either `i2t` for image to text, or `t2i`
+                text to image. Defaults to `i2t`.
+
+        Returns:
+            List[DataSample]: the raw data_samples with
+                the predicted results.
+        """
+
+        # create data sample if not exists
+        if data_samples is None:
+            data_samples = [DataSample() for _ in range(result.size(0))]
+        elif mode == 't2i':
+            # Process data samples to align with the num of texts.
+            new_data_samples = []
+            for sample in data_samples:
+                if isinstance(sample.text, (list, tuple)):
+                    texts = sample.text
+                else:
+                    texts = [sample.text]
+                for i, text in enumerate(texts):
+                    new_sample = DataSample(text=text)
+                    if 'gt_image_id' in sample:
+                        new_sample.gt_label = sample.gt_image_id[i]
+                    new_data_samples.append(new_sample)
+            assert len(new_data_samples) == result.size(0)
+            data_samples = new_data_samples
+        elif mode == 'i2t':
+            for sample in data_samples:
+                if 'gt_text_id' in sample:
+                    sample.gt_label = sample.gt_text_id
+        else:
+            raise ValueError(f'Type {mode} is not supported.')
+
+        for data_sample, score in zip(data_samples, result):
+            idx = score.argmax(keepdim=True).detach()
+
+            data_sample.set_pred_score(score)
+            data_sample.set_pred_label(idx)
+        return data_samples
+
+    # TODO: add temperaily
+    @torch.no_grad()
+    def copy_params(self):
+        for model_pair in self.model_pairs:
+            for param, param_m in zip(model_pair[0].parameters(),
+                                      model_pair[1].parameters()):
+                param_m.data.copy_(param.data)  # initialize
+                param_m.requires_grad = False  # not update by gradient
+
+    @torch.no_grad()
+    def _momentum_update(self):
+        for model_pair in self.model_pairs:
+            for (name,
+                 param), (name_m,
+                          param_m) in zip(model_pair[0].named_parameters(),
+                                          model_pair[1].named_parameters()):
+                # hack to behave the same
+                if any([i in name for i in ['8', '9', '10', '11']
+                        ]) and 'layers' in name and any(
+                            [i in name for i in ['attn', 'ffn']]):
+                    param_m.data = param.data
+                else:
+                    param_m.data = param_m.data * self.momentum + \
+                        param.data * (1.0 - self.momentum)
diff --git a/mmpretrain/models/multimodal/blip/blip_vqa.py b/mmpretrain/models/multimodal/blip/blip_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..d0f4e5861b5c92be302cc48eaa7a37264be63f93
--- /dev/null
+++ b/mmpretrain/models/multimodal/blip/blip_vqa.py
@@ -0,0 +1,265 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Tuple, Union
+
+import torch
+from mmengine.model import BaseModel
+
+from mmpretrain.registry import MODELS, TOKENIZER
+from mmpretrain.structures import DataSample
+
+
+@MODELS.register_module()
+class BlipVQA(BaseModel):
+    """BLIP VQA.
+
+    Args:
+        tokenizer: (dict): The config for tokenizer.
+        vision_backbone (dict): Encoder for extracting image features.
+        multimodal_backbone (dict): Backbone for extracting
+            multi-modal features. We apply this part as VQA fusion module.
+        head (dict): The head module to calculate
+            loss from processed features.
+        data_preprocessor (Optional[dict]): The config for preprocessing input
+            data. If None or no specified type, it will use
+            `MutimodalDataPreprocessor` as type.
+            See :class:`MutimodalDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (Optional[dict]): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 tokenizer: dict,
+                 vision_backbone: dict,
+                 multimodal_backbone: dict,
+                 head: dict,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+        data_preprocessor = MODELS.build(data_preprocessor)
+
+        super(BlipVQA, self).__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+        self.tokenizer = TOKENIZER.build(tokenizer)
+        self.vision_backbone = MODELS.build(vision_backbone)
+        self.multimodal_backbone = MODELS.build(multimodal_backbone)
+        self.vqa_head = MODELS.build(head)
+
+    @property
+    def device(self):
+        return next(self.parameters()).device
+
+    def forward(
+        self,
+        images: torch.Tensor,
+        data_samples: Optional[List[DataSample]] = None,
+        mode: str = 'loss',
+    ):
+        """The unified entry for a forward process in both training and test.
+
+        - "loss": For training. Forward and return a dict of losses according
+          to the given inputs and data samples. Note that this method doesn't
+          handle neither back propagation nor optimizer updating, which are
+          done in the :meth:`train_step`.
+        - "predict": For testing. Forward and return a list of data_sample that
+          contains pred_answer for each question.
+
+        Args:
+            images (Tensor): A batch of images. The shape of it should be
+                (B, C, H, W) for images and (B, T, C, H, W) for videos.
+            data_samples (List[DataSample], optional): The annotation data of
+                every samples. Required when ``mode="loss"``. Defaults to None.
+            mode (str): Return what kind of value. Defaults to 'loss'.
+
+        Returns:
+            The return type depends on ``mode``.
+            - If ``mode="loss"``, return a dict of tensor.
+            - If ``mode="predict"``, return a list of `DataSample`
+        """
+
+        if mode == 'loss':
+            return self.loss(images, data_samples)
+        elif mode == 'predict':
+            return self.predict(images, data_samples)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def extract_feat(self, images: torch.Tensor) -> torch.Tensor:
+        """Extract features from the input tensor with shape (N, C, ..).
+
+        Args:
+            images (Tensor): A batch of images. The shape of it should be
+                (B, C, H, W) for images and (B, T, C, H, W) for videos.
+
+        Returns:
+            visual_embeds (Tensor): The output features.
+        """
+        # extract visual feature
+        if images.ndim == 4:
+            visual_embeds = self.vision_backbone(images)[0]
+        elif images.ndim == 5:
+            # [batch, T, C, H, W] -> [batch * T, C, H, W]
+            bs = images.size(0)
+            images = images.reshape(-1, *images.shape[2:])
+            visual_embeds = self.vision_backbone(images)[0]
+            # [batch * num_segs, L, dim] -> [batch, num_segs * L, dim]
+            visual_embeds = visual_embeds.reshape(bs, -1,
+                                                  *visual_embeds.shape[2:])
+        else:
+            raise ValueError(
+                f'Images with {images.ndim} dims is not supported.')
+        return visual_embeds
+
+    def loss(
+        self,
+        images: torch.Tensor,
+        data_samples: Optional[List[DataSample]] = None,
+    ) -> Union[torch.Tensor, Tuple[torch.Tensor]]:
+        """generate train_loss from the input tensor and data_samples.
+
+        Args:
+            images (Tensor): A batch of images. The shape of it should be
+                (B, C, H, W) for images and (B, T, C, H, W) for videos.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples.
+
+        Returns:
+            Dict[torch.Tensor]: The losses features.
+        """
+        visual_embeds = self.extract_feat(images)
+        image_atts = torch.ones(
+            visual_embeds.size()[:-1], dtype=torch.long).to(self.device)
+
+        questions = []
+        for sample in data_samples:
+            questions.append(sample.get('question'))
+        questions = self.tokenizer(
+            questions, padding='longest', return_tensors='pt').to(self.device)
+
+        questions.input_ids[:, 0] = \
+            self.tokenizer.additional_special_tokens_ids[0]
+
+        # multimodal fusion
+        multimodal_embeds = self.multimodal_backbone(
+            questions.input_ids,
+            attention_mask=questions.attention_mask,
+            encoder_hidden_states=visual_embeds,
+            encoder_attention_mask=image_atts,
+            return_dict=True,
+        )
+
+        # put answer from data_samples into tensor form
+        answer_raw_text = []
+        for sample in data_samples:
+            answer_raw_text.extend(sample.gt_answer)
+        answer = self.tokenizer(
+            answer_raw_text, padding='longest',
+            return_tensors='pt').to(self.device)
+        answer_targets = answer.input_ids.masked_fill(
+            answer.input_ids == self.tokenizer.pad_token_id, -100)
+        for sample in data_samples:
+            # follow BLIP setting, set answer_weight to 0.2 for VG dataset.
+            if not hasattr(sample, 'gt_answer_weight'):
+                sample.gt_answer_weight = torch.tensor([0.2])
+            else:
+                sample.gt_answer_weight = torch.tensor(sample.gt_answer_weight)
+        answer_weight = torch.cat(
+            [sample.gt_answer_weight for sample in data_samples],
+            dim=0).to(self.device)
+        answer_count = torch.tensor(
+            [len(sample.gt_answer) for sample in data_samples]).to(self.device)
+
+        question_states, question_atts = [], []
+        for b, n in enumerate(answer_count):
+            question_states += [multimodal_embeds.last_hidden_state[b]] * n
+            question_atts += [questions.attention_mask[b]] * n
+
+        question_states = torch.stack(question_states, dim=0).to(self.device)
+        question_atts = torch.stack(question_atts, dim=0).to(self.device)
+
+        head_feats = dict(
+            answer_input_ids=answer.input_ids,
+            answer_attention_mask=answer.attention_mask,
+            answer_weight=answer_weight,
+            answer_targets=answer_targets,
+            question_states=question_states,
+            question_atts=question_atts,
+            batch_size=len(data_samples),
+        )
+
+        losses = self.vqa_head.loss(head_feats)
+
+        return losses
+
+    def predict(
+        self,
+        images: torch.Tensor,
+        data_samples: Optional[List[DataSample]] = None,
+    ):
+        """update data_samples that contain pred_answer for each question.
+
+        Args:
+            images (Tensor): A batch of images. The shape of it should be
+                (B, C, H, W) for images and (B, T, C, H, W) for videos.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples.
+
+        Returns:
+            Dict[torch.Tensor]: The losses features.
+        """
+        visual_embeds = self.extract_feat(images)
+        image_atts = torch.ones(
+            visual_embeds.size()[:-1], dtype=torch.long).to(self.device)
+
+        questions = []
+        for sample in data_samples:
+            questions.append(sample.get('question'))
+        questions = self.tokenizer(
+            questions, padding='longest', return_tensors='pt').to(self.device)
+
+        questions.input_ids[:, 0] = \
+            self.tokenizer.additional_special_tokens_ids[0]
+
+        # multimodal fusion
+        multimodal_embeds = self.multimodal_backbone(
+            questions.input_ids,
+            attention_mask=questions.attention_mask,
+            encoder_hidden_states=visual_embeds,
+            encoder_attention_mask=image_atts,
+            return_dict=True,
+        )
+
+        if self.vqa_head.inference_method == 'rank':
+            answer_candidates = self.tokenizer(
+                self.vqa_head.answer_list,
+                padding='longest',
+                return_tensors='pt').to(self.device)
+            answer_candidates.input_ids[:, 0] = self.tokenizer.bos_token_id
+        elif self.vqa_head.inference_method == 'generate':
+            answer_candidates = None
+
+        head_feats = dict(
+            multimodal_embeds=multimodal_embeds.last_hidden_state,
+            question_atts=questions.attention_mask,
+            answer_candidates=answer_candidates,
+            bos_token_id=self.tokenizer.bos_token_id,
+            sep_token_id=self.tokenizer.sep_token_id,
+            pad_token_id=self.tokenizer.pad_token_id,
+        )
+
+        if self.vqa_head.inference_method == 'rank':
+            answers = self.vqa_head.predict(head_feats)
+            for answer, data_sample in zip(answers, data_samples):
+                data_sample.pred_answer = answer
+
+        elif self.vqa_head.inference_method == 'generate':
+            outputs = self.vqa_head.predict(head_feats)
+            for output, data_sample in zip(outputs, data_samples):
+                data_sample.pred_answer = self.tokenizer.decode(
+                    output, skip_special_tokens=True)
+
+        return data_samples
diff --git a/mmpretrain/models/multimodal/blip/language_model.py b/mmpretrain/models/multimodal/blip/language_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..48605a95f60550e970f893f55c4a43e03efb74df
--- /dev/null
+++ b/mmpretrain/models/multimodal/blip/language_model.py
@@ -0,0 +1,1320 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+
+# flake8: noqa
+
+import math
+from typing import Tuple
+
+import torch
+import torch.nn as nn
+from torch import Tensor, device
+
+try:
+    from transformers.activations import ACT2FN
+    from transformers.modeling_outputs import (
+        BaseModelOutputWithPastAndCrossAttentions,
+        BaseModelOutputWithPoolingAndCrossAttentions,
+        CausalLMOutputWithCrossAttentions)
+    from transformers.modeling_utils import (PreTrainedModel,
+                                             apply_chunking_to_forward,
+                                             find_pruneable_heads_and_indices,
+                                             prune_linear_layer)
+    from transformers.models.bert.configuration_bert import BertConfig
+except:
+    ACT2FN = None
+    BaseModelOutputWithPastAndCrossAttentions = None
+    BaseModelOutputWithPoolingAndCrossAttentions = None
+    CausalLMOutputWithCrossAttentions = None
+    PreTrainedModel = None
+    apply_chunking_to_forward = None
+    find_pruneable_heads_and_indices = None
+    prune_linear_layer = None
+    BertConfig = None
+
+from mmpretrain.registry import MODELS
+
+
+class BertEmbeddings(nn.Module):
+    """Construct the embeddings from word and position embeddings."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size,
+            config.hidden_size,
+            padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
+                                                config.hidden_size)
+
+        if config.add_type_embeddings:
+            self.token_type_embeddings = nn.Embedding(config.type_vocab_size,
+                                                      config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer(
+            'position_ids',
+            torch.arange(config.max_position_embeddings).expand((1, -1)))
+        self.position_embedding_type = getattr(config,
+                                               'position_embedding_type',
+                                               'absolute')
+
+        self.config = config
+
+    def forward(
+        self,
+        input_ids=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        past_key_values_length=0,
+    ):
+        if input_ids is not None:
+            input_shape = input_ids.size()
+        else:
+            input_shape = inputs_embeds.size()[:-1]
+
+        seq_length = input_shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, past_key_values_length:
+                                             seq_length +
+                                             past_key_values_length]
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        if token_type_ids is not None:
+            token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+            embeddings = inputs_embeds + token_type_embeddings
+        else:
+            embeddings = inputs_embeds
+
+        if self.position_embedding_type == 'absolute':
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings += position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BertPooler(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class BertPreTrainedModel(PreTrainedModel):
+    """An abstract class to handle weights initialization and a simple
+    interface for downloading and loading pretrained models."""
+
+    config_class = BertConfig
+    base_model_prefix = 'bert'
+    _keys_to_ignore_on_load_missing = [r'position_ids']
+
+    def _init_weights(self, module):
+        """Initialize the weights."""
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(
+                mean=0.0, std=self.config.initializer_range)
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
+
+
+class BertSelfAttention(nn.Module):
+
+    def __init__(self, config, is_cross_attention):
+        super().__init__()
+        self.config = config
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(
+                config, 'embedding_size'):
+            raise ValueError(
+                'The hidden size (%d) is not a multiple of the number of attention '
+                'heads (%d)' %
+                (config.hidden_size, config.num_attention_heads))
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size /
+                                       config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        if is_cross_attention:
+            self.key = nn.Linear(config.encoder_width, self.all_head_size)
+            self.value = nn.Linear(config.encoder_width, self.all_head_size)
+        else:
+            self.key = nn.Linear(config.hidden_size, self.all_head_size)
+            self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = getattr(config,
+                                               'position_embedding_type',
+                                               'absolute')
+        if (self.position_embedding_type == 'relative_key'
+                or self.position_embedding_type == 'relative_key_query'):
+            self.max_position_embeddings = config.max_position_embeddings
+            self.distance_embedding = nn.Embedding(
+                2 * config.max_position_embeddings - 1,
+                self.attention_head_size)
+        self.save_attention = False
+
+    def save_attn_gradients(self, attn_gradients):
+        self.attn_gradients = attn_gradients
+
+    def get_attn_gradients(self):
+        return self.attn_gradients
+
+    def save_attention_map(self, attention_map):
+        self.attention_map = attention_map
+
+    def get_attention_map(self):
+        return self.attention_map
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (
+            self.num_attention_heads,
+            self.attention_head_size,
+        )
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        mixed_query_layer = self.query(hidden_states)
+
+        # If this is instantiated as a cross-attention module, the keys
+        # and values come from an encoder; the attention mask needs to be
+        # such that the encoder's padding tokens are not attended to.
+        is_cross_attention = encoder_hidden_states is not None
+
+        if is_cross_attention:
+            key_layer = self.transpose_for_scores(
+                self.key(encoder_hidden_states))
+            value_layer = self.transpose_for_scores(
+                self.value(encoder_hidden_states))
+            attention_mask = encoder_attention_mask
+        elif past_key_value is not None:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+            key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
+            value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
+        else:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        past_key_value = (key_layer, value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_layer,
+                                        key_layer.transpose(-1, -2))
+
+        if (self.position_embedding_type == 'relative_key'
+                or self.position_embedding_type == 'relative_key_query'):
+            seq_length = hidden_states.size()[1]
+            position_ids_l = torch.arange(
+                seq_length, dtype=torch.long,
+                device=hidden_states.device).view(-1, 1)
+            position_ids_r = torch.arange(
+                seq_length, dtype=torch.long,
+                device=hidden_states.device).view(1, -1)
+            distance = position_ids_l - position_ids_r
+            positional_embedding = self.distance_embedding(
+                distance + self.max_position_embeddings - 1)
+            positional_embedding = positional_embedding.to(
+                dtype=query_layer.dtype)  # fp16 compatibility
+
+            if self.position_embedding_type == 'relative_key':
+                relative_position_scores = torch.einsum(
+                    'bhld,lrd->bhlr', query_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == 'relative_key_query':
+                relative_position_scores_query = torch.einsum(
+                    'bhld,lrd->bhlr', query_layer, positional_embedding)
+                relative_position_scores_key = torch.einsum(
+                    'bhrd,lrd->bhlr', key_layer, positional_embedding)
+                attention_scores = (
+                    attention_scores + relative_position_scores_query +
+                    relative_position_scores_key)
+
+        attention_scores = attention_scores / math.sqrt(
+            self.attention_head_size)
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(dim=-1)(attention_scores)
+
+        if is_cross_attention and self.save_attention:
+            self.save_attention_map(attention_probs)
+            attention_probs.register_hook(self.save_attn_gradients)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs_dropped = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs_dropped = attention_probs_dropped * head_mask
+
+        context_layer = torch.matmul(attention_probs_dropped, value_layer)
+
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (
+            self.all_head_size, )
+        context_layer = context_layer.view(*new_context_layer_shape)
+
+        outputs = ((context_layer, attention_probs) if output_attentions else
+                   (context_layer, ))
+
+        outputs = outputs + (past_key_value, )
+        return outputs
+
+
+class BertSelfOutput(nn.Module):
+
+    def __init__(self, config, twin=False, merge=False):
+        super().__init__()
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        if twin:
+            self.dense0 = nn.Linear(config.hidden_size, config.hidden_size)
+            self.dense1 = nn.Linear(config.hidden_size, config.hidden_size)
+        else:
+            self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        if merge:
+            self.act = ACT2FN[config.hidden_act]
+            self.merge_layer = nn.Linear(config.hidden_size * 2,
+                                         config.hidden_size)
+            self.merge = True
+        else:
+            self.merge = False
+
+    def forward(self, hidden_states, input_tensor):
+        if type(hidden_states) == list:
+            hidden_states0 = self.dense0(hidden_states[0])
+            hidden_states1 = self.dense1(hidden_states[1])
+            if self.merge:
+                hidden_states = self.merge_layer(
+                    torch.cat([hidden_states0, hidden_states1], dim=-1))
+            else:
+                hidden_states = (hidden_states0 + hidden_states1) / 2
+        else:
+            hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BertAttention(nn.Module):
+
+    def __init__(self, config, is_cross_attention=False, layer_num=-1):
+        super().__init__()
+        is_nlvr = is_cross_attention and getattr(config, 'nlvr', False)
+        if is_nlvr:
+            self.self0 = BertSelfAttention(config, is_nlvr)
+            self.self1 = BertSelfAttention(config, is_nlvr)
+        else:
+            self.self = BertSelfAttention(config, is_cross_attention)
+        self.output = BertSelfOutput(
+            config,
+            twin=is_nlvr,
+            merge=(is_nlvr and layer_num >= 6),
+        )
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads,
+            self.self.num_attention_heads,
+            self.self.attention_head_size,
+            self.pruned_heads,
+        )
+
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+
+        # Update hyper params and store pruned heads
+        self.self.num_attention_heads = self.self.num_attention_heads - len(
+            heads)
+        self.self.all_head_size = (
+            self.self.attention_head_size * self.self.num_attention_heads)
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        if type(encoder_hidden_states) == list:
+            self_outputs0 = self.self0(
+                hidden_states,
+                attention_mask,
+                head_mask,
+                encoder_hidden_states[0],
+                encoder_attention_mask[0],
+                past_key_value,
+                output_attentions,
+            )
+            self_outputs1 = self.self1(
+                hidden_states,
+                attention_mask,
+                head_mask,
+                encoder_hidden_states[1],
+                encoder_attention_mask[1],
+                past_key_value,
+                output_attentions,
+            )
+            attention_output = self.output(
+                [self_outputs0[0], self_outputs1[0]], hidden_states)
+
+            outputs = (attention_output, ) + self_outputs0[
+                1:]  # add attentions if we output them
+        else:
+            self_outputs = self.self(
+                hidden_states,
+                attention_mask,
+                head_mask,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                past_key_value,
+                output_attentions,
+            )
+            attention_output = self.output(self_outputs[0], hidden_states)
+            outputs = (attention_output,
+                       ) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+class BertIntermediate(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class BertOutput(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BertLayer(nn.Module):
+
+    def __init__(self, config, layer_num):
+        super().__init__()
+        self.config = config
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = BertAttention(config)
+        self.layer_num = layer_num
+
+        # compatibility for ALBEF and BLIP
+        try:
+            # ALBEF & ALPRO
+            fusion_layer = self.config.fusion_layer
+            add_cross_attention = (
+                fusion_layer <= layer_num and self.config.add_cross_attention)
+
+            self.fusion_layer = fusion_layer
+        except AttributeError:
+            # BLIP
+            self.fusion_layer = self.config.num_hidden_layers
+            add_cross_attention = self.config.add_cross_attention
+
+        # if self.config.add_cross_attention:
+        if self.config.add_cross_attention:
+            self.crossattention = BertAttention(
+                config,
+                is_cross_attention=self.config.add_cross_attention,
+                layer_num=layer_num,
+            )
+        self.intermediate = BertIntermediate(config)
+        self.output = BertOutput(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+        mode=None,
+    ):
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = (
+            past_key_value[:2] if past_key_value is not None else None)
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            output_attentions=output_attentions,
+            past_key_value=self_attn_past_key_value,
+        )
+        attention_output = self_attention_outputs[0]
+
+        outputs = self_attention_outputs[1:-1]
+        present_key_value = self_attention_outputs[-1]
+
+        # TODO line 482 in albef/models/xbert.py
+        # compatibility for ALBEF and BLIP
+        if mode in ['multimodal', 'fusion'] and hasattr(
+                self, 'crossattention'):
+            assert (
+                encoder_hidden_states is not None
+            ), 'encoder_hidden_states must be given for cross-attention layers'
+
+            cross_attention_outputs = self.crossattention(
+                attention_output,
+                attention_mask,
+                head_mask,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                output_attentions=output_attentions,
+            )
+            attention_output = cross_attention_outputs[0]
+            outputs = (outputs + cross_attention_outputs[1:-1]
+                       )  # add cross attentions if we output attention weights
+        layer_output = apply_chunking_to_forward(
+            self.feed_forward_chunk,
+            self.chunk_size_feed_forward,
+            self.seq_len_dim,
+            attention_output,
+        )
+        outputs = (layer_output, ) + outputs
+
+        outputs = outputs + (present_key_value, )
+
+        return outputs
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+
+class BertEncoder(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList(
+            [BertLayer(config, i) for i in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+        mode='multimodal',
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        all_cross_attentions = (() if output_attentions
+                                and self.config.add_cross_attention else None)
+
+        next_decoder_cache = () if use_cache else None
+
+        try:
+            # ALBEF
+            fusion_layer = self.config.fusion_layer
+        except AttributeError:
+            # BLIP
+            fusion_layer = self.config.num_hidden_layers
+
+        if mode == 'text':
+            start_layer = 0
+            # output_layer = self.config.fusion_layer
+            output_layer = fusion_layer
+
+        elif mode == 'fusion':
+            # start_layer = self.config.fusion_layer
+            start_layer = fusion_layer
+            output_layer = self.config.num_hidden_layers
+
+        elif mode == 'multimodal':
+            start_layer = 0
+            output_layer = self.config.num_hidden_layers
+
+        # compatibility for ALBEF and BLIP
+        # for i in range(self.config.num_hidden_layers):
+        for i in range(start_layer, output_layer):
+            layer_module = self.layer[i]
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states, )
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_value = past_key_values[
+                i] if past_key_values is not None else None
+
+            # TODO pay attention to this.
+            if self.gradient_checkpointing and self.training:
+
+                if use_cache:
+                    # TODO: logger here
+                    # logger.warn(
+                    #     "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                    # )
+                    use_cache = False
+
+                def create_custom_forward(module):
+
+                    def custom_forward(*inputs):
+                        return module(*inputs, past_key_value,
+                                      output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    mode=mode,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
+                    mode=mode,
+                )
+
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache += (layer_outputs[-1], )
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (
+                    layer_outputs[1], )
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states, )
+
+        if not return_dict:
+            return tuple(v for v in [
+                hidden_states,
+                next_decoder_cache,
+                all_hidden_states,
+                all_self_attentions,
+                all_cross_attentions,
+            ] if v is not None)
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_decoder_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+
+class BertPredictionHeadTransform(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        if isinstance(config.hidden_act, str):
+            self.transform_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.transform_act_fn = config.hidden_act
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states)
+        return hidden_states
+
+
+class BertLMPredictionHead(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.transform = BertPredictionHeadTransform(config)
+
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(
+            config.hidden_size, config.vocab_size, bias=False)
+
+        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
+
+        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
+        self.decoder.bias = self.bias
+
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+
+
+class BertOnlyMLMHead(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.predictions = BertLMPredictionHead(config)
+
+    def forward(self, sequence_output):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+
+
+@MODELS.register_module()
+class BertModel(BertPreTrainedModel):
+    """The model can behave as an encoder (with only self-attention) as well as
+    a decoder, in which case a layer of cross-attention is added between the
+    self-attention layers, following the architecture described in `Attention
+    is all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani,
+    Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
+
+    Gomez, Lukasz Kaiser and Illia Polosukhin. argument and
+    :obj:`add_cross_attention` set to :obj:`True`; an
+    :obj:`encoder_hidden_states` is then expected as an input to the forward
+    pass.
+    """
+
+    def __init__(self, config, add_pooling_layer=True):
+        if not isinstance(config, BertConfig):
+            config = BertConfig.from_dict(config)
+
+        super().__init__(config)
+        self.config = config
+
+        self.embeddings = BertEmbeddings(config)
+
+        self.encoder = BertEncoder(config)
+
+        self.pooler = BertPooler(config) if add_pooling_layer else None
+
+        self.init_weights()
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def _prune_heads(self, heads_to_prune):
+        """Prunes heads of the model.
+
+        heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    def get_extended_attention_mask(
+        self,
+        attention_mask: Tensor,
+        input_shape: Tuple[int],
+        device: device,
+        is_decoder: bool,
+    ) -> Tensor:
+        """Makes broadcastable attention and causal masks so that future and
+        masked tokens are ignored.
+
+        Arguments:
+            attention_mask (:obj:`torch.Tensor`):
+                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
+            input_shape (:obj:`Tuple[int]`):
+                The shape of the input to the model.
+            device: (:obj:`torch.device`):
+                The device of the input to the model.
+
+        Returns:
+            :obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.
+        """
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        if attention_mask.dim() == 3:
+            extended_attention_mask = attention_mask[:, None, :, :]
+        elif attention_mask.dim() == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - if the model is a decoder, apply a causal mask in addition to the padding mask
+            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            if is_decoder:
+                batch_size, seq_length = input_shape
+
+                seq_ids = torch.arange(seq_length, device=device)
+                causal_mask = (
+                    seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <=
+                    seq_ids[None, :, None])
+                # in case past_key_values are used we need to add a prefix ones mask to the causal mask
+                # causal and attention masks must have same type with pytorch version < 1.3
+                causal_mask = causal_mask.to(attention_mask.dtype)
+
+                if causal_mask.shape[1] < attention_mask.shape[1]:
+                    prefix_seq_len = attention_mask.shape[
+                        1] - causal_mask.shape[1]
+                    causal_mask = torch.cat(
+                        [
+                            torch.ones(
+                                (batch_size, seq_length, prefix_seq_len),
+                                device=device,
+                                dtype=causal_mask.dtype,
+                            ),
+                            causal_mask,
+                        ],
+                        axis=-1,
+                    )
+
+                extended_attention_mask = (
+                    causal_mask[:, None, :, :] *
+                    attention_mask[:, None, None, :])
+            else:
+                extended_attention_mask = attention_mask[:, None, None, :]
+        else:
+            raise ValueError(
+                'Wrong shape for input_ids (shape {}) or attention_mask (shape {})'
+                .format(input_shape, attention_mask.shape))
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.to(
+            dtype=self.dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        is_decoder=False,
+        mode='multimodal',
+    ):
+        r"""
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+        """
+        output_attentions = (
+            output_attentions if output_attentions is not None else
+            self.config.output_attentions)
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else
+            self.config.output_hidden_states)
+        return_dict = (
+            return_dict
+            if return_dict is not None else self.config.use_return_dict)
+
+        if is_decoder:
+            use_cache = use_cache if use_cache is not None else self.config.use_cache
+        else:
+            use_cache = False
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError(
+                'You cannot specify both input_ids and inputs_embeds at the same time'
+            )
+        elif input_ids is not None:
+            input_shape = input_ids.size()
+            batch_size, seq_length = input_shape
+            device = input_ids.device
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+            batch_size, seq_length = input_shape
+            device = inputs_embeds.device
+        elif encoder_embeds is not None:
+            input_shape = encoder_embeds.size()[:-1]
+            batch_size, seq_length = input_shape
+            device = encoder_embeds.device
+        else:
+            raise ValueError(
+                'You have to specify either input_ids or inputs_embeds or encoder_embeds'
+            )
+
+        # past_key_values_length
+        past_key_values_length = (
+            past_key_values[0][0].shape[2]
+            if past_key_values is not None else 0)
+
+        if attention_mask is None:
+            attention_mask = torch.ones(
+                ((batch_size, seq_length + past_key_values_length)),
+                device=device)
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(
+            attention_mask, input_shape, device, is_decoder)
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if encoder_hidden_states is not None:
+            if type(encoder_hidden_states) == list:
+                encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states[
+                    0].size()
+            else:
+                (
+                    encoder_batch_size,
+                    encoder_sequence_length,
+                    _,
+                ) = encoder_hidden_states.size()
+            encoder_hidden_shape = (encoder_batch_size,
+                                    encoder_sequence_length)
+
+            if type(encoder_attention_mask) == list:
+                encoder_extended_attention_mask = [
+                    self.invert_attention_mask(mask)
+                    for mask in encoder_attention_mask
+                ]
+            elif encoder_attention_mask is None:
+                encoder_attention_mask = torch.ones(
+                    encoder_hidden_shape, device=device)
+                encoder_extended_attention_mask = self.invert_attention_mask(
+                    encoder_attention_mask)
+            else:
+                encoder_extended_attention_mask = self.invert_attention_mask(
+                    encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(head_mask,
+                                       self.config.num_hidden_layers)
+
+        if encoder_embeds is None:
+            embedding_output = self.embeddings(
+                input_ids=input_ids,
+                position_ids=position_ids,
+                token_type_ids=token_type_ids,
+                inputs_embeds=inputs_embeds,
+                past_key_values_length=past_key_values_length,
+            )
+        else:
+            embedding_output = encoder_embeds
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            mode=mode,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = (
+            self.pooler(sequence_output) if self.pooler is not None else None)
+
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            cross_attentions=encoder_outputs.cross_attentions,
+        )
+
+
+class BaseEncoder(nn.Module):
+    """Base class for primitive encoders, such as ViT, TimeSformer, etc."""
+
+    def __init__(self):
+        super().__init__()
+
+    def forward_features(self, samples, **kwargs):
+        raise NotImplementedError
+
+    @property
+    def device(self):
+        return list(self.parameters())[0].device
+
+
+@MODELS.register_module()
+class XBertEncoder(BertModel, BaseEncoder):
+
+    def __init__(self, med_config, from_pretrained=False):
+
+        med_config = BertConfig.from_dict(med_config)
+        super().__init__(config=med_config, add_pooling_layer=False)
+
+    def forward_automask(self, tokenized_text, visual_embeds, **kwargs):
+        image_atts = torch.ones(
+            visual_embeds.size()[:-1], dtype=torch.long).to(self.device)
+
+        text = tokenized_text
+        text_output = super().forward(
+            text.input_ids,
+            attention_mask=text.attention_mask,
+            encoder_hidden_states=visual_embeds,
+            encoder_attention_mask=image_atts,
+            return_dict=True,
+        )
+
+        return text_output
+
+    def forward_text(self, tokenized_text, **kwargs):
+        text = tokenized_text
+        token_type_ids = kwargs.get('token_type_ids', None)
+
+        text_output = super().forward(
+            text.input_ids,
+            attention_mask=text.attention_mask,
+            token_type_ids=token_type_ids,
+            return_dict=True,
+            mode='text',
+        )
+
+        return text_output
+
+
+@MODELS.register_module()
+class Linear(torch.nn.Linear):
+    """Wrapper for linear function."""
+
+
+@MODELS.register_module()
+class BertLMHeadModel(BertPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r'pooler']
+    _keys_to_ignore_on_load_missing = [
+        r'position_ids', r'predictions.decoder.bias'
+    ]
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.bert = BertModel(config, add_pooling_layer=False)
+        self.cls = BertOnlyMLMHead(config)
+
+        self.init_weights()
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.cls.predictions.decoder = new_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        labels=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        return_logits=False,
+        is_decoder=True,
+        reduction='mean',
+        mode='multimodal',
+    ):
+        r"""
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
+            ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are
+            ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+        Returns:
+        Example::
+            >>> from transformers import BertTokenizer,
+                    BertLMHeadModel, BertConfig
+            >>> import torch
+            >>> tokenizer = BertTokenizer.from_pretrained(
+                'bert-base-cased')
+            >>> config = BertConfig.from_pretrained(
+                "bert-base-cased")
+            >>> model = BertLMHeadModel.from_pretrained(
+                'bert-base-cased', config=config)
+            >>> inputs = tokenizer(
+                    "Hello, my dog is cute",
+                    return_tensors="pt")
+            >>> outputs = model(**inputs)
+            >>> prediction_logits = outputs.logits
+        """
+        return_dict = (
+            return_dict
+            if return_dict is not None else self.config.use_return_dict)
+        if labels is not None:
+            use_cache = False
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            is_decoder=is_decoder,
+            mode=mode,
+        )
+
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output)
+
+        if return_logits:
+            return prediction_scores[:, :-1, :].contiguous()
+
+        lm_loss = None
+        if labels is not None:
+            # we are doing next-token prediction; shift prediction scores and input ids by one
+            shifted_prediction_scores = prediction_scores[:, :
+                                                          -1, :].contiguous()
+            labels = labels[:, 1:].contiguous()
+            loss_fct = torch.nn.CrossEntropyLoss(
+                reduction=reduction, label_smoothing=0.1)
+            lm_loss = loss_fct(
+                shifted_prediction_scores.view(-1, self.config.vocab_size),
+                labels.view(-1))
+            if reduction == 'none':
+                lm_loss = lm_loss.view(prediction_scores.size(0), -1).sum(1)
+
+        if not return_dict:
+            output = (prediction_scores, ) + outputs[2:]
+            return ((lm_loss, ) + output) if lm_loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=lm_loss,
+            logits=prediction_scores,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+
+    def prepare_inputs_for_generation(self,
+                                      input_ids,
+                                      past=None,
+                                      attention_mask=None,
+                                      **model_kwargs):
+        input_shape = input_ids.shape
+        # if model is used as a decoder in encoder-decoder model,
+        # the decoder attention mask is created on the fly
+        if attention_mask is None:
+            attention_mask = input_ids.new_ones(input_shape)
+
+        # cut decoder_input_ids if past is used
+        if past is not None:
+            input_ids = input_ids[:, -1:]
+
+        return {
+            'input_ids':
+            input_ids,
+            'attention_mask':
+            attention_mask,
+            'past_key_values':
+            past,
+            'encoder_hidden_states':
+            model_kwargs.get('encoder_hidden_states', None),
+            'encoder_attention_mask':
+            model_kwargs.get('encoder_attention_mask', None),
+            'is_decoder':
+            True,
+        }
+
+    def _reorder_cache(self, past, beam_idx):
+        reordered_past = ()
+        for layer_past in past:
+            reordered_past += (tuple(
+                past_state.index_select(0, beam_idx)
+                for past_state in layer_past), )
+        return reordered_past
+
+
+@MODELS.register_module()
+class XBertLMHeadDecoder(BertLMHeadModel):
+    """This class decouples the decoder forward logic from the VL model.
+
+    In this way, different VL models can share this decoder as long as they
+    feed encoder_embeds as required.
+    """
+
+    def __init__(self, med_config):
+        self.med_config = BertConfig.from_dict(med_config)
+        super(XBertLMHeadDecoder, self).__init__(config=self.med_config)
+
+    def generate_from_encoder(self,
+                              tokenized_prompt,
+                              visual_embeds,
+                              sep_token_id,
+                              pad_token_id,
+                              use_nucleus_sampling=False,
+                              num_beams=3,
+                              max_length=30,
+                              min_length=10,
+                              top_p=0.9,
+                              repetition_penalty=1.0,
+                              **kwargs):
+
+        if not use_nucleus_sampling:
+            num_beams = num_beams
+            visual_embeds = visual_embeds.repeat_interleave(num_beams, dim=0)
+
+        image_atts = torch.ones(
+            visual_embeds.size()[:-1], dtype=torch.long).to(self.device)
+
+        model_kwargs = {
+            'encoder_hidden_states': visual_embeds,
+            'encoder_attention_mask': image_atts,
+        }
+
+        if use_nucleus_sampling:
+            # nucleus sampling
+            outputs = self.generate(
+                input_ids=tokenized_prompt.input_ids,
+                max_length=max_length,
+                min_length=min_length,
+                do_sample=True,
+                top_p=top_p,
+                num_return_sequences=1,
+                eos_token_id=sep_token_id,
+                pad_token_id=pad_token_id,
+                repetition_penalty=1.1,
+                **model_kwargs)
+        else:
+            # beam search
+            outputs = self.generate(
+                input_ids=tokenized_prompt.input_ids,
+                max_length=max_length,
+                min_length=min_length,
+                num_beams=num_beams,
+                eos_token_id=sep_token_id,
+                pad_token_id=pad_token_id,
+                repetition_penalty=repetition_penalty,
+                **model_kwargs)
+
+        return outputs
diff --git a/mmpretrain/models/multimodal/blip2/Qformer.py b/mmpretrain/models/multimodal/blip2/Qformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..4b1c7d1e28711ae706ee4f3590cc5351c165fbae
--- /dev/null
+++ b/mmpretrain/models/multimodal/blip2/Qformer.py
@@ -0,0 +1,773 @@
+# flake8: noqa
+"""
+ * Copyright (c) 2023, salesforce.com, inc.
+"""
+from typing import Tuple
+
+import torch
+import torch.utils.checkpoint
+from torch import Tensor, device, nn
+from torch.nn import CrossEntropyLoss
+from transformers.activations import ACT2FN
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions)
+from transformers.modeling_utils import apply_chunking_to_forward
+from transformers.models.bert.configuration_bert import BertConfig
+from transformers.utils import logging
+
+from mmpretrain.registry import MODELS
+from ..blip.language_model import (BertAttention, BertIntermediate,
+                                   BertOnlyMLMHead, BertOutput, BertPooler,
+                                   BertPreTrainedModel)
+
+logger = logging.get_logger(__name__)
+
+
+class BertEmbeddings(nn.Module):
+    """Construct the embeddings from word and position embeddings."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size,
+            config.hidden_size,
+            padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
+                                                config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer(
+            'position_ids',
+            torch.arange(config.max_position_embeddings).expand((1, -1)))
+        self.position_embedding_type = getattr(config,
+                                               'position_embedding_type',
+                                               'absolute')
+
+        self.config = config
+
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        query_embeds=None,
+        past_key_values_length=0,
+    ):
+        if input_ids is not None:
+            seq_length = input_ids.size()[1]
+        else:
+            seq_length = 0
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, past_key_values_length:
+                                             seq_length +
+                                             past_key_values_length].clone()
+
+        if input_ids is not None:
+            embeddings = self.word_embeddings(input_ids)
+            if self.position_embedding_type == 'absolute':
+                position_embeddings = self.position_embeddings(position_ids)
+                embeddings = embeddings + position_embeddings
+
+            if query_embeds is not None:
+                embeddings = torch.cat((query_embeds, embeddings), dim=1)
+        else:
+            embeddings = query_embeds
+
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BertLayer(nn.Module):
+
+    def __init__(self, config, layer_num):
+        super().__init__()
+        self.config = config
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = BertAttention(config)
+        self.layer_num = layer_num
+        if (self.config.add_cross_attention
+                and layer_num % self.config.cross_attention_freq == 0):
+            self.crossattention = BertAttention(
+                config, is_cross_attention=self.config.add_cross_attention)
+            self.has_cross_attention = True
+        else:
+            self.has_cross_attention = False
+        self.intermediate = BertIntermediate(config)
+        self.output = BertOutput(config)
+
+        self.intermediate_query = BertIntermediate(config)
+        self.output_query = BertOutput(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+        query_length=0,
+    ):
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = (
+            past_key_value[:2] if past_key_value is not None else None)
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            output_attentions=output_attentions,
+            past_key_value=self_attn_past_key_value,
+        )
+        attention_output = self_attention_outputs[0]
+        outputs = self_attention_outputs[1:-1]
+
+        present_key_value = self_attention_outputs[-1]
+
+        if query_length > 0:
+            query_attention_output = attention_output[:, :query_length, :]
+
+            if self.has_cross_attention:
+                assert (
+                    encoder_hidden_states is not None
+                ), 'encoder_hidden_states must be given for cross-attention layers'
+                cross_attention_outputs = self.crossattention(
+                    query_attention_output,
+                    attention_mask,
+                    head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    output_attentions=output_attentions,
+                )
+                query_attention_output = cross_attention_outputs[0]
+                outputs = (
+                    outputs + cross_attention_outputs[1:-1]
+                )  # add cross attentions if we output attention weights
+
+            layer_output = apply_chunking_to_forward(
+                self.feed_forward_chunk_query,
+                self.chunk_size_feed_forward,
+                self.seq_len_dim,
+                query_attention_output,
+            )
+            if attention_output.shape[1] > query_length:
+                layer_output_text = apply_chunking_to_forward(
+                    self.feed_forward_chunk,
+                    self.chunk_size_feed_forward,
+                    self.seq_len_dim,
+                    attention_output[:, query_length:, :],
+                )
+                layer_output = torch.cat([layer_output, layer_output_text],
+                                         dim=1)
+        else:
+            layer_output = apply_chunking_to_forward(
+                self.feed_forward_chunk,
+                self.chunk_size_feed_forward,
+                self.seq_len_dim,
+                attention_output,
+            )
+        outputs = (layer_output, ) + outputs
+
+        outputs = outputs + (present_key_value, )
+
+        return outputs
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+    def feed_forward_chunk_query(self, attention_output):
+        intermediate_output = self.intermediate_query(attention_output)
+        layer_output = self.output_query(intermediate_output, attention_output)
+        return layer_output
+
+
+class BertEncoder(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList(
+            [BertLayer(config, i) for i in range(config.num_hidden_layers)])
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+        query_length=0,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        all_cross_attentions = (() if output_attentions
+                                and self.config.add_cross_attention else None)
+
+        next_decoder_cache = () if use_cache else None
+
+        for i in range(self.config.num_hidden_layers):
+            layer_module = self.layer[i]
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states, )
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_value = past_key_values[
+                i] if past_key_values is not None else None
+
+            if getattr(self.config, 'gradient_checkpointing',
+                       False) and self.training:
+
+                if use_cache:
+                    logger.warn(
+                        '`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...'
+                    )
+                    use_cache = False
+
+                def create_custom_forward(module):
+
+                    def custom_forward(*inputs):
+                        return module(*inputs, past_key_value,
+                                      output_attentions, query_length)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
+                    query_length,
+                )
+
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache += (layer_outputs[-1], )
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (
+                    layer_outputs[1], )
+                all_cross_attentions = all_cross_attentions + (
+                    layer_outputs[2], )
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states, )
+
+        if not return_dict:
+            return tuple(v for v in [
+                hidden_states,
+                next_decoder_cache,
+                all_hidden_states,
+                all_self_attentions,
+                all_cross_attentions,
+            ] if v is not None)
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_decoder_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+
+class BertModel(BertPreTrainedModel):
+    """The model can behave as an encoder (with only self-attention) as well as
+    a decoder, in which case a layer of cross-attention is added between the
+    self-attention layers, following the architecture described in `Attention
+    is all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani,
+    Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
+
+    Gomez, Lukasz Kaiser and Illia Polosukhin. argument and
+    :obj:`add_cross_attention` set to :obj:`True`; an
+    :obj:`encoder_hidden_states` is then expected as an input to the forward
+    pass.
+    """
+
+    def __init__(self, config, add_pooling_layer=False):
+        super().__init__(config)
+        self.config = config
+
+        self.embeddings = BertEmbeddings(config)
+
+        self.encoder = BertEncoder(config)
+
+        self.pooler = BertPooler(config) if add_pooling_layer else None
+
+        self.init_weights()
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def _prune_heads(self, heads_to_prune):
+        """Prunes heads of the model.
+
+        heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    def get_extended_attention_mask(
+        self,
+        attention_mask: Tensor,
+        input_shape: Tuple[int],
+        device: device,
+        is_decoder: bool,
+        has_query: bool = False,
+    ) -> Tensor:
+        """Makes broadcastable attention and causal masks so that future and
+        masked tokens are ignored.
+
+        Arguments:
+            attention_mask (:obj:`torch.Tensor`):
+                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
+            input_shape (:obj:`Tuple[int]`):
+                The shape of the input to the model.
+            device: (:obj:`torch.device`):
+                The device of the input to the model.
+
+        Returns:
+            :obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.
+        """
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        if attention_mask.dim() == 3:
+            extended_attention_mask = attention_mask[:, None, :, :]
+        elif attention_mask.dim() == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - if the model is a decoder, apply a causal mask in addition to the padding mask
+            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            if is_decoder:
+                batch_size, seq_length = input_shape
+
+                seq_ids = torch.arange(seq_length, device=device)
+                causal_mask = (
+                    seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <=
+                    seq_ids[None, :, None])
+
+                # add a prefix ones mask to the causal mask
+                # causal and attention masks must have same type with pytorch version < 1.3
+                causal_mask = causal_mask.to(attention_mask.dtype)
+
+                if causal_mask.shape[1] < attention_mask.shape[1]:
+                    prefix_seq_len = attention_mask.shape[
+                        1] - causal_mask.shape[1]
+                    if has_query:  # UniLM style attention mask
+                        causal_mask = torch.cat(
+                            [
+                                torch.zeros(
+                                    (batch_size, prefix_seq_len, seq_length),
+                                    device=device,
+                                    dtype=causal_mask.dtype,
+                                ),
+                                causal_mask,
+                            ],
+                            axis=1,
+                        )
+                    causal_mask = torch.cat(
+                        [
+                            torch.ones(
+                                (batch_size, causal_mask.shape[1],
+                                 prefix_seq_len),
+                                device=device,
+                                dtype=causal_mask.dtype,
+                            ),
+                            causal_mask,
+                        ],
+                        axis=-1,
+                    )
+                extended_attention_mask = (
+                    causal_mask[:, None, :, :] *
+                    attention_mask[:, None, None, :])
+            else:
+                extended_attention_mask = attention_mask[:, None, None, :]
+        else:
+            raise ValueError(
+                'Wrong shape for input_ids (shape {}) or attention_mask (shape {})'
+                .format(input_shape, attention_mask.shape))
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.to(
+            dtype=self.dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        head_mask=None,
+        query_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        is_decoder=False,
+    ):
+        r"""
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+        """
+        output_attentions = (
+            output_attentions if output_attentions is not None else
+            self.config.output_attentions)
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else
+            self.config.output_hidden_states)
+        return_dict = (
+            return_dict
+            if return_dict is not None else self.config.use_return_dict)
+
+        # use_cache = use_cache if use_cache is not None else self.config.use_cache
+        if input_ids is None:
+            assert (
+                query_embeds is not None
+            ), 'You have to specify query_embeds when input_ids is None'
+
+        # past_key_values_length
+        past_key_values_length = (
+            past_key_values[0][0].shape[2] -
+            self.config.query_length if past_key_values is not None else 0)
+
+        query_length = query_embeds.shape[1] if query_embeds is not None else 0
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            query_embeds=query_embeds,
+            past_key_values_length=past_key_values_length,
+        )
+
+        input_shape = embedding_output.size()[:-1]
+        batch_size, seq_length = input_shape
+        device = embedding_output.device
+
+        if attention_mask is None:
+            attention_mask = torch.ones(
+                ((batch_size, seq_length + past_key_values_length)),
+                device=device)
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        if is_decoder:
+            extended_attention_mask = self.get_extended_attention_mask(
+                attention_mask,
+                input_ids.shape,
+                device,
+                is_decoder,
+                has_query=(query_embeds is not None),
+            )
+        else:
+            extended_attention_mask = self.get_extended_attention_mask(
+                attention_mask, input_shape, device, is_decoder)
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if encoder_hidden_states is not None:
+            if type(encoder_hidden_states) == list:
+                encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states[
+                    0].size()
+            else:
+                (
+                    encoder_batch_size,
+                    encoder_sequence_length,
+                    _,
+                ) = encoder_hidden_states.size()
+            encoder_hidden_shape = (encoder_batch_size,
+                                    encoder_sequence_length)
+
+            if type(encoder_attention_mask) == list:
+                encoder_extended_attention_mask = [
+                    self.invert_attention_mask(mask)
+                    for mask in encoder_attention_mask
+                ]
+            elif encoder_attention_mask is None:
+                encoder_attention_mask = torch.ones(
+                    encoder_hidden_shape, device=device)
+                encoder_extended_attention_mask = self.invert_attention_mask(
+                    encoder_attention_mask)
+            else:
+                encoder_extended_attention_mask = self.invert_attention_mask(
+                    encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(head_mask,
+                                       self.config.num_hidden_layers)
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            query_length=query_length,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = (
+            self.pooler(sequence_output) if self.pooler is not None else None)
+
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            cross_attentions=encoder_outputs.cross_attentions,
+        )
+
+
+class BertLMHeadModel(BertPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r'pooler']
+    _keys_to_ignore_on_load_missing = [
+        r'position_ids', r'predictions.decoder.bias'
+    ]
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.bert = BertModel(config, add_pooling_layer=False)
+        self.cls = BertOnlyMLMHead(config)
+
+        self.init_weights()
+
+    def get_output_embeddings(self):
+        if self.cls is not None:
+            return self.cls.predictions.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.cls.predictions.decoder = new_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        head_mask=None,
+        query_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        labels=None,
+        past_key_values=None,
+        use_cache=True,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        return_logits=False,
+        is_decoder=True,
+        reduction='mean',
+    ):
+        r"""
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
+            ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are
+            ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4
+            tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+        Returns:
+        Example::
+            >>> from transformers import BertTokenizer, BertLMHeadModel, BertConfig
+            >>> import torch
+            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
+            >>> config = BertConfig.from_pretrained("bert-base-cased")
+            >>> model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)
+            >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
+            >>> outputs = model(**inputs)
+            >>> prediction_logits = outputs.logits
+        """
+        return_dict = (
+            return_dict
+            if return_dict is not None else self.config.use_return_dict)
+        if labels is not None:
+            use_cache = False
+        if past_key_values is not None:
+            query_embeds = None
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            query_embeds=query_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            is_decoder=is_decoder,
+        )
+
+        sequence_output = outputs[0]
+        if query_embeds is not None:
+            sequence_output = outputs[0][:, query_embeds.shape[1]:, :]
+        prediction_scores = self.cls(sequence_output)
+
+        if return_logits:
+            return prediction_scores[:, :-1, :].contiguous()
+
+        lm_loss = None
+        if labels is not None:
+            # we are doing next-token prediction; shift prediction scores and input ids by one
+            shifted_prediction_scores = prediction_scores[:, :
+                                                          -1, :].contiguous()
+            labels = labels[:, 1:].contiguous()
+            loss_fct = CrossEntropyLoss(
+                reduction=reduction, label_smoothing=0.1)
+            lm_loss = loss_fct(
+                shifted_prediction_scores.view(-1, self.config.vocab_size),
+                labels.view(-1),
+            )
+            if reduction == 'none':
+                lm_loss = lm_loss.view(prediction_scores.size(0), -1).sum(1)
+
+        if not return_dict:
+            output = (prediction_scores, ) + outputs[2:]
+            return ((lm_loss, ) + output) if lm_loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=lm_loss,
+            logits=prediction_scores,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+
+    def prepare_inputs_for_generation(self,
+                                      input_ids,
+                                      query_embeds,
+                                      past=None,
+                                      attention_mask=None,
+                                      **model_kwargs):
+        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
+        if attention_mask is None:
+            attention_mask = input_ids.new_ones(input_ids.shape)
+        query_mask = input_ids.new_ones(query_embeds.shape[:-1])
+        attention_mask = torch.cat([query_mask, attention_mask], dim=-1)
+
+        # cut decoder_input_ids if past is used
+        if past is not None:
+            input_ids = input_ids[:, -1:]
+
+        return {
+            'input_ids':
+            input_ids,
+            'query_embeds':
+            query_embeds,
+            'attention_mask':
+            attention_mask,
+            'past_key_values':
+            past,
+            'encoder_hidden_states':
+            model_kwargs.get('encoder_hidden_states', None),
+            'encoder_attention_mask':
+            model_kwargs.get('encoder_attention_mask', None),
+            'is_decoder':
+            True,
+        }
+
+    def _reorder_cache(self, past, beam_idx):
+        reordered_past = ()
+        for layer_past in past:
+            reordered_past += (tuple(
+                past_state.index_select(0, beam_idx)
+                for past_state in layer_past), )
+        return reordered_past
+
+
+@MODELS.register_module()
+class Qformer(BertLMHeadModel):
+
+    def __init__(self, model_style: str, vision_model_width: int,
+                 add_cross_attention: bool, cross_attention_freq: int,
+                 num_query_token: int) -> None:
+
+        config = BertConfig.from_pretrained(model_style)
+        config.add_cross_attention = add_cross_attention
+        config.encoder_width = vision_model_width
+        config.cross_attention_freq = cross_attention_freq
+        config.query_length = num_query_token
+        super().__init__(config)
diff --git a/mmpretrain/models/multimodal/blip2/__init__.py b/mmpretrain/models/multimodal/blip2/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..b5695f236caf74493fc6e851edbf2a4a05146b5f
--- /dev/null
+++ b/mmpretrain/models/multimodal/blip2/__init__.py
@@ -0,0 +1,10 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .blip2_caption import Blip2Caption
+from .blip2_opt_vqa import Blip2VQA
+from .blip2_retriever import Blip2Retrieval
+from .modeling_opt import OPTForCausalLM
+from .Qformer import Qformer
+
+__all__ = [
+    'Blip2Caption', 'Blip2Retrieval', 'Blip2VQA', 'OPTForCausalLM', 'Qformer'
+]
diff --git a/mmpretrain/models/multimodal/blip2/blip2_caption.py b/mmpretrain/models/multimodal/blip2/blip2_caption.py
new file mode 100644
index 0000000000000000000000000000000000000000..acf694827152ad47efd61d58f8361ea23834d68e
--- /dev/null
+++ b/mmpretrain/models/multimodal/blip2/blip2_caption.py
@@ -0,0 +1,315 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List, Optional
+
+import torch
+from mmengine.model import BaseModel
+from torch import nn
+
+from mmpretrain.registry import MODELS, TOKENIZER
+from mmpretrain.structures import DataSample
+
+
+@MODELS.register_module()
+class Blip2Caption(BaseModel):
+    """BLIP2 Caption.
+
+    Module for BLIP2 Caption task.
+
+    Args:
+        vision_backbone (dict): The config dict for vision backbone.
+        text_backbone (dict): The config dict for text backbone.
+        multimodal_backbone (dict): The config dict for multimodal backbone.
+        vision_neck (dict): The config dict for vision neck.
+        tokenizer: (Optional[dict]): The config for tokenizer.
+            Defaults to None.
+        prompt (str): Prompt used for training and eval.
+            Defaults to ''.
+        max_txt_len (int): Max text length of input text.
+        num_captions (int): Number of captions to be generated for each image.
+        data_preprocessor (Optional[dict]): The config for preprocessing input
+            data. If None or no specified type, it will use
+            "MultiModalDataPreprocessor" as type.
+            See :class:`MultiModalDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (Optional[dict]): the config to control the initialization.
+            Defaults to None.
+    """
+    _no_split_modules = ['BEiTViT', 'OPTDecoderLayer', 'BertLayer']
+
+    def __init__(self,
+                 vision_backbone: dict,
+                 text_backbone: dict,
+                 multimodal_backbone: dict,
+                 vision_neck: dict,
+                 tokenizer: Optional[dict] = None,
+                 prompt: str = '',
+                 max_txt_len: int = 20,
+                 num_captions: int = 1,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None) -> None:
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        if isinstance(data_preprocessor, dict):
+            data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+            data_preprocessor = MODELS.build(data_preprocessor)
+
+        super().__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+        self.tokenizer = TOKENIZER.build(tokenizer)
+        self.eos_token_id = self.tokenizer(
+            '\n', add_special_tokens=False).input_ids[0]
+
+        self.vision_backbone = MODELS.build(vision_backbone)
+        self.ln_vision_backbone = nn.LayerNorm(self.vision_backbone.embed_dims)
+
+        self.vision_neck = MODELS.build(vision_neck)
+
+        self.text_backbone = MODELS.build(text_backbone)
+
+        self.multimodal_backbone = MODELS.build(multimodal_backbone)
+        self.multimodal_backbone.cls = None
+        self.multimodal_backbone.bert.embeddings.word_embeddings = None
+        self.multimodal_backbone.bert.embeddings.position_embeddings = None
+        for layer in self.multimodal_backbone.bert.encoder.layer:
+            layer.output = None
+            layer.intermediate = None
+
+        self.prompt = prompt
+        self.max_txt_len = max_txt_len
+        self.num_captions = num_captions
+        prompt_tokens = self.tokenizer(prompt, return_tensors='pt')
+        self.prompt_length = prompt_tokens.attention_mask.sum(1)
+
+        self.query_tokens = nn.Parameter(
+            torch.zeros(1, self.multimodal_backbone.bert.config.query_length,
+                        self.multimodal_backbone.bert.config.hidden_size))
+        self.query_tokens.data.normal_(
+            mean=0.0,
+            std=self.multimodal_backbone.bert.config.initializer_range)
+
+        # freeze the text backbone
+        for _, param in self.text_backbone.named_parameters():
+            param.requires_grad = False
+
+        if hasattr(self, 'register_load_state_dict_post_hook'):
+            self.register_load_state_dict_post_hook(
+                self._ignore_loading_llm_keys_hook)
+
+        if hasattr(self, '_register_state_dict_hook'):
+            self._register_state_dict_hook(self._igonre_saving_llm_keys_hook)
+
+    def forward(self,
+                images: torch.Tensor,
+                data_samples: Optional[List] = None,
+                mode: str = 'loss'):
+        """The unified entry for a forward process in both training and test.
+        The method should accept two modes: "predict" and "loss":
+
+        - "predict": Forward and return the predictions, which are fully
+          processed to a list of :obj:`DataSample`.
+        - "loss": Forward and return a dict of losses according to the given
+          inputs and data samples.
+
+        Note that this method doesn't handle neither back propagation nor
+        optimizer updating, which are done in the :meth:`train_step`.
+
+        Args:
+            images (torch.Tensor): pre_processed img tensor  (N, C, ...).
+            data_samples (List[DataSample], optional):
+            mode (str): Return what kind of value. Defaults to 'loss'.
+
+        Returns:
+            The return type depends on ``mode``.
+            - If ``mode="loss"``, return a dict of tensor.
+            - If ``mode="predict"``, return a list of
+              :obj:`mmpretrain.structures.DataSample`.
+        """
+        if mode == 'loss':
+            return self.loss(images, data_samples)
+        elif mode == 'predict':
+            return self.predict(images, data_samples)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def loss(self,
+             images: torch.Tensor,
+             data_samples: Optional[list] = None,
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            images (torch.Tensor): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. Defaults to None.
+            **kwargs: Other keyword arguments accepted by the ``loss``
+                method of :attr:`head`.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+
+        # extract image features
+        image_embeds = self.ln_vision_backbone(self.vision_backbone(images)[0])
+        image_atts = torch.ones(
+            image_embeds.size()[:-1],
+            dtype=torch.long,
+        ).to(images.device)
+
+        # distill image features to query tokens
+        query_tokens = self.query_tokens.expand(image_embeds.size(0), -1, -1)
+        query_outputs = self.multimodal_backbone.bert(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_atts,
+            return_dict=True,
+        )
+        inputs_opt = self.vision_neck([query_outputs.last_hidden_state])
+        attns_opt = torch.ones(
+            inputs_opt.size()[:-1], dtype=torch.long).to(images.device)
+
+        self.tokenizer.padding_side = 'right'
+
+        prompt = [
+            self.prompt + data_sample.gt_caption + '\n'
+            for data_sample in data_samples
+        ]
+
+        opt_tokens = self.tokenizer(
+            prompt,
+            return_tensors='pt',
+            padding='longest',
+            truncation=True,
+            max_length=self.max_txt_len,
+        ).to(images.device)
+
+        targets = opt_tokens.input_ids.masked_fill(
+            opt_tokens.input_ids == self.tokenizer.pad_token_id, -100)
+        if self.prompt:
+            targets[:, :self.prompt_length] = -100
+
+        empty_targets = (
+            torch.ones(attns_opt.size(),
+                       dtype=torch.long).to(images.device).fill_(-100))
+        targets = torch.cat([empty_targets, targets], dim=1)
+
+        inputs_embeds = (
+            self.text_backbone.model.decoder.embed_tokens(
+                opt_tokens.input_ids))
+        inputs_embeds = torch.cat([inputs_opt, inputs_embeds], dim=1)
+        attention_mask = torch.cat([attns_opt, opt_tokens.attention_mask],
+                                   dim=1)
+
+        outputs = self.text_backbone(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            return_dict=True,
+            labels=targets,
+        )
+        loss = outputs.loss
+
+        return {'loss': loss}
+
+    def predict(self,
+                images: torch.Tensor,
+                data_samples: Optional[list] = None,
+                **kwargs) -> List[DataSample]:
+        """Predict captions from a batch of inputs.
+
+        Args:
+            images (torch.Tensor): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. Defaults to None.
+            **kwargs: Other keyword arguments accepted by the ``predict``
+                method of :attr:`head`.
+
+        Returns:
+            List[DataSample]: Return list of data samples.
+        """
+
+        # extract image features
+        image_embeds = self.ln_vision_backbone(self.vision_backbone(images)[0])
+        image_atts = torch.ones(
+            image_embeds.size()[:-1],
+            dtype=torch.long,
+        ).to(images.device)
+
+        # distill image features to query tokens
+        query_tokens = self.query_tokens.expand(image_embeds.size(0), -1, -1)
+        query_outputs = self.multimodal_backbone.bert(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_atts,
+            return_dict=True,
+        )
+        inputs_opt = self.vision_neck([query_outputs.last_hidden_state])
+        attns_opt = torch.ones(
+            inputs_opt.size()[:-1], dtype=torch.long).to(images.device)
+
+        prompt = [self.prompt] * image_embeds.size(0)
+
+        opt_tokens = self.tokenizer(
+            prompt,
+            return_tensors='pt',
+            padding='longest',
+            truncation=True,
+            max_length=self.max_txt_len,
+        ).to(images.device)
+        attention_mask = torch.cat([attns_opt, opt_tokens.attention_mask],
+                                   dim=1)
+
+        inputs_embeds = (
+            self.text_backbone.get_input_embeddings()(opt_tokens.input_ids))
+        inputs_embeds = torch.cat([inputs_opt, inputs_embeds], dim=1)
+
+        outputs = self.text_backbone.generate(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            do_sample=False,
+            top_p=0.9,
+            temperature=1.,
+            num_beams=5,
+            max_new_tokens=self.max_txt_len,
+            min_length=1,
+            eos_token_id=self.eos_token_id,
+            repetition_penalty=1.0,
+            length_penalty=1.0,
+            num_return_sequences=self.num_captions,
+        )
+
+        output_text = self.tokenizer.batch_decode(
+            outputs, skip_special_tokens=True)
+        output_text = [text.strip() for text in output_text]
+
+        out_data_samples = []
+        if data_samples is None:
+            data_samples = [None for _ in range(len(output_text))]
+
+        for data_sample, decode_token in zip(data_samples, output_text):
+            if data_sample is None:
+                data_sample = DataSample()
+            data_sample.pred_caption = decode_token
+            out_data_samples.append(data_sample)
+
+        return out_data_samples
+
+    @staticmethod
+    def _ignore_loading_llm_keys_hook(module, incompatible_keys):
+        """Avoid warning missing keys of the LLM model."""
+        import re
+        llm_pattern = '^text_backbone'
+        for key in list(incompatible_keys.missing_keys):
+            if re.match(llm_pattern, key):
+                incompatible_keys.missing_keys.remove(key)
+
+    @staticmethod
+    def _igonre_saving_llm_keys_hook(module, state_dict, prefix, metadata):
+        """Avoid saving llm state dict."""
+        import re
+        llm_pattern = '^text_backbone'
+        keys = [k for k, _ in state_dict.items()]
+        for key in keys:
+            if re.match(llm_pattern, key):
+                state_dict.pop(key)
diff --git a/mmpretrain/models/multimodal/blip2/blip2_opt_vqa.py b/mmpretrain/models/multimodal/blip2/blip2_opt_vqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..20e439fa826725a80462557faab8ae25a8e5660e
--- /dev/null
+++ b/mmpretrain/models/multimodal/blip2/blip2_opt_vqa.py
@@ -0,0 +1,92 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional
+
+import torch
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from .blip2_caption import Blip2Caption
+
+
+@MODELS.register_module()
+class Blip2VQA(Blip2Caption):
+    """BLIP2 VQA.
+
+    Module for BLIP2 VQA task. For more details about the initialization
+    params, please refer to :class:`Blip2Caption`.
+    """
+
+    def predict(self,
+                images: torch.Tensor,
+                data_samples: Optional[list] = None,
+                **kwargs) -> List[DataSample]:
+        """Predict captions from a batch of inputs.
+
+        Args:
+            images (torch.Tensor): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. Defaults to None.
+            **kwargs: Other keyword arguments accepted by the ``predict``
+                method of :attr:`head`.
+
+        Returns:
+            List[DataSample]: Return list of data samples.
+        """
+        questions = [d.question for d in data_samples]
+
+        # extract image features from
+        image_embeds = self.ln_vision_backbone(self.vision_backbone(images)[0])
+        image_atts = torch.ones(
+            image_embeds.size()[:-1],
+            dtype=torch.long,
+        ).to(images.device)
+
+        # distill image features to query tokens
+        query_tokens = self.query_tokens.expand(image_embeds.size(0), -1, -1)
+        query_outputs = self.multimodal_backbone.bert(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_atts,
+            return_dict=True,
+        )
+        inputs_opt = self.vision_neck([query_outputs.last_hidden_state])
+        attns_opt = torch.ones(
+            inputs_opt.size()[:-1], dtype=torch.long).to(images.device)
+
+        prompt = [self.prompt.format(q) for q in questions]
+
+        # use left padding
+        self.tokenizer.padding_side = 'left'
+
+        opt_tokens = self.tokenizer(
+            prompt, return_tensors='pt', padding='longest').to(images.device)
+        input_ids = opt_tokens.input_ids
+        attention_mask = torch.cat([attns_opt, opt_tokens.attention_mask],
+                                   dim=1)
+
+        inputs_embeds = self.text_backbone.model.decoder.embed_tokens(
+            input_ids)
+        inputs_embeds = torch.cat([inputs_opt, inputs_embeds], dim=1)
+
+        outputs = self.text_backbone.generate(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            do_sample=False,
+            num_beams=5,
+            max_new_tokens=self.max_txt_len,
+            min_length=1,
+            eos_token_id=self.eos_token_id,
+            length_penalty=-1.0,
+        )
+
+        output_text = self.tokenizer.batch_decode(
+            outputs, skip_special_tokens=True)
+        output_text = [text.strip() for text in output_text]
+
+        out_data_samples = []
+        for data_sample, decode_token in zip(data_samples, output_text):
+            data_sample.pred_answer = decode_token
+            out_data_samples.append(data_sample)
+
+        return out_data_samples
diff --git a/mmpretrain/models/multimodal/blip2/blip2_retriever.py b/mmpretrain/models/multimodal/blip2/blip2_retriever.py
new file mode 100644
index 0000000000000000000000000000000000000000..e626404a4cde5798151a0fa9589716470ed928a9
--- /dev/null
+++ b/mmpretrain/models/multimodal/blip2/blip2_retriever.py
@@ -0,0 +1,505 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List, Optional, Tuple, Union
+
+import mmengine.dist as dist
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmengine.utils import track_iter_progress
+
+from mmpretrain.registry import MODELS, TOKENIZER
+from mmpretrain.structures import DataSample
+from ..blip.blip_retrieval import BlipRetrieval, all_gather_concat
+
+
+@MODELS.register_module()
+class Blip2Retrieval(BlipRetrieval):
+    """BLIP2 Retriever.
+
+    Args:
+        vision_backbone (dict): Backbone for extracting image features.
+        text_backbone (dict): Backbone for extracting text features.
+        multimodal_backbone (Optional[dict]): Backbone for extracting
+            multi-modal features.
+        vision_neck (Optional[dict]): The neck module to process image features
+            from vision backbone. Defaults to None.
+        text_neck (Optional[dict]): The neck module to process text features
+            from text backbone. Defaults to None.
+        head (Optional[Union[List[dict], dict]]): The head module to calculate
+            loss from processed single modality features.
+            See :mod:`mmmultimodal.models.heads`.
+            Notice that if the head is not set, `loss` method cannot be used.
+            Defaults to None.
+        multimodal_head (Optional[Union[List[dict], dict]]): The multi-modal
+            head module to calculate loss from processed multimodal features.
+            See :mod:`mmmultimodal.models.heads`.
+            Notice that if the head is not set, `loss` method cannot be used.
+            Defaults to None.
+        tokenizer (Optional[dict]): The config for tokenizer. Defaults to None.
+        temperature (float): Temperature parameter that controls the
+            concentration level of the distribution. Defaults to 0.07.
+        fast_match (bool): If False, select topk similarity as candidates and
+            compute the matching score. If True, return the similarity as the
+            matching score directly. Defaults to False.
+        topk (int): Select topk similarity as candidates for compute matching
+            scores. Notice that this is not the topk in evaluation.
+            Defaults to 256.
+        data_preprocessor (Optional[dict]): The config for preprocessing input
+            data. If None or no specified type, it will use
+            "MultiModalDataPreprocessor" as type.
+            See :class:`MultiModalDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (Optional[dict]): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 vision_backbone: dict,
+                 text_backbone: Optional[dict] = None,
+                 multimodal_backbone: Optional[dict] = None,
+                 vision_neck: Optional[dict] = None,
+                 text_neck: Optional[dict] = None,
+                 head: Optional[Union[List[dict], dict]] = None,
+                 multimodal_head: Optional[Union[List[dict], dict]] = None,
+                 tokenizer: Optional[dict] = None,
+                 temperature: float = 0.07,
+                 fast_match: bool = False,
+                 topk: int = 256,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None) -> None:
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        if isinstance(data_preprocessor, dict):
+            data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+            data_preprocessor = MODELS.build(data_preprocessor)
+
+        # Skip BlipRetrieval init
+        super(BlipRetrieval, self).__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+        self.vision_backbone = MODELS.build(vision_backbone)
+        self.ln_vision_backbone = nn.LayerNorm(self.vision_backbone.embed_dims)
+        self.tokenizer = TOKENIZER.build(tokenizer)
+
+        if text_backbone is not None:
+            self.text_backbone = MODELS.build(text_backbone)
+
+        if multimodal_backbone is not None:
+            self.multimodal_backbone = MODELS.build(multimodal_backbone)
+            self.multimodal_backbone.resize_token_embeddings(
+                len(self.tokenizer))
+        self.query_tokens = nn.Parameter(
+            torch.zeros(1, self.multimodal_backbone.bert.config.query_length,
+                        self.multimodal_backbone.bert.config.hidden_size))
+        self.query_tokens.data.normal_(
+            mean=0.0,
+            std=self.multimodal_backbone.bert.config.initializer_range)
+
+        if vision_neck is not None:
+            self.vision_neck = MODELS.build(vision_neck)
+
+        if text_neck is not None:
+            self.text_neck = MODELS.build(text_neck)
+
+        if head is not None:
+            self.head = MODELS.build(head)
+
+        if multimodal_head is not None:
+            self.multimodal_head = MODELS.build(multimodal_head)
+
+        self.temp = nn.Parameter(temperature * torch.ones([]))
+
+        # Notice that this topk is used for select k candidate to compute
+        # image-text score, but not the final metric topk in evaluation.
+        self.fast_match = fast_match
+        self.topk = topk
+
+    def _extract_feat(self, inputs: Union[torch.Tensor, dict],
+                      modality: str) -> Tuple[torch.Tensor]:
+        """Extract features from the single modality.
+        Args:
+            inputs (Union[torch.Tensor, dict]): A batch of inputs.
+                For image, a tensor of shape (N, C, ...) in general.
+                For text, a dict of tokenized text inputs.
+            modality (str): Modality feature to be extracted. Only two
+                options are supported.
+
+                - ``images``: Only extract image features, mostly used for
+                    inference.
+                - ``texts``: Only extract text features, mostly used for
+                    inference.
+        Returns:
+            Tuple[torch.Tensor]: The output features.
+        """
+        if modality == 'images':
+            # extract image features
+            # TODO:
+            # Add layernorm inside backbone and handle the concat outside
+            image_embeds = self.ln_vision_backbone(
+                self.vision_backbone(inputs)[0])
+            image_atts = torch.ones(
+                image_embeds.size()[:-1], dtype=torch.long).to(self.device)
+
+            query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1,
+                                                    -1)
+            query_output = self.multimodal_backbone.bert(
+                query_embeds=query_tokens,
+                encoder_hidden_states=image_embeds,
+                encoder_attention_mask=image_atts,
+                use_cache=True,
+                return_dict=True,
+            )
+            image_feat = F.normalize(
+                self.vision_neck([query_output.last_hidden_state]), dim=-1)
+            return {
+                'image_embeds': image_embeds,
+                'image_feat': image_feat,
+                'query_output': query_output
+            }
+        elif modality == 'texts':
+            # extract text features
+            text_output = self.multimodal_backbone.bert(
+                inputs.input_ids,
+                attention_mask=inputs.attention_mask,
+                return_dict=True,
+            )
+            text_embeds = text_output.last_hidden_state
+            text_feat = F.normalize(
+                self.text_neck([text_embeds[:, 0, :]]), dim=-1)
+            return {'text_embeds': text_embeds, 'text_feat': text_feat}
+        else:
+            raise RuntimeError(f'Invalid modality "{modality}".')
+
+    def loss(
+        self,
+        images: torch.Tensor,
+        data_samples: Optional[List[DataSample]] = None,
+    ) -> Dict[str, torch.tensor]:
+        """Calculate losses from a batch of inputs and data samples.
+
+        Args:
+            inputs (dict): A batch of inputs. The input tensor with of
+                at least one modality. For image, the value is a tensor
+                of shape (N, C, ...) in general.
+                For text, the value is a dict of tokenized text inputs.
+            data_samples (Optional[List[DataSample]]):
+                The annotation data of every samples. Defaults to None.
+
+        Returns:
+            Dict[str, torch.tensor]: a dictionary of loss components of
+                both head and multimodal head.
+        """
+        output = self.extract_feat(images, data_samples)
+
+        text_ids = output['text_ids']
+        text_attn_mask = output['text_attn_mask']
+        image_embeds = output['image_embeds']
+        image_feat = output['image_feat']
+        text_feat = output['text_feat']
+        query_output = output['query_output']
+
+        # ITC Loss
+        # B*world_size, num_query, D
+        image_feat_all = torch.cat(dist.all_gather(image_feat))
+        # B*world_size, D
+        text_feat_all = torch.cat(dist.all_gather(text_feat))
+
+        # B, B*world_size, num_query
+        sim_q2t = torch.matmul(
+            image_feat.unsqueeze(1), text_feat_all.unsqueeze(-1)).squeeze()
+
+        # image to text similarity
+        sim_i2t, _ = sim_q2t.max(-1)
+        sim_i2t = sim_i2t / self.temp
+
+        # B, B*world_size, num_query
+        sim_t2q = torch.matmul(
+            text_feat.unsqueeze(1).unsqueeze(1),
+            image_feat_all.permute(0, 2, 1)).squeeze()
+
+        # text-image similarity
+        sim_t2i, _ = sim_t2q.max(-1)
+        sim_t2i = sim_t2i / self.temp
+
+        rank = dist.get_rank()
+        bs = images.size(0)
+        targets = torch.linspace(
+            rank * bs, rank * bs + bs - 1, bs, dtype=int).to(self.device)
+
+        itc_loss = (F.cross_entropy(sim_i2t, targets, label_smoothing=0.1) +
+                    F.cross_entropy(sim_t2i, targets, label_smoothing=0.1)) / 2
+
+        # prepare for itm
+        text_input_ids_world = torch.cat(dist.all_gather(text_ids))
+        text_attention_mask_world = torch.cat(dist.all_gather(text_attn_mask))
+        image_embeds_world = torch.cat(dist.all_gather(image_embeds))
+        with torch.no_grad():
+            weights_t2i = F.softmax(sim_t2i, dim=1) + 1e-4
+            weights_t2i[:, rank * bs:rank * bs + bs].fill_diagonal_(0)
+            weights_i2t = F.softmax(sim_i2t, dim=1) + 1e-4
+            weights_i2t[:, rank * bs:rank * bs + bs].fill_diagonal_(0)
+
+        # select a negative image for each text
+        image_embeds_neg = []
+        for b in range(bs):
+            neg_idx = torch.multinomial(weights_t2i[b], 1).item()
+            image_embeds_neg.append(image_embeds_world[neg_idx])
+        image_embeds_neg = torch.stack(image_embeds_neg, dim=0)
+
+        # select a negative text for each image
+        text_ids_neg = []
+        text_atts_neg = []
+        for b in range(bs):
+            neg_idx = torch.multinomial(weights_i2t[b], 1).item()
+            text_ids_neg.append(text_input_ids_world[neg_idx])
+            text_atts_neg.append(text_attention_mask_world[neg_idx])
+
+        text_ids_neg = torch.stack(text_ids_neg, dim=0)
+        text_atts_neg = torch.stack(text_atts_neg, dim=0)
+
+        text_ids_all = torch.cat([text_ids, text_ids, text_ids_neg],
+                                 dim=0)  # pos, pos, neg
+        text_atts_all = torch.cat(
+            [text_attn_mask, text_attn_mask, text_atts_neg],
+            dim=0,
+        )
+
+        query_tokens_itm = self.query_tokens.expand(text_ids_all.shape[0], -1,
+                                                    -1)
+        query_atts_itm = torch.ones(
+            query_tokens_itm.size()[:-1], dtype=torch.long).to(self.device)
+        attention_mask_all = torch.cat([query_atts_itm, text_atts_all], dim=1)
+
+        image_embeds_all = torch.cat(
+            [image_embeds, image_embeds_neg, image_embeds],
+            dim=0)  # pos, neg, pos
+        image_atts_all = torch.ones(
+            image_embeds_all.size()[:-1], dtype=torch.long).to(self.device)
+
+        output_itm = self.multimodal_backbone.bert(
+            text_ids_all,
+            query_embeds=query_tokens_itm,
+            attention_mask=attention_mask_all,
+            encoder_hidden_states=image_embeds_all,
+            encoder_attention_mask=image_atts_all,
+            return_dict=True,
+        )
+
+        vl_embeddings = output_itm.last_hidden_state[:, :query_tokens_itm.
+                                                     size(1), :]
+
+        # create false data samples
+        data_samples.extend(
+            [DataSample(is_matched=False) for _ in range(2 * bs)])
+        loss_multimodal = self.multimodal_head.loss((vl_embeddings, ),
+                                                    data_samples)
+
+        # LM loss
+        decoder_input_ids = text_ids.clone()
+        decoder_input_ids[:, 0] = self.tokenizer.bos_token_id
+        labels = decoder_input_ids.masked_fill(
+            decoder_input_ids == self.tokenizer.pad_token_id, -100)
+
+        query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1)
+        query_atts = torch.ones(
+            query_tokens.size()[:-1], dtype=torch.long).to(self.device)
+        attention_mask = torch.cat([query_atts, text_attn_mask], dim=1)
+        lm_output = self.multimodal_backbone(
+            decoder_input_ids,
+            attention_mask=attention_mask,
+            past_key_values=query_output.past_key_values,
+            return_dict=True,
+            labels=labels,
+        )
+
+        return dict(
+            itc_loss=itc_loss, **loss_multimodal, lm_loss=lm_output.loss)
+
+    def predict_all(self,
+                    feats: Dict[str, torch.Tensor],
+                    data_samples: List[DataSample],
+                    num_images: int = None,
+                    num_texts: int = None,
+                    cal_i2t: bool = True,
+                    cal_t2i: bool = True) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Compute similarity matrix between images and texts across all ranks.
+
+        Args:
+            feats (Dict[str, torch.Tensor]): Features from the current rank.
+            data_samples (List[DataSample]): Data samples from the current
+                rank.
+            num_images (int, optional): Number of images to use.
+                Defaults to None.
+            num_texts (int, optional): Number of texts to use.
+                Defaults to None.
+            cal_i2t (bool, optional): Whether to compute image-to-text
+                similarity. Defaults to True.
+            cal_t2i (bool, optional): Whether to compute text-to-image
+                similarity. Defaults to True.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]: Image-to-text and text-to-image
+            similarity matrices.
+        """
+        text_ids = feats['text_ids']
+        text_attn_mask = feats['text_attn_mask']
+        image_embeds = feats.get('image_embeds', None)
+        image_feat = feats['image_feat']
+        text_feat = feats['text_feat']
+
+        num_images = num_images or image_feat.size(0)
+        num_texts = num_texts or text_feat.size(0)
+
+        if not self.fast_match:
+            image_embeds_all = all_gather_concat(image_embeds)[:num_images]
+        else:
+            image_embeds_all = None
+        image_feat_all = all_gather_concat(image_feat)[:num_images]
+        text_feat_all = all_gather_concat(text_feat)[:num_texts]
+        text_ids_all = all_gather_concat(text_ids)[:num_texts]
+        text_attn_mask_all = all_gather_concat(text_attn_mask)[:num_texts]
+
+        results = []
+        if cal_i2t:
+            result_i2t = self.compute_score_matrix_i2t(
+                image_feat,
+                image_embeds,
+                text_feat_all,
+                text_ids_all,
+                text_attn_mask_all,
+            )
+            results.append(
+                self._get_predictions(result_i2t, data_samples, mode='i2t'))
+        if cal_t2i:
+            result_t2i = self.compute_score_matrix_t2i(
+                image_feat_all,
+                image_embeds_all,
+                text_feat,
+                text_ids,
+                text_attn_mask,
+            )
+            results.append(
+                self._get_predictions(result_t2i, data_samples, mode='t2i'))
+        return tuple(results)
+
+    def compute_score_matrix_i2t(self, img_feats: torch.Tensor,
+                                 img_embeds: List[torch.Tensor],
+                                 text_feats: torch.Tensor,
+                                 text_ids: torch.Tensor,
+                                 text_atts: torch.Tensor) -> torch.Tensor:
+        """Compare the score matrix for image-to-text retrieval. Every image
+        should compare to all the text features.
+
+        Args:
+            img_feats (torch.Tensor): The input tensor with shape (M, C).
+                M stands for numbers of samples on a single GPU.
+            img_embeds (List[torch.Tensor]): Image features from each layer of
+                the vision backbone.
+            text_feats (torch.Tensor): The input tensor with shape (N, C).
+                N stands for numbers of all samples on all GPUs.
+            text_ids (torch.Tensor): The input tensor with shape (N, C).
+            text_atts (torch.Tensor): The input tensor with shape (N, C).
+
+        Returns:
+            torch.Tensor: Score matrix of image-to-text retrieval.
+        """
+
+        # compute i2t sim matrix
+        # TODO: check correctness
+        sim_matrix_i2t, _ = (img_feats @ text_feats.t()).max(1)
+        if self.fast_match:
+            return sim_matrix_i2t
+
+        score_matrix_i2t = torch.full((img_feats.size(0), text_feats.size(0)),
+                                      -100.0).to(self.device)
+
+        for i in track_iter_progress(range(img_feats.size(0))):
+            sims = sim_matrix_i2t[i]
+            topk_sim, topk_idx = sims.topk(k=self.topk, dim=0)
+            # get repeated image embeddings
+            encoder_output = img_embeds[i].repeat(self.topk, 1, 1)
+            encoder_att = torch.ones(
+                encoder_output.size()[:-1], dtype=torch.long).to(self.device)
+            # query embeds and attention masks
+            query_tokens = self.query_tokens.expand(encoder_output.shape[0],
+                                                    -1, -1)
+            query_atts = torch.ones(
+                query_tokens.size()[:-1], dtype=torch.long).to(self.device)
+            attention_mask = torch.cat([query_atts, text_atts[topk_idx]],
+                                       dim=1)
+            output = self.multimodal_backbone.bert(
+                text_ids[topk_idx],
+                query_embeds=query_tokens,
+                attention_mask=attention_mask,
+                encoder_hidden_states=encoder_output,
+                encoder_attention_mask=encoder_att,
+                return_dict=True,
+            )
+            score = self.multimodal_head(
+                (output.last_hidden_state[:, :query_tokens.size(1), :],
+                 ))[:, :, 1].mean(dim=1)
+            score_matrix_i2t[i, topk_idx] = score + topk_sim
+
+        return score_matrix_i2t
+
+    def compute_score_matrix_t2i(self, img_feats: torch.Tensor,
+                                 img_embeds: List[torch.Tensor],
+                                 text_feats: torch.Tensor,
+                                 text_ids: torch.Tensor,
+                                 text_atts: torch.Tensor) -> torch.Tensor:
+        """Compare the score matrix for text-to-image retrieval.
+
+        Every text should compare to all the image features.
+
+        Args:
+            img_feats (torch.Tensor): The input tensor with shape (N, C).
+                N stands for numbers of all samples on all GPUs.
+            img_embeds (List[torch.Tensor]): Image features from each layer of
+                the vision backbone.
+            text_feats (torch.Tensor): The input tensor with shape (M, C).
+                M stands for numbers of samples on a single GPU.
+            text_ids (torch.Tensor): The input tensor with shape (M, C).
+            text_atts (torch.Tensor): The input tensor with shape (M, C).
+
+        Returns:
+            torch.Tensor: Score matrix of text-to-image retrieval.
+        """
+
+        # compute t2i sim matrix
+        # TODO: check correctness
+        sim_matrix_i2t, _ = (img_feats @ text_feats.t()).max(1)
+        sim_matrix_t2i = sim_matrix_i2t.t()
+        if self.fast_match:
+            return sim_matrix_i2t
+
+        score_matrix_t2i = torch.full((text_feats.size(0), img_feats.size(0)),
+                                      -100.0).to(self.device)
+
+        for i in track_iter_progress(range(text_feats.size(0))):
+            sims = sim_matrix_t2i[i]
+            topk_sim, topk_idx = sims.topk(k=self.topk, dim=0)
+            # get topk image embeddings
+            encoder_output = img_embeds[topk_idx]
+            encoder_att = torch.ones(
+                encoder_output.size()[:-1], dtype=torch.long).to(self.device)
+            # get query embeds and attention masks
+            query_tokens = self.query_tokens.expand(encoder_output.shape[0],
+                                                    -1, -1)
+            query_atts = torch.ones(
+                query_tokens.size()[:-1], dtype=torch.long).to(self.device)
+            attention_mask = torch.cat(
+                [query_atts, text_atts[i].repeat(self.topk, 1)], dim=1)
+            output = self.multimodal_backbone.bert(
+                text_ids[i].repeat(self.topk, 1),
+                query_embeds=query_tokens,
+                attention_mask=attention_mask,
+                encoder_hidden_states=encoder_output,
+                encoder_attention_mask=encoder_att,
+                return_dict=True,
+            )
+            score = self.multimodal_head(
+                (output.last_hidden_state[:, :query_tokens.size(1), :],
+                 ))[:, :, 1].mean(dim=1)
+            score_matrix_t2i[i, topk_idx] = score + topk_sim
+
+        return score_matrix_t2i
diff --git a/mmpretrain/models/multimodal/blip2/modeling_opt.py b/mmpretrain/models/multimodal/blip2/modeling_opt.py
new file mode 100644
index 0000000000000000000000000000000000000000..7cde0d76a2079a610bd71ed034c0c88940244e76
--- /dev/null
+++ b/mmpretrain/models/multimodal/blip2/modeling_opt.py
@@ -0,0 +1,1083 @@
+# flake8: noqa
+# Copyright 2022 The Fairseq Authors and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch OPT model."""
+import random
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import CrossEntropyLoss
+from transformers.activations import ACT2FN
+from transformers.modeling_outputs import (BaseModelOutputWithPast,
+                                           CausalLMOutputWithPast)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.models.opt.configuration_opt import OPTConfig
+from transformers.utils import (add_code_sample_docstrings,
+                                add_start_docstrings,
+                                add_start_docstrings_to_model_forward, logging,
+                                replace_return_docstrings)
+
+from mmpretrain.models.utils import register_hf_model
+
+logger = logging.get_logger(__name__)
+
+_CHECKPOINT_FOR_DOC = 'facebook/opt-350m'
+_CONFIG_FOR_DOC = 'OPTConfig'
+_TOKENIZER_FOR_DOC = 'GPT2Tokenizer'
+
+# Base model docstring
+_EXPECTED_OUTPUT_SHAPE = [1, 8, 1024]
+
+OPT_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    'facebook/opt-125m',
+    'facebook/opt-350m',
+    'facebook/opt-1.3b',
+    'facebook/opt-2.7b',
+    'facebook/opt-6.7b',
+    'facebook/opt-13b',
+    'facebook/opt-30b',
+    # See all OPT models at https://huggingface.co/models?filter=opt
+]
+
+
+def _make_causal_mask(input_ids_shape: torch.Size,
+                      dtype: torch.dtype,
+                      past_key_values_length: int = 0):
+    """Make causal mask used for bi-directional self-attention."""
+    bsz, tgt_len = input_ids_shape
+    mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min))
+    mask_cond = torch.arange(mask.size(-1))
+    mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
+    mask = mask.to(dtype)
+
+    if past_key_values_length > 0:
+        mask = torch.cat(
+            [torch.zeros(tgt_len, past_key_values_length, dtype=dtype), mask],
+            dim=-1)
+    return mask[None, None, :, :].expand(bsz, 1, tgt_len,
+                                         tgt_len + past_key_values_length)
+
+
+def _expand_mask(mask: torch.Tensor,
+                 dtype: torch.dtype,
+                 tgt_len: Optional[int] = None):
+    """Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len,
+    src_seq_len]`."""
+    bsz, src_len = mask.size()
+    tgt_len = tgt_len if tgt_len is not None else src_len
+
+    expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len,
+                                                  src_len).to(dtype)
+
+    inverted_mask = 1.0 - expanded_mask
+
+    return inverted_mask.masked_fill(
+        inverted_mask.to(torch.bool),
+        torch.finfo(dtype).min)
+
+
+class OPTLearnedPositionalEmbedding(nn.Embedding):
+    """This module learns positional embeddings up to a fixed maximum size."""
+
+    def __init__(self, num_embeddings: int, embedding_dim: int):
+        # OPT is set up so that if padding_idx is specified then offset the embedding ids by 2
+        # and adjust num_embeddings appropriately. Other models don't have this hack
+        self.offset = 2
+        super().__init__(num_embeddings + self.offset, embedding_dim)
+
+    def forward(self,
+                attention_mask: torch.LongTensor,
+                past_key_values_length: int = 0):
+        """`input_ids_shape` is expected to be [bsz x seqlen]."""
+        attention_mask = attention_mask.long()
+
+        # create positions depending on attention_mask
+        positions = (
+            torch.cumsum(attention_mask, dim=1).type_as(attention_mask) *
+            attention_mask).long() - 1
+
+        # cut positions if `past_key_values_length` is > 0
+        positions = positions[:, past_key_values_length:]
+
+        return super().forward(positions + self.offset)
+
+
+class OPTAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper."""
+
+    def __init__(
+        self,
+        embed_dim: int,
+        num_heads: int,
+        dropout: float = 0.0,
+        is_decoder: bool = False,
+        bias: bool = True,
+    ):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.head_dim = embed_dim // num_heads
+
+        if (self.head_dim * num_heads) != self.embed_dim:
+            raise ValueError(
+                f'embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim}'
+                f' and `num_heads`: {num_heads}).')
+        self.scaling = self.head_dim**-0.5
+        self.is_decoder = is_decoder
+
+        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+
+    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
+        return (tensor.view(bsz, seq_len, self.num_heads,
+                            self.head_dim).transpose(1, 2).contiguous())
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        key_value_states: Optional[torch.Tensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        layer_head_mask: Optional[torch.Tensor] = None,
+        output_attentions: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor],
+               Optional[Tuple[torch.Tensor]]]:
+        """Input shape: Batch x Time x Channel."""
+
+        # if key_value_states are provided this layer is used as a cross-attention layer
+        # for the decoder
+        is_cross_attention = key_value_states is not None
+
+        bsz, tgt_len, _ = hidden_states.size()
+
+        # get query proj
+        query_states = self.q_proj(hidden_states) * self.scaling
+        # get key, value proj
+        if is_cross_attention and past_key_value is not None:
+            # reuse k,v, cross_attentions
+            key_states = past_key_value[0]
+            value_states = past_key_value[1]
+        elif is_cross_attention:
+            # cross_attentions
+            key_states = self._shape(self.k_proj(key_value_states), -1, bsz)
+            value_states = self._shape(self.v_proj(key_value_states), -1, bsz)
+        elif past_key_value is not None:
+            # reuse k, v, self_attention
+            key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
+            value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
+            key_states = torch.cat([past_key_value[0], key_states], dim=2)
+            value_states = torch.cat([past_key_value[1], value_states], dim=2)
+        else:
+            # self_attention
+            key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
+            value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
+
+        if self.is_decoder:
+            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
+            # Further calls to cross_attention layer can then reuse all cross-attention
+            # key/value_states (first "if" case)
+            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
+            # all previous decoder key/value_states. Further calls to uni-directional self-attention
+            # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
+            # if encoder bi-directional self-attention `past_key_value` is always `None`
+            past_key_value = (key_states, value_states)
+
+        proj_shape = (bsz * self.num_heads, -1, self.head_dim)
+        query_states = self._shape(query_states, tgt_len,
+                                   bsz).view(*proj_shape)
+        key_states = key_states.view(*proj_shape)
+        value_states = value_states.view(*proj_shape)
+
+        src_len = key_states.size(1)
+        attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
+
+        if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
+            raise ValueError(
+                f'Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is'
+                f' {attn_weights.size()}')
+
+        if attention_mask is not None:
+            if attention_mask.size() != (bsz, 1, tgt_len, src_len):
+                raise ValueError(
+                    f'Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}'
+                )
+            attn_weights = (
+                attn_weights.view(bsz, self.num_heads, tgt_len, src_len) +
+                attention_mask)
+            attn_weights = torch.max(
+                attn_weights,
+                torch.tensor(torch.finfo(attn_weights.dtype).min))
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len,
+                                             src_len)
+
+        # upcast to fp32 if the weights are in fp16. Please see https://github.com/huggingface/transformers/pull/17437
+        if attn_weights.dtype == torch.float16:
+            attn_weights = nn.functional.softmax(
+                attn_weights, dim=-1, dtype=torch.float32).to(torch.float16)
+        else:
+            attn_weights = nn.functional.softmax(attn_weights, dim=-1)
+
+        if layer_head_mask is not None:
+            if layer_head_mask.size() != (self.num_heads, ):
+                raise ValueError(
+                    f'Head mask for a single layer should be of size {(self.num_heads,)}, but is'
+                    f' {layer_head_mask.size()}')
+            attn_weights = layer_head_mask.view(
+                1, -1, 1, 1) * attn_weights.view(bsz, self.num_heads, tgt_len,
+                                                 src_len)
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len,
+                                             src_len)
+
+        if output_attentions:
+            # this operation is a bit awkward, but it's required to
+            # make sure that attn_weights keeps its gradient.
+            # In order to do so, attn_weights have to be reshaped
+            # twice and have to be reused in the following
+            attn_weights_reshaped = attn_weights.view(bsz, self.num_heads,
+                                                      tgt_len, src_len)
+            attn_weights = attn_weights_reshaped.view(bsz * self.num_heads,
+                                                      tgt_len, src_len)
+        else:
+            attn_weights_reshaped = None
+
+        attn_probs = nn.functional.dropout(
+            attn_weights, p=self.dropout, training=self.training)
+
+        attn_output = torch.bmm(attn_probs, value_states)
+
+        if attn_output.size() != (bsz * self.num_heads, tgt_len,
+                                  self.head_dim):
+            raise ValueError(
+                f'`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is'
+                f' {attn_output.size()}')
+
+        attn_output = attn_output.view(bsz, self.num_heads, tgt_len,
+                                       self.head_dim)
+        attn_output = attn_output.transpose(1, 2)
+
+        # Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be
+        # partitioned aross GPUs when using tensor-parallelism.
+        attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim)
+
+        attn_output = self.out_proj(attn_output)
+
+        return attn_output, attn_weights_reshaped, past_key_value
+
+
+class OPTDecoderLayer(nn.Module):
+
+    def __init__(self, config: OPTConfig):
+        super().__init__()
+        self.embed_dim = config.hidden_size
+        self.self_attn = OPTAttention(
+            embed_dim=self.embed_dim,
+            num_heads=config.num_attention_heads,
+            dropout=config.attention_dropout,
+            is_decoder=True,
+        )
+        self.do_layer_norm_before = config.do_layer_norm_before
+        self.dropout = config.dropout
+        self.activation_fn = ACT2FN[config.activation_function]
+
+        self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
+        self.fc1 = nn.Linear(self.embed_dim, config.ffn_dim)
+        self.fc2 = nn.Linear(config.ffn_dim, self.embed_dim)
+        self.final_layer_norm = nn.LayerNorm(self.embed_dim)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        layer_head_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor,
+                                                 torch.FloatTensor]]]:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+            layer_head_mask (`torch.FloatTensor`, *optional*): mask for attention heads in a given layer of size
+                `(encoder_attention_heads,)`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
+        """
+
+        residual = hidden_states
+
+        # 125m, 1.7B, ..., 175B applies layer norm BEFORE attention
+        if self.do_layer_norm_before:
+            hidden_states = self.self_attn_layer_norm(hidden_states)
+
+        # Self Attention
+        hidden_states, self_attn_weights, present_key_value = self.self_attn(
+            hidden_states=hidden_states,
+            past_key_value=past_key_value,
+            attention_mask=attention_mask,
+            layer_head_mask=layer_head_mask,
+            output_attentions=output_attentions,
+        )
+        hidden_states = nn.functional.dropout(
+            hidden_states, p=self.dropout, training=self.training)
+        hidden_states = residual + hidden_states
+
+        # 350m applies layer norm AFTER attention
+        if not self.do_layer_norm_before:
+            hidden_states = self.self_attn_layer_norm(hidden_states)
+
+        # Fully Connected
+        hidden_states_shape = hidden_states.shape
+        hidden_states = hidden_states.reshape(-1, hidden_states.size(-1))
+        residual = hidden_states
+
+        # 125m, 1.7B, ..., 175B applies layer norm BEFORE attention
+        if self.do_layer_norm_before:
+            hidden_states = self.final_layer_norm(hidden_states)
+
+        hidden_states = self.fc1(hidden_states)
+        hidden_states = self.activation_fn(hidden_states)
+
+        hidden_states = self.fc2(hidden_states)
+        hidden_states = nn.functional.dropout(
+            hidden_states, p=self.dropout, training=self.training)
+
+        hidden_states = (residual + hidden_states).view(hidden_states_shape)
+
+        # 350m applies layer norm AFTER attention
+        if not self.do_layer_norm_before:
+            hidden_states = self.final_layer_norm(hidden_states)
+
+        outputs = (hidden_states, )
+
+        if output_attentions:
+            outputs += (self_attn_weights, )
+
+        if use_cache:
+            outputs += (present_key_value, )
+
+        return outputs
+
+
+OPT_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+
+    Parameters:
+        config ([`OPTConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+
+@add_start_docstrings(
+    'The bare OPT Model outputting raw hidden-states without any specific head on top.',
+    OPT_START_DOCSTRING,
+)
+class OPTPreTrainedModel(PreTrainedModel):
+
+    config_class = OPTConfig
+    base_model_prefix = 'model'
+    supports_gradient_checkpointing = True
+    _no_split_modules = ['OPTDecoderLayer']
+    _keys_to_ignore_on_load_unexpected = [r'decoder\.version']
+
+    def _init_weights(self, module):
+        std = self.config.init_std
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, (OPTDecoder)):
+            module.gradient_checkpointing = value
+
+
+OPT_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+
+            Indices can be obtained using [`GPT2Tokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            [What are attention masks?](../glossary#attention-mask)
+
+            Indices can be obtained using [`OPTTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
+            `past_key_values`).
+
+            If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
+            and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
+            information on the default strategy.
+        head_mask (`torch.Tensor` of shape `(encoder_layers, encoder_attention_heads)`, *optional*):
+            Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in `[0, 1]`:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
+            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
+
+            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+class OPTDecoder(OPTPreTrainedModel):
+    """Transformer decoder consisting of *config.num_hidden_layers* layers.
+    Each layer is a [`OPTDecoderLayer`]
+
+    Args:
+        config: OPTConfig
+    """
+
+    def __init__(self, config: OPTConfig):
+        super().__init__(config)
+        self.dropout = config.dropout
+        self.layerdrop = config.layerdrop
+        self.padding_idx = config.pad_token_id
+        self.max_target_positions = config.max_position_embeddings
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = nn.Embedding(config.vocab_size,
+                                         config.word_embed_proj_dim,
+                                         self.padding_idx)
+        self.embed_positions = OPTLearnedPositionalEmbedding(
+            config.max_position_embeddings, config.hidden_size)
+
+        if config.word_embed_proj_dim != config.hidden_size:
+            self.project_out = nn.Linear(
+                config.hidden_size, config.word_embed_proj_dim, bias=False)
+        else:
+            self.project_out = None
+
+        if config.word_embed_proj_dim != config.hidden_size:
+            self.project_in = nn.Linear(
+                config.word_embed_proj_dim, config.hidden_size, bias=False)
+        else:
+            self.project_in = None
+
+        # Note that the only purpose of `config._remove_final_layer_norm` is to keep backward compatibility
+        # with checkpoints that have been fine-tuned before transformers v4.20.1
+        # see https://github.com/facebookresearch/metaseq/pull/164
+        if config.do_layer_norm_before and not config._remove_final_layer_norm:
+            self.final_layer_norm = nn.LayerNorm(config.hidden_size)
+        else:
+            self.final_layer_norm = None
+
+        self.layers = nn.ModuleList(
+            [OPTDecoderLayer(config) for _ in range(config.num_hidden_layers)])
+
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
+    def _prepare_decoder_attention_mask(self, attention_mask, input_shape,
+                                        inputs_embeds, past_key_values_length):
+        # create causal mask
+        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+        combined_attention_mask = None
+        if input_shape[-1] > 1:
+            combined_attention_mask = _make_causal_mask(
+                input_shape,
+                inputs_embeds.dtype,
+                past_key_values_length=past_key_values_length,
+            ).to(inputs_embeds.device)
+
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            expanded_attn_mask = _expand_mask(
+                attention_mask, inputs_embeds.dtype,
+                tgt_len=input_shape[-1]).to(inputs_embeds.device)
+            combined_attention_mask = (
+                expanded_attn_mask if combined_attention_mask is None else
+                expanded_attn_mask + combined_attention_mask)
+
+        return combined_attention_mask
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        query_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        r"""
+        Args:
+            input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
+                provide it.
+
+                Indices can be obtained using [`OPTTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+                [`PreTrainedTokenizer.__call__`] for details.
+
+                [What are input IDs?](../glossary#input-ids)
+            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+
+                [What are attention masks?](../glossary#attention-mask)
+            head_mask (`torch.Tensor` of shape `(num_hidden_layers, num_attention_heads)`, *optional*):
+                Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the head is **masked**.
+
+            past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+                Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
+                shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of
+
+                Contains pre-computed hidden-states (key and values in the self-attention blocks and in the
+                cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+
+                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those
+                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of
+                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+
+            inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
+                This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+                than the model's internal embedding lookup matrix.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        """
+        output_attentions = (
+            output_attentions if output_attentions is not None else
+            self.config.output_attentions)
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else
+            self.config.output_hidden_states)
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+        return_dict = (
+            return_dict
+            if return_dict is not None else self.config.use_return_dict)
+
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError(
+                'You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time'
+            )
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape
+        elif inputs_embeds is not None:
+            batch_size, seq_length, _ = inputs_embeds.shape
+        else:
+            raise ValueError(
+                'You have to specify either decoder_input_ids or decoder_inputs_embeds'
+            )
+
+        seq_length_with_past = seq_length
+        past_key_values_length = 0
+
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+            seq_length_with_past = seq_length_with_past + past_key_values_length
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+
+        if query_embeds is not None:
+            inputs_embeds = torch.cat([query_embeds, inputs_embeds], dim=1)
+            input_shape = inputs_embeds.size()[:-1]
+        else:
+            input_shape = (batch_size, seq_length)
+
+        # embed positions
+        if attention_mask is None:
+            attention_mask = torch.ones(
+                inputs_embeds.shape[:2],
+                dtype=torch.bool,
+                device=inputs_embeds.device)
+        pos_embeds = self.embed_positions(attention_mask,
+                                          past_key_values_length)
+
+        # embed positions
+        if attention_mask is None:
+            attention_mask = torch.ones((batch_size, seq_length_with_past),
+                                        dtype=torch.bool,
+                                        device=inputs_embeds.device)
+
+        attention_mask = self._prepare_decoder_attention_mask(
+            attention_mask, input_shape, inputs_embeds, past_key_values_length)
+
+        if self.project_in is not None:
+            inputs_embeds = self.project_in(inputs_embeds)
+
+        hidden_states = inputs_embeds + pos_embeds
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = () if use_cache else None
+
+        # check if head_mask has a correct number of layers specified if desired
+        for attn_mask, mask_name in zip([head_mask], ['head_mask']):
+            if attn_mask is not None:
+                if attn_mask.size()[0] != (len(self.layers)):
+                    raise ValueError(
+                        f'The `{mask_name}` should be specified for {len(self.layers)} layers, but it is for'
+                        f' {head_mask.size()[0]}.')
+
+        for idx, decoder_layer in enumerate(self.layers):
+            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
+            if output_hidden_states:
+                all_hidden_states += (hidden_states, )
+
+            dropout_probability = random.uniform(0, 1)
+            if self.training and (dropout_probability < self.layerdrop):
+                continue
+
+            past_key_value = (
+                past_key_values[idx] if past_key_values is not None else None)
+
+            if self.gradient_checkpointing and self.training:
+
+                if use_cache:
+                    logger.warning(
+                        '`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...'
+                    )
+                    use_cache = False
+
+                def create_custom_forward(module):
+
+                    def custom_forward(*inputs):
+                        # None for past_key_value
+                        return module(*inputs, output_attentions, None)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(decoder_layer),
+                    hidden_states,
+                    attention_mask,
+                    head_mask[idx] if head_mask is not None else None,
+                    None,
+                )
+            else:
+
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    layer_head_mask=(head_mask[idx]
+                                     if head_mask is not None else None),
+                    past_key_value=past_key_value,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if use_cache:
+                next_decoder_cache += (
+                    layer_outputs[2 if output_attentions else 1], )
+
+            if output_attentions:
+                all_self_attns += (layer_outputs[1], )
+
+        if self.final_layer_norm is not None:
+            hidden_states = self.final_layer_norm(hidden_states)
+
+        if self.project_out is not None:
+            hidden_states = self.project_out(hidden_states)
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states, )
+
+        next_cache = next_decoder_cache if use_cache else None
+        if not return_dict:
+            return tuple(
+                v for v in
+                [hidden_states, next_cache, all_hidden_states, all_self_attns]
+                if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+
+
+@add_start_docstrings(
+    'The bare OPT Model outputting raw hidden-states without any specific head on top.',
+    OPT_START_DOCSTRING,
+)
+class OPTModel(OPTPreTrainedModel):
+
+    def __init__(self, config: OPTConfig):
+        super().__init__(config)
+        self.decoder = OPTDecoder(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.decoder.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.decoder.embed_tokens = value
+
+    def get_decoder(self):
+        return self.decoder
+
+    @add_start_docstrings_to_model_forward(OPT_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        processor_class=_TOKENIZER_FOR_DOC,
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=BaseModelOutputWithPast,
+        config_class=_CONFIG_FOR_DOC,
+        expected_output=_EXPECTED_OUTPUT_SHAPE,
+    )
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        query_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+
+        output_attentions = (
+            output_attentions if output_attentions is not None else
+            self.config.output_attentions)
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else
+            self.config.output_hidden_states)
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = (
+            return_dict
+            if return_dict is not None else self.config.use_return_dict)
+
+        # decoder outputs consists of (dec_features, past_key_value, dec_hidden, dec_attn)
+        decoder_outputs = self.decoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            query_embeds=query_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        if not return_dict:
+            return decoder_outputs
+
+        return BaseModelOutputWithPast(
+            last_hidden_state=decoder_outputs.last_hidden_state,
+            past_key_values=decoder_outputs.past_key_values,
+            hidden_states=decoder_outputs.hidden_states,
+            attentions=decoder_outputs.attentions,
+        )
+
+
+@register_hf_model()
+class OPTForCausalLM(OPTPreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r'lm_head.weight']
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = OPTModel(config)
+
+        # the lm_head weight is automatically tied to the embed tokens weight
+        self.lm_head = nn.Linear(
+            config.word_embed_proj_dim, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.model.decoder.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.model.decoder.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.model.decoder = decoder
+
+    def get_decoder(self):
+        return self.model.decoder
+
+    @replace_return_docstrings(
+        output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        query_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        reduction: Optional[str] = 'mean',
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
+                provide it.
+
+                Indices can be obtained using [`OPTTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+                [`PreTrainedTokenizer.__call__`] for details.
+
+                [What are input IDs?](../glossary#input-ids)
+            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+
+                [What are attention masks?](../glossary#attention-mask)
+            head_mask (`torch.Tensor` of shape `(num_hidden_layers, num_attention_heads)`, *optional*):
+                Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the head is **masked**.
+
+            past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+                Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
+                shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of
+                shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. The two additional
+                tensors are only required when the model is used as a decoder in a Sequence to Sequence model.
+
+                Contains pre-computed hidden-states (key and values in the self-attention blocks and in the
+                cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+
+                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those
+                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of
+                all `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+            inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
+                This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+                than the model's internal embedding lookup matrix.
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from transformers import GPT2Tokenizer, OPTForCausalLM
+
+        >>> model = OPTForCausalLM.from_pretrained("facebook/opt-350m")
+        >>> tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt-350m")
+
+        >>> prompt = "Hey, are you consciours? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you consciours? Can you talk to me?\nI'm not consciours, but I can talk to you."
+        ```"""
+
+        output_attentions = (
+            output_attentions if output_attentions is not None else
+            self.config.output_attentions)
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else
+            self.config.output_hidden_states)
+        return_dict = (
+            return_dict
+            if return_dict is not None else self.config.use_return_dict)
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model.decoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            query_embeds=query_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        logits = self.lm_head(outputs[0]).contiguous()
+
+        loss = None
+        if labels is not None:
+            logits = logits[:, -labels.size(1):, :]
+
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss(reduction=reduction)
+            loss = loss_fct(
+                shift_logits.view(-1, self.config.vocab_size),
+                shift_labels.view(-1))
+            if reduction == 'none':
+                loss = loss.view(shift_logits.size(0), -1).sum(1)
+
+        if not return_dict:
+            output = (logits, ) + outputs[1:]
+            return (loss, ) + output if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    def prepare_inputs_for_generation(
+        self,
+        input_ids=None,
+        inputs_embeds=None,
+        query_embeds=None,
+        past_key_values=None,
+        attention_mask=None,
+        use_cache=None,
+        **kwargs,
+    ):
+        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
+        if attention_mask is None:
+            if input_ids is not None:
+                attention_mask = input_ids.new_ones(input_ids.shape)
+        if past_key_values:
+            input_ids = input_ids[:, -1:]
+            query_embeds = None
+        # first step, decoder_cached_states are empty
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {'inputs_embeds': inputs_embeds}
+        else:
+            model_inputs = {'input_ids': input_ids}
+
+        model_inputs.update({
+            'query_embeds': query_embeds,
+            'attention_mask': attention_mask,
+            'past_key_values': past_key_values,
+            'use_cache': use_cache,
+        })
+        return model_inputs
+
+    @staticmethod
+    def _reorder_cache(past, beam_idx):
+        reordered_past = ()
+        for layer_past in past:
+            reordered_past += (tuple(
+                past_state.index_select(0, beam_idx)
+                for past_state in layer_past), )
+        return reordered_past
diff --git a/mmpretrain/models/multimodal/chinese_clip/__init__.py b/mmpretrain/models/multimodal/chinese_clip/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..460e9e6a6be748113df029ad76bc0934ab7704d3
--- /dev/null
+++ b/mmpretrain/models/multimodal/chinese_clip/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .bert import BertModelCN
+from .chinese_clip import ChineseCLIP, ModifiedResNet
+
+__all__ = ['ChineseCLIP', 'ModifiedResNet', 'BertModelCN']
diff --git a/mmpretrain/models/multimodal/chinese_clip/bert.py b/mmpretrain/models/multimodal/chinese_clip/bert.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e8dc7322a9aaddb0f5e02f8b70597ba08a8b925
--- /dev/null
+++ b/mmpretrain/models/multimodal/chinese_clip/bert.py
@@ -0,0 +1,263 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+
+# flake8: noqa
+import math
+
+import torch
+from torch import nn
+from torch.utils.checkpoint import checkpoint
+
+try:
+    from transformers.models.bert.configuration_bert import BertConfig
+except:
+    BertConfig = None
+
+from mmpretrain.registry import MODELS
+from ..blip.language_model import BertAttention, BertIntermediate, BertOutput
+
+
+def gelu(x):
+    """Original Implementation of the gelu activation function in Google Bert
+    repo when initially created.
+
+    For information: OpenAI GPT's gelu is slightly different (and gives
+    slightly different results):
+        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+    Also see https://arxiv.org/abs/1606.08415
+    """ # noqa
+    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
+
+
+def gelu_new(x):
+    """Implementation of the gelu activation function currently in Google Bert
+    repo (identical to OpenAI GPT) https://arxiv.org/abs/1606.08415."""
+    return 0.5 * x * (1 + torch.tanh(
+        math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+
+
+def swish(x):
+    return x * torch.sigmoid(x)
+
+
+ACT2FN = {
+    'gelu': gelu,
+    'relu': torch.nn.functional.relu,
+    'swish': swish,
+    'gelu_new': gelu_new
+}
+
+
+class BertEmbeddings(nn.Module):
+    """Construct the embeddings from word, position and token_type
+    embeddings."""
+
+    def __init__(self, config):
+        super(BertEmbeddings, self).__init__()
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size, config.hidden_size, padding_idx=0)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
+                                                config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size,
+                                                  config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model
+        # variable name and be able to load any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, input_ids, token_type_ids=None, position_ids=None):
+        seq_length = input_ids.size(1)
+        if position_ids is None:
+            position_ids = torch.arange(
+                seq_length, dtype=torch.long, device=input_ids.device)
+            position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+
+        words_embeddings = self.word_embeddings(input_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+
+        embeddings = words_embeddings + position_embeddings \
+            + token_type_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BertLayer(nn.Module):
+
+    def __init__(self, config):
+        super(BertLayer, self).__init__()
+        self.attention = BertAttention(config)
+        self.intermediate = BertIntermediate(config)
+        self.output = BertOutput(config)
+
+    def forward(self, hidden_states, attention_mask=None, head_mask=None):
+        attention_outputs = self.attention(hidden_states, attention_mask,
+                                           head_mask)
+        attention_output = attention_outputs[0]
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        outputs = (layer_output, ) + attention_outputs[
+            1:]  # add attentions if we output them
+        if len(outputs) == 1:
+            return outputs[0]
+        return outputs
+
+
+class BertEncoder(nn.Module):
+
+    def __init__(self, config):
+        super(BertEncoder, self).__init__()
+        self.output_attentions = config.output_attentions
+        self.output_hidden_states = config.output_hidden_states
+        self.grad_checkpointing = False
+        self.layer = nn.ModuleList(
+            [BertLayer(config) for _ in range(config.num_hidden_layers)])
+
+    def forward(self, hidden_states, attention_mask=None, head_mask=None):
+        all_hidden_states = ()
+        all_attentions = ()
+        for i, layer_module in enumerate(self.layer):
+            if self.output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states, )
+
+            if self.grad_checkpointing and not torch.jit.is_scripting():
+                layer_outputs = checkpoint(layer_module, hidden_states,
+                                           attention_mask, head_mask[i])
+            else:
+                layer_outputs = layer_module(hidden_states, attention_mask,
+                                             head_mask[i])
+            if not isinstance(layer_outputs, tuple):
+                layer_outputs = (layer_outputs, )
+            hidden_states = layer_outputs[0]
+
+            if self.output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1], )
+
+        # Add last layer
+        if self.output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states, )
+
+        outputs = (hidden_states, )
+        if self.output_hidden_states:
+            outputs = outputs + (all_hidden_states, )
+        if self.output_attentions:
+            outputs = outputs + (all_attentions, )
+        # last-layer hidden state, (all hidden states), (all attentions)
+        return outputs
+
+
+class BertPreTrainedModel(nn.Module):
+    base_model_prefix = 'bert'
+
+    def __init__(self, config):
+        super(BertPreTrainedModel, self).__init__()
+        self.config = config
+
+    def _init_weights(self, module):
+        """Initialize the weights."""
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            # Slightly different from the TF version
+            # which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(
+                mean=0.0, std=self.config.initializer_range)
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
+
+
+@MODELS.register_module()
+class BertModelCN(BertPreTrainedModel):
+    """The BERT model implementation for Chinese CLIP."""
+
+    def __init__(self, config):
+        config = BertConfig.from_dict(config)
+        super(BertModelCN, self).__init__(config)
+
+        self.embeddings = BertEmbeddings(config)
+        self.encoder = BertEncoder(config)
+
+        self.apply(self._init_weights)
+
+    @torch.jit.ignore
+    def set_grad_checkpointing(self, enable=True):
+        if enable:
+            assert not self.config.output_attentions, \
+                'Grad checkpointing is currently conflict with ' \
+                'output_attentions for BertEncoder, ' \
+                'please set it to False in BertConfig'
+
+        self.encoder.grad_checkpointing = enable
+
+    def forward(self,
+                input_ids,
+                attention_mask=None,
+                token_type_ids=None,
+                position_ids=None,
+                head_mask=None):
+        if attention_mask is None:
+            attention_mask = torch.ones_like(input_ids)
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+
+        # We create a 3D attention mask from a 2D tensor mask.
+        # Sizes are [batch_size, 1, 1, to_seq_length]
+        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
+        # this attention mask is more simple than the triangular masking of causal attention
+        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
+        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
+
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.to(
+            dtype=next(self.parameters()).dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(
+                    -1).unsqueeze(-1)
+                head_mask = head_mask.expand(self.config.num_hidden_layers, -1,
+                                             -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(
+                    -1)  # We can specify head_mask for each layer
+            head_mask = head_mask.to(dtype=next(self.parameters(
+            )).dtype)  # switch to fload if need + fp16 compatibility
+        else:
+            head_mask = [None] * self.config.num_hidden_layers
+
+        embedding_output = self.embeddings(
+            input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids)
+        encoder_outputs = self.encoder(
+            embedding_output, extended_attention_mask, head_mask=head_mask)
+        sequence_output = encoder_outputs[0]
+        # pooled_output = self.pooler(sequence_output)
+        pooled_output = None
+
+        # add hidden_states and attentions if they are here
+        outputs = (
+            sequence_output,
+            pooled_output,
+        ) + encoder_outputs[1:]
+
+        # sequence_output, pooled_output, (hidden_states), (attentions)
+        return outputs
diff --git a/mmpretrain/models/multimodal/chinese_clip/chinese_clip.py b/mmpretrain/models/multimodal/chinese_clip/chinese_clip.py
new file mode 100644
index 0000000000000000000000000000000000000000..40af5643602685be4d0e37331609bdecae184de9
--- /dev/null
+++ b/mmpretrain/models/multimodal/chinese_clip/chinese_clip.py
@@ -0,0 +1,446 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from collections import OrderedDict
+from typing import List, Optional, Tuple, Union
+
+import numpy as np
+import torch
+import torch.nn.functional as F
+from mmengine.model import BaseModel, BaseModule
+from torch import nn
+
+from mmpretrain.datasets.categories import CIFAR100_CATEGORIES_CN
+from mmpretrain.registry import MODELS, TOKENIZER
+from mmpretrain.structures import DataSample
+from mmpretrain.utils import track_on_main_process
+from .utils import OPENAI_PROMPT
+
+PROTOTYPE_MAP = {'cifar100': CIFAR100_CATEGORIES_CN}
+PROMPT_MAP = {'openai': OPENAI_PROMPT}
+
+
+class Bottleneck(nn.Module):
+    expansion = 4
+
+    def __init__(self, inplanes, planes, stride=1):
+        super().__init__()
+
+        self.conv1 = nn.Conv2d(inplanes, planes, 1, bias=False)
+        self.bn1 = nn.BatchNorm2d(planes)
+
+        self.conv2 = nn.Conv2d(planes, planes, 3, padding=1, bias=False)
+        self.bn2 = nn.BatchNorm2d(planes)
+
+        self.avgpool = nn.AvgPool2d(stride) if stride > 1 else nn.Identity()
+
+        self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)
+        self.bn3 = nn.BatchNorm2d(planes * self.expansion)
+
+        self.relu = nn.ReLU(inplace=True)
+        self.downsample = None
+        self.stride = stride
+
+        if stride > 1 or inplanes != planes * Bottleneck.expansion:
+            self.downsample = nn.Sequential(
+                OrderedDict([('-1', nn.AvgPool2d(stride)),
+                             ('0',
+                              nn.Conv2d(
+                                  inplanes,
+                                  planes * self.expansion,
+                                  1,
+                                  stride=1,
+                                  bias=False)),
+                             ('1', nn.BatchNorm2d(planes * self.expansion))]))
+
+    def forward(self, x: torch.Tensor):
+        identity = x
+
+        out = self.relu(self.bn1(self.conv1(x)))
+        out = self.relu(self.bn2(self.conv2(out)))
+        out = self.avgpool(out)
+        out = self.bn3(self.conv3(out))
+
+        if self.downsample is not None:
+            identity = self.downsample(x)
+
+        out += identity
+        out = self.relu(out)
+        return out
+
+
+class AttentionPool2d(nn.Module):
+
+    def __init__(self,
+                 spacial_dim: int,
+                 embed_dim: int,
+                 num_heads: int,
+                 output_dim: int = None):
+        super().__init__()
+        self.positional_embedding = nn.Parameter(
+            torch.randn(spacial_dim**2 + 1, embed_dim) / embed_dim**0.5)
+        self.k_proj = nn.Linear(embed_dim, embed_dim)
+        self.q_proj = nn.Linear(embed_dim, embed_dim)
+        self.v_proj = nn.Linear(embed_dim, embed_dim)
+        self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim)
+        self.num_heads = num_heads
+
+    def forward(self, x):
+        x = x.reshape(x.shape[0], x.shape[1],
+                      x.shape[2] * x.shape[3]).permute(2, 0,
+                                                       1)  # NCHW -> (HW)NC
+        x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0)  # (HW+1)NC
+        x = x + self.positional_embedding[:, None, :].to(x.dtype)  # (HW+1)NC
+        x, _ = F.multi_head_attention_forward(
+            query=x,
+            key=x,
+            value=x,
+            embed_dim_to_check=x.shape[-1],
+            num_heads=self.num_heads,
+            q_proj_weight=self.q_proj.weight,
+            k_proj_weight=self.k_proj.weight,
+            v_proj_weight=self.v_proj.weight,
+            in_proj_weight=None,
+            in_proj_bias=torch.cat(
+                [self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]),
+            bias_k=None,
+            bias_v=None,
+            add_zero_attn=False,
+            dropout_p=0,
+            out_proj_weight=self.c_proj.weight,
+            out_proj_bias=self.c_proj.bias,
+            use_separate_proj_weight=True,
+            training=self.training,
+            need_weights=False)
+
+        return x[0]
+
+
+@MODELS.register_module()
+class ModifiedResNet(BaseModule):
+    """A modified ResNet contains the following changes:
+
+    - Apply deep stem with an average pool instead of a max pool.
+    - Performs anti-aliasing strided convolutions, where an avgpool is
+      prepended to convolutions with stride > 1
+    - The final pooling layer is a QKV attention instead of an average pool
+    """ # noqa
+
+    arch_settings = {
+        50: (Bottleneck, (3, 4, 6, 3)),
+        101: (Bottleneck, (3, 4, 23, 3)),
+        152: (Bottleneck, (3, 8, 36, 3))
+    }
+
+    def __init__(self,
+                 depth: int = 50,
+                 base_channels: int = 64,
+                 input_size: int = 224,
+                 num_attn_heads: int = 32,
+                 output_dim: int = 1024,
+                 init_cfg: Optional[dict] = None):
+        super().__init__(init_cfg=init_cfg)
+        self.input_size = input_size
+        self.block, stage_blocks = self.arch_settings[depth]
+
+        # the 3-layer stem
+        self.conv1 = nn.Conv2d(
+            3,
+            base_channels // 2,
+            kernel_size=3,
+            stride=2,
+            padding=1,
+            bias=False)
+        self.bn1 = nn.BatchNorm2d(base_channels // 2)
+        self.conv2 = nn.Conv2d(
+            base_channels // 2,
+            base_channels // 2,
+            kernel_size=3,
+            padding=1,
+            bias=False)
+        self.bn2 = nn.BatchNorm2d(base_channels // 2)
+        self.conv3 = nn.Conv2d(
+            base_channels // 2,
+            base_channels,
+            kernel_size=3,
+            padding=1,
+            bias=False)
+        self.bn3 = nn.BatchNorm2d(base_channels)
+        self.avgpool = nn.AvgPool2d(2)
+        self.relu = nn.ReLU(inplace=True)
+
+        # residual layers
+        # this is a *mutable* variable used during construction
+        self._inplanes = base_channels
+        self.layer1 = self._make_layer(base_channels, stage_blocks[0])
+        self.layer2 = self._make_layer(
+            base_channels * 2, stage_blocks[1], stride=2)
+        self.layer3 = self._make_layer(
+            base_channels * 4, stage_blocks[2], stride=2)
+        self.layer4 = self._make_layer(
+            base_channels * 8, stage_blocks[3], stride=2)
+
+        embed_dim = base_channels * 32
+        self.attnpool = AttentionPool2d(input_size // 32, embed_dim,
+                                        num_attn_heads, output_dim)
+
+    def _make_layer(self, planes, blocks, stride=1):
+        layers = [Bottleneck(self._inplanes, planes, stride)]
+
+        self._inplanes = planes * Bottleneck.expansion
+        for _ in range(1, blocks):
+            layers.append(Bottleneck(self._inplanes, planes))
+
+        return nn.Sequential(*layers)
+
+    def forward(self, x):
+
+        def stem(x):
+            for conv, bn in [(self.conv1, self.bn1), (self.conv2, self.bn2),
+                             (self.conv3, self.bn3)]:
+                x = self.relu(bn(conv(x)))
+            x = self.avgpool(x)
+            return x
+
+        x = x.type(self.conv1.weight.dtype)
+        x = stem(x)
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+        x = self.attnpool(x)
+
+        return x
+
+
+@MODELS.register_module()
+class ChineseCLIP(BaseModel):
+    """The implementation of `ChineseCLIP <https://arxiv.org/abs/2211.01335>`_.
+
+    Args:
+        vision_backbone (dict): Config dict for vision backbone.
+        text_backbone (dict): Config dict for text backbone.
+        tokenizer (dict): Config dict for text tokenizer.
+        proj_dim (int): Projection dimension for similarity computation.
+        text_prototype (str): Text prototype, which can be a key in
+            `PROTOTYPE_MAP` or list of text.
+        text_prompt (str): The prompt for text prototype. Defaults to 'openai'.
+        context_length (int): The context length to use. Defaults to 52.
+        data_preprocessor (Union[dict, nn.Module], optional): The config for
+            preprocessing input data. If None or no specified type, it will use
+            "MultiModalDataPreprocessor" as type.
+            See :class:`MultiModalDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (dict, optional): The config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 vision_backbone: dict,
+                 text_backbone: dict,
+                 tokenizer: dict,
+                 proj_dim: int,
+                 text_prototype: Union[str, List[str]],
+                 text_prompt: str = 'openai',
+                 context_length: int = 52,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+        data_preprocessor = MODELS.build(data_preprocessor)
+
+        super().__init__(
+            data_preprocessor=data_preprocessor, init_cfg=init_cfg)
+
+        self.vision_backbone = MODELS.build(vision_backbone)
+        self.text_backbone = MODELS.build(text_backbone)
+
+        if not isinstance(self.vision_backbone, ModifiedResNet):
+            self.vision_projection = nn.Parameter(
+                torch.empty(self.vision_backbone.embed_dims, proj_dim))
+        text_hidden_size = text_backbone['config']['hidden_size']
+        self.text_projection = nn.Parameter(
+            torch.empty(text_hidden_size, proj_dim))
+
+        self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
+
+        self.tokenizer = TOKENIZER.build(tokenizer)
+        self.context_length = context_length
+
+        # for zero-shot classification
+        if isinstance(text_prototype,
+                      str) and text_prototype in PROTOTYPE_MAP.keys():
+            self.prototype = PROTOTYPE_MAP[text_prototype]
+        else:
+            self.prototype = text_prototype
+        self.text_prototype_embeds = None
+
+        self.prompt = PROMPT_MAP[text_prompt]
+
+    def forward(
+        self,
+        images: torch.Tensor,
+        data_samples: Optional[list] = None,
+        mode: str = 'predict',
+        **kwargs,
+    ):
+        """The unified entry for a forward process in both training and test.
+        The method accepts the following modes:
+
+        - "predict": Forward and return a list of data samples contain the
+          predict results.
+
+        Args:
+            images (torch.Tensor): the preprocessed image tensor of shape
+                ``(N, C, H, W)``.
+            data_samples (List[DataSample], optional): The annotation data
+                of every samples. Defaults to None.
+            mode (str): Return what kind of value. Defaults to 'predict'.
+        """
+        if mode == 'predict':
+            return self.predict(images, data_samples, **kwargs)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def extract_image_feat(self, images: torch.Tensor) -> torch.Tensor:
+        """The function to extract image latent features."""
+        if isinstance(self.vision_backbone, ModifiedResNet):
+            return self.vision_backbone(images)
+        return self.vision_backbone(images)[-1] @ self.vision_projection
+
+    def extract_text_feat(self, texts: torch.Tensor) -> torch.Tensor:
+        """The function to extract text latent features."""
+        pad_index = self.tokenizer.vocab['[PAD]']
+        attn_mask = texts.ne(pad_index)
+        # [batch_size, seq_length, hidden_size]
+        x = self.text_backbone(texts, attention_mask=attn_mask)[0]
+        return x[:, 0, :] @ self.text_projection
+
+    def extract_feat(
+            self, images: torch.Tensor,
+            texts: torch.Tensor) -> Union[torch.Tensor, Tuple[torch.Tensor]]:
+        """The function to extract image and text latent features, the input
+        image or text can not both be None."""
+
+        assert images is not None or texts is not None, \
+            'text and image cannot both be None!'
+        if images is None:
+            return self.extract_text_feat(texts)
+        elif texts is None:
+            return self.extract_image_feat(images)
+
+        image_features = self.extract_image_feat(images)
+        text_features = self.extract_text_feat(texts)
+
+        image_features = image_features / image_features.norm(
+            dim=-1, keepdim=True)
+        text_features = text_features / text_features.norm(
+            dim=-1, keepdim=True)
+
+        return image_features, text_features
+
+    def compute_similarity(self, images, texts):
+        """Extract images and texts features and compute cosine similarity."""
+        image_features, text_features = self.extract_feat(
+            images=images, texts=texts)
+
+        # cosine similarity as logits
+        logit_scale = self.logit_scale.exp()
+        logits_per_image = logit_scale * image_features @ text_features.t()
+        logits_per_text = logits_per_image.t()
+
+        # shape (N, N)
+        return logits_per_image, logits_per_text
+
+    def predict(self,
+                images: torch.Tensor,
+                data_samples: DataSample = None) -> DataSample:
+        """Predict the classes of the input images.
+
+        The prediction is for zero-shot classification and the text prototypes
+        will be prepared in thisfunction.
+
+        Args:
+            images (torch.Tensor): The input images.
+            data_samples (DataSample): The data samples with information from
+                dataset.
+
+        Returns:
+            DataSample: The results of prediction.
+        """
+
+        if self.text_prototype_embeds is None:
+            self.prepare_text_prototype(device=images.device)
+
+        image_features = self.extract_image_feat(images=images)
+        image_features /= image_features.norm(dim=-1, keepdim=True)
+
+        # cosine similarity as logits
+        logits_per_image = image_features @ self.text_prototype_embeds.to(
+            image_features.device) * self.logit_scale.exp()
+
+        pred_scores = F.softmax(logits_per_image, dim=1)
+        pred_labels = pred_scores.argmax(dim=1, keepdim=True).detach()
+
+        out_data_samples = []
+        if data_samples is None:
+            data_samples = [None for _ in range(pred_scores.size(0))]
+
+        for data_sample, score, label in zip(data_samples, pred_scores,
+                                             pred_labels):
+            if data_sample is None:
+                data_sample = DataSample()
+
+            data_sample.set_pred_score(score).set_pred_label(label)
+            out_data_samples.append(data_sample)
+        return out_data_samples
+
+    def prepare_text_prototype(self, device) -> None:
+        """The function to prepare text prototypes with prompt."""
+        class_embeddings = []
+        for classname in track_on_main_process(self.prototype,
+                                               'Prepare text prototype...'):
+            # format with class
+            texts = [prompt(classname) for prompt in self.prompt]
+            tokenized_texts = self.tokenize(texts)
+            class_features = self.extract_text_feat(tokenized_texts.to(device))
+            class_features /= class_features.norm(dim=-1, keepdim=True)
+            class_feature = class_features.mean(dim=0)
+            class_feature /= class_feature.norm()
+            class_embeddings.append(class_feature)
+        self.text_prototype_embeds = torch.stack(
+            class_embeddings, dim=1).to(device)
+
+    def tokenize(self, texts: Union[str, List[str]]) -> torch.LongTensor:
+        """Returns the tokenized representation of given input string(s)
+
+        Args:
+            texts (Union[str, List[str]]): An input string or a list of input
+                strings to tokenize
+            context_length (int): The context length to use. Defaults to 52.
+
+        Returns:
+            torch.Tensor: Resulting tokens.
+        """
+        if isinstance(texts, str):
+            texts = [texts]
+
+        all_tokens = []
+        for text in texts:
+            # adapt the text to Chinese BERT vocab
+            text = text.lower().replace('“', "\"").replace('”', "\"")
+
+            # add special tokens
+            all_tokens.append(
+                [self.tokenizer.vocab['[CLS]']] +
+                self.tokenizer.convert_tokens_to_ids(
+                    self.tokenizer.tokenize(text))[:self.context_length - 2] +
+                [self.tokenizer.vocab['[SEP]']])
+
+        result = torch.zeros(
+            len(all_tokens), self.context_length, dtype=torch.long)
+
+        for i, tokens in enumerate(all_tokens):
+            assert len(tokens) <= self.context_length
+            result[i, :len(tokens)] = torch.tensor(tokens)
+
+        return result
diff --git a/mmpretrain/models/multimodal/chinese_clip/utils.py b/mmpretrain/models/multimodal/chinese_clip/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..6964722bd3dbb05a6a59a1dc2c57c0a6e8692c31
--- /dev/null
+++ b/mmpretrain/models/multimodal/chinese_clip/utils.py
@@ -0,0 +1,186 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+OPENAI_PROMPT = [
+    lambda c: f'{c}的照片',
+    lambda c: f'质量差的{c}的照片',
+    lambda c: f'许多{c}的照片',
+    lambda c: f'{c}的雕塑',
+    lambda c: f'难以看到{c}的照片',
+    lambda c: f'{c}的低分辨率照片',
+    lambda c: f'{c}的渲染',
+    lambda c: f'涂鸦{c}',
+    lambda c: f'{c}的糟糕照片',
+    lambda c: f'{c}的裁剪照片',
+    lambda c: f'{c}的纹身',
+    lambda c: f'{c}的刺绣照片',
+    lambda c: f'很难看到{c}的照片',
+    lambda c: f'{c}的明亮照片',
+    lambda c: f'一张干净的{c}的照片',
+    lambda c: f'一张包含{c}的照片',
+    lambda c: f'{c}的深色照片',
+    lambda c: f'{c}的手绘画',
+    lambda c: f'我的{c}的照片',
+    lambda c: f'不自然的{c}的照片',
+    lambda c: f'一张酷的{c}的照片',
+    lambda c: f'{c}的特写照片',
+    lambda c: f'{c}的黑白照片',
+    lambda c: f'一幅{c}的画',
+    lambda c: f'一幅{c}的绘画',
+    lambda c: f'一张{c}的像素照片',
+    lambda c: f'{c}的雕像',
+    lambda c: f'一张{c}的明亮照片',
+    lambda c: f'{c}的裁剪照片',
+    lambda c: f'人造的{c}的照片',
+    lambda c: f'一张关于{c}的照片',
+    lambda c: f'损坏的{c}的jpeg照片',
+    lambda c: f'{c}的模糊照片',
+    lambda c: f'{c}的相片',
+    lambda c: f'一张{c}的好照片',
+    lambda c: f'{c}的渲染照',
+    lambda c: f'视频游戏中的{c}',
+    lambda c: f'一张{c}的照片',
+    lambda c: f'{c}的涂鸦',
+    lambda c: f'{c}的近距离照片',
+    lambda c: f'{c}的折纸',
+    lambda c: f'{c}在视频游戏中',
+    lambda c: f'{c}的草图',
+    lambda c: f'{c}的涂鸦照',
+    lambda c: f'{c}的折纸形状',
+    lambda c: f'低分辨率的{c}的照片',
+    lambda c: f'玩具{c}',
+    lambda c: f'{c}的副本',
+    lambda c: f'{c}的干净的照片',
+    lambda c: f'一张大{c}的照片',
+    lambda c: f'{c}的重现',
+    lambda c: f'一张漂亮的{c}的照片',
+    lambda c: f'一张奇怪的{c}的照片',
+    lambda c: f'模糊的{c}的照片',
+    lambda c: f'卡通{c}',
+    lambda c: f'{c}的艺术作品',
+    lambda c: f'{c}的素描',
+    lambda c: f'刺绣{c}',
+    lambda c: f'{c}的像素照',
+    lambda c: f'{c}的拍照',
+    lambda c: f'{c}的损坏的照片',
+    lambda c: f'高质量的{c}的照片',
+    lambda c: f'毛绒玩具{c}',
+    lambda c: f'漂亮的{c}的照片',
+    lambda c: f'小{c}的照片',
+    lambda c: f'照片是奇怪的{c}',
+    lambda c: f'漫画{c}',
+    lambda c: f'{c}的艺术照',
+    lambda c: f'{c}的图形',
+    lambda c: f'大{c}的照片',
+    lambda c: f'黑白的{c}的照片',
+    lambda c: f'{c}毛绒玩具',
+    lambda c: f'一张{c}的深色照片',
+    lambda c: f'{c}的摄影图',
+    lambda c: f'{c}的涂鸦照',
+    lambda c: f'玩具形状的{c}',
+    lambda c: f'拍了{c}的照片',
+    lambda c: f'酷酷的{c}的照片',
+    lambda c: f'照片里的小{c}',
+    lambda c: f'{c}的刺青',
+    lambda c: f'{c}的可爱的照片',
+    lambda c: f'一张{c}可爱的照片',
+    lambda c: f'{c}可爱图片',
+    lambda c: f'{c}酷炫图片',
+    lambda c: f'一张{c}的酷炫的照片',
+    lambda c: f'一张{c}的酷炫图片',
+    lambda c: f'这是{c}',
+    lambda c: f'{c}的好看照片',
+    lambda c: f'一张{c}的好看的图片',
+    lambda c: f'{c}的好看图片',
+    lambda c: f'{c}的照片。',
+    lambda c: f'质量差的{c}的照片。',
+    lambda c: f'许多{c}的照片。',
+    lambda c: f'{c}的雕塑。',
+    lambda c: f'难以看到{c}的照片。',
+    lambda c: f'{c}的低分辨率照片。',
+    lambda c: f'{c}的渲染。',
+    lambda c: f'涂鸦{c}。',
+    lambda c: f'{c}的糟糕照片。',
+    lambda c: f'{c}的裁剪照片。',
+    lambda c: f'{c}的纹身。',
+    lambda c: f'{c}的刺绣照片。',
+    lambda c: f'很难看到{c}的照片。',
+    lambda c: f'{c}的明亮照片。',
+    lambda c: f'一张干净的{c}的照片。',
+    lambda c: f'一张包含{c}的照片。',
+    lambda c: f'{c}的深色照片。',
+    lambda c: f'{c}的手绘画。',
+    lambda c: f'我的{c}的照片。',
+    lambda c: f'不自然的{c}的照片。',
+    lambda c: f'一张酷的{c}的照片。',
+    lambda c: f'{c}的特写照片。',
+    lambda c: f'{c}的黑白照片。',
+    lambda c: f'一幅{c}的画。',
+    lambda c: f'一幅{c}的绘画。',
+    lambda c: f'一张{c}的像素照片。',
+    lambda c: f'{c}的雕像。',
+    lambda c: f'一张{c}的明亮照片。',
+    lambda c: f'{c}的裁剪照片。',
+    lambda c: f'人造的{c}的照片。',
+    lambda c: f'一张关于{c}的照片。',
+    lambda c: f'损坏的{c}的jpeg照片。',
+    lambda c: f'{c}的模糊照片。',
+    lambda c: f'{c}的相片。',
+    lambda c: f'一张{c}的好照片。',
+    lambda c: f'{c}的渲染照。',
+    lambda c: f'视频游戏中的{c}。',
+    lambda c: f'一张{c}的照片。',
+    lambda c: f'{c}的涂鸦。',
+    lambda c: f'{c}的近距离照片。',
+    lambda c: f'{c}的折纸。',
+    lambda c: f'{c}在视频游戏中。',
+    lambda c: f'{c}的草图。',
+    lambda c: f'{c}的涂鸦照。',
+    lambda c: f'{c}的折纸形状。',
+    lambda c: f'低分辨率的{c}的照片。',
+    lambda c: f'玩具{c}。',
+    lambda c: f'{c}的副本。',
+    lambda c: f'{c}的干净的照片。',
+    lambda c: f'一张大{c}的照片。',
+    lambda c: f'{c}的重现。',
+    lambda c: f'一张漂亮的{c}的照片。',
+    lambda c: f'一张奇怪的{c}的照片。',
+    lambda c: f'模糊的{c}的照片。',
+    lambda c: f'卡通{c}。',
+    lambda c: f'{c}的艺术作品。',
+    lambda c: f'{c}的素描。',
+    lambda c: f'刺绣{c}。',
+    lambda c: f'{c}的像素照。',
+    lambda c: f'{c}的拍照。',
+    lambda c: f'{c}的损坏的照片。',
+    lambda c: f'高质量的{c}的照片。',
+    lambda c: f'毛绒玩具{c}。',
+    lambda c: f'漂亮的{c}的照片。',
+    lambda c: f'小{c}的照片。',
+    lambda c: f'照片是奇怪的{c}。',
+    lambda c: f'漫画{c}。',
+    lambda c: f'{c}的艺术照。',
+    lambda c: f'{c}的图形。',
+    lambda c: f'大{c}的照片。',
+    lambda c: f'黑白的{c}的照片。',
+    lambda c: f'{c}毛绒玩具。',
+    lambda c: f'一张{c}的深色照片。',
+    lambda c: f'{c}的摄影图。',
+    lambda c: f'{c}的涂鸦照。',
+    lambda c: f'玩具形状的{c}。',
+    lambda c: f'拍了{c}的照片。',
+    lambda c: f'酷酷的{c}的照片。',
+    lambda c: f'照片里的小{c}。',
+    lambda c: f'{c}的刺青。',
+    lambda c: f'{c}的可爱的照片。',
+    lambda c: f'一张{c}可爱的照片。',
+    lambda c: f'{c}可爱图片。',
+    lambda c: f'{c}酷炫图片。',
+    lambda c: f'一张{c}的酷炫的照片。',
+    lambda c: f'一张{c}的酷炫图片。',
+    lambda c: f'这是{c}。',
+    lambda c: f'{c}的好看照片。',
+    lambda c: f'一张{c}的好看的图片。',
+    lambda c: f'{c}的好看图片。',
+    lambda c: f'一种叫{c}的花的照片',
+    lambda c: f'一种叫{c}的食物的照片',
+    lambda c: f'{c}的卫星照片',
+]
diff --git a/mmpretrain/models/multimodal/clip/__init__.py b/mmpretrain/models/multimodal/clip/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7a117ea7ca57ce30d7ad304103220f7af84e7c0
--- /dev/null
+++ b/mmpretrain/models/multimodal/clip/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from ..clip.clip import CLIP, CLIPZeroShot
+from ..clip.clip_transformer import CLIPProjection, CLIPTransformer
+
+__all__ = ['CLIP', 'CLIPZeroShot', 'CLIPTransformer', 'CLIPProjection']
diff --git a/mmpretrain/models/multimodal/clip/clip.py b/mmpretrain/models/multimodal/clip/clip.py
new file mode 100644
index 0000000000000000000000000000000000000000..b509a63b3be964000232a006da33243d9f93f84b
--- /dev/null
+++ b/mmpretrain/models/multimodal/clip/clip.py
@@ -0,0 +1,364 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from abc import abstractmethod
+from typing import List, Optional, Tuple, Union
+
+import numpy as np
+import torch
+import torch.nn.functional as F
+from mmengine.model import BaseModel
+from torch import nn
+
+from mmpretrain.datasets.categories import (CIFAR100_CATEGORIES,
+                                            IMAGENET_SIMPLE_CATEGORIES)
+from mmpretrain.registry import MODELS, TOKENIZER
+from mmpretrain.structures import DataSample
+from mmpretrain.utils import track_on_main_process
+from .utils import (OPENAI_CIFAR100_PROMPT, OPENAI_IMAGENET_PROMPT,
+                    OPENAI_IMAGENET_PROMPT_SUB)
+
+CIFAR100_CATEGORIES = [' '.join(c.split('_')) for c in CIFAR100_CATEGORIES]
+PROTOTYPE_MAP = {
+    'imagenet': IMAGENET_SIMPLE_CATEGORIES,
+    'cifar100': CIFAR100_CATEGORIES,
+}
+PROMPT_MAP = {
+    'openai_imagenet': OPENAI_IMAGENET_PROMPT,
+    'openai_cifar100': OPENAI_CIFAR100_PROMPT,
+    'vanilla': [lambda c: f'a photo of a {c}'],
+    'openai_imagenet_sub': OPENAI_IMAGENET_PROMPT_SUB
+}
+
+
+class LayerNorm(nn.LayerNorm):
+    """Subclass torch's LayerNorm to handle fp16."""
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward function."""
+        orig_type = x.dtype
+        ret = super().forward(x.type(torch.float32))
+        return ret.type(orig_type)
+
+
+class CLIP(BaseModel):
+    """The implementation of `CLIP <https://arxiv.org/abs/2103.00020>`_.
+
+    Args:
+        vision_backbone (dict): Config dict for vision backbone.
+        text_backbone (dict): Config dict for text backbone.
+        tokenizer (dict): Config dict for text tokenizer.
+        proj_dim (int): Projection dimension for similarity computation.
+        text_prototype (str): Text prototype, which can be a key in
+            `PROTOTYPE_MAP` or list of text.
+        text_prompt (str): The prompt for text prototype.
+            Defaults to 'vanilla',which refers to "a photo of {cls}".
+        context_length (int): The context length to use. Defaults to 77.
+        data_preprocessor (Union[dict, nn.Module], optional): The config for
+            preprocessing input data. If None or no specified type, it will use
+            "MultiModalDataPreprocessor" as type.
+            See :class:`MultiModalDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (dict, optional): The config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 vision_backbone: dict,
+                 projection: dict,
+                 text_backbone: dict,
+                 tokenizer: dict,
+                 vocab_size: int,
+                 transformer_width: int,
+                 proj_dim: int,
+                 context_length: int = 77,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+        data_preprocessor = MODELS.build(data_preprocessor)
+
+        super().__init__(
+            data_preprocessor=data_preprocessor, init_cfg=init_cfg)
+
+        self.context_length = context_length
+
+        # build the vision transformer
+        self.visual = MODELS.build(vision_backbone)
+
+        # build the visual projection
+        self.visual_proj = MODELS.build(projection)
+
+        # build attn_mask for casual-attn
+        text_backbone['attn_mask'] = self.build_attention_mask()
+
+        # build the text transformer
+        self.transformer = MODELS.build(text_backbone)
+
+        self.vocab_size = vocab_size
+        self.token_embedding = nn.Embedding(vocab_size, transformer_width)
+        self.positional_embedding = nn.Parameter(
+            torch.empty(self.context_length, transformer_width))
+        self.ln_final = LayerNorm(transformer_width)
+
+        self.text_projection = nn.Parameter(
+            torch.empty(transformer_width, proj_dim))
+        self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
+
+        self.initialize_parameters()
+
+        self.tokenizer = TOKENIZER.build(tokenizer)
+
+        self.tokenizer.vocab = self.tokenizer.get_vocab(
+        )  # CLIPTokenizer has no attribute named 'vocab', so manually
+
+    def initialize_parameters(self) -> None:
+        """Initialize the parameters.
+
+        The pretrained weight will override the initialized parameters by this
+        function.
+        """
+        nn.init.normal_(self.token_embedding.weight, std=0.02)
+        nn.init.normal_(self.positional_embedding, std=0.01)
+
+        proj_std = (self.transformer.width**-0.5) * (
+            (2 * self.transformer.layers)**-0.5)
+        attn_std = self.transformer.width**-0.5
+        fc_std = (2 * self.transformer.width)**-0.5
+        for block in self.transformer.resblocks:
+            nn.init.normal_(block.attn.in_proj_weight, std=attn_std)
+            nn.init.normal_(block.attn.out_proj.weight, std=proj_std)
+            nn.init.normal_(block.mlp.c_fc.weight, std=fc_std)
+            nn.init.normal_(block.mlp.c_proj.weight, std=proj_std)
+
+        if self.text_projection is not None:
+            nn.init.normal_(
+                self.text_projection, std=self.transformer.width**-0.5)
+
+    def build_attention_mask(self):
+        # lazily create causal attention mask,
+        # with full attention between the vision tokens
+        # pytorch uses additive attention mask; fill with -inf
+        mask = torch.empty(self.context_length, self.context_length)
+        mask.fill_(float('-inf'))
+        mask.triu_(1)  # zero out the lower diagonal
+        return mask
+
+    def forward(
+        self,
+        images: torch.Tensor,
+        data_samples: Optional[list] = None,
+        mode: str = 'predict',
+        **kwargs,
+    ):
+        """The unified entry for a forward process in both training and test.
+        The method accepts the following modes:
+
+        - "predict": Forward and return a list of data samples contain the
+          predict results.
+
+        Args:
+            images (torch.Tensor): the preprocessed image tensor of shape
+                ``(N, C, H, W)``.
+            data_samples (List[DataSample], optional): The annotation data
+                of every samples. Defaults to None.
+            mode (str): Return what kind of value. Defaults to 'predict'.
+        """
+        if mode == 'predict':
+            return self.predict(images, data_samples, **kwargs)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def extract_image_feat(self, images: torch.Tensor) -> torch.Tensor:
+        """The function to extract image latent features."""
+        return self.visual_proj(self.visual(images))[0]
+
+    def extract_text_feat(self, texts: torch.Tensor) -> torch.Tensor:
+        """The function to extract text latent features."""
+        x = self.token_embedding(texts)  # [batch_size, n_ctx, d_model]
+
+        x = x + self.positional_embedding
+        x = x.permute(1, 0, 2)  # NLD -> LND
+        x = self.transformer(x)[0]
+
+        x = x.permute(1, 0, 2)  # LND -> NLD
+        x = self.ln_final(x)
+
+        # x.shape = [batch_size, n_ctx, transformer.width]
+        # take features from the eot embedding
+        # (eot_token is the highest number in each sequence)
+        x = x[torch.arange(x.shape[0]),
+              texts.argmax(dim=-1)] @ self.text_projection
+
+        return x
+
+    def extract_feat(
+            self, images: torch.Tensor,
+            texts: torch.Tensor) -> Union[torch.Tensor, Tuple[torch.Tensor]]:
+        """The function to extract image and text latent features, the input
+        image or text can not both be None."""
+
+        assert images is not None or texts is not None, \
+            'text and image cannot both be None!'
+        if images is None:
+            return self.extract_text_feat(texts)
+        elif texts is None:
+            return self.extract_image_feat(images)
+
+        image_features = self.extract_image_feat(images)
+        text_features = self.extract_text_feat(texts)
+
+        image_features = image_features / image_features.norm(
+            dim=-1, keepdim=True)
+        text_features = text_features / text_features.norm(
+            dim=-1, keepdim=True)
+
+        return image_features, text_features
+
+    def compute_similarity(self, images, texts):
+        """Extract images and texts features and compute cosine similarity."""
+        image_features, text_features = self.extract_feat(
+            images=images, texts=texts)
+
+        # cosine similarity as logits
+        logit_scale = self.logit_scale.exp()
+        logits_per_image = logit_scale * image_features @ text_features.t()
+        logits_per_text = logits_per_image.t()
+
+        # shape (N, N)
+        return logits_per_image, logits_per_text
+
+    @abstractmethod
+    def predict(self,
+                images: torch.Tensor,
+                data_samples: DataSample = None) -> DataSample:
+        raise NotImplementedError
+
+    def tokenize(self, texts: Union[str, List[str]]) -> torch.LongTensor:
+        """Returns the tokenized representation of given input string(s)
+
+        Args:
+            texts (Union[str, List[str]]): An input string or a list of input
+                strings to tokenize
+            context_length (int): The context length to use. Defaults to 52.
+
+        Returns:
+            torch.Tensor: Resulting tokens.
+        """
+        if isinstance(texts, str):
+            texts = [texts]
+
+        all_tokens = []
+        for text in texts:
+            # adapt the text to Chinese BERT vocab
+            # text = text.lower().replace('“', "\"").replace('”', "\"")
+
+            # add special tokens
+            all_tokens.append(
+                [self.tokenizer.vocab['<|startoftext|>']
+                 ] +  # <|startoftext|>代表[CLS] token
+                self.tokenizer.convert_tokens_to_ids(
+                    self.tokenizer.tokenize(text))[:self.context_length - 2] +
+                [self.tokenizer.vocab['<|endoftext|>']])
+
+        result = torch.zeros(
+            len(all_tokens), self.context_length, dtype=torch.long)
+
+        for i, tokens in enumerate(all_tokens):
+            assert len(tokens) <= self.context_length
+            result[i, :len(tokens)] = torch.tensor(tokens)
+
+        return result
+
+
+@MODELS.register_module()
+class CLIPZeroShot(CLIP):
+
+    def __init__(
+        self,
+        vision_backbone: dict,
+        projection: dict,
+        text_backbone: dict,
+        tokenizer: dict,
+        vocab_size: int,
+        transformer_width: int,
+        proj_dim: int,
+        context_length: int = 77,
+        data_preprocessor: Optional[dict] = None,
+        init_cfg: Optional[dict] = None,
+        text_prototype: Union[str, List[str]] = 'imagenet',
+        text_prompt: str = 'vanilla',
+    ):
+        super(CLIPZeroShot,
+              self).__init__(vision_backbone, projection, text_backbone,
+                             tokenizer, vocab_size, transformer_width,
+                             proj_dim, context_length, data_preprocessor,
+                             init_cfg)
+
+        # for zero-shot classification
+        if isinstance(text_prototype,
+                      str) and text_prototype in PROTOTYPE_MAP.keys():
+            self.prototype = PROTOTYPE_MAP[text_prototype]
+        else:
+            self.prototype = text_prototype
+        self.text_prototype_embeds = None
+
+        self.prompt = PROMPT_MAP[text_prompt]
+
+    def predict(self,
+                images: torch.Tensor,
+                data_samples: DataSample = None) -> DataSample:
+        """Predict the classes of the input images.
+
+        The prediction is for zero-shot classification and the text prototypes
+        will be prepared in thisfunction.
+
+        Args:
+            images (torch.Tensor): The input images.
+            data_samples (DataSample): The data samples with information from
+                dataset.
+
+        Returns:
+            DataSample: The results of prediction.
+        """
+
+        if self.text_prototype_embeds is None:
+            self.prepare_text_prototype(device=images.device)
+
+        image_features = self.extract_image_feat(images=images)
+        image_features /= image_features.norm(dim=-1, keepdim=True)
+
+        # cosine similarity as logits
+        logits_per_image = image_features @ self.text_prototype_embeds.to(
+            image_features.device) * self.logit_scale.exp()
+
+        pred_scores = F.softmax(logits_per_image, dim=1)
+        pred_labels = pred_scores.argmax(dim=1, keepdim=True).detach()
+
+        out_data_samples = []
+        if data_samples is None:
+            data_samples = [None for _ in range(pred_scores.size(0))]
+
+        for data_sample, score, label in zip(data_samples, pred_scores,
+                                             pred_labels):
+            if data_sample is None:
+                data_sample = DataSample()
+
+            data_sample.set_pred_score(score).set_pred_label(label)
+            out_data_samples.append(data_sample)
+        return out_data_samples
+
+    def prepare_text_prototype(self, device) -> None:
+        """The function to prepare text prototypes with prompt."""
+        class_embeddings = []
+        for classname in track_on_main_process(self.prototype,
+                                               'Prepare text prototype...'):
+            # format with class
+            texts = [prompt(classname) for prompt in self.prompt]
+            tokenized_texts = self.tokenize(texts)
+            class_features = self.extract_text_feat(tokenized_texts.to(device))
+            class_features /= class_features.norm(dim=-1, keepdim=True)
+            class_feature = class_features.mean(dim=0)
+            class_feature /= class_feature.norm()
+            class_embeddings.append(class_feature)
+        self.text_prototype_embeds = torch.stack(
+            class_embeddings, dim=1).to(device)
diff --git a/mmpretrain/models/multimodal/clip/clip_transformer.py b/mmpretrain/models/multimodal/clip/clip_transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..4b5f76661cbc3317a04e17f11680266bc44ea3eb
--- /dev/null
+++ b/mmpretrain/models/multimodal/clip/clip_transformer.py
@@ -0,0 +1,99 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Modified from https://github.com/zejiangh/MILAN
+from typing import Optional, Tuple
+
+import torch
+from mmengine.model import BaseModule
+from torch import nn
+
+from mmpretrain.models.utils.clip_generator_helper import \
+    ResidualAttentionBlock
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class CLIPTransformer(nn.Module):
+    """Transformer.
+
+    Both visual and text branches use this transformer.
+
+    Args:
+        width (int): The feature dimension.
+        layers (int): The number of layers.
+        heads (int): The number of attention heads.
+        attn_mask (torch.Tensor, optional): The attention mask.
+    """
+
+    def __init__(self,
+                 width: int,
+                 layers: int,
+                 heads: int,
+                 attn_mask: Optional[torch.Tensor] = None) -> None:
+        super().__init__()
+        self.width = width
+        self.layers = layers
+        self.resblocks = nn.ModuleList()
+        for _ in range(layers - 1):
+            self.resblocks.append(
+                ResidualAttentionBlock(width, heads, attn_mask))
+        self.resblocks.append(
+            ResidualAttentionBlock(
+                width, heads, attn_mask, return_attention=True))
+
+    def forward(
+            self, x: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Forward function."""
+        z = []
+        for idx, blk in enumerate(self.resblocks):
+            if idx < self.layers - 1:
+                x = blk(x)
+                z.append(x.permute(1, 0, 2))
+            else:
+                x, attention = blk(x)
+                z.append(x.permute(1, 0, 2))
+        return x, attention, z
+
+
+@MODELS.register_module()
+class CLIPProjection(BaseModule):
+    """Neck with CLIP Projection.
+
+    Args:
+        in_channels (int): Number of channels in the input.
+        out_channels (int): Number of channels in the output.
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 init_cfg: Optional[dict] = None):
+        super(CLIPProjection, self).__init__(init_cfg=init_cfg)
+
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        scale = in_channels**-0.5
+        self.proj = nn.Parameter(scale *
+                                 torch.randn(in_channels, out_channels))
+
+    def forward(self, inputs: Tuple) -> Tuple[torch.Tensor]:
+        """forward function.
+
+        Args:
+            inputs (Tuple): The features extracted from
+                 the backbone. Multiple stage inputs are acceptable but only
+                  the last stage will be used.
+        Returns:
+            Tuple(torch.Tensor)): A tuple of reducted features.
+        """
+        if isinstance(inputs, tuple):
+            inputs = inputs[-1]
+            out = inputs @ self.proj
+        elif isinstance(inputs, torch.Tensor):
+            out = inputs @ self.proj
+        else:
+            raise TypeError(
+                '`CLIPProjection` neck inputs should be tuple or torch.tensor')
+        return (out, )
diff --git a/mmpretrain/models/multimodal/clip/utils.py b/mmpretrain/models/multimodal/clip/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..65239bc37d6c26826c4fe1cbaffb35a45cd948fd
--- /dev/null
+++ b/mmpretrain/models/multimodal/clip/utils.py
@@ -0,0 +1,115 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+
+OPENAI_CIFAR100_PROMPT = [
+    lambda c: f'a photo of a {c}.',
+    lambda c: f'a blurry photo of a {c}.',
+    lambda c: f'a black and white photo of a {c}.',
+    lambda c: f'a low contrast photo of a {c}.',
+    lambda c: f'a high contrast photo of a {c}.',
+    lambda c: f'a bad photo of a {c}.',
+    lambda c: f'a good photo of a {c}.',
+    lambda c: f'a photo of a small {c}.',
+    lambda c: f'a photo of a big {c}.',
+    lambda c: f'a photo of the {c}.',
+    lambda c: f'a blurry photo of the {c}.',
+    lambda c: f'a black and white photo of the {c}.',
+    lambda c: f'a low contrast photo of the {c}.',
+    lambda c: f'a high contrast photo of the {c}.',
+    lambda c: f'a bad photo of the {c}.',
+    lambda c: f'a good photo of the {c}.',
+    lambda c: f'a photo of the small {c}.',
+    lambda c: f'a photo of the big {c}.',
+]
+
+OPENAI_IMAGENET_PROMPT_SUB = [
+    lambda c: f'itap of a {c}.',
+    lambda c: f'a bad photo of the {c}.',
+    lambda c: f'a origami {c}.',
+    lambda c: f'a photo of the large {c}.',
+    lambda c: f'a {c} in a video game.',
+    lambda c: f'art of the {c}.',
+    lambda c: f'a photo of the small {c}.',
+]
+
+OPENAI_IMAGENET_PROMPT = [
+    lambda c: f'a bad photo of a {c}.',
+    lambda c: f'a photo of many {c}.',
+    lambda c: f'a sculpture of a {c}.',
+    lambda c: f'a photo of the hard to see {c}.',
+    lambda c: f'a low resolution photo of the {c}.',
+    lambda c: f'a rendering of a {c}.',
+    lambda c: f'graffiti of a {c}.',
+    lambda c: f'a bad photo of the {c}.',
+    lambda c: f'a cropped photo of the {c}.',
+    lambda c: f'a tattoo of a {c}.',
+    lambda c: f'the embroidered {c}.',
+    lambda c: f'a photo of a hard to see {c}.',
+    lambda c: f'a bright photo of a {c}.',
+    lambda c: f'a photo of a clean {c}.',
+    lambda c: f'a photo of a dirty {c}.',
+    lambda c: f'a dark photo of the {c}.',
+    lambda c: f'a drawing of a {c}.',
+    lambda c: f'a photo of my {c}.',
+    lambda c: f'the plastic {c}.',
+    lambda c: f'a photo of the cool {c}.',
+    lambda c: f'a close-up photo of a {c}.',
+    lambda c: f'a black and white photo of the {c}.',
+    lambda c: f'a painting of the {c}.',
+    lambda c: f'a painting of a {c}.',
+    lambda c: f'a pixelated photo of the {c}.',
+    lambda c: f'a sculpture of the {c}.',
+    lambda c: f'a bright photo of the {c}.',
+    lambda c: f'a cropped photo of a {c}.',
+    lambda c: f'a plastic {c}.',
+    lambda c: f'a photo of the dirty {c}.',
+    lambda c: f'a jpeg corrupted photo of a {c}.',
+    lambda c: f'a blurry photo of the {c}.',
+    lambda c: f'a photo of the {c}.',
+    lambda c: f'a good photo of the {c}.',
+    lambda c: f'a rendering of the {c}.',
+    lambda c: f'a {c} in a video game.',
+    lambda c: f'a photo of one {c}.',
+    lambda c: f'a doodle of a {c}.',
+    lambda c: f'a close-up photo of the {c}.',
+    lambda c: f'a photo of a {c}.',
+    lambda c: f'the origami {c}.',
+    lambda c: f'the {c} in a video game.',
+    lambda c: f'a sketch of a {c}.',
+    lambda c: f'a doodle of the {c}.',
+    lambda c: f'a origami {c}.',
+    lambda c: f'a low resolution photo of a {c}.',
+    lambda c: f'the toy {c}.',
+    lambda c: f'a rendition of the {c}.',
+    lambda c: f'a photo of the clean {c}.',
+    lambda c: f'a photo of a large {c}.',
+    lambda c: f'a rendition of a {c}.',
+    lambda c: f'a photo of a nice {c}.',
+    lambda c: f'a photo of a weird {c}.',
+    lambda c: f'a blurry photo of a {c}.',
+    lambda c: f'a cartoon {c}.',
+    lambda c: f'art of a {c}.',
+    lambda c: f'a sketch of the {c}.',
+    lambda c: f'a embroidered {c}.',
+    lambda c: f'a pixelated photo of a {c}.',
+    lambda c: f'itap of the {c}.',
+    lambda c: f'a jpeg corrupted photo of the {c}.',
+    lambda c: f'a good photo of a {c}.',
+    lambda c: f'a plushie {c}.',
+    lambda c: f'a photo of the nice {c}.',
+    lambda c: f'a photo of the small {c}.',
+    lambda c: f'a photo of the weird {c}.',
+    lambda c: f'the cartoon {c}.',
+    lambda c: f'art of the {c}.',
+    lambda c: f'a drawing of the {c}.',
+    lambda c: f'a photo of the large {c}.',
+    lambda c: f'a black and white photo of a {c}.',
+    lambda c: f'the plushie {c}.',
+    lambda c: f'a dark photo of a {c}.',
+    lambda c: f'itap of a {c}.',
+    lambda c: f'graffiti of the {c}.',
+    lambda c: f'a toy {c}.',
+    lambda c: f'itap of my {c}.',
+    lambda c: f'a photo of a cool {c}.',
+    lambda c: f'a photo of a small {c}.',
+    lambda c: f'a tattoo of the {c}.',
+]
diff --git a/mmpretrain/models/multimodal/flamingo/__init__.py b/mmpretrain/models/multimodal/flamingo/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e0bfd63b657f5f0f1517ad6d31bce2821cb372cd
--- /dev/null
+++ b/mmpretrain/models/multimodal/flamingo/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .adapter import FlamingoLMAdapter
+from .flamingo import Flamingo
+
+__all__ = ['Flamingo', 'FlamingoLMAdapter']
diff --git a/mmpretrain/models/multimodal/flamingo/adapter.py b/mmpretrain/models/multimodal/flamingo/adapter.py
new file mode 100644
index 0000000000000000000000000000000000000000..bef0e2f86bfbe81046bb25fa4b9915e4c4f9005a
--- /dev/null
+++ b/mmpretrain/models/multimodal/flamingo/adapter.py
@@ -0,0 +1,96 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import random
+
+import torch.nn as nn
+
+from mmpretrain.registry import MODELS
+from .modules import FlamingoLayer, GatedCrossAttentionBlock
+from .utils import getattr_recursive, setattr_recursive
+
+
+@MODELS.register_module()
+class FlamingoLMAdapter:
+    """Mixin to add cross-attention layers to a language model."""
+
+    @classmethod
+    def extend_init(
+        cls,
+        base: object,
+        vis_hidden_size: int,
+        cross_attn_every_n_layers: int,
+        use_media_placement_augmentation: bool,
+        only_attend_previous: bool = False,
+    ):
+        """Initialize Flamingo by adding a new gated cross attn to the decoder.
+
+        Store the media token id for computing the media locations.
+
+        Args:
+            base (object): Base module could be any object that represent
+                a instance of language model.
+            vis_hidden_size: (int): Hidden size of vision embeddings.
+            cross_attn_every_n_layers: (int): Additional cross attn for
+                every n layers.
+            use_media_placement_augmentation: (bool): Whether to use media
+                placement augmentation.
+        """
+        base.set_decoder_layers_attr_name('model.layers')
+        gated_cross_attn_layers = nn.ModuleList([
+            GatedCrossAttentionBlock(
+                dim=base.config.hidden_size, dim_visual=vis_hidden_size) if
+            (layer_idx + 1) % cross_attn_every_n_layers == 0 else None
+            for layer_idx, _ in enumerate(base._get_decoder_layers())
+        ])
+        base._set_decoder_layers(
+            nn.ModuleList([
+                FlamingoLayer(gated_cross_attn_layer, decoder_layer)
+                for gated_cross_attn_layer, decoder_layer in zip(
+                    gated_cross_attn_layers, base._get_decoder_layers())
+            ]))
+        base.use_media_placement_augmentation = use_media_placement_augmentation  # noqa
+        base.initialized_flamingo = True
+        base.only_attend_previous = only_attend_previous
+        return base
+
+    def set_decoder_layers_attr_name(self, decoder_layers_attr_name):
+        """Set decoder layers attribute name."""
+        self.decoder_layers_attr_name = decoder_layers_attr_name
+
+    def _get_decoder_layers(self):
+        """Get decoder layers according to attribute name."""
+        return getattr_recursive(self, self.decoder_layers_attr_name)
+
+    def _set_decoder_layers(self, value):
+        """Set decoder layers according to attribute name."""
+        setattr_recursive(self, self.decoder_layers_attr_name, value)
+
+    def forward(self, *input, **kwargs):
+        """Condition the Flamingo layers on the media locations before forward
+        function."""
+        input_ids = kwargs['input_ids'] if 'input_ids' in kwargs else input[0]
+        media_locations = input_ids == self.media_token_id
+        if self.only_attend_previous:
+            attend_previous = True
+        elif self.use_media_placement_augmentation:
+            attend_previous = (random.random() < 0.5)
+        else:
+            attend_previous = False
+
+        for layer in self.get_decoder().layers:
+            layer.condition_media_locations(media_locations)
+            layer.condition_attend_previous(attend_previous)
+
+        return super().forward(
+            *input, **kwargs)  # Call the other parent's forward method
+
+    def is_conditioned(self) -> bool:
+        """Check whether all decoder layers are already conditioned."""
+        return all(layer.is_conditioned()
+                   for layer in self._get_decoder_layers())
+
+    def clear_conditioned_layers(self):
+        """Clear all conditional layers."""
+        for layer in self._get_decoder_layers():
+            layer.condition_vis_x(None)
+            layer.condition_media_locations(None)
+            layer.condition_attend_previous(None)
diff --git a/mmpretrain/models/multimodal/flamingo/flamingo.py b/mmpretrain/models/multimodal/flamingo/flamingo.py
new file mode 100644
index 0000000000000000000000000000000000000000..729d6c741898e0ba88d59604f3d86e5ba0c539d9
--- /dev/null
+++ b/mmpretrain/models/multimodal/flamingo/flamingo.py
@@ -0,0 +1,323 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import re
+from typing import List, Optional
+
+import torch
+from mmengine.model import BaseModel
+
+from mmpretrain.registry import MODELS, TOKENIZER
+from mmpretrain.structures import DataSample
+from .modules import PerceiverResampler
+from .utils import ExtendModule
+
+
+@MODELS.register_module()
+class Flamingo(BaseModel):
+    """The Open Flamingo model for multiple tasks.
+
+    Args:
+        vision_encoder (dict): The config of the vision encoder.
+        lang_encoder (dict): The config of the language encoder.
+        tokenizer (dict): The tokenizer to encode the text.
+        task (int): The task to perform prediction.
+        zeroshot_prompt (str): Prompt used for zero-shot inference.
+            Defaults to '<image>Output:'.
+        shot_prompt_tmpl (str): Prompt used for few-shot inference.
+            Defaults to ``<image>Output:{caption}<|endofchunk|>``.
+        final_prompt_tmpl (str): Final part of prompt used for inference.
+            Defaults to '<image>Output:'.
+        generation_cfg (dict): The extra generation config, accept the keyword
+            arguments of [~`transformers.GenerationConfig`].
+            Defaults to an empty dict.
+        data_preprocessor (Optional[dict]): The config for preprocessing input
+            data. If None or no specified type, it will use
+            "MutimodalDataPreprocessor" as type.
+            See :class:`MutimodalDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (dict, optional): The initialization config. Defaults to None.
+    """
+
+    support_tasks = {'caption', 'vqa'}
+    _no_split_modules = [
+        'TransformerEncoderLayer', 'PerceiverAttention',
+        'GatedCrossAttentionBlock', 'FlamingoLayer'
+    ]
+
+    def __init__(
+            self,
+            vision_encoder: dict,
+            lang_encoder: dict,
+            tokenizer: dict,
+            task: str = 'caption',
+            zeroshot_prompt: str = '<image>Output:',
+            shot_prompt_tmpl: str = '<image>Output:{caption}<|endofchunk|>',
+            final_prompt_tmpl: str = '<image>Output:',
+            generation_cfg: dict = dict(),
+            data_preprocessor: Optional[dict] = None,
+            init_cfg: Optional[dict] = None):
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        if isinstance(data_preprocessor, dict):
+            data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+            data_preprocessor = MODELS.build(data_preprocessor)
+
+        super().__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+        if task not in self.support_tasks:
+            raise ValueError(f'Unsupported task {task}, please select '
+                             f'the task from {self.support_tasks}.')
+        self.task = task
+
+        # init tokenizer
+        self.tokenizer = TOKENIZER.build(tokenizer)
+        # add Flamingo special tokens to the tokenizer
+        self.tokenizer.add_special_tokens(
+            {'additional_special_tokens': ['<|endofchunk|>', '<image>']})
+        self.tokenizer.bos_token_id = 1
+        if self.tokenizer.pad_token is None:
+            # Issue: GPT models don't have a pad token, which we use to
+            # modify labels for the loss.
+            self.tokenizer.add_special_tokens({'pad_token': '<PAD>'})
+
+        # Template to format the prompt input
+        self.zeroshot_prompt = zeroshot_prompt
+        self.shot_prompt_tmpl = shot_prompt_tmpl
+        self.final_prompt_tmpl = final_prompt_tmpl
+
+        # init vision encoder related modules
+        vision_encoder_weight = vision_encoder.pop('pretrained', None)
+        self.vision_encoder = MODELS.build(vision_encoder)
+        if vision_encoder_weight is not None:
+            from mmengine.runner.checkpoint import load_checkpoint
+            load_checkpoint(
+                self.vision_encoder,
+                vision_encoder_weight,
+                map_location='cpu',
+                revise_keys=[(r'^backbone\.', '')],
+            )
+            self.vision_encoder.is_init = True
+
+        self.perceiver = PerceiverResampler(dim=self.vision_encoder.embed_dims)
+
+        # init language encoder related modules
+        self.lang_encoder = ExtendModule(**lang_encoder)
+        self.lang_encoder.resize_token_embeddings(len(self.tokenizer))
+        self.lang_encoder.media_token_id = self.tokenizer.encode('<image>')[-1]
+
+        # other necessary parameters
+        self.eoc_token_id = self.tokenizer.encode('<|endofchunk|>')[-1]
+        self.generation_cfg = {
+            'num_beams': 1,
+            'max_new_tokens': None,
+            'temperature': 1.0,
+            'top_k': 0,
+            'top_p': 1.0,
+            'no_repeat_ngram_size': 0,
+            'prefix_allowed_tokens_fn': None,
+            'length_penalty': 1.0,
+            'num_return_sequences': 1,
+            'do_sample': False,
+            'early_stopping': False,
+            **generation_cfg,
+        }
+
+        if hasattr(self, 'register_load_state_dict_post_hook'):
+            self.register_load_state_dict_post_hook(self._load_adapter_hook)
+
+    def forward(
+        self,
+        images: torch.Tensor,
+        data_samples: Optional[List[DataSample]] = None,
+        mode: str = 'loss',
+    ):
+        """The unified entry for a forward process in both training and test.
+        The method should accept only one mode "loss":
+
+        - "loss": Forward and return a dict of losses according to the given
+          inputs and data samples.
+
+        Note that this method doesn't handle neither back propagation nor
+        optimizer updating, which are done in the :meth:`train_step`.
+
+        Args:
+            images (torch.Tensor): The input image tensor with different ndim
+                according to the inputs.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. It's required if ``mode="loss"``.
+                Defaults to None.
+            mode (str): Return what kind of value. Defaults to 'loss'.
+
+        Returns:
+            The return type depends on ``mode``.
+            - If ``mode="loss"``, return a dict of tensor.
+        """
+
+        if mode == 'loss':
+            return self.loss(images, data_samples)
+        elif mode == 'predict':
+            return self.predict(images, data_samples)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def extract_vision_feats(self, images: torch.Tensor) -> torch.Tensor:
+        """Extract vision features.
+
+        Args:
+            images (torch.Tensor): For zero-shot, the input images tensor is
+                with shape (B, C, H, W), for few-shot, which is
+                (B, T_img, C, H, W) in general. Images in the same chunk
+                are collated along T_img. Video data is not supported yet.
+
+        Returns:
+            torch.Tensor: Return extracted features.
+        """
+        if images.ndim == 4:
+            # (B, C, H, W) -> (B, 1, C, H, W) for zero-shot.
+            images = images.unsqueeze(1)
+        b, T = images.shape[:2]
+        # b T c h w -> (b T) c h w
+        images = images.view(b * T, *images.shape[-3:])
+
+        with torch.no_grad():
+            vision_feats = self.vision_encoder(images)[-1][:, 1:]
+
+        # (b T F) v d -> b T F v d  Only support F=1 here
+        vision_feats = vision_feats.view(b, T, 1, *vision_feats.shape[-2:])
+
+        vision_feats = self.perceiver(vision_feats)  # reshapes to (b, T, n, d)
+        return vision_feats
+
+    def predict(self,
+                images: torch.Tensor,
+                data_samples: Optional[List[DataSample]] = None,
+                **generation_cfg):
+        """Predict generation results from a batch of inputs.
+
+        Args:
+            images (torch.Tensor): For zero-shot, the input images tensor is
+                with shape (B, C, H, W), for few-shot, which is
+                (B, T_img, C, H, W) in general. Images in the same chunk
+                are collated along T_img. Video data is not supported yet.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. Defaults to None.
+            **generation_cfg: Other keyword arguments accepted by the
+                ``generate`` method of :attr:`lang_encoder`.
+
+        Returns:
+            List[DataSample]: Return list of data samples.
+        """
+        # generation_cfg in prediction should be dominant
+        generation_cfg = {**self.generation_cfg, **generation_cfg}
+        num_beams = generation_cfg['num_beams']
+
+        if num_beams > 1:
+            images = images.repeat_interleave(num_beams, dim=0)
+
+        # extra vision feats and set as language condition feats
+        vision_x = self.extract_vision_feats(images)
+        for layer in self.lang_encoder._get_decoder_layers():
+            layer.condition_vis_x(vision_x)
+
+        input_text = self.preprocess_text(data_samples, device=images.device)
+
+        outputs = self.lang_encoder.generate(
+            input_text.input_ids,
+            attention_mask=input_text.attention_mask,
+            eos_token_id=self.eoc_token_id,
+            **generation_cfg)
+
+        # clear conditioned layers for language models
+        self.lang_encoder.clear_conditioned_layers()
+
+        # remove prefix
+        outputs = outputs[:, len(input_text.input_ids[0]):]
+
+        return self.post_process(outputs, data_samples)
+
+    def preprocess_text(self, data_samples: List[DataSample],
+                        device: torch.device) -> List[DataSample]:
+        """Preprocess text in advance before fed into language model.
+
+        Args:
+            data_samples (List[DataSample]): The annotation
+                data of every samples. Defaults to None.
+            device (torch.device): Device for text to put on.
+
+        Returns:
+            List[DataSample]: Return list of data samples.
+        """
+        prompts = []
+        for sample in data_samples:
+            if 'shots' in sample:
+                # few-shot
+                shot_prompt = ''.join([
+                    self.shot_prompt_tmpl.format(**shot)
+                    for shot in sample.get('shots')
+                ])
+            else:
+                # zero-shot
+                shot_prompt = self.zeroshot_prompt
+
+            # add final prompt
+            final_prompt = self.final_prompt_tmpl.format(**sample.to_dict())
+            prompts.append(shot_prompt + final_prompt)
+
+        self.tokenizer.padding_side = 'left'
+        input_text = self.tokenizer(
+            prompts,
+            padding='longest',
+            truncation=True,
+            return_tensors='pt',
+            max_length=2000,
+        ).to(device)
+        return input_text
+
+    def post_process(
+            self, outputs: torch.Tensor,
+            data_samples: Optional[List[DataSample]]) -> List[DataSample]:
+        """Perform post process for outputs for different task.
+
+        Args:
+            outputs (torch.Tensor): The generated outputs.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples.
+
+        Returns:
+            List[DataSample]: Return list of data samples.
+        """
+        outputs = self.tokenizer.batch_decode(
+            outputs, skip_special_tokens=True)
+
+        if data_samples is None:
+            data_samples = [DataSample() for _ in range(len(outputs))]
+
+        for output, data_sample in zip(outputs, data_samples):
+            # remove text pattern
+            if self.task == 'caption':
+                data_sample.pred_caption = re.split('Output', output,
+                                                    1)[0].replace('"', '')
+            elif self.task == 'vqa':
+                data_sample.pred_answer = re.split('Question|Answer', output,
+                                                   1)[0]
+
+        return data_samples
+
+    @staticmethod
+    def _load_adapter_hook(module, incompatible_keys):
+        """Avoid warning missing keys except adapter keys."""
+        adapter_patterns = [
+            '^perceiver',
+            'lang_encoder.*embed_tokens',
+            'lang_encoder.*gated_cross_attn_layers',
+            'lang_encoder.*rotary_emb',
+        ]
+        for key in list(incompatible_keys.missing_keys):
+            if not any(re.match(pattern, key) for pattern in adapter_patterns):
+                incompatible_keys.missing_keys.remove(key)
+
+        for key in list(incompatible_keys.unexpected_keys):
+            if 'position_ids' in key:
+                incompatible_keys.unexpected_keys.remove(key)
+            if 'lang_encoder.gated_cross_attn_layers' in key:
+                incompatible_keys.unexpected_keys.remove(key)
diff --git a/mmpretrain/models/multimodal/flamingo/modules.py b/mmpretrain/models/multimodal/flamingo/modules.py
new file mode 100644
index 0000000000000000000000000000000000000000..730c61b68a8d0fb799b7985636f09b6484ef99c2
--- /dev/null
+++ b/mmpretrain/models/multimodal/flamingo/modules.py
@@ -0,0 +1,398 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+"""Taken from https://github.com/lucidrains/flamingo-pytorch."""
+
+from typing import Optional
+
+import torch
+from einops import rearrange, repeat
+from torch import einsum, nn
+
+
+def FeedForward(dim, mult: int = 4):
+    """Feedforward layers.
+
+    Args:
+        mult (int): Layer expansion muliplier. Defaults to 4.
+    """
+    inner_dim = int(dim * mult)
+    return nn.Sequential(
+        nn.LayerNorm(dim),
+        nn.Linear(dim, inner_dim, bias=False),
+        nn.GELU(),
+        nn.Linear(inner_dim, dim, bias=False),
+    )
+
+
+class PerceiverAttention(nn.Module):
+    """Perceiver attetion layers.
+
+    Args:
+        dim (int): Input dimensions.
+        dim_head (int): Number of dimension heads. Defaults to 64.
+        heads (int): Number of heads. Defaults to 8.
+    """
+
+    def __init__(self, *, dim: int, dim_head: int = 64, heads: int = 8):
+        super().__init__()
+        self.scale = dim_head**-0.5
+        self.heads = heads
+        inner_dim = dim_head * heads
+
+        self.norm_media = nn.LayerNorm(dim)
+        self.norm_latents = nn.LayerNorm(dim)
+
+        self.to_q = nn.Linear(dim, inner_dim, bias=False)
+        self.to_kv = nn.Linear(dim, inner_dim * 2, bias=False)
+        self.to_out = nn.Linear(inner_dim, dim, bias=False)
+
+    def forward(self, x: torch.Tensor, latents: torch.Tensor):
+        """Forward function.
+
+        Args:
+            x (torch.Tensor): image features of shape (b, T, n1, D).
+            latent (torch.Tensor): latent features of shape (b, T, n2, D).
+        """
+        x = self.norm_media(x)
+        latents = self.norm_latents(latents)
+
+        h = self.heads
+
+        q = self.to_q(latents)
+        kv_input = torch.cat((x, latents), dim=-2)
+        k, v = self.to_kv(kv_input).chunk(2, dim=-1)
+        q = rearrange(q, 'b t n (h d) -> b h t n d', h=h)
+        k = rearrange(k, 'b t n (h d) -> b h t n d', h=h)
+        v = rearrange(v, 'b t n (h d) -> b h t n d', h=h)
+        q = q * self.scale
+
+        # attention
+        sim = einsum('... i d, ... j d  -> ... i j', q, k)
+        sim = sim - sim.amax(dim=-1, keepdim=True).detach()
+        attn = sim.softmax(dim=-1)
+
+        out = einsum('... i j, ... j d -> ... i d', attn, v)
+        out = rearrange(out, 'b h t n d -> b t n (h d)', h=h)
+        return self.to_out(out)
+
+
+class PerceiverResampler(nn.Module):
+    """Perceiver resampler layers.
+
+    Args:
+        dim (int): Input dimensions.
+        depth (int): Depth of resampler. Defaults to 6.
+        dim_head (int): Number of dimension heads. Defaults to 64.
+        heads (int): Number of heads. Defaults to 8.
+        num_latents (int): Number of latents. Defaults to 64.
+        max_num_media (int, optional): Max number of media.
+            Defaults to None.
+        max_num_frames (int, optional): Max number of frames.
+            Defaults to None.
+        ff_mult (int): Feed forward multiplier. Defaults to 4.
+    """
+
+    def __init__(
+        self,
+        *,
+        dim: int,
+        depth: int = 6,
+        dim_head: int = 64,
+        heads: int = 8,
+        num_latents: int = 64,
+        max_num_media: Optional[int] = None,
+        max_num_frames: Optional[int] = None,
+        ff_mult: int = 4,
+    ):
+        super().__init__()
+        self.latents = nn.Parameter(torch.randn(num_latents, dim))
+        self.frame_embs = (
+            nn.Parameter(torch.randn(max_num_frames, dim))
+            if max_num_frames is not None else None)
+        self.media_time_embs = (
+            nn.Parameter(torch.randn(max_num_media, 1, dim))
+            if max_num_media is not None else None)
+
+        self.layers = nn.ModuleList([])
+        for _ in range(depth):
+            self.layers.append(
+                nn.ModuleList([
+                    PerceiverAttention(
+                        dim=dim, dim_head=dim_head, heads=heads),
+                    FeedForward(dim=dim, mult=ff_mult),
+                ]))
+
+        self.norm = nn.LayerNorm(dim)
+
+    def forward(self, x: torch.Tensor):
+        """Forward function for perceiver sampler.
+
+        Args:
+            x (torch.Tensor): image features of shape (b, T, F, v, D)
+
+        Returns:
+            torch.Tensor: shape (b, T, n, D) where n is self.num_latents
+        """
+        b, T, F, v = x.shape[:4]
+
+        # frame and media time embeddings
+        if self.frame_embs is not None:
+            frame_embs = repeat(
+                self.frame_embs[:F], 'F d -> b T F v d', b=b, T=T, v=v)
+            x = x + frame_embs
+        x = rearrange(x, 'b T F v d -> b T (F v) d'
+                      )  # flatten the frame and spatial dimensions
+        if self.media_time_embs is not None:
+            x = x + self.media_time_embs[:T]
+
+        # blocks
+        latents = repeat(self.latents, 'n d -> b T n d', b=b, T=T)
+        for attn, ff in self.layers:
+            latents = attn(x, latents) + latents
+            latents = ff(latents) + latents
+        return self.norm(latents)
+
+
+class MaskedCrossAttention(nn.Module):
+    """Masked cross attention layers.
+
+    Args:
+        dim (int): Input text feature dimensions.
+        dim_visual (int): Input visual feature dimensions.
+        dim_head (int): Number of dimension heads. Defaults to 64.
+        heads (int): Number of heads. Defaults to 8.
+        only_attend_immediate_media (bool): Whether attend immediate media.
+            Defaults to True.
+    """
+
+    def __init__(
+        self,
+        *,
+        dim: int,
+        dim_visual: int,
+        dim_head: int = 64,
+        heads: int = 8,
+        only_attend_immediate_media: bool = True,
+    ):
+        super().__init__()
+        self.scale = dim_head**-0.5
+        self.heads = heads
+        inner_dim = dim_head * heads
+
+        self.norm = nn.LayerNorm(dim)
+
+        self.to_q = nn.Linear(dim, inner_dim, bias=False)
+        self.to_kv = nn.Linear(dim_visual, inner_dim * 2, bias=False)
+        self.to_out = nn.Linear(inner_dim, dim, bias=False)
+
+        # whether for text to only attend to immediate preceding image
+        # or all previous images
+        self.only_attend_immediate_media = only_attend_immediate_media
+
+    def forward(self,
+                x: torch.Tensor,
+                media: torch.Tensor,
+                media_locations: Optional[torch.Tensor] = None,
+                attend_previous: bool = True):
+        """Forward function for perceiver sampler.
+
+        Args:
+            x (torch.Tensor): text features of shape (B, T_txt, D_txt).
+            media (torch.Tensor): image features of shape
+                (B, T_img, n, D_img) where n is the dim of the latents.
+            media_locations (torch.Tensor, optional): boolean mask identifying
+                the media tokens in x of shape (B, T_txt). Defaults to None.
+            attend_previous (bool): If false, ignores immediately preceding
+                image and starts attending when following image.
+                Defaults to True.
+        """
+        _, T_img, n = media.shape[:3]
+        h = self.heads
+
+        x = self.norm(x)
+
+        q = self.to_q(x)
+        media = rearrange(media, 'b t n d -> b (t n) d')
+
+        k, v = self.to_kv(media).chunk(2, dim=-1)
+        q = rearrange(q, 'b n (h d) -> b h n d', h=h)
+        k = rearrange(k, 'b n (h d) -> b h n d', h=h)
+        v = rearrange(v, 'b n (h d) -> b h n d', h=h)
+
+        q = q * self.scale
+
+        sim = einsum('... i d, ... j d -> ... i j', q, k)
+
+        if media_locations is not None:
+            # at each boolean of True, increment the time counter
+            # (relative to media time)
+            text_time = media_locations.cumsum(dim=-1)
+            media_time = torch.arange(T_img, device=x.device) + 1
+
+            if not attend_previous:
+                text_time[~media_locations] += 1
+                # make sure max is still the number of images in the sequence
+                text_time[text_time > repeat(
+                    torch.count_nonzero(media_locations, dim=1),
+                    'b -> b i',
+                    i=text_time.shape[1],
+                )] = 0
+
+            # text time must equal media time if only attending to most
+            # immediate image otherwise, as long as text time is greater than
+            # media time (if attending to all previous images / media)
+            mask_op = torch.eq if self.only_attend_immediate_media else torch.ge  # noqa
+
+            text_to_media_mask = mask_op(
+                rearrange(text_time, 'b i -> b 1 i 1'),
+                repeat(media_time, 'j -> 1 1 1 (j n)', n=n),
+            )
+            sim = sim.masked_fill(~text_to_media_mask,
+                                  -torch.finfo(sim.dtype).max)
+
+        sim = sim - sim.amax(dim=-1, keepdim=True).detach()
+        attn = sim.softmax(dim=-1)
+
+        if media_locations is not None and self.only_attend_immediate_media:
+            # any text without a preceding media needs to have
+            # attention zeroed out
+            text_without_media_mask = text_time == 0
+            text_without_media_mask = rearrange(text_without_media_mask,
+                                                'b i -> b 1 i 1')
+            attn = attn.masked_fill(text_without_media_mask, 0.0)
+
+        out = einsum('... i j, ... j d -> ... i d', attn, v)
+        out = rearrange(out, 'b h n d -> b n (h d)')
+        return self.to_out(out)
+
+
+class GatedCrossAttentionBlock(nn.Module):
+    """Gated cross attention layers.
+
+    Args:
+        dim (int): Input text feature dimensions.
+        dim_visual (int): Input visual feature dimensions.
+        dim_head (int): Number of dimension heads. Defaults to 64.
+        heads (int): Number of heads. Defaults to 8.
+        ff_mult (int): Feed forward multiplier. Defaults to 4.
+        only_attend_immediate_media (bool): Whether attend immediate media.
+            Defaults to True.
+    """
+
+    def __init__(
+        self,
+        *,
+        dim: int,
+        dim_visual: int,
+        dim_head: int = 64,
+        heads: int = 8,
+        ff_mult: int = 4,
+        only_attend_immediate_media: bool = True,
+    ):
+        super().__init__()
+        self.attn = MaskedCrossAttention(
+            dim=dim,
+            dim_visual=dim_visual,
+            dim_head=dim_head,
+            heads=heads,
+            only_attend_immediate_media=only_attend_immediate_media,
+        )
+        self.attn_gate = nn.Parameter(torch.tensor([0.0]))
+
+        self.ff = FeedForward(dim, mult=ff_mult)
+        self.ff_gate = nn.Parameter(torch.tensor([0.0]))
+
+    def forward(self,
+                x: torch.Tensor,
+                media: torch.Tensor,
+                media_locations: Optional[torch.Tensor] = None,
+                attend_previous: bool = True):
+        """Forward function for perceiver sampler.
+
+        Args:
+            x (torch.Tensor): text features of shape (B, T_txt, D_txt).
+            media (torch.Tensor): image features of shape
+                (B, T_img, n, D_img) where n is the dim of the latents.
+            media_locations (torch.Tensor, optional): boolean mask identifying
+                the media tokens in x of shape (B, T_txt). Defaults to None.
+            attend_previous (bool): If false, ignores immediately preceding
+                image and starts attending when following image.
+                Defaults to True.
+        """
+        x = (
+            self.attn(
+                x,
+                media,
+                media_locations=media_locations,
+                attend_previous=attend_previous,
+            ) * self.attn_gate.tanh() + x)
+        x = self.ff(x) * self.ff_gate.tanh() + x
+
+        return x
+
+
+class FlamingoLayer(nn.Module):
+    """Faminogo layers.
+
+    Args:
+        gated_cross_attn_layer (nn.Module): Gated cross attention layer.
+        decoder_layer (nn.Module): Decoder layer.
+    """
+
+    def __init__(self, gated_cross_attn_layer: nn.Module,
+                 decoder_layer: nn.Module):
+        super().__init__()
+        self.gated_cross_attn_layer = gated_cross_attn_layer
+        self.decoder_layer = decoder_layer
+        self.vis_x = None
+        self.media_locations = None
+
+    def is_conditioned(self) -> bool:
+        """Check whether the layer is conditioned."""
+        return self.vis_x is not None
+
+    def condition_vis_x(self, vis_x):
+        """Set condition vision features."""
+        self.vis_x = vis_x
+
+    def condition_media_locations(self, media_locations):
+        """Set condition media locations."""
+        self.media_locations = media_locations
+
+    def condition_attend_previous(self, attend_previous):
+        """Set attend previous."""
+        self.attend_previous = attend_previous
+
+    def forward(
+        self,
+        lang_x: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        **decoder_layer_kwargs,
+    ):
+        """Forward function.
+
+        Args:
+            lang_x (torch.Tensor): language inputs.
+            attention_mask (torch.Tensor, optional): text attention mask.
+                Defaults to None.
+            **decoder_layer_kwargs: Other decoder layer keyword arguments.
+        """
+        if self.gated_cross_attn_layer is None:
+            return self.decoder_layer(
+                lang_x, attention_mask=attention_mask, **decoder_layer_kwargs)
+
+        if self.vis_x is None:
+            raise ValueError('vis_x must be conditioned before forward pass')
+
+        if self.media_locations is None:
+            raise ValueError(
+                'media_locations must be conditioned before forward pass')
+
+        lang_x = self.gated_cross_attn_layer(
+            lang_x,
+            self.vis_x,
+            media_locations=self.media_locations,
+            attend_previous=self.attend_previous,
+        )
+        lang_x = self.decoder_layer(
+            lang_x, attention_mask=attention_mask, **decoder_layer_kwargs)
+        return lang_x
diff --git a/mmpretrain/models/multimodal/flamingo/utils.py b/mmpretrain/models/multimodal/flamingo/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..1077e145a7daeeff1c769d837ec9c5aac0cf3d93
--- /dev/null
+++ b/mmpretrain/models/multimodal/flamingo/utils.py
@@ -0,0 +1,64 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Any, Type
+
+from mmpretrain.registry import MODELS
+
+
+class ExtendModule:
+    """Combine the base language model with adapter. This module will create a
+    instance from base with extended functions in adapter.
+
+    Args:
+        base (object): Base module could be any object that represent
+            a instance of language model or a dict that can build the
+            base module.
+        adapter: (dict): Dict to build the adapter.
+    """
+
+    def __new__(cls, base: object, adapter: dict):
+
+        if isinstance(base, dict):
+            base = MODELS.build(base)
+
+        adapter_module = MODELS.get(adapter.pop('type'))
+        cls.extend_instance(base, adapter_module)
+        return adapter_module.extend_init(base, **adapter)
+
+    @classmethod
+    def extend_instance(cls, base: object, mixin: Type[Any]):
+        """Apply mixins to a class instance after creation.
+
+        Args:
+            base (object): Base module instance.
+            mixin: (Type[Any]): Adapter class type to mixin.
+        """
+        base_cls = base.__class__
+        base_cls_name = base.__class__.__name__
+        base.__class__ = type(
+            base_cls_name, (mixin, base_cls),
+            {})  # mixin needs to go first for our forward() logic to work
+
+
+def getattr_recursive(obj, att):
+    """
+    Return nested attribute of obj
+    Example: getattr_recursive(obj, 'a.b.c') is equivalent to obj.a.b.c
+    """
+    if att == '':
+        return obj
+    i = att.find('.')
+    if i < 0:
+        return getattr(obj, att)
+    else:
+        return getattr_recursive(getattr(obj, att[:i]), att[i + 1:])
+
+
+def setattr_recursive(obj, att, val):
+    """
+    Set nested attribute of obj
+    Example: setattr_recursive(obj, 'a.b.c', val)
+        is equivalent to obj.a.b.c = val
+    """
+    if '.' in att:
+        obj = getattr_recursive(obj, '.'.join(att.split('.')[:-1]))
+    setattr(obj, att.split('.')[-1], val)
diff --git a/mmpretrain/models/multimodal/llava/__init__.py b/mmpretrain/models/multimodal/llava/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..aef10d34d46fc3974744881c814068ae7d6f9357
--- /dev/null
+++ b/mmpretrain/models/multimodal/llava/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .llava import Llava
+from .modules import LlavaLlamaForCausalLM
+
+__all__ = ['Llava', 'LlavaLlamaForCausalLM']
diff --git a/mmpretrain/models/multimodal/llava/llava.py b/mmpretrain/models/multimodal/llava/llava.py
new file mode 100644
index 0000000000000000000000000000000000000000..f829b092146903583698be15d3c3fa3bd5aa4490
--- /dev/null
+++ b/mmpretrain/models/multimodal/llava/llava.py
@@ -0,0 +1,267 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import re
+from typing import List, Optional
+
+import torch
+from mmengine.model import BaseModel
+
+from mmpretrain.registry import MODELS, TOKENIZER
+from mmpretrain.structures import DataSample
+from ...utils import no_load_hf_pretrained_model
+from .modules import LlavaLlamaForCausalLM
+
+
+@MODELS.register_module()
+class Llava(BaseModel):
+    """The LLaVA model for multiple tasks.
+
+    Args:
+        vision_encoder (dict): The config of the vision encoder.
+        lang_encoder (dict): The config of the language encoder.
+        tokenizer (dict): The tokenizer to encode the text.
+        prompt_tmpl (str): Prompt template for inference.
+        task (int): The task to perform prediction.
+        use_im_start_end (bool): Whether to use the im_start and im_end tokens
+        mm_vision_select_layer (int): The index from vision encoder output.
+            Defaults to -1.
+        mm_proj_depth (int): The number of linear layers for multi-modal
+            projection. Defaults to 1.
+        load_lang_pretrained (bool): Whether to load the pretrained model of
+            language encoder. Defaults to False.
+        generation_cfg (dict): The extra generation config, accept the keyword
+            arguments of [~`transformers.GenerationConfig`].
+            Defaults to an empty dict.
+        data_preprocessor (Optional[dict]): The config for preprocessing input
+            data. If None or no specified type, it will use
+            "MutimodalDataPreprocessor" as type.
+            See :class:`MutimodalDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (dict, optional): The initialization config. Defaults to None.
+    """
+
+    support_tasks = {'caption', 'vqa'}
+    im_patch_token = '<im_patch>'
+    im_start_token = '<im_start>'
+    im_end_token = '<im_end>'
+
+    def __init__(self,
+                 vision_encoder: dict,
+                 lang_encoder: dict,
+                 tokenizer: dict,
+                 mm_hidden_size: int,
+                 prompt_tmpl: str,
+                 task: str = 'caption',
+                 use_im_patch: bool = True,
+                 use_im_start_end: bool = False,
+                 mm_vision_select_layer: int = -1,
+                 mm_proj_depth: int = 1,
+                 generation_cfg: dict = dict(),
+                 load_lang_pretrained: bool = False,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        if isinstance(data_preprocessor, dict):
+            data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+            data_preprocessor = MODELS.build(data_preprocessor)
+
+        super().__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+        if task not in self.support_tasks:
+            raise ValueError(f'Unsupported task {task}, please select '
+                             f'the task from {self.support_tasks}.')
+        self.task = task
+
+        # init tokenizer
+        self.tokenizer = TOKENIZER.build(tokenizer)
+        # add Llava special tokens to the tokenizer
+        if use_im_patch:
+            self.tokenizer.add_tokens([self.im_patch_token],
+                                      special_tokens=True)
+        if use_im_start_end:
+            self.tokenizer.add_tokens([self.im_start_token, self.im_end_token],
+                                      special_tokens=True)
+
+        # Template to format the prompt input
+        self.prompt_tmpl = prompt_tmpl
+
+        # init vision encoder related modules
+        vision_encoder_weight = vision_encoder.pop('pretrained', None)
+        vision_encoder = MODELS.build(vision_encoder)
+        if vision_encoder_weight is not None:
+            from mmengine.runner.checkpoint import load_checkpoint
+            load_checkpoint(
+                vision_encoder,
+                vision_encoder_weight,
+                map_location='cpu',
+                revise_keys=[(r'^backbone\.', '')],
+            )
+            vision_encoder.is_init = True
+
+        # init language encoder related modules
+        if load_lang_pretrained:
+            lang_encoder = MODELS.build(lang_encoder)
+        else:
+            with no_load_hf_pretrained_model():
+                lang_encoder = MODELS.build(lang_encoder)
+        lang_encoder.resize_token_embeddings(len(self.tokenizer))
+
+        self.model = LlavaLlamaForCausalLM(
+            vision_encoder=vision_encoder,
+            lang_encoder=lang_encoder,
+            mm_hidden_size=mm_hidden_size,
+            mm_proj_depth=mm_proj_depth,
+            use_im_start_end=use_im_start_end,
+            im_start_token=self.tokenizer.convert_tokens_to_ids(
+                self.im_start_token),
+            im_end_token=self.tokenizer.convert_tokens_to_ids(
+                self.im_end_token),
+            mm_vision_select_layer=mm_vision_select_layer)
+
+        self.generation_cfg = generation_cfg
+
+        if hasattr(self, 'register_load_state_dict_post_hook'):
+            self.register_load_state_dict_post_hook(self._load_ckpt_hook)
+
+    def forward(
+        self,
+        images: torch.Tensor,
+        data_samples: Optional[List[DataSample]] = None,
+        mode: str = 'loss',
+    ):
+        """The unified entry for a forward process in both training and test.
+
+        - "predict": Forward and return the predictions, which are fully
+          processed to a list of :obj:`DataSample`.
+        - "loss": Forward and return a dict of losses according to the given
+          inputs and data samples.
+
+        Note that this method doesn't handle neither back propagation nor
+        optimizer updating, which are done in the :meth:`train_step`.
+
+        Args:
+            images (torch.Tensor): The input image tensor with different ndim
+                according to the inputs.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. It's required if ``mode="loss"``.
+                Defaults to None.
+            mode (str): Return what kind of value. Defaults to 'loss'.
+
+        Returns:
+            The return type depends on ``mode``.
+            - If ``mode="loss"``, return a dict of tensor.
+        """
+
+        if mode == 'predict':
+            return self.predict(images, data_samples)
+        elif mode == 'loss':
+            raise NotImplementedError
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def predict(self,
+                images: torch.Tensor,
+                data_samples: Optional[List[DataSample]] = None,
+                **generation_cfg):
+        """Predict generation results from a batch of inputs.
+
+        Args:
+            images (torch.Tensor): For zero-shot, the input images tensor is
+                with shape (B, C, H, W), for few-shot, which is
+                (B, T_img, C, H, W) in general. Images in the same chunk
+                are collated along T_img. Video data is not supported yet.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. Defaults to None.
+            **generation_cfg: Other keyword arguments accepted by the
+                ``generate`` method of :attr:`lang_encoder`.
+
+        Returns:
+            List[DataSample]: Return list of data samples.
+        """
+        # generation_cfg in prediction should be dominant
+        generation_cfg = {**self.generation_cfg, **generation_cfg}
+
+        input_text = self.preprocess_text(data_samples, device=images.device)
+
+        outputs = self.model.generate(
+            input_text.input_ids,
+            attention_mask=input_text.attention_mask,
+            eos_token_id=self.tokenizer.eos_token_id,
+            images=images,
+            **generation_cfg)
+
+        # remove prefix
+        outputs = outputs[:, len(input_text.input_ids[0]):]
+
+        return self.post_process(outputs, data_samples)
+
+    def preprocess_text(self, data_samples: List[DataSample],
+                        device: torch.device) -> List[DataSample]:
+        """Preprocess text in advance before fed into language model.
+
+        Args:
+            data_samples (List[DataSample]): The annotation
+                data of every samples. Defaults to None.
+            device (torch.device): Device for text to put on.
+
+        Returns:
+            List[DataSample]: Return list of data samples.
+        """
+        tokens = []
+        for sample in data_samples:
+            prompt = self.prompt_tmpl.format(**sample.to_dict())
+            input_ids = []
+            while '<image>' in prompt:
+                prefix, _, prompt = prompt.partition('<image>')
+                input_ids.extend(
+                    self.tokenizer(prefix, add_special_tokens=False).input_ids)
+                input_ids.append(-200)
+            if prompt:
+                input_ids.extend(
+                    self.tokenizer(prompt, add_special_tokens=False).input_ids)
+            tokens.append(dict(input_ids=input_ids))
+
+        self.tokenizer.padding_side = 'left'
+        input_text = self.tokenizer.pad(
+            tokens,
+            padding='longest',
+            return_tensors='pt',
+            max_length=2000,
+        ).to(device)
+        return input_text
+
+    def post_process(
+            self, outputs: torch.Tensor,
+            data_samples: Optional[List[DataSample]]) -> List[DataSample]:
+        """Perform post process for outputs for different task.
+
+        Args:
+            outputs (torch.Tensor): The generated outputs.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples.
+
+        Returns:
+            List[DataSample]: Return list of data samples.
+        """
+        outputs = self.tokenizer.batch_decode(
+            outputs, skip_special_tokens=True)
+
+        if data_samples is None:
+            data_samples = [DataSample() for _ in range(len(outputs))]
+
+        for output, data_sample in zip(outputs, data_samples):
+            # remove text pattern
+            if self.task == 'caption':
+                data_sample.pred_caption = output
+            elif self.task == 'vqa':
+                data_sample.pred_answer = output
+
+        return data_samples
+
+    @staticmethod
+    def _load_ckpt_hook(module, incompatible_keys):
+        """Avoid warning missing keys except lang_encoder keys."""
+        for key in list(incompatible_keys.missing_keys):
+            if re.match('model.vision_tower', key):
+                incompatible_keys.missing_keys.remove(key)
diff --git a/mmpretrain/models/multimodal/llava/modules.py b/mmpretrain/models/multimodal/llava/modules.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa3c6bbbcc008d04da24989296bb08abe6623cab
--- /dev/null
+++ b/mmpretrain/models/multimodal/llava/modules.py
@@ -0,0 +1,234 @@
+#    Copyright 2023 Haotian Liu
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+from typing import List, Optional, Union
+
+import torch
+import torch.nn as nn
+from transformers import PreTrainedModel
+
+DEFAULT_IMAGE_TOKEN = '<image>'
+DEFAULT_IMAGE_PATCH_TOKEN = '<im_patch>'
+DEFAULT_IM_START_TOKEN = '<im_start>'
+DEFAULT_IM_END_TOKEN = '<im_end>'
+
+
+class LlavaLlamaForCausalLM(PreTrainedModel):
+
+    def __init__(self,
+                 vision_encoder,
+                 lang_encoder,
+                 mm_hidden_size,
+                 use_im_start_end=True,
+                 mm_proj_depth=1,
+                 im_start_token: Optional[int] = None,
+                 im_end_token: Optional[int] = None,
+                 im_token_index: int = -200,
+                 mm_vision_select_layer: int = -1):
+        super().__init__(lang_encoder.config)
+        self.vision_tower = vision_encoder
+        self.lang_encoder = lang_encoder
+
+        self.use_im_start_end = use_im_start_end
+        self.im_start_token = im_start_token
+        self.im_end_token = im_end_token
+        self.mm_hidden_size = mm_hidden_size
+        self.mm_vision_select_layer = mm_vision_select_layer
+        self.im_token_index = im_token_index
+        self.lang_hidden_size = lang_encoder.config.hidden_size
+
+        if mm_proj_depth == 1:
+            # Llava V1
+            mm_projector = nn.Linear(self.mm_hidden_size,
+                                     self.lang_hidden_size)
+            self.lang_encoder.model.add_module('mm_projector', mm_projector)
+        elif mm_proj_depth > 1:
+            # Llava V1.5
+            modules = [nn.Linear(self.mm_hidden_size, self.lang_hidden_size)]
+            for _ in range(1, mm_proj_depth):
+                modules.append(nn.GELU())
+                modules.append(
+                    nn.Linear(self.lang_hidden_size, self.lang_hidden_size))
+            mm_projector = nn.Sequential(*modules)
+            self.lang_encoder.model.add_module('mm_projector', mm_projector)
+        elif mm_proj_depth == 0:
+            self.lang_encoder.model.add_module('mm_projector', nn.Identity())
+
+        self.post_init()
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        images: Optional[torch.FloatTensor] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        output_attentions = (
+            output_attentions if output_attentions is not None else
+            self.config.output_attentions)
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else
+            self.config.output_hidden_states)
+        return_dict = (
+            return_dict
+            if return_dict is not None else self.config.use_return_dict)
+
+        (input_ids, attention_mask, past_key_values, inputs_embeds,
+         labels) = self.forward_vision_tower(input_ids, attention_mask,
+                                             past_key_values, labels, images)
+
+        return self.lang_encoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            labels=labels,
+        )
+
+    def prepare_inputs_for_generation(self,
+                                      input_ids,
+                                      past_key_values=None,
+                                      attention_mask=None,
+                                      inputs_embeds=None,
+                                      **kwargs):
+        if past_key_values:
+            input_ids = input_ids[:, -1:]
+
+        # if `inputs_embeds` are passed, we only want to use
+        # them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {'inputs_embeds': inputs_embeds}
+        else:
+            model_inputs = {'input_ids': input_ids}
+
+        model_inputs.update({
+            'past_key_values': past_key_values,
+            'use_cache': kwargs.get('use_cache'),
+            'attention_mask': attention_mask,
+            'images': kwargs.get('images', None),
+        })
+        return model_inputs
+
+    def forward_vision_tower(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: torch.LongTensor,
+        past_key_values: torch.FloatTensor,
+        labels: torch.LongTensor,
+        images: Union[torch.FloatTensor, None] = None,
+    ):
+        if self.vision_tower is None or images is None or input_ids.shape[
+                1] == 1:
+            if (past_key_values is not None and self.vision_tower is not None
+                    and images is not None and input_ids.shape[1] == 1):
+                attention_mask = torch.ones(
+                    (attention_mask.shape[0],
+                     past_key_values[-1][-1].shape[-2] + 1),
+                    dtype=attention_mask.dtype,
+                    device=attention_mask.device)
+            return input_ids, attention_mask, past_key_values, None, labels
+
+        with torch.no_grad():
+            # TODO: support variable number of images (single now)
+            feats = self.vision_tower(images)
+            image_features = feats[-1][:, 1:]
+
+        image_features = self.lang_encoder.model.mm_projector(image_features)
+
+        new_input_embeds = []
+        new_labels = [] if labels is not None else None
+        new_attn_mask = [] if attention_mask is not None else None
+        for batch_idx, cur_input_ids in enumerate(input_ids):
+            cur_img = image_features[batch_idx]
+
+            if (cur_input_ids != self.im_token_index).all():
+                # multimodal LLM, but the current sample is not multimodal
+                new_input_embeds.append(self.embed_tokens(cur_input_ids))
+                if labels is not None:
+                    new_labels.append(labels[batch_idx])
+                if attention_mask is not None:
+                    new_attn_mask.append(attention_mask[batch_idx])
+                continue
+
+            img_idx = torch.where(cur_input_ids == self.im_token_index)[0][0]
+            if self.use_im_start_end:
+                cur_new_input_embeds = torch.cat(
+                    [
+                        self.embed_tokens(cur_input_ids[:img_idx - 1]),
+                        self.embed_tokens(cur_input_ids[img_idx - 1:img_idx]),
+                        cur_img,
+                        self.embed_tokens(
+                            cur_input_ids[img_idx + 1:img_idx + 2]),
+                        self.embed_tokens(cur_input_ids[img_idx + 2:]),
+                    ],
+                    dim=0,
+                )
+            else:
+                cur_new_input_embeds = torch.cat(
+                    [
+                        self.embed_tokens(cur_input_ids[:img_idx]),
+                        cur_img,
+                        self.embed_tokens(cur_input_ids[img_idx + 1:]),
+                    ],
+                    dim=0,
+                )
+            new_input_embeds.append(cur_new_input_embeds)
+
+            if labels is not None:
+                cur_new_labels = torch.cat([
+                    labels[batch_idx, :img_idx],
+                    labels.new_full((cur_img.size(0), ), -100),
+                    labels[batch_idx, img_idx + 1:],
+                ],
+                                           dim=0)
+                new_labels.append(cur_new_labels)
+
+            if attention_mask is not None:
+                cur_attn_mask = torch.cat([
+                    attention_mask[batch_idx, :img_idx],
+                    attention_mask.new_full((cur_img.size(0), ), True),
+                    attention_mask[batch_idx, img_idx + 1:],
+                ],
+                                          dim=0)
+                new_attn_mask.append(cur_attn_mask)
+
+        inputs_embeds = torch.stack(new_input_embeds, dim=0)
+        if labels is not None:
+            labels = torch.stack(new_labels, dim=0)
+        if attention_mask is not None:
+            attention_mask = torch.stack(new_attn_mask, dim=0)
+
+        return None, attention_mask, past_key_values, inputs_embeds, labels
+
+    @staticmethod
+    def _reorder_cache(past_key_values, beam_idx):
+        reordered_past = ()
+        for layer_past in past_key_values:
+            reordered_past += (tuple(
+                past_state.index_select(0, beam_idx)
+                for past_state in layer_past), )
+        return reordered_past
+
+    def embed_tokens(self, input_ids):
+        return self.lang_encoder.model.embed_tokens(input_ids)
diff --git a/mmpretrain/models/multimodal/minigpt4/__init__.py b/mmpretrain/models/multimodal/minigpt4/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..5358bb1377ee6da7d848c06f3a249493645cdbf7
--- /dev/null
+++ b/mmpretrain/models/multimodal/minigpt4/__init__.py
@@ -0,0 +1,4 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .minigpt4 import MiniGPT4
+
+__all__ = ['MiniGPT4']
diff --git a/mmpretrain/models/multimodal/minigpt4/minigpt4.py b/mmpretrain/models/multimodal/minigpt4/minigpt4.py
new file mode 100644
index 0000000000000000000000000000000000000000..d25d0b6be36cbc52d9ae636e1b62e27f00bd2cbd
--- /dev/null
+++ b/mmpretrain/models/multimodal/minigpt4/minigpt4.py
@@ -0,0 +1,410 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import random
+import re
+from typing import List, Optional, Tuple
+
+import torch
+import torch.nn as nn
+from mmengine.logging import MMLogger
+from mmengine.model import BaseModel
+
+from mmpretrain.registry import MODELS, TOKENIZER
+from mmpretrain.structures import DataSample
+
+
+@MODELS.register_module()
+class MiniGPT4(BaseModel):
+    """The multi-modality model of MiniGPT-4.
+
+    The implementation of `MiniGPT-4 <https://arxiv.org/abs/2304.10592>`_.
+    Modified from https://github.com/Vision-CAIR/MiniGPT-4/blob/main/minigpt4/models/mini_gpt4.py
+
+    Args:
+        vision_encoder (dict): The config for vision encoder.
+        q_former_model (dict): The config for Qformer.
+        lang_encoder (dict): The config for language model.
+        tokenizer (dict): The config for tokenizer.
+        task (str): To define the task, which control the processing of text.
+            Defaults to 'caption'.
+        freeze_vit (bool): Freeze the training of ViT. Defaults to True.
+        freeze_q_former (bool): Freeze the training of Qformer. Defaults to
+            True.
+        num_query_token (int): Number of query tokens of Qformer. Defaults to
+            32.
+        prompt_template (dict): Multi-language prompt template of the model. Defaults to dict([ ('en', '###Ask: {} ###Answer: '),
+                                                                                                ('zh', '###问：{} ###答：')])
+        raw_prompts (dict): Prompts for training. Defaults to dict().
+        max_txt_len (int): Max token length while doing tokenization. Defaults
+            to 32.
+        end_sym (str): Ended symbol of the sequence. Defaults to '###'.
+        generation_cfg (dict): The config of text generation. Defaults to
+            dict().
+        data_preprocessor (:obj:`BaseDataPreprocessor`): Used for
+            pre-processing data sampled by dataloader to the format accepted by
+            :meth:`forward`. Defaults to None.
+        init_cfg (dict): Initialization config dict. Defaults to None.
+    """ # noqa
+
+    def __init__(self,
+                 vision_encoder: dict,
+                 q_former_model: dict,
+                 lang_encoder: dict,
+                 tokenizer: dict,
+                 task: str = 'caption',
+                 freeze_vit: bool = True,
+                 freeze_q_former: bool = True,
+                 num_query_token: int = 32,
+                 prompt_template: dict = dict([('en',
+                                                '###Ask: {} ###Answer: '),
+                                               ('zh', '###问：{} ###答：')]),
+                 raw_prompts: dict = dict(),
+                 max_txt_len: int = 32,
+                 end_sym: str = '###',
+                 generation_cfg: dict = dict(),
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+        data_preprocessor = MODELS.build(data_preprocessor)
+
+        super().__init__(
+            data_preprocessor=data_preprocessor, init_cfg=init_cfg)
+        self.task = task
+        logger = MMLogger.get_current_instance()
+
+        # build vision model
+        vision_encoder_weight = vision_encoder.pop('pretrained', None)
+        self.vision_encoder = MODELS.build(vision_encoder)
+        self.ln_vision = nn.LayerNorm(self.vision_encoder.embed_dims)
+
+        if vision_encoder_weight is not None:
+            from mmengine.runner.checkpoint import load_checkpoint
+            load_checkpoint(self.vision_encoder, vision_encoder_weight)
+            self.vision_encoder.is_init = True
+        if freeze_vit:
+            for name, param in self.ln_vision.named_parameters():
+                param.requires_grad = False
+            self.ln_vision = self.ln_vision.eval()
+        else:
+            logger.warning('Please check `frozen_stages` in the dict of'
+                           '`vision_encoder`. Also set it to be -1 if do not'
+                           'freeze ViT.')
+
+        # build Qformer
+        q_former_model_weight = q_former_model.pop('pretrained', None)
+        self.q_former = MODELS.build(q_former_model)
+        self.q_former.cls = None
+        self.q_former.bert.embeddings.word_embeddings = None
+        self.q_former.bert.embeddings.position_embeddings = None
+        for layer in self.q_former.bert.encoder.layer:
+            layer.output = None
+            layer.intermediate = None
+
+        self.query_tokens = nn.Parameter(
+            torch.zeros(1, num_query_token, self.q_former.config.hidden_size))
+        self.query_tokens.data.normal_(
+            mean=0.0, std=self.q_former.config.initializer_range)
+
+        if q_former_model_weight is not None:
+            from mmengine.runner.checkpoint import CheckpointLoader
+            state_dict = CheckpointLoader.load_checkpoint(
+                q_former_model_weight)['state_dict']
+            self.load_state_dict(state_dict, strict=False)
+            # The ln_vision weights are also in the q-former checkpoint.
+            setattr(self.ln_vision, 'is_init', True)
+            setattr(self.q_former, 'is_init', True)
+
+        if freeze_q_former:
+            for name, param in self.q_former.named_parameters():
+                param.requires_grad = False
+            self.q_former.eval()
+            self.query_tokens.requires_grad = False
+
+        # build language model
+        self.llama_tokenizer = TOKENIZER.build(tokenizer)
+        self.llama_tokenizer.pad_token = self.llama_tokenizer.eos_token
+
+        self.llama_model = MODELS.build(lang_encoder)
+        for name, param in self.llama_model.named_parameters():
+            param.requires_grad = False
+
+        # build linear projection layer
+        self.llama_proj = nn.Linear(self.q_former.config.hidden_size,
+                                    self.llama_model.config.hidden_size)
+        self.max_txt_len = max_txt_len
+        self.end_sym = end_sym
+        self.end_token_id = self.llama_tokenizer.encode(end_sym)[-1]
+
+        # set prompts
+        self.en_prompt_list, self.zh_prompt_list = [], []
+        if raw_prompts.get('en') is not None:
+            en_filted_prompts = [
+                raw_prompt for raw_prompt in raw_prompts['en']
+                if '<ImageHere>' in raw_prompt
+            ]
+            self.en_prompt_list = [
+                prompt_template['en'].format(p) for p in en_filted_prompts
+            ]
+        if raw_prompts.get('zh') is not None:
+            zh_filted_prompts = [
+                raw_prompt for raw_prompt in raw_prompts['zh']
+                if '<ImageHere>' in raw_prompt
+            ]
+            self.zh_prompt_list = [
+                prompt_template['zh'].format(p) for p in zh_filted_prompts
+            ]
+
+        # update generation configs
+        self.generation_cfg = dict(
+            max_new_tokens=300,
+            num_beams=1,
+            do_sample=True,
+            min_length=1,
+            top_p=0.9,
+            repetition_penalty=1.1,
+            length_penalty=1.0,
+            temperature=1.0)
+        self.generation_cfg.update(**generation_cfg)
+
+        if hasattr(self, 'register_load_state_dict_post_hook'):
+            self.register_load_state_dict_post_hook(self._load_llama_proj_hook)
+
+    def half(self):
+        self.llama_model = self.llama_model.half()
+        return self
+
+    def encode_img(self,
+                   images: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """The function to encode the images."""
+        device = images.device
+        x = self.vision_encoder(images)[0]
+        image_embeds = self.ln_vision(x).to(device)
+        image_atts = torch.ones(
+            image_embeds.size()[:-1], dtype=torch.long).to(device)
+
+        query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1)
+        query_output = self.q_former.bert(
+            query_embeds=query_tokens,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_atts,
+            return_dict=True,
+        )
+
+        inputs_llama = self.llama_proj(query_output.last_hidden_state)
+        atts_llama = torch.ones(
+            inputs_llama.size()[:-1], dtype=torch.long).to(images.device)
+        return inputs_llama, atts_llama
+
+    def prompt_wrap(self, img_embeds: torch.Tensor, atts_img: torch.Tensor,
+                    prompt: List[str]) -> Tuple[torch.Tensor, torch.Tensor]:
+        """The function to wrap the image and prompt.
+
+        Make sure that len(prompt) == img_embeds.shape[0].
+
+        Args:
+            img_embeds (torch.Tensor): The embedding of the input images.
+            atts_img (torch.Tensor): Attention map of the image embeddings.
+            prompt (List[str]): The prompt of the batch data.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]: The embedding and attention map.
+        """
+        if len(prompt) > 0:
+            p_before_list, p_after_list = [], []
+            for pro in prompt:
+                p_before, p_after = pro.split('<ImageHere>')
+                p_before_list.append(p_before)
+                p_after_list.append(p_after)
+            p_before_tokens = self.llama_tokenizer(
+                p_before_list,
+                return_tensors='pt',
+                padding='longest',
+                add_special_tokens=False).to(img_embeds.device)
+            p_after_tokens = self.llama_tokenizer(
+                p_after_list,
+                return_tensors='pt',
+                padding='longest',
+                add_special_tokens=False).to(img_embeds.device)
+            p_before_embeds = self.llama_model.model.embed_tokens(
+                p_before_tokens.input_ids)
+            p_after_embeds = self.llama_model.model.embed_tokens(
+                p_after_tokens.input_ids)
+            wrapped_img_embeds = torch.cat(
+                [p_before_embeds, img_embeds, p_after_embeds], dim=1)
+            wrapped_atts_img = atts_img[:, :1].expand(
+                -1, wrapped_img_embeds.shape[1])
+            return wrapped_img_embeds, wrapped_atts_img
+        else:
+            return img_embeds, atts_img
+
+    def loss(self,
+             images: torch.Tensor,
+             data_samples: Optional[List[DataSample]] = None) -> dict:
+        """The forward function in training.
+
+        Args:
+            inputs (List[torch.Tensor]): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        img_embeds, atts_img = self.encode_img(images)
+
+        self.llama_tokenizer.padding_side = 'right'
+
+        prompts, texts = [], []
+        for t in data_samples:
+            chat_content = t.chat_content
+            split_mark = '###Answer: ' if t.lang == 'en' else '###答：'
+            prompt, text = chat_content.split(split_mark)
+            prompt += split_mark
+            text += self.end_sym
+            prompts.append(prompt)
+            texts.append(text)
+
+        img_embeds, atts_img = self.prompt_wrap(img_embeds, atts_img, prompts)
+
+        to_regress_tokens = self.llama_tokenizer(
+            texts,
+            return_tensors='pt',
+            padding='longest',
+            truncation=True,
+            max_length=self.max_txt_len,
+            add_special_tokens=False).to(images.device)
+
+        targets = to_regress_tokens.input_ids.masked_fill(
+            to_regress_tokens.input_ids == self.llama_tokenizer.pad_token_id,
+            -100)
+
+        empty_targets = (
+            torch.ones([atts_img.shape[0], atts_img.shape[1] + 1],
+                       dtype=torch.long).to(images.device).fill_(
+                           -100)  # plus one for bos
+        )
+        targets = torch.cat([empty_targets, targets], dim=1)
+
+        batch_size = img_embeds.shape[0]
+        bos = torch.ones([batch_size, 1],
+                         dtype=to_regress_tokens.input_ids.dtype,
+                         device=to_regress_tokens.input_ids.device
+                         ) * self.llama_tokenizer.bos_token_id
+        bos_embeds = self.llama_model.model.embed_tokens(bos)
+        atts_bos = atts_img[:, :1]
+
+        to_regress_embeds = self.llama_model.model.embed_tokens(
+            to_regress_tokens.input_ids)
+        inputs_embeds = torch.cat([bos_embeds, img_embeds, to_regress_embeds],
+                                  dim=1)
+        attention_mask = torch.cat(
+            [atts_bos, atts_img, to_regress_tokens.attention_mask], dim=1)
+
+        outputs = self.llama_model(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            return_dict=True,
+            labels=targets,
+        )
+        loss = outputs.loss
+        return dict(loss=loss)
+
+    def predict(
+            self,
+            images: torch.Tensor,
+            data_samples: Optional[List[DataSample]] = None
+    ) -> List[DataSample]:
+
+        with torch.no_grad():
+            img_embeds, atts_img = self.encode_img(images)
+
+        prompts = [
+            random.choice(self.zh_prompt_list) if hasattr(t, 'lang')
+            and t.lang == 'zh' else random.choice(self.en_prompt_list)
+            for t in data_samples
+        ]
+        img_embeds, atts_img = self.prompt_wrap(img_embeds, atts_img, prompts)
+
+        batch_size = img_embeds.shape[0]
+        bos = torch.ones(
+            [batch_size, 1], dtype=torch.long,
+            device=img_embeds.device) * self.llama_tokenizer.bos_token_id
+        bos_embeds = self.llama_model.model.embed_tokens(bos)
+        inputs_embeds = torch.cat([bos_embeds, img_embeds], dim=1)
+
+        outputs = self.llama_model.generate(
+            inputs_embeds=inputs_embeds,
+            eos_token_id=self.end_token_id,
+            **self.generation_cfg)
+
+        return self.post_process(outputs, data_samples)
+
+    def post_process(
+            self, outputs: torch.Tensor,
+            data_samples: Optional[List[DataSample]]) -> List[DataSample]:
+        """Perform post process for outputs for different task.
+
+        Args:
+            outputs (torch.Tensor): The generated outputs.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples.
+
+        Returns:
+            List[DataSample]: Return list of data samples.
+        """
+        outputs = self.llama_tokenizer.batch_decode(
+            outputs, skip_special_tokens=True)
+
+        if data_samples is None:
+            data_samples = [DataSample() for _ in range(len(outputs))]
+
+        for output, data_sample in zip(outputs, data_samples):
+            if self.task == 'caption':
+                output = output.split('###')[0]
+                data_sample.pred_caption = output
+            else:
+                # raw output
+                data_sample.pred_output = output
+        return data_samples
+
+    def forward(
+        self,
+        images: torch.Tensor,
+        data_samples: Optional[list] = None,
+        mode: str = 'predict',
+        **kwargs,
+    ):
+        """The unified entry for a forward process in both training and test.
+        The method accepts the following modes:
+
+        - "predict": Forward and return a list of data samples contain the
+          predict results.
+
+        Args:
+            images (torch.Tensor): the preprocessed image tensor of shape
+                ``(N, C, H, W)``.
+            data_samples (List[DataSample], optional): The annotation data
+                of every samples. Defaults to None.
+            mode (str): Return what kind of value. Defaults to 'predict'.
+        """
+        if mode == 'loss':
+            return self.loss(images, data_samples)
+        elif mode == 'predict':
+            return self.predict(images, data_samples, **kwargs)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    @staticmethod
+    def _load_llama_proj_hook(module, incompatible_keys):
+        """Avoid warning missing keys except LLaMA projection keys."""
+        proj_patterns = [
+            'vision_encoder.*',
+            'ln_vision.*',
+            'q_former.*',
+            'query_tokens',
+            'llama_model.*',
+        ]
+        for key in list(incompatible_keys.missing_keys):
+            if any(re.match(pattern, key) for pattern in proj_patterns):
+                incompatible_keys.missing_keys.remove(key)
diff --git a/mmpretrain/models/multimodal/ofa/__init__.py b/mmpretrain/models/multimodal/ofa/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..bcb3f45f09b757304bfca3de2a94d217ff78d8d4
--- /dev/null
+++ b/mmpretrain/models/multimodal/ofa/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .ofa import OFA
+from .ofa_modules import OFADecoder, OFAEncoder, OFAEncoderDecoder
+
+__all__ = ['OFAEncoderDecoder', 'OFA', 'OFAEncoder', 'OFADecoder']
diff --git a/mmpretrain/models/multimodal/ofa/ofa.py b/mmpretrain/models/multimodal/ofa/ofa.py
new file mode 100644
index 0000000000000000000000000000000000000000..e15787a60d66ac56308b320cdd73a7703a2a29bc
--- /dev/null
+++ b/mmpretrain/models/multimodal/ofa/ofa.py
@@ -0,0 +1,320 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import string
+from collections import defaultdict
+from functools import partial
+from typing import Optional, Union
+
+import mmengine
+import torch
+from mmengine.model import BaseModel
+
+from mmpretrain.datasets import CleanCaption
+from mmpretrain.registry import MODELS, TOKENIZER
+from mmpretrain.structures import DataSample
+from .ofa_modules import OFAEncoderDecoder
+
+
+class TreeNode():
+
+    def __init__(self):
+        self.child = defaultdict(TreeNode)
+
+
+class Trie:
+
+    def __init__(self, eos):
+        self.root = TreeNode()
+        self.eos = eos
+
+    def insert(self, word):
+        cur = self.root
+        for c in word:
+            cur = cur.child[c]
+
+    def get_next_layer(self, word):
+        cur = self.root
+        for c in word:
+            cur = cur.child.get(c)
+            if cur is None:
+                return [self.eos]
+        return list(cur.child.keys())
+
+
+def apply_constraint(
+    input_ids: torch.Tensor,
+    logits: torch.Tensor,
+    decoder_prompts: Optional[list],
+    num_beams: int,
+    constraint_trie: Trie = None,
+):
+    if decoder_prompts is None and constraint_trie is None:
+        return logits
+
+    mask = logits.new_zeros(logits[:, -1, :].size(), dtype=torch.bool)
+    input_ids = input_ids.view(-1, num_beams, input_ids.shape[-1])
+    for batch_id, beam_sent in enumerate(input_ids):
+        for beam_id, sent in enumerate(beam_sent):
+            if decoder_prompts is None:
+                prompt_len = 0
+            else:
+                prompt_len = len(decoder_prompts[batch_id])
+
+            if sent.size(0) - 1 < prompt_len:
+                allowed_tokens = [decoder_prompts[batch_id][sent.size(0) - 1]]
+                mask[batch_id * num_beams + beam_id, allowed_tokens] = True
+            elif constraint_trie is not None:
+                answer_tokens = [0] + sent[prompt_len + 1:].tolist()
+                allowed_tokens = constraint_trie.get_next_layer(answer_tokens)
+                mask[batch_id * num_beams + beam_id, allowed_tokens] = True
+            else:
+                mask[batch_id * num_beams + beam_id, :] = True
+    logits[:, -1, :].masked_fill_(~mask, float('-inf'))
+    return logits
+
+
+@MODELS.register_module()
+class OFA(BaseModel):
+    """The OFA model for multiple tasks.
+
+    Args:
+        encoder_cfg (dict): The config of the encoder, accept the keyword
+            arguments of :class:`OFAEncoder`.
+        decoder_cfg (dict): The config of the decoder, accept the keyword
+            arguments of :class:`OFADecoder`.
+        vocab_size (int): The size of the vocabulary.
+        embedding_dim (int): The embedding dimensions of both the encoder
+            and the decoder.
+        tokenizer (dict | PreTrainedTokenizer): The tokenizer to encode
+            the text.
+        task (str): The task name, supported tasks are "caption", "vqa" and
+            "refcoco".
+        prompt (str, optional): The prompt template for the following tasks,
+            If None, use default prompt:
+
+            - **caption**: ' what does the image describe?'
+            - **refcoco**: ' which region does the text " {} " describe?'
+
+            Defaults to None
+        ans2label (str | Sequence | None): The answer to label mapping for
+            the vqa task. If a string, it should be a pickle or json file.
+            The sequence constrains the output answers. Defaults to None,
+            which means no constraint.
+        generation_cfg (dict): The extra generation config, accept the keyword
+            arguments of :class:`~transformers.GenerationConfig`.
+            Defaults to an empty dict.
+        data_preprocessor (dict, optional): The config for preprocessing input
+            data. If None or no specified type, it will use
+            "MultiModalDataPreprocessor" as type. See :class:
+            `MultiModalDataPreprocessor` for more details. Defaults to None.
+        init_cfg (dict, optional): The initialization config. Defaults to None.
+    """
+    support_tasks = {'caption', 'vqa', 'refcoco'}
+
+    def __init__(
+        self,
+        encoder_cfg,
+        decoder_cfg,
+        vocab_size,
+        embedding_dim,
+        tokenizer,
+        task,
+        prompt=None,
+        ans2label: Union[dict, str, None] = None,
+        generation_cfg=dict(),
+        data_preprocessor: Optional[dict] = None,
+        init_cfg=None,
+    ):
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        if isinstance(data_preprocessor, dict):
+            data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+            data_preprocessor = MODELS.build(data_preprocessor)
+
+        super().__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+        if isinstance(tokenizer, dict):
+            self.tokenizer = TOKENIZER.build(tokenizer)
+        else:
+            self.tokenizer = tokenizer
+
+        if task not in self.support_tasks:
+            raise ValueError(f'Unsupported task {task}, please select '
+                             f'the task from {self.support_tasks}.')
+
+        self.prompt = prompt
+        self.task = task
+
+        if isinstance(ans2label, str):
+            self.ans2label = mmengine.load(ans2label)
+        else:
+            self.ans2label = ans2label
+
+        if self.task == 'vqa' and self.ans2label is not None:
+            self.constraint_trie = Trie(eos=self.tokenizer.eos_token_id)
+            answers = [f' {answer}' for answer in self.ans2label]
+            answer_tokens = self.tokenizer(answers, padding=False)
+            for answer_token in answer_tokens['input_ids']:
+                self.constraint_trie.insert(answer_token)
+        else:
+            self.constraint_trie = None
+
+        generation_cfg = {
+            'num_beams': 5,
+            'max_new_tokens': 20,
+            'no_repeat_ngram_size': 3,
+            **generation_cfg,
+        }
+        self.model = OFAEncoderDecoder(
+            encoder_cfg=encoder_cfg,
+            decoder_cfg=decoder_cfg,
+            padding_idx=self.tokenizer.pad_token_id,
+            vocab_size=vocab_size,
+            embedding_dim=embedding_dim,
+            generation_cfg=generation_cfg,
+        )
+
+    def forward(
+        self,
+        images: torch.Tensor,
+        data_samples: Optional[list] = None,
+        mode: str = 'predict',
+        **kwargs,
+    ):
+        """The unified entry for a forward process in both training and test.
+        The method accepts the following modes:
+
+        - "predict": Forward and return a list of data samples contain the
+          predict results.
+
+        Args:
+            images (torch.Tensor): the preprocessed image tensor of shape
+                ``(N, C, H, W)``.
+            data_samples (List[DataSample], optional): The annotation data
+                of every samples. Defaults to None.
+            mode (str): Return what kind of value. Defaults to 'predict'.
+        """
+        if mode == 'predict':
+            return self.predict(images, data_samples, **kwargs)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def predict(
+        self,
+        images,
+        data_samples=None,
+        post_process=True,
+        **generation_config,
+    ):
+        text_tokens = self.preprocess_text(data_samples, images.size(0),
+                                           images.device)
+
+        if 'images_mask' in data_samples[0]:
+            images_mask = torch.tensor([
+                sample.get('images_mask') for sample in data_samples
+            ]).bool().to(images.device)
+        else:
+            images_mask = None
+
+        num_beams = generation_config.get(
+            'num_beams', getattr(self.model.generation_config, 'num_beams'))
+        decoder_prompts = self.get_decoder_prompts(data_samples)
+        constrain_fn = partial(
+            apply_constraint,
+            constraint_trie=self.constraint_trie,
+            decoder_prompts=decoder_prompts,
+            num_beams=num_beams,
+        )
+
+        outputs = self.model.generate(
+            input_ids=text_tokens,
+            images=images,
+            images_mask=images_mask,
+            constrain_fn=constrain_fn,
+            **generation_config,
+        )
+
+        if decoder_prompts is not None:
+            # Remove the prefix decoder prompt.
+            for prompt_ids, token in zip(decoder_prompts, outputs):
+                token[1:len(prompt_ids) + 1] = self.tokenizer.pad_token_id
+
+        if post_process:
+            return self.post_process(outputs, data_samples)
+        else:
+            return outputs
+
+    def get_decoder_prompts(self, data_samples):
+        decoder_prompts = []
+        if 'decoder_prompt' not in data_samples[0]:
+            return None
+        for sample in data_samples:
+            prompt = ' ' + sample.get('decoder_prompt')
+            prompt_ids = self.tokenizer(prompt, add_special_tokens=False)
+            prompt_ids = prompt_ids['input_ids']
+            decoder_prompts.append(prompt_ids)
+        return decoder_prompts
+
+    def preprocess_text(self, data_samples, batch_size, device):
+        if self.task == 'caption':
+            prompt = self.prompt or ' what does the image describe?'
+            prompts = [prompt] * batch_size
+            prompts = self.tokenizer(prompts, return_tensors='pt')
+            return prompts.input_ids.to(device)
+        elif self.task == 'vqa':
+            prompts = []
+            for sample in data_samples:
+                assert 'question' in sample
+                prompt = ' ' + sample.get('question')
+                prompts.append(prompt)
+            prompts = self.tokenizer(
+                prompts, return_tensors='pt', padding=True)
+            return prompts.input_ids.to(device)
+        elif self.task == 'refcoco':
+            prompt_template = self.prompt or \
+                ' which region does the text " {} " describe?'
+            prompts = []
+            for sample in data_samples:
+                assert 'text' in sample
+                prompt = prompt_template.format(sample.get('text'))
+                prompts.append(prompt)
+            prompts = self.tokenizer(
+                prompts, return_tensors='pt', padding=True)
+            return prompts.input_ids.to(device)
+
+    def post_process(self, outputs, data_samples):
+
+        out_data_samples = []
+        if data_samples is None:
+            data_samples = [None] * outputs.size(0)
+
+        for data_sample, token in zip(data_samples, outputs):
+            if data_sample is None:
+                data_sample = DataSample()
+
+            if self.task == 'caption':
+                text = self.tokenizer.decode(token, skip_special_tokens=True)
+                text = CleanCaption(
+                    lowercase=False,
+                    remove_chars=string.punctuation).clean(text)
+                data_sample.pred_caption = text
+            elif self.task == 'vqa':
+                text = self.tokenizer.decode(token, skip_special_tokens=True)
+                data_sample.pred_answer = text.strip()
+            elif self.task == 'refcoco':
+                bbox = token[1:5] - self.tokenizer.bin_offset
+                # During training, the bbox is normalized by 512. It's related
+                # to the `max_image_size` config in the official repo.
+                bbox = bbox / self.tokenizer.num_bins * 512
+                scale_factor = data_sample.get('scale_factor', (1, 1))
+                bbox[0::2] /= scale_factor[0]
+                bbox[1::2] /= scale_factor[1]
+                data_sample.pred_bboxes = bbox.unsqueeze(0)
+                if 'gt_bboxes' in data_sample:
+                    gt_bboxes = bbox.new_tensor(data_sample.gt_bboxes)
+                    gt_bboxes[:, 0::2] /= scale_factor[0]
+                    gt_bboxes[:, 1::2] /= scale_factor[1]
+                    data_sample.gt_bboxes = gt_bboxes
+            out_data_samples.append(data_sample)
+
+        return out_data_samples
diff --git a/mmpretrain/models/multimodal/ofa/ofa_modules.py b/mmpretrain/models/multimodal/ofa/ofa_modules.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef5c8533755739fb6b9f01211cbf10032544bf8b
--- /dev/null
+++ b/mmpretrain/models/multimodal/ofa/ofa_modules.py
@@ -0,0 +1,1613 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from dataclasses import dataclass
+from functools import partial
+from typing import List, Optional, Tuple
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn.bricks import DropPath
+from mmengine.model import BaseModule
+from mmengine.utils import digit_version
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions, ModelOutput, Seq2SeqLMOutput)
+from transformers.modeling_utils import (GenerationConfig, GenerationMixin,
+                                         PretrainedConfig)
+
+from mmpretrain.registry import MODELS
+from ...backbones.resnet import Bottleneck, ResNet
+
+if digit_version(torch.__version__) >= digit_version('1.10.0'):
+    torch_meshgrid = partial(torch.meshgrid, indexing='ij')
+else:
+    torch_meshgrid = torch.meshgrid
+
+
+def make_token_bucket_position(bucket_size, max_position=1024):
+    context_pos = torch.arange(max_position, dtype=torch.long)[:, None]
+    memory_pos = torch.arange(max_position, dtype=torch.long)[None, :]
+    relative_pos = context_pos - memory_pos
+    sign = torch.sign(relative_pos)
+    mid = bucket_size // 2
+    abs_pos = torch.where((relative_pos < mid) & (relative_pos > -mid),
+                          mid - 1, torch.abs(relative_pos))
+    log_pos = torch.ceil(
+        torch.log(abs_pos / mid) / math.log(
+            (max_position - 1) / mid) * (mid - 1)) + mid
+    log_pos = log_pos.int()
+    bucket_pos = torch.where(abs_pos.le(mid), relative_pos,
+                             log_pos * sign).long()
+    return bucket_pos + bucket_size - 1
+
+
+def make_image_bucket_position(bucket_size, num_relative_distance):
+    coords_h = torch.arange(bucket_size)
+    coords_w = torch.arange(bucket_size)
+    # (2, h, w)
+    coords = torch.stack(torch_meshgrid([coords_h, coords_w]))
+    # (2, h*w)
+    coords_flatten = torch.flatten(coords, 1)
+    # (2, h*w, h*w)
+    relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :]
+    # (h*w, h*w, 2)
+    relative_coords = relative_coords.permute(1, 2, 0).contiguous()
+    relative_coords[:, :, 0] += bucket_size - 1  # shift to start from 0
+    relative_coords[:, :, 1] += bucket_size - 1
+    relative_coords[:, :, 0] *= 2 * bucket_size - 1
+    relative_position_index = torch.zeros(
+        size=(bucket_size * bucket_size + 1, ) * 2,
+        dtype=relative_coords.dtype)
+    # (h*w, h*w)
+    relative_position_index[1:, 1:] = relative_coords.sum(-1)
+    relative_position_index[0, 0:] = num_relative_distance - 3
+    relative_position_index[0:, 0] = num_relative_distance - 2
+    relative_position_index[0, 0] = num_relative_distance - 1
+    return relative_position_index
+
+
+def _make_causal_mask(input_ids_shape: torch.Size,
+                      dtype: torch.dtype,
+                      past_key_values_length: int = 0):
+    """Make causal mask used for uni-directional self-attention."""
+    bsz, tgt_len = input_ids_shape
+    mask = torch.full((tgt_len, tgt_len), float('-inf'))
+    mask_cond = torch.arange(mask.size(-1))
+    mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
+    mask = mask.to(dtype)
+
+    if past_key_values_length > 0:
+        mask = torch.cat(
+            [torch.zeros(tgt_len, past_key_values_length, dtype=dtype), mask],
+            dim=-1)
+    return mask[None, None, :, :].expand(bsz, 1, tgt_len,
+                                         tgt_len + past_key_values_length)
+
+
+def _expand_mask(mask: torch.Tensor,
+                 dtype: torch.dtype,
+                 tgt_len: Optional[int] = None):
+    """Expands attention_mask from ``[B, L_s]`` to ``[B, 1, L_t, L_s]``.
+
+    Where ``B`` is batch_size, `L_s`` is the source sequence length, and
+    ``L_t`` is the target sequence length.
+    """
+    bsz, src_len = mask.size()
+    tgt_len = tgt_len if tgt_len is not None else src_len
+
+    expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len,
+                                                  src_len).to(dtype)
+    return expanded_mask.masked_fill(expanded_mask.bool(),
+                                     torch.finfo(dtype).min)
+
+
+class MultiheadAttention(BaseModule):
+    """Multi-head Attention Module for OFA.
+
+    Args:
+        embedding_dim (int): The embedding dimension of query.
+        num_heads (int): Parallel attention heads.
+        kdim (int, optional): The embedding dimension of key.
+            Defaults to None, which means the same as the `embedding_dim`.
+        vdim (int, optional): The embedding dimension of value.
+            Defaults to None, which means the same as the `embedding_dim`.
+        attn_drop (float): Dropout rate of the dropout layer after the
+            attention calculation of query and key. Defaults to 0.
+        qkv_bias (bool): If True, add a learnable bias to q, k, v.
+            Defaults to True.
+        scale_factor (float): The scale of qk will be
+            ``(head_dim * scale_factor) ** -0.5``. Defaults to 1.
+        proj_bias (bool) If True, add a learnable bias to output projection.
+            Defaults to True.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embedding_dim,
+                 num_heads,
+                 kdim=None,
+                 vdim=None,
+                 attn_drop=0.,
+                 scale_factor=1.,
+                 qkv_bias=True,
+                 proj_bias=True,
+                 scale_heads=False,
+                 init_cfg=None):
+        super(MultiheadAttention, self).__init__(init_cfg=init_cfg)
+
+        self.embedding_dim = embedding_dim
+        self.num_heads = num_heads
+        self.kdim = kdim or embedding_dim
+        self.vdim = vdim or embedding_dim
+
+        self.head_dim = embedding_dim // num_heads
+        self.scale = (self.head_dim * scale_factor)**-0.5
+
+        self.q_proj = nn.Linear(embedding_dim, embedding_dim, bias=qkv_bias)
+        self.k_proj = nn.Linear(self.kdim, embedding_dim, bias=qkv_bias)
+        self.v_proj = nn.Linear(self.vdim, embedding_dim, bias=qkv_bias)
+        self.out_proj = nn.Linear(embedding_dim, embedding_dim, bias=proj_bias)
+
+        self.attn_drop = nn.Dropout(p=attn_drop)
+
+        if scale_heads:
+            self.c_attn = nn.Parameter(torch.ones(num_heads))
+        else:
+            self.c_attn = None
+
+    def forward(
+        self,
+        query,
+        key_value=None,
+        attn_mask=None,
+        attn_bias=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        B, _, C = query.shape
+        assert C == self.head_dim * self.num_heads
+
+        is_cross_attention = key_value is not None
+        if key_value is None:
+            key_value = query
+
+        # (B, L, C) -> (B, num_heads, L, head_dims)
+        q = self.q_proj(query).reshape(B, -1, self.num_heads,
+                                       self.head_dim).transpose(1, 2)
+
+        if is_cross_attention and past_key_value is not None:
+            # Reuse key and value in cross_attentions
+            k, v = past_key_value
+        else:
+            k = self.k_proj(key_value).reshape(B, -1, self.num_heads,
+                                               self.head_dim).transpose(1, 2)
+            v = self.v_proj(key_value).reshape(B, -1, self.num_heads,
+                                               self.head_dim).transpose(1, 2)
+            if past_key_value is not None:
+                past_key, past_value = past_key_value
+                k = torch.cat([past_key, k], dim=2)
+                v = torch.cat([past_value, v], dim=2)
+
+        past_key_value = (k, v)
+
+        attn_weights = q @ k.transpose(-2, -1) * self.scale
+
+        if attn_bias is not None:
+            src_len = k.size(2)
+            attn_weights[:, :, -src_len:] += attn_bias[:, :, -src_len:]
+
+        if attn_mask is not None:
+            attn_weights += attn_mask
+        attn_weights = torch.softmax(attn_weights, dim=-1)
+        attn = self.attn_drop(attn_weights) @ v
+
+        if self.c_attn is not None:
+            attn = torch.einsum('bhlc,h->bhlc', attn, self.c_attn)
+
+        # (B, num_heads, L, head_dims) -> (B, L, C)
+        attn = attn.transpose(1, 2).reshape(B, -1, self.embedding_dim)
+        attn = self.out_proj(attn)
+
+        if output_attentions:
+            return attn, attn_weights, past_key_value
+        else:
+            return attn, None, past_key_value
+
+
+@MODELS.register_module(force=True)
+class OFAResNet(ResNet):
+    """ResNet module for OFA.
+
+    The ResNet in OFA has only three stages.
+    """
+    arch_settings = {
+        50: (Bottleneck, (3, 4, 6)),
+        101: (Bottleneck, (3, 4, 23)),
+        152: (Bottleneck, (3, 8, 36)),
+    }
+
+    def __init__(self, depth, *args, **kwargs):
+        super().__init__(
+            depth=depth,
+            *args,
+            num_stages=3,
+            out_indices=(2, ),
+            dilations=(1, 1, 1),
+            strides=(1, 2, 2),
+            **kwargs)
+
+
+@dataclass
+class OFAEncoderOutput(ModelOutput):
+    """OFA encoder outputs.
+
+    Args:
+        last_hidden_state (torch.tensor): The hidden-states of the output at
+            the last layer of the model. The shape is (B, L, C).
+        hidden_states (Tuple[torch.tensor]): The initial embedding and the
+            output of each layer. The shape of every item is (B, L, C).
+        attentions (Tuple[torch.tensor]): The attention weights after the
+            attention softmax, used to compute the weighted average in the
+            self-attention heads. The shape of every item is
+            (B, num_heads, L, L).
+        position_embedding (torch.tensor): The positional embeddings of the
+            inputs. The shape is (B, L, C).
+    """
+
+    last_hidden_state: torch.FloatTensor = None
+    padding_mask: torch.Tensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    position_embedding: Optional[torch.FloatTensor] = None
+
+
+class OFAEncoderLayer(nn.Module):
+    """OFAEncoder layer block."""
+
+    def __init__(self,
+                 embedding_dim,
+                 num_heads,
+                 dropout_rate=0.,
+                 drop_path_rate=0.,
+                 attn_drop=0.,
+                 act_drop=0.,
+                 scale_factor=2.,
+                 mlp_ratio=4.,
+                 scale_heads=True,
+                 normformer=True,
+                 pre_norm=True,
+                 act_cfg=dict(type='GELU')):
+        super().__init__()
+        self.embedding_dim = embedding_dim
+        self.pre_norm = pre_norm
+
+        self.attn = MultiheadAttention(
+            embedding_dim=embedding_dim,
+            num_heads=num_heads,
+            attn_drop=attn_drop,
+            scale_factor=scale_factor,
+            scale_heads=scale_heads,
+        )
+
+        mid_channels = int(embedding_dim * mlp_ratio)
+        self.fc1 = nn.Linear(embedding_dim, mid_channels)
+        self.fc2 = nn.Linear(mid_channels, embedding_dim)
+        self.act = MODELS.build(act_cfg)
+        self.act_drop = nn.Dropout(
+            act_drop) if act_drop > 0. else nn.Identity()
+
+        # LayerNorm between attention block and ffn block.
+        self.attn_ln = nn.LayerNorm(embedding_dim)
+        self.ffn_ln = nn.LayerNorm(embedding_dim)
+
+        # Extra LayerNorm
+        self.normformer = normformer
+        if self.normformer:
+            self.attn_mid_ln = nn.LayerNorm(embedding_dim)
+            self.ffn_mid_ln = nn.LayerNorm(mid_channels)
+
+        self.dropout = nn.Dropout(dropout_rate)
+        self.drop_path = DropPath(
+            drop_path_rate) if drop_path_rate > 0.0 else nn.Identity()
+
+    def forward(self,
+                x,
+                attention_mask=None,
+                attn_bias=None,
+                output_attentions=False):
+        """Forward the encoder layer.
+
+        Args:
+            x (torch.tensor): The input to the layer of shape ``(B, L, C)``.
+            attention_mask (torch.Tensor, optional): The attention mask of size
+                ``(B, 1, L, L)``, where padding elements are indicated by very
+                large negative values. Defaults to None.
+            attn_bias (torch.tensor, optional): The bias for positional
+                information. Defaults to None.
+            output_attentions (bool): Whether to return the attentions tensors
+                of the attention layer.
+
+        Returns:
+            List[torch.tensor]: The first element is the encoded output of
+            shape ``(B, L, C)``. And the second element is the output
+            attentions if ``output_attentions=True``.
+        """
+        residual = x
+
+        # Attention block
+        if self.pre_norm:
+            x = self.attn_ln(x)
+        x, attn_weights, _ = self.attn(
+            query=x,
+            attn_mask=attention_mask,
+            attn_bias=attn_bias,
+            output_attentions=output_attentions)
+        if self.normformer:
+            x = self.attn_mid_ln(x)
+        x = self.dropout(x)
+        x = residual + self.drop_path(x)
+        if not self.pre_norm:
+            x = self.attn_ln(x)
+
+        residual = x
+
+        # FFN block
+        if self.pre_norm:
+            x = self.ffn_ln(x)
+        x = self.act(self.fc1(x))
+        x = self.act_drop(x)
+        if self.normformer:
+            x = self.ffn_mid_ln(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        x = residual + self.drop_path(x)
+        if not self.pre_norm:
+            x = self.ffn_ln(x)
+
+        if output_attentions:
+            return [x, attn_weights]
+        else:
+            return [x]
+
+
+class OFADecoderLayer(nn.Module):
+    """OFADecoder layer block."""
+
+    def __init__(self,
+                 embedding_dim,
+                 num_heads,
+                 dropout_rate=0.,
+                 drop_path_rate=0.,
+                 attn_drop=0.,
+                 act_drop=0.,
+                 scale_factor=2.,
+                 mlp_ratio=4.,
+                 encoder_embed_dim=None,
+                 scale_heads=True,
+                 normformer=True,
+                 pre_norm=True,
+                 act_cfg=dict(type='GELU')):
+        super().__init__()
+        self.embedding_dim = embedding_dim
+        self.pre_norm = pre_norm
+
+        self.self_attn = MultiheadAttention(
+            embedding_dim=embedding_dim,
+            num_heads=num_heads,
+            attn_drop=attn_drop,
+            scale_factor=scale_factor,
+            scale_heads=scale_heads,
+        )
+
+        self.cross_attn = MultiheadAttention(
+            embedding_dim=embedding_dim,
+            kdim=encoder_embed_dim,
+            vdim=encoder_embed_dim,
+            num_heads=num_heads,
+            attn_drop=attn_drop,
+            scale_factor=scale_factor,
+            scale_heads=scale_heads,
+        )
+
+        mid_channels = int(embedding_dim * mlp_ratio)
+        self.fc1 = nn.Linear(embedding_dim, mid_channels)
+        self.fc2 = nn.Linear(mid_channels, embedding_dim)
+        self.act = MODELS.build(act_cfg)
+        self.act_drop = nn.Dropout(
+            act_drop) if act_drop > 0. else nn.Identity()
+
+        # LayerNorm between attention block and ffn block.
+        self.self_attn_ln = nn.LayerNorm(embedding_dim)
+        self.cross_attn_ln = nn.LayerNorm(embedding_dim)
+        self.ffn_ln = nn.LayerNorm(embedding_dim)
+
+        # Extra LayerNorm
+        self.normformer = normformer
+        if self.normformer:
+            self.self_attn_mid_ln = nn.LayerNorm(embedding_dim)
+            self.cross_attn_mid_ln = nn.LayerNorm(embedding_dim)
+            self.ffn_mid_ln = nn.LayerNorm(mid_channels)
+
+        self.dropout = nn.Dropout(dropout_rate)
+        self.drop_path = DropPath(
+            drop_path_rate) if drop_path_rate > 0.0 else nn.Identity()
+
+    def forward(
+        self,
+        x,
+        attention_mask=None,
+        encoder_hidden_states: Optional[torch.Tensor] = None,
+        encoder_attention_mask: Optional[torch.Tensor] = None,
+        past_key_value: Optional[List[torch.Tensor]] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        self_attn_bias: Optional[torch.Tensor] = None,
+        cross_attn_bias: Optional[torch.Tensor] = None,
+    ):
+        """Forward the decoder layer.
+
+        Args:
+            x (torch.tensor): The input to the layer of shape ``(B, L, C)``.
+            attention_mask (torch.Tensor, optional): The attention mask of size
+                ``(B, 1, L, L)``, where padding elements are indicated by very
+                large negative values. Defaults to None.
+            encoder_hidden_states (torch.Tensor, optional): The cross attention
+                input to the layer of size ``(B, L, C)``. Defaults to None.
+            encoder_attention_mask (torch.Tensor, optional): The cross
+                attention mask where padding elements are indicated by very
+                large negative values. Defaults to None.
+            past_key_value (Tuple[torch.tensor], optional): The cached past key
+                and value projection states. Defaults to none.
+            output_attentions (bool): whether to return the attentions tensors
+                of all attention layers. Defaults to False.
+            use_cache (bool, optional): Whether to use cache.
+                Defaults to False.
+            self_attn_bias (torch.Tensor, optional): The self attention bias
+                for positional information. Defaults to None.
+            cross_attn_bias (torch.Tensor, optional): The cross attention bias
+                for positional information. Defaults to None.
+
+        Returns:
+            List[torch.tensor]: The first element is the encoded output of
+            shape ``(B, L, C)``. The following two elements can be the output
+            self-attentions and cross-attentions if ``output_attentions=True``.
+            The following one element can be the cached past key and value
+            projection states.
+        """
+        residual = x
+
+        if past_key_value is not None:
+            self_past_key_value = past_key_value[:2]
+            cross_past_key_value = past_key_value[2:]
+        else:
+            self_past_key_value, cross_past_key_value = None, None
+
+        # Self-Attention block
+        if self.pre_norm:
+            x = self.self_attn_ln(x)
+        x, self_attn_weights, present_key_value = self.self_attn(
+            query=x,
+            past_key_value=self_past_key_value,
+            attn_mask=attention_mask,
+            output_attentions=output_attentions,
+            attn_bias=self_attn_bias,
+        )
+        if self.normformer:
+            x = self.self_attn_mid_ln(x)
+        x = self.dropout(x)
+        x = residual + self.drop_path(x)
+        if not self.pre_norm:
+            x = self.self_attn_ln(x)
+
+        # Cross-Attention block
+        if encoder_hidden_states is not None:
+            residual = x
+            if self.pre_norm:
+                x = self.cross_attn_ln(x)
+            x, cross_attn_weights, cross_key_value = self.cross_attn.forward(
+                query=x,
+                key_value=encoder_hidden_states,
+                attn_mask=encoder_attention_mask,
+                past_key_value=cross_past_key_value,
+                output_attentions=output_attentions,
+                attn_bias=cross_attn_bias)
+            if self.normformer:
+                x = self.cross_attn_mid_ln(x)
+            x = self.dropout(x)
+            x = residual + self.drop_path(x)
+            if not self.pre_norm:
+                x = self.cross_attn_ln(x)
+
+            present_key_value = present_key_value + cross_key_value
+
+        residual = x
+
+        # FFN block
+        if self.pre_norm:
+            x = self.ffn_ln(x)
+        x = self.act(self.fc1(x))
+        x = self.act_drop(x)
+        if self.normformer:
+            x = self.ffn_mid_ln(x)
+        x = self.fc2(x)
+        x = self.dropout(x)
+        x = residual + self.drop_path(x)
+        if not self.pre_norm:
+            x = self.ffn_ln(x)
+
+        outputs = [x]
+
+        if output_attentions:
+            outputs.extend([self_attn_weights, cross_attn_weights])
+
+        if use_cache:
+            outputs.append(present_key_value)
+
+        return outputs
+
+
+class OFAEncoder(BaseModule):
+    """The encoder module of OFA.
+
+    Args:
+        embed_tokens (nn.Embedding): The embedding module to embed the
+            input tokens.
+        embed_images (dict | nn.Module): The module to embed the input
+            images into features. The output number of channels should
+            be 1024.
+        num_layers (int): The number of encoder layers. Defaults to 6.
+        num_heads (int): The number of heads of attention. Defaults to 12.
+        dropout_rate (float): The prob of dropout for embedding and
+            transformer layers. Defaults to 0.
+        drop_path_rate (float): The prob of droppath for transformer layers.
+            Defaults to 0.
+        max_source_positions (int): The maximum length of the input tokens.
+            Defaults to 1024.
+        token_bucket_size (int): The token bucket size, it's used as the
+            maximum relative position index in relative position embedding
+            of input tokens. Defaults to 256.
+        image_bucket_size (int): The image bucket size, it's used to generate
+            the image relative position embedding table. It should be larger
+            than the shape of image feature map. Defaults to 42.
+        attn_scale_factor (float): The scale factor to calculate qk scale in
+            attentions. Defaults to 2.
+        scale_embedding (bool): Whether to scale the embeddings by the square
+            root of the dimension. Defaults to False.
+        add_embedding_ln (bool): Whether to add an extra layer norm for token
+            embeddings. Defaults to True.
+        add_image_embedding_ln (bool): Whether to add an extra layer norm for
+            image embeddings. Defaults to True.
+        pre_norm (bool): Whether to do layer norm before attention and ffn
+            blocks in transformer layers. Defaults to True.
+        entangle_position_embedding (bool): Whether to add the position
+            embedding on the embeddings directly. Defaults to False.
+        init_cfg (dict, optional): The initialization config. Defaults to None.
+    """
+
+    def __init__(
+        self,
+        embed_tokens,
+        embed_images: dict,
+        num_layers=6,
+        num_heads=12,
+        dropout_rate=0.,
+        drop_path_rate=0.,
+        max_source_positions=1024,
+        token_bucket_size=256,
+        image_bucket_size=42,
+        attn_scale_factor=2.,
+        scale_embedding=False,
+        add_embedding_ln=True,
+        add_type_embed=True,
+        add_image_embedding_ln=True,
+        pre_norm=True,
+        entangle_position_embedding=False,
+        init_cfg=None,
+    ):
+        super().__init__(init_cfg=init_cfg)
+
+        self.num_layers = num_layers
+        embedding_dim = embed_tokens.embedding_dim
+        self.embedding_dim = embedding_dim
+        self.padding_idx = embed_tokens.padding_idx
+        self.max_source_positions = max_source_positions
+        self.num_heads = num_heads
+
+        # Build embedding process components
+        self.embed_tokens = embed_tokens
+        self.embedding_scale = math.sqrt(
+            embedding_dim) if scale_embedding else 1.0
+
+        if not isinstance(embed_images, nn.Module):
+            self.embed_images = MODELS.build(embed_images)
+        else:
+            self.embed_images = embed_images
+        self.image_proj = nn.Linear(1024, embedding_dim)
+
+        if add_embedding_ln:
+            self.embedding_ln = nn.LayerNorm(embedding_dim)
+        else:
+            self.embedding_ln = None
+
+        if add_type_embed:
+            self.embed_type = nn.Embedding(2, embedding_dim)
+        else:
+            self.embed_type = None
+
+        if add_image_embedding_ln:
+            self.image_embedding_ln = nn.LayerNorm(embedding_dim)
+        else:
+            self.image_embedding_ln = None
+
+        self.entangle_position_embedding = entangle_position_embedding
+
+        # Build position embedding
+        self.embed_positions = nn.Embedding(self.max_source_positions + 2,
+                                            embedding_dim)
+        self.pos_ln = nn.LayerNorm(embedding_dim)
+        self.embed_image_positions = nn.Embedding(image_bucket_size**2 + 1,
+                                                  embedding_dim)
+        self.image_pos_ln = nn.LayerNorm(embedding_dim)
+
+        self.pos_scaling = float(embedding_dim / num_heads *
+                                 attn_scale_factor)**-0.5
+        self.pos_q_linear = nn.Linear(embedding_dim, embedding_dim)
+        self.pos_k_linear = nn.Linear(embedding_dim, embedding_dim)
+
+        self.dropout = nn.Dropout(
+            dropout_rate) if dropout_rate > 0. else nn.Identity()
+
+        # Register token relative position embedding table
+        self.token_bucket_size = token_bucket_size
+        token_num_rel_dis = 2 * token_bucket_size - 1
+        token_rp_bucket = make_token_bucket_position(token_bucket_size,
+                                                     self.max_source_positions)
+        self.register_buffer('token_rp_bucket', token_rp_bucket)
+        self.token_rel_pos_table_list = nn.ModuleList()
+
+        # Register image relative position embedding table
+        self.image_bucket_size = image_bucket_size
+        image_num_rel_dis = (2 * image_bucket_size -
+                             1) * (2 * image_bucket_size - 1) + 3
+        image_rp_bucket = make_image_bucket_position(image_bucket_size,
+                                                     image_num_rel_dis)
+        self.register_buffer('image_rp_bucket', image_rp_bucket)
+        self.image_rel_pos_table_list = nn.ModuleList()
+
+        # Build encoder layers
+        self.layers = nn.ModuleList()
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, num_layers)]
+        for index in range(self.num_layers):
+            layer = OFAEncoderLayer(
+                embedding_dim=embedding_dim,
+                num_heads=num_heads,
+                dropout_rate=dropout_rate,
+                drop_path_rate=dpr[index],
+                scale_factor=attn_scale_factor,
+                pre_norm=pre_norm,
+            )
+            self.layers.append(layer)
+            token_pos_table = nn.Embedding(token_num_rel_dis, self.num_heads)
+            image_pos_table = nn.Embedding(image_num_rel_dis, self.num_heads)
+            nn.init.constant_(token_pos_table.weight, 0.)
+            nn.init.constant_(image_pos_table.weight, 0.)
+            self.token_rel_pos_table_list.append(token_pos_table)
+            self.image_rel_pos_table_list.append(image_pos_table)
+
+        if pre_norm:
+            self.final_ln = nn.LayerNorm(embedding_dim)
+        else:
+            self.final_ln = None
+
+    main_input_name = 'input_ids'
+
+    def forward(self,
+                input_ids,
+                images,
+                images_mask,
+                output_attentions=False,
+                output_hidden_states=False,
+                sample_patch_num=None):
+        padding_mask = input_ids.eq(self.padding_idx)
+        has_pads = padding_mask.any()
+        token_embedding = self.embed_tokens(input_ids)
+        token_embedding = self.embedding_scale * token_embedding
+
+        # Embed the token position
+        src_pos_idx = torch.arange(input_ids.size(-1), device=input_ids.device)
+        src_pos_idx = src_pos_idx.expand(*input_ids.shape).contiguous()
+        pos_embedding = self.embed_positions(src_pos_idx)
+
+        # Embed the input tokens
+        x = self.process_embedding(
+            embedding=token_embedding,
+            type_tokens=input_ids.new_zeros(token_embedding.shape[:2]),
+            pos_embedding=pos_embedding,
+            embedding_ln=self.embedding_ln,
+        )
+        pos_embedding = self.pos_ln(pos_embedding)
+
+        # Embed the input images
+        if images is not None:
+            (image_tokens, image_padding_mask, image_position_ids,
+             image_pos_embedding) = self.get_image_tokens(
+                 images,
+                 sample_patch_num,
+                 images_mask,
+             )
+            image_embedding = self.image_proj(image_tokens)
+
+            image_x = self.process_embedding(
+                embedding=image_embedding,
+                type_tokens=input_ids.new_ones(image_embedding.shape[:2]),
+                pos_embedding=image_pos_embedding,
+                embedding_ln=self.image_embedding_ln,
+            )
+            image_pos_embedding = self.image_pos_ln(image_pos_embedding)
+
+            x = torch.cat([image_x, x], dim=1)
+            padding_mask = torch.cat([image_padding_mask, padding_mask], dim=1)
+            pos_embedding = torch.cat([image_pos_embedding, pos_embedding],
+                                      dim=1)
+
+        # account for padding while computing the representation
+        if has_pads:
+            x = x * (1 - padding_mask.unsqueeze(-1).type_as(x))
+
+        # Decoupled position embedding
+        B, L = pos_embedding.shape[:2]
+        pos_q = self.pos_q_linear(pos_embedding).view(
+            B, L, self.num_heads, -1).transpose(1, 2) * self.pos_scaling
+        pos_k = self.pos_k_linear(pos_embedding).view(B, L, self.num_heads,
+                                                      -1).transpose(1, 2)
+        abs_pos_bias = torch.matmul(pos_q, pos_k.transpose(2, 3))
+
+        all_hidden_states = [] if output_hidden_states else None
+        all_attentions = [] if output_attentions else None
+
+        for idx, layer in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states.append(x)
+
+            self_attn_bias = abs_pos_bias.clone()
+            # Add decoupled position embedding for input tokens.
+            token_len = input_ids.size(1)
+            rel_pos_bias = self.get_rel_pos_bias(input_ids, idx)
+            self_attn_bias[:, :, -token_len:, -token_len:] += rel_pos_bias
+
+            # Add decoupled position embedding for images
+            if images is not None:
+                token_len = image_tokens.size(1)
+                rel_pos_bias = self.get_image_rel_pos_bias(
+                    image_position_ids, idx)
+                self_attn_bias[:, :, :token_len, :token_len] += rel_pos_bias
+
+            if has_pads:
+                attention_mask = _expand_mask(padding_mask, dtype=x.dtype)
+            else:
+                attention_mask = None
+
+            out = layer(
+                x,
+                attention_mask=attention_mask,
+                attn_bias=self_attn_bias,
+                output_attentions=output_attentions)
+            x = out[0]
+
+            if output_attentions:
+                all_attentions.append(out[1])
+
+        if output_hidden_states:
+            all_hidden_states.append(x)
+
+        if self.final_ln is not None:
+            x = self.final_ln(x)
+
+        return OFAEncoderOutput(
+            last_hidden_state=x,  # (B, L, C)
+            padding_mask=padding_mask,  # (B, L)
+            position_embedding=pos_embedding,  # (B, L, C)
+            hidden_states=all_hidden_states,  # list of (B, L, C)
+            attentions=all_attentions,  # list of (B, num_heads, L, head_dims)
+        )
+
+    def get_image_tokens(self, images, sample_patch_num, images_mask):
+        image_embedding = self.embed_images(images)[-1]
+        B, C, H, W = image_embedding.shape
+        num_patches = H * W
+
+        padding_mask = images.new_zeros((B, num_patches)).bool()
+        position_col = torch.arange(W).unsqueeze(0)
+        position_row = torch.arange(H).unsqueeze(1) * self.image_bucket_size
+        position_idx = (position_col + position_row + 1).view(-1)
+        position_idx = position_idx.to(images.device).expand(B, num_patches)
+
+        # (B, C, H, W) -> (B, C, H*W) -> (B, H*W, C)
+        image_embedding = image_embedding.flatten(2).transpose(1, 2)
+        if sample_patch_num is not None:
+            patch_orders = torch.stack([
+                torch.randperm(num_patches)[:sample_patch_num]
+                for _ in range(B)
+            ])
+            num_patches = sample_patch_num
+            image_embedding = image_embedding.gather(
+                dim=1, index=patch_orders.unsqueeze(2).expand(-1, -1, C))
+            padding_mask = padding_mask.gather(1, patch_orders)
+            position_idx = position_idx.gather(1, patch_orders)
+
+        pos_embedding = self.embed_image_positions(position_idx)
+        padding_mask[~images_mask] = True
+        return image_embedding, padding_mask, position_idx, pos_embedding
+
+    def process_embedding(self,
+                          embedding,
+                          pos_embedding=None,
+                          type_tokens=None,
+                          embedding_ln=None):
+        if self.entangle_position_embedding and pos_embedding is not None:
+            embedding += pos_embedding
+        if self.embed_type is not None:
+            embedding += self.embed_type(type_tokens)
+        if embedding_ln is not None:
+            embedding = embedding_ln(embedding)
+        embedding = self.dropout(embedding)
+
+        return embedding
+
+    def get_rel_pos_bias(self, x, idx):
+        seq_len = x.size(1)
+        rp_bucket = self.token_rp_bucket[:seq_len, :seq_len]
+        values = F.embedding(rp_bucket,
+                             self.token_rel_pos_table_list[idx].weight)
+        values = values.unsqueeze(0).expand(x.size(0), -1, -1, -1)
+        values = values.permute([0, 3, 1, 2])
+        return values.contiguous()
+
+    def get_image_rel_pos_bias(self, image_position_ids, idx):
+        bsz, seq_len = image_position_ids.shape
+        rp_bucket_size = self.image_rp_bucket.size(1)
+
+        rp_bucket = self.image_rp_bucket.unsqueeze(0).expand(
+            bsz, rp_bucket_size, rp_bucket_size).gather(
+                1, image_position_ids[:, :, None].expand(
+                    bsz, seq_len, rp_bucket_size)).gather(
+                        2, image_position_ids[:, None, :].expand(
+                            bsz, seq_len, seq_len))
+        values = F.embedding(rp_bucket,
+                             self.image_rel_pos_table_list[idx].weight)
+        values = values.permute(0, 3, 1, 2)
+        return values
+
+
+class OFADecoder(BaseModule):
+    """The decoder module of OFA.
+
+    Args:
+        embed_tokens (nn.Embedding): The embedding module to embed the
+            input tokens.
+        num_layers (int): The number of decoder layers. Defaults to 6.
+        num_heads (int): The number of heads of attention. Defaults to 12.
+        dropout_rate (float): The prob of dropout for embedding and
+            transformer layers. Defaults to 0.
+        drop_path_rate (float): The prob of droppath for transformer layers.
+            Defaults to 0.
+        max_target_positions (int): The maximum length of the input tokens.
+            Defaults to 1024.
+        code_image_size (int): The resolution of the generated image in the
+            image infilling task. Defaults to 128.
+        token_bucket_size (int): The token bucket size, it's used as the
+            maximum relative position index in relative position embedding
+            of input tokens. Defaults to 256.
+        image_bucket_size (int): The image bucket size, it's used to generate
+            the image relative position embedding table. It should be larger
+            than the shape of image feature map. Defaults to 42.
+        attn_scale_factor (float): The scale factor to calculate qk scale in
+            attentions. Defaults to 2.
+        scale_embedding (bool): Whether to scale the embeddings by the square
+            root of the dimension. Defaults to False.
+        add_embedding_ln (bool): Whether to add an extra layer norm for token
+            embeddings. Defaults to True.
+        add_code_embedding_ln (bool): Whether to add an extra layer norm for
+            code embeddings. Defaults to True.
+        pre_norm (bool): Whether to do layer norm before attention and ffn
+            blocks in transformer layers. Defaults to True.
+        entangle_position_embedding (bool): Whether to add the position
+            embedding on the embeddings directly. Defaults to False.
+        share_input_output_embed (bool): Share the weights of the input token
+            embedding module and the output projection module.
+            Defaults to True.
+        init_cfg (dict, optional): The initialization config. Defaults to None.
+    """
+
+    def __init__(
+        self,
+        embed_tokens,
+        num_layers=6,
+        num_heads=12,
+        dropout_rate=0.,
+        drop_layer_rate=0.,
+        drop_path_rate=0.,
+        max_target_positions=1024,
+        code_image_size=128,
+        token_bucket_size=256,
+        image_bucket_size=42,
+        attn_scale_factor=2.,
+        scale_embedding=False,
+        add_embedding_ln=True,
+        add_code_embedding_ln=True,
+        pre_norm=True,
+        entangle_position_embedding=False,
+        share_input_output_embed=True,
+        init_cfg=None,
+    ):
+        super().__init__(init_cfg=init_cfg)
+        self._future_mask = torch.empty(0)
+
+        self.num_layers = num_layers
+        embedding_dim = embed_tokens.embedding_dim
+        self.embedding_dim = embedding_dim
+        self.padding_idx = embed_tokens.padding_idx
+        self.max_target_positions = max_target_positions
+        self.num_heads = num_heads
+
+        # Build embedding process components
+        self.embed_tokens = embed_tokens
+        self.embedding_scale = math.sqrt(
+            embedding_dim) if scale_embedding else 1.0
+
+        if add_embedding_ln:
+            self.embedding_ln = nn.LayerNorm(embedding_dim)
+        else:
+            self.embedding_ln = None
+
+        if add_code_embedding_ln:
+            self.code_embedding_ln = nn.LayerNorm(embedding_dim)
+        else:
+            self.code_embedding_ln = None
+
+        # Build position embedding
+        self.embed_positions = nn.Embedding(self.max_target_positions + 2,
+                                            embedding_dim)
+        self.pos_ln = nn.LayerNorm(embedding_dim)
+        self.embed_image_positions = nn.Embedding(image_bucket_size**2 + 1,
+                                                  embedding_dim)
+        self.image_pos_ln = nn.LayerNorm(embedding_dim)
+
+        self.pos_scaling = float(embedding_dim / num_heads *
+                                 attn_scale_factor)**-0.5
+        self.self_pos_q_linear = nn.Linear(embedding_dim, embedding_dim)
+        self.self_pos_k_linear = nn.Linear(embedding_dim, embedding_dim)
+        self.cross_pos_q_linear = nn.Linear(embedding_dim, embedding_dim)
+        self.cross_pos_k_linear = nn.Linear(embedding_dim, embedding_dim)
+
+        self.entangle_position_embedding = entangle_position_embedding
+
+        self.dropout = nn.Dropout(
+            dropout_rate) if dropout_rate > 0. else nn.Identity()
+        if drop_layer_rate > 0.:
+            raise NotImplementedError
+
+        # Register token relative position embedding table
+        self.token_bucket_size = token_bucket_size
+        token_num_rel_dis = 2 * token_bucket_size - 1
+        token_rp_bucket = make_token_bucket_position(token_bucket_size)
+        self.register_buffer('token_rp_bucket', token_rp_bucket)
+        self.token_rel_pos_table_list = nn.ModuleList()
+
+        # Register image relative position embedding table
+        self.image_bucket_size = image_bucket_size
+        image_num_rel_dis = (2 * image_bucket_size -
+                             1) * (2 * image_bucket_size - 1) + 3
+        image_rp_bucket = make_image_bucket_position(image_bucket_size,
+                                                     image_num_rel_dis)
+        self.register_buffer('image_rp_bucket', image_rp_bucket)
+        self.image_rel_pos_table_list = nn.ModuleList()
+
+        self.window_size = code_image_size // 8
+
+        position_col = torch.arange(self.window_size).unsqueeze(0)
+        position_row = torch.arange(
+            self.window_size).unsqueeze(1) * self.image_bucket_size
+        image_position_idx = (position_col + position_row + 1)
+        image_position_idx = torch.cat(
+            [torch.tensor([0]), image_position_idx.view(-1)])
+        image_position_idx = torch.cat(
+            [image_position_idx,
+             torch.tensor([1024] * 768)])
+        self.register_buffer('image_position_idx', image_position_idx)
+
+        # Build decoder layers
+        self.layers = nn.ModuleList()
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, num_layers)]
+        for index in range(self.num_layers):
+            layer = OFADecoderLayer(
+                embedding_dim=embedding_dim,
+                num_heads=num_heads,
+                dropout_rate=dropout_rate,
+                drop_path_rate=dpr[index],
+                scale_factor=attn_scale_factor,
+                pre_norm=pre_norm,
+            )
+            self.layers.append(layer)
+            token_pos_table = nn.Embedding(token_num_rel_dis, self.num_heads)
+            image_pos_table = nn.Embedding(image_num_rel_dis, self.num_heads)
+            nn.init.constant_(token_pos_table.weight, 0.)
+            nn.init.constant_(image_pos_table.weight, 0.)
+            self.token_rel_pos_table_list.append(token_pos_table)
+            self.image_rel_pos_table_list.append(image_pos_table)
+
+        if pre_norm:
+            self.final_ln = nn.LayerNorm(embedding_dim)
+        else:
+            self.final_ln = None
+
+        # Build output projection
+        if share_input_output_embed:
+            self.output_projection = nn.Linear(
+                self.embed_tokens.weight.shape[1],
+                self.embed_tokens.weight.shape[0],
+                bias=False,
+            )
+            self.output_projection.weight = self.embed_tokens.weight
+        else:
+            vocab_size = self.embed_tokens.num_embeddings
+            self.output_projection = nn.Linear(
+                embedding_dim, vocab_size, bias=False)
+            nn.init.normal_(
+                self.output_projection.weight,
+                mean=0,
+                std=embedding_dim**-0.5,
+            )
+
+    main_input_name = 'input_ids'
+
+    def forward(
+        self,
+        input_ids: torch.Tensor = None,
+        attention_mask: torch.Tensor = None,
+        encoder_hidden_states: torch.Tensor = None,
+        encoder_attention_mask: torch.Tensor = None,
+        code_masks: Optional[torch.Tensor] = None,
+        encoder_pos_embedding: Optional[torch.Tensor] = None,
+        past_key_values: Optional[torch.Tensor] = None,
+        use_cache: bool = False,
+        output_attentions: bool = False,
+        output_hidden_states: bool = False,
+    ):
+
+        if past_key_values is not None and len(past_key_values) > 0:
+            B, _, L_past, _ = past_key_values[0][0].shape
+            L = L_past + 1
+        else:
+            B, L = input_ids.shape
+            L_past = 0
+
+        # Embed the token position
+        target_pos_idx = torch.arange(
+            L, device=input_ids.device).expand([B, L]).contiguous()
+        pos_embedding = self.embed_positions(target_pos_idx)
+
+        # Embed the code positions
+        if code_masks is not None and torch.any(code_masks):
+            image_position_idx = self.image_position_idx[:input_ids.size(1)]
+            image_position_idx = image_position_idx.unsqueeze(0).expand(B, L)
+            pos_embedding[code_masks] = self.embed_image_positions(
+                image_position_idx)[code_masks]
+
+        # Self-attention position bias (B, num_heads, L_t, L_t)
+        self_abs_pos_bias = self.get_pos_info(self.pos_ln(pos_embedding))
+        if code_masks is not None and torch.any(code_masks):
+            self_image_abs_pos_bias = self.get_pos_info(
+                self.image_pos_ln(pos_embedding))
+            self_abs_pos_bias[code_masks] = self_image_abs_pos_bias[code_masks]
+
+        # Cross-attention position bias (B, num_heads, L_t, L_s)
+        cross_abs_pos_bias = self.get_pos_info(
+            self.pos_ln(pos_embedding), encoder_pos_embedding)
+        if code_masks is not None and torch.any(code_masks):
+            cross_image_abs_pos_bias = self.get_pos_info(
+                self.image_pos_ln(pos_embedding), encoder_pos_embedding)
+            cross_abs_pos_bias[code_masks] = cross_image_abs_pos_bias[
+                code_masks]
+
+        all_prev_output_tokens = input_ids.clone()
+        if past_key_values is not None and len(past_key_values) > 0:
+            input_ids = input_ids[:, -1:]
+            cross_abs_pos_bias = cross_abs_pos_bias[:, :, -1:, :]
+            pos_embedding = pos_embedding[:, -1:, :]
+
+        # Embed the input tokens
+        x = self.embed_tokens(input_ids) * self.embedding_scale
+
+        if self.entangle_position_embedding:
+            x += pos_embedding
+
+        if self.embedding_ln is not None:
+            if (code_masks is None or not code_masks.any()
+                    or self.code_embedding_ln is None):
+                x = self.embedding_ln(x)
+            elif code_masks is not None and code_masks.all():
+                x = self.code_embedding_ln(x)
+            else:
+                x[~code_masks] = self.embedding_ln(x[~code_masks])
+                x[code_masks] = self.code_embedding_ln(x[code_masks])
+
+        x = self.dropout(x)
+
+        attention_mask = self._prepare_decoder_attention_mask(
+            attention_mask, input_ids.shape, x.dtype, L_past)
+        attention_mask = attention_mask.to(x.device)
+
+        # decoder layers
+        all_hidden_states = [] if output_hidden_states else None
+        all_self_attns = [] if output_attentions else None
+        all_cross_attentions = [] if (
+            output_attentions and encoder_hidden_states is not None) else None
+        next_decoder_cache = [] if use_cache else None
+
+        for idx, layer in enumerate(self.layers):
+            # add hidden states from the last decoder layer
+            if output_hidden_states:
+                all_hidden_states.append(x)
+
+            if past_key_values is not None and len(past_key_values) > 0:
+                past_key_value = past_key_values[idx]
+            else:
+                past_key_value = None
+
+            self_attn_bias = self_abs_pos_bias.clone()
+            if code_masks is None or not code_masks.any():
+                self_attn_bias += self.get_rel_pos_bias(
+                    all_prev_output_tokens, idx)
+            elif code_masks is not None and code_masks.all():
+                self_attn_bias += self.get_image_rel_pos_bias(
+                    all_prev_output_tokens, idx)
+            else:
+                self_attn_bias[~code_masks] += self.get_rel_pos_bias(
+                    all_prev_output_tokens, idx)
+                self_attn_bias[code_masks] += self.get_image_rel_pos_bias(
+                    all_prev_output_tokens, idx)
+
+            if past_key_value is not None:
+                self_attn_bias = self_attn_bias[:, :, -1:, :]
+
+            out = layer(
+                x,
+                attention_mask=attention_mask,
+                encoder_hidden_states=encoder_hidden_states,
+                encoder_attention_mask=encoder_attention_mask,
+                past_key_value=past_key_value,
+                output_attentions=output_attentions,
+                use_cache=use_cache,
+                self_attn_bias=self_attn_bias,
+                cross_attn_bias=cross_abs_pos_bias,
+            )
+            x = out.pop(0)
+
+            if output_attentions:
+                all_self_attns.append(out.pop(0))
+                if encoder_hidden_states is not None:
+                    all_cross_attentions.append(out.pop(0))
+
+            if use_cache:
+                next_decoder_cache.append(out.pop(0))
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (x, )
+
+        if self.final_ln is not None:
+            x = self.final_ln(x)
+
+        x = self.output_projection(x)
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=x,
+            past_key_values=next_decoder_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+            cross_attentions=all_cross_attentions,
+        )
+
+    def _prepare_decoder_attention_mask(
+        self,
+        attention_mask,
+        input_shape,
+        dtype,
+        past_key_values_length,
+    ):
+        r"""
+        Create causal mask for unidirectional decoding.
+        [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+        """
+        combined_attention_mask = None
+        if input_shape[-1] > 1:
+            combined_attention_mask = _make_causal_mask(
+                input_shape,
+                dtype,
+                past_key_values_length=past_key_values_length).to(
+                    attention_mask.device)
+
+        if attention_mask is not None:
+            # (B, L_s) -> (B, 1, L_t, L_s)
+            expanded_attention_mask = _expand_mask(
+                attention_mask, dtype, tgt_len=input_shape[-1])
+            combined_attention_mask = (
+                expanded_attention_mask if combined_attention_mask is None else
+                expanded_attention_mask + combined_attention_mask)
+
+        return combined_attention_mask
+
+    def get_pos_info(self, pos_embedding, src_pos_embedding=None):
+        B, tgt_len = pos_embedding.shape[:2]
+        if src_pos_embedding is not None:
+            src_len = src_pos_embedding.size(1)
+            pos_q = self.cross_pos_q_linear(pos_embedding).view(
+                B, tgt_len, self.num_heads, -1).transpose(1, 2)
+            pos_q = pos_q * self.pos_scaling
+            pos_k = self.cross_pos_k_linear(src_pos_embedding).view(
+                B, src_len, self.num_heads, -1).transpose(1, 2)
+        else:
+            pos_q = self.self_pos_q_linear(pos_embedding).view(
+                B, tgt_len, self.num_heads, -1).transpose(1, 2)
+            pos_q = pos_q * self.pos_scaling
+            pos_k = self.self_pos_k_linear(pos_embedding).view(
+                B, tgt_len, self.num_heads, -1).transpose(1, 2)
+
+        abs_pos_bias = torch.matmul(pos_q, pos_k.transpose(2, 3))
+
+        return abs_pos_bias
+
+    def get_rel_pos_bias(self, x, idx):
+        seq_len = x.size(1)
+        rp_bucket = self.token_rp_bucket[:seq_len, :seq_len]
+        values = F.embedding(rp_bucket,
+                             self.token_rel_pos_table_list[idx].weight)
+        values = values.unsqueeze(0).expand(x.size(0), -1, -1, -1)
+        values = values.permute([0, 3, 1, 2])
+        return values.contiguous()
+
+    def get_image_rel_pos_bias(self, image_position_ids, idx):
+        bsz, seq_len = image_position_ids.shape
+        rp_bucket_size = self.image_rp_bucket.size(1)
+
+        rp_bucket = self.image_rp_bucket.unsqueeze(0).expand(
+            bsz, rp_bucket_size, rp_bucket_size).gather(
+                1, image_position_ids[:, :, None].expand(
+                    bsz, seq_len, rp_bucket_size)).gather(
+                        2, image_position_ids[:, None, :].expand(
+                            bsz, seq_len, seq_len))
+        values = F.embedding(rp_bucket,
+                             self.image_rel_pos_table_list[idx].weight)
+        values = values.permute(0, 3, 1, 2)
+        return values
+
+
+class OFAEncoderDecoder(BaseModule, GenerationMixin):
+    """The OFA main architecture with an encoder and a decoder.
+
+    Args:
+        encoder_cfg (dict): The config of the encoder, accept the keyword
+            arguments of :class:`OFAEncoder`.
+        decoder_cfg (dict): The config of the decoder, accept the keyword
+            arguments of :class:`OFADecoder`.
+        padding_idx (int): The index of the padding token.
+        vocab_size (int): The size of the vocabulary.
+        embedding_dim (int): The embedding dimensions of both the encoder
+            and the decoder.
+        generation_cfg (dict): The extra generation config, accept the keyword
+            arguments of :class:`~transformers.GenerationConfig`.
+            Defaults to an empty dict.
+        init_cfg (dict, optional): The initialization config. Defaults to None.
+    """
+    base_model_prefix = ''
+
+    def __init__(
+            self,
+            encoder_cfg,
+            decoder_cfg,
+            padding_idx,
+            vocab_size,
+            embedding_dim,
+            generation_cfg=dict(),
+            init_cfg=None,
+    ):
+        super().__init__(init_cfg=init_cfg)
+
+        self.padding_idx = padding_idx
+        self.vocab_size = vocab_size
+        self.embedding_dim = embedding_dim
+        embed_tokens = nn.Embedding(vocab_size, embedding_dim, padding_idx)
+
+        self.encoder = OFAEncoder(embed_tokens, **encoder_cfg)
+        self.decoder = OFADecoder(embed_tokens, **decoder_cfg)
+
+        self.config = PretrainedConfig(
+            vocab_size=vocab_size,
+            embedding_dim=embedding_dim,
+            padding_idx=padding_idx,
+            bos_token_id=0,
+            decoder_start_token_id=0,
+            pad_token_id=1,
+            eos_token_id=2,
+            forced_eos_token_id=2,
+            use_cache=False,
+            is_encoder_decoder=True,
+        )
+        self.config.update(generation_cfg)
+
+        self.generation_config = GenerationConfig.from_model_config(
+            self.config)
+
+    @property
+    def device(self):
+        return next(self.parameters()).device
+
+    def can_generate(self):
+        return True
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def max_decoder_positions(self):
+        """Maximum length supported by the decoder."""
+        return self.decoder.max_positions()
+
+    def get_normalized_probs(self, net_output, log_probs: bool, sample=None):
+        """Get normalized probabilities (or log probs) from a net's output."""
+        return self.get_normalized_probs_scriptable(net_output, log_probs,
+                                                    sample)
+
+    def get_normalized_probs_scriptable(
+        self,
+        net_output,
+        log_probs: bool,
+        sample=None,
+    ):
+        """Scriptable helper function for get_normalized_probs in.
+
+        ~BaseFairseqModel.
+        """
+        if hasattr(self, 'decoder'):
+            return self.decoder.get_normalized_probs(net_output, log_probs,
+                                                     sample)
+        elif torch.is_tensor(net_output):
+            # syntactic sugar for simple models which don't have a decoder
+            # (e.g., the classification tutorial)
+            logits = net_output.float()
+            if log_probs:
+                return F.log_softmax(logits, dim=-1)
+            else:
+                return F.softmax(logits, dim=-1)
+        raise NotImplementedError
+
+    main_input_name = 'input_ids'
+
+    def forward(self,
+                input_ids=None,
+                images=None,
+                images_mask=None,
+                sample_patch_num=None,
+                decoder_input_ids=None,
+                code_masks=None,
+                attention_mask=None,
+                encoder_outputs=None,
+                past_key_values=None,
+                use_cache=False,
+                output_attentions=False,
+                output_hidden_states=False,
+                constrain_fn=None,
+                return_dict=False):
+        """Forword the module.
+
+        Args:
+            input_ids (torch.Tensor): The indices of the input tokens in the
+                vocabulary, and padding will be ignored by default. The indices
+                can be obtained using :class:`OFATokenizer`.
+                The shape is (B, L).
+            images (torch.Tensor): The input images. The shape is (B, 3, H, W).
+            images_mask (torch.Tensor): The mask of all available images. The
+                shape is (B, ).
+            sample_patch_num (int): The number of patches to sample for the
+                images. Defaults to None, which means to use all patches.
+            decoder_input_ids (torch.Tensor): The indices of the input tokens
+                for the decoder.
+            code_masks (torch.Tensor): The mask of all samples for image
+                generation. The shape is (B, ).
+            attention_mask (torch.Tensor): The attention mask for decoding.
+                The shape is (B, L).
+            encoder_outputs (OFAEncoderOutput): The encoder outputs with hidden
+                states, positional embeddings, and padding masks.
+            past_key_values (Tuple[Tuple[torch.Tensor]]): If use cache, the
+                parameter is a tuple of length ``num_layers``. Every item is
+                also a tuple with four tensors, two for the key and value of
+                self-attention, two for the key and value of cross-attention.
+            use_cache (bool): Whether to use cache for faster inference.
+                Defaults to False.
+            output_attentions (bool): Whether to output attention weights.
+                Defaults to False.
+            output_hidden_states (bool): Whether to output hidden states.
+                Defaults to False.
+            constrain_fn (Callable, optional): The function to constrain the
+                output logits. Defaults to None.
+            return_dict (bool): Not used, it's only for compat with the
+                interface of the ``generate`` of ``transformers``.
+
+        Returns:
+            Seq2SeqLMOutput:
+
+            - logits (``torch.Tensor``): The last decoder hidden states.
+              The shape is (B, L, C).
+            - past_key_values (``Tuple[Tuple[torch.Tensor]]``): The past keys
+              and values for faster inference.
+            - decoder_hidden_states (``Tuple[torch.Tensor]``): the decoder
+              hidden states of all layers.
+            - decoder_attentions (``Tuple[torch.Tensor]``): The self-attention
+              weights of all layers in the decoder.
+            - cross_attentions (``Tuple[torch.Tensor]``): The cross-attention
+              weights of all layers in the decoder.
+            - encoder_last_hidden_state (``torch.Tensor``): The last encoder
+              hidden states.
+            - encoder_hidden_states (``Tuple[torch.Tensor]``): The encoder
+              hidden states of all layers, including the embeddings.
+            - encoder_attentions (``Tuple[torch.Tensor]``): The self-attention
+              weights of all layers in the encoder.
+        """
+
+        if encoder_outputs is None:
+            encoder_outputs = self.encoder(
+                input_ids=input_ids,
+                images=images,
+                images_mask=images_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                sample_patch_num=sample_patch_num,
+            )
+
+        if decoder_input_ids.eq(self.padding_idx).any():
+            attention_mask = decoder_input_ids.eq(self.padding_idx)
+
+        encoder_hidden_states = encoder_outputs.last_hidden_state
+        encoder_attention_mask = _expand_mask(encoder_outputs.padding_mask,
+                                              encoder_hidden_states.dtype,
+                                              decoder_input_ids.shape[-1])
+        src_pos_embed = encoder_outputs.position_embedding
+
+        decoder_outputs = self.decoder(
+            input_ids=decoder_input_ids,
+            attention_mask=attention_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            code_masks=code_masks,
+            encoder_pos_embedding=src_pos_embed,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+        )
+
+        # The constrain operation for fine-tuned model in OFA is applied
+        # before log_softmax, therefore we cannot use
+        # `prefix_allowed_tokens_fn` to implement it.
+        if constrain_fn is not None:
+            logits = constrain_fn(decoder_input_ids,
+                                  decoder_outputs.last_hidden_state)
+        else:
+            logits = decoder_outputs.last_hidden_state
+
+        return Seq2SeqLMOutput(
+            logits=logits,
+            past_key_values=decoder_outputs.past_key_values,
+            decoder_hidden_states=decoder_outputs.hidden_states,
+            decoder_attentions=decoder_outputs.attentions,
+            cross_attentions=decoder_outputs.cross_attentions,
+            encoder_last_hidden_state=encoder_outputs.last_hidden_state,
+            encoder_hidden_states=encoder_outputs.hidden_states,
+            encoder_attentions=encoder_outputs.attentions,
+        )
+
+    def prepare_inputs_for_generation(self,
+                                      decoder_input_ids=None,
+                                      past=None,
+                                      attention_mask=None,
+                                      code_masks=None,
+                                      use_cache=False,
+                                      encoder_outputs=None,
+                                      constrain_fn=None,
+                                      **kwargs):
+        # if attention_mask is None:
+        attention_mask = decoder_input_ids.new_zeros(decoder_input_ids.shape)
+
+        # cut decoder_input_ids if past is used
+        if past is not None:
+            decoder_input_ids = decoder_input_ids[:, -1:]
+
+        return {
+            'input_ids': None,
+            'images': None,
+            'images_mask': None,
+            'sample_patch_num': None,
+            'attention_mask': attention_mask,
+            'encoder_outputs': encoder_outputs,
+            'past_key_values': past,
+            'decoder_input_ids': decoder_input_ids,
+            'code_masks': code_masks,
+            'use_cache': use_cache,
+            'constrain_fn': constrain_fn,
+        }
+
+    def _prepare_encoder_decoder_kwargs_for_generation(
+            self,
+            inputs_tensor: torch.Tensor,
+            model_kwargs,
+            model_input_name: Optional[str] = None):
+        # 1. get encoder
+        encoder = self.get_encoder()
+
+        # 2. prepare encoder args and encoder kwargs from model kwargs
+        irrelevant_prefix = [
+            'decoder_', 'cross_attn', 'use_cache', 'attention_mask',
+            'constrain_fn'
+        ]
+        encoder_kwargs = {
+            argument: value
+            for argument, value in model_kwargs.items()
+            if not any(argument.startswith(p) for p in irrelevant_prefix)
+        }
+
+        if encoder_kwargs.get('images_mask') is None:
+            encoder_kwargs['images_mask'] = torch.tensor([True] *
+                                                         inputs_tensor.size(0))
+
+        # 3. make sure that encoder returns `ModelOutput`
+        model_input_name = model_input_name or self.main_input_name
+        encoder_kwargs[model_input_name] = inputs_tensor
+        model_kwargs['encoder_outputs']: ModelOutput = encoder(
+            **encoder_kwargs)
+        model_kwargs['attention_mask'] = None
+
+        return model_kwargs
+
+    @staticmethod
+    def _reorder_cache(past, beam_idx):
+        reordered_past = ()
+        for layer_past in past:
+            reordered_past += (tuple(
+                past_state.index_select(0, beam_idx)
+                for past_state in layer_past), )
+        return reordered_past
+
+    @staticmethod
+    def _expand_inputs_for_generation(
+        input_ids: torch.LongTensor,
+        expand_size: int = 1,
+        is_encoder_decoder: bool = False,
+        attention_mask: Optional[torch.LongTensor] = None,
+        encoder_outputs: Optional[ModelOutput] = None,
+        **model_kwargs,
+    ):
+        expanded_return_idx = (
+            torch.arange(input_ids.shape[0]).view(-1, 1).repeat(
+                1, expand_size).view(-1).to(input_ids.device))
+        input_ids = input_ids.index_select(0, expanded_return_idx)
+
+        if attention_mask is not None:
+            model_kwargs['attention_mask'] = attention_mask.index_select(
+                0, expanded_return_idx)
+
+        if is_encoder_decoder:
+            if encoder_outputs is None:
+                raise ValueError('If `is_encoder_decoder` is True, make '
+                                 'sure that `encoder_outputs` is defined.')
+            encoder_outputs['last_hidden_state'] = encoder_outputs.\
+                last_hidden_state.index_select(0, expanded_return_idx)
+            encoder_outputs['position_embedding'] = encoder_outputs.\
+                position_embedding.index_select(0, expanded_return_idx)
+            encoder_outputs['padding_mask'] = encoder_outputs.\
+                padding_mask.index_select(0, expanded_return_idx)
+            model_kwargs['encoder_outputs'] = encoder_outputs
+        return input_ids, model_kwargs
diff --git a/mmpretrain/models/multimodal/otter/__init__.py b/mmpretrain/models/multimodal/otter/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..38a45a3d17458eae2471846b43498aa06cdfaac3
--- /dev/null
+++ b/mmpretrain/models/multimodal/otter/__init__.py
@@ -0,0 +1,4 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .otter import Otter
+
+__all__ = ['Otter']
diff --git a/mmpretrain/models/multimodal/otter/otter.py b/mmpretrain/models/multimodal/otter/otter.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d30b509410fca6bb6bb61ba2756f851e388f944
--- /dev/null
+++ b/mmpretrain/models/multimodal/otter/otter.py
@@ -0,0 +1,143 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional
+
+import torch
+
+from mmpretrain.registry import MODELS, TOKENIZER
+from mmpretrain.structures import DataSample
+from ..flamingo.flamingo import ExtendModule, Flamingo, PerceiverResampler
+
+
+@MODELS.register_module()
+class Otter(Flamingo):
+    """The Otter model for multiple tasks.
+
+    Args:
+        vision_encoder (dict): The config of the vision encoder.
+        lang_encoder (dict): The config of the language encoder.
+        tokenizer (dict): The tokenizer to encode the text.
+        task (int): The task to perform prediction.
+        zeroshot_prompt (str): Prompt used for zero-shot inference.
+            Defaults to an.
+        shot_prompt_tmpl (str): Prompt used for few-shot inference.
+            Defaults to ``<image>User:Please describe the image.
+            GPT:<answer>{caption}<|endofchunk|>``.
+        final_prompt_tmpl (str): Final part of prompt used for inference.
+            Defaults to '<image>User:Please describe the image. GPT:<answer>'.
+        generation_cfg (dict): The extra generation config, accept the keyword
+            arguments of [~`transformers.GenerationConfig`].
+            Defaults to an empty dict.
+        data_preprocessor (Optional[dict]): The config for preprocessing input
+            data. If None or no specified type, it will use
+            "MutimodalDataPreprocessor" as type.
+            See :class:`MutimodalDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (dict, optional): The initialization config. Defaults to None.
+    """
+
+    support_tasks = {'caption', 'vqa'}
+    _no_split_modules = [
+        'TransformerEncoderLayer', 'PerceiverAttention',
+        'GatedCrossAttentionBlock', 'FlamingoLayer'
+    ]
+
+    def __init__(
+            self,
+            vision_encoder: dict,
+            lang_encoder: dict,
+            tokenizer: dict,
+            task: str = 'caption',
+            zeroshot_prompt: str = '',
+            shot_prompt_tmpl: str = ('<image>User:Please describe the image. '
+                                     'GPT:<answer>{caption}<|endofchunk|>'),
+            final_prompt_tmpl: str = ('<image>User:Please describe the image. '
+                                      'GPT:<answer>'),
+            generation_cfg: dict = dict(),
+            data_preprocessor: Optional[dict] = None,
+            init_cfg: Optional[dict] = None):
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        if isinstance(data_preprocessor, dict):
+            data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+            data_preprocessor = MODELS.build(data_preprocessor)
+
+        super(Flamingo, self).__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+        if task not in self.support_tasks:
+            raise ValueError(f'Unsupported task {task}, please select '
+                             f'the task from {self.support_tasks}.')
+        self.task = task
+
+        # init tokenizer
+        self.tokenizer = TOKENIZER.build(tokenizer)
+        # add Otter special tokens to the tokenizer
+        self.tokenizer.add_special_tokens({
+            'additional_special_tokens':
+            ['<|endofchunk|>', '<image>', '<answer>']
+        })
+        self.tokenizer.bos_token_id = 1
+        if self.tokenizer.pad_token is None:
+            # Issue: GPT models don't have a pad token, which we use to
+            # modify labels for the loss.
+            self.tokenizer.add_special_tokens({'pad_token': '<PAD>'})
+
+        # Template to format the prompt input
+        self.zeroshot_prompt = zeroshot_prompt
+        self.shot_prompt_tmpl = shot_prompt_tmpl
+        self.final_prompt_tmpl = final_prompt_tmpl
+
+        # init vision encoder related modules
+        vision_encoder_weight = vision_encoder.pop('pretrained', None)
+        self.vision_encoder = MODELS.build(vision_encoder)
+        if vision_encoder_weight is not None:
+            from mmengine.runner.checkpoint import load_checkpoint
+            load_checkpoint(
+                self.vision_encoder,
+                vision_encoder_weight,
+                map_location='cpu',
+                revise_keys=[(r'^backbone\.', '')],
+            )
+            self.vision_encoder.is_init = True
+
+        self.perceiver = PerceiverResampler(dim=self.vision_encoder.embed_dims)
+
+        # init language encoder related modules
+        self.lang_encoder = ExtendModule(**lang_encoder)
+        self.lang_encoder.resize_token_embeddings(len(self.tokenizer))
+        self.lang_encoder.media_token_id = self.tokenizer.encode('<image>')[-1]
+
+        # other necessary parameters
+        self.eoc_token_id = self.tokenizer.encode('<|endofchunk|>')[-1]
+        self.generation_cfg = generation_cfg
+
+        if hasattr(self, 'register_load_state_dict_post_hook'):
+            self.register_load_state_dict_post_hook(self._load_adapter_hook)
+
+    def post_process(
+            self, outputs: torch.Tensor,
+            data_samples: Optional[List[DataSample]]) -> List[DataSample]:
+        """Perform post process for outputs for different task.
+
+        Args:
+            outputs (torch.Tensor): The generated outputs.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples.
+
+        Returns:
+            List[DataSample]: Return list of data samples.
+        """
+        outputs = self.tokenizer.batch_decode(
+            outputs, skip_special_tokens=True)
+
+        if data_samples is None:
+            data_samples = [DataSample() for _ in range(len(outputs))]
+
+        for output, data_sample in zip(outputs, data_samples):
+            # remove text pattern
+            if self.task == 'caption':
+                data_sample.pred_caption = output
+            elif self.task == 'vqa':
+                data_sample.pred_answer = output
+
+        return data_samples
diff --git a/mmpretrain/models/multimodal/ram/__init__.py b/mmpretrain/models/multimodal/ram/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..35619d88516933d766d102027573ae79308771a9
--- /dev/null
+++ b/mmpretrain/models/multimodal/ram/__init__.py
@@ -0,0 +1,4 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .ram import RAM, RAMNormal, RAMOpenset
+
+__all__ = ['RAM', 'RAMNormal', 'RAMOpenset']
diff --git a/mmpretrain/models/multimodal/ram/bert.py b/mmpretrain/models/multimodal/ram/bert.py
new file mode 100644
index 0000000000000000000000000000000000000000..f54b2ce8e4764564129a264e553c0e4800d42725
--- /dev/null
+++ b/mmpretrain/models/multimodal/ram/bert.py
@@ -0,0 +1,1197 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Modify from:
+# https://github.com/xinyu1205/recognize-anything/blob/main/ram/models/bert.py
+
+import math
+from typing import Tuple
+
+import torch
+import torch.utils.checkpoint
+from torch import Tensor, device, nn
+from torch.nn import CrossEntropyLoss
+from transformers.activations import ACT2FN
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions)
+from transformers.modeling_utils import (PreTrainedModel,
+                                         apply_chunking_to_forward,
+                                         find_pruneable_heads_and_indices,
+                                         prune_linear_layer)
+from transformers.models.bert.configuration_bert import BertConfig
+from transformers.utils import logging
+
+logger = logging.get_logger(__name__)
+
+
+class BertEmbeddings_nopos(nn.Module):
+    """Construct the embeddings from word and position embeddings."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size,
+            config.hidden_size,
+            padding_idx=config.pad_token_id)
+        # self.position_embeddings = nn.Embedding(
+        #               config.max_position_embeddings, config.hidden_size)
+        '''self.LayerNorm is not snake-cased to stick with
+        TensorFlow model variable name and be able to load'''
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # position_ids (1, len position emb) is contiguous
+        # in memory and exported when serialized
+        # self.register_buffer("position_ids",
+        #       torch.arange(config.max_position_embeddings).expand((1, -1)))
+        # self.position_embedding_type = \
+        #           getattr(config, "position_embedding_type", "absolute")
+
+        self.config = config
+
+    def forward(self,
+                input_ids=None,
+                position_ids=None,
+                inputs_embeds=None,
+                past_key_values_length=0):
+        if input_ids is not None:
+            input_shape = input_ids.size()
+        else:
+            input_shape = inputs_embeds.size()[:-1]
+
+        seq_length = input_shape[1]  # noqa: F841
+
+        # if position_ids is None:
+        #   position_ids = self.position_ids[:, \
+        #       past_key_values_length : seq_length + \
+        #       past_key_values_length]
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        embeddings = inputs_embeds
+
+        # if self.position_embedding_type == "absolute":
+        #     position_embeddings = self.position_embeddings(position_ids)
+        #     # print('add position_embeddings!!!!')
+        #     embeddings += position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BertEmbeddings(nn.Module):
+    """Construct the embeddings from word and position embeddings."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size,
+            config.hidden_size,
+            padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
+                                                config.hidden_size)
+
+        # self.LayerNorm is not snake-cased to stick with
+        # TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # position_ids (1, len position emb) is contiguous
+        # in memory and exported when serialized
+        self.register_buffer(
+            'position_ids',
+            torch.arange(config.max_position_embeddings).expand((1, -1)))
+        self.position_embedding_type = getattr(config,
+                                               'position_embedding_type',
+                                               'absolute')
+
+        self.config = config
+
+    def forward(self,
+                input_ids=None,
+                position_ids=None,
+                inputs_embeds=None,
+                past_key_values_length=0):
+        if input_ids is not None:
+            input_shape = input_ids.size()
+        else:
+            input_shape = inputs_embeds.size()[:-1]
+
+        seq_length = input_shape[1]
+
+        if position_ids is None:
+            position_ids = self.position_ids[:, past_key_values_length:
+                                             seq_length +
+                                             past_key_values_length]
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        embeddings = inputs_embeds
+
+        if self.position_embedding_type == 'absolute':
+            position_embeddings = self.position_embeddings(position_ids)
+            # print('add position_embeddings!!!!')
+            embeddings += position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+class BertSelfAttention(nn.Module):
+
+    def __init__(self, config, is_cross_attention):
+        super().__init__()
+        self.config = config
+        if config.hidden_size % config.num_attention_heads != 0 and \
+                not hasattr(config, 'embedding_size'):
+            raise ValueError('''The hidden size (%d) is not a multiple of
+                the number of attention heads (%d)''' %
+                             (config.hidden_size, config.num_attention_heads))
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size /
+                                       config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * \
+            self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        if is_cross_attention:
+            self.key = nn.Linear(config.encoder_width, self.all_head_size)
+            self.value = nn.Linear(config.encoder_width, self.all_head_size)
+        else:
+            self.key = nn.Linear(config.hidden_size, self.all_head_size)
+            self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = getattr(config,
+                                               'position_embedding_type',
+                                               'absolute')
+        if (self.position_embedding_type == 'relative_key'
+                or self.position_embedding_type == 'relative_key_query'):
+            self.max_position_embeddings = config.max_position_embeddings
+            self.distance_embedding = nn.Embedding(
+                2 * config.max_position_embeddings - 1,
+                self.attention_head_size)
+        self.save_attention = False
+
+    def save_attn_gradients(self, attn_gradients):
+        self.attn_gradients = attn_gradients
+
+    def get_attn_gradients(self):
+        return self.attn_gradients
+
+    def save_attention_map(self, attention_map):
+        self.attention_map = attention_map
+
+    def get_attention_map(self):
+        return self.attention_map
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads,
+                                       self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        mixed_query_layer = self.query(hidden_states)
+
+        # If this is instantiated as a cross-attention module, the keys
+        # and values come from an encoder; the attention mask needs to be
+        # such that the encoder's padding tokens are not attended to.
+        is_cross_attention = encoder_hidden_states is not None
+
+        if is_cross_attention:
+            # print(self.key.weight.shape)
+            key_layer = self.transpose_for_scores(
+                self.key(encoder_hidden_states))
+            value_layer = self.transpose_for_scores(
+                self.value(encoder_hidden_states))
+            attention_mask = encoder_attention_mask
+        elif past_key_value is not None:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+            key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
+            value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
+        else:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        past_key_value = (key_layer, value_layer)
+
+        # compatible with higher versions of transformers
+        if key_layer.shape[0] > query_layer.shape[0]:
+            key_layer = key_layer[:query_layer.shape[0], :, :, :]
+            attention_mask = attention_mask[:query_layer.shape[0], :, :]
+            value_layer = value_layer[:query_layer.shape[0], :, :, :]
+
+        # Take the dot product between "query" and "key"
+        # to get the raw attention scores.
+        attention_scores = torch.matmul(query_layer,
+                                        key_layer.transpose(-1, -2))
+
+        if (self.position_embedding_type == 'relative_key'
+                or self.position_embedding_type == 'relative_key_query'):
+            seq_length = hidden_states.size()[1]
+            position_ids_l = torch.arange(
+                seq_length, dtype=torch.long,
+                device=hidden_states.device).view(-1, 1)
+            position_ids_r = torch.arange(
+                seq_length, dtype=torch.long,
+                device=hidden_states.device).view(1, -1)
+            distance = position_ids_l - position_ids_r
+            positional_embedding = self.distance_embedding(
+                distance + self.max_position_embeddings - 1)
+            positional_embedding = positional_embedding.to(
+                dtype=query_layer.dtype)  # fp16 compatibility
+
+            if self.position_embedding_type == 'relative_key':
+                relative_position_scores = torch.einsum(
+                    'bhld,lrd->bhlr', query_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == 'relative_key_query':
+                relative_position_scores_query = torch.einsum(
+                    'bhld,lrd->bhlr', query_layer, positional_embedding)
+                relative_position_scores_key = torch.einsum(
+                    'bhrd,lrd->bhlr', key_layer, positional_embedding)
+                attention_scores = attention_scores + \
+                    relative_position_scores_query + \
+                    relative_position_scores_key
+
+        attention_scores = attention_scores / math.sqrt(
+            self.attention_head_size)
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for
+            # all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(dim=-1)(attention_scores)
+
+        if is_cross_attention and self.save_attention:
+            self.save_attention_map(attention_probs)
+            attention_probs.register_hook(self.save_attn_gradients)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs_dropped = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs_dropped = attention_probs_dropped * head_mask
+
+        context_layer = torch.matmul(attention_probs_dropped, value_layer)
+
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (
+            self.all_head_size, )
+        context_layer = context_layer.view(*new_context_layer_shape)
+
+        outputs = (context_layer,
+                   attention_probs) if output_attentions else (context_layer, )
+
+        outputs = outputs + (past_key_value, )
+        return outputs
+
+
+class BertSelfOutput(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BertAttention(nn.Module):
+
+    def __init__(self, config, is_cross_attention=False):
+        super().__init__()
+        self.self = BertSelfAttention(config, is_cross_attention)
+        self.output = BertSelfOutput(config)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.self.num_attention_heads,
+            self.self.attention_head_size, self.pruned_heads)
+
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+
+        # Update hyper params and store pruned heads
+        self.self.num_attention_heads = self.self.num_attention_heads - len(
+            heads)
+        self.self.all_head_size = self.self.attention_head_size * \
+            self.self.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        self_outputs = self.self(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_value,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,
+                   ) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+class BertIntermediate(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class BertOutput(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class BertLayer(nn.Module):
+
+    def __init__(self, config, layer_num):
+        super().__init__()
+        self.config = config
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = BertAttention(config)
+        self.layer_num = layer_num
+        if self.config.add_cross_attention:
+            self.crossattention = BertAttention(
+                config, is_cross_attention=self.config.add_cross_attention)
+        self.intermediate = BertIntermediate(config)
+        self.output = BertOutput(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+        mode=None,
+    ):
+
+        if mode == 'tagging':
+
+            assert encoder_hidden_states is not None, \
+                '''encoder_hidden_states must be given
+                for cross-attention layers'''
+
+            cross_attention_outputs = self.crossattention(
+                hidden_states,
+                attention_mask,
+                head_mask,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                output_attentions=output_attentions,
+            )
+            attention_output = cross_attention_outputs[0]
+            outputs = cross_attention_outputs[
+                1:-1]  # add cross attentions if we output attention weights
+
+            present_key_value = cross_attention_outputs[-1]
+
+        else:
+            # decoder uni-directional self-attention
+            # cached key/values tuple is at positions 1,2
+            self_attn_past_key_value = \
+                (past_key_value[:2]
+                    if past_key_value is not None else None)
+            self_attention_outputs = self.attention(
+                hidden_states,
+                attention_mask,
+                head_mask,
+                output_attentions=output_attentions,
+                past_key_value=self_attn_past_key_value,
+            )
+            attention_output = self_attention_outputs[0]
+
+            outputs = self_attention_outputs[1:-1]
+            present_key_value = self_attention_outputs[-1]
+
+            if mode == 'multimodal':
+                assert encoder_hidden_states is not None, \
+                    '''encoder_hidden_states must be
+                    given for cross-attention layers'''
+
+                cross_attention_outputs = self.crossattention(
+                    attention_output,
+                    attention_mask,
+                    head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    output_attentions=output_attentions,
+                )
+                attention_output = cross_attention_outputs[0]
+                outputs = outputs + cross_attention_outputs[
+                    1:
+                    -1]  # add cross attentions if we output attention weights
+        layer_output = apply_chunking_to_forward(self.feed_forward_chunk,
+                                                 self.chunk_size_feed_forward,
+                                                 self.seq_len_dim,
+                                                 attention_output)
+        outputs = (layer_output, ) + outputs
+
+        outputs = outputs + (present_key_value, )
+
+        return outputs
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+
+class BertEncoder(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList(
+            [BertLayer(config, i) for i in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+        mode='multimodal',
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        all_cross_attentions = (
+        ) if output_attentions and self.config.add_cross_attention else None
+
+        next_decoder_cache = () if use_cache else None
+
+        for i in range(self.config.num_hidden_layers):
+            layer_module = self.layer[i]
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states, )
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_value = past_key_values[
+                i] if past_key_values is not None else None
+
+            if self.gradient_checkpointing and self.training:
+
+                if use_cache:
+                    logger.warn('''`use_cache=True` is incompatible with
+                        gradient checkpointing. Setting `use_cache=False`...'''
+                                )
+                    use_cache = False
+
+                def create_custom_forward(module):
+
+                    def custom_forward(*inputs):
+                        return module(*inputs, past_key_value,
+                                      output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    mode=mode,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
+                    mode=mode,
+                )
+
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache += (layer_outputs[-1], )
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (
+                    layer_outputs[1], )
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states, )
+
+        if not return_dict:
+            return tuple(v for v in [
+                hidden_states,
+                next_decoder_cache,
+                all_hidden_states,
+                all_self_attentions,
+                all_cross_attentions,
+            ] if v is not None)
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_decoder_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+
+
+class BertPooler(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class BertPredictionHeadTransform(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        if isinstance(config.hidden_act, str):
+            self.transform_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.transform_act_fn = config.hidden_act
+        self.LayerNorm = nn.LayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states)
+        return hidden_states
+
+
+class BertLMPredictionHead(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.transform = BertPredictionHeadTransform(config)
+
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(
+            config.hidden_size, config.vocab_size, bias=False)
+
+        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
+
+        # Need a link between the two variables so that
+        # the bias is correctly resized with `resize_token_embeddings`
+        self.decoder.bias = self.bias
+
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+
+
+class BertOnlyMLMHead(nn.Module):
+
+    def __init__(self, config):
+        super().__init__()
+        self.predictions = BertLMPredictionHead(config)
+
+    def forward(self, sequence_output):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+
+
+class BertPreTrainedModel(PreTrainedModel):
+    """An abstract class to handle weights initialization and a simple
+    interface for downloading and loading pretrained models."""
+
+    config_class = BertConfig
+    base_model_prefix = 'bert'
+    _keys_to_ignore_on_load_missing = [r'position_ids']
+
+    def _init_weights(self, module):
+        """Initialize the weights."""
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            # Slightly different from the TF version
+            # which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(
+                mean=0.0, std=self.config.initializer_range)
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
+
+
+class BertModel(BertPreTrainedModel):
+    """The model can behave as an encoder (with only self-attention) as well as
+    a decoder, in which case a layer of cross-attention is added between the
+    self-attention layers, following the architecture described in `Attention
+    is all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani,
+    Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
+
+    Gomez, Lukasz Kaiser and Illia Polosukhin. argument and
+    :obj:`add_cross_attention` set to :obj:`True`; an
+    :obj:`encoder_hidden_states` is then expected as an input to the forward
+    pass.
+    """
+
+    def __init__(self, config, add_pooling_layer=True):
+        super().__init__(config)
+        self.config = config
+
+        self.embeddings = BertEmbeddings(config)
+
+        self.encoder = BertEncoder(config)
+
+        self.pooler = BertPooler(config) if add_pooling_layer else None
+
+        self.init_weights()
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def _prune_heads(self, heads_to_prune):
+        """Prunes heads of the model.
+
+        heads_to_prune:
+        dict of {layer_num: list of heads to prune in this layer}
+        See base class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    def get_extended_attention_mask(self, attention_mask: Tensor,
+                                    input_shape: Tuple[int], device: device,
+                                    is_decoder: bool) -> Tensor:
+        """Makes broadcastable attention and causal masks so that future and
+        masked tokens are ignored.
+
+        Arguments:
+            attention_mask (:obj:`torch.Tensor`):
+                Mask with ones indicating tokens to attend to,
+                zeros for tokens to ignore.
+            input_shape (:obj:`Tuple[int]`):
+                The shape of the input to the model.
+            device: (:obj:`torch.device`):
+                The device of the input to the model.
+
+        Returns:
+            :obj:`torch.Tensor` The extended attention mask,
+            with a the same dtype as :obj:`attention_mask.dtype`.
+        """
+        # We can provide a self-attention mask of dimensions
+        # [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it
+        # broadcastable to all heads.
+        if attention_mask.dim() == 3:
+            extended_attention_mask = attention_mask[:, None, :, :]
+        elif attention_mask.dim() == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - if the model is a decoder, apply a causal mask
+            # in addition to the padding mask
+            # - if the model is an encoder, make the mask
+            # broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            if is_decoder:
+                batch_size, seq_length = input_shape
+
+                seq_ids = torch.arange(seq_length, device=device)
+                causal_mask = seq_ids[None, None, :].repeat(
+                    batch_size, seq_length, 1) <= seq_ids[None, :, None]
+                # in case past_key_values are used we need to
+                # add a prefix ones mask to the causal mask
+                # causal and attention masks must have same type
+                # with pytorch version < 1.3
+                causal_mask = causal_mask.to(attention_mask.dtype)
+
+                if causal_mask.shape[1] < attention_mask.shape[1]:
+                    prefix_seq_len = attention_mask.shape[
+                        1] - causal_mask.shape[1]
+                    causal_mask = torch.cat(
+                        [
+                            torch.ones(
+                                (batch_size, seq_length, prefix_seq_len),
+                                device=device,
+                                dtype=causal_mask.dtype),
+                            causal_mask,
+                        ],
+                        axis=-1,
+                    )
+
+                extended_attention_mask = (
+                    causal_mask[:None, :, :] *
+                    attention_mask[:, None, None, :])
+            else:
+                extended_attention_mask = attention_mask[:, None, None, :]
+        else:
+            raise ValueError(
+                '''Wrong shape for input_ids (shape {}) or attention_mask
+                (shape {})'''.format(input_shape, attention_mask.shape))
+
+        # Since attention_mask is 1.0
+        # for positions we want to attend and 0.0
+        # for masked positions, this operation will
+        # create a tensor which is 0.0 for positions
+        # we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores
+        # before the softmax, this is effectively
+        # the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.to(
+            dtype=self.dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        is_decoder=False,
+        mode='multimodal',
+    ):
+        r"""
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:
+        `(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer
+            of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:
+        `(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token
+            indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as
+            a decoder. Mask values selected in ``[0, 1]``:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :
+        obj:`config.n_layers` with each tuple having 4 tensors of shape :
+        obj:`(batch_size, num_heads, sequence_length - 1,
+        embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the
+            attention blocks. Can be used to speed up decoding.
+            If :obj:`past_key_values` are used, the user can optionally
+            input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to
+            this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:
+            `(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value
+            states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+        """
+        output_attentions = (
+            output_attentions if output_attentions is not None else
+            self.config.output_attentions)
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else
+            self.config.output_hidden_states)
+        return_dict = (
+            return_dict
+            if return_dict is not None else self.config.use_return_dict)
+
+        if is_decoder:
+            use_cache = (
+                use_cache if use_cache is not None else self.config.use_cache)
+        else:
+            use_cache = False
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError('''You cannot specify both
+                input_ids and inputs_embeds at the same time''')
+        elif input_ids is not None:
+            input_shape = input_ids.size()
+            batch_size, seq_length = input_shape
+            device = input_ids.device
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+            batch_size, seq_length = input_shape
+            device = inputs_embeds.device
+        elif encoder_embeds is not None:
+            input_shape = encoder_embeds.size()[:-1]
+            batch_size, seq_length = input_shape
+            device = encoder_embeds.device
+        else:
+            raise ValueError('''You have to specify either
+                input_ids or inputs_embeds or encoder_embeds''')
+
+        # past_key_values_length
+        past_key_values_length = past_key_values[0][0].shape[
+            2] if past_key_values is not None else 0
+
+        if attention_mask is None:
+            attention_mask = torch.ones(
+                ((batch_size, seq_length + past_key_values_length)),
+                device=device)
+
+        # We can provide a self-attention mask of dimensions
+        # [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to
+        # make it broadcastable to all heads.
+        extended_attention_mask: torch.Tensor = \
+            (self.get_extended_attention_mask(
+                attention_mask, input_shape, device, is_decoder))
+
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to
+        # [batch_size, num_heads, seq_length, seq_length]
+        if encoder_hidden_states is not None:
+            if type(encoder_hidden_states) == list:
+                encoder_batch_size, encoder_sequence_length, _ = \
+                    (encoder_hidden_states[0].size())
+            else:
+                encoder_batch_size, encoder_sequence_length, _ = \
+                    (encoder_hidden_states.size())
+            encoder_hidden_shape = (encoder_batch_size,
+                                    encoder_sequence_length)
+
+            if type(encoder_attention_mask) == list:
+                encoder_extended_attention_mask = [
+                    self.invert_attention_mask(mask)
+                    for mask in encoder_attention_mask
+                ]
+            elif encoder_attention_mask is None:
+                encoder_attention_mask = torch.ones(
+                    encoder_hidden_shape, device=device)
+                encoder_extended_attention_mask = self.invert_attention_mask(
+                    encoder_attention_mask)
+            else:
+                encoder_extended_attention_mask = self.invert_attention_mask(
+                    encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape
+        # [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape
+        # [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(head_mask,
+                                       self.config.num_hidden_layers)
+
+        if encoder_embeds is None:
+            embedding_output = self.embeddings(
+                input_ids=input_ids,
+                position_ids=position_ids,
+                inputs_embeds=inputs_embeds,
+                past_key_values_length=past_key_values_length,
+            )
+        else:
+            embedding_output = encoder_embeds
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            mode=mode,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(
+            sequence_output) if self.pooler is not None else None
+
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            cross_attentions=encoder_outputs.cross_attentions,
+        )
+
+
+class BertLMHeadModel(BertPreTrainedModel):
+
+    _keys_to_ignore_on_load_unexpected = [r'pooler']
+    _keys_to_ignore_on_load_missing = [
+        r'position_ids', r'predictions.decoder.bias'
+    ]
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.bert = BertModel(config, add_pooling_layer=False)
+        self.cls = BertOnlyMLMHead(config)
+
+        self.init_weights()
+
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+
+    def set_output_embeddings(self, new_embeddings):
+        self.cls.predictions.decoder = new_embeddings
+
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        labels=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        return_logits=False,
+        is_decoder=True,
+        reduction='mean',
+        mode='multimodal',
+    ):
+        r"""
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:
+        `(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer
+            of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:
+        `(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token
+            indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder.
+            Mask values selected in ``[0, 1]``:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        labels (:obj:`torch.LongTensor` of shape :obj:
+        `(batch_size, sequence_length)`, `optional`):
+            Labels for computing the left-to-right
+            language modeling loss (next word prediction).
+            Indices should be in
+            ``[-100, 0, ..., config.vocab_size]``
+            (see ``input_ids`` docstring) Tokens with indices set to
+            ``-100`` are ignored (masked), the loss is only computed
+            for the tokens with labels n ``[0, ..., config.vocab_size]``
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length
+        :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:
+        `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention
+            blocks. Can be used to speed up decoding.
+            If :obj:`past_key_values` are used, the user can optionally
+            input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to
+            this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:
+            `(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states
+            are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+        Returns:
+        Example::
+            >>> from transformers import (BertTokenizer,
+                    BertLMHeadModel, BertConfig)
+            >>> import torch
+            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
+            >>> config = BertConfig.from_pretrained("bert-base-cased")
+            >>> model = BertLMHeadModel.from_pretrained(
+                    'bert-base-cased', config=config)
+            >>> inputs = tokenizer("Hello, my dog is cute",
+                    return_tensors="pt")
+            >>> outputs = model(**inputs)
+            >>> prediction_logits = outputs.logits
+        """
+        return_dict = (
+            return_dict
+            if return_dict is not None else self.config.use_return_dict)
+        if labels is not None:
+            use_cache = False
+
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            is_decoder=is_decoder,
+            mode=mode,
+        )
+
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output)
+        # sequence_output.shape torch.Size([85, 30, 768])
+        # prediction_scores.shape torch.Size([85, 30, 30524])
+        # labels.shape torch.Size([85, 30])
+
+        if return_logits:
+            return prediction_scores[:, :-1, :].contiguous()
+
+        lm_loss = None
+        if labels is not None:
+            # we are doing next-token prediction; shift
+            # prediction scores and input ids by one
+            shifted_prediction_scores = prediction_scores[:, :
+                                                          -1, :].contiguous()
+            labels = labels[:, 1:].contiguous()
+            loss_fct = CrossEntropyLoss(
+                reduction=reduction, label_smoothing=0.1)
+            lm_loss = loss_fct(
+                shifted_prediction_scores.view(-1, self.config.vocab_size),
+                labels.view(-1))
+            if reduction == 'none':
+                lm_loss = lm_loss.view(prediction_scores.size(0), -1).sum(1)
+
+        if not return_dict:
+            output = (prediction_scores, ) + outputs[2:]
+            return ((lm_loss, ) + output) if lm_loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=lm_loss,
+            logits=prediction_scores,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+
+    def prepare_inputs_for_generation(self,
+                                      input_ids,
+                                      past=None,
+                                      attention_mask=None,
+                                      **model_kwargs):
+        input_shape = input_ids.shape
+        # if model is used as a decoder in encoder-decoder model,
+        # the decoder attention mask is created on the fly
+        if attention_mask is None:
+            attention_mask = input_ids.new_ones(input_shape)
+
+        # cut decoder_input_ids if past is used
+        if past is not None:
+            input_ids = input_ids[:, -1:]
+
+        return {
+            'input_ids':
+            input_ids,
+            'attention_mask':
+            attention_mask,
+            'past_key_values':
+            past,
+            'encoder_hidden_states':
+            model_kwargs.get('encoder_hidden_states', None),
+            'encoder_attention_mask':
+            model_kwargs.get('encoder_attention_mask', None),
+            'is_decoder':
+            True,
+        }
+
+    def _reorder_cache(self, past, beam_idx):
+        reordered_past = ()
+        for layer_past in past:
+            reordered_past += (tuple(
+                past_state.index_select(0, beam_idx)
+                for past_state in layer_past), )
+        return reordered_past
diff --git a/mmpretrain/models/multimodal/ram/config/__init__.py b/mmpretrain/models/multimodal/ram/config/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef101fec61e72abc0eb90266d453b5b22331378d
--- /dev/null
+++ b/mmpretrain/models/multimodal/ram/config/__init__.py
@@ -0,0 +1 @@
+# Copyright (c) OpenMMLab. All rights reserved.
diff --git a/mmpretrain/models/multimodal/ram/config/ram_swin_large_14m.py b/mmpretrain/models/multimodal/ram/config/ram_swin_large_14m.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4b88653b3b9bd42097069165760cf282c3f7d66
--- /dev/null
+++ b/mmpretrain/models/multimodal/ram/config/ram_swin_large_14m.py
@@ -0,0 +1,93 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# data settings
+test_transforms_cfg = [
+    dict(type='Resize', scale=(384, 384), interpolation='bicubic'),
+    dict(
+        type='mmpretrain.PackInputs',
+        algorithm_keys=['text'],
+        meta_keys=['image_id', 'scale_factor'],
+    ),
+]
+
+
+def get_ram_cfg(mode='normal'):
+    assert mode in ['normal', 'openset'], 'mode must "normal" or "openset"'
+    model_type = 'RAMNormal' if mode == 'normal' else 'RAMOpenset'
+    model_cfg = dict(
+        type=model_type,
+        tokenizer=dict(
+            type='BertTokenizer',
+            name_or_path='/public/DATA/qbw/ckpt/bert-base-uncased',
+            use_fast=False),
+        vision_backbone=dict(
+            type='SwinTransformer',
+            arch='large',
+            img_size=384,
+            window_size=12,
+        ),
+        tag_encoder={
+            'architectures': ['BertModel'],
+            'attention_probs_dropout_prob': 0.1,
+            'hidden_act': 'gelu',
+            'hidden_dropout_prob': 0.1,
+            'hidden_size': 768,
+            'initializer_range': 0.02,
+            'intermediate_size': 3072,
+            'layer_norm_eps': 1e-12,
+            'max_position_embeddings': 512,
+            'model_type': 'bert',
+            'num_attention_heads': 12,
+            'num_hidden_layers': 12,
+            'pad_token_id': 0,
+            'type_vocab_size': 2,
+            'vocab_size': 30524,
+            'encoder_width': 512,
+            'add_cross_attention': True
+        },
+        text_decoder={
+            'architectures': ['BertModel'],
+            'attention_probs_dropout_prob': 0.1,
+            'hidden_act': 'gelu',
+            'hidden_dropout_prob': 0.1,
+            'hidden_size': 768,
+            'initializer_range': 0.02,
+            'intermediate_size': 3072,
+            'layer_norm_eps': 1e-12,
+            'max_position_embeddings': 512,
+            'model_type': 'bert',
+            'num_attention_heads': 12,
+            'num_hidden_layers': 12,
+            'pad_token_id': 0,
+            'type_vocab_size': 2,
+            'vocab_size': 30524,
+            'encoder_width': 768,
+            'add_cross_attention': True
+        },
+        tagging_head={
+            'architectures': ['BertModel'],
+            'attention_probs_dropout_prob': 0.1,
+            'hidden_act': 'gelu',
+            'hidden_dropout_prob': 0.1,
+            'hidden_size': 768,
+            'initializer_range': 0.02,
+            'intermediate_size': 3072,
+            'layer_norm_eps': 1e-12,
+            'max_position_embeddings': 512,
+            'model_type': 'bert',
+            'num_attention_heads': 4,
+            'num_hidden_layers': 2,
+            'pad_token_id': 0,
+            'type_vocab_size': 2,
+            'vocab_size': 30522,
+            'encoder_width': 512,
+            'add_cross_attention': True,
+            'add_tag_cross_attention': False
+        },
+        data_preprocessor=dict(
+            type='MultiModalDataPreprocessor',
+            mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+            std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+            to_rgb=False,
+        ),
+    )
+    return model_cfg
diff --git a/mmpretrain/models/multimodal/ram/data/ram_tag_list.pickle b/mmpretrain/models/multimodal/ram/data/ram_tag_list.pickle
new file mode 100644
index 0000000000000000000000000000000000000000..0519d1ee759eacdad99df2811ff59432369e1599
Binary files /dev/null and b/mmpretrain/models/multimodal/ram/data/ram_tag_list.pickle differ
diff --git a/mmpretrain/models/multimodal/ram/data/ram_tag_list_chinese.pickle b/mmpretrain/models/multimodal/ram/data/ram_tag_list_chinese.pickle
new file mode 100644
index 0000000000000000000000000000000000000000..4abe105e3b347ab63c1dd8ac25977853918635c5
Binary files /dev/null and b/mmpretrain/models/multimodal/ram/data/ram_tag_list_chinese.pickle differ
diff --git a/mmpretrain/models/multimodal/ram/data/ram_tag_list_threshold.pickle b/mmpretrain/models/multimodal/ram/data/ram_tag_list_threshold.pickle
new file mode 100644
index 0000000000000000000000000000000000000000..2be681d6f0afc07c75c8e972b3a847f9acdb86dd
Binary files /dev/null and b/mmpretrain/models/multimodal/ram/data/ram_tag_list_threshold.pickle differ
diff --git a/mmpretrain/models/multimodal/ram/gradio_demo.py b/mmpretrain/models/multimodal/ram/gradio_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..206e6b40fd828e5668e684d3c39d78c06511c023
--- /dev/null
+++ b/mmpretrain/models/multimodal/ram/gradio_demo.py
@@ -0,0 +1,109 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+
+import gradio as gr
+import torch
+
+from mmpretrain.registry import MODELS, TRANSFORMS
+from .config.ram_swin_large_14m import get_ram_cfg, test_transforms_cfg
+from .run.inference import inference
+
+parser = argparse.ArgumentParser(
+    description='RAM(Recognize Anything Model) demo')
+parser.add_argument(
+    'ram_ckpt', type=str, help='pretrained file for ram (absolute path)')
+parser.add_argument(
+    'clip_ckpt',
+    type=str,
+    help='clip vit-base-p16 pretrained file (absolute path)')
+args = parser.parse_args()
+
+if torch.cuda.is_available():
+    devices = [
+        torch.device(f'cuda:{i}') for i in range(torch.cuda.device_count())
+    ]
+elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
+    devices = [torch.device('mps')]
+else:
+    devices = [torch.device('cpu')]
+
+
+def get_free_device():
+    if hasattr(torch.cuda, 'mem_get_info'):
+        free = [torch.cuda.mem_get_info(gpu)[0] for gpu in devices]
+        select = max(zip(free, range(len(free))))[1]
+    else:
+        import random
+        select = random.randint(0, len(devices) - 1)
+    return devices[select]
+
+
+device = get_free_device()
+
+
+def ram_inference(image, tag_list, mode, threshold):
+    test_transforms = TRANSFORMS.get('Compose')(transforms=test_transforms_cfg)
+    model = MODELS.build(get_ram_cfg(mode=mode))
+    model.load_state_dict(torch.load(args.ram_ckpt))
+    model.device = device
+
+    if mode == 'openset':
+        categories = tag_list
+        if categories != '':
+            categories = categories.strip().split()
+        else:
+            categories = None
+        model.set_openset(
+            categories=categories,
+            clip_ckpt=args.clip_ckpt,
+            threshold=threshold)
+
+    sample = dict(img=image)
+    result = inference(sample, model, test_transforms, mode=mode)
+    tag, tag_chinese, logits =  \
+        result.get('tag_output')[0][0], result.get('tag_output')[1][0],\
+        result.get('logits_output')[0]
+
+    def wrap(tags, logits):
+        if tags is None:
+            return 'Openset mode has no tag_en'
+        tag_lst = tags.split('|')
+        rt_lst = []
+        for i, tag in enumerate(tag_lst):
+            tag = tag.strip()
+            rt_lst.append(tag + f': {logits[i]:.2f}')
+        return ' | '.join(rt_lst)
+
+    return [wrap(tag, logits), wrap(tag_chinese, logits)]
+
+
+def build_gradio():
+    inputs = [
+        gr.components.Image(label='image'),
+        gr.components.Textbox(
+            lines=2,
+            label='tag_list',
+            placeholder=
+            'please input the categories split by keyboard "blank": ',
+            value=''),
+        gr.components.Radio(['normal', 'openset'],
+                            label='mode',
+                            value='normal'),
+        gr.components.Slider(
+            minimum=0, maximum=1, value=0.68, step=0.01, label='threshold')
+    ]
+    return gr.Interface(
+        fn=ram_inference,
+        inputs=inputs,
+        outputs=[
+            gr.components.Textbox(),
+            gr.components.Textbox(info="it's translated from the english tags")
+        ])
+
+
+def main():
+    build_gradio().launch()
+
+
+if __name__ == '__main__':
+    main()
diff --git a/mmpretrain/models/multimodal/ram/openset_utils.py b/mmpretrain/models/multimodal/ram/openset_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..5fa0f52e26ed5b4fc648332e76ab7215913627c5
--- /dev/null
+++ b/mmpretrain/models/multimodal/ram/openset_utils.py
@@ -0,0 +1,212 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+
+from mmpretrain.registry import MODELS
+
+
+def article(name):
+    return 'an' if name[0] in 'aeiou' else 'a'
+
+
+def processed_name(name, rm_dot=False):
+    # _ for lvis
+    # / for obj365
+    res = name.replace('_', ' ').replace('/', ' or ').lower()
+    if rm_dot:
+        res = res.rstrip('.')
+    return res
+
+
+single_template = ['a photo of a {}.']
+
+multiple_templates = [
+    'There is {article} {} in the scene.',
+    'There is the {} in the scene.',
+    'a photo of {article} {} in the scene.',
+    'a photo of the {} in the scene.',
+    'a photo of one {} in the scene.',
+    'itap of {article} {}.',
+    'itap of my {}.',  # itap: I took a picture of
+    'itap of the {}.',
+    'a photo of {article} {}.',
+    'a photo of my {}.',
+    'a photo of the {}.',
+    'a photo of one {}.',
+    'a photo of many {}.',
+    'a good photo of {article} {}.',
+    'a good photo of the {}.',
+    'a bad photo of {article} {}.',
+    'a bad photo of the {}.',
+    'a photo of a nice {}.',
+    'a photo of the nice {}.',
+    'a photo of a cool {}.',
+    'a photo of the cool {}.',
+    'a photo of a weird {}.',
+    'a photo of the weird {}.',
+    'a photo of a small {}.',
+    'a photo of the small {}.',
+    'a photo of a large {}.',
+    'a photo of the large {}.',
+    'a photo of a clean {}.',
+    'a photo of the clean {}.',
+    'a photo of a dirty {}.',
+    'a photo of the dirty {}.',
+    'a bright photo of {article} {}.',
+    'a bright photo of the {}.',
+    'a dark photo of {article} {}.',
+    'a dark photo of the {}.',
+    'a photo of a hard to see {}.',
+    'a photo of the hard to see {}.',
+    'a low resolution photo of {article} {}.',
+    'a low resolution photo of the {}.',
+    'a cropped photo of {article} {}.',
+    'a cropped photo of the {}.',
+    'a close-up photo of {article} {}.',
+    'a close-up photo of the {}.',
+    'a jpeg corrupted photo of {article} {}.',
+    'a jpeg corrupted photo of the {}.',
+    'a blurry photo of {article} {}.',
+    'a blurry photo of the {}.',
+    'a pixelated photo of {article} {}.',
+    'a pixelated photo of the {}.',
+    'a black and white photo of the {}.',
+    'a black and white photo of {article} {}.',
+    'a plastic {}.',
+    'the plastic {}.',
+    'a toy {}.',
+    'the toy {}.',
+    'a plushie {}.',
+    'the plushie {}.',
+    'a cartoon {}.',
+    'the cartoon {}.',
+    'an embroidered {}.',
+    'the embroidered {}.',
+    'a painting of the {}.',
+    'a painting of a {}.',
+]
+
+openimages_rare_unseen = [
+    'Aerial photography', 'Aircraft engine', 'Ale', 'Aloe', 'Amphibian',
+    'Angling', 'Anole', 'Antique car', 'Arcade game', 'Arthropod',
+    'Assault rifle', 'Athletic shoe', 'Auto racing', 'Backlighting',
+    'Bagpipes', 'Ball game', 'Barbecue chicken', 'Barechested', 'Barquentine',
+    'Beef tenderloin', 'Billiard room', 'Billiards', 'Bird of prey',
+    'Black swan', 'Black-and-white', 'Blond', 'Boating', 'Bonbon',
+    'Bottled water', 'Bouldering', 'Bovine', 'Bratwurst', 'Breadboard',
+    'Briefs', 'Brisket', 'Brochette', 'Calabaza', 'Camera operator', 'Canola',
+    'Childbirth', 'Chordophone', 'Church bell', 'Classical sculpture',
+    'Close-up', 'Cobblestone', 'Coca-cola', 'Combat sport', 'Comics',
+    'Compact car', 'Computer speaker', 'Cookies and crackers',
+    'Coral reef fish', 'Corn on the cob', 'Cosmetics', 'Crocodilia',
+    'Digital camera', 'Dishware', 'Divemaster', 'Dobermann', 'Dog walking',
+    'Domestic rabbit', 'Domestic short-haired cat', 'Double-decker bus',
+    'Drums', 'Electric guitar', 'Electric piano', 'Electronic instrument',
+    'Equestrianism', 'Equitation', 'Erinaceidae', 'Extreme sport', 'Falafel',
+    'Figure skating', 'Filling station', 'Fire apparatus', 'Firearm',
+    'Flatbread', 'Floristry', 'Forklift truck', 'Freight transport',
+    'Fried food', 'Fried noodles', 'Frigate', 'Frozen yogurt', 'Frying',
+    'Full moon', 'Galleon', 'Glacial landform', 'Gliding', 'Go-kart', 'Goats',
+    'Grappling', 'Great white shark', 'Gumbo', 'Gun turret', 'Hair coloring',
+    'Halter', 'Headphones', 'Heavy cruiser', 'Herding', 'High-speed rail',
+    'Holding hands', 'Horse and buggy', 'Horse racing', 'Hound',
+    'Hunting knife', 'Hurdling', 'Inflatable', 'Jackfruit', 'Jeans', 'Jiaozi',
+    'Junk food', 'Khinkali', 'Kitesurfing', 'Lawn game', 'Leaf vegetable',
+    'Lechon', 'Lifebuoy', 'Locust', 'Lumpia', 'Luxury vehicle', 'Machine tool',
+    'Medical imaging', 'Melee weapon', 'Microcontroller', 'Middle ages',
+    'Military person', 'Military vehicle', 'Milky way', 'Miniature Poodle',
+    'Modern dance', 'Molluscs', 'Monoplane', 'Motorcycling', 'Musical theatre',
+    'Narcissus', 'Nest box', 'Newsagent\'s shop', 'Nile crocodile',
+    'Nordic skiing', 'Nuclear power plant', 'Orator', 'Outdoor shoe',
+    'Parachuting', 'Pasta salad', 'Peafowl', 'Pelmeni', 'Perching bird',
+    'Performance car', 'Personal water craft', 'Pit bull', 'Plant stem',
+    'Pork chop', 'Portrait photography', 'Primate', 'Procyonidae',
+    'Prosciutto', 'Public speaking', 'Racewalking', 'Ramen',
+    'Rear-view mirror', 'Residential area', 'Ribs', 'Rice ball',
+    'Road cycling', 'Roller skating', 'Roman temple', 'Rowing', 'Rural area',
+    'Sailboat racing', 'Scaled reptile', 'Scuba diving', 'Senior citizen',
+    'Shallot', 'Shinto shrine', 'Shooting range', 'Siberian husky', 'Sledding',
+    'Soba', 'Solar energy', 'Sport climbing', 'Sport utility vehicle',
+    'Steamed rice', 'Stemware', 'Sumo', 'Surfing Equipment', 'Team sport',
+    'Touring car', 'Toy block', 'Trampolining', 'Underwater diving',
+    'Vegetarian food', 'Wallaby', 'Water polo', 'Watercolor paint', 'Whiskers',
+    'Wind wave', 'Woodwind instrument', 'Yakitori', 'Zeppelin'
+]
+
+
+def get_clip_model():
+    model = dict(
+        type='CLIPZeroShot',
+        vision_backbone=dict(
+            type='VisionTransformer',
+            arch='base',
+            img_size=224,
+            patch_size=16,
+            drop_rate=0.,
+            layer_cfgs=dict(act_cfg=dict(type='mmpretrain.QuickGELU')),
+            pre_norm=True,
+        ),
+        projection=dict(
+            type='CLIPProjection', in_channels=768, out_channels=512),
+        text_backbone=dict(
+            type='CLIPTransformer',
+            width=512,
+            layers=12,
+            heads=8,
+            attn_mask=True,
+        ),
+        tokenizer=dict(
+            type='AutoTokenizer',
+            name_or_path='openai/clip-vit-base-patch16',
+            use_fast=False),
+        vocab_size=49408,
+        transformer_width=512,
+        proj_dim=512,
+        context_length=77,
+        data_preprocessor=dict(
+            type='MultiModalDataPreprocessor',
+            mean=[0.48145466 * 255, 0.4578275 * 255, 0.40821073 * 255],
+            std=[0.26862954 * 255, 0.26130258 * 255, 0.27577711 * 255],
+            to_rgb=False,
+        ),
+    )
+    return MODELS.build(model)
+
+
+def build_openset_label_embedding(categories=None, clip_ckpt_path=''):
+    if categories is None:
+        print('Categories is None, so using rare_unseen categories')
+        categories = openimages_rare_unseen
+    model = get_clip_model()
+    model.load_state_dict(torch.load(clip_ckpt_path))
+    templates = multiple_templates
+
+    run_on_gpu = torch.cuda.is_available()
+
+    with torch.no_grad():
+        openset_label_embedding = []
+        for category in categories:
+            texts = [
+                template.format(
+                    processed_name(category, rm_dot=True),
+                    article=article(category)) for template in templates
+            ]
+            texts = [
+                'This is ' + text
+                if text.startswith('a') or text.startswith('the') else text
+                for text in texts
+            ]
+            texts = model.tokenize(texts)  # tokenize
+            if run_on_gpu:
+                texts = texts.cuda()
+                model = model.cuda()
+            text_embeddings = model.extract_text_feat(texts)
+            text_embeddings /= text_embeddings.norm(dim=-1, keepdim=True)
+            text_embedding = text_embeddings.mean(dim=0)
+            text_embedding /= text_embedding.norm()
+            openset_label_embedding.append(text_embedding)
+        openset_label_embedding = torch.stack(openset_label_embedding, dim=1)
+        if run_on_gpu:
+            openset_label_embedding = openset_label_embedding.cuda()
+
+    openset_label_embedding = openset_label_embedding.t()
+    return openset_label_embedding, categories
diff --git a/mmpretrain/models/multimodal/ram/ram.py b/mmpretrain/models/multimodal/ram/ram.py
new file mode 100644
index 0000000000000000000000000000000000000000..c5d22f07817c21d7acdee2127b199f659fb9f80b
--- /dev/null
+++ b/mmpretrain/models/multimodal/ram/ram.py
@@ -0,0 +1,332 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+import pickle
+from abc import abstractmethod
+from typing import List, Optional
+
+import numpy as np
+import torch
+import torch.nn as nn
+from mmengine.model import BaseModel
+
+from mmpretrain.registry import MODELS, TOKENIZER
+from mmpretrain.structures import DataSample
+from .bert import BertConfig, BertLMHeadModel, BertModel
+from .openset_utils import build_openset_label_embedding
+from .utils import tie_encoder_decoder_weights
+
+
+def get_path(path):
+    file_path = os.path.abspath(os.path.dirname(__file__))
+    if not os.path.isabs(path):
+        return os.path.join(file_path, path)
+
+
+class RAM(BaseModel):
+    """The implementation of `RAM <https://arxiv.org/abs/2306.03514>`_."""
+
+    def __init__(self,
+                 tokenizer: dict,
+                 vision_backbone: dict,
+                 tag_encoder: dict,
+                 tagging_head: dict,
+                 text_decoder: dict,
+                 device: str = 'cpu',
+                 vision_width: int = 1536,
+                 prompt='a picture of ',
+                 threshold=0.68,
+                 delete_tag_index=[],
+                 tag_list='./data/ram_tag_list.pickle',
+                 tag_list_chinese='./data/ram_tag_list_chinese.pickle',
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        data_preprocessor.setdefault('type', 'MultiModalDataPreprocessor')
+        data_preprocessor = MODELS.build(data_preprocessor)
+
+        super().__init__(
+            data_preprocessor=data_preprocessor, init_cfg=init_cfg)
+
+        self.device = device
+        # build the visual encoder
+        self.visual_encoder = MODELS.build(vision_backbone)
+
+        # build the tokenizer
+        self.tokenizer = TOKENIZER.build(tokenizer)
+        self.tokenizer.add_special_tokens({'bos_token': '[DEC]'})
+        self.tokenizer.add_special_tokens(
+            {'additional_special_tokens': ['[ENC]']})
+        self.tokenizer.enc_token_id = \
+            self.tokenizer.additional_special_tokens_ids[0]
+
+        # build the tag encoder
+        # encoder_config = BertConfig.from_json_file(med_config)
+        # encoder_config.encoder_width = 512
+        encoder_config = BertConfig.from_dict(tag_encoder)
+        self.tag_encoder = BertModel(
+            config=encoder_config, add_pooling_layer=False)
+
+        # build image-tag-text decoder
+        # decoder_config = BertConfig.from_json_file(med_config)
+        decoder_config = BertConfig.from_dict(text_decoder)
+        self.text_decoder = BertLMHeadModel(config=decoder_config)
+
+        self.delete_tag_index = delete_tag_index
+        self.prompt = prompt
+        self.prompt_length = len(self.tokenizer(self.prompt).input_ids) - 1
+
+        # load tag list
+        self.tag_list = self.load_tag_list(get_path(tag_list))
+        self.tag_list_chinese = self.load_tag_list(get_path(tag_list_chinese))
+
+        # create image-tag recognition decoder
+        self.threshold = threshold
+        self.num_class = len(self.tag_list)
+        # q2l_config =  \
+        #               BertConfig.from_json_file(f'{CONFIG_PATH}/configs/q2l_config.json')
+        # q2l_config.encoder_width = 512
+        q2l_config = BertConfig.from_dict(tagging_head)
+        self.tagging_head = BertModel(
+            config=q2l_config, add_pooling_layer=False)
+        self.tagging_head.resize_token_embeddings(len(self.tokenizer))
+        self.label_embed = nn.Parameter(
+            torch.zeros(self.num_class, q2l_config.encoder_width))
+
+        if q2l_config.hidden_size != 512:
+            self.wordvec_proj = nn.Linear(512, q2l_config.hidden_size)
+        else:
+            self.wordvec_proj = nn.Identity()
+
+        self.fc = nn.Linear(q2l_config.hidden_size, 1)
+
+        self.del_selfattention()
+
+        # share weights of the lowest 2-layer of
+        # "image-tag interaction encoder" with
+        # the "image-tag recogntion decoder"
+        tie_encoder_decoder_weights(self.tag_encoder, self.tagging_head, '',
+                                    ' ')
+        self.image_proj = nn.Linear(vision_width, 512)
+        # self.label_embed = nn.Parameter(torch.load(
+        #   f'{CONFIG_PATH}/data/textual_label_embedding.pth',
+        #   map_location='cpu').float())
+
+        # adjust thresholds for some tags
+        self.class_threshold = torch.ones(self.num_class) * self.threshold
+        ram_class_threshold_path = get_path(
+            './data/ram_tag_list_threshold.pickle')
+        with open(ram_class_threshold_path, 'rb') as f:
+            ram_class_threshold = pickle.load(f)
+        for key, value in enumerate(ram_class_threshold):
+            self.class_threshold[key] = value
+
+    def load_tag_list(self, tag_list_file):
+        with open(tag_list_file, 'rb') as f:
+            tag_list = pickle.load(f)
+        tag_list = np.array(tag_list)
+        return tag_list
+
+    # delete self-attention layer of image-tag recognition decoder
+    # to reduce computation, follower Query2Label
+    def del_selfattention(self):
+        del self.tagging_head.embeddings
+        for layer in self.tagging_head.encoder.layer:
+            del layer.attention
+
+    def get_label_embed(self):
+        return torch.nn.functional.relu(self.wordvec_proj(self.label_embed))
+
+    def extract_visual_feature(self, images):
+        image_embeds = self.visual_encoder(images)[0]
+        image_embeds = image_embeds.flatten(2, 3)
+        attn_pool = nn.AdaptiveAvgPool1d(1)
+        cls_token = attn_pool(image_embeds).permute(0, 2, 1).contiguous()
+        image_embeds = image_embeds.permute(0, 2, 1).contiguous()
+        image_embeds = torch.cat([cls_token, image_embeds], dim=1)
+        image_embeds = self.image_proj(image_embeds)
+        image_atts = torch.ones(
+            image_embeds.size()[:-1], dtype=torch.long).to(images.device)
+        return image_embeds, image_atts
+
+    def image2tag(self, label_embed, image_embeds, image_atts):
+        # recognized image tags using image-tag recogntiion decoder
+        # image_cls_embeds = image_embeds[:, 0, :]
+        image_spatial_embeds = image_embeds[:, 1:, :]
+
+        bs = image_spatial_embeds.shape[0]
+        label_embed = label_embed.unsqueeze(0).repeat(bs, 1, 1)
+        tagging_embed = self.tagging_head(
+            encoder_embeds=label_embed,
+            encoder_hidden_states=image_embeds,
+            encoder_attention_mask=image_atts,
+            return_dict=False,
+            mode='tagging',
+        )
+
+        logits = self.fc(tagging_embed[0]).squeeze(-1)
+        return logits
+
+    def forward(
+        self,
+        images: torch.Tensor,
+        data_samples: Optional[list] = None,
+        mode: str = 'predict',
+        **kwargs,
+    ):
+        if mode == 'predict':
+            return self.predict(images, data_samples, **kwargs)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    @abstractmethod
+    def predict(self,
+                images: torch.Tensor,
+                data_samples: DataSample = None) -> DataSample:
+        raise NotImplementedError
+
+
+@MODELS.register_module()
+class RAMNormal(RAM):
+
+    def __init__(self,
+                 tokenizer: dict,
+                 vision_backbone: dict,
+                 tag_encoder: dict,
+                 tagging_head: dict,
+                 text_decoder: dict,
+                 device: str = 'cpu',
+                 vision_width: int = 1536,
+                 prompt='a picture of ',
+                 threshold=0.68,
+                 delete_tag_index=[],
+                 tag_list='./data/ram_tag_list.pickle',
+                 tag_list_chinese='./data/ram_tag_list_chinese.pickle',
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        super().__init__(
+            tokenizer,
+            vision_backbone,
+            tag_encoder,
+            tagging_head,
+            text_decoder,
+            device,
+            vision_width,
+            prompt,
+            threshold,
+            delete_tag_index,
+            tag_list,
+            tag_list_chinese,
+            data_preprocessor,
+            init_cfg,
+        )
+
+    def tag_process(self, logits):
+        targets = torch.where(
+            torch.sigmoid(logits) > self.class_threshold.to(logits.device),
+            torch.tensor(1.0).to(logits.device),
+            torch.zeros(self.num_class).to(logits.device))
+
+        tag = targets.cpu().numpy()
+        tag[:, self.delete_tag_index] = 0
+        tag_output = []
+        tag_output_chinese = []
+        logits_output = []
+
+        bs = logits.shape[0]
+        for b in range(bs):
+            index = np.argwhere(tag[b] == 1)
+            token = self.tag_list[index].squeeze(axis=1)
+            logits_output.append(
+                torch.sigmoid(logits)[b][index[:, 0]].cpu().numpy())
+            tag_output.append(' | '.join(token))
+            token_chinese = self.tag_list_chinese[index].squeeze(axis=1)
+            tag_output_chinese.append(' | '.join(token_chinese))
+
+        return [(tag_output, tag_output_chinese), logits_output]
+
+    def predict(self,
+                images: torch.Tensor,
+                data_samples: DataSample = None) -> DataSample:
+        self.eval()
+        self.to(self.device)
+        images = images.to(self.device)
+        label_embed = self.get_label_embed()
+        image_embeds, image_atts = self.extract_visual_feature(images)
+        logits = self.image2tag(label_embed, image_embeds, image_atts)
+        tag_output, logits_output = self.tag_process(logits)
+        data_samples.set_field(logits_output, 'logits_output')
+        data_samples.set_field(tag_output, 'tag_output')
+        return data_samples
+
+
+@MODELS.register_module()
+class RAMOpenset(RAMNormal):
+
+    def __init__(self,
+                 tokenizer: dict,
+                 vision_backbone: dict,
+                 tag_encoder: dict,
+                 tagging_head: dict,
+                 text_decoder: dict,
+                 device: str = 'cpu',
+                 vision_width: int = 1536,
+                 prompt='a picture of ',
+                 threshold=0.68,
+                 delete_tag_index=[],
+                 tag_list='./data/ram_tag_list.pickle',
+                 tag_list_chinese='./data/ram_tag_list_chinese.pickle',
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        super().__init__(
+            tokenizer,
+            vision_backbone,
+            tag_encoder,
+            tagging_head,
+            text_decoder,
+            device,
+            vision_width,
+            prompt,
+            threshold,
+            delete_tag_index,
+            tag_list,
+            tag_list_chinese,
+            data_preprocessor,
+            init_cfg,
+        )
+
+    def set_openset(self,
+                    categories: List[str] = None,
+                    clip_ckpt: str = '',
+                    threshold: float = 0.68):
+        openset_label_embedding, openset_categories = \
+                            build_openset_label_embedding(
+                                categories, clip_ckpt
+                            )
+        self.tag_list = np.array(openset_categories)
+        self.label_embed = nn.Parameter(openset_label_embedding.float())
+        self.num_class = len(openset_categories)
+
+        # the threshold for unseen categories is often lower
+        self.class_threshold = torch.ones(self.num_class) * threshold
+
+    def tag_process(self, logits):
+        targets = torch.where(
+            torch.sigmoid(logits) > self.class_threshold.to(logits.device),
+            torch.tensor(1.0).to(logits.device),
+            torch.zeros(self.num_class).to(logits.device))
+
+        tag = targets.cpu().numpy()
+        tag[:, self.delete_tag_index] = 0
+
+        bs = logits.shape[0]
+        tag_output = []
+        logits_output = []
+        for b in range(bs):
+            index = np.argwhere(tag[b] == 1)
+            token = self.tag_list[index].squeeze(axis=1)
+            logits_output.append(
+                torch.sigmoid(logits)[b][index[:, 0]].cpu().numpy())
+            tag_output.append(' | '.join(token))
+
+        return [(tag_output, [None]), logits_output]
diff --git a/mmpretrain/models/multimodal/ram/run/__init__.py b/mmpretrain/models/multimodal/ram/run/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef101fec61e72abc0eb90266d453b5b22331378d
--- /dev/null
+++ b/mmpretrain/models/multimodal/ram/run/__init__.py
@@ -0,0 +1 @@
+# Copyright (c) OpenMMLab. All rights reserved.
diff --git a/mmpretrain/models/multimodal/ram/run/inference.py b/mmpretrain/models/multimodal/ram/run/inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..da5afcf5e9dbf1b95023f44154b8fa93c545a700
--- /dev/null
+++ b/mmpretrain/models/multimodal/ram/run/inference.py
@@ -0,0 +1,29 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+
+
+def inference_ram(sample, model):
+
+    with torch.no_grad():
+        result = model.test_step(sample)
+
+    return result
+
+
+def inference_ram_openset(sample, model):
+    with torch.no_grad():
+        result = model.test_step(sample)
+
+    return result
+
+
+def inference(sample, model, transforms, mode='normal'):
+    sample = transforms(sample)
+    if sample['inputs'].ndim == 3:
+        sample['inputs'] = sample['inputs'].unsqueeze(dim=0)
+    assert mode in ['normal', 'openset'
+                    ], 'mode of inference must be "normal" or "openset"'
+    if mode == 'normal':
+        return inference_ram(sample, model)
+    else:
+        return inference_ram_openset(sample, model)
diff --git a/mmpretrain/models/multimodal/ram/utils.py b/mmpretrain/models/multimodal/ram/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..32cb115be647d417e5c72362a4c27521be80030a
--- /dev/null
+++ b/mmpretrain/models/multimodal/ram/utils.py
@@ -0,0 +1,87 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+from torch import nn
+
+
+def tie_encoder_decoder_weights(encoder: nn.Module, decoder: nn.Module,
+                                base_model_prefix: str, skip_key: str):
+    uninitialized_encoder_weights: List[str] = []
+    if decoder.__class__ != encoder.__class__:
+        print(f'''{decoder.__class__} and {encoder.__class__} are not equal.
+            In this case make sure that
+            all encoder weights are correctly initialized.''')
+
+    def tie_encoder_to_decoder_recursively(
+        decoder_pointer: nn.Module,
+        encoder_pointer: nn.Module,
+        module_name: str,
+        uninitialized_encoder_weights: List[str],
+        skip_key: str,
+        depth=0,
+    ):
+        assert isinstance(decoder_pointer, nn.Module) and isinstance(
+            encoder_pointer, nn.Module
+        ), f'{decoder_pointer} and {encoder_pointer}' + \
+            'have to be of type torch.nn.Module'
+        if hasattr(decoder_pointer, 'weight') and skip_key not in module_name:
+            assert hasattr(encoder_pointer, 'weight')
+            encoder_pointer.weight = decoder_pointer.weight
+            if hasattr(decoder_pointer, 'bias'):
+                assert hasattr(encoder_pointer, 'bias')
+                encoder_pointer.bias = decoder_pointer.bias
+            print(module_name + ' is tied')
+            return
+
+        encoder_modules = encoder_pointer._modules
+        decoder_modules = decoder_pointer._modules
+        if len(decoder_modules) > 0:
+            assert (len(encoder_modules) >
+                    0), f'''Encoder module {encoder_pointer}
+            does not match decoder module {decoder_pointer}'''
+
+            all_encoder_weights = set([
+                module_name + '/' + sub_name
+                for sub_name in encoder_modules.keys()
+            ])
+            encoder_layer_pos = 0
+            for name, module in decoder_modules.items():
+                if name.isdigit():
+                    encoder_name = str(int(name) + encoder_layer_pos)
+                    decoder_name = name
+                    if not isinstance(
+                            decoder_modules[decoder_name],
+                            type(encoder_modules[encoder_name])) and len(
+                                encoder_modules) != len(decoder_modules):
+                        # this can happen if the name corresponds to
+                        # the position in a list module list of layers
+                        # in this case the decoder has added a
+                        # cross-attention that the encoder doesn't have
+                        # thus skip this step and
+                        # subtract one layer pos from encoder
+                        encoder_layer_pos -= 1
+                        continue
+                elif name not in encoder_modules:
+                    continue
+                elif depth > 500:
+                    raise ValueError(
+                        '''Max depth of recursive function `tie_encoder_to_decoder` reached.
+                        It seems that there is a circular dependency
+                        between two or more `nn.Modules` of your model.''')
+                else:
+                    decoder_name = encoder_name = name
+                tie_encoder_to_decoder_recursively(
+                    decoder_modules[decoder_name],
+                    encoder_modules[encoder_name],
+                    module_name + '/' + name,
+                    uninitialized_encoder_weights,
+                    skip_key,
+                    depth=depth + 1,
+                )
+                all_encoder_weights.remove(module_name + '/' + encoder_name)
+
+            uninitialized_encoder_weights += list(all_encoder_weights)
+
+    # tie weights recursively
+    tie_encoder_to_decoder_recursively(decoder, encoder, base_model_prefix,
+                                       uninitialized_encoder_weights, skip_key)
diff --git a/mmpretrain/models/necks/__init__.py b/mmpretrain/models/necks/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..2952a691758843436dd70ad6a11a390216ac724a
--- /dev/null
+++ b/mmpretrain/models/necks/__init__.py
@@ -0,0 +1,37 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .beitv2_neck import BEiTV2Neck
+from .cae_neck import CAENeck
+from .densecl_neck import DenseCLNeck
+from .gap import GlobalAveragePooling
+from .gem import GeneralizedMeanPooling
+from .hr_fuse import HRFuseScales
+from .itpn_neck import iTPNPretrainDecoder
+from .linear_neck import LinearNeck
+from .mae_neck import ClsBatchNormNeck, MAEPretrainDecoder
+from .milan_neck import MILANPretrainDecoder
+from .mixmim_neck import MixMIMPretrainDecoder
+from .mocov2_neck import MoCoV2Neck
+from .nonlinear_neck import NonLinearNeck
+from .simmim_neck import SimMIMLinearDecoder
+from .spark_neck import SparKLightDecoder
+from .swav_neck import SwAVNeck
+
+__all__ = [
+    'GlobalAveragePooling',
+    'GeneralizedMeanPooling',
+    'HRFuseScales',
+    'LinearNeck',
+    'BEiTV2Neck',
+    'CAENeck',
+    'DenseCLNeck',
+    'MAEPretrainDecoder',
+    'ClsBatchNormNeck',
+    'MILANPretrainDecoder',
+    'MixMIMPretrainDecoder',
+    'MoCoV2Neck',
+    'NonLinearNeck',
+    'SimMIMLinearDecoder',
+    'SwAVNeck',
+    'iTPNPretrainDecoder',
+    'SparKLightDecoder',
+]
diff --git a/mmpretrain/models/necks/__pycache__/__init__.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f2f1884c44edad051a444578041c477d48463404
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/beitv2_neck.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/beitv2_neck.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..c59f661505721887a9061d7496d287a2cd788dbc
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/beitv2_neck.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/cae_neck.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/cae_neck.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ca6a5cedba1b4fbfb97d7257cf28b9148fb8c8d0
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/cae_neck.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/densecl_neck.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/densecl_neck.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..4feb35ecb9781a57e26a62762675b7df7f111e0b
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/densecl_neck.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/gap.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/gap.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..d840a505a4811b35bda9f6b98af1c6fe7bc052b5
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/gap.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/gem.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/gem.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..c85043baaf19830461b8f01c401f7c76204c9c24
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/gem.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/hr_fuse.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/hr_fuse.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a62d556f07ed21cfe6b8c6e82f50d32b74a61a05
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/hr_fuse.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/itpn_neck.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/itpn_neck.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..d8aef89176881f5abc8bcf8ff4bc04f4d0641bf3
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/itpn_neck.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/linear_neck.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/linear_neck.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ea75ebd7ff4e259b153363556b7f2a048e6c2b9c
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/linear_neck.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/mae_neck.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/mae_neck.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..1b3535940a7f09f00cffd42e54943896dde40cf3
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/mae_neck.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/milan_neck.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/milan_neck.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..cffc2a0b9ca5d6953f488fda24785d98b2d4c3e0
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/milan_neck.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/mixmim_neck.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/mixmim_neck.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..d2c4a674244c75062283e4b283d5641fb325f74f
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/mixmim_neck.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/mocov2_neck.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/mocov2_neck.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..41dc99cd50d99c28088e961f63613ea75eb67aca
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/mocov2_neck.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/nonlinear_neck.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/nonlinear_neck.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..479f27d40fe62f12f090bcce3fa0e67e132fc418
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/nonlinear_neck.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/simmim_neck.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/simmim_neck.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..8c307716f63703cb0fbfc0b8fc8e691d5ad482ef
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/simmim_neck.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/spark_neck.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/spark_neck.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..588d721613876108804e73589eca4e1127e125bd
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/spark_neck.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/__pycache__/swav_neck.cpython-310.pyc b/mmpretrain/models/necks/__pycache__/swav_neck.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7dac231ec2bb51ebd9b7afa86baaad963d4bc30c
Binary files /dev/null and b/mmpretrain/models/necks/__pycache__/swav_neck.cpython-310.pyc differ
diff --git a/mmpretrain/models/necks/beitv2_neck.py b/mmpretrain/models/necks/beitv2_neck.py
new file mode 100644
index 0000000000000000000000000000000000000000..745e3879f5e3a4b9269687797728354cb6cf7d4e
--- /dev/null
+++ b/mmpretrain/models/necks/beitv2_neck.py
@@ -0,0 +1,153 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from typing import List, Optional, Tuple, Union
+
+import numpy as np
+import torch
+import torch.nn as nn
+from mmcv.cnn import build_norm_layer
+from mmengine.model import BaseModule
+
+from mmpretrain.models.backbones.beit import BEiTTransformerEncoderLayer
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class BEiTV2Neck(BaseModule):
+    """Neck for BEiTV2 Pre-training.
+
+    This module construct the decoder for the final prediction.
+
+    Args:
+        num_layers (int): Number of encoder layers of neck. Defaults to 2.
+        early_layers (int): The layer index of the early output from the
+            backbone. Defaults to 9.
+        backbone_arch (str): Vision Transformer architecture. Defaults to base.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        layer_scale_init_value (float): The initialization value for the
+            learnable scaling of attention and FFN. Defaults to 0.1.
+        use_rel_pos_bias (bool): Whether to use unique relative position bias,
+            if False, use shared relative position bias defined in backbone.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+    arch_zoo = {
+        **dict.fromkeys(
+            ['b', 'base'], {
+                'embed_dims': 768,
+                'depth': 12,
+                'num_heads': 12,
+                'feedforward_channels': 3072,
+            }),
+        **dict.fromkeys(
+            ['l', 'large'], {
+                'embed_dims': 1024,
+                'depth': 24,
+                'num_heads': 16,
+                'feedforward_channels': 4096,
+            }),
+    }
+
+    def __init__(
+        self,
+        num_layers: int = 2,
+        early_layers: int = 9,
+        backbone_arch: str = 'base',
+        drop_rate: float = 0.,
+        drop_path_rate: float = 0.,
+        layer_scale_init_value: float = 0.1,
+        use_rel_pos_bias: bool = False,
+        norm_cfg: dict = dict(type='LN', eps=1e-6),
+        init_cfg: Optional[Union[dict, List[dict]]] = dict(
+            type='TruncNormal', layer='Linear', std=0.02, bias=0)
+    ) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        if isinstance(backbone_arch, str):
+            backbone_arch = backbone_arch.lower()
+            assert backbone_arch in set(self.arch_zoo), \
+                (f'Arch {backbone_arch} is not in default archs '
+                 f'{set(self.arch_zoo)}')
+            self.arch_settings = self.arch_zoo[backbone_arch]
+        else:
+            essential_keys = {
+                'embed_dims', 'num_layers', 'num_heads', 'feedforward_channels'
+            }
+            assert isinstance(backbone_arch, dict) and essential_keys <= set(
+                backbone_arch
+            ), f'Custom arch needs a dict with keys {essential_keys}'
+            self.arch_settings = backbone_arch
+
+        # stochastic depth decay rule
+        self.early_layers = early_layers
+        depth = self.arch_settings['depth']
+        dpr = np.linspace(0, drop_path_rate,
+                          max(depth, early_layers + num_layers))
+
+        self.patch_aggregation = nn.ModuleList()
+        for i in range(early_layers, early_layers + num_layers):
+            _layer_cfg = dict(
+                embed_dims=self.arch_settings['embed_dims'],
+                num_heads=self.arch_settings['num_heads'],
+                feedforward_channels=self.
+                arch_settings['feedforward_channels'],
+                drop_rate=drop_rate,
+                drop_path_rate=dpr[i],
+                norm_cfg=norm_cfg,
+                layer_scale_init_value=layer_scale_init_value,
+                window_size=None,
+                use_rel_pos_bias=use_rel_pos_bias)
+            self.patch_aggregation.append(
+                BEiTTransformerEncoderLayer(**_layer_cfg))
+
+        self.rescale_patch_aggregation_init_weight()
+
+        embed_dims = self.arch_settings['embed_dims']
+        _, norm = build_norm_layer(norm_cfg, embed_dims)
+        self.add_module('norm', norm)
+
+    def rescale_patch_aggregation_init_weight(self):
+        """Rescale the initialized weights."""
+
+        def rescale(param, layer_id):
+            param.div_(math.sqrt(2.0 * layer_id))
+
+        for layer_id, layer in enumerate(self.patch_aggregation):
+            rescale(layer.attn.proj.weight.data,
+                    self.early_layers + layer_id + 1)
+            rescale(layer.ffn.layers[1].weight.data,
+                    self.early_layers + layer_id + 1)
+
+    def forward(self, inputs: Tuple[torch.Tensor], rel_pos_bias: torch.Tensor,
+                **kwargs) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Get the latent prediction and final prediction.
+
+        Args:
+            x (Tuple[torch.Tensor]): Features of tokens.
+            rel_pos_bias (torch.Tensor): Shared relative position bias table.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]:
+              - ``x``: The final layer features from backbone, which are normed
+                in ``BEiTV2Neck``.
+              - ``x_cls_pt``: The early state features from backbone, which are
+                consist of final layer cls_token and early state patch_tokens
+                from backbone and sent to PatchAggregation layers in the neck.
+        """
+
+        early_states, x = inputs[0], inputs[1]
+        x_cls_pt = torch.cat([x[:, [0]], early_states[:, 1:]], dim=1)
+        for layer in self.patch_aggregation:
+            x_cls_pt = layer(x_cls_pt, rel_pos_bias=rel_pos_bias)
+
+        # shared norm
+        x, x_cls_pt = self.norm(x), self.norm(x_cls_pt)
+
+        # remove cls_token
+        x = x[:, 1:]
+        x_cls_pt = x_cls_pt[:, 1:]
+        return x, x_cls_pt
diff --git a/mmpretrain/models/necks/cae_neck.py b/mmpretrain/models/necks/cae_neck.py
new file mode 100644
index 0000000000000000000000000000000000000000..81fc30111362ca6f602a0d3f456fbc991926a99f
--- /dev/null
+++ b/mmpretrain/models/necks/cae_neck.py
@@ -0,0 +1,273 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Tuple
+
+import torch
+import torch.nn as nn
+from mmcv.cnn import build_norm_layer
+from mmcv.cnn.bricks import DropPath
+from mmcv.cnn.bricks.transformer import FFN
+from mmengine.model import BaseModule
+from mmengine.model.weight_init import trunc_normal_
+
+from mmpretrain.models.backbones.beit import BEiTTransformerEncoderLayer
+from mmpretrain.registry import MODELS
+from ..utils import CrossMultiheadAttention
+
+
+class CAETransformerRegressorLayer(BaseModule):
+    """Transformer layer for the regressor of CAE.
+
+    This module is different from conventional transformer encoder layer, for
+    its queries are the masked tokens, but its keys and values are the
+    concatenation of the masked and unmasked tokens.
+
+    Args:
+        embed_dims (int): The feature dimension.
+        num_heads (int): The number of heads in multi-head attention.
+        feedforward_channels (int): The hidden dimension of FFNs.
+            Defaults: 1024.
+        num_fcs (int, optional): The number of fully-connected layers in
+            FFNs. Default: 2.
+        qkv_bias (bool): If True, add a learnable bias to q, k, v.
+            Defaults to True.
+        qk_scale (float, optional): Override default qk scale of
+            ``head_dim ** -0.5`` if set. Defaults to None.
+        drop_rate (float): The dropout rate. Defaults to 0.0.
+        attn_drop_rate (float): The drop out rate for attention output weights.
+            Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        layer_scale_init_value (float): The init value of gamma.
+            Defaults to 0.0.
+        act_cfg (dict): The activation config for FFNs.
+            Defaults to ``dict(type='GELU')``.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+    """
+
+    def __init__(
+        self,
+        embed_dims: int,
+        num_heads: int,
+        feedforward_channels: int,
+        num_fcs: int = 2,
+        qkv_bias: bool = False,
+        qk_scale: float = None,
+        drop_rate: float = 0.,
+        attn_drop_rate: float = 0.,
+        drop_path_rate: float = 0.,
+        layer_scale_init_value: float = 0.0,
+        act_cfg: dict = dict(type='GELU'),
+        norm_cfg: dict = dict(type='LN', eps=1e-6)
+    ) -> None:
+        super().__init__()
+
+        # NOTE: cross attention
+        _, self.norm1_q_cross = build_norm_layer(
+            norm_cfg, embed_dims, postfix=2)
+        _, self.norm1_k_cross = build_norm_layer(
+            norm_cfg, embed_dims, postfix=2)
+        _, self.norm1_v_cross = build_norm_layer(
+            norm_cfg, embed_dims, postfix=2)
+        _, self.norm2_cross = build_norm_layer(norm_cfg, embed_dims, postfix=2)
+        self.cross_attn = CrossMultiheadAttention(
+            embed_dims,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            attn_drop=attn_drop_rate,
+            proj_drop=drop_rate)
+
+        self.ffn = FFN(
+            embed_dims=embed_dims,
+            feedforward_channels=feedforward_channels,
+            num_fcs=num_fcs,
+            ffn_drop=drop_rate,
+            dropout_layer=None,
+            act_cfg=act_cfg,
+            add_identity=False)
+
+        self.drop_path = DropPath(drop_prob=drop_path_rate)
+
+        if layer_scale_init_value > 0:
+            self.gamma_1_cross = nn.Parameter(
+                layer_scale_init_value * torch.ones((embed_dims)),
+                requires_grad=True)
+            self.gamma_2_cross = nn.Parameter(
+                layer_scale_init_value * torch.ones((embed_dims)),
+                requires_grad=True)
+        else:
+            self.gamma_1_cross = nn.Parameter(
+                torch.ones((embed_dims)), requires_grad=False)
+            self.gamma_2_cross = nn.Parameter(
+                torch.ones((embed_dims)), requires_grad=False)
+
+    def forward(self, x_q: torch.Tensor, x_kv: torch.Tensor,
+                pos_q: torch.Tensor, pos_k: torch.Tensor) -> torch.Tensor:
+        """Forward function."""
+        x = x_q + self.drop_path(self.gamma_1_cross * self.cross_attn(
+            self.norm1_q_cross(x_q + pos_q),
+            k=self.norm1_k_cross(x_kv + pos_k),
+            v=self.norm1_v_cross(x_kv)))
+        x = self.norm2_cross(x)
+        x = x + self.drop_path(self.gamma_2_cross * self.ffn(x))
+
+        return x
+
+
+@MODELS.register_module()
+class CAENeck(BaseModule):
+    """Neck for CAE Pre-training.
+
+    This module construct the latent prediction regressor and the decoder
+    for the latent prediction and final prediction.
+
+    Args:
+        num_classes (int): The number of classes for final prediction. Defaults
+            to 8192.
+        embed_dims (int): The embed dims of latent feature in regressor and
+            decoder. Defaults to 768.
+        regressor_depth (int): The number of regressor blocks. Defaults to 6.
+        decoder_depth (int): The number of decoder blocks. Defaults to 8.
+        num_heads (int): The number of head in multi-head attention. Defaults
+            to 12.
+        mlp_ratio (int): The expand ratio of latent features in MLP. defaults
+            to 4.
+        qkv_bias (bool): Whether or not to use qkv bias. Defaults to True.
+        qk_scale (float, optional): The scale applied to the results of qk.
+            Defaults to None.
+        drop_rate (float): The dropout rate. Defaults to 0.
+        attn_drop_rate (float): The dropout rate in attention block. Defaults
+            to 0.
+        norm_cfg (dict): The config of normalization layer. Defaults to
+            dict(type='LN', eps=1e-6).
+        layer_scale_init_value (float, optional): The init value of gamma.
+            Defaults to None.
+        mask_tokens_num (int): The number of mask tokens. Defaults to 75.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 num_classes: int = 8192,
+                 embed_dims: int = 768,
+                 regressor_depth: int = 6,
+                 decoder_depth: int = 8,
+                 num_heads: int = 12,
+                 mlp_ratio: int = 4,
+                 qkv_bias: bool = True,
+                 qk_scale: float = None,
+                 drop_rate: float = 0.,
+                 attn_drop_rate: float = 0.,
+                 drop_path_rate: float = 0.,
+                 norm_cfg: dict = dict(type='LN', eps=1e-6),
+                 layer_scale_init_value: float = None,
+                 mask_tokens_num: int = 75,
+                 init_cfg: dict = None) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        self.num_features = self.embed_dim = embed_dims
+        self.mask_token_num = mask_tokens_num
+
+        # regressor
+        regressor_drop_path_rates = [
+            x.item()
+            for x in torch.linspace(0, drop_path_rate, regressor_depth)
+        ]
+        self.regressors = nn.ModuleList([
+            CAETransformerRegressorLayer(
+                embed_dims=embed_dims,
+                num_heads=num_heads,
+                feedforward_channels=mlp_ratio * embed_dims,
+                qkv_bias=qkv_bias,
+                qk_scale=qk_scale,
+                drop_rate=drop_rate,
+                attn_drop_rate=attn_drop_rate,
+                drop_path_rate=regressor_drop_path_rates[i],
+                norm_cfg=norm_cfg,
+                layer_scale_init_value=layer_scale_init_value)
+            for i in range(regressor_depth)
+        ])
+
+        # decoder
+        decoder_drop_path_rates = [
+            x.item() for x in torch.linspace(0, drop_path_rate, decoder_depth)
+        ]
+        self.decoders = nn.ModuleList([
+            BEiTTransformerEncoderLayer(
+                embed_dims=embed_dims,
+                num_heads=num_heads,
+                feedforward_channels=mlp_ratio * embed_dims,
+                layer_scale_init_value=layer_scale_init_value,
+                window_size=None,
+                # setting `use_rel_pos_bias` to False ignores the `window_size`
+                use_rel_pos_bias=False,
+                drop_rate=drop_rate,
+                attn_drop_rate=attn_drop_rate,
+                drop_path_rate=decoder_drop_path_rates[i],
+                norm_cfg=norm_cfg) for i in range(decoder_depth)
+        ])
+
+        _, self.norm_regressor = build_norm_layer(
+            norm_cfg, embed_dims, postfix=2)
+        _, self.norm_decoder = build_norm_layer(
+            norm_cfg, embed_dims, postfix=2)
+
+        self.head = nn.Linear(
+            embed_dims, num_classes) if num_classes > 0 else nn.Identity()
+        self.mask_token = nn.Parameter(torch.zeros(1, 1, embed_dims))
+
+    def init_weights(self) -> None:
+        """Initialization."""
+        super().init_weights()
+        self.apply(self._init_weights)
+        trunc_normal_(self.mask_token, std=0.02)
+        trunc_normal_(self.head.weight, std=0.02)
+
+    def _init_weights(self, m: nn.Module) -> None:
+        """Initialization."""
+        if isinstance(m, nn.Linear):
+            nn.init.xavier_uniform_(m.weight)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+
+    def forward(
+            self, x_unmasked: torch.Tensor, pos_embed_masked: torch.Tensor,
+            pos_embed_unmasked: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Get the latent prediction and final prediction.
+
+        Args:
+            x_unmasked (torch.Tensor): Features of unmasked tokens.
+            pos_embed_masked (torch.Tensor): Position embedding of masked
+                tokens.
+            pos_embed_unmasked (torch.Tensor): Position embedding of unmasked
+                tokens.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]:
+              - ``logits``: Final prediction.
+              - ``latent_pred``: Latent prediction.
+        """
+        x_masked = self.mask_token.expand(x_unmasked.shape[0],
+                                          self.mask_token_num, -1)
+        # regressor
+        for regressor in self.regressors:
+            x_masked = regressor(
+                x_masked, torch.cat([x_unmasked, x_masked], dim=1),
+                pos_embed_masked,
+                torch.cat([pos_embed_unmasked, pos_embed_masked], dim=1))
+        x_masked = self.norm_regressor(x_masked)
+        latent_pred = x_masked
+
+        # decoder
+        x_masked = x_masked + pos_embed_masked
+        for decoder in self.decoders:
+            x_masked = decoder(x_masked, rel_pos_bias=None)
+        x_masked = self.norm_decoder(x_masked)
+
+        logits = self.head(x_masked)
+
+        return logits, latent_pred
diff --git a/mmpretrain/models/necks/densecl_neck.py b/mmpretrain/models/necks/densecl_neck.py
new file mode 100644
index 0000000000000000000000000000000000000000..bee9a2368d8917ece7b4b8ab8d1398ce951ede24
--- /dev/null
+++ b/mmpretrain/models/necks/densecl_neck.py
@@ -0,0 +1,71 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class DenseCLNeck(BaseModule):
+    """The non-linear neck of DenseCL.
+
+    Single and dense neck in parallel: fc-relu-fc, conv-relu-conv.
+    Borrowed from the authors' `code <https://github.com/WXinlong/DenseCL>`_.
+
+    Args:
+        in_channels (int): Number of input channels.
+        hid_channels (int): Number of hidden channels.
+        out_channels (int): Number of output channels.
+        num_grid (int): The grid size of dense features. Defaults to None.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 hid_channels: int,
+                 out_channels: int,
+                 num_grid: Optional[int] = None,
+                 init_cfg: Optional[Union[dict, List[dict]]] = None) -> None:
+        super().__init__(init_cfg)
+        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
+        self.mlp = nn.Sequential(
+            nn.Linear(in_channels, hid_channels), nn.ReLU(inplace=True),
+            nn.Linear(hid_channels, out_channels))
+
+        self.with_pool = True if num_grid is not None else False
+        if self.with_pool:
+            self.pool = nn.AdaptiveAvgPool2d((num_grid, num_grid))
+        self.mlp2 = nn.Sequential(
+            nn.Conv2d(in_channels, hid_channels, 1), nn.ReLU(inplace=True),
+            nn.Conv2d(hid_channels, out_channels, 1))
+        self.avgpool2 = nn.AdaptiveAvgPool2d((1, 1))
+
+    def forward(self, x: Tuple[torch.Tensor]) -> Tuple[torch.Tensor]:
+        """Forward function of neck.
+
+        Args:
+            x (Tuple[torch.Tensor]): feature map of backbone.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+              - ``avgpooled_x``: Global feature vectors.
+              - ``x``: Dense feature vectors.
+              - ``avgpooled_x2``: Dense feature vectors for queue.
+        """
+        assert len(x) == 1
+        x = x[0]
+
+        avgpooled_x = self.avgpool(x)
+        avgpooled_x = self.mlp(avgpooled_x.view(avgpooled_x.size(0), -1))
+
+        if self.with_pool:
+            x = self.pool(x)  # sxs
+        x = self.mlp2(x)  # sxs: bxdxsxs
+        avgpooled_x2 = self.avgpool2(x)  # 1x1: bxdx1x1
+        x = x.view(x.size(0), x.size(1), -1)  # bxdxs^2
+        avgpooled_x2 = avgpooled_x2.view(avgpooled_x2.size(0), -1)  # bxd
+        return avgpooled_x, x, avgpooled_x2
diff --git a/mmpretrain/models/necks/gap.py b/mmpretrain/models/necks/gap.py
new file mode 100644
index 0000000000000000000000000000000000000000..0877743ad1e5a75976feb14f5d34942c0b7b8ee4
--- /dev/null
+++ b/mmpretrain/models/necks/gap.py
@@ -0,0 +1,45 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class GlobalAveragePooling(nn.Module):
+    """Global Average Pooling neck.
+
+    Note that we use `view` to remove extra channel after pooling. We do not
+    use `squeeze` as it will also remove the batch dimension when the tensor
+    has a batch dimension of size 1, which can lead to unexpected errors.
+
+    Args:
+        dim (int): Dimensions of each sample channel, can be one of {1, 2, 3}.
+            Default: 2
+    """
+
+    def __init__(self, dim=2):
+        super(GlobalAveragePooling, self).__init__()
+        assert dim in [1, 2, 3], 'GlobalAveragePooling dim only support ' \
+            f'{1, 2, 3}, get {dim} instead.'
+        if dim == 1:
+            self.gap = nn.AdaptiveAvgPool1d(1)
+        elif dim == 2:
+            self.gap = nn.AdaptiveAvgPool2d((1, 1))
+        else:
+            self.gap = nn.AdaptiveAvgPool3d((1, 1, 1))
+
+    def init_weights(self):
+        pass
+
+    def forward(self, inputs):
+        if isinstance(inputs, tuple):
+            outs = tuple([self.gap(x) for x in inputs])
+            outs = tuple(
+                [out.view(x.size(0), -1) for out, x in zip(outs, inputs)])
+        elif isinstance(inputs, torch.Tensor):
+            outs = self.gap(inputs)
+            outs = outs.view(inputs.size(0), -1)
+        else:
+            raise TypeError('neck inputs should be tuple or torch.tensor')
+        return outs
diff --git a/mmpretrain/models/necks/gem.py b/mmpretrain/models/necks/gem.py
new file mode 100644
index 0000000000000000000000000000000000000000..f5648be86303caa6f2c25786fe8c3058c2f98d7e
--- /dev/null
+++ b/mmpretrain/models/necks/gem.py
@@ -0,0 +1,53 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+from torch import Tensor, nn
+from torch.nn import functional as F
+from torch.nn.parameter import Parameter
+
+from mmpretrain.registry import MODELS
+
+
+def gem(x: Tensor, p: Parameter, eps: float = 1e-6, clamp=True) -> Tensor:
+    if clamp:
+        x = x.clamp(min=eps)
+    return F.avg_pool2d(x.pow(p), (x.size(-2), x.size(-1))).pow(1. / p)
+
+
+@MODELS.register_module()
+class GeneralizedMeanPooling(nn.Module):
+    """Generalized Mean Pooling neck.
+
+    Note that we use `view` to remove extra channel after pooling. We do not
+    use `squeeze` as it will also remove the batch dimension when the tensor
+    has a batch dimension of size 1, which can lead to unexpected errors.
+
+    Args:
+        p (float): Parameter value. Defaults to 3.
+        eps (float): epsilon. Defaults to 1e-6.
+        clamp (bool): Use clamp before pooling. Defaults to True
+        p_trainable (bool): Toggle whether Parameter p is trainable or not.
+            Defaults to True.
+    """
+
+    def __init__(self, p=3., eps=1e-6, clamp=True, p_trainable=True):
+        assert p >= 1, "'p' must be a value greater than 1"
+        super(GeneralizedMeanPooling, self).__init__()
+        self.p = Parameter(torch.ones(1) * p, requires_grad=p_trainable)
+        self.eps = eps
+        self.clamp = clamp
+        self.p_trainable = p_trainable
+
+    def forward(self, inputs):
+        if isinstance(inputs, tuple):
+            outs = tuple([
+                gem(x, p=self.p, eps=self.eps, clamp=self.clamp)
+                for x in inputs
+            ])
+            outs = tuple(
+                [out.view(x.size(0), -1) for out, x in zip(outs, inputs)])
+        elif isinstance(inputs, torch.Tensor):
+            outs = gem(inputs, p=self.p, eps=self.eps, clamp=self.clamp)
+            outs = outs.view(inputs.size(0), -1)
+        else:
+            raise TypeError('neck inputs should be tuple or torch.tensor')
+        return outs
diff --git a/mmpretrain/models/necks/hr_fuse.py b/mmpretrain/models/necks/hr_fuse.py
new file mode 100644
index 0000000000000000000000000000000000000000..4a97f86f9fb9e4cce89e950e54674d5ec3d9b1f7
--- /dev/null
+++ b/mmpretrain/models/necks/hr_fuse.py
@@ -0,0 +1,83 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+from mmcv.cnn.bricks import ConvModule
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+from ..backbones.resnet import Bottleneck, ResLayer
+
+
+@MODELS.register_module()
+class HRFuseScales(BaseModule):
+    """Fuse feature map of multiple scales in HRNet.
+
+    Args:
+        in_channels (list[int]): The input channels of all scales.
+        out_channels (int): The channels of fused feature map.
+            Defaults to 2048.
+        norm_cfg (dict): dictionary to construct norm layers.
+            Defaults to ``dict(type='BN', momentum=0.1)``.
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+            Defaults to ``dict(type='Normal', layer='Linear', std=0.01))``.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels=2048,
+                 norm_cfg=dict(type='BN', momentum=0.1),
+                 init_cfg=dict(type='Normal', layer='Linear', std=0.01)):
+        super(HRFuseScales, self).__init__(init_cfg=init_cfg)
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.norm_cfg = norm_cfg
+
+        block_type = Bottleneck
+        out_channels = [128, 256, 512, 1024]
+
+        # Increase the channels on each resolution
+        # from C, 2C, 4C, 8C to 128, 256, 512, 1024
+        increase_layers = []
+        for i in range(len(in_channels)):
+            increase_layers.append(
+                ResLayer(
+                    block_type,
+                    in_channels=in_channels[i],
+                    out_channels=out_channels[i],
+                    num_blocks=1,
+                    stride=1,
+                ))
+        self.increase_layers = nn.ModuleList(increase_layers)
+
+        # Downsample feature maps in each scale.
+        downsample_layers = []
+        for i in range(len(in_channels) - 1):
+            downsample_layers.append(
+                ConvModule(
+                    in_channels=out_channels[i],
+                    out_channels=out_channels[i + 1],
+                    kernel_size=3,
+                    stride=2,
+                    padding=1,
+                    norm_cfg=self.norm_cfg,
+                    bias=False,
+                ))
+        self.downsample_layers = nn.ModuleList(downsample_layers)
+
+        # The final conv block before final classifier linear layer.
+        self.final_layer = ConvModule(
+            in_channels=out_channels[3],
+            out_channels=self.out_channels,
+            kernel_size=1,
+            norm_cfg=self.norm_cfg,
+            bias=False,
+        )
+
+    def forward(self, x):
+        assert isinstance(x, tuple) and len(x) == len(self.in_channels)
+
+        feat = self.increase_layers[0](x[0])
+        for i in range(len(self.downsample_layers)):
+            feat = self.downsample_layers[i](feat) + \
+                self.increase_layers[i + 1](x[i + 1])
+
+        return (self.final_layer(feat), )
diff --git a/mmpretrain/models/necks/itpn_neck.py b/mmpretrain/models/necks/itpn_neck.py
new file mode 100644
index 0000000000000000000000000000000000000000..1a3626af634b185fef9b0b2fb47c1fdc15e1139b
--- /dev/null
+++ b/mmpretrain/models/necks/itpn_neck.py
@@ -0,0 +1,388 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from typing import List, Optional, Union
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn import build_norm_layer
+from mmengine.model import BaseModule
+
+from mmpretrain.models.backbones.hivit import BlockWithRPE
+from mmpretrain.registry import MODELS
+from ..backbones.vision_transformer import TransformerEncoderLayer
+from ..utils import build_2d_sincos_position_embedding
+
+
+class PatchSplit(nn.Module):
+    """The up-sample module used in neck (transformer pyramid network)
+
+    Args:
+        dim (int): the input dimension (channel number).
+        fpn_dim (int): the fpn dimension (channel number).
+        norm_cfg (dict): Config dict for normalization layer.
+                Defaults to ``dict(type='LN')``.
+    """
+
+    def __init__(self, dim, fpn_dim, norm_cfg):
+        super().__init__()
+        _, self.norm = build_norm_layer(norm_cfg, dim)
+        self.reduction = nn.Linear(dim, fpn_dim * 4, bias=False)
+        self.fpn_dim = fpn_dim
+
+    def forward(self, x):
+        B, N, H, W, C = x.shape
+        x = self.norm(x)
+        x = self.reduction(x)
+        x = x.reshape(B, N, H, W, 2, 2,
+                      self.fpn_dim).permute(0, 1, 2, 4, 3, 5,
+                                            6).reshape(B, N, 2 * H, 2 * W,
+                                                       self.fpn_dim)
+        return x
+
+
+@MODELS.register_module()
+class iTPNPretrainDecoder(BaseModule):
+    """The neck module of iTPN (transformer pyramid network).
+
+    Args:
+        num_patches (int): The number of total patches. Defaults to 196.
+        patch_size (int): Image patch size. Defaults to 16.
+        in_chans (int): The channel of input image. Defaults to 3.
+        embed_dim (int): Encoder's embedding dimension. Defaults to 512.
+        fpn_dim (int): The fpn dimension (channel number).
+        fpn_depth (int): The layer number of feature pyramid.
+        decoder_embed_dim (int): Decoder's embedding dimension.
+            Defaults to 512.
+        decoder_depth (int): The depth of decoder. Defaults to 8.
+        decoder_num_heads (int): Number of attention heads of decoder.
+            Defaults to 16.
+        mlp_ratio (int): Ratio of mlp hidden dim to decoder's embedding dim.
+            Defaults to 4.
+        norm_cfg (dict): Normalization layer. Defaults to LayerNorm.
+        reconstruction_type (str): The itpn supports 2 kinds of supervisions.
+            Defaults to 'pixel'.
+        num_outs (int): The output number of neck (transformer pyramid
+            network). Defaults to 3.
+        predict_feature_dim (int): The output dimension to supervision.
+            Defaults to None.
+        init_cfg (Union[List[dict], dict], optional): Initialization config
+            dict. Defaults to None.
+    """
+
+    def __init__(self,
+                 num_patches: int = 196,
+                 patch_size: int = 16,
+                 in_chans: int = 3,
+                 embed_dim: int = 512,
+                 fpn_dim: int = 256,
+                 fpn_depth: int = 2,
+                 decoder_embed_dim: int = 512,
+                 decoder_depth: int = 6,
+                 decoder_num_heads: int = 16,
+                 mlp_ratio: int = 4,
+                 norm_cfg: dict = dict(type='LN', eps=1e-6),
+                 reconstruction_type: str = 'pixel',
+                 num_outs: int = 3,
+                 qkv_bias: bool = True,
+                 qk_scale: Optional[bool] = None,
+                 drop_rate: float = 0.0,
+                 attn_drop_rate: float = 0.0,
+                 predict_feature_dim: Optional[float] = None,
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(init_cfg=init_cfg)
+        self.num_patches = num_patches
+        assert reconstruction_type in ['pixel', 'clip'], \
+            'iTPN method only support `pixel` and `clip`, ' \
+            f'but got `{reconstruction_type}`.'
+        self.reconstruction_type = reconstruction_type
+        self.num_outs = num_outs
+
+        self.build_transformer_pyramid(
+            num_outs=num_outs,
+            embed_dim=embed_dim,
+            fpn_dim=fpn_dim,
+            fpn_depth=fpn_depth,
+            mlp_ratio=mlp_ratio,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            drop_rate=drop_rate,
+            attn_drop_rate=attn_drop_rate,
+            rpe=False,
+            norm_cfg=norm_cfg,
+        )
+
+        # merge the output
+        self.decoder_embed = nn.ModuleList()
+        self.decoder_embed.append(
+            nn.Sequential(
+                nn.LayerNorm(fpn_dim),
+                nn.Linear(fpn_dim, decoder_embed_dim, bias=True),
+            ))
+
+        if self.num_outs >= 2:
+            self.decoder_embed.append(
+                nn.Sequential(
+                    nn.LayerNorm(fpn_dim),
+                    nn.Linear(fpn_dim, decoder_embed_dim // 4, bias=True),
+                ))
+        if self.num_outs >= 3:
+            self.decoder_embed.append(
+                nn.Sequential(
+                    nn.LayerNorm(fpn_dim),
+                    nn.Linear(fpn_dim, decoder_embed_dim // 16, bias=True),
+                ))
+
+        if reconstruction_type == 'pixel':
+            self.mask_token = nn.Parameter(
+                torch.zeros(1, 1, decoder_embed_dim))
+
+            # create new position embedding, different from that in encoder
+            # and is not learnable
+            self.decoder_pos_embed = nn.Parameter(
+                torch.zeros(1, self.num_patches, decoder_embed_dim),
+                requires_grad=False)
+
+            self.decoder_blocks = nn.ModuleList([
+                TransformerEncoderLayer(
+                    decoder_embed_dim,
+                    decoder_num_heads,
+                    int(mlp_ratio * decoder_embed_dim),
+                    qkv_bias=True,
+                    norm_cfg=norm_cfg) for _ in range(decoder_depth)
+            ])
+
+            self.decoder_norm_name, decoder_norm = build_norm_layer(
+                norm_cfg, decoder_embed_dim, postfix=1)
+            self.add_module(self.decoder_norm_name, decoder_norm)
+
+            # Used to map features to pixels
+            if predict_feature_dim is None:
+                predict_feature_dim = patch_size**2 * in_chans
+            self.decoder_pred = nn.Linear(
+                decoder_embed_dim, predict_feature_dim, bias=True)
+        else:
+            _, norm = build_norm_layer(norm_cfg, embed_dim)
+            self.add_module('norm', norm)
+
+    def build_transformer_pyramid(self,
+                                  num_outs=3,
+                                  embed_dim=512,
+                                  fpn_dim=256,
+                                  fpn_depth=2,
+                                  mlp_ratio=4.0,
+                                  qkv_bias=True,
+                                  qk_scale=None,
+                                  drop_rate=0.0,
+                                  attn_drop_rate=0.0,
+                                  rpe=False,
+                                  norm_cfg=None):
+        Hp = None
+        mlvl_dims = {'4': embed_dim // 4, '8': embed_dim // 2, '16': embed_dim}
+        if num_outs > 1:
+            if embed_dim != fpn_dim:
+                self.align_dim_16tofpn = nn.Linear(embed_dim, fpn_dim)
+            else:
+                self.align_dim_16tofpn = None
+            self.fpn_modules = nn.ModuleList()
+            self.fpn_modules.append(
+                BlockWithRPE(
+                    Hp,
+                    fpn_dim,
+                    0,
+                    mlp_ratio,
+                    qkv_bias,
+                    qk_scale,
+                    drop=drop_rate,
+                    attn_drop=attn_drop_rate,
+                    drop_path=0.,
+                    rpe=rpe,
+                    norm_cfg=norm_cfg))
+            self.fpn_modules.append(
+                BlockWithRPE(
+                    Hp,
+                    fpn_dim,
+                    0,
+                    mlp_ratio,
+                    qkv_bias,
+                    qk_scale,
+                    drop=drop_rate,
+                    attn_drop=attn_drop_rate,
+                    drop_path=0.,
+                    rpe=False,
+                    norm_cfg=norm_cfg,
+                ))
+
+            self.align_dim_16to8 = nn.Linear(
+                mlvl_dims['8'], fpn_dim, bias=False)
+            self.split_16to8 = PatchSplit(mlvl_dims['16'], fpn_dim, norm_cfg)
+            self.block_16to8 = nn.Sequential(*[
+                BlockWithRPE(
+                    Hp,
+                    fpn_dim,
+                    0,
+                    mlp_ratio,
+                    qkv_bias,
+                    qk_scale,
+                    drop=drop_rate,
+                    attn_drop=attn_drop_rate,
+                    drop_path=0.,
+                    rpe=rpe,
+                    norm_cfg=norm_cfg,
+                ) for _ in range(fpn_depth)
+            ])
+
+        if num_outs > 2:
+            self.align_dim_8to4 = nn.Linear(
+                mlvl_dims['4'], fpn_dim, bias=False)
+            self.split_8to4 = PatchSplit(fpn_dim, fpn_dim, norm_cfg)
+            self.block_8to4 = nn.Sequential(*[
+                BlockWithRPE(
+                    Hp,
+                    fpn_dim,
+                    0,
+                    mlp_ratio,
+                    qkv_bias,
+                    qk_scale,
+                    drop=drop_rate,
+                    attn_drop=attn_drop_rate,
+                    drop_path=0.,
+                    rpe=rpe,
+                    norm_cfg=norm_cfg,
+                ) for _ in range(fpn_depth)
+            ])
+            self.fpn_modules.append(
+                BlockWithRPE(
+                    Hp,
+                    fpn_dim,
+                    0,
+                    mlp_ratio,
+                    qkv_bias,
+                    qk_scale,
+                    drop=drop_rate,
+                    attn_drop=attn_drop_rate,
+                    drop_path=0.,
+                    rpe=rpe,
+                    norm_cfg=norm_cfg))
+
+    def init_weights(self) -> None:
+        """Initialize position embedding and mask token of MAE decoder."""
+        super().init_weights()
+
+        if self.reconstruction_type == 'pixel':
+            decoder_pos_embed = build_2d_sincos_position_embedding(
+                int(self.num_patches**.5),
+                self.decoder_pos_embed.shape[-1],
+                cls_token=False)
+            self.decoder_pos_embed.data.copy_(decoder_pos_embed.float())
+
+            torch.nn.init.normal_(self.mask_token, std=.02)
+        else:
+            self.rescale_init_weight()
+
+    def rescale_init_weight(self) -> None:
+        """Rescale the initialized weights."""
+
+        def rescale(param, layer_id):
+            param.div_(math.sqrt(2.0 * layer_id))
+
+        for layer_id, layer in enumerate(self.fpn_modules):
+            if isinstance(layer, BlockWithRPE):
+                if layer.attn is not None:
+                    rescale(layer.attn.proj.weight.data, layer_id + 1)
+                rescale(layer.mlp.fc2.weight.data, layer_id + 1)
+
+    @property
+    def decoder_norm(self):
+        """The normalization layer of decoder."""
+        return getattr(self, self.decoder_norm_name)
+
+    def forward(self,
+                x: torch.Tensor,
+                ids_restore: torch.Tensor = None) -> torch.Tensor:
+        """The forward function.
+
+        The process computes the visible patches' features vectors and the mask
+        tokens to output feature vectors, which will be used for
+        reconstruction.
+
+        Args:
+            x (torch.Tensor): hidden features, which is of shape
+                    B x (L * mask_ratio) x C.
+            ids_restore (torch.Tensor): ids to restore original image.
+
+        Returns:
+            torch.Tensor: The reconstructed feature vectors, which is of
+            shape B x (num_patches) x C.
+        """
+
+        features = x[:2]
+        x = x[-1]
+        B, L, _ = x.shape
+        x = x[..., None, None, :]
+        Hp = Wp = math.sqrt(L)
+
+        outs = [x] if self.align_dim_16tofpn is None else [
+            self.align_dim_16tofpn(x)
+        ]
+        if self.num_outs >= 2:
+            x = self.block_16to8(
+                self.split_16to8(x) + self.align_dim_16to8(features[1]))
+            outs.append(x)
+        if self.num_outs >= 3:
+            x = self.block_8to4(
+                self.split_8to4(x) + self.align_dim_8to4(features[0]))
+            outs.append(x)
+        if self.num_outs > 3:
+            outs = [
+                out.reshape(B, Hp, Wp, *out.shape[-3:]).permute(
+                    0, 5, 1, 3, 2, 4).reshape(B, -1, Hp * out.shape[-3],
+                                              Wp * out.shape[-2]).contiguous()
+                for out in outs
+            ]
+            if self.num_outs >= 4:
+                outs.insert(0, F.avg_pool2d(outs[0], kernel_size=2, stride=2))
+            if self.num_outs >= 5:
+                outs.insert(0, F.avg_pool2d(outs[0], kernel_size=2, stride=2))
+
+        for i, out in enumerate(outs):
+            out = self.fpn_modules[i](out)
+            outs[i] = out
+
+        if self.reconstruction_type == 'pixel':
+            feats = []
+            for feat, layer in zip(outs, self.decoder_embed):
+                x = layer(feat).reshape(B, L, -1)
+                # append mask tokens to sequence
+                mask_tokens = self.mask_token.repeat(
+                    x.shape[0], ids_restore.shape[1] + 1 - x.shape[1], 1)
+                x = torch.cat([x, mask_tokens], dim=1)
+                x = torch.gather(
+                    x,
+                    dim=1,
+                    index=ids_restore.unsqueeze(-1).repeat(1, 1, x.shape[2]))
+                feats.append(x)
+            x = feats.pop(0)
+            # add pos embed
+            x = x + self.decoder_pos_embed
+
+            for i, feat in enumerate(feats):
+                x = x + feats[i]
+            # apply Transformer blocks
+            for i, blk in enumerate(self.decoder_blocks):
+                x = blk(x)
+            x = self.decoder_norm(x)
+            x = self.decoder_pred(x)
+            return x
+        else:
+            feats = []
+            for feat, layer in zip(outs, self.decoder_embed):
+                x = layer(feat).reshape(B, L, -1)
+                feats.append(x)
+            x = feats.pop(0)
+            for i, feat in enumerate(feats):
+                x = x + feats[i]
+
+            x = self.norm(x)
+
+            return x
diff --git a/mmpretrain/models/necks/linear_neck.py b/mmpretrain/models/necks/linear_neck.py
new file mode 100644
index 0000000000000000000000000000000000000000..bcdbee264325c8db0a054f765651a5dbadc968db
--- /dev/null
+++ b/mmpretrain/models/necks/linear_neck.py
@@ -0,0 +1,88 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+from typing import Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+from mmcv.cnn import build_activation_layer, build_norm_layer
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class LinearNeck(BaseModule):
+    """Linear neck with Dimension projection.
+
+    Args:
+        in_channels (int): Number of channels in the input.
+        out_channels (int): Number of channels in the output.
+        gap_dim (int): Dimensions of each sample channel, can be one of
+            {0, 1, 2, 3}. Defaults to 0.
+        norm_cfg (dict, optional): dictionary to construct and
+            config norm layer. Defaults to dict(type='BN1d').
+        act_cfg (dict, optional): dictionary to construct and
+            config activate layer. Defaults to None.
+        init_cfg (dict, optional): dictionary to initialize weights.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 gap_dim: int = 0,
+                 norm_cfg: Optional[dict] = dict(type='BN1d'),
+                 act_cfg: Optional[dict] = None,
+                 init_cfg: Optional[dict] = None):
+        super().__init__(init_cfg=init_cfg)
+
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.norm_cfg = copy.deepcopy(norm_cfg)
+        self.act_cfg = copy.deepcopy(act_cfg)
+
+        assert gap_dim in [0, 1, 2, 3], 'GlobalAveragePooling dim only ' \
+            f'support {0, 1, 2, 3}, get {gap_dim} instead.'
+        if gap_dim == 0:
+            self.gap = nn.Identity()
+        elif gap_dim == 1:
+            self.gap = nn.AdaptiveAvgPool1d(1)
+        elif gap_dim == 2:
+            self.gap = nn.AdaptiveAvgPool2d((1, 1))
+        elif gap_dim == 3:
+            self.gap = nn.AdaptiveAvgPool3d((1, 1, 1))
+
+        self.fc = nn.Linear(in_features=in_channels, out_features=out_channels)
+
+        if norm_cfg:
+            self.norm = build_norm_layer(norm_cfg, out_channels)[1]
+        else:
+            self.norm = nn.Identity()
+
+        if act_cfg:
+            self.act = build_activation_layer(act_cfg)
+        else:
+            self.act = nn.Identity()
+
+    def forward(self, inputs: Union[Tuple,
+                                    torch.Tensor]) -> Tuple[torch.Tensor]:
+        """forward function.
+
+        Args:
+            inputs (Union[Tuple, torch.Tensor]): The features extracted from
+                the backbone. Multiple stage inputs are acceptable but only
+                the last stage will be used.
+
+        Returns:
+            Tuple[torch.Tensor]: A tuple of output features.
+        """
+        assert isinstance(inputs, (tuple, torch.Tensor)), (
+            'The inputs of `LinearNeck` must be tuple or `torch.Tensor`, '
+            f'but get {type(inputs)}.')
+        if isinstance(inputs, tuple):
+            inputs = inputs[-1]
+
+        x = self.gap(inputs)
+        x = x.view(x.size(0), -1)
+        out = self.act(self.norm(self.fc(x)))
+        return (out, )
diff --git a/mmpretrain/models/necks/mae_neck.py b/mmpretrain/models/necks/mae_neck.py
new file mode 100644
index 0000000000000000000000000000000000000000..773692dcb3a94d85d2d2085360fd339493a24db3
--- /dev/null
+++ b/mmpretrain/models/necks/mae_neck.py
@@ -0,0 +1,188 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+from mmcv.cnn import build_norm_layer
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+from ..backbones.vision_transformer import TransformerEncoderLayer
+from ..utils import build_2d_sincos_position_embedding
+
+
+@MODELS.register_module()
+class MAEPretrainDecoder(BaseModule):
+    """Decoder for MAE Pre-training.
+
+    Some of the code is borrowed from `https://github.com/facebookresearch/mae`. # noqa
+
+    Args:
+        num_patches (int): The number of total patches. Defaults to 196.
+        patch_size (int): Image patch size. Defaults to 16.
+        in_chans (int): The channel of input image. Defaults to 3.
+        embed_dim (int): Encoder's embedding dimension. Defaults to 1024.
+        decoder_embed_dim (int): Decoder's embedding dimension.
+            Defaults to 512.
+        decoder_depth (int): The depth of decoder. Defaults to 8.
+        decoder_num_heads (int): Number of attention heads of decoder.
+            Defaults to 16.
+        mlp_ratio (int): Ratio of mlp hidden dim to decoder's embedding dim.
+            Defaults to 4.
+        norm_cfg (dict): Normalization layer. Defaults to LayerNorm.
+        init_cfg (Union[List[dict], dict], optional): Initialization config
+            dict. Defaults to None.
+
+    Example:
+        >>> from mmpretrain.models import MAEPretrainDecoder
+        >>> import torch
+        >>> self = MAEPretrainDecoder()
+        >>> self.eval()
+        >>> inputs = torch.rand(1, 50, 1024)
+        >>> ids_restore = torch.arange(0, 196).unsqueeze(0)
+        >>> level_outputs = self.forward(inputs, ids_restore)
+        >>> print(tuple(level_outputs.shape))
+        (1, 196, 768)
+    """
+
+    def __init__(self,
+                 num_patches: int = 196,
+                 patch_size: int = 16,
+                 in_chans: int = 3,
+                 embed_dim: int = 1024,
+                 decoder_embed_dim: int = 512,
+                 decoder_depth: int = 8,
+                 decoder_num_heads: int = 16,
+                 mlp_ratio: int = 4,
+                 norm_cfg: dict = dict(type='LN', eps=1e-6),
+                 predict_feature_dim: Optional[float] = None,
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(init_cfg=init_cfg)
+        self.num_patches = num_patches
+
+        # used to convert the dim of features from encoder to the dim
+        # compatible with that of decoder
+        self.decoder_embed = nn.Linear(embed_dim, decoder_embed_dim, bias=True)
+
+        self.mask_token = nn.Parameter(torch.zeros(1, 1, decoder_embed_dim))
+
+        # create new position embedding, different from that in encoder
+        # and is not learnable
+        self.decoder_pos_embed = nn.Parameter(
+            torch.zeros(1, self.num_patches + 1, decoder_embed_dim),
+            requires_grad=False)
+
+        self.decoder_blocks = nn.ModuleList([
+            TransformerEncoderLayer(
+                decoder_embed_dim,
+                decoder_num_heads,
+                int(mlp_ratio * decoder_embed_dim),
+                qkv_bias=True,
+                norm_cfg=norm_cfg) for _ in range(decoder_depth)
+        ])
+
+        self.decoder_norm_name, decoder_norm = build_norm_layer(
+            norm_cfg, decoder_embed_dim, postfix=1)
+        self.add_module(self.decoder_norm_name, decoder_norm)
+
+        # Used to map features to pixels
+        if predict_feature_dim is None:
+            predict_feature_dim = patch_size**2 * in_chans
+        self.decoder_pred = nn.Linear(
+            decoder_embed_dim, predict_feature_dim, bias=True)
+
+    def init_weights(self) -> None:
+        """Initialize position embedding and mask token of MAE decoder."""
+        super().init_weights()
+
+        decoder_pos_embed = build_2d_sincos_position_embedding(
+            int(self.num_patches**.5),
+            self.decoder_pos_embed.shape[-1],
+            cls_token=True)
+        self.decoder_pos_embed.data.copy_(decoder_pos_embed.float())
+
+        torch.nn.init.normal_(self.mask_token, std=.02)
+
+    @property
+    def decoder_norm(self):
+        """The normalization layer of decoder."""
+        return getattr(self, self.decoder_norm_name)
+
+    def forward(self, x: torch.Tensor,
+                ids_restore: torch.Tensor) -> torch.Tensor:
+        """The forward function.
+
+        The process computes the visible patches' features vectors and the mask
+        tokens to output feature vectors, which will be used for
+        reconstruction.
+
+        Args:
+            x (torch.Tensor): hidden features, which is of shape
+                    B x (L * mask_ratio) x C.
+            ids_restore (torch.Tensor): ids to restore original image.
+
+        Returns:
+            torch.Tensor: The reconstructed feature vectors, which is of
+            shape B x (num_patches) x C.
+        """
+        # embed tokens
+        x = self.decoder_embed(x)
+
+        # append mask tokens to sequence
+        mask_tokens = self.mask_token.repeat(
+            x.shape[0], ids_restore.shape[1] + 1 - x.shape[1], 1)
+        x_ = torch.cat([x[:, 1:, :], mask_tokens], dim=1)
+        x_ = torch.gather(
+            x_,
+            dim=1,
+            index=ids_restore.unsqueeze(-1).repeat(1, 1, x.shape[2]))
+        x = torch.cat([x[:, :1, :], x_], dim=1)
+
+        # add pos embed
+        x = x + self.decoder_pos_embed
+
+        # apply Transformer blocks
+        for blk in self.decoder_blocks:
+            x = blk(x)
+        x = self.decoder_norm(x)
+
+        # predictor projection
+        x = self.decoder_pred(x)
+
+        # remove cls token
+        x = x[:, 1:, :]
+
+        return x
+
+
+@MODELS.register_module()
+class ClsBatchNormNeck(BaseModule):
+    """Normalize cls token across batch before head.
+
+    This module is proposed by MAE, when running linear probing.
+
+    Args:
+        input_features (int): The dimension of features.
+        affine (bool): a boolean value that when set to ``True``, this module
+            has learnable affine parameters. Defaults to False.
+        eps (float): a value added to the denominator for numerical stability.
+            Defaults to 1e-6.
+        init_cfg (Dict or List[Dict], optional): Config dict for weight
+            initialization. Defaults to None.
+    """
+
+    def __init__(self,
+                 input_features: int,
+                 affine: bool = False,
+                 eps: float = 1e-6,
+                 init_cfg: Optional[Union[dict, List[dict]]] = None) -> None:
+        super().__init__(init_cfg)
+        self.bn = nn.BatchNorm1d(input_features, affine=affine, eps=eps)
+
+    def forward(
+            self,
+            inputs: Tuple[List[torch.Tensor]]) -> Tuple[List[torch.Tensor]]:
+        """The forward function."""
+        # Only apply batch norm to cls_token
+        inputs = [self.bn(input_) for input_ in inputs]
+        return tuple(inputs)
diff --git a/mmpretrain/models/necks/milan_neck.py b/mmpretrain/models/necks/milan_neck.py
new file mode 100644
index 0000000000000000000000000000000000000000..b48b76787231cfe9671e9f12900b6db1987a7e2a
--- /dev/null
+++ b/mmpretrain/models/necks/milan_neck.py
@@ -0,0 +1,222 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Union
+
+import torch
+from torch import nn
+
+from mmpretrain.registry import MODELS
+from ..backbones.vision_transformer import TransformerEncoderLayer
+from ..utils import PromptMultiheadAttention
+from .mae_neck import MAEPretrainDecoder
+
+
+class PromptTransformerEncoderLayer(TransformerEncoderLayer):
+    """Prompt Transformer Encoder Layer for MILAN.
+
+    This module is specific for the prompt encoder in MILAN. It will not update
+    the visible tokens from the encoder.
+
+    Args:
+        embed_dims (int): The feature dimension.
+        num_heads (int): Parallel attention heads.
+        feedforward_channels (int): The hidden dimension for FFNs.
+        drop_rate (float): Probability of an element to be zeroed
+            after the feed forward layer. Defaults to 0.0.
+        attn_drop_rate (float): The drop out rate for attention layer.
+            Defaults to 0.0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.0.
+        num_fcs (int): The number of fully-connected layers for FFNs.
+            Defaults to 2.
+        qkv_bias (bool): Enable bias for qkv if True. Defaults to True.
+        act_cfg (dict): The activation config for FFNs.
+            Defaults to ``dict(type='GELU')``.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        batch_first (bool): Key, Query and Value are shape of
+            (batch, n, embed_dim)
+            or (n, batch, embed_dim). Defaults to False.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims: int,
+                 num_heads: int,
+                 feedforward_channels=int,
+                 drop_rate: float = 0.,
+                 attn_drop_rate: float = 0.,
+                 drop_path_rate: float = 0.,
+                 num_fcs: int = 2,
+                 qkv_bias: bool = True,
+                 act_cfg: dict = dict(type='GELU'),
+                 norm_cfg: dict = dict(type='LN'),
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            feedforward_channels=feedforward_channels,
+            drop_rate=drop_rate,
+            attn_drop_rate=attn_drop_rate,
+            drop_path_rate=drop_path_rate,
+            num_fcs=num_fcs,
+            qkv_bias=qkv_bias,
+            act_cfg=act_cfg,
+            norm_cfg=norm_cfg,
+            init_cfg=init_cfg)
+        self.attn = PromptMultiheadAttention(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            attn_drop=attn_drop_rate,
+            proj_drop=drop_rate,
+            dropout_layer=dict(type='DropPath', drop_prob=drop_path_rate),
+            qkv_bias=qkv_bias)
+
+    def forward(self, x: torch.Tensor, visible_tokens: torch.Tensor,
+                ids_restore: torch.Tensor) -> torch.Tensor:
+        """Forward function for `PromptMultiheadAttention`.
+
+        Args:
+            x (torch.Tensor): Mask token features with shape N x L_m x C.
+            visible_tokens (torch.Tensor): The visible tokens features from
+                encoder with shape N x L_v x C.
+            ids_restore (torch.Tensor): The ids of all tokens in the original
+                image with shape N x L.
+
+        Returns:
+            torch Tensor: Output features with shape N x L x C.
+        """
+        x = x + self.attn(self.norm1(x), visible_tokens, ids_restore)
+        x = self.ffn(self.norm2(x), identity=x)
+        return x
+
+
+@MODELS.register_module()
+class MILANPretrainDecoder(MAEPretrainDecoder):
+    """Prompt decoder for MILAN.
+
+    This decoder is used in MILAN pretraining, which will not update these
+    visible tokens from the encoder.
+
+    Args:
+        num_patches (int): The number of total patches. Defaults to 196.
+        patch_size (int): Image patch size. Defaults to 16.
+        in_chans (int): The channel of input image. Defaults to 3.
+        embed_dim (int): Encoder's embedding dimension. Defaults to 1024.
+        decoder_embed_dim (int): Decoder's embedding dimension.
+            Defaults to 512.
+        decoder_depth (int): The depth of decoder. Defaults to 8.
+        decoder_num_heads (int): Number of attention heads of decoder.
+            Defaults to 16.
+        predict_feature_dim (int): The dimension of the feature to be
+            predicted. Defaults to 512.
+        mlp_ratio (int): Ratio of mlp hidden dim to decoder's embedding dim.
+            Defaults to 4.
+        norm_cfg (dict): Normalization layer. Defaults to LayerNorm.
+        init_cfg (Union[List[dict], dict], optional): Initialization config
+            dict. Defaults to None.
+    """
+
+    def __init__(self,
+                 num_patches: int = 196,
+                 patch_size: int = 16,
+                 in_chans: int = 3,
+                 embed_dim: int = 1024,
+                 decoder_embed_dim: int = 512,
+                 decoder_depth: int = 8,
+                 decoder_num_heads: int = 16,
+                 predict_feature_dim: int = 512,
+                 mlp_ratio: int = 4,
+                 norm_cfg: dict = dict(type='LN', eps=1e-6),
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(
+            num_patches=num_patches,
+            patch_size=patch_size,
+            in_chans=in_chans,
+            embed_dim=embed_dim,
+            decoder_embed_dim=decoder_embed_dim,
+            decoder_depth=decoder_depth,
+            decoder_num_heads=decoder_num_heads,
+            mlp_ratio=mlp_ratio,
+            norm_cfg=norm_cfg,
+            init_cfg=init_cfg)
+
+        # map the dim of features from decoder to the dim compatible with
+        # that of CLIP
+        self.decoder_pred = nn.Linear(
+            decoder_embed_dim, predict_feature_dim, bias=True)
+
+        # use prompt transformer encoder layer, instead of the conventional
+        # transformer encoder layer
+        self.decoder_blocks = nn.ModuleList([
+            PromptTransformerEncoderLayer(
+                decoder_embed_dim,
+                decoder_num_heads,
+                int(mlp_ratio * decoder_embed_dim),
+                qkv_bias=True,
+                norm_cfg=norm_cfg) for _ in range(decoder_depth)
+        ])
+
+    def forward(self, x: torch.Tensor, ids_restore: torch.Tensor,
+                ids_keep: torch.Tensor,
+                ids_dump: torch.Tensor) -> torch.Tensor:
+        """Forward function.
+
+        Args:
+            x (torch.Tensor): The input features, which is of shape (N, L, C).
+            ids_restore (torch.Tensor): The indices to restore these tokens
+                to the original image.
+            ids_keep (torch.Tensor): The indices of tokens to be kept.
+            ids_dump (torch.Tensor): The indices of tokens to be masked.
+
+        Returns:
+            torch.Tensor: The reconstructed features, which is of shape
+            (N, L, C).
+        """
+        # embed tokens
+        x = self.decoder_embed(x)
+
+        # append mask tokens to sequence
+        mask_tokens = self.mask_token.repeat(
+            x.shape[0], ids_restore.shape[1] + 1 - x.shape[1], 1)
+        x_ = torch.cat([x[:, 1:, :], mask_tokens], dim=1)
+        x_ = torch.gather(
+            x_,
+            dim=1,
+            index=ids_restore.unsqueeze(-1).repeat(1, 1, x.shape[2]))
+        x = torch.cat([x[:, :1, :], x_], dim=1)
+
+        # add pos embed
+        x = x + self.decoder_pos_embed
+
+        # split mask tokens and visible tokens
+        visible_tokens = torch.cat([
+            x[:, :1, :],
+            torch.gather(
+                x[:, 1:, :],
+                dim=1,
+                index=ids_keep.unsqueeze(-1).repeat(1, 1, x.shape[-1]))
+        ],
+                                   dim=1)
+        x = torch.gather(
+            x[:, 1:, :],
+            dim=1,
+            index=ids_dump.unsqueeze(-1).repeat(1, 1, x.shape[-1]))
+
+        for blk in self.decoder_blocks:
+            x = blk(x, visible_tokens, ids_restore)
+
+        # full sequence recovery
+        x_ = torch.cat([visible_tokens[:, 1:, :], x], dim=1)
+        x_ = torch.gather(
+            x_,
+            dim=1,
+            index=ids_restore.unsqueeze(-1).repeat(1, 1,
+                                                   x.shape[-1]))  # unshuffle
+        x = torch.cat([visible_tokens[:, :1, :], x_], dim=1)
+
+        x = self.decoder_norm(x)
+
+        # predictor projection
+        x = self.decoder_pred(x)
+
+        return x
diff --git a/mmpretrain/models/necks/mixmim_neck.py b/mmpretrain/models/necks/mixmim_neck.py
new file mode 100644
index 0000000000000000000000000000000000000000..8d67ee2bd6b48136f2ae6b298e11bd7758fa414b
--- /dev/null
+++ b/mmpretrain/models/necks/mixmim_neck.py
@@ -0,0 +1,111 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Union
+
+import torch
+import torch.nn as nn
+
+from mmpretrain.registry import MODELS
+from ..utils import build_2d_sincos_position_embedding
+from .mae_neck import MAEPretrainDecoder
+
+
+@MODELS.register_module()
+class MixMIMPretrainDecoder(MAEPretrainDecoder):
+    """Decoder for MixMIM Pretraining.
+
+    Some of the code is borrowed from `https://github.com/Sense-X/MixMIM`. # noqa
+
+    Args:
+        num_patches (int): The number of total patches. Defaults to 196.
+        patch_size (int): Image patch size. Defaults to 16.
+        in_chans (int): The channel of input image. Defaults to 3.
+        embed_dim (int): Encoder's embedding dimension. Defaults to 1024.
+        encoder_stride (int): The output stride of MixMIM backbone. Defaults
+            to 32.
+        decoder_embed_dim (int): Decoder's embedding dimension.
+            Defaults to 512.
+        decoder_depth (int): The depth of decoder. Defaults to 8.
+        decoder_num_heads (int): Number of attention heads of decoder.
+            Defaults to 16.
+        mlp_ratio (int): Ratio of mlp hidden dim to decoder's embedding dim.
+            Defaults to 4.
+        norm_cfg (dict): Normalization layer. Defaults to LayerNorm.
+        init_cfg (Union[List[dict], dict], optional): Initialization config
+            dict. Defaults to None.
+    """
+
+    def __init__(self,
+                 num_patches: int = 196,
+                 patch_size: int = 16,
+                 in_chans: int = 3,
+                 embed_dim: int = 1024,
+                 encoder_stride: int = 32,
+                 decoder_embed_dim: int = 512,
+                 decoder_depth: int = 8,
+                 decoder_num_heads: int = 16,
+                 mlp_ratio: int = 4,
+                 norm_cfg: dict = dict(type='LN', eps=1e-6),
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+
+        super().__init__(
+            num_patches=num_patches,
+            patch_size=patch_size,
+            in_chans=in_chans,
+            embed_dim=embed_dim,
+            decoder_embed_dim=decoder_embed_dim,
+            decoder_depth=decoder_depth,
+            decoder_num_heads=decoder_num_heads,
+            mlp_ratio=mlp_ratio,
+            norm_cfg=norm_cfg,
+            init_cfg=init_cfg)
+
+        self.decoder_pos_embed = nn.Parameter(
+            torch.zeros(1, num_patches, decoder_embed_dim),
+            requires_grad=False)
+        self.decoder_pred = nn.Linear(decoder_embed_dim, encoder_stride**2 * 3)
+
+    def init_weights(self) -> None:
+        """Initialize position embedding and mask token of MixMIM decoder."""
+        super(MAEPretrainDecoder, self).init_weights()
+
+        decoder_pos_embed = build_2d_sincos_position_embedding(
+            int(self.num_patches**.5),
+            self.decoder_pos_embed.shape[-1],
+            cls_token=False)
+        self.decoder_pos_embed.data.copy_(decoder_pos_embed.float())
+
+        torch.nn.init.normal_(self.mask_token, std=.02)
+
+    def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
+        """Forward function.
+
+        Args:
+            x (torch.Tensor): The input features, which is of shape (N, L, C).
+            mask (torch.Tensor): The tensor to indicate which tokens a
+                re masked.
+
+        Returns:
+            torch.Tensor: The reconstructed features, which is of shape
+            (N, L, C).
+        """
+
+        x = self.decoder_embed(x)
+        B, L, C = x.shape
+
+        mask_tokens = self.mask_token.expand(B, L, -1)
+        x1 = x * (1 - mask) + mask_tokens * mask
+        x2 = x * mask + mask_tokens * (1 - mask)
+        x = torch.cat([x1, x2], dim=0)
+
+        # add pos embed
+        x = x + self.decoder_pos_embed
+
+        # apply Transformer blocks
+        for idx, blk in enumerate(self.decoder_blocks):
+            x = blk(x)
+        x = self.decoder_norm(x)
+
+        # predictor projection
+        x = self.decoder_pred(x)
+
+        return x
diff --git a/mmpretrain/models/necks/mocov2_neck.py b/mmpretrain/models/necks/mocov2_neck.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ad9107812eb9aaaaff8cbc1a7d5c3d39e92dfa1
--- /dev/null
+++ b/mmpretrain/models/necks/mocov2_neck.py
@@ -0,0 +1,52 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class MoCoV2Neck(BaseModule):
+    """The non-linear neck of MoCo v2: fc-relu-fc.
+
+    Args:
+        in_channels (int): Number of input channels.
+        hid_channels (int): Number of hidden channels.
+        out_channels (int): Number of output channels.
+        with_avg_pool (bool): Whether to apply the global
+            average pooling after backbone. Defaults to True.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels: int,
+                 hid_channels: int,
+                 out_channels: int,
+                 with_avg_pool: bool = True,
+                 init_cfg: Optional[Union[dict, List[dict]]] = None) -> None:
+        super().__init__(init_cfg)
+        self.with_avg_pool = with_avg_pool
+        if with_avg_pool:
+            self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
+        self.mlp = nn.Sequential(
+            nn.Linear(in_channels, hid_channels), nn.ReLU(inplace=True),
+            nn.Linear(hid_channels, out_channels))
+
+    def forward(self, x: Tuple[torch.Tensor]) -> Tuple[torch.Tensor]:
+        """Forward function.
+
+        Args:
+            x (Tuple[torch.Tensor]): The feature map of backbone.
+
+        Returns:
+            Tuple[torch.Tensor]: The output features.
+        """
+        assert len(x) == 1
+        x = x[0]
+        if self.with_avg_pool:
+            x = self.avgpool(x)
+        return (self.mlp(x.view(x.size(0), -1)), )
diff --git a/mmpretrain/models/necks/nonlinear_neck.py b/mmpretrain/models/necks/nonlinear_neck.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef684d39d1f7f5dc7361ccbf631d3ce712d65ac5
--- /dev/null
+++ b/mmpretrain/models/necks/nonlinear_neck.py
@@ -0,0 +1,115 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+from mmcv.cnn import build_norm_layer
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class NonLinearNeck(BaseModule):
+    """The non-linear neck.
+
+    Structure: fc-bn-[relu-fc-bn] where the substructure in [] can be repeated.
+    For the default setting, the repeated time is 1.
+    The neck can be used in many algorithms, e.g., SimCLR, BYOL, SimSiam.
+
+    Args:
+        in_channels (int): Number of input channels.
+        hid_channels (int): Number of hidden channels.
+        out_channels (int): Number of output channels.
+        num_layers (int): Number of fc layers. Defaults to 2.
+        with_bias (bool): Whether to use bias in fc layers (except for the
+            last). Defaults to False.
+        with_last_bn (bool): Whether to add the last BN layer.
+            Defaults to True.
+        with_last_bn_affine (bool): Whether to have learnable affine parameters
+            in the last BN layer (set False for SimSiam). Defaults to True.
+        with_last_bias (bool): Whether to use bias in the last fc layer.
+            Defaults to False.
+        with_avg_pool (bool): Whether to apply the global average pooling
+            after backbone. Defaults to True.
+        norm_cfg (dict): Dictionary to construct and config norm layer.
+            Defaults to dict(type='SyncBN').
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+    """
+
+    def __init__(
+        self,
+        in_channels: int,
+        hid_channels: int,
+        out_channels: int,
+        num_layers: int = 2,
+        with_bias: bool = False,
+        with_last_bn: bool = True,
+        with_last_bn_affine: bool = True,
+        with_last_bias: bool = False,
+        with_avg_pool: bool = True,
+        norm_cfg: dict = dict(type='SyncBN'),
+        init_cfg: Optional[Union[dict, List[dict]]] = [
+            dict(type='Constant', val=1, layer=['_BatchNorm', 'GroupNorm'])
+        ]
+    ) -> None:
+        super(NonLinearNeck, self).__init__(init_cfg)
+        self.with_avg_pool = with_avg_pool
+        if with_avg_pool:
+            self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
+        self.relu = nn.ReLU(inplace=True)
+        self.fc0 = nn.Linear(in_channels, hid_channels, bias=with_bias)
+        self.bn0 = build_norm_layer(norm_cfg, hid_channels)[1]
+
+        self.fc_names = []
+        self.bn_names = []
+        for i in range(1, num_layers):
+            this_channels = out_channels if i == num_layers - 1 \
+                else hid_channels
+            if i != num_layers - 1:
+                self.add_module(
+                    f'fc{i}',
+                    nn.Linear(hid_channels, this_channels, bias=with_bias))
+                self.add_module(f'bn{i}',
+                                build_norm_layer(norm_cfg, this_channels)[1])
+                self.bn_names.append(f'bn{i}')
+            else:
+                self.add_module(
+                    f'fc{i}',
+                    nn.Linear(
+                        hid_channels, this_channels, bias=with_last_bias))
+                if with_last_bn:
+                    self.add_module(
+                        f'bn{i}',
+                        build_norm_layer(
+                            dict(**norm_cfg, affine=with_last_bn_affine),
+                            this_channels)[1])
+                    self.bn_names.append(f'bn{i}')
+                else:
+                    self.bn_names.append(None)
+            self.fc_names.append(f'fc{i}')
+
+    def forward(self, x: Tuple[torch.Tensor]) -> Tuple[torch.Tensor]:
+        """Forward function.
+
+        Args:
+            x (Tuple[torch.Tensor]): The feature map of backbone.
+
+        Returns:
+            Tuple[torch.Tensor]: The output features.
+        """
+        assert len(x) == 1
+        x = x[0]
+        if self.with_avg_pool:
+            x = self.avgpool(x)
+        x = x.view(x.size(0), -1)
+        x = self.fc0(x)
+        x = self.bn0(x)
+        for fc_name, bn_name in zip(self.fc_names, self.bn_names):
+            fc = getattr(self, fc_name)
+            x = self.relu(x)
+            x = fc(x)
+            if bn_name is not None:
+                bn = getattr(self, bn_name)
+                x = bn(x)
+        return (x, )
diff --git a/mmpretrain/models/necks/simmim_neck.py b/mmpretrain/models/necks/simmim_neck.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb1e29bcf195ecb800a22a2c43917e62718b5ffe
--- /dev/null
+++ b/mmpretrain/models/necks/simmim_neck.py
@@ -0,0 +1,33 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class SimMIMLinearDecoder(BaseModule):
+    """Linear Decoder For SimMIM pretraining.
+
+    This neck reconstructs the original image from the shrunk feature map.
+
+    Args:
+        in_channels (int): Channel dimension of the feature map.
+        encoder_stride (int): The total stride of the encoder.
+    """
+
+    def __init__(self, in_channels: int, encoder_stride: int) -> None:
+        super().__init__()
+        self.decoder = nn.Sequential(
+            nn.Conv2d(
+                in_channels=in_channels,
+                out_channels=encoder_stride**2 * 3,
+                kernel_size=1),
+            nn.PixelShuffle(encoder_stride),
+        )
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward function."""
+        x = self.decoder(x)
+        return x
diff --git a/mmpretrain/models/necks/spark_neck.py b/mmpretrain/models/necks/spark_neck.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac129da389711f900e4444fae38fdbc7ae91b9e5
--- /dev/null
+++ b/mmpretrain/models/necks/spark_neck.py
@@ -0,0 +1,169 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from typing import Optional
+
+import torch
+import torch.nn as nn
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+from ..utils import build_norm_layer
+
+
+def is_pow2n(x):
+    return x > 0 and (x & (x - 1) == 0)
+
+
+class ConvBlock2x(BaseModule):
+    """The definition of convolution block."""
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 mid_channels: int,
+                 norm_cfg: dict,
+                 act_cfg: dict,
+                 last_act: bool,
+                 init_cfg: Optional[dict] = None) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        self.conv1 = nn.Conv2d(in_channels, mid_channels, 3, 1, 1, bias=False)
+        self.norm1 = build_norm_layer(norm_cfg, mid_channels)
+        self.activate1 = MODELS.build(act_cfg)
+
+        self.conv2 = nn.Conv2d(mid_channels, out_channels, 3, 1, 1, bias=False)
+        self.norm2 = build_norm_layer(norm_cfg, out_channels)
+        self.activate2 = MODELS.build(act_cfg) if last_act else nn.Identity()
+
+    def forward(self, x: torch.Tensor):
+        out = self.conv1(x)
+        out = self.norm1(out)
+        out = self.activate1(out)
+
+        out = self.conv2(out)
+        out = self.norm2(out)
+        out = self.activate2(out)
+        return out
+
+
+class DecoderConvModule(BaseModule):
+    """The convolution module of decoder with upsampling."""
+
+    def __init__(self,
+                 in_channels: int,
+                 out_channels: int,
+                 mid_channels: int,
+                 kernel_size: int = 4,
+                 scale_factor: int = 2,
+                 num_conv_blocks: int = 1,
+                 norm_cfg: dict = dict(type='SyncBN'),
+                 act_cfg: dict = dict(type='ReLU6'),
+                 last_act: bool = True,
+                 init_cfg: Optional[dict] = None):
+        super().__init__(init_cfg=init_cfg)
+
+        assert (kernel_size - scale_factor >= 0) and\
+               (kernel_size - scale_factor) % 2 == 0,\
+               f'kernel_size should be greater than or equal to scale_factor '\
+               f'and (kernel_size - scale_factor) should be even numbers, '\
+               f'while the kernel size is {kernel_size} and scale_factor is '\
+               f'{scale_factor}.'
+
+        padding = (kernel_size - scale_factor) // 2
+        self.upsample = nn.ConvTranspose2d(
+            in_channels,
+            in_channels,
+            kernel_size=kernel_size,
+            stride=scale_factor,
+            padding=padding,
+            bias=True)
+
+        conv_blocks_list = [
+            ConvBlock2x(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                mid_channels=mid_channels,
+                norm_cfg=norm_cfg,
+                last_act=last_act,
+                act_cfg=act_cfg) for _ in range(num_conv_blocks)
+        ]
+        self.conv_blocks = nn.Sequential(*conv_blocks_list)
+
+    def forward(self, x):
+        x = self.upsample(x)
+        return self.conv_blocks(x)
+
+
+@MODELS.register_module()
+class SparKLightDecoder(BaseModule):
+    """The decoder for SparK, which upsamples the feature maps.
+
+    Args:
+        feature_dim (int): The dimension of feature map.
+        upsample_ratio (int): The ratio of upsample, equal to downsample_raito
+            of the algorithm.
+        mid_channels (int): The middle channel of `DecoderConvModule`. Defaults
+            to 0.
+        kernel_size (int): The kernel size of `ConvTranspose2d` in
+            `DecoderConvModule`. Defaults to 4.
+        scale_factor (int): The scale_factor of `ConvTranspose2d` in
+            `DecoderConvModule`. Defaults to 2.
+        num_conv_blocks (int): The number of convolution blocks in
+            `DecoderConvModule`. Defaults to 1.
+        norm_cfg (dict): Normalization config. Defaults to dict(type='SyncBN').
+        act_cfg (dict): Activation config. Defaults to dict(type='ReLU6').
+        last_act (bool): Whether apply the last activation in
+            `DecoderConvModule`. Defaults to False.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+    """
+
+    def __init__(
+        self,
+        feature_dim: int,
+        upsample_ratio: int,
+        mid_channels: int = 0,
+        kernel_size: int = 4,
+        scale_factor: int = 2,
+        num_conv_blocks: int = 1,
+        norm_cfg: dict = dict(type='SyncBN'),
+        act_cfg: dict = dict(type='ReLU6'),
+        last_act: bool = False,
+        init_cfg: Optional[dict] = [
+            dict(type='Kaiming', layer=['Conv2d', 'ConvTranspose2d']),
+            dict(type='TruncNormal', std=0.02, layer=['Linear']),
+            dict(
+                type='Constant',
+                val=1,
+                layer=['_BatchNorm', 'LayerNorm', 'SyncBatchNorm'])
+        ],
+    ):
+        super().__init__(init_cfg=init_cfg)
+        self.feature_dim = feature_dim
+
+        assert is_pow2n(upsample_ratio)
+        n = round(math.log2(upsample_ratio))
+        channels = [feature_dim // 2**i for i in range(n + 1)]
+
+        self.decoder = nn.ModuleList([
+            DecoderConvModule(
+                in_channels=c_in,
+                out_channels=c_out,
+                mid_channels=c_in if mid_channels == 0 else mid_channels,
+                kernel_size=kernel_size,
+                scale_factor=scale_factor,
+                num_conv_blocks=num_conv_blocks,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg,
+                last_act=last_act)
+            for (c_in, c_out) in zip(channels[:-1], channels[1:])
+        ])
+        self.proj = nn.Conv2d(
+            channels[-1], 3, kernel_size=1, stride=1, bias=True)
+
+    def forward(self, to_dec):
+        x = 0
+        for i, d in enumerate(self.decoder):
+            if i < len(to_dec) and to_dec[i] is not None:
+                x = x + to_dec[i]
+            x = self.decoder[i](x)
+        return self.proj(x)
diff --git a/mmpretrain/models/necks/swav_neck.py b/mmpretrain/models/necks/swav_neck.py
new file mode 100644
index 0000000000000000000000000000000000000000..807ae8b9b3155e9dd14ef95fe5fca526919ee11d
--- /dev/null
+++ b/mmpretrain/models/necks/swav_neck.py
@@ -0,0 +1,93 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Union
+
+import torch
+import torch.nn as nn
+from mmcv.cnn import build_norm_layer
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class SwAVNeck(BaseModule):
+    """The non-linear neck of SwAV: fc-bn-relu-fc-normalization.
+
+    Args:
+        in_channels (int): Number of input channels.
+        hid_channels (int): Number of hidden channels.
+        out_channels (int): Number of output channels.
+        with_avg_pool (bool): Whether to apply the global average pooling after
+            backbone. Defaults to True.
+        with_l2norm (bool): whether to normalize the output after projection.
+            Defaults to True.
+        norm_cfg (dict): Dictionary to construct and config norm layer.
+            Defaults to dict(type='SyncBN').
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+    """
+
+    def __init__(
+        self,
+        in_channels: int,
+        hid_channels: int,
+        out_channels: int,
+        with_avg_pool: bool = True,
+        with_l2norm: bool = True,
+        norm_cfg: dict = dict(type='SyncBN'),
+        init_cfg: Optional[Union[dict, List[dict]]] = [
+            dict(type='Constant', val=1, layer=['_BatchNorm', 'GroupNorm'])
+        ]
+    ) -> None:
+        super().__init__(init_cfg)
+        self.with_avg_pool = with_avg_pool
+        self.with_l2norm = with_l2norm
+        if with_avg_pool:
+            self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
+
+        if out_channels == 0:
+            self.projection_neck = nn.Identity()
+        elif hid_channels == 0:
+            self.projection_neck = nn.Linear(in_channels, out_channels)
+        else:
+            self.norm = build_norm_layer(norm_cfg, hid_channels)[1]
+            self.projection_neck = nn.Sequential(
+                nn.Linear(in_channels, hid_channels),
+                self.norm,
+                nn.ReLU(inplace=True),
+                nn.Linear(hid_channels, out_channels),
+            )
+
+    def forward_projection(self, x: torch.Tensor) -> torch.Tensor:
+        """Compute projection.
+
+        Args:
+            x (torch.Tensor): The feature vectors after pooling.
+
+        Returns:
+            torch.Tensor: The output features with projection or L2-norm.
+        """
+        x = self.projection_neck(x)
+        if self.with_l2norm:
+            x = nn.functional.normalize(x, dim=1, p=2)
+        return x
+
+    def forward(self, x: List[torch.Tensor]) -> torch.Tensor:
+        """Forward function.
+
+        Args:
+            x (List[torch.Tensor]): list of feature maps, len(x) according to
+                len(num_crops).
+
+        Returns:
+            torch.Tensor: The projection vectors.
+        """
+        avg_out = []
+        for _x in x:
+            _x = _x[0]
+            if self.with_avg_pool:
+                _out = self.avgpool(_x)
+                avg_out.append(_out)
+        feat_vec = torch.cat(avg_out)  # [sum(num_crops) * N, C]
+        feat_vec = feat_vec.view(feat_vec.size(0), -1)
+        output = self.forward_projection(feat_vec)
+        return output
diff --git a/mmpretrain/models/peft/__init__.py b/mmpretrain/models/peft/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9f43e14890fdbff8b64a9046a5c7f06d62cfec8d
--- /dev/null
+++ b/mmpretrain/models/peft/__init__.py
@@ -0,0 +1,6 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .lora import LoRAModel
+
+__all__ = [
+    'LoRAModel',
+]
diff --git a/mmpretrain/models/peft/__pycache__/__init__.cpython-310.pyc b/mmpretrain/models/peft/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..51bbe037b3d5b5c895e80cd93507da40cbb15c4b
Binary files /dev/null and b/mmpretrain/models/peft/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/models/peft/__pycache__/lora.cpython-310.pyc b/mmpretrain/models/peft/__pycache__/lora.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..16b3a3d979fd8d37fc5be918c7d157a9d10de404
Binary files /dev/null and b/mmpretrain/models/peft/__pycache__/lora.cpython-310.pyc differ
diff --git a/mmpretrain/models/peft/lora.py b/mmpretrain/models/peft/lora.py
new file mode 100644
index 0000000000000000000000000000000000000000..ae1bae7fdd23bbeb3fa4ff58fde2f6d1176de8b6
--- /dev/null
+++ b/mmpretrain/models/peft/lora.py
@@ -0,0 +1,205 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+import re
+from typing import Any, List
+
+import torch
+from mmengine.logging import print_log
+from mmengine.model import BaseModule
+from torch import nn
+
+from mmpretrain.registry import MODELS
+
+
+class LoRALinear(nn.Module):
+    r"""Implements LoRA in a linear layer.
+
+    Args:
+        original_layer (nn.Linear): The linear layer to be finetuned.
+        alpha (int): The scale factor of LoRA. Defaults to 1.
+        rank (int): The rank of LoRA. Defaults to 0.
+        drop_rate (float): The drop out rate for LoRA. Defaults to 0.
+
+    Note:
+        The forward process of LoRA linear layer is:
+
+        .. math::
+            `y = W_0 x + BAx * (\alpha / r)`
+
+        Where :math:`x` is the input, :math:`y` is the output,
+        :math:`W_0` is the parameter of the original layer,
+        :math:`A` and :math:`B` are the low-rank decomposition matrixs,
+        :math: `\alpha` is the scale factor and :math: `r` is the rank.
+    """
+
+    def __init__(self,
+                 original_layer: nn.Linear,
+                 alpha: int = 1,
+                 rank: int = 0,
+                 drop_rate: float = 0.):
+        super(LoRALinear, self).__init__()
+        in_features = original_layer.in_features
+        out_features = original_layer.out_features
+
+        self.lora_dropout = nn.Dropout(drop_rate)
+        self.lora_down = nn.Linear(in_features, rank, bias=False)
+        self.lora_up = nn.Linear(rank, out_features, bias=False)
+        self.scaling = alpha / rank
+
+        nn.init.kaiming_uniform_(self.lora_down.weight, a=math.sqrt(5))
+        nn.init.zeros_(self.lora_up.weight)
+
+        self.original_layer = original_layer
+
+    def forward(self, x: torch.Tensor):
+        out = self.original_layer(x)
+
+        lora_x = self.lora_dropout(x)
+        lora_out = self.lora_up(self.lora_down(lora_x)) * self.scaling
+
+        return out + lora_out
+
+
+@MODELS.register_module()
+class LoRAModel(BaseModule):
+    """Implements LoRA in a module.
+
+    An PyTorch implement of : `LoRA: Low-Rank Adaptation
+    of Large Language Models <https://arxiv.org/abs/2106.09685>`_
+
+    Args:
+        module (dict): The config of the module to be finetuned. See
+            :mod:`mmpretrain.models`
+        alpha (int): The scale factor of LoRA. Defaults to 1.
+        rank (int): The rank of LoRA. Defaults to 0.
+        drop_rate (float): The drop out rate for LoRA. Defaults to 0.
+        targets (List[dict]): The target layers to be applied with the LoRA.
+            Defaults to a empty list. Specify by regular expression or suffix.
+
+    Examples:
+        >>> model = LoRAModel(
+        ...     module=dict(type='VisionTransformer', arch='b'),
+        ...     alpha=4,
+        ...     rank=4,
+        ...     drop_rate=0.1,
+        ...     targets=[
+        ...         dict(type='.*qkv'), # regular expression
+        ...         dict(type='proj', alpha=8, rank=8), # suffix
+        ...     ])
+    """
+
+    def __init__(self,
+                 module: dict,
+                 alpha: int = 1,
+                 rank: int = 0,
+                 drop_rate: float = 0.,
+                 targets: List[dict] = list()):
+
+        super().__init__()
+
+        module = MODELS.build(module)
+        module.init_weights()
+
+        self.module = module
+        self.alpha = alpha
+        self.rank = rank
+        self.drop_rate = drop_rate
+
+        assert len(targets) != 0, \
+            'The length of target layers should not be 0.'
+
+        self.targets = targets
+
+        self.applied = False
+        self.apply_lora()
+
+        if not self.applied:
+            raise ValueError(
+                'No lora layer is replaced. Please check targets.')
+
+        self._set_lora_trainable()
+        self._register_state_dict_hooks()
+
+    def apply_lora(self):
+        """Apply LoRA to target layers."""
+        module_names = [k for k, _ in self.module.named_modules()]
+        for module_name in module_names:
+            for target in self.targets:
+                target_name = target['type']
+                target_alpha = target.get('alpha', self.alpha)
+                target_rank = target.get('rank', self.rank)
+                target_drop_rate = target.get('drop_rate', self.drop_rate)
+
+                if re.fullmatch(target_name, module_name) or \
+                        module_name.endswith(target_name):
+                    current_module = self.module.get_submodule(module_name)
+                    if isinstance(current_module, nn.Linear):
+                        print_log(
+                            f'Set LoRA for {module_name} '
+                            f'with alpha: {target_alpha}, '
+                            f'rank: {target_rank}, '
+                            f'drop rate: {target_drop_rate}',
+                            logger='current')
+
+                        self._replace_module(module_name, current_module,
+                                             target_alpha, target_rank,
+                                             target_drop_rate)
+                        self.applied = True
+
+    def _replace_module(self, module_name: str, current_module: nn.Module,
+                        alpha: int, rank: int, drop_rate: float):
+        """Replace target layer with LoRA linear layer in place."""
+        parent_module_name = '.'.join(module_name.split('.')[:-1])
+        parent_module = self.module.get_submodule(parent_module_name)
+
+        target_name = module_name.split('.')[-1]
+        target_module = LoRALinear(current_module, alpha, rank, drop_rate)
+        setattr(parent_module, target_name, target_module)
+
+    def _set_lora_trainable(self):
+        """Set only the lora parameters trainable."""
+        for name, param in self.named_parameters():
+            if '.lora_' in name:
+                param.requires_grad = True
+            else:
+                param.requires_grad = False
+
+    def _register_state_dict_hooks(self):
+        """Register state dict hooks.
+
+        Register state dict saving hooks to save only the lora parameters to
+        the state dict. And register state dict loading hooks to handle the
+        incompatible keys while loading the state dict.
+        """
+
+        def _state_dict_hook(module, state_dict, prefix, local_metadata):
+            """Save only the lora parameters to the state dict."""
+            keys = [k for k, _ in state_dict.items()]
+            for key in keys:
+                if '.lora_' not in key:
+                    state_dict.pop(key)
+
+        self._register_state_dict_hook(_state_dict_hook)
+
+        def _load_state_dict_post_hook(module, incompatible_keys):
+            """Handle the incompatible keys while loading the state dict."""
+            missing_keys = incompatible_keys.missing_keys.copy()
+            for key in missing_keys:
+                if '.lora_' not in key:
+                    incompatible_keys.missing_keys.remove(key)
+
+            unexpected_keys = incompatible_keys.unexpected_keys.copy()
+            for key in unexpected_keys:
+                if '.lora_' not in key:
+                    incompatible_keys.unexpected_keys.remove(key)
+
+        self.register_load_state_dict_post_hook(_load_state_dict_post_hook)
+
+    def forward(self, *args, **kwargs):
+        return self.module(*args, **kwargs)
+
+    def __getattr__(self, name: str) -> Any:
+        try:
+            return super(LoRAModel, self).__getattr__(name)
+        except AttributeError:
+            return self.module.__getattribute__(name)
diff --git a/mmpretrain/models/retrievers/__init__.py b/mmpretrain/models/retrievers/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..593b637d6eb7e44184fdf6ceb70470253639b013
--- /dev/null
+++ b/mmpretrain/models/retrievers/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .base import BaseRetriever
+from .image2image import ImageToImageRetriever
+
+__all__ = ['BaseRetriever', 'ImageToImageRetriever']
diff --git a/mmpretrain/models/retrievers/__pycache__/__init__.cpython-310.pyc b/mmpretrain/models/retrievers/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..1def0dc67c83b9e413f4421f81f0773bed8605a2
Binary files /dev/null and b/mmpretrain/models/retrievers/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/models/retrievers/__pycache__/base.cpython-310.pyc b/mmpretrain/models/retrievers/__pycache__/base.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7672811377f94ff74b25c9a65e5cbf1ba62b0a1b
Binary files /dev/null and b/mmpretrain/models/retrievers/__pycache__/base.cpython-310.pyc differ
diff --git a/mmpretrain/models/retrievers/__pycache__/image2image.cpython-310.pyc b/mmpretrain/models/retrievers/__pycache__/image2image.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..df80a025af40d2e45d3db40c8f117488ebc6520e
Binary files /dev/null and b/mmpretrain/models/retrievers/__pycache__/image2image.cpython-310.pyc differ
diff --git a/mmpretrain/models/retrievers/base.py b/mmpretrain/models/retrievers/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..15816798f3fadc612b51634994178eb5f8860fb8
--- /dev/null
+++ b/mmpretrain/models/retrievers/base.py
@@ -0,0 +1,151 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from abc import ABCMeta, abstractmethod
+from typing import List, Optional, Union
+
+import torch
+from mmengine.model import BaseModel
+from mmengine.structures import BaseDataElement
+from torch.utils.data import DataLoader
+
+
+class BaseRetriever(BaseModel, metaclass=ABCMeta):
+    """Base class for retriever.
+
+    Args:
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+        data_preprocessor (dict, optional): The config for preprocessing input
+            data. If None, it will use "BaseDataPreprocessor" as type, see
+            :class:`mmengine.model.BaseDataPreprocessor` for more details.
+            Defaults to None.
+        prototype (Union[DataLoader, dict, str, torch.Tensor]): Database to be
+            retrieved. The following four types are supported.
+
+            - DataLoader: The original dataloader serves as the prototype.
+            - dict: The configuration to construct Dataloader.
+            - str: The path of the saved vector.
+            - torch.Tensor: The saved tensor whose dimension should be dim.
+
+    Attributes:
+        prototype (Union[DataLoader, dict, str, torch.Tensor]): Database to be
+            retrieved. The following four types are supported.
+
+            - DataLoader: The original dataloader serves as the prototype.
+            - dict: The configuration to construct Dataloader.
+            - str: The path of the saved vector.
+            - torch.Tensor: The saved tensor whose dimension should be dim.
+
+        data_preprocessor (:obj:`mmengine.model.BaseDataPreprocessor`): An
+            extra data pre-processing module, which processes data from
+            dataloader to the format accepted by :meth:`forward`.
+    """
+
+    def __init__(
+        self,
+        prototype: Union[DataLoader, dict, str, torch.Tensor] = None,
+        data_preprocessor: Optional[dict] = None,
+        init_cfg: Optional[dict] = None,
+    ):
+        super(BaseRetriever, self).__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+        self.prototype = prototype
+        self.prototype_inited = False
+
+    @abstractmethod
+    def forward(self,
+                inputs: torch.Tensor,
+                data_samples: Optional[List[BaseDataElement]] = None,
+                mode: str = 'loss'):
+        """The unified entry for a forward process in both training and test.
+
+        The method should accept three modes: "tensor", "predict" and "loss":
+
+        - "tensor": Forward the whole network and return tensor without any
+          post-processing, same as a common nn.Module.
+        - "predict": Forward and return the predictions, which are fully
+          processed to a list of :obj:`DataSample`.
+        - "loss": Forward and return a dict of losses according to the given
+          inputs and data samples.
+
+        Note that this method doesn't handle neither back propagation nor
+        optimizer updating, which are done in the :meth:`train_step`.
+
+        Args:
+            inputs (torch.Tensor, tuple): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. It's required if ``mode="loss"``.
+                Defaults to None.
+            mode (str): Return what kind of value. Defaults to 'tensor'.
+
+        Returns:
+            The return type depends on ``mode``.
+
+            - If ``mode="tensor"``, return a tensor.
+            - If ``mode="predict"``, return a list of
+              :obj:`mmpretrain.structures.DataSample`.
+            - If ``mode="loss"``, return a dict of tensor.
+        """
+        pass
+
+    def extract_feat(self, inputs: torch.Tensor):
+        """Extract features from the input tensor with shape (N, C, ...).
+
+        The sub-classes are recommended to implement this method to extract
+        features from backbone and neck.
+
+        Args:
+            inputs (Tensor): A batch of inputs. The shape of it should be
+                ``(num_samples, num_channels, *img_shape)``.
+        """
+        raise NotImplementedError
+
+    def loss(self, inputs: torch.Tensor,
+             data_samples: List[BaseDataElement]) -> dict:
+        """Calculate losses from a batch of inputs and data samples.
+
+        Args:
+            inputs (torch.Tensor): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample]): The annotation data of
+                every samples.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components
+        """
+        raise NotImplementedError
+
+    def predict(self,
+                inputs: tuple,
+                data_samples: Optional[List[BaseDataElement]] = None,
+                **kwargs) -> List[BaseDataElement]:
+        """Predict results from the extracted features.
+
+        Args:
+            inputs (tuple): The features extracted from the backbone.
+            data_samples (List[BaseDataElement], optional): The annotation
+                data of every samples. Defaults to None.
+            **kwargs: Other keyword arguments accepted by the ``predict``
+                method of :attr:`head`.
+        """
+        raise NotImplementedError
+
+    def matching(self, inputs: torch.Tensor):
+        """Compare the prototype and calculate the similarity.
+
+        Args:
+            inputs (torch.Tensor): The input tensor with shape (N, C).
+        """
+        raise NotImplementedError
+
+    def prepare_prototype(self):
+        """Preprocessing the prototype before predict."""
+        raise NotImplementedError
+
+    def dump_prototype(self, path):
+        """Save the features extracted from the prototype to the specific path.
+
+        Args:
+            path (str): Path to save feature.
+        """
+        raise NotImplementedError
diff --git a/mmpretrain/models/retrievers/image2image.py b/mmpretrain/models/retrievers/image2image.py
new file mode 100644
index 0000000000000000000000000000000000000000..a00c1dceb102ee692c44090b62dcfa19dc441f3b
--- /dev/null
+++ b/mmpretrain/models/retrievers/image2image.py
@@ -0,0 +1,314 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Callable, List, Optional, Union
+
+import mmengine.dist as dist
+import torch
+import torch.nn as nn
+from mmengine.runner import Runner
+from torch.utils.data import DataLoader
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from mmpretrain.utils import track_on_main_process
+from .base import BaseRetriever
+
+
+@MODELS.register_module()
+class ImageToImageRetriever(BaseRetriever):
+    """Image To Image Retriever for supervised retrieval task.
+
+    Args:
+        image_encoder (Union[dict, List[dict]]): Encoder for extracting
+            features.
+        prototype (Union[DataLoader, dict, str, torch.Tensor]): Database to be
+            retrieved. The following four types are supported.
+
+            - DataLoader: The original dataloader serves as the prototype.
+            - dict: The configuration to construct Dataloader.
+            - str: The path of the saved vector.
+            - torch.Tensor: The saved tensor whose dimension should be dim.
+
+        head (dict, optional): The head module to calculate loss from
+            processed features. See :mod:`mmpretrain.models.heads`. Notice
+            that if the head is not set, `loss` method cannot be used.
+            Defaults to None.
+        similarity_fn (Union[str, Callable]): The way that the similarity
+            is calculated. If `similarity` is callable, it is used directly
+            as the measure function. If it is a string, the appropriate
+            method will be used.  The larger the calculated value, the
+            greater the similarity. Defaults to "cosine_similarity".
+        train_cfg (dict, optional): The training setting. The acceptable
+            fields are:
+
+            - augments (List[dict]): The batch augmentation methods to use.
+              More details can be found in
+              :mod:`mmpretrain.model.utils.augment`.
+
+            Defaults to None.
+        data_preprocessor (dict, optional): The config for preprocessing input
+            data. If None or no specified type, it will use
+            "ClsDataPreprocessor" as type. See :class:`ClsDataPreprocessor` for
+            more details. Defaults to None.
+        topk (int): Return the topk of the retrieval result. `-1` means
+            return all. Defaults to -1.
+        init_cfg (dict, optional): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 image_encoder: Union[dict, List[dict]],
+                 prototype: Union[DataLoader, dict, str, torch.Tensor],
+                 head: Optional[dict] = None,
+                 pretrained: Optional[str] = None,
+                 similarity_fn: Union[str, Callable] = 'cosine_similarity',
+                 train_cfg: Optional[dict] = None,
+                 data_preprocessor: Optional[dict] = None,
+                 topk: int = -1,
+                 init_cfg: Optional[dict] = None):
+
+        if data_preprocessor is None:
+            data_preprocessor = {}
+        # The build process is in MMEngine, so we need to add scope here.
+        data_preprocessor.setdefault('type', 'mmpretrain.ClsDataPreprocessor')
+
+        if train_cfg is not None and 'augments' in train_cfg:
+            # Set batch augmentations by `train_cfg`
+            data_preprocessor['batch_augments'] = train_cfg
+
+        super(ImageToImageRetriever, self).__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+        if not isinstance(image_encoder, nn.Module):
+            image_encoder = MODELS.build(image_encoder)
+        if head is not None and not isinstance(head, nn.Module):
+            head = MODELS.build(head)
+
+        self.image_encoder = image_encoder
+        self.head = head
+
+        self.similarity = similarity_fn
+
+        assert isinstance(prototype, (str, torch.Tensor, dict, DataLoader)), (
+            'The `prototype` in  `ImageToImageRetriever` must be a path, '
+            'a torch.Tensor, a dataloader or a dataloader dict format config.')
+        self.prototype = prototype
+        self.prototype_inited = False
+        self.topk = topk
+
+    @property
+    def similarity_fn(self):
+        """Returns a function that calculates the similarity."""
+        # If self.similarity_way is callable, return it directly
+        if isinstance(self.similarity, Callable):
+            return self.similarity
+
+        if self.similarity == 'cosine_similarity':
+            # a is a tensor with shape (N, C)
+            # b is a tensor with shape (M, C)
+            # "cosine_similarity" will get the matrix of similarity
+            # with shape (N, M).
+            # The higher the score is, the more similar is
+            return lambda a, b: torch.cosine_similarity(
+                a.unsqueeze(1), b.unsqueeze(0), dim=-1)
+        else:
+            raise RuntimeError(f'Invalid function "{self.similarity_fn}".')
+
+    def forward(self,
+                inputs: torch.Tensor,
+                data_samples: Optional[List[DataSample]] = None,
+                mode: str = 'tensor'):
+        """The unified entry for a forward process in both training and test.
+
+        The method should accept three modes: "tensor", "predict" and "loss":
+
+        - "tensor": Forward the whole network and return tensor without any
+          post-processing, same as a common nn.Module.
+        - "predict": Forward and return the predictions, which are fully
+          processed to a list of :obj:`DataSample`.
+        - "loss": Forward and return a dict of losses according to the given
+          inputs and data samples.
+
+        Note that this method doesn't handle neither back propagation nor
+        optimizer updating, which are done in the :meth:`train_step`.
+
+        Args:
+            inputs (torch.Tensor, tuple): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. It's required if ``mode="loss"``.
+                Defaults to None.
+            mode (str): Return what kind of value. Defaults to 'tensor'.
+
+        Returns:
+            The return type depends on ``mode``.
+
+            - If ``mode="tensor"``, return a tensor.
+            - If ``mode="predict"``, return a list of
+              :obj:`mmpretrain.structures.DataSample`.
+            - If ``mode="loss"``, return a dict of tensor.
+        """
+        if mode == 'tensor':
+            return self.extract_feat(inputs)
+        elif mode == 'loss':
+            return self.loss(inputs, data_samples)
+        elif mode == 'predict':
+            return self.predict(inputs, data_samples)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def extract_feat(self, inputs):
+        """Extract features from the input tensor with shape (N, C, ...).
+
+        Args:
+            inputs (Tensor): A batch of inputs. The shape of it should be
+                ``(num_samples, num_channels, *img_shape)``.
+        Returns:
+            Tensor: The output of encoder.
+        """
+
+        feat = self.image_encoder(inputs)
+        return feat
+
+    def loss(self, inputs: torch.Tensor,
+             data_samples: List[DataSample]) -> dict:
+        """Calculate losses from a batch of inputs and data samples.
+
+        Args:
+            inputs (torch.Tensor): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample]): The annotation data of
+                every samples.
+
+        Returns:
+            dict[str, Tensor]: a dictionary of loss components
+        """
+        feats = self.extract_feat(inputs)
+        return self.head.loss(feats, data_samples)
+
+    def matching(self, inputs: torch.Tensor):
+        """Compare the prototype and calculate the similarity.
+
+        Args:
+            inputs (torch.Tensor): The input tensor with shape (N, C).
+        Returns:
+            dict: a dictionary of score and prediction label based on fn.
+        """
+        sim = self.similarity_fn(inputs, self.prototype_vecs)
+        sorted_sim, indices = torch.sort(sim, descending=True, dim=-1)
+        predictions = dict(
+            score=sim, pred_label=indices, pred_score=sorted_sim)
+        return predictions
+
+    def predict(self,
+                inputs: tuple,
+                data_samples: Optional[List[DataSample]] = None,
+                **kwargs) -> List[DataSample]:
+        """Predict results from the extracted features.
+
+        Args:
+            inputs (tuple): The features extracted from the backbone.
+            data_samples (List[DataSample], optional): The annotation
+                data of every samples. Defaults to None.
+            **kwargs: Other keyword arguments accepted by the ``predict``
+                method of :attr:`head`.
+        Returns:
+            List[DataSample]: the raw data_samples with
+                the predicted results
+        """
+        if not self.prototype_inited:
+            self.prepare_prototype()
+
+        feats = self.extract_feat(inputs)
+        if isinstance(feats, tuple):
+            feats = feats[-1]
+
+        # Matching of similarity
+        result = self.matching(feats)
+        return self._get_predictions(result, data_samples)
+
+    def _get_predictions(self, result, data_samples):
+        """Post-process the output of retriever."""
+        pred_scores = result['score']
+        pred_labels = result['pred_label']
+        if self.topk != -1:
+            topk = min(self.topk, pred_scores.size()[-1])
+            pred_labels = pred_labels[:, :topk]
+
+        if data_samples is not None:
+            for data_sample, score, label in zip(data_samples, pred_scores,
+                                                 pred_labels):
+                data_sample.set_pred_score(score).set_pred_label(label)
+        else:
+            data_samples = []
+            for score, label in zip(pred_scores, pred_labels):
+                data_samples.append(
+                    DataSample().set_pred_score(score).set_pred_label(label))
+        return data_samples
+
+    def _get_prototype_vecs_from_dataloader(self, data_loader):
+        """get prototype_vecs from dataloader."""
+        self.eval()
+        num = len(data_loader.dataset)
+
+        prototype_vecs = None
+        for data_batch in track_on_main_process(data_loader,
+                                                'Prepare prototype'):
+            data = self.data_preprocessor(data_batch, False)
+            feat = self(**data)
+            if isinstance(feat, tuple):
+                feat = feat[-1]
+
+            if prototype_vecs is None:
+                dim = feat.shape[-1]
+                prototype_vecs = torch.zeros(num, dim)
+            for i, data_sample in enumerate(data_batch['data_samples']):
+                sample_idx = data_sample.get('sample_idx')
+                prototype_vecs[sample_idx] = feat[i]
+
+        assert prototype_vecs is not None
+        dist.all_reduce(prototype_vecs)
+        return prototype_vecs
+
+    def _get_prototype_vecs_from_path(self, proto_path):
+        """get prototype_vecs from prototype path."""
+        data = [None]
+        if dist.is_main_process():
+            data[0] = torch.load(proto_path)
+        dist.broadcast_object_list(data, src=0)
+        prototype_vecs = data[0]
+        assert prototype_vecs is not None
+        return prototype_vecs
+
+    @torch.no_grad()
+    def prepare_prototype(self):
+        """Used in meta testing. This function will be called before the meta
+        testing. Obtain the vector based on the prototype.
+
+        - torch.Tensor: The prototype vector is the prototype
+        - str: The path of the extracted feature path, parse data structure,
+            and generate the prototype feature vector set
+        - Dataloader or config: Extract and save the feature vectors according
+            to the dataloader
+        """
+        device = next(self.image_encoder.parameters()).device
+        if isinstance(self.prototype, torch.Tensor):
+            prototype_vecs = self.prototype
+        elif isinstance(self.prototype, str):
+            prototype_vecs = self._get_prototype_vecs_from_path(self.prototype)
+        elif isinstance(self.prototype, (dict, DataLoader)):
+            loader = Runner.build_dataloader(self.prototype)
+            prototype_vecs = self._get_prototype_vecs_from_dataloader(loader)
+
+        self.register_buffer(
+            'prototype_vecs', prototype_vecs.to(device), persistent=False)
+        self.prototype_inited = True
+
+    def dump_prototype(self, path):
+        """Save the features extracted from the prototype to specific path.
+
+        Args:
+            path (str): Path to save feature.
+        """
+        if not self.prototype_inited:
+            self.prepare_prototype()
+        torch.save(self.prototype_vecs, path)
diff --git a/mmpretrain/models/selfsup/__init__.py b/mmpretrain/models/selfsup/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..08c1ed59ddcf51924b3de7df1c995bf84c6bb753
--- /dev/null
+++ b/mmpretrain/models/selfsup/__init__.py
@@ -0,0 +1,59 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .barlowtwins import BarlowTwins
+from .base import BaseSelfSupervisor
+from .beit import VQKD, BEiT, BEiTPretrainViT
+from .byol import BYOL
+from .cae import CAE, CAEPretrainViT, DALLEEncoder
+from .densecl import DenseCL
+from .eva import EVA
+from .itpn import iTPN, iTPNHiViT
+from .mae import MAE, MAEHiViT, MAEViT
+from .maskfeat import HOGGenerator, MaskFeat, MaskFeatViT
+from .mff import MFF, MFFViT
+from .milan import MILAN, CLIPGenerator, MILANViT
+from .mixmim import MixMIM, MixMIMPretrainTransformer
+from .moco import MoCo
+from .mocov3 import MoCoV3, MoCoV3ViT
+from .simclr import SimCLR
+from .simmim import SimMIM, SimMIMSwinTransformer
+from .simsiam import SimSiam
+from .spark import SparK
+from .swav import SwAV
+
+__all__ = [
+    'BaseSelfSupervisor',
+    'BEiTPretrainViT',
+    'VQKD',
+    'CAEPretrainViT',
+    'DALLEEncoder',
+    'MAEViT',
+    'MAEHiViT',
+    'iTPNHiViT',
+    'iTPN',
+    'HOGGenerator',
+    'MaskFeatViT',
+    'CLIPGenerator',
+    'MILANViT',
+    'MixMIMPretrainTransformer',
+    'MoCoV3ViT',
+    'SimMIMSwinTransformer',
+    'MoCo',
+    'MoCoV3',
+    'BYOL',
+    'SimCLR',
+    'SimSiam',
+    'BEiT',
+    'CAE',
+    'MAE',
+    'MaskFeat',
+    'MILAN',
+    'MixMIM',
+    'SimMIM',
+    'EVA',
+    'DenseCL',
+    'BarlowTwins',
+    'SwAV',
+    'SparK',
+    'MFF',
+    'MFFViT',
+]
diff --git a/mmpretrain/models/selfsup/__pycache__/__init__.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..76ff11f50949bbec22e6c4dd3cf55e43495cf69a
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/barlowtwins.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/barlowtwins.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e090b3b2c7608e7a9222093bfbc7e05634bb3c70
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/barlowtwins.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/base.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/base.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a63b6e0809016b40c54ddd98b0ac1eaaca619328
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/base.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/beit.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/beit.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..38f360022b05c980e4da012473baedc0569c585a
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/beit.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/byol.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/byol.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..1e4dd441dde704b9b2e519b94653e227282afbb8
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/byol.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/cae.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/cae.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..34d609c22791e9fb81636316abe92a32755e30ff
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/cae.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/densecl.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/densecl.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..8ae6abe8657f536ac6412c8ce878de12c311c5c2
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/densecl.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/eva.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/eva.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..0ab9f8defdbf1a9634df4f3005fcdb51a8536d9f
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/eva.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/itpn.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/itpn.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..baa23c1c0e5b3c790ea953b4040d756c81c0975a
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/itpn.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/mae.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/mae.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ae7c68b316639110f517851720b128de1b0cd468
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/mae.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/maskfeat.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/maskfeat.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..bb70d49cf790e855a818b663a39ea26cc730b7b8
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/maskfeat.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/mff.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/mff.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a8629e50b1672676294c8ff01e9974e17db5ea66
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/mff.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/milan.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/milan.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ca0739e085ae9f6d6de293ab953a83189ede4a42
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/milan.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/mixmim.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/mixmim.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..47eb34c25c167696fd364f005e2975312609ee8a
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/mixmim.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/moco.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/moco.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7e98b6f48e97663036fe53cd07a4bf26f9f7b449
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/moco.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/mocov3.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/mocov3.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..91dc5f227797e4dda47a911569254f6a924487a9
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/mocov3.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/simclr.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/simclr.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..2bd45215ac8a9044864b3644b3e6fed7baf003c2
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/simclr.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/simmim.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/simmim.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..0a4a64ee98b7b808a965db10cb8443a0f497f141
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/simmim.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/simsiam.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/simsiam.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..134b7b2e83811e181dbccf7c0fbdaef605d2969a
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/simsiam.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/spark.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/spark.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a12c61eaa244dfc8f0ce2103b1def11077733fb9
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/spark.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/__pycache__/swav.cpython-310.pyc b/mmpretrain/models/selfsup/__pycache__/swav.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..93af9d073e150ad8a895bc9a91ca9da9f3477371
Binary files /dev/null and b/mmpretrain/models/selfsup/__pycache__/swav.cpython-310.pyc differ
diff --git a/mmpretrain/models/selfsup/barlowtwins.py b/mmpretrain/models/selfsup/barlowtwins.py
new file mode 100644
index 0000000000000000000000000000000000000000..4c75cd0caca6ab2dc4c4a14e365fda5daa9bdb83
--- /dev/null
+++ b/mmpretrain/models/selfsup/barlowtwins.py
@@ -0,0 +1,42 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List
+
+import torch
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class BarlowTwins(BaseSelfSupervisor):
+    """BarlowTwins.
+
+    Implementation of `Barlow Twins: Self-Supervised Learning via Redundancy
+    Reduction <https://arxiv.org/abs/2103.03230>`_.
+    Part of the code is borrowed from:
+    `<https://github.com/facebookresearch/barlowtwins/blob/main/main.py>`_.
+    """
+
+    def loss(self, inputs: List[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (List[torch.Tensor]): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        assert isinstance(inputs, list)
+        img_v1 = inputs[0]
+        img_v2 = inputs[1]
+
+        z1 = self.neck(self.backbone(img_v1))[0]  # NxC
+        z2 = self.neck(self.backbone(img_v2))[0]  # NxC
+
+        loss = self.head.loss(z1, z2)
+        losses = dict(loss=loss)
+        return losses
diff --git a/mmpretrain/models/selfsup/base.py b/mmpretrain/models/selfsup/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d53a72871dff7b4fc59cd591686350026a875bb
--- /dev/null
+++ b/mmpretrain/models/selfsup/base.py
@@ -0,0 +1,179 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from abc import ABCMeta, abstractmethod
+from typing import List, Optional, Union
+
+import torch
+from mmengine.model import BaseModel
+from torch import nn
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+
+
+class BaseSelfSupervisor(BaseModel, metaclass=ABCMeta):
+    """BaseModel for Self-Supervised Learning.
+
+    All self-supervised algorithms should inherit this module.
+
+    Args:
+        backbone (dict): The backbone module. See
+            :mod:`mmpretrain.models.backbones`.
+        neck (dict, optional): The neck module to process features from
+            backbone. See :mod:`mmpretrain.models.necks`. Defaults to None.
+        head (dict, optional): The head module to do prediction and calculate
+            loss from processed features. See :mod:`mmpretrain.models.heads`.
+            Notice that if the head is not set, almost all methods cannot be
+            used except :meth:`extract_feat`. Defaults to None.
+        target_generator: (dict, optional): The target_generator module to
+            generate targets for self-supervised learning optimization, such as
+            HOG, extracted features from other modules(DALL-E, CLIP), etc.
+        pretrained (str, optional): The pretrained checkpoint path, support
+            local path and remote path. Defaults to None.
+        data_preprocessor (Union[dict, nn.Module], optional): The config for
+            preprocessing input data. If None or no specified type, it will use
+            "SelfSupDataPreprocessor" as type.
+            See :class:`SelfSupDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (dict, optional): the config to control the initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 backbone: dict,
+                 neck: Optional[dict] = None,
+                 head: Optional[dict] = None,
+                 target_generator: Optional[dict] = None,
+                 pretrained: Optional[str] = None,
+                 data_preprocessor: Optional[Union[dict, nn.Module]] = None,
+                 init_cfg: Optional[dict] = None):
+        if pretrained is not None:
+            init_cfg = dict(type='Pretrained', checkpoint=pretrained)
+
+        data_preprocessor = data_preprocessor or {}
+        if isinstance(data_preprocessor, dict):
+            data_preprocessor.setdefault('type', 'SelfSupDataPreprocessor')
+            data_preprocessor = MODELS.build(data_preprocessor)
+        elif not isinstance(data_preprocessor, nn.Module):
+            raise TypeError('data_preprocessor should be a `dict` or '
+                            f'`nn.Module` instance, but got '
+                            f'{type(data_preprocessor)}')
+
+        super().__init__(
+            init_cfg=init_cfg, data_preprocessor=data_preprocessor)
+
+        if not isinstance(backbone, nn.Module):
+            backbone = MODELS.build(backbone)
+        if neck is not None and not isinstance(neck, nn.Module):
+            neck = MODELS.build(neck)
+        if head is not None and not isinstance(head, nn.Module):
+            head = MODELS.build(head)
+        if target_generator is not None and not isinstance(
+                target_generator, nn.Module):
+            target_generator = MODELS.build(target_generator)
+
+        self.backbone = backbone
+        self.neck = neck
+        self.head = head
+        self.target_generator = target_generator
+
+    @property
+    def with_neck(self) -> bool:
+        """Check if the model has a neck module."""
+        return hasattr(self, 'neck') and self.neck is not None
+
+    @property
+    def with_head(self) -> bool:
+        """Check if the model has a head module."""
+        return hasattr(self, 'head') and self.head is not None
+
+    @property
+    def with_target_generator(self) -> bool:
+        """Check if the model has a target_generator module."""
+        return hasattr(
+            self, 'target_generator') and self.target_generator is not None
+
+    def forward(self,
+                inputs: Union[torch.Tensor, List[torch.Tensor]],
+                data_samples: Optional[List[DataSample]] = None,
+                mode: str = 'tensor'):
+        """The unified entry for a forward process in both training and test.
+
+        The method currently accepts two modes: "tensor" and "loss":
+
+        - "tensor": Forward the backbone network and return the feature
+          tensor(s) tensor without any post-processing, same as a common
+          PyTorch Module.
+        - "loss": Forward and return a dict of losses according to the given
+          inputs and data samples.
+
+        Args:
+            inputs (torch.Tensor or List[torch.Tensor]): The input tensor with
+                shape (N, C, ...) in general.
+            data_samples (List[DataSample], optional): The other data of
+                every samples. It's required for some algorithms
+                if ``mode="loss"``. Defaults to None.
+            mode (str): Return what kind of value. Defaults to 'tensor'.
+
+        Returns:
+            The return type depends on ``mode``.
+
+            - If ``mode="tensor"``, return a tensor or a tuple of tensor.
+            - If ``mode="loss"``, return a dict of tensor.
+        """
+        if mode == 'tensor':
+            feats = self.extract_feat(inputs)
+            return feats
+        elif mode == 'loss':
+            return self.loss(inputs, data_samples)
+        else:
+            raise RuntimeError(f'Invalid mode "{mode}".')
+
+    def extract_feat(self, inputs: torch.Tensor):
+        """Extract features from the input tensor with shape (N, C, ...).
+
+        The default behavior is extracting features from backbone.
+
+        Args:
+            inputs (Tensor): A batch of inputs. The shape of it should be
+                ``(num_samples, num_channels, *img_shape)``.
+
+        Returns:
+            tuple | Tensor: The output feature tensor(s).
+        """
+        x = self.backbone(inputs)
+        return x
+
+    @abstractmethod
+    def loss(self, inputs: torch.Tensor,
+             data_samples: List[DataSample]) -> dict:
+        """Calculate losses from a batch of inputs and data samples.
+
+        This is a abstract method, and subclass should overwrite this methods
+        if needed.
+
+        Args:
+            inputs (torch.Tensor): The input tensor with shape
+                (N, C, ...) in general.
+            data_samples (List[DataSample]): The annotation data of
+                every samples.
+
+        Returns:
+            dict[str, Tensor]: A dictionary of loss components.
+        """
+        raise NotImplementedError
+
+    def get_layer_depth(self, param_name: str):
+        """Get the layer-wise depth of a parameter.
+
+        Args:
+            param_name (str): The name of the parameter.
+
+        Returns:
+            Tuple[int, int]: The layer-wise depth and the max depth.
+        """
+        if hasattr(self.backbone, 'get_layer_depth'):
+            return self.backbone.get_layer_depth(param_name, 'backbone.')
+        else:
+            raise NotImplementedError(
+                f"The backbone {type(self.backbone)} doesn't "
+                'support `get_layer_depth` by now.')
diff --git a/mmpretrain/models/selfsup/beit.py b/mmpretrain/models/selfsup/beit.py
new file mode 100644
index 0000000000000000000000000000000000000000..c301f7d5cae07370f26b4cd531190b8c3c90e24b
--- /dev/null
+++ b/mmpretrain/models/selfsup/beit.py
@@ -0,0 +1,357 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from typing import Dict, List, Optional, Tuple, Union
+
+import torch
+from einops import rearrange
+from mmengine.model import BaseModule
+from mmengine.model.weight_init import trunc_normal_
+from torch import nn
+
+from mmpretrain.models.backbones import BEiTViT
+from mmpretrain.models.utils import NormEMAVectorQuantizer, resize_pos_embed
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class VQKD(BaseModule):
+    """Vector-Quantized Knowledge Distillation.
+
+    The module only contains encoder and VectorQuantizer part
+    Modified from https://github.com/microsoft/unilm/blob/master/beit2/modeling_vqkd.py
+
+    Args:
+        encoder_config (dict): The config of encoder.
+        decoder_config (dict, optional): The config of decoder. Currently,
+            VQKD only support to build encoder. Defaults to None.
+        num_embed (int): Number of embedding vectors in the codebook. Defaults
+            to 8192.
+        embed_dims (int) : The dimension of embedding vectors in the codebook.
+            Defaults to 32.
+        decay (float): The decay parameter of EMA. Defaults to 0.99.
+        beta (float): The mutiplier for VectorQuantizer loss. Defaults to 1.
+        quantize_kmeans_init (bool): Whether to use k-means to initialize the
+            VectorQuantizer. Defaults to True.
+        init_cfg (dict or List[dict], optional): Initialization config dict.
+            Defaults to None.
+    """  # noqa: E501
+
+    def __init__(self,
+                 encoder_config: dict,
+                 decoder_config: Optional[dict] = None,
+                 num_embed: int = 8192,
+                 embed_dims: int = 32,
+                 decay: float = 0.99,
+                 beta: float = 1.0,
+                 quantize_kmeans_init: bool = True,
+                 init_cfg: Optional[dict] = None) -> None:
+        super().__init__(init_cfg=init_cfg)
+
+        self.encoder = BEiTViT(**encoder_config)
+        if decoder_config is not None:
+            self.decoder = BEiTViT(**decoder_config)
+
+        self.quantize = NormEMAVectorQuantizer(
+            num_embed=num_embed,
+            embed_dims=embed_dims,
+            beta=beta,
+            decay=decay,
+            kmeans_init=quantize_kmeans_init,
+        )
+
+        # task layer
+        self.encode_task_layer = nn.Sequential(
+            nn.Linear(self.encoder.arch_settings['embed_dims'],
+                      self.encoder.arch_settings['embed_dims']), nn.Tanh(),
+            nn.Linear(self.encoder.arch_settings['embed_dims'], embed_dims))
+
+    def get_tokens(self, x: torch.Tensor) -> dict:
+        """Get tokens for beit pre-training."""
+        _, embed_ind, _ = self.encode(x)
+        output = {}
+        output['token'] = embed_ind.view(x.shape[0], -1)
+        output['input_img'] = x
+
+        return output
+
+    def encode(
+            self, x: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Encode the input images and get corresponding results."""
+        encoder_features = self.encoder(x)[0]
+        B, C, N1, N2 = encoder_features.shape
+        encoder_features = encoder_features.permute(0, 2, 3,
+                                                    1).reshape(B, N1 * N2, C)
+
+        with torch.cuda.amp.autocast(enabled=False):
+            to_quantizer_features = self.encode_task_layer(
+                encoder_features.type_as(self.encode_task_layer[-1].weight))
+
+        N = to_quantizer_features.shape[1]
+        h, w = int(math.sqrt(N)), int(math.sqrt(N))
+
+        to_quantizer_features = rearrange(
+            to_quantizer_features, 'b (h w) c -> b c h w', h=h,
+            w=w)  # reshape for quantizer
+        quantize, loss, embed_ind = self.quantize(to_quantizer_features)
+
+        return quantize, embed_ind, loss
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """The forward function.
+
+        Currently, only support to get tokens.
+        """
+        return self.get_tokens(x)['token']
+
+
+@MODELS.register_module()
+class BEiTPretrainViT(BEiTViT):
+    """Vision Transformer for BEiT pre-training.
+
+    Args:
+        arch (str | dict): Vision Transformer architecture. If use string,
+            choose from 'small', 'base' and 'large'. If use dict, it should
+            have below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **num_layers** (int): The number of transformer encoder layers.
+            - **num_heads** (int): The number of heads in attention modules.
+            - **feedforward_channels** (int): The hidden dimensions in
+              feedforward modules.
+
+            Defaults to 'base'.
+        img_size (int | tuple): The expected input image shape. Because we
+            support dynamic input shape, just set the argument to the most
+            common input image shape. Defaults to 224.
+        patch_size (int | tuple): The patch size in patch embedding.
+            Defaults to 16.
+        in_channels (int): The num of input channels. Defaults to 3.
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        qkv_bias (bool): Whether to add bias for qkv in attention modules.
+            Defaults to True.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        final_norm (bool): Whether to add a additional layer to normalize
+            final feature map. Defaults to True.
+        out_type (str): The type of output features. Please choose from
+
+            - ``"cls_token"``: The class token tensor with shape (B, C).
+            - ``"featmap"``: The feature map tensor from the patch tokens
+              with shape (B, C, H, W).
+            - ``"avg_featmap"``: The global averaged feature map tensor
+              with shape (B, C).
+            - ``"raw"``: The raw feature tensor includes patch tokens and
+              class tokens with shape (B, L, C).
+
+            It only works without input mask. Defaults to ``"avg_featmap"``.
+        with_cls_token (bool): Whether concatenating class token into image
+            tokens as transformer input. Defaults to True.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        use_abs_pos_emb (bool): Whether or not use absolute position embedding.
+            Defaults to False.
+        use_rel_pos_bias (bool): Whether or not use relative position bias.
+            Defaults to False.
+        use_shared_rel_pos_bias (bool): Whether or not use shared relative
+            position bias. Defaults to True.
+        layer_scale_init_value (float): The initialization value for
+            the learnable scaling of attention and FFN. Defaults to 0.1.
+        interpolate_mode (str): Select the interpolate mode for position
+            embeding vector resize. Defaults to "bicubic".
+        patch_cfg (dict): Configs of patch embeding. Defaults to an empty dict.
+        layer_cfgs (Sequence | dict): Configs of each transformer layer in
+            encoder. Defaults to an empty dict.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 arch: str = 'base',
+                 img_size: int = 224,
+                 patch_size: int = 16,
+                 in_channels: int = 3,
+                 out_indices: int = -1,
+                 drop_rate: float = 0,
+                 drop_path_rate: float = 0,
+                 norm_cfg: dict = dict(type='LN', eps=1e-6),
+                 final_norm: bool = True,
+                 out_type: str = 'raw',
+                 frozen_stages: int = -1,
+                 use_abs_pos_emb: bool = False,
+                 use_rel_pos_bias: bool = False,
+                 use_shared_rel_pos_bias: bool = True,
+                 layer_scale_init_value: int = 0.1,
+                 interpolate_mode: str = 'bicubic',
+                 patch_cfg: dict = dict(padding=0),
+                 layer_cfgs: dict = dict(),
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(
+            arch=arch,
+            img_size=img_size,
+            patch_size=patch_size,
+            in_channels=in_channels,
+            out_indices=out_indices,
+            drop_rate=drop_rate,
+            drop_path_rate=drop_path_rate,
+            norm_cfg=norm_cfg,
+            final_norm=final_norm,
+            out_type=out_type,
+            with_cls_token=True,
+            frozen_stages=frozen_stages,
+            use_abs_pos_emb=use_abs_pos_emb,
+            use_shared_rel_pos_bias=use_shared_rel_pos_bias,
+            use_rel_pos_bias=use_rel_pos_bias,
+            layer_scale_init_value=layer_scale_init_value,
+            interpolate_mode=interpolate_mode,
+            patch_cfg=patch_cfg,
+            layer_cfgs=layer_cfgs,
+            init_cfg=init_cfg)
+
+        self.mask_token = nn.Parameter(torch.zeros(1, 1, self.embed_dims))
+
+    def init_weights(self) -> None:
+        """Initialize position embedding, patch embedding and cls token."""
+        super().init_weights()
+
+        if (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            # Suppress default init if use pretrained model.
+            return
+
+        trunc_normal_(self.cls_token, std=0.02)
+        trunc_normal_(self.mask_token, std=0.02)
+        self.rescale_init_weight()
+
+    def rescale_init_weight(self) -> None:
+        """Rescale the initialized weights."""
+
+        def rescale(param, layer_id):
+            param.div_(math.sqrt(2.0 * layer_id))
+
+        for layer_id, layer in enumerate(self.layers):
+            rescale(layer.attn.proj.weight.data, layer_id + 1)
+            rescale(layer.ffn.layers[1].weight.data, layer_id + 1)
+
+    def forward(self, x: torch.Tensor,
+                mask: Optional[torch.Tensor]) -> Tuple[torch.Tensor]:
+        """The BEiT style forward function.
+
+        The function supports two kind of forward behaviors. If the ``mask`` is
+        not ``None``, the forward function will be executed as masked image
+        modeling pre-training; if the ``mask`` is ``None``, the forward
+        function will call ``super().forward()``, which extract features from
+        images without mask.
+
+        Args:
+            x (torch.Tensor): Input images, which is of shape (B x C x H x W).
+            mask (torch.Tensor, optional): Mask for input, which is of shape
+                (B x patch_resolution[0] x patch_resolution[1]).
+
+        Returns:
+            Tuple[torch.Tensor]: Hidden features.
+        """
+        if mask is None:
+            return super().forward(x)
+
+        else:
+            x, patch_resolution = self.patch_embed(x)
+
+            # replace the masked visual tokens by mask_token
+            B, L, _ = x.shape
+            mask_token = self.mask_token.expand(B, L, -1)
+            w = mask.flatten(1).unsqueeze(-1).type_as(mask_token)
+            x = x * (1. - w) + mask_token * w
+
+            # stole cls_tokens impl from Phil Wang, thanks
+            cls_tokens = self.cls_token.expand(B, -1, -1)
+            x = torch.cat((cls_tokens, x), dim=1)
+            if self.pos_embed is not None:
+                x = x + resize_pos_embed(
+                    self.pos_embed,
+                    self.patch_resolution,
+                    patch_resolution,
+                    mode=self.interpolate_mode,
+                    num_extra_tokens=self.num_extra_tokens)
+            x = self.drop_after_pos(x)
+
+            self.shared_rel_pos_bias = self.rel_pos_bias().to(
+                mask.device) if self.rel_pos_bias is not None else None
+
+            outs = []
+            for i, layer in enumerate(self.layers):
+                x = layer(x, rel_pos_bias=self.shared_rel_pos_bias)
+
+                if i == len(self.layers) - 1 and self.final_norm:
+                    x = self.norm1(x)
+
+                if i in self.out_indices:
+                    outs.append(x)
+
+            return tuple(outs)
+
+
+@MODELS.register_module()
+class BEiT(BaseSelfSupervisor):
+    """BEiT v1/v2.
+
+    Implementation of `BEiT: BERT Pre-Training of Image Transformers
+    <https://arxiv.org/abs/2106.08254>`_ and `BEiT v2: Masked Image Modeling
+    with Vector-Quantized Visual Tokenizers
+    <https://arxiv.org/abs/2208.06366>`_.
+    """
+
+    def extract_feat(self, inputs: torch.Tensor):
+        return self.backbone(inputs, mask=None)
+
+    def loss(self, inputs: List[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (List[torch.Tensor]): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        mask = torch.stack([data_sample.mask for data_sample in data_samples])
+
+        img_latent = self.backbone(inputs[0], mask)
+
+        # inputs[1] is the target image
+        with torch.no_grad():
+            target = self.target_generator(inputs[1])
+            target = target.detach()
+
+        if self.with_neck:
+            # BEiT v2
+            feats, feats_cls_pt = self.neck(
+                img_latent, rel_pos_bias=self.backbone.shared_rel_pos_bias)
+            loss = self.head.loss(feats, feats_cls_pt, target, mask)
+        else:
+            # BEiT v1
+            loss = self.head.loss(img_latent[0], target, mask)
+
+        if isinstance(loss, torch.Tensor):
+            losses = dict(loss=loss)
+            return losses
+        elif isinstance(loss, Tuple):
+            # the loss_1 and loss_2 are general reconstruction loss (patch
+            # feature vectors from last layer of backbone) and early state
+            # reconstruction loss (patch feature vectors from intermediate
+            # layer of backbone)
+            loss_1, loss_2 = loss[0], loss[1]
+            losses = dict()
+            # the key with prefix 'loss', like loss_1 and loss_2, will be used
+            # as the final criterion
+            losses['loss_1'] = loss_1
+            losses['loss_2'] = loss_2
+            return losses
diff --git a/mmpretrain/models/selfsup/byol.py b/mmpretrain/models/selfsup/byol.py
new file mode 100644
index 0000000000000000000000000000000000000000..803e4005da8620b0e5a93fb29cb65e90a78f345f
--- /dev/null
+++ b/mmpretrain/models/selfsup/byol.py
@@ -0,0 +1,89 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List, Optional, Union
+
+import torch
+import torch.nn as nn
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from ..utils import CosineEMA
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class BYOL(BaseSelfSupervisor):
+    """BYOL.
+
+    Implementation of `Bootstrap Your Own Latent: A New Approach to
+    Self-Supervised Learning <https://arxiv.org/abs/2006.07733>`_.
+
+    Args:
+        backbone (dict): Config dict for module of backbone.
+        neck (dict): Config dict for module of deep features
+            to compact feature vectors.
+        head (dict): Config dict for module of head functions.
+        base_momentum (float): The base momentum coefficient for the target
+            network. Defaults to 0.004.
+        pretrained (str, optional): The pretrained checkpoint path, support
+            local path and remote path. Defaults to None.
+        data_preprocessor (dict, optional): The config for preprocessing
+            input data. If None or no specified type, it will use
+            "SelfSupDataPreprocessor" as type.
+            See :class:`SelfSupDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (Union[List[dict], dict], optional): Config dict for weight
+            initialization. Defaults to None.
+    """
+
+    def __init__(self,
+                 backbone: dict,
+                 neck: dict,
+                 head: dict,
+                 base_momentum: float = 0.004,
+                 pretrained: Optional[str] = None,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(
+            backbone=backbone,
+            neck=neck,
+            head=head,
+            pretrained=pretrained,
+            data_preprocessor=data_preprocessor,
+            init_cfg=init_cfg)
+
+        # create momentum model
+        self.target_net = CosineEMA(
+            nn.Sequential(self.backbone, self.neck), momentum=base_momentum)
+
+    def loss(self, inputs: List[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (List[torch.Tensor]): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        assert isinstance(inputs, list)
+        img_v1 = inputs[0]
+        img_v2 = inputs[1]
+        # compute online features
+        proj_online_v1 = self.neck(self.backbone(img_v1))[0]
+        proj_online_v2 = self.neck(self.backbone(img_v2))[0]
+        # compute target features
+        with torch.no_grad():
+            # update the target net
+            self.target_net.update_parameters(
+                nn.Sequential(self.backbone, self.neck))
+
+            proj_target_v1 = self.target_net(img_v1)[0]
+            proj_target_v2 = self.target_net(img_v2)[0]
+
+        loss_1 = self.head.loss(proj_online_v1, proj_target_v2)
+        loss_2 = self.head.loss(proj_online_v2, proj_target_v1)
+
+        losses = dict(loss=2. * (loss_1 + loss_2))
+        return losses
diff --git a/mmpretrain/models/selfsup/cae.py b/mmpretrain/models/selfsup/cae.py
new file mode 100644
index 0000000000000000000000000000000000000000..67ac09188e9bf97cdbea63378aa4facb1e8348ab
--- /dev/null
+++ b/mmpretrain/models/selfsup/cae.py
@@ -0,0 +1,472 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Part of code is modified from BEiT
+# https://github.com/microsoft/unilm/blob/master/beit/dall_e/encoder.py
+import math
+from collections import OrderedDict
+from functools import partial
+from typing import Dict, List, Optional, Union
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmengine.model import BaseModule
+from mmengine.model.weight_init import trunc_normal_
+
+from mmpretrain.models.backbones import BEiTViT
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from ..utils import build_2d_sincos_position_embedding
+from .base import BaseSelfSupervisor
+
+
+class Conv2d(nn.Module):
+    """Rewrite Conv2d module according to DALL-E code."""
+
+    def __init__(self,
+                 n_in: int,
+                 n_out: int,
+                 kw: int,
+                 use_float16: bool = True,
+                 device: torch.device = torch.device('cpu'),
+                 requires_grad: bool = False) -> None:
+        super().__init__()
+
+        w = torch.empty((n_out, n_in, kw, kw),
+                        dtype=torch.float32,
+                        device=device,
+                        requires_grad=requires_grad)
+        w.normal_(std=1 / math.sqrt(n_in * kw**2))
+
+        b = torch.zeros((n_out, ),
+                        dtype=torch.float32,
+                        device=device,
+                        requires_grad=requires_grad)
+        self.kw = kw
+        self.w, self.b = nn.Parameter(w), nn.Parameter(b)
+        self.use_float16 = use_float16
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        if self.use_float16 and 'cuda' in self.w.device.type:
+            if x.dtype != torch.float16:
+                x = x.half()
+
+            w, b = self.w.half(), self.b.half()
+        else:
+            if x.dtype != torch.float32:
+                x = x.float()
+
+            w, b = self.w, self.b
+
+        return F.conv2d(x, w, b, padding=(self.kw - 1) // 2)
+
+
+class EncoderBlock(nn.Module):
+    """Rewrite EncoderBlock module according to DALL-E code."""
+
+    def __init__(self,
+                 n_in: int,
+                 n_out: int,
+                 n_layers: int,
+                 device: torch.device = None,
+                 requires_grad: bool = False) -> None:
+        super().__init__()
+        self.n_hid = n_out // 4
+        self.post_gain = 1 / (n_layers**2)
+
+        make_conv = partial(Conv2d, device=device, requires_grad=requires_grad)
+        self.id_path = make_conv(n_in, n_out,
+                                 1) if n_in != n_out else nn.Identity()
+        self.res_path = nn.Sequential(
+            OrderedDict([
+                ('relu_1', nn.ReLU()),
+                ('conv_1', make_conv(n_in, self.n_hid, 3)),
+                ('relu_2', nn.ReLU()),
+                ('conv_2', make_conv(self.n_hid, self.n_hid, 3)),
+                ('relu_3', nn.ReLU()),
+                ('conv_3', make_conv(self.n_hid, self.n_hid, 3)),
+                ('relu_4', nn.ReLU()),
+                ('conv_4', make_conv(self.n_hid, n_out, 1)),
+            ]))
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.id_path(x) + self.post_gain * self.res_path(x)
+
+
+@MODELS.register_module(name='DALL-E')
+class DALLEEncoder(BaseModule):
+    """DALL-E Encoder for feature extraction.
+
+    Args:
+        group_count (int): Number of groups in DALL-E encoder. Defaults to 4.
+        n_hid (int): Dimension of hidden layers. Defaults to 256.
+        n_blk_per_group (int): Number of blocks per group. Defaults to 2.
+        input_channels: (int): The channels of input images. Defaults to 3.
+        vocab_size (int): Vocabulary size, indicating the number of classes.
+            Defaults to 8192.
+        device (torch.device): Device of parameters. Defaults to
+            ``torch.device('cpu')``.
+        requires_grad (bool): Require gradient or not. Defaults to False.
+        init_cfg (Union[List[dict], dict], optional): Config dict for weight
+            initialization. Defaults to None.
+    """
+
+    def __init__(self,
+                 group_count: int = 4,
+                 n_hid: int = 256,
+                 n_blk_per_group: int = 2,
+                 input_channels: int = 3,
+                 vocab_size: int = 8192,
+                 device: torch.device = torch.device('cpu'),
+                 requires_grad: bool = False,
+                 init_cfg: Union[dict, List[dict], None] = None):
+        super().__init__(init_cfg=init_cfg)
+        self.input_channels = input_channels
+
+        blk_range = range(n_blk_per_group)
+        n_layers = group_count * n_blk_per_group
+        make_conv = partial(Conv2d, device=device, requires_grad=requires_grad)
+        make_blk = partial(
+            EncoderBlock,
+            n_layers=n_layers,
+            device=device,
+            requires_grad=requires_grad)
+
+        self.blocks = nn.Sequential(
+            OrderedDict([
+                ('input', make_conv(input_channels, 1 * n_hid, 7)),
+                ('group_1',
+                 nn.Sequential(
+                     OrderedDict([
+                         *[(f'block_{i + 1}', make_blk(1 * n_hid, 1 * n_hid))
+                           for i in blk_range],
+                         ('pool', nn.MaxPool2d(kernel_size=2)),
+                     ]))),
+                ('group_2',
+                 nn.Sequential(
+                     OrderedDict([
+                         *[(f'block_{i + 1}',
+                            make_blk(1 * n_hid if i == 0 else 2 * n_hid,
+                                     2 * n_hid)) for i in blk_range],
+                         ('pool', nn.MaxPool2d(kernel_size=2)),
+                     ]))),
+                ('group_3',
+                 nn.Sequential(
+                     OrderedDict([
+                         *[(f'block_{i + 1}',
+                            make_blk(2 * n_hid if i == 0 else 4 * n_hid,
+                                     4 * n_hid)) for i in blk_range],
+                         ('pool', nn.MaxPool2d(kernel_size=2)),
+                     ]))),
+                ('group_4',
+                 nn.Sequential(
+                     OrderedDict([
+                         *[(f'block_{i + 1}',
+                            make_blk(4 * n_hid if i == 0 else 8 * n_hid,
+                                     8 * n_hid)) for i in blk_range],
+                     ]))),
+                ('output',
+                 nn.Sequential(
+                     OrderedDict([
+                         ('relu', nn.ReLU()),
+                         ('conv',
+                          make_conv(
+                              8 * n_hid, vocab_size, 1, use_float16=False)),
+                     ]))),
+            ]))
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward function of DALL-E encoder.
+
+        Args:
+            x (torch.Tensor): The input images with shape (B, C, H, W).
+
+        Returns:
+            torch.Tensor: The output with shape (B, vocab_size, h, w).
+        """
+        x = x.float()
+        if len(x.shape) != 4:
+            raise ValueError(f'input shape {x.shape} is not 4d')
+        if x.shape[1] != self.input_channels:
+            raise ValueError(f'input has {x.shape[1]} channels but model \
+                    built for {self.input_channels}')
+        if x.dtype != torch.float32:
+            raise ValueError('input must have dtype torch.float32')
+
+        return self.blocks(x)
+
+
+@MODELS.register_module()
+class CAEPretrainViT(BEiTViT):
+    """Vision Transformer for CAE pre-training and the implementation is based
+    on BEiTViT.
+
+    Args:
+        arch (str | dict): Vision Transformer architecture. Default: 'b'
+        img_size (int | tuple): Input image size
+        patch_size (int | tuple): The patch size
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        bias (bool | str): The option to add leanable bias for q, k, v. If bias
+            is True, it will add leanable bias. If bias is 'qv_bias', it will
+            only add leanable bias for q, v. If bias is False, it will not add
+            bias for q, k, v. Default to 'qv_bias'.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        final_norm (bool): Whether to add a additional layer to normalize
+            final feature map. Defaults to True.
+        out_type (str): The type of output features. Please choose from
+
+            - ``"cls_token"``: The class token tensor with shape (B, C).
+            - ``"featmap"``: The feature map tensor from the patch tokens
+              with shape (B, C, H, W).
+            - ``"avg_featmap"``: The global averaged feature map tensor
+              with shape (B, C).
+            - ``"raw"``: The raw feature tensor includes patch tokens and
+              class tokens with shape (B, L, C).
+
+            It only works without input mask. Defaults to ``"avg_featmap"``.
+        interpolate_mode (str): Select the interpolate mode for position
+            embeding vector resize. Defaults to "bicubic".
+        layer_scale_init_value (float, optional): The init value of gamma in
+            BEiTTransformerEncoderLayer.
+        patch_cfg (dict): Configs of patch embeding. Defaults to an empty dict.
+        layer_cfgs (Sequence | dict): Configs of each transformer layer in
+            encoder. Defaults to an empty dict.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(
+        self,
+        arch: str = 'b',
+        img_size: int = 224,
+        patch_size: int = 16,
+        in_channels: int = 3,
+        out_indices: int = -1,
+        drop_rate: float = 0,
+        drop_path_rate: float = 0,
+        bias: bool = 'qv_bias',
+        norm_cfg: dict = dict(type='LN', eps=1e-6),
+        final_norm: bool = True,
+        out_type: str = 'raw',
+        frozen_stages: int = -1,
+        use_abs_pos_emb: bool = True,
+        use_rel_pos_bias: bool = False,
+        use_shared_rel_pos_bias: bool = False,
+        layer_scale_init_value: float = None,
+        interpolate_mode: str = 'bicubic',
+        patch_cfg: dict = dict(),
+        layer_cfgs: dict = dict(),
+        init_cfg: dict = [
+            dict(type='Constant', val=1, layer=['LayerNorm']),
+            dict(type='TruncNormal', std=0.02, layer=['Conv2d']),
+            dict(type='Xavier', distribution='uniform', layer=['Linear'])
+        ]
+    ) -> None:
+        super().__init__(
+            arch=arch,
+            img_size=img_size,
+            patch_size=patch_size,
+            in_channels=in_channels,
+            out_indices=out_indices,
+            drop_rate=drop_rate,
+            drop_path_rate=drop_path_rate,
+            bias=bias,
+            norm_cfg=norm_cfg,
+            final_norm=final_norm,
+            out_type=out_type,
+            with_cls_token=True,
+            frozen_stages=frozen_stages,
+            use_abs_pos_emb=use_abs_pos_emb,
+            use_rel_pos_bias=use_rel_pos_bias,
+            use_shared_rel_pos_bias=use_shared_rel_pos_bias,
+            layer_scale_init_value=layer_scale_init_value,
+            interpolate_mode=interpolate_mode,
+            patch_cfg=patch_cfg,
+            layer_cfgs=layer_cfgs,
+            init_cfg=init_cfg)
+        self.pos_embed.requires_grad = False
+        self.num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+
+    def init_weights(self) -> None:
+        """Initialize position embedding, patch embedding and cls token."""
+        super().init_weights()
+        if not (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            # initialize position  embedding in backbone
+            pos_embed = build_2d_sincos_position_embedding(
+                int(self.num_patches**.5),
+                self.pos_embed.shape[-1],
+                cls_token=True)
+            self.pos_embed.data.copy_(pos_embed.float())
+
+            trunc_normal_(self.cls_token, std=.02)
+
+    def forward(self, x: torch.Tensor,
+                mask: Optional[torch.Tensor]) -> torch.Tensor:
+        """Generate features for masked images.
+
+        This function generates mask images and get the hidden features for
+        visible patches.
+
+        The function supports two kind of forward behaviors. If the ``mask`` is
+        not ``None``, the forward function will be executed as masked image
+        modeling pre-training; if the ``mask`` is ``None``, the forward
+        function will call ``super().forward()``, which extract features from
+        images without mask.
+
+        Args:
+            x (torch.Tensor): Input images, which is of shape B x C x H x W.
+            mask (torch.Tensor, optional): Mask for input, which is of shape
+                B x L.
+
+        Returns:
+            torch.Tensor: hidden features.
+        """
+        if mask is None:
+            return super().forward(x)
+
+        else:
+            x, _ = self.patch_embed(x)
+            batch_size, _, dim = x.size()
+
+            cls_tokens = self.cls_token.expand(batch_size, -1, -1)
+
+            # NOTE: unmasked embeddings
+            x_unmasked = x[~mask].reshape(batch_size, -1, dim)
+            x_unmasked = torch.cat((cls_tokens, x_unmasked), dim=1)
+
+            pos_embed = self.pos_embed.expand(batch_size, self.num_patches + 1,
+                                              dim)
+            pos_embed_unmasked = pos_embed[:, 1:][~mask].reshape(
+                batch_size, -1, dim)
+            pos_embed_unmasked = torch.cat(
+                (pos_embed[:, :1], pos_embed_unmasked), dim=1)
+            x_unmasked = x_unmasked + pos_embed_unmasked
+
+            x_unmasked = self.drop_after_pos(x_unmasked)
+
+            for i, layer in enumerate(self.layers):
+                x_unmasked = layer(x=x_unmasked, rel_pos_bias=None)
+
+                if i == len(self.layers) - 1 and self.final_norm:
+                    x_unmasked = self.norm1(x_unmasked)
+
+            return x_unmasked
+
+
+@MODELS.register_module()
+class CAE(BaseSelfSupervisor):
+    """CAE.
+
+    Implementation of `Context Autoencoder for Self-Supervised Representation
+    Learning <https://arxiv.org/abs/2202.03026>`_.
+
+    Args:
+        backbone (dict): Config dict for module of backbone.
+        neck (dict): Config dict for module of neck.
+        head (dict): Config dict for module of head functions.
+        target_generator: (dict, optional): The target_generator module to
+            generate targets for self-supervised learning optimization, such as
+            HOG, extracted features from other modules(DALL-E, CLIP), etc.
+        base_momentum (float): The base momentum coefficient for the target
+            network. Defaults to 0.0.
+        data_preprocessor (dict, optional): The config for preprocessing
+            input data. If None or no specified type, it will use
+            "SelfSupDataPreprocessor" as type.
+            See :class:`SelfSupDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (Union[List[dict], dict], optional): Config dict for weight
+            initialization. Defaults to None.
+    """
+
+    def __init__(self,
+                 backbone: dict,
+                 neck: dict,
+                 head: dict,
+                 target_generator: Optional[dict] = None,
+                 base_momentum: float = 0.0,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(
+            backbone=backbone,
+            neck=neck,
+            head=head,
+            target_generator=target_generator,
+            data_preprocessor=data_preprocessor,
+            init_cfg=init_cfg)
+
+        self.momentum = base_momentum
+        self.teacher = MODELS.build(backbone)
+
+    def init_weights(self) -> None:
+        """Initialize weights."""
+        super().init_weights()
+
+        # init the weights of teacher with those of backbone
+        for param_backbone, param_teacher in zip(self.backbone.parameters(),
+                                                 self.teacher.parameters()):
+            param_teacher.detach()
+            param_teacher.data.copy_(param_backbone.data)
+            param_teacher.requires_grad = False
+
+    def momentum_update(self) -> None:
+        """Momentum update of the teacher network."""
+        for param_bacbone, param_teacher in zip(self.backbone.parameters(),
+                                                self.teacher.parameters()):
+            param_teacher.data = param_teacher.data * self.momentum + \
+                param_bacbone.data * (1. - self.momentum)
+
+    def extract_feat(self, inputs: torch.Tensor):
+        return self.backbone(inputs, mask=None)
+
+    def loss(self, inputs: List[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (List[torch.Tensor]): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        mask = torch.stack([data_sample.mask for data_sample in data_samples])
+        mask = mask.flatten(1).to(torch.bool)
+
+        unmasked = self.backbone(inputs[0], mask)
+
+        # get the latent prediction for the masked patches
+        with torch.no_grad():
+            # inputs[0] is the prediction image
+            latent_target = self.teacher(inputs[0], ~mask)
+            latent_target = latent_target[:, 1:, :]
+            self.momentum_update()
+
+        pos_embed = self.backbone.pos_embed.expand(inputs[0].shape[0], -1, -1)
+        pos_embed_masked = pos_embed[:,
+                                     1:][mask].reshape(inputs[0].shape[0], -1,
+                                                       pos_embed.shape[-1])
+        pos_embed_unmasked = pos_embed[:, 1:][~mask].reshape(
+            inputs[0].shape[0], -1, pos_embed.shape[-1])
+
+        # input the unmasked tokens and masked tokens to the decoder
+        logits, latent_pred = self.neck(unmasked[:, 1:], pos_embed_masked,
+                                        pos_embed_unmasked)
+
+        logits = logits.view(-1, logits.shape[-1])
+        # inputs[1] is the target image
+        logits_target = self.target_generator(inputs[1])
+        loss_main, loss_align = self.head.loss(logits, logits_target,
+                                               latent_pred, latent_target,
+                                               mask)
+        losses = dict()
+
+        losses['loss'] = loss_main + loss_align
+        losses['main'] = loss_main
+        losses['align'] = loss_align
+        return losses
diff --git a/mmpretrain/models/selfsup/densecl.py b/mmpretrain/models/selfsup/densecl.py
new file mode 100644
index 0000000000000000000000000000000000000000..c969af17fa921a119f6b05b5a319e104f6422494
--- /dev/null
+++ b/mmpretrain/models/selfsup/densecl.py
@@ -0,0 +1,203 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List, Optional, Union
+
+import torch
+import torch.nn as nn
+from mmengine.dist import all_gather
+from mmengine.model import ExponentialMovingAverage
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from ..utils import batch_shuffle_ddp, batch_unshuffle_ddp
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class DenseCL(BaseSelfSupervisor):
+    """DenseCL.
+
+    Implementation of `Dense Contrastive Learning for Self-Supervised Visual
+    Pre-Training <https://arxiv.org/abs/2011.09157>`_.
+    Borrowed from the authors' code: `<https://github.com/WXinlong/DenseCL>`_.
+    The loss_lambda warmup is in `engine/hooks/densecl_hook.py`.
+
+    Args:
+        backbone (dict): Config dict for module of backbone.
+        neck (dict): Config dict for module of deep features to compact
+            feature vectors.
+        head (dict): Config dict for module of head functions.
+        queue_len (int): Number of negative keys maintained in the queue.
+            Defaults to 65536.
+        feat_dim (int): Dimension of compact feature vectors. Defaults to 128.
+        momentum (float): Momentum coefficient for the momentum-updated
+            encoder. Defaults to 0.999.
+        loss_lambda (float): Loss weight for the single and dense contrastive
+            loss. Defaults to 0.5.
+        pretrained (str, optional): The pretrained checkpoint path, support
+            local path and remote path. Defaults to None.
+        data_preprocessor (dict, optional): The config for preprocessing
+            input data. If None or no specified type, it will use
+            "SelfSupDataPreprocessor" as type.
+            See :class:`SelfSupDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (Union[List[dict], dict], optional): Config dict for weight
+            initialization. Defaults to None.
+    """
+
+    def __init__(self,
+                 backbone: dict,
+                 neck: dict,
+                 head: dict,
+                 queue_len: int = 65536,
+                 feat_dim: int = 128,
+                 momentum: float = 0.001,
+                 loss_lambda: float = 0.5,
+                 pretrained: Optional[str] = None,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(
+            backbone=backbone,
+            neck=neck,
+            head=head,
+            pretrained=pretrained,
+            data_preprocessor=data_preprocessor,
+            init_cfg=init_cfg)
+
+        # create momentum model
+        self.encoder_k = ExponentialMovingAverage(
+            nn.Sequential(self.backbone, self.neck), momentum)
+
+        self.queue_len = queue_len
+        self.loss_lambda = loss_lambda
+
+        # create the queue
+        self.register_buffer('queue', torch.randn(feat_dim, queue_len))
+        self.queue = nn.functional.normalize(self.queue, dim=0)
+        self.register_buffer('queue_ptr', torch.zeros(1, dtype=torch.long))
+
+        # create the second queue for dense output
+        self.register_buffer('queue2', torch.randn(feat_dim, queue_len))
+        self.queue2 = nn.functional.normalize(self.queue2, dim=0)
+        self.register_buffer('queue2_ptr', torch.zeros(1, dtype=torch.long))
+
+    @torch.no_grad()
+    def _dequeue_and_enqueue(self, keys: torch.Tensor) -> None:
+        """Update queue."""
+        # gather keys before updating queue
+        keys = torch.cat(all_gather(keys), dim=0)
+
+        batch_size = keys.shape[0]
+
+        ptr = int(self.queue_ptr)
+        assert self.queue_len % batch_size == 0  # for simplicity
+
+        # replace the keys at ptr (dequeue and enqueue)
+        self.queue[:, ptr:ptr + batch_size] = keys.transpose(0, 1)
+        ptr = (ptr + batch_size) % self.queue_len  # move pointer
+
+        self.queue_ptr[0] = ptr
+
+    @torch.no_grad()
+    def _dequeue_and_enqueue2(self, keys: torch.Tensor) -> None:
+        """Update queue2."""
+        # gather keys before updating queue
+        keys = torch.cat(all_gather(keys), dim=0)
+
+        batch_size = keys.shape[0]
+
+        ptr = int(self.queue2_ptr)
+        assert self.queue_len % batch_size == 0  # for simplicity
+
+        # replace the keys at ptr (dequeue and enqueue)
+        self.queue2[:, ptr:ptr + batch_size] = keys.transpose(0, 1)
+        ptr = (ptr + batch_size) % self.queue_len  # move pointer
+
+        self.queue2_ptr[0] = ptr
+
+    def loss(self, inputs: List[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (List[torch.Tensor]): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        assert isinstance(inputs, list)
+        im_q = inputs[0]
+        im_k = inputs[1]
+        # compute query features
+        q_b = self.backbone(im_q)  # backbone features
+        q, q_grid, q2 = self.neck(q_b)  # queries: NxC; NxCxS^2
+        q_b = q_b[0]
+        q_b = q_b.view(q_b.size(0), q_b.size(1), -1)
+
+        q = nn.functional.normalize(q, dim=1)
+        q2 = nn.functional.normalize(q2, dim=1)
+        q_grid = nn.functional.normalize(q_grid, dim=1)
+        q_b = nn.functional.normalize(q_b, dim=1)
+
+        # compute key features
+        with torch.no_grad():  # no gradient to keys
+            # update the key encoder
+            self.encoder_k.update_parameters(
+                nn.Sequential(self.backbone, self.neck))
+
+            # shuffle for making use of BN
+            im_k, idx_unshuffle = batch_shuffle_ddp(im_k)
+
+            k_b = self.encoder_k.module[0](im_k)  # backbone features
+            k, k_grid, k2 = self.encoder_k.module[1](k_b)  # keys: NxC; NxCxS^2
+            k_b = k_b[0]
+            k_b = k_b.view(k_b.size(0), k_b.size(1), -1)
+
+            k = nn.functional.normalize(k, dim=1)
+            k2 = nn.functional.normalize(k2, dim=1)
+            k_grid = nn.functional.normalize(k_grid, dim=1)
+            k_b = nn.functional.normalize(k_b, dim=1)
+
+            # undo shuffle
+            k = batch_unshuffle_ddp(k, idx_unshuffle)
+            k2 = batch_unshuffle_ddp(k2, idx_unshuffle)
+            k_grid = batch_unshuffle_ddp(k_grid, idx_unshuffle)
+            k_b = batch_unshuffle_ddp(k_b, idx_unshuffle)
+
+        # compute logits
+        # Einstein sum is more intuitive
+        # positive logits: Nx1
+        l_pos = torch.einsum('nc,nc->n', [q, k]).unsqueeze(-1)
+        # negative logits: NxK
+        l_neg = torch.einsum('nc,ck->nk', [q, self.queue.clone().detach()])
+
+        # feat point set sim
+        backbone_sim_matrix = torch.matmul(q_b.permute(0, 2, 1), k_b)
+        densecl_sim_ind = backbone_sim_matrix.max(dim=2)[1]  # NxS^2
+
+        indexed_k_grid = torch.gather(k_grid, 2,
+                                      densecl_sim_ind.unsqueeze(1).expand(
+                                          -1, k_grid.size(1), -1))  # NxCxS^2
+        densecl_sim_q = (q_grid * indexed_k_grid).sum(1)  # NxS^2
+
+        # dense positive logits: NS^2X1
+        l_pos_dense = densecl_sim_q.view(-1).unsqueeze(-1)
+
+        q_grid = q_grid.permute(0, 2, 1)
+        q_grid = q_grid.reshape(-1, q_grid.size(2))
+        # dense negative logits: NS^2xK
+        l_neg_dense = torch.einsum(
+            'nc,ck->nk', [q_grid, self.queue2.clone().detach()])
+
+        loss_single = self.head.loss(l_pos, l_neg)
+        loss_dense = self.head.loss(l_pos_dense, l_neg_dense)
+
+        losses = dict()
+        losses['loss_single'] = loss_single * (1 - self.loss_lambda)
+        losses['loss_dense'] = loss_dense * self.loss_lambda
+
+        self._dequeue_and_enqueue(k)
+        self._dequeue_and_enqueue2(k2)
+
+        return losses
diff --git a/mmpretrain/models/selfsup/eva.py b/mmpretrain/models/selfsup/eva.py
new file mode 100644
index 0000000000000000000000000000000000000000..30779bec491ae7c95b6540cdc7d71a875da572de
--- /dev/null
+++ b/mmpretrain/models/selfsup/eva.py
@@ -0,0 +1,43 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List
+
+import torch
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class EVA(BaseSelfSupervisor):
+    """EVA.
+
+    Implementation of `EVA: Exploring the Limits of Masked Visual
+    Representation Learning at Scale <https://arxiv.org/abs/2211.07636>`_.
+    """
+
+    def extract_feat(self, inputs: torch.Tensor):
+        return self.backbone(inputs, mask=None)
+
+    def loss(self, inputs: torch.Tensor, data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (torch.Tensor): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+
+        clip_feature, _ = self.target_generator(inputs)
+
+        latent, mask, ids_restore = self.backbone(inputs)
+        pred = self.neck(latent, ids_restore)
+
+        clip_feature = clip_feature[:, 1:, :]
+        loss = self.head.loss(pred, clip_feature, mask)
+        losses = dict(loss=loss)
+        return losses
diff --git a/mmpretrain/models/selfsup/itpn.py b/mmpretrain/models/selfsup/itpn.py
new file mode 100644
index 0000000000000000000000000000000000000000..488a99631820e866c8cc743168b65f237fa136b2
--- /dev/null
+++ b/mmpretrain/models/selfsup/itpn.py
@@ -0,0 +1,359 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from typing import Dict, List, Optional, Tuple
+
+import torch
+import torch.nn as nn
+from mmengine.model.weight_init import trunc_normal_
+
+from mmpretrain.models.backbones.hivit import BlockWithRPE, HiViT, PatchMerge
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from ..utils import build_2d_sincos_position_embedding
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class iTPNHiViT(HiViT):
+    """HiViT for iTPN pre-training.
+
+    Args:
+        img_size (int | tuple): Input image size. Defaults to 224.
+        patch_size (int | tuple): The patch size. Defaults to 16.
+        inner_patches (int): Inner patch. Defaults to 4.
+        stem_mlp_ratio (int): Ratio of MLP hidden dim to embedding dim
+            in the first two stages. Defaults to 3.
+        mlp_ratio (int): Ratio of MLP hidden dim to embedding dim in
+            the last stage. Defaults to 4.
+        qkv_bias (bool): Enable bias for qkv projections if True.
+        qk_scale (float): The number of divider after q@k. Default to None.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        attn_drop_rate (float): The drop out rate for attention output weights.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        ape (bool): If True, add absolute position embedding to
+            the patch embedding.
+        rpe (bool): If True, add relative position embedding to
+            the patch embedding.
+        layer_scale_init_value (float): Layer-scale init values. Defaults to 0.
+        mask_ratio (bool): The ratio of total number of patches to be masked.
+            Defaults to 0.75.
+        reconstruction_type (str): The reconstruction of self-supervised
+            learning. Defaults to 'pixel'.
+    """
+
+    def __init__(
+        self,
+        arch='base',
+        img_size: int = 224,
+        patch_size: int = 16,
+        inner_patches: int = 4,
+        stem_mlp_ratio: int = 3.,
+        mlp_ratio: int = 4.,
+        qkv_bias: bool = True,
+        qk_scale: Optional[bool] = None,
+        drop_rate: float = 0.0,
+        attn_drop_rate: float = 0.0,
+        drop_path_rate: float = 0.0,
+        norm_cfg: dict = dict(type='LN', eps=1e-6),
+        ape: bool = True,
+        rpe: bool = False,
+        layer_scale_init_value: float = 0.0,
+        mask_ratio: float = 0.75,
+        reconstruction_type: str = 'pixel',
+        **kwargs,
+    ):
+        super().__init__(
+            arch=arch,
+            img_size=img_size,
+            patch_size=patch_size,
+            inner_patches=inner_patches,
+            stem_mlp_ratio=stem_mlp_ratio,
+            mlp_ratio=mlp_ratio,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            drop_rate=drop_rate,
+            attn_drop_rate=attn_drop_rate,
+            drop_path_rate=drop_path_rate,
+            norm_cfg=norm_cfg,
+            ape=ape,
+            rpe=rpe,
+            layer_scale_init_value=layer_scale_init_value,
+            **kwargs,
+        )
+
+        self.pos_embed.requires_grad = False
+        self.mask_ratio = mask_ratio
+
+        assert reconstruction_type in ['pixel', 'clip'], \
+            'iTPN method only support `pixel` and `clip`, ' \
+            f'but got `{reconstruction_type}`.'
+        self.reconstruction_type = reconstruction_type
+        self.num_patches = self.patch_embed.num_patches
+
+        if reconstruction_type == 'clip':
+            self.mask_token = nn.Parameter(torch.zeros(1, 1, self.embed_dims))
+
+    def init_weights(self) -> None:
+        """Initialize position embedding, patch embedding and cls token."""
+        super().apply(self._init_weights)
+
+        if self.reconstruction_type == 'clip':
+            trunc_normal_(self.mask_token, std=0.02)
+            self.rescale_init_weight()
+        else:
+            pos_embed = build_2d_sincos_position_embedding(
+                int(self.num_patches**.5),
+                self.pos_embed.shape[-1],
+                cls_token=False)
+            self.pos_embed.data.copy_(pos_embed.float())
+
+            w = self.patch_embed.proj.weight.data
+            torch.nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
+
+    def rescale_init_weight(self) -> None:
+        """Rescale the initialized weights."""
+
+        def rescale(param, layer_id):
+            param.div_(math.sqrt(2.0 * layer_id))
+
+        for layer_id, layer in enumerate(self.blocks):
+            if isinstance(layer, BlockWithRPE):
+                if layer.attn is not None:
+                    rescale(layer.attn.proj.weight.data, layer_id + 1)
+                rescale(layer.mlp.fc2.weight.data, layer_id + 1)
+
+    def masking_id(self, batch_size, mask_ratio):
+        N, L = batch_size, self.pos_embed.size(1)
+        len_keep = int(L * (1 - mask_ratio))
+
+        noise = torch.rand(
+            N, L, device=self.pos_embed.device)  # noise in [0, 1]
+
+        # sort noise for each sample
+        ids_shuffle = torch.argsort(
+            noise, dim=1)  # ascend: small is keep, large is remove
+        ids_restore = torch.argsort(ids_shuffle, dim=1)
+
+        # keep the first subset
+        ids_keep = ids_shuffle[:, :len_keep]
+        # generate the binary mask: 0 is keep, 1 is remove
+        mask = torch.ones([N, L], device=self.pos_embed.device)
+        mask[:, :ids_keep.size(1)] = 0
+        # unshuffle to get the binary mask
+        mask = torch.gather(mask, dim=1, index=ids_restore)
+
+        return ids_keep, ids_restore, mask
+
+    def forward_pixel(
+        self,
+        x: torch.Tensor,
+        mask: Optional[bool] = True
+    ) -> Tuple[Tuple, torch.Tensor, torch.Tensor]:
+        """Generate features for masked images.
+
+        The function supports two kind of forward behaviors. If the ``mask`` is
+        ``True``, the function will generate mask to masking some patches
+        randomly and get the hidden features for visible patches, which means
+        the function will be executed as masked imagemodeling pre-training;
+        if the ``mask`` is ``None`` or ``False``, the forward function will
+        call ``super().forward()``, which extract features from images without
+        mask.
+
+
+        Args:
+            x (torch.Tensor): Input images, which is of shape B x C x H x W.
+            mask (bool, optional): To indicate whether the forward function
+                generating ``mask`` or not.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Hidden features,
+            mask and the ids to restore original image.
+
+            - ``x`` (torch.Tensor): hidden features, which is of shape
+              B x (L * mask_ratio) x C.
+            - ``mask`` (torch.Tensor): mask used to mask image.
+            - ``ids_restore`` (torch.Tensor): ids to restore original image.
+        """
+        if mask is None or False:
+            return super().forward(x)
+
+        else:
+            B, C, H, W = x.shape
+            ids_keep, ids_restore, mask = self.masking_id(B, self.mask_ratio)
+
+            x = self.patch_embed(x)
+
+            x = torch.gather(
+                x,
+                dim=1,
+                index=ids_keep[:, :, None, None,
+                               None].expand(-1, -1, *x.shape[2:]))
+
+            outs = []
+            for blk in self.blocks[:-self.num_main_blocks]:
+                if isinstance(blk, PatchMerge):
+                    outs.append(x)
+                x = blk(x)
+
+            x = x[..., 0, 0, :]
+            if self.ape:
+                pos_embed = self.interpolate_pos_encoding(x, H, W)
+                pos_embed = torch.gather(
+                    pos_embed.expand(B, -1, -1),
+                    dim=1,
+                    index=ids_keep[:, :, None].expand(-1, -1,
+                                                      pos_embed.shape[2]),
+                )
+                x = x + pos_embed
+            x = self.pos_drop(x)
+
+            for blk in self.blocks[-self.num_main_blocks:]:
+                x = blk(x)
+
+            outs.append(x)
+
+            return (tuple(outs), mask, ids_restore)
+
+    def forward_clip(self,
+                     x: torch.Tensor,
+                     mask: Optional[bool] = True) -> Tuple:
+        """Generate features for masked images.
+
+        The function supports two kind of forward behaviors. If the ``mask`` is
+        ``True``, the function will generate mask to masking some patches
+        randomly and get the hidden features for visible patches, which means
+        the function will be executed as masked imagemodeling pre-training;
+        if the ``mask`` is ``None`` or ``False``, the forward function will
+        call ``super().forward()``, which extract features from images without
+        mask.
+
+
+        Args:
+            x (torch.Tensor): Input images, which is of shape B x C x H x W.
+            mask (bool, optional): To indicate whether the forward function
+                generating ``mask`` or not.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Hidden features,
+            mask and the ids to restore original image.
+
+            - ``x`` (torch.Tensor): hidden features, which is of shape
+              B x (L * mask_ratio) x C.
+            - ``mask`` (torch.Tensor): mask used to mask image.
+            - ``ids_restore`` (torch.Tensor): ids to restore original image.
+        """
+        if mask is None or False:
+            return super().forward(x)
+
+        else:
+            B, C, H, W = x.shape
+            x = self.patch_embed(x)
+
+            outs = []
+            for blk in self.blocks[:-self.num_main_blocks]:
+                if isinstance(blk, PatchMerge):
+                    outs.append(x)
+                x = blk(x)
+
+            x = x[..., 0, 0, :]
+            B, L, _ = x.shape
+            mask_token = self.mask_token.expand(B, L, -1)
+            w = mask.flatten(1).unsqueeze(-1).type_as(mask_token)
+            x = x * (1. - w) + mask_token * w
+
+            if self.ape:
+                pos_embed = self.interpolate_pos_encoding(x, H, W)
+                x = x + pos_embed
+            x = self.pos_drop(x)
+
+            rpe_index = True if self.rpe else None
+
+            for blk in self.blocks[-self.num_main_blocks:]:
+                x = blk(x, rpe_index)
+
+            outs.append(x)
+
+            return tuple(outs)
+
+    def forward(self, x: torch.Tensor, mask: Optional[bool] = True) -> Tuple:
+        """Generate features for masked images.
+
+        The function supports two kind of forward behaviors. If the ``mask`` is
+        ``True``, the function will generate mask to masking some patches
+        randomly and get the hidden features for visible patches, which means
+        the function will be executed as masked imagemodeling pre-training;
+        if the ``mask`` is ``None`` or ``False``, the forward function will
+        call ``super().forward()``, which extract features from images without
+        mask.
+
+
+        Args:
+            x (torch.Tensor): Input images, which is of shape B x C x H x W.
+            mask (bool, optional): To indicate whether the forward function
+                generating ``mask`` or not.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Hidden features,
+            mask and the ids to restore original image.
+
+            - ``x`` (torch.Tensor): hidden features, which is of shape
+              B x (L * mask_ratio) x C.
+            - ``mask`` (torch.Tensor): mask used to mask image.
+            - ``ids_restore`` (torch.Tensor): ids to restore original image.
+        """
+
+        if self.reconstruction_type == 'pixel':
+            return self.forward_pixel(x, mask)
+        return self.forward_clip(x, mask)
+
+
+@MODELS.register_module()
+class iTPN(BaseSelfSupervisor):
+    """iTPN.
+
+    Implementation of `iTPN: Integrally Pre-Trained Transformer Pyramid
+    Networks <https://arxiv.org/abs/2211.12735>`_.
+    """
+
+    def extract_feat(self, inputs: torch.Tensor):
+        return self.backbone(inputs, mask=None)
+
+    def loss(self, inputs: torch.Tensor, data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (torch.Tensor): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+
+        if self.backbone.reconstruction_type == 'pixel':
+            latent, mask, ids_restore = self.backbone(inputs)
+            pred = self.neck(latent, ids_restore)
+
+            loss = self.head.loss(pred, inputs, mask)
+        else:
+            mask = torch.stack(
+                [data_sample.mask for data_sample in data_samples])
+
+            img_latent = self.backbone(inputs[0], mask)
+
+            # inputs[1] is the target image
+            with torch.no_grad():
+                target = self.target_generator(inputs[1])[0]
+                target = target.detach()
+
+            # iTPN contains a neck module
+            feats = self.neck(img_latent)
+            loss = self.head.loss(feats, target[:, 1:, :], mask)
+
+        losses = dict(loss=loss)
+        return losses
diff --git a/mmpretrain/models/selfsup/mae.py b/mmpretrain/models/selfsup/mae.py
new file mode 100644
index 0000000000000000000000000000000000000000..01bc5bc5134e02488556eacd8cfc30c2fae44fea
--- /dev/null
+++ b/mmpretrain/models/selfsup/mae.py
@@ -0,0 +1,416 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List, Optional, Sequence, Tuple, Union
+
+import torch
+
+from mmpretrain.models import HiViT, VisionTransformer
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from ..utils import build_2d_sincos_position_embedding
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class MAEViT(VisionTransformer):
+    """Vision Transformer for MAE pre-training.
+
+    A PyTorch implement of: `An Image is Worth 16x16 Words: Transformers
+    for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`_.
+    This module implements the patch masking in MAE and initialize the
+    position embedding with sine-cosine position embedding.
+
+    Args:
+        arch (str | dict): Vision Transformer architecture
+            Default: 'b'
+        img_size (int | tuple): Input image size
+        patch_size (int | tuple): The patch size
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        final_norm (bool): Whether to add a additional layer to normalize
+            final feature map. Defaults to True.
+        out_type (str): The type of output features. Please choose from
+
+            - ``"cls_token"``: The class token tensor with shape (B, C).
+            - ``"featmap"``: The feature map tensor from the patch tokens
+              with shape (B, C, H, W).
+            - ``"avg_featmap"``: The global averaged feature map tensor
+              with shape (B, C).
+            - ``"raw"``: The raw feature tensor includes patch tokens and
+              class tokens with shape (B, L, C).
+
+            It only works without input mask. Defaults to ``"avg_featmap"``.
+        interpolate_mode (str): Select the interpolate mode for position
+            embeding vector resize. Defaults to "bicubic".
+        patch_cfg (dict): Configs of patch embeding. Defaults to an empty dict.
+        layer_cfgs (Sequence | dict): Configs of each transformer layer in
+            encoder. Defaults to an empty dict.
+        mask_ratio (bool): The ratio of total number of patches to be masked.
+            Defaults to 0.75.
+        init_cfg (Union[List[dict], dict], optional): Initialization config
+            dict. Defaults to None.
+    """
+
+    def __init__(self,
+                 arch: Union[str, dict] = 'b',
+                 img_size: int = 224,
+                 patch_size: int = 16,
+                 out_indices: Union[Sequence, int] = -1,
+                 drop_rate: float = 0,
+                 drop_path_rate: float = 0,
+                 norm_cfg: dict = dict(type='LN', eps=1e-6),
+                 final_norm: bool = True,
+                 out_type: str = 'raw',
+                 interpolate_mode: str = 'bicubic',
+                 patch_cfg: dict = dict(),
+                 layer_cfgs: dict = dict(),
+                 mask_ratio: float = 0.75,
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(
+            arch=arch,
+            img_size=img_size,
+            patch_size=patch_size,
+            out_indices=out_indices,
+            drop_rate=drop_rate,
+            drop_path_rate=drop_path_rate,
+            norm_cfg=norm_cfg,
+            final_norm=final_norm,
+            out_type=out_type,
+            with_cls_token=True,
+            interpolate_mode=interpolate_mode,
+            patch_cfg=patch_cfg,
+            layer_cfgs=layer_cfgs,
+            init_cfg=init_cfg)
+
+        # position embedding is not learnable during pretraining
+        self.pos_embed.requires_grad = False
+        self.mask_ratio = mask_ratio
+        self.num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+
+    def init_weights(self) -> None:
+        """Initialize position embedding, patch embedding and cls token."""
+        super().init_weights()
+        pos_embed = build_2d_sincos_position_embedding(
+            int(self.num_patches**.5),
+            self.pos_embed.shape[-1],
+            cls_token=True)
+        self.pos_embed.data.copy_(pos_embed.float())
+
+        w = self.patch_embed.projection.weight.data
+        torch.nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
+
+        torch.nn.init.normal_(self.cls_token, std=.02)
+
+    def random_masking(
+        self,
+        x: torch.Tensor,
+        mask_ratio: float = 0.75
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Generate the mask for MAE Pre-training.
+
+        Args:
+            x (torch.Tensor): Image with data augmentation applied, which is
+                of shape B x L x C.
+            mask_ratio (float): The mask ratio of total patches.
+                Defaults to 0.75.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: masked image, mask
+            and the ids to restore original image.
+
+            - ``x_masked`` (torch.Tensor): masked image.
+            - ``mask`` (torch.Tensor): mask used to mask image.
+            - ``ids_restore`` (torch.Tensor): ids to restore original image.
+        """
+        N, L, D = x.shape  # batch, length, dim
+        len_keep = int(L * (1 - mask_ratio))
+
+        noise = torch.rand(N, L, device=x.device)  # noise in [0, 1]
+
+        # sort noise for each sample
+        ids_shuffle = torch.argsort(
+            noise, dim=1)  # ascend: small is keep, large is remove
+        ids_restore = torch.argsort(ids_shuffle, dim=1)
+
+        # keep the first subset
+        ids_keep = ids_shuffle[:, :len_keep]
+        x_masked = torch.gather(
+            x, dim=1, index=ids_keep.unsqueeze(-1).repeat(1, 1, D))
+
+        # generate the binary mask: 0 is keep, 1 is remove
+        mask = torch.ones([N, L], device=x.device)
+        mask[:, :len_keep] = 0
+        # unshuffle to get the binary mask
+        mask = torch.gather(mask, dim=1, index=ids_restore)
+
+        return x_masked, mask, ids_restore
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        mask: Optional[bool] = True
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Generate features for masked images.
+
+        The function supports two kind of forward behaviors. If the ``mask`` is
+        ``True``, the function will generate mask to masking some patches
+        randomly and get the hidden features for visible patches, which means
+        the function will be executed as masked imagemodeling pre-training;
+        if the ``mask`` is ``None`` or ``False``, the forward function will
+        call ``super().forward()``, which extract features from images without
+        mask.
+
+
+        Args:
+            x (torch.Tensor): Input images, which is of shape B x C x H x W.
+            mask (bool, optional): To indicate whether the forward function
+                generating ``mask`` or not.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Hidden features,
+            mask and the ids to restore original image.
+
+            - ``x`` (torch.Tensor): hidden features, which is of shape
+              B x (L * mask_ratio) x C.
+            - ``mask`` (torch.Tensor): mask used to mask image.
+            - ``ids_restore`` (torch.Tensor): ids to restore original image.
+        """
+        if mask is None or False:
+            return super().forward(x)
+
+        else:
+            B = x.shape[0]
+            x = self.patch_embed(x)[0]
+            # add pos embed w/o cls token
+            x = x + self.pos_embed[:, 1:, :]
+
+            # masking: length -> length * mask_ratio
+            x, mask, ids_restore = self.random_masking(x, self.mask_ratio)
+
+            # append cls token
+            cls_token = self.cls_token + self.pos_embed[:, :1, :]
+            cls_tokens = cls_token.expand(B, -1, -1)
+            x = torch.cat((cls_tokens, x), dim=1)
+
+            for _, layer in enumerate(self.layers):
+                x = layer(x)
+            # Use final norm
+            x = self.norm1(x)
+
+            return (x, mask, ids_restore)
+
+
+@MODELS.register_module()
+class MAE(BaseSelfSupervisor):
+    """MAE.
+
+    Implementation of `Masked Autoencoders Are Scalable Vision Learners
+    <https://arxiv.org/abs/2111.06377>`_.
+    """
+
+    def extract_feat(self, inputs: torch.Tensor):
+        return self.backbone(inputs, mask=None)
+
+    def loss(self, inputs: torch.Tensor, data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (torch.Tensor): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        # ids_restore: the same as that in original repo, which is used
+        # to recover the original order of tokens in decoder.
+        latent, mask, ids_restore = self.backbone(inputs)
+        pred = self.neck(latent, ids_restore)
+        loss = self.head.loss(pred, inputs, mask)
+        losses = dict(loss=loss)
+        return losses
+
+
+@MODELS.register_module()
+class MAEHiViT(HiViT):
+    """HiViT for MAE pre-training.
+
+    A PyTorch implement of: `HiViT: A Simple and More Efficient Design
+    of Hierarchical Vision Transformer <https://arxiv.org/abs/2205.14949>`_.
+    This module implements the patch masking in MAE and initialize the
+    position embedding with sine-cosine position embedding.
+
+    Args:
+        arch (str | dict): Vision Transformer architecture
+            Default: 'b'
+        img_size (int | tuple): Input image size
+        patch_size (int | tuple): The patch size
+            Defaults to 4, to downsample 4x at the first stage
+        inner_patches (int): The inner patches within a token
+            Defaults to 4
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        ape (bool): the absolute position embedding
+        rpe (bool): the relative position embedding
+            Defaults to False
+        layer_scale_init_value (float): the layer scale init value
+        mask_ratio (bool): The ratio of total number of patches to be masked.
+            Defaults to 0.75.
+        init_cfg (Union[List[dict], dict], optional): Initialization config
+            dict. Defaults to None.
+    """
+
+    def __init__(self,
+                 arch: Union[str, dict] = 'b',
+                 img_size: int = 224,
+                 patch_size: int = 16,
+                 inner_patches: int = 4,
+                 out_indices: Union[list, int] = [23],
+                 drop_rate: float = 0.0,
+                 drop_path_rate: float = 0.0,
+                 norm_cfg: dict = dict(type='LN', eps=1e-6),
+                 ape: bool = True,
+                 rpe: bool = False,
+                 layer_scale_init_value: float = 0.0,
+                 mask_ratio: float = 0.75,
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(
+            arch=arch,
+            img_size=img_size,
+            patch_size=patch_size,
+            inner_patches=inner_patches,
+            out_indices=out_indices,
+            drop_rate=drop_rate,
+            drop_path_rate=drop_path_rate,
+            norm_cfg=norm_cfg,
+            ape=ape,
+            rpe=rpe,
+            layer_scale_init_value=layer_scale_init_value,
+            init_cfg=init_cfg)
+
+        self.pos_embed.requires_grad = False
+        self.mask_ratio = mask_ratio
+        self.num_patches = self.patch_embed.num_patches
+
+    def init_weights(self) -> None:
+        """Initialize position embedding, patch embedding."""
+        super().apply(self._init_weights)
+        pos_embed = build_2d_sincos_position_embedding(
+            int(self.num_patches**.5),
+            self.pos_embed.shape[-1],
+            cls_token=False)
+        self.pos_embed.data.copy_(pos_embed.float())
+
+        w = self.patch_embed.proj.weight.data
+        torch.nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
+
+    def masking_id(
+            self, batch_size,
+            mask_ratio) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Generate the mask for MAE Pre-training.
+
+        Args:
+            batch_size: The batch size of input data
+            mask_ratio: The mask ratio of total patches.
+                Defaults to 0.75.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: the ids
+            for the tokens retained, the ids to restore original image,
+            and the mask
+        """
+        N, L = batch_size, self.pos_embed.size(1)
+        len_keep = int(L * (1 - mask_ratio))
+
+        noise = torch.rand(
+            N, L, device=self.pos_embed.device)  # noise in [0, 1]
+
+        # sort noise for each sample
+        ids_shuffle = torch.argsort(
+            noise, dim=1)  # ascend: small is keep, large is remove
+        ids_restore = torch.argsort(ids_shuffle, dim=1)
+
+        # keep the first subset
+        ids_keep = ids_shuffle[:, :len_keep]
+        # generate the binary mask: 0 is keep, 1 is remove
+        mask = torch.ones([N, L], device=self.pos_embed.device)
+        mask[:, :ids_keep.size(1)] = 0
+        # unshuffle to get the binary mask
+        mask = torch.gather(mask, dim=1, index=ids_restore)
+
+        return ids_keep, ids_restore, mask
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        mask: Optional[bool] = True
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Generate features for masked images.
+
+        The function supports two kind of forward behaviors. If the ``mask`` is
+        ``True``, the function will generate mask to masking some patches
+        randomly and get the hidden features for visible patches, which means
+        the function will be executed as masked imagemodeling pre-training;
+        if the ``mask`` is ``None`` or ``False``, the forward function will
+        call ``super().forward()``, which extract features from images without
+        mask.
+
+
+        Args:
+            x (torch.Tensor): Input images, which is of shape B x C x H x W.
+            mask (bool, optional): To indicate whether the forward function
+                generating ``mask`` or not.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Hidden features,
+            mask and the ids to restore original image.
+
+            - ``x`` (torch.Tensor): hidden features, which is of shape
+              B x (L * mask_ratio) x C.
+            - ``mask`` (torch.Tensor): mask used to mask image.
+            - ``ids_restore`` (torch.Tensor): ids to restore original image.
+        """
+        if mask is None or False:
+            return super().forward(x)
+
+        else:
+            B, C, H, W = x.shape
+            ids_keep, ids_restore, mask = self.masking_id(B, self.mask_ratio)
+
+            x = self.patch_embed(x)
+
+            x = torch.gather(
+                x,
+                dim=1,
+                index=ids_keep[:, :, None, None,
+                               None].expand(-1, -1, *x.shape[2:]))
+
+            for blk in self.blocks[:-self.num_main_blocks]:
+                x = blk(x)
+
+            x = x[..., 0, 0, :]
+            if self.ape:
+                pos_embed = self.interpolate_pos_encoding(x, H, W)
+                pos_embed = torch.gather(
+                    pos_embed.expand(B, -1, -1),
+                    dim=1,
+                    index=ids_keep[:, :, None].expand(-1, -1,
+                                                      pos_embed.shape[2]),
+                )
+                x = x + pos_embed
+            x = self.pos_drop(x)
+
+            for blk in self.blocks[-self.num_main_blocks:]:
+                x = blk(x)
+
+            return (x, mask, ids_restore)
diff --git a/mmpretrain/models/selfsup/maskfeat.py b/mmpretrain/models/selfsup/maskfeat.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd9f0b296c44cdffe7f2a40caae04de0104abd60
--- /dev/null
+++ b/mmpretrain/models/selfsup/maskfeat.py
@@ -0,0 +1,336 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from typing import Dict, List, Optional, Sequence, Union
+
+import cv2
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmengine.model import BaseModule
+
+from mmpretrain.models import VisionTransformer
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class HOGGenerator(BaseModule):
+    """Generate HOG feature for images.
+
+    This module is used in MaskFeat to generate HOG feature. The code is
+    modified from file `slowfast/models/operators.py
+    <https://github.com/facebookresearch/SlowFast/blob/main/slowfast/models/operators.py>`_.
+    Here is the link of `HOG wikipedia
+    <https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients>`_.
+
+    Args:
+        nbins (int): Number of bin. Defaults to 9.
+        pool (float): Number of cell. Defaults to 8.
+        gaussian_window (int): Size of gaussian kernel. Defaults to 16.
+    """
+
+    def __init__(self,
+                 nbins: int = 9,
+                 pool: int = 8,
+                 gaussian_window: int = 16) -> None:
+        super().__init__()
+        self.nbins = nbins
+        self.pool = pool
+        self.pi = math.pi
+        weight_x = torch.FloatTensor([[1, 0, -1], [2, 0, -2], [1, 0, -1]])
+        weight_x = weight_x.view(1, 1, 3, 3).repeat(3, 1, 1, 1).contiguous()
+        weight_y = weight_x.transpose(2, 3).contiguous()
+        self.register_buffer('weight_x', weight_x)
+        self.register_buffer('weight_y', weight_y)
+
+        self.gaussian_window = gaussian_window
+        if gaussian_window:
+            gaussian_kernel = self.get_gaussian_kernel(gaussian_window,
+                                                       gaussian_window // 2)
+            self.register_buffer('gaussian_kernel', gaussian_kernel)
+
+    def get_gaussian_kernel(self, kernlen: int, std: int) -> torch.Tensor:
+        """Returns a 2D Gaussian kernel array."""
+
+        def _gaussian_fn(kernlen: int, std: int) -> torch.Tensor:
+            n = torch.arange(0, kernlen).float()
+            n -= n.mean()
+            n /= std
+            w = torch.exp(-0.5 * n**2)
+            return w
+
+        kernel_1d = _gaussian_fn(kernlen, std)
+        kernel_2d = kernel_1d[:, None] * kernel_1d[None, :]
+        return kernel_2d / kernel_2d.sum()
+
+    def _reshape(self, hog_feat: torch.Tensor) -> torch.Tensor:
+        """Reshape HOG Features for output."""
+        hog_feat = hog_feat.flatten(1, 2)
+        self.unfold_size = hog_feat.shape[-1] // 14
+        hog_feat = hog_feat.permute(0, 2, 3, 1)
+        hog_feat = hog_feat.unfold(1, self.unfold_size,
+                                   self.unfold_size).unfold(
+                                       2, self.unfold_size, self.unfold_size)
+        hog_feat = hog_feat.flatten(1, 2).flatten(2)
+        return hog_feat
+
+    @torch.no_grad()
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Generate hog feature for each batch images.
+
+        Args:
+            x (torch.Tensor): Input images of shape (N, 3, H, W).
+
+        Returns:
+            torch.Tensor: Hog features.
+        """
+        # input is RGB image with shape [B 3 H W]
+        self.h, self.w = x.size(-2), x.size(-1)
+        x = F.pad(x, pad=(1, 1, 1, 1), mode='reflect')
+        gx_rgb = F.conv2d(
+            x, self.weight_x, bias=None, stride=1, padding=0, groups=3)
+        gy_rgb = F.conv2d(
+            x, self.weight_y, bias=None, stride=1, padding=0, groups=3)
+        norm_rgb = torch.stack([gx_rgb, gy_rgb], dim=-1).norm(dim=-1)
+        phase = torch.atan2(gx_rgb, gy_rgb)
+        phase = phase / self.pi * self.nbins  # [-9, 9]
+
+        b, c, h, w = norm_rgb.shape
+        out = torch.zeros((b, c, self.nbins, h, w),
+                          dtype=torch.float,
+                          device=x.device)
+        phase = phase.view(b, c, 1, h, w)
+        norm_rgb = norm_rgb.view(b, c, 1, h, w)
+        if self.gaussian_window:
+            if h != self.gaussian_window:
+                assert h % self.gaussian_window == 0, 'h {} gw {}'.format(
+                    h, self.gaussian_window)
+                repeat_rate = h // self.gaussian_window
+                temp_gaussian_kernel = self.gaussian_kernel.repeat(
+                    [repeat_rate, repeat_rate])
+            else:
+                temp_gaussian_kernel = self.gaussian_kernel
+            norm_rgb *= temp_gaussian_kernel
+
+        out.scatter_add_(2, phase.floor().long() % self.nbins, norm_rgb)
+
+        out = out.unfold(3, self.pool, self.pool)
+        out = out.unfold(4, self.pool, self.pool)
+        out = out.sum(dim=[-1, -2])
+
+        self.out = F.normalize(out, p=2, dim=2)
+
+        return self._reshape(self.out)
+
+    def generate_hog_image(self, hog_out: torch.Tensor) -> np.ndarray:
+        """Generate HOG image according to HOG features."""
+        assert hog_out.size(0) == 1 and hog_out.size(1) == 3, \
+            'Check the input batch size and the channcel number, only support'\
+            '"batch_size = 1".'
+        hog_image = np.zeros([self.h, self.w])
+        cell_gradient = np.array(hog_out.mean(dim=1).squeeze().detach().cpu())
+        cell_width = self.pool / 2
+        max_mag = np.array(cell_gradient).max()
+        angle_gap = 360 / self.nbins
+
+        for x in range(cell_gradient.shape[1]):
+            for y in range(cell_gradient.shape[2]):
+                cell_grad = cell_gradient[:, x, y]
+                cell_grad /= max_mag
+                angle = 0
+                for magnitude in cell_grad:
+                    angle_radian = math.radians(angle)
+                    x1 = int(x * self.pool +
+                             magnitude * cell_width * math.cos(angle_radian))
+                    y1 = int(y * self.pool +
+                             magnitude * cell_width * math.sin(angle_radian))
+                    x2 = int(x * self.pool -
+                             magnitude * cell_width * math.cos(angle_radian))
+                    y2 = int(y * self.pool -
+                             magnitude * cell_width * math.sin(angle_radian))
+                    magnitude = 0 if magnitude < 0 else magnitude
+                    cv2.line(hog_image, (y1, x1), (y2, x2),
+                             int(255 * math.sqrt(magnitude)))
+                    angle += angle_gap
+        return hog_image
+
+
+@MODELS.register_module()
+class MaskFeatViT(VisionTransformer):
+    """Vision Transformer for MaskFeat pre-training.
+
+    A PyTorch implement of: `Masked Feature Prediction for Self-Supervised
+    Visual Pre-Training <https://arxiv.org/abs/2112.09133>`_.
+
+    Args:
+        arch (str | dict): Vision Transformer architecture
+            Default: 'b'
+        img_size (int | tuple): Input image size
+        patch_size (int | tuple): The patch size
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        final_norm (bool): Whether to add a additional layer to normalize
+            final feature map. Defaults to True.
+        out_type (str): The type of output features. Please choose from
+
+            - ``"cls_token"``: The class token tensor with shape (B, C).
+            - ``"featmap"``: The feature map tensor from the patch tokens
+              with shape (B, C, H, W).
+            - ``"avg_featmap"``: The global averaged feature map tensor
+              with shape (B, C).
+            - ``"raw"``: The raw feature tensor includes patch tokens and
+              class tokens with shape (B, L, C).
+
+            It only works without input mask. Defaults to ``"avg_featmap"``.
+        interpolate_mode (str): Select the interpolate mode for position
+            embeding vector resize. Defaults to "bicubic".
+        patch_cfg (dict): Configs of patch embeding. Defaults to an empty dict.
+        layer_cfgs (Sequence | dict): Configs of each transformer layer in
+            encoder. Defaults to an empty dict.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 arch: Union[str, dict] = 'b',
+                 img_size: int = 224,
+                 patch_size: int = 16,
+                 out_indices: Union[Sequence, int] = -1,
+                 drop_rate: float = 0,
+                 drop_path_rate: float = 0,
+                 norm_cfg: dict = dict(type='LN', eps=1e-6),
+                 final_norm: bool = True,
+                 out_type: str = 'raw',
+                 interpolate_mode: str = 'bicubic',
+                 patch_cfg: dict = dict(),
+                 layer_cfgs: dict = dict(),
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(
+            arch=arch,
+            img_size=img_size,
+            patch_size=patch_size,
+            out_indices=out_indices,
+            drop_rate=drop_rate,
+            drop_path_rate=drop_path_rate,
+            norm_cfg=norm_cfg,
+            final_norm=final_norm,
+            out_type=out_type,
+            with_cls_token=True,
+            interpolate_mode=interpolate_mode,
+            patch_cfg=patch_cfg,
+            layer_cfgs=layer_cfgs,
+            init_cfg=init_cfg)
+
+        self.mask_token = nn.parameter.Parameter(
+            torch.zeros(1, 1, self.embed_dims), requires_grad=True)
+        self.num_patches = self.patch_resolution[0] * self.patch_resolution[1]
+
+    def init_weights(self) -> None:
+        """Initialize position embedding, mask token and cls token."""
+        super().init_weights()
+        if not (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+
+            nn.init.trunc_normal_(self.cls_token, std=.02)
+            nn.init.trunc_normal_(self.mask_token, std=.02)
+            nn.init.trunc_normal_(self.pos_embed, std=.02)
+
+            self.apply(self._init_weights)
+
+    def _init_weights(self, m: torch.nn.Module) -> None:
+        if isinstance(m, (nn.Linear, nn.Conv2d, nn.Conv3d)):
+            nn.init.trunc_normal_(m.weight, std=0.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+
+    def forward(self, x: torch.Tensor,
+                mask: Optional[torch.Tensor]) -> torch.Tensor:
+        """Generate features for masked images.
+
+        The function supports two kind of forward behaviors. If the ``mask`` is
+        not ``None``, the forward function will be executed as masked image
+        modeling pre-training; if the ``mask`` is ``None``, the forward
+        function will call ``super().forward()``, which extract features from
+        images without mask.
+
+        Args:
+            x (torch.Tensor): Input images.
+            mask (torch.Tensor, optional): Input masks.
+
+        Returns:
+            torch.Tensor: Features with cls_tokens.
+        """
+        if mask is None:
+            return super().forward(x)
+
+        else:
+            B = x.shape[0]
+            x = self.patch_embed(x)[0]
+
+            # masking: length -> length * mask_ratio
+            B, L, _ = x.shape
+            mask_tokens = self.mask_token.expand(B, L, -1)
+            mask = mask.unsqueeze(-1)
+            x = x * (1 - mask.int()) + mask_tokens * mask
+
+            # append cls token
+            cls_tokens = self.cls_token.expand(B, -1, -1)
+            x = torch.cat((cls_tokens, x), dim=1)
+            x = x + self.pos_embed
+            x = self.drop_after_pos(x)
+
+            for i, layer in enumerate(self.layers):
+                x = layer(x)
+
+                if i == len(self.layers) - 1 and self.final_norm:
+                    x = self.norm1(x)
+
+            return x
+
+
+@MODELS.register_module()
+class MaskFeat(BaseSelfSupervisor):
+    """MaskFeat.
+
+    Implementation of `Masked Feature Prediction for Self-Supervised Visual
+    Pre-Training <https://arxiv.org/abs/2112.09133>`_.
+    """
+
+    def extract_feat(self, inputs: torch.Tensor):
+        return self.backbone(inputs, mask=None)
+
+    def loss(self, inputs: torch.Tensor, data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (torch.Tensor): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        mask = torch.stack([data_sample.mask for data_sample in data_samples])
+        mask = mask.flatten(1).bool()
+
+        latent = self.backbone(inputs, mask)
+        B, L, C = latent.shape
+        pred = self.neck((latent.view(B * L, C), ))
+        pred = pred[0].view(B, L, -1)
+        hog = self.target_generator(inputs)
+
+        # remove cls_token before compute loss
+        loss = self.head.loss(pred[:, 1:], hog, mask)
+        losses = dict(loss=loss)
+        return losses
diff --git a/mmpretrain/models/selfsup/mff.py b/mmpretrain/models/selfsup/mff.py
new file mode 100644
index 0000000000000000000000000000000000000000..268505805777399c632643fa9ac1e4be6fc271c6
--- /dev/null
+++ b/mmpretrain/models/selfsup/mff.py
@@ -0,0 +1,194 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List, Optional, Sequence, Tuple, Union
+
+import torch
+import torch.nn.functional as F
+
+from mmpretrain.models.selfsup.mae import MAE, MAEViT
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+
+
+@MODELS.register_module()
+class MFFViT(MAEViT):
+    """Vision Transformer for MFF Pretraining.
+
+    This class inherits all these functionalities from ``MAEViT``, and
+    add multi-level feature fusion to it. For more details, you can
+    refer to `Improving Pixel-based MIM by Reducing Wasted Modeling
+    Capability`.
+
+    Args:
+        arch (str | dict): Vision Transformer architecture
+            Default: 'b'
+        img_size (int | tuple): Input image size
+        patch_size (int | tuple): The patch size
+        out_indices (Sequence | int): Output from which stages.
+            Defaults to -1, means the last stage.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        final_norm (bool): Whether to add a additional layer to normalize
+            final feature map. Defaults to True.
+        out_type (str): The type of output features. Please choose from
+
+            - ``"cls_token"``: The class token tensor with shape (B, C).
+            - ``"featmap"``: The feature map tensor from the patch tokens
+              with shape (B, C, H, W).
+            - ``"avg_featmap"``: The global averaged feature map tensor
+              with shape (B, C).
+            - ``"raw"``: The raw feature tensor includes patch tokens and
+              class tokens with shape (B, L, C).
+
+            It only works without input mask. Defaults to ``"avg_featmap"``.
+        interpolate_mode (str): Select the interpolate mode for position
+            embeding vector resize. Defaults to "bicubic".
+        patch_cfg (dict): Configs of patch embeding. Defaults to an empty dict.
+        layer_cfgs (Sequence | dict): Configs of each transformer layer in
+            encoder. Defaults to an empty dict.
+        mask_ratio (bool): The ratio of total number of patches to be masked.
+            Defaults to 0.75.
+        init_cfg (Union[List[dict], dict], optional): Initialization config
+            dict. Defaults to None.
+    """
+
+    def __init__(self,
+                 arch: Union[str, dict] = 'b',
+                 img_size: int = 224,
+                 patch_size: int = 16,
+                 out_indices: Union[Sequence, int] = -1,
+                 drop_rate: float = 0,
+                 drop_path_rate: float = 0,
+                 norm_cfg: dict = dict(type='LN', eps=1e-6),
+                 final_norm: bool = True,
+                 out_type: str = 'raw',
+                 interpolate_mode: str = 'bicubic',
+                 patch_cfg: dict = dict(),
+                 layer_cfgs: dict = dict(),
+                 mask_ratio: float = 0.75,
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(
+            arch=arch,
+            img_size=img_size,
+            patch_size=patch_size,
+            out_indices=out_indices,
+            drop_rate=drop_rate,
+            drop_path_rate=drop_path_rate,
+            norm_cfg=norm_cfg,
+            final_norm=final_norm,
+            out_type=out_type,
+            interpolate_mode=interpolate_mode,
+            patch_cfg=patch_cfg,
+            layer_cfgs=layer_cfgs,
+            mask_ratio=mask_ratio,
+            init_cfg=init_cfg)
+        proj_layers = [
+            torch.nn.Linear(self.embed_dims, self.embed_dims)
+            for _ in range(len(self.out_indices) - 1)
+        ]
+        self.proj_layers = torch.nn.ModuleList(proj_layers)
+        self.proj_weights = torch.nn.Parameter(
+            torch.ones(len(self.out_indices)).view(-1, 1, 1, 1))
+        if len(self.out_indices) == 1:
+            self.proj_weights.requires_grad = False
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        mask: Optional[bool] = True
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Generate features for masked images.
+
+        The function supports two kind of forward behaviors. If the ``mask`` is
+        ``True``, the function will generate mask to masking some patches
+        randomly and get the hidden features for visible patches, which means
+        the function will be executed as masked imagemodeling pre-training;
+        if the ``mask`` is ``None`` or ``False``, the forward function will
+        call ``super().forward()``, which extract features from images without
+        mask.
+
+
+        Args:
+            x (torch.Tensor): Input images, which is of shape B x C x H x W.
+            mask (bool, optional): To indicate whether the forward function
+                generating ``mask`` or not.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Hidden features,
+            mask and the ids to restore original image.
+
+            - ``x`` (torch.Tensor): hidden features, which is of shape
+              B x (L * mask_ratio) x C.
+            - ``mask`` (torch.Tensor): mask used to mask image.
+            - ``ids_restore`` (torch.Tensor): ids to restore original image.
+        """
+        if mask is None or False:
+            return super().forward(x)
+
+        else:
+            B = x.shape[0]
+            x = self.patch_embed(x)[0]
+            # add pos embed w/o cls token
+            x = x + self.pos_embed[:, 1:, :]
+
+            # masking: length -> length * mask_ratio
+            x, mask, ids_restore = self.random_masking(x, self.mask_ratio)
+
+            # append cls token
+            cls_token = self.cls_token + self.pos_embed[:, :1, :]
+            cls_tokens = cls_token.expand(B, -1, -1)
+            x = torch.cat((cls_tokens, x), dim=1)
+
+            res = []
+            for i, layer in enumerate(self.layers):
+                x = layer(x)
+                if i in self.out_indices:
+                    if i != self.out_indices[-1]:
+                        proj_x = self.proj_layers[self.out_indices.index(i)](x)
+                    else:
+                        proj_x = x
+                    res.append(proj_x)
+            res = torch.stack(res)
+            proj_weights = F.softmax(self.proj_weights, dim=0)
+            res = res * proj_weights
+            res = res.sum(dim=0)
+
+            # Use final norm
+            x = self.norm1(res)
+            return (x, mask, ids_restore, proj_weights.view(-1))
+
+
+@MODELS.register_module()
+class MFF(MAE):
+    """MFF.
+
+    Implementation of `Improving Pixel-based MIM by Reducing Wasted Modeling
+    Capability`.
+    """
+
+    def loss(self, inputs: torch.Tensor, data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (torch.Tensor): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        # ids_restore: the same as that in original repo, which is used
+        # to recover the original order of tokens in decoder.
+        latent, mask, ids_restore, weights = self.backbone(inputs)
+        pred = self.neck(latent, ids_restore)
+        loss = self.head.loss(pred, inputs, mask)
+        weight_params = {
+            f'weight_{i}': weights[i]
+            for i in range(weights.size(0))
+        }
+        losses = dict(loss=loss)
+        losses.update(weight_params)
+        return losses
diff --git a/mmpretrain/models/selfsup/milan.py b/mmpretrain/models/selfsup/milan.py
new file mode 100644
index 0000000000000000000000000000000000000000..fdf86737af3499e6f6309aa5c5ddadef00f63740
--- /dev/null
+++ b/mmpretrain/models/selfsup/milan.py
@@ -0,0 +1,202 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List, Optional, Tuple
+
+import torch
+import torch.nn as nn
+from mmengine.runner.checkpoint import _load_checkpoint
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from ..utils import build_clip_model
+from .base import BaseSelfSupervisor
+from .mae import MAEViT
+
+
+@MODELS.register_module()
+class CLIPGenerator(nn.Module):
+    """Get the features and attention from the last layer of CLIP.
+
+    This module is used to generate target features in masked image modeling.
+
+    Args:
+        tokenizer_path (str): The path of the checkpoint of CLIP.
+    """
+
+    def __init__(self, tokenizer_path: str) -> None:
+        super().__init__()
+        self.tokenizer_path = tokenizer_path
+        self.tokenizer = build_clip_model(
+            _load_checkpoint(self.tokenizer_path), False)
+
+    @torch.no_grad()
+    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Get the features and attention from the last layer of CLIP.
+
+        Args:
+            x (torch.Tensor): The input image, which is of shape (N, 3, H, W).
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]: The features and attention from
+            the last layer of CLIP, which are of shape (N, L, C) and (N, L, L),
+            respectively.
+        """
+        # use the visual branch of CLIP to get the features
+        assert self.tokenizer is not None, 'Please check whether the ' \
+            '`self.tokenizer` is initialized correctly.'
+
+        clip_features = self.tokenizer.encode_image(x)
+        return clip_features
+
+
+@MODELS.register_module()
+class MILANViT(MAEViT):
+    """Vision Transformer for MILAN pre-training.
+
+    Implementation of the encoder for `MILAN: Masked Image Pretraining on
+    Language Assisted Representation <https://arxiv.org/abs/2208.06049>`_.
+
+    This module inherits from MAEViT and only overrides the forward function
+    and replace random masking with attention masking.
+    """
+
+    def attention_masking(
+        self, x: torch.Tensor, mask_ratio: float, importance: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Generate attention mask for MILAN.
+
+        This is what is different from MAEViT, which uses random masking.
+        Attention masking generates attention mask for MILAN, according to
+        importance. The higher the importance, the more likely the patch is
+        kept.
+
+        Args:
+            x (torch.Tensor): Input images, which is of shape B x L x C.
+            mask_ratio (float): The ratio of patches to be masked.
+            importance (torch.Tensor): Importance of each patch, which is of
+                shape B x L.
+
+        Returns:
+            Tuple[torch.Tensor, ...]:
+
+            - ``x_masked``: masked image
+            - ``ids_restore``: the ids to restore original image
+            - ``ids_keep``: ids of the kept patches
+            - ``ids_dump``: ids of the removed patches
+        """
+        N, L, D = x.shape  # batch, length, dim
+        len_keep = int(L * (1 - mask_ratio))
+
+        noise = importance.to(x.device)  # large is keep, small is remove
+
+        # sort noise for each sample
+        ids_shuffle = torch.multinomial(noise, L, replacement=False)
+        ids_restore = torch.argsort(ids_shuffle, dim=1)
+
+        # keep the first subset
+        ids_keep = ids_shuffle[:, :len_keep]
+        ids_dump = ids_shuffle[:, len_keep:]
+        x_masked = torch.gather(
+            x, dim=1, index=ids_keep.unsqueeze(-1).repeat(1, 1, D))
+
+        # generate the binary mask: 0 is keep, 1 is remove
+        mask = torch.ones([N, L], device=x.device)
+        mask[:, :len_keep] = 0
+        # unshuffle to get the binary mask
+        mask = torch.gather(mask, dim=1, index=ids_restore)
+
+        return x_masked, ids_restore, ids_keep, ids_dump
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        importance: Optional[torch.Tensor],
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Generate features for masked images.
+
+        The function supports two kind of forward behaviors. If the
+        ``importance`` is ``None``, the function generates mask and masks some
+        patches randomly and get the hidden features for visible patches. The
+        mask is generated by importance. The higher the importance, the more
+        likely the patch is kept. The importance is calculated by CLIP.
+        The higher the CLIP score, the more likely the patch is kept. The CLIP
+        score is calculated by cross attention between the class token and all
+        other tokens from the last layer.
+        If the ``importance`` is ``torch.Tensor``, the forward function will
+        call ``super().forward()``, which extract features from images without
+        mask.
+
+        Args:
+            x (torch.Tensor): Input images, which is of shape B x C x H x W.
+            importance (torch.Tensor, optional): Importance of each patch,
+                which is of shape B x L.
+
+        Returns:
+            Tuple[torch.Tensor, ...]: masked image, the ids to restore original
+            image, ids of the kept patches, ids of the removed patches.
+
+            - ``x`` (torch.Tensor): hidden features, which is of shape
+              B x (L * mask_ratio) x C.
+            - ``ids_restore`` (torch.Tensor): ids to restore original image.
+            - ``ids_keep`` (torch.Tensor): ids of the kept patches.
+            - ``ids_dump`` (torch.Tensor): ids of the removed patches.
+        """
+        if importance is None:
+            return super(MAEViT, self).forward(x)
+
+        else:
+            B = x.shape[0]
+            x = self.patch_embed(x)[0]
+            # add pos embed w/o cls token
+            x = x + self.pos_embed[:, 1:, :]
+
+            # masking: length -> length * mask_ratio
+            x, ids_restore, ids_keep, ids_dump = self.attention_masking(
+                x, self.mask_ratio, importance)
+
+            # append cls token
+            cls_token = self.cls_token + self.pos_embed[:, :1, :]
+            cls_tokens = cls_token.expand(B, -1, -1)
+            x = torch.cat((cls_tokens, x), dim=1)
+
+            for _, layer in enumerate(self.layers):
+                x = layer(x)
+            # Use final norm
+            x = self.norm1(x)
+
+            return x, ids_restore, ids_keep, ids_dump
+
+
+@MODELS.register_module()
+class MILAN(BaseSelfSupervisor):
+    """MILAN.
+
+    Implementation of `MILAN: Masked Image Pretraining on Language Assisted
+    Representation <https://arxiv.org/abs/2208.06049>`_.
+    """
+
+    def extract_feat(self, inputs: torch.Tensor):
+        return self.backbone(inputs, importance=None)
+
+    def loss(self, inputs: torch.Tensor, data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (torch.Tensor): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        # ids_restore: the same as that in original repo, which is used
+        # to recover the original order of tokens in decoder.
+        clip_feature, importance = self.target_generator(inputs)
+        importance = importance[:, 0, 1:]
+        latent, ids_restore, ids_keep, ids_dump = self.backbone(
+            inputs, importance)
+        pred = self.neck(latent, ids_restore, ids_keep, ids_dump)
+
+        loss = self.head.loss(pred, clip_feature)
+        losses = dict(loss=loss)
+        return losses
diff --git a/mmpretrain/models/selfsup/mixmim.py b/mmpretrain/models/selfsup/mixmim.py
new file mode 100644
index 0000000000000000000000000000000000000000..b202f836f64358369276a9b85795fb6eec769fb7
--- /dev/null
+++ b/mmpretrain/models/selfsup/mixmim.py
@@ -0,0 +1,263 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import random
+from typing import Dict, List, Optional, Tuple, Union
+
+import torch
+from torch import nn
+from torch.nn import functional as F
+
+from mmpretrain.models.backbones import MixMIMTransformer
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from ..utils import build_2d_sincos_position_embedding
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class MixMIMPretrainTransformer(MixMIMTransformer):
+    """MixMIM backbone for MixMIM pre-training.
+
+    A PyTorch implement of : ` MixMIM: Mixed and Masked Image
+    Modeling for Efficient Visual Representation Learning
+    <https://arxiv.org/abs/2205.13137>`_
+
+    Args:
+        arch (str | dict): MixMIM architecture. If use string,
+            choose from 'base','large' and 'huge'.
+            If use dict, it should have below keys:
+
+            - **embed_dims** (int): The dimensions of embedding.
+            - **depths** (int): The number of transformer encoder layers.
+            - **num_heads** (int): The number of heads in attention modules.
+
+            Defaults to 'base'.
+        mlp_ratio (int): The mlp ratio in FFN.  Defaults to 4.
+        img_size (int | tuple): The expected input image shape. Because we
+            support dynamic input shape, just set the argument to mlp_ratio
+            the most common input image shape. Defaults to 224.
+        patch_size (int | tuple): The patch size in patch embedding.
+            Defaults to 16.
+        in_channels (int): The num of input channels. Defaults to 3.
+        window_size (list): The height and width of the window.
+        qkv_bias (bool): Whether to add bias for qkv in attention modules.
+            Defaults to True.
+        patch_cfg (dict): Extra config dict for patch embedding.
+            Defaults to an empty dict.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        drop_rate (float): Probability of an element to be zeroed.
+            Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate. Defaults to 0.
+        attn_drop_rate (float): Attention drop rate. Defaults to 0.
+        use_checkpoint (bool): Whether use the checkpoint to reduce GPU memory
+            cost. Defaults to False.
+        mask_ratio (bool): The base ratio of total number of patches to be
+            masked. Defaults to 0.5.
+        range_mask_ratio (float): The range of mask ratio.
+            Defaults to 0.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 arch: Union[str, dict] = 'base',
+                 mlp_ratio: float = 4,
+                 img_size: int = 224,
+                 patch_size: int = 4,
+                 in_channels: int = 3,
+                 window_size: List = [14, 14, 14, 7],
+                 qkv_bias: bool = True,
+                 patch_cfg: dict = dict(),
+                 norm_cfg: dict = dict(type='LN'),
+                 drop_rate: float = 0.0,
+                 drop_path_rate: float = 0.0,
+                 attn_drop_rate: float = 0.0,
+                 use_checkpoint: bool = False,
+                 mask_ratio: float = 0.5,
+                 range_mask_ratio: float = 0.0,
+                 init_cfg: Optional[dict] = None) -> None:
+
+        super().__init__(
+            arch=arch,
+            mlp_ratio=mlp_ratio,
+            img_size=img_size,
+            patch_size=patch_size,
+            in_channels=in_channels,
+            window_size=window_size,
+            qkv_bias=qkv_bias,
+            patch_cfg=patch_cfg,
+            norm_cfg=norm_cfg,
+            drop_rate=drop_rate,
+            drop_path_rate=drop_path_rate,
+            attn_drop_rate=attn_drop_rate,
+            use_checkpoint=use_checkpoint,
+            init_cfg=init_cfg)
+
+        self.mask_ratio = mask_ratio
+        self.range_mask_ratio = range_mask_ratio
+
+    def init_weights(self):
+        """Initialize position embedding, patch embedding."""
+        super(MixMIMTransformer, self).init_weights()
+
+        pos_embed = build_2d_sincos_position_embedding(
+            int(self.num_patches**.5),
+            self.absolute_pos_embed.shape[-1],
+            cls_token=False)
+        self.absolute_pos_embed.data.copy_(pos_embed.float())
+
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            # we use xavier_uniform following official JAX ViT:
+            torch.nn.init.xavier_uniform_(m.weight)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+
+    def random_masking(self,
+                       x: torch.Tensor,
+                       mask_ratio: float = 0.5) -> Tuple[torch.Tensor]:
+        """Generate the mask for MixMIM Pretraining.
+
+        Args:
+            x (torch.Tensor): Image with data augmentation applied, which is
+                of shape B x L x C.
+            mask_ratio (float): The mask ratio of total patches.
+                Defaults to 0.5.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+                - mask_s1 (torch.Tensor): mask with stride of
+                  self.encoder_stride // 8.
+                - mask_s2 (torch.Tensor): mask with stride of
+                  self.encoder_stride // 4.
+                - mask_s3 (torch.Tensor): mask with stride of
+                  self.encoder_stride // 2.
+                - mask (torch.Tensor): mask with stride of
+                  self.encoder_stride.
+        """
+
+        B, C, H, W = x.shape
+        out_H = H // self.encoder_stride
+        out_W = W // self.encoder_stride
+        s3_H, s3_W = out_H * 2, out_W * 2
+        s2_H, s2_W = out_H * 4, out_W * 4
+        s1_H, s1_W = out_H * 8, out_W * 8
+
+        seq_l = out_H * out_W
+        # use a shared mask for a batch images
+        mask = torch.zeros([1, 1, seq_l], device=x.device)
+
+        mask_ratio = mask_ratio + random.uniform(0.0, self.range_mask_ratio)
+        noise = torch.rand(1, 1, seq_l, device=x.device)  # noise in [0, 1]
+        # ascend: small is keep, large is removed
+        mask_idx = torch.argsort(noise, dim=2)[:, :, :int(seq_l * mask_ratio)]
+        mask.scatter_(2, mask_idx, 1)
+        mask = mask.reshape(1, 1, out_H, out_W)
+        mask_s1 = F.interpolate(mask, size=(s1_H, s1_W), mode='nearest')
+        mask_s2 = F.interpolate(mask, size=(s2_H, s2_W), mode='nearest')
+        mask_s3 = F.interpolate(mask, size=(s3_H, s3_W), mode='nearest')
+
+        mask = mask.reshape(1, out_H * out_W, 1).contiguous()
+        mask_s1 = mask_s1.reshape(1, s1_H * s1_W, 1).contiguous()
+        mask_s2 = mask_s2.reshape(1, s2_H * s2_W, 1).contiguous()
+        mask_s3 = mask_s3.reshape(1, s3_H * s3_W, 1).contiguous()
+
+        return mask_s1, mask_s2, mask_s3, mask
+
+    def forward(self,
+                x: torch.Tensor,
+                mask: Optional[bool] = True) -> Tuple[torch.Tensor]:
+        """Generate features for masked images.
+
+        This function generates mask and masks some patches randomly and get
+        the hidden features for visible patches.
+
+        Args:
+            x (torch.Tensor): Input images, which is of shape B x C x H x W.
+            mask (bool, optional): To indicate whether the forward containing
+                ``mask`` or not.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]:
+              - x (torch.Tensor): hidden features, which is of shape
+                B x L x C.
+              - mask_s4 (torch.Tensor): the mask tensor for the last layer.
+        """
+        if mask is None or False:
+            return super().forward(x)
+
+        else:
+            mask_s1, mask_s2, mask_s3, mask_s4 = self.random_masking(
+                x, self.mask_ratio)
+
+            x, _ = self.patch_embed(x)
+
+            x = x * (1. - mask_s1) + x.flip(0) * mask_s1
+            x = x + self.absolute_pos_embed
+            x = self.drop_after_pos(x)
+
+            for idx, layer in enumerate(self.layers):
+                if idx == 0:
+                    x = layer(x, attn_mask=mask_s1)
+                elif idx == 1:
+                    x = layer(x, attn_mask=mask_s2)
+                elif idx == 2:
+                    x = layer(x, attn_mask=mask_s3)
+                elif idx == 3:
+                    x = layer(x, attn_mask=mask_s4)
+
+            x = self.norm(x)
+
+            return x, mask_s4
+
+
+@MODELS.register_module()
+class MixMIM(BaseSelfSupervisor):
+    """MixMIM.
+
+    Implementation of `MixMIM: Mixed and Masked Image Modeling for Efficient
+    Visual Representation Learning. <https://arxiv.org/abs/2205.13137>`_.
+    """
+
+    def __init__(self,
+                 backbone: dict,
+                 neck: Optional[dict] = None,
+                 head: Optional[dict] = None,
+                 pretrained: Optional[str] = None,
+                 data_preprocessor: Optional[Union[dict, nn.Module]] = None,
+                 init_cfg: Optional[dict] = None):
+
+        head.update(dict(patch_size=neck['encoder_stride']))
+        super().__init__(
+            backbone=backbone,
+            neck=neck,
+            head=head,
+            pretrained=pretrained,
+            data_preprocessor=data_preprocessor,
+            init_cfg=init_cfg)
+
+    def extract_feat(self, inputs: torch.Tensor):
+        return self.backbone(inputs, mask=None)
+
+    def loss(self, inputs: torch.Tensor, data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (torch.Tensor): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        latent, mask = self.backbone(inputs)
+        x_rec = self.neck(latent, mask)
+        loss = self.head.loss(x_rec, inputs, mask)
+        losses = dict(loss=loss)
+        return losses
diff --git a/mmpretrain/models/selfsup/moco.py b/mmpretrain/models/selfsup/moco.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ff4cf8fd6d0d6bca4724965d3b6d09543317748
--- /dev/null
+++ b/mmpretrain/models/selfsup/moco.py
@@ -0,0 +1,137 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List, Optional, Union
+
+import torch
+import torch.nn as nn
+from mmengine.dist import all_gather
+from mmengine.model import ExponentialMovingAverage
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from ..utils import batch_shuffle_ddp, batch_unshuffle_ddp
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class MoCo(BaseSelfSupervisor):
+    """MoCo.
+
+    Implementation of `Momentum Contrast for Unsupervised Visual
+    Representation Learning <https://arxiv.org/abs/1911.05722>`_.
+    Part of the code is borrowed from:
+    `<https://github.com/facebookresearch/moco/blob/master/moco/builder.py>`_.
+
+    Args:
+        backbone (dict): Config dict for module of backbone.
+        neck (dict): Config dict for module of deep features to compact feature
+            vectors.
+        head (dict): Config dict for module of head functions.
+        queue_len (int): Number of negative keys maintained in the
+            queue. Defaults to 65536.
+        feat_dim (int): Dimension of compact feature vectors.
+            Defaults to 128.
+        momentum (float): Momentum coefficient for the momentum-updated
+            encoder. Defaults to 0.001.
+        pretrained (str, optional): The pretrained checkpoint path, support
+            local path and remote path. Defaults to None.
+        data_preprocessor (dict, optional): The config for preprocessing
+            input data. If None or no specified type, it will use
+            "SelfSupDataPreprocessor" as type.
+            See :class:`SelfSupDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (Union[List[dict], dict], optional): Config dict for weight
+            initialization. Defaults to None.
+    """
+
+    def __init__(self,
+                 backbone: dict,
+                 neck: dict,
+                 head: dict,
+                 queue_len: int = 65536,
+                 feat_dim: int = 128,
+                 momentum: float = 0.001,
+                 pretrained: Optional[str] = None,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(
+            backbone=backbone,
+            neck=neck,
+            head=head,
+            pretrained=pretrained,
+            data_preprocessor=data_preprocessor,
+            init_cfg=init_cfg)
+
+        # create momentum model
+        self.encoder_k = ExponentialMovingAverage(
+            nn.Sequential(self.backbone, self.neck), momentum)
+
+        # create the queue
+        self.queue_len = queue_len
+        self.register_buffer('queue', torch.randn(feat_dim, queue_len))
+        self.queue = nn.functional.normalize(self.queue, dim=0)
+        self.register_buffer('queue_ptr', torch.zeros(1, dtype=torch.long))
+
+    @torch.no_grad()
+    def _dequeue_and_enqueue(self, keys: torch.Tensor) -> None:
+        """Update queue."""
+        # gather keys before updating queue
+        keys = torch.cat(all_gather(keys), dim=0)
+
+        batch_size = keys.shape[0]
+
+        ptr = int(self.queue_ptr)
+        assert self.queue_len % batch_size == 0  # for simplicity
+
+        # replace the keys at ptr (dequeue and enqueue)
+        self.queue[:, ptr:ptr + batch_size] = keys.transpose(0, 1)
+        ptr = (ptr + batch_size) % self.queue_len  # move pointer
+
+        self.queue_ptr[0] = ptr
+
+    def loss(self, inputs: List[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (List[torch.Tensor]): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        assert isinstance(inputs, list)
+        im_q = inputs[0]
+        im_k = inputs[1]
+        # compute query features from encoder_q
+        q = self.neck(self.backbone(im_q))[0]  # queries: NxC
+        q = nn.functional.normalize(q, dim=1)
+
+        # compute key features
+        with torch.no_grad():  # no gradient to keys
+            # update the key encoder
+            self.encoder_k.update_parameters(
+                nn.Sequential(self.backbone, self.neck))
+
+            # shuffle for making use of BN
+            im_k, idx_unshuffle = batch_shuffle_ddp(im_k)
+
+            k = self.encoder_k(im_k)[0]  # keys: NxC
+            k = nn.functional.normalize(k, dim=1)
+
+            # undo shuffle
+            k = batch_unshuffle_ddp(k, idx_unshuffle)
+
+        # compute logits
+        # Einstein sum is more intuitive
+        # positive logits: Nx1
+        l_pos = torch.einsum('nc,nc->n', [q, k]).unsqueeze(-1)
+        # negative logits: NxK
+        l_neg = torch.einsum('nc,ck->nk', [q, self.queue.clone().detach()])
+
+        loss = self.head.loss(l_pos, l_neg)
+        # update the queue
+        self._dequeue_and_enqueue(k)
+
+        losses = dict(loss=loss)
+        return losses
diff --git a/mmpretrain/models/selfsup/mocov3.py b/mmpretrain/models/selfsup/mocov3.py
new file mode 100644
index 0000000000000000000000000000000000000000..61b803387fdc129bc29056ee369fa3ad36c13e07
--- /dev/null
+++ b/mmpretrain/models/selfsup/mocov3.py
@@ -0,0 +1,215 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from functools import reduce
+from operator import mul
+from typing import Dict, List, Optional, Union
+
+import torch
+import torch.nn as nn
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import VisionTransformer
+from mmpretrain.models.utils import (build_2d_sincos_position_embedding,
+                                     to_2tuple)
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from ..utils import CosineEMA
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class MoCoV3ViT(VisionTransformer):
+    """Vision Transformer for MoCoV3 pre-training.
+
+    A pytorch implement of: `An Images is Worth 16x16 Words: Transformers for
+    Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`_.
+
+    Part of the code is modified from:
+    `<https://github.com/facebookresearch/moco-v3/blob/main/vits.py>`_.
+
+    Args:
+        stop_grad_conv1 (bool): whether to stop the gradient of
+            convolution layer in `PatchEmbed`. Defaults to False.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Defaults to False.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 stop_grad_conv1: bool = False,
+                 frozen_stages: int = -1,
+                 norm_eval: bool = False,
+                 init_cfg: Optional[Union[dict, List[dict]]] = None,
+                 **kwargs) -> None:
+
+        # add MoCoV3 ViT-small arch
+        self.arch_zoo.update(
+            dict.fromkeys(
+                ['mocov3-s', 'mocov3-small'], {
+                    'embed_dims': 384,
+                    'num_layers': 12,
+                    'num_heads': 12,
+                    'feedforward_channels': 1536,
+                }))
+
+        super().__init__(init_cfg=init_cfg, **kwargs)
+        self.patch_size = kwargs['patch_size']
+        self.frozen_stages = frozen_stages
+        self.norm_eval = norm_eval
+        self.init_cfg = init_cfg
+
+        if stop_grad_conv1:
+            self.patch_embed.projection.weight.requires_grad = False
+            self.patch_embed.projection.bias.requires_grad = False
+
+        self._freeze_stages()
+
+    def init_weights(self) -> None:
+        """Initialize position embedding, patch embedding, qkv layers and cls
+        token."""
+        super().init_weights()
+
+        if not (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+
+            # Use fixed 2D sin-cos position embedding
+            pos_emb = build_2d_sincos_position_embedding(
+                patches_resolution=self.patch_resolution,
+                embed_dims=self.embed_dims,
+                cls_token=True)
+            self.pos_embed.data.copy_(pos_emb)
+            self.pos_embed.requires_grad = False
+
+            # xavier_uniform initialization for PatchEmbed
+            val = math.sqrt(
+                6. / float(3 * reduce(mul, to_2tuple(self.patch_size), 1) +
+                           self.embed_dims))
+            nn.init.uniform_(self.patch_embed.projection.weight, -val, val)
+            nn.init.zeros_(self.patch_embed.projection.bias)
+
+            # initialization for linear layers
+            for name, m in self.named_modules():
+                if isinstance(m, nn.Linear):
+                    if 'qkv' in name:
+                        # treat the weights of Q, K, V separately
+                        val = math.sqrt(
+                            6. /
+                            float(m.weight.shape[0] // 3 + m.weight.shape[1]))
+                        nn.init.uniform_(m.weight, -val, val)
+                    else:
+                        nn.init.xavier_uniform_(m.weight)
+                    nn.init.zeros_(m.bias)
+            nn.init.normal_(self.cls_token, std=1e-6)
+
+    def _freeze_stages(self) -> None:
+        """Freeze patch_embed layer, some parameters and stages."""
+        if self.frozen_stages >= 0:
+            self.patch_embed.eval()
+            for param in self.patch_embed.parameters():
+                param.requires_grad = False
+
+            self.cls_token.requires_grad = False
+            self.pos_embed.requires_grad = False
+
+        for i in range(1, self.frozen_stages + 1):
+            m = self.layers[i - 1]
+            m.eval()
+            for param in m.parameters():
+                param.requires_grad = False
+
+            if i == (self.num_layers) and self.final_norm:
+                for param in getattr(self, 'norm1').parameters():
+                    param.requires_grad = False
+
+    def train(self, mode: bool = True) -> None:
+        super().train(mode)
+        self._freeze_stages()
+        if mode and self.norm_eval:
+            for m in self.modules():
+                # trick: eval have effect on BatchNorm only
+                if isinstance(m, _BatchNorm):
+                    m.eval()
+
+
+@MODELS.register_module()
+class MoCoV3(BaseSelfSupervisor):
+    """MoCo v3.
+
+    Implementation of `An Empirical Study of Training Self-Supervised Vision
+    Transformers <https://arxiv.org/abs/2104.02057>`_.
+
+    Args:
+        backbone (dict): Config dict for module of backbone
+        neck (dict): Config dict for module of deep features to compact feature
+            vectors.
+        head (dict): Config dict for module of head functions.
+        base_momentum (float): Momentum coefficient for the momentum-updated
+            encoder. Defaults to 0.01.
+        pretrained (str, optional): The pretrained checkpoint path, support
+            local path and remote path. Defaults to None.
+        data_preprocessor (dict, optional): The config for preprocessing
+            input data. If None or no specified type, it will use
+            "SelfSupDataPreprocessor" as type.
+            See :class:`SelfSupDataPreprocessor` for more details.
+            Defaults to None.
+        init_cfg (Union[List[dict], dict], optional): Config dict for weight
+            initialization. Defaults to None.
+    """
+
+    def __init__(self,
+                 backbone: dict,
+                 neck: dict,
+                 head: dict,
+                 base_momentum: float = 0.01,
+                 pretrained: Optional[str] = None,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(
+            backbone=backbone,
+            neck=neck,
+            head=head,
+            pretrained=pretrained,
+            data_preprocessor=data_preprocessor,
+            init_cfg=init_cfg)
+
+        # create momentum model
+        self.momentum_encoder = CosineEMA(
+            nn.Sequential(self.backbone, self.neck), momentum=base_momentum)
+
+    def loss(self, inputs: List[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (List[torch.Tensor]): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        assert isinstance(inputs, list)
+        view_1 = inputs[0]
+        view_2 = inputs[1]
+
+        # compute query features, [N, C] each
+        q1 = self.neck(self.backbone(view_1))[0]
+        q2 = self.neck(self.backbone(view_2))[0]
+
+        # compute key features, [N, C] each, no gradient
+        with torch.no_grad():
+            # update momentum encoder
+            self.momentum_encoder.update_parameters(
+                nn.Sequential(self.backbone, self.neck))
+
+            k1 = self.momentum_encoder(view_1)[0]
+            k2 = self.momentum_encoder(view_2)[0]
+
+        loss = self.head.loss(q1, k2) + self.head.loss(q2, k1)
+
+        losses = dict(loss=loss)
+        return losses
diff --git a/mmpretrain/models/selfsup/simclr.py b/mmpretrain/models/selfsup/simclr.py
new file mode 100644
index 0000000000000000000000000000000000000000..4b19ab4053de21a865fbaf864f654ff3ad8840f1
--- /dev/null
+++ b/mmpretrain/models/selfsup/simclr.py
@@ -0,0 +1,98 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Any, Dict, List, Tuple
+
+import torch
+from mmengine.dist import all_gather, get_rank
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from .base import BaseSelfSupervisor
+
+
+class GatherLayer(torch.autograd.Function):
+    """Gather tensors from all process, supporting backward propagation."""
+
+    @staticmethod
+    def forward(ctx: Any, input: torch.Tensor) -> Tuple[List]:
+        ctx.save_for_backward(input)
+        output = all_gather(input)
+        return tuple(output)
+
+    @staticmethod
+    def backward(ctx: Any, *grads: torch.Tensor) -> torch.Tensor:
+        input, = ctx.saved_tensors
+        grad_out = torch.zeros_like(input)
+        grad_out[:] = grads[get_rank()]
+        return grad_out
+
+
+@MODELS.register_module()
+class SimCLR(BaseSelfSupervisor):
+    """SimCLR.
+
+    Implementation of `A Simple Framework for Contrastive Learning of Visual
+    Representations <https://arxiv.org/abs/2002.05709>`_.
+    """
+
+    @staticmethod
+    def _create_buffer(
+        batch_size: int, device: torch.device
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Compute the mask and the index of positive samples.
+
+        Args:
+            batch_size (int): The batch size.
+            device (torch.device): The device of backend.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+            - The mask for feature selection.
+            - The index of positive samples.
+            - The mask of negative samples.
+        """
+        mask = 1 - torch.eye(batch_size * 2, dtype=torch.uint8).to(device)
+        pos_idx = (
+            torch.arange(batch_size * 2).to(device),
+            2 * torch.arange(batch_size, dtype=torch.long).unsqueeze(1).repeat(
+                1, 2).view(-1, 1).squeeze().to(device))
+        neg_mask = torch.ones((batch_size * 2, batch_size * 2 - 1),
+                              dtype=torch.uint8).to(device)
+        neg_mask[pos_idx] = 0
+        return mask, pos_idx, neg_mask
+
+    def loss(self, inputs: List[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (List[torch.Tensor]): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        assert isinstance(inputs, list)
+        inputs = torch.stack(inputs, 1)
+        inputs = inputs.reshape((inputs.size(0) * 2, inputs.size(2),
+                                 inputs.size(3), inputs.size(4)))
+        x = self.backbone(inputs)
+        z = self.neck(x)[0]  # (2n)xd
+
+        z = z / (torch.norm(z, p=2, dim=1, keepdim=True) + 1e-10)
+        z = torch.cat(GatherLayer.apply(z), dim=0)  # (2N)xd
+        assert z.size(0) % 2 == 0
+        N = z.size(0) // 2
+        s = torch.matmul(z, z.permute(1, 0))  # (2N)x(2N)
+        mask, pos_idx, neg_mask = self._create_buffer(N, s.device)
+
+        # remove diagonal, (2N)x(2N-1)
+        s = torch.masked_select(s, mask == 1).reshape(s.size(0), -1)
+        positive = s[pos_idx].unsqueeze(1)  # (2N)x1
+
+        # select negative, (2N)x(2N-2)
+        negative = torch.masked_select(s, neg_mask == 1).reshape(s.size(0), -1)
+
+        loss = self.head.loss(positive, negative)
+        losses = dict(loss=loss)
+        return losses
diff --git a/mmpretrain/models/selfsup/simmim.py b/mmpretrain/models/selfsup/simmim.py
new file mode 100644
index 0000000000000000000000000000000000000000..635a3297df2c3f361b8a63f1ea7c5d1f9c34c28b
--- /dev/null
+++ b/mmpretrain/models/selfsup/simmim.py
@@ -0,0 +1,194 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List, Optional, Sequence, Tuple, Union
+
+import torch
+import torch.nn as nn
+from mmengine.model.weight_init import trunc_normal_
+
+from mmpretrain.models import SwinTransformer
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class SimMIMSwinTransformer(SwinTransformer):
+    """Swin Transformer for SimMIM pre-training.
+
+    Args:
+        Args:
+        arch (str | dict): Swin Transformer architecture
+            Defaults to 'T'.
+        img_size (int | tuple): The size of input image.
+            Defaults to 224.
+        in_channels (int): The num of input channels.
+            Defaults to 3.
+        drop_rate (float): Dropout rate after embedding.
+            Defaults to 0.
+        drop_path_rate (float): Stochastic depth rate.
+            Defaults to 0.1.
+        out_indices (tuple): Layers to be outputted. Defaults to (3, ).
+        use_abs_pos_embed (bool): If True, add absolute position embedding to
+            the patch embedding. Defaults to False.
+        with_cp (bool): Use checkpoint or not. Using checkpoint
+            will save some memory while slowing down the training speed.
+            Defaults to False.
+        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).
+            -1 means not freezing any parameters. Defaults to -1.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Defaults to False.
+        norm_cfg (dict): Config dict for normalization layer at end
+            of backbone. Defaults to dict(type='LN')
+        stage_cfgs (Sequence | dict): Extra config dict for each
+            stage. Defaults to empty dict.
+        patch_cfg (dict): Extra config dict for patch embedding.
+            Defaults to empty dict.
+        pad_small_map (bool): If True, pad the small feature map to the window
+            size, which is common used in detection and segmentation. If False,
+            avoid shifting window and shrink the window size to the size of
+            feature map, which is common used in classification.
+            Defaults to False.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 arch: Union[str, dict] = 'T',
+                 img_size: Union[Tuple[int, int], int] = 224,
+                 in_channels: int = 3,
+                 drop_rate: float = 0.,
+                 drop_path_rate: float = 0.1,
+                 out_indices: tuple = (3, ),
+                 use_abs_pos_embed: bool = False,
+                 with_cp: bool = False,
+                 frozen_stages: bool = -1,
+                 norm_eval: bool = False,
+                 norm_cfg: dict = dict(type='LN'),
+                 stage_cfgs: Union[Sequence, dict] = dict(),
+                 patch_cfg: dict = dict(),
+                 pad_small_map: bool = False,
+                 init_cfg: Optional[dict] = None) -> None:
+        super().__init__(
+            arch=arch,
+            img_size=img_size,
+            in_channels=in_channels,
+            drop_rate=drop_rate,
+            drop_path_rate=drop_path_rate,
+            out_indices=out_indices,
+            use_abs_pos_embed=use_abs_pos_embed,
+            with_cp=with_cp,
+            frozen_stages=frozen_stages,
+            norm_eval=norm_eval,
+            norm_cfg=norm_cfg,
+            stage_cfgs=stage_cfgs,
+            patch_cfg=patch_cfg,
+            pad_small_map=pad_small_map,
+            init_cfg=init_cfg)
+
+        self.mask_token = nn.Parameter(torch.zeros(1, 1, self.embed_dims))
+
+    def init_weights(self) -> None:
+        """Initialize weights."""
+        super().init_weights()
+
+        if (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            # Suppress default init if use pretrained model.
+            return
+
+        if self.use_abs_pos_embed:
+            trunc_normal_(self.absolute_pos_embed, std=0.02)
+
+        trunc_normal_(self.mask_token, mean=0, std=.02)
+
+        self.apply(self._init_weights)
+
+    def _init_weights(self, m):
+        """Initialize weights."""
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+
+    def forward(self, x: torch.Tensor,
+                mask: Optional[torch.Tensor]) -> Sequence[torch.Tensor]:
+        """Generate features for masked images.
+
+        The function supports two kind of forward behaviors. If the ``mask`` is
+        not ``None``, the forward function will be executed as masked image
+        modeling pre-training; if the ``mask`` is ``None``, the forward
+        function will call ``super().forward()``, which extract features from
+        images without mask.
+
+        Args:
+            x (torch.Tensor): Input images.
+            mask (torch.Tensor, optional): Masks for images.
+
+        Returns:
+            tuple: A tuple containing features from multi-stages.
+        """
+        if mask is None:
+            return super().forward(x)
+
+        else:
+            x, hw_shape = self.patch_embed(x)
+            B, L, _ = x.shape
+
+            mask_token = self.mask_token.expand(B, L, -1)
+            w = mask.flatten(1).unsqueeze(-1).type_as(mask_token)
+            x = x * (1. - w) + mask_token * w
+
+            if self.use_abs_pos_embed:
+                x = x + self.absolute_pos_embed
+
+            x = self.drop_after_pos(x)
+
+            outs = []
+            for i, stage in enumerate(self.stages):
+                x, hw_shape = stage(x, hw_shape)
+                if i in self.out_indices:
+                    norm_layer = getattr(self, f'norm{i}')
+                    out = norm_layer(x)
+                    out = out.view(-1, *hw_shape,
+                                   stage.out_channels).permute(0, 3, 1,
+                                                               2).contiguous()
+                    outs.append(out)
+
+            return tuple(outs)
+
+
+@MODELS.register_module()
+class SimMIM(BaseSelfSupervisor):
+    """SimMIM.
+
+    Implementation of `SimMIM: A Simple Framework for Masked Image Modeling
+    <https://arxiv.org/abs/2111.09886>`_.
+    """
+
+    def extract_feat(self, inputs: torch.Tensor):
+        return self.backbone(inputs, mask=None)
+
+    def loss(self, inputs: torch.Tensor, data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (List[torch.Tensor]): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        mask = torch.stack([data_sample.mask for data_sample in data_samples])
+
+        img_latent = self.backbone(inputs, mask)
+        img_rec = self.neck(img_latent[0])
+        loss = self.head.loss(img_rec, inputs, mask)
+        losses = dict(loss=loss)
+
+        return losses
diff --git a/mmpretrain/models/selfsup/simsiam.py b/mmpretrain/models/selfsup/simsiam.py
new file mode 100644
index 0000000000000000000000000000000000000000..a502cd770d0b497368dc7fc1d93caac01ec65db1
--- /dev/null
+++ b/mmpretrain/models/selfsup/simsiam.py
@@ -0,0 +1,43 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List
+
+import torch
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class SimSiam(BaseSelfSupervisor):
+    """SimSiam.
+
+    Implementation of `Exploring Simple Siamese Representation Learning
+    <https://arxiv.org/abs/2011.10566>`_. The operation of fixing learning rate
+    of predictor is in `engine/hooks/simsiam_hook.py`.
+    """
+
+    def loss(self, inputs: List[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (List[torch.Tensor]): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        assert isinstance(inputs, list)
+        img_v1 = inputs[0]
+        img_v2 = inputs[1]
+
+        z1 = self.neck(self.backbone(img_v1))[0]  # NxC
+        z2 = self.neck(self.backbone(img_v2))[0]  # NxC
+
+        loss_1 = self.head.loss(z1, z2)
+        loss_2 = self.head.loss(z2, z1)
+
+        losses = dict(loss=0.5 * (loss_1 + loss_2))
+        return losses
diff --git a/mmpretrain/models/selfsup/spark.py b/mmpretrain/models/selfsup/spark.py
new file mode 100644
index 0000000000000000000000000000000000000000..d5570a5a9b17212aa400c3c6518a8e75a5c8c6c2
--- /dev/null
+++ b/mmpretrain/models/selfsup/spark.py
@@ -0,0 +1,163 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List, Optional, Union
+
+import torch
+import torch.nn as nn
+from mmengine.model.weight_init import trunc_normal_
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from ..utils.norm import build_norm_layer
+from ..utils.sparse_modules import SparseHelper
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class SparK(BaseSelfSupervisor):
+    """Implementation of SparK.
+
+    Implementation of `Designing BERT for Convolutional Networks: Sparse and
+    Hierarchical Masked Modeling <https://arxiv.org/abs/2301.03580>`_.
+
+    Modified from
+    https://github.com/keyu-tian/SparK/blob/main/pretrain/spark.py
+    """
+
+    def __init__(
+        self,
+        backbone: dict,
+        neck: dict,
+        head: dict,
+        pretrained: Optional[str] = None,
+        data_preprocessor: Optional[dict] = None,
+        input_size: int = 224,
+        downsample_raito: int = 32,
+        mask_ratio: float = 0.6,
+        enc_dec_norm_cfg=dict(type='SparseSyncBatchNorm2d'),
+        enc_dec_norm_dim: int = 2048,
+        init_cfg: Optional[dict] = None,
+    ) -> None:
+        super().__init__(
+            backbone=backbone,
+            neck=neck,
+            head=head,
+            pretrained=pretrained,
+            data_preprocessor=data_preprocessor,
+            init_cfg=init_cfg)
+        self.input_size = input_size
+        self.downsample_raito = downsample_raito
+        feature_map_size = input_size // downsample_raito
+        self.feature_map_size = feature_map_size
+
+        self.mask_ratio = mask_ratio
+        self.len_keep = round(feature_map_size * feature_map_size *
+                              (1 - mask_ratio))
+
+        self.enc_dec_norm_cfg = enc_dec_norm_cfg
+        self.enc_dec_norms = nn.ModuleList()
+        self.enc_dec_projectors = nn.ModuleList()
+        self.mask_tokens = nn.ParameterList()
+
+        proj_out_dim = self.neck.feature_dim
+        for i in range(len(self.backbone.out_indices)):
+            enc_dec_norm = build_norm_layer(self.enc_dec_norm_cfg,
+                                            enc_dec_norm_dim)
+            self.enc_dec_norms.append(enc_dec_norm)
+
+            kernel_size = 1 if i <= 0 else 3
+            proj_layer = nn.Conv2d(
+                enc_dec_norm_dim,
+                proj_out_dim,
+                kernel_size=kernel_size,
+                stride=1,
+                padding=kernel_size // 2,
+                bias=True)
+            if i == 0 and enc_dec_norm_dim == proj_out_dim:
+                proj_layer = nn.Identity()
+            self.enc_dec_projectors.append(proj_layer)
+
+            mask_token = nn.Parameter(torch.zeros(1, enc_dec_norm_dim, 1, 1))
+            trunc_normal_(mask_token, mean=0, std=.02, a=-.02, b=.02)
+            self.mask_tokens.append(mask_token)
+
+            enc_dec_norm_dim //= 2
+            proj_out_dim //= 2
+            feature_map_size *= 2
+
+    def mask(self,
+             shape: torch.Size,
+             device: Union[torch.device, str],
+             generator: Optional[torch.Generator] = None):
+        """Mask generation.
+
+        Args:
+            shape (torch.Size): The shape of the input images.
+            device (Union[torch.device, str]): The device of the tensor.
+            generator (torch.Generator, optional): Generator for random
+                functions. Defaults to None
+        Returns:
+            torch.Tensor: The generated mask.
+        """
+        B, C, H, W = shape
+        f = self.feature_map_size
+        idx = torch.rand(B, f * f, generator=generator).argsort(dim=1)
+        idx = idx[:, :self.len_keep].to(device)  # (B, len_keep)
+        return torch.zeros(
+            B, f * f, dtype=torch.bool, device=device).scatter_(
+                dim=1, index=idx, value=True).view(B, 1, f, f)
+
+    def loss(self, inputs: torch.Tensor, data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (List[torch.Tensor]): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+
+        # active mask of feature map, (B, 1, f, f)
+        active_mask_feature_map = self.mask(inputs.shape, inputs.device)
+        SparseHelper._cur_active = active_mask_feature_map
+
+        # active mask of original input, (B, 1, H, W)
+        active_mask_origin = active_mask_feature_map.repeat_interleave(
+            self.downsample_raito,
+            2).repeat_interleave(self.downsample_raito, 3)
+        masked_img = inputs * active_mask_origin
+
+        # get hierarchical encoded sparse features in a list
+        # containing four feature maps
+        feature_maps = self.backbone(masked_img)
+
+        # from the smallest feature map to the largest
+        feature_maps = list(feature_maps)
+        feature_maps.reverse()
+
+        cur_active = active_mask_feature_map
+        feature_maps_to_dec = []
+        for i, feature_map in enumerate(feature_maps):
+            if feature_map is not None:
+                # fill in empty positions with [mask] embeddings
+                feature_map = self.enc_dec_norms[i](feature_map)
+                mask_token = self.mask_tokens[i].expand_as(feature_map)
+                feature_map = torch.where(
+                    cur_active.expand_as(feature_map), feature_map,
+                    mask_token.to(feature_map.dtype))
+                feature_map = self.enc_dec_projectors[i](feature_map)
+            feature_maps_to_dec.append(feature_map)
+
+            # dilate the mask map
+            cur_active = cur_active.repeat_interleave(
+                2, dim=2).repeat_interleave(
+                    2, dim=3)
+
+        # decode and reconstruct
+        rec_img = self.neck(feature_maps_to_dec)
+
+        # compute loss
+        loss = self.head(rec_img, inputs, active_mask_feature_map)
+        losses = dict(loss=loss)
+        return losses
diff --git a/mmpretrain/models/selfsup/swav.py b/mmpretrain/models/selfsup/swav.py
new file mode 100644
index 0000000000000000000000000000000000000000..efe0eab483319bd2dfde8929a2285e684cd3fc38
--- /dev/null
+++ b/mmpretrain/models/selfsup/swav.py
@@ -0,0 +1,49 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List
+
+import torch
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from .base import BaseSelfSupervisor
+
+
+@MODELS.register_module()
+class SwAV(BaseSelfSupervisor):
+    """SwAV.
+
+    Implementation of `Unsupervised Learning of Visual Features by Contrasting
+    Cluster Assignments <https://arxiv.org/abs/2006.09882>`_.
+
+    The queue is built in ``mmpretrain/engine/hooks/swav_hook.py``.
+    """
+
+    def loss(self, inputs: List[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """Forward computation during training.
+
+        Args:
+            inputs (List[torch.Tensor]): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        assert isinstance(inputs, list)
+        # multi-res forward passes
+        idx_crops = torch.cumsum(
+            torch.unique_consecutive(
+                torch.tensor([input.shape[-1] for input in inputs]),
+                return_counts=True)[1], 0)
+        start_idx = 0
+        output = []
+        for end_idx in idx_crops:
+            _out = self.backbone(torch.cat(inputs[start_idx:end_idx]))
+            output.append(_out)
+            start_idx = end_idx
+        output = self.neck(output)
+
+        loss = self.head.loss(output)
+        losses = dict(loss=loss)
+        return losses
diff --git a/mmpretrain/models/tta/__init__.py b/mmpretrain/models/tta/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..568e64ffdc743b4694045f39a46deb5083b2688a
--- /dev/null
+++ b/mmpretrain/models/tta/__init__.py
@@ -0,0 +1,4 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .score_tta import AverageClsScoreTTA
+
+__all__ = ['AverageClsScoreTTA']
diff --git a/mmpretrain/models/tta/__pycache__/__init__.cpython-310.pyc b/mmpretrain/models/tta/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..69e6a275628a93018822dbf5b996b55e47e47e35
Binary files /dev/null and b/mmpretrain/models/tta/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/models/tta/__pycache__/score_tta.cpython-310.pyc b/mmpretrain/models/tta/__pycache__/score_tta.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..17f8a715d7e4e2b8b70053e4207f2705dbb88a8b
Binary files /dev/null and b/mmpretrain/models/tta/__pycache__/score_tta.cpython-310.pyc differ
diff --git a/mmpretrain/models/tta/score_tta.py b/mmpretrain/models/tta/score_tta.py
new file mode 100644
index 0000000000000000000000000000000000000000..5b8a0786577c6cdb5076957df0ed60aac9d307cb
--- /dev/null
+++ b/mmpretrain/models/tta/score_tta.py
@@ -0,0 +1,36 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List
+
+from mmengine.model import BaseTTAModel
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+
+
+@MODELS.register_module()
+class AverageClsScoreTTA(BaseTTAModel):
+
+    def merge_preds(
+        self,
+        data_samples_list: List[List[DataSample]],
+    ) -> List[DataSample]:
+        """Merge predictions of enhanced data to one prediction.
+
+        Args:
+            data_samples_list (List[List[DataSample]]): List of predictions
+                of all enhanced data.
+
+        Returns:
+            List[DataSample]: Merged prediction.
+        """
+        merged_data_samples = []
+        for data_samples in data_samples_list:
+            merged_data_samples.append(self._merge_single_sample(data_samples))
+        return merged_data_samples
+
+    def _merge_single_sample(self, data_samples):
+        merged_data_sample: DataSample = data_samples[0].new()
+        merged_score = sum(data_sample.pred_score
+                           for data_sample in data_samples) / len(data_samples)
+        merged_data_sample.set_pred_score(merged_score)
+        return merged_data_sample
diff --git a/mmpretrain/models/utils/__init__.py b/mmpretrain/models/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e59d71d524308cbda3f4f693d1fb066b4a5981fa
--- /dev/null
+++ b/mmpretrain/models/utils/__init__.py
@@ -0,0 +1,102 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmpretrain.utils.dependency import WITH_MULTIMODAL
+from .attention import (BEiTAttention, ChannelMultiheadAttention,
+                        CrossMultiheadAttention, LeAttention,
+                        MultiheadAttention, PromptMultiheadAttention,
+                        ShiftWindowMSA, WindowMSA, WindowMSAV2)
+from .batch_augments import CutMix, Mixup, RandomBatchAugment, ResizeMix
+from .batch_shuffle import batch_shuffle_ddp, batch_unshuffle_ddp
+from .channel_shuffle import channel_shuffle
+from .clip_generator_helper import QuickGELU, build_clip_model
+from .data_preprocessor import (ClsDataPreprocessor,
+                                MultiModalDataPreprocessor,
+                                SelfSupDataPreprocessor,
+                                TwoNormDataPreprocessor, VideoDataPreprocessor)
+from .ema import CosineEMA
+from .embed import (HybridEmbed, PatchEmbed, PatchMerging, resize_pos_embed,
+                    resize_relative_position_bias_table)
+from .helpers import is_tracing, to_2tuple, to_3tuple, to_4tuple, to_ntuple
+from .inverted_residual import InvertedResidual
+from .layer_scale import LayerScale
+from .make_divisible import make_divisible
+from .norm import GRN, LayerNorm2d, build_norm_layer
+from .position_encoding import (ConditionalPositionEncoding,
+                                PositionEncodingFourier, RotaryEmbeddingFast,
+                                build_2d_sincos_position_embedding)
+from .res_layer_extra_norm import ResLayerExtraNorm
+from .se_layer import SELayer
+from .sparse_modules import (SparseAvgPooling, SparseBatchNorm2d, SparseConv2d,
+                             SparseHelper, SparseLayerNorm2D, SparseMaxPooling,
+                             SparseSyncBatchNorm2d)
+from .swiglu_ffn import SwiGLUFFN, SwiGLUFFNFused
+from .vector_quantizer import NormEMAVectorQuantizer
+
+__all__ = [
+    'channel_shuffle',
+    'make_divisible',
+    'InvertedResidual',
+    'SELayer',
+    'to_ntuple',
+    'to_2tuple',
+    'to_3tuple',
+    'to_4tuple',
+    'PatchEmbed',
+    'PatchMerging',
+    'HybridEmbed',
+    'RandomBatchAugment',
+    'ShiftWindowMSA',
+    'is_tracing',
+    'MultiheadAttention',
+    'ConditionalPositionEncoding',
+    'resize_pos_embed',
+    'resize_relative_position_bias_table',
+    'ClsDataPreprocessor',
+    'Mixup',
+    'CutMix',
+    'ResizeMix',
+    'BEiTAttention',
+    'LayerScale',
+    'WindowMSA',
+    'WindowMSAV2',
+    'ChannelMultiheadAttention',
+    'PositionEncodingFourier',
+    'LeAttention',
+    'GRN',
+    'LayerNorm2d',
+    'build_norm_layer',
+    'CrossMultiheadAttention',
+    'build_2d_sincos_position_embedding',
+    'PromptMultiheadAttention',
+    'NormEMAVectorQuantizer',
+    'build_clip_model',
+    'batch_shuffle_ddp',
+    'batch_unshuffle_ddp',
+    'SelfSupDataPreprocessor',
+    'TwoNormDataPreprocessor',
+    'VideoDataPreprocessor',
+    'CosineEMA',
+    'ResLayerExtraNorm',
+    'MultiModalDataPreprocessor',
+    'QuickGELU',
+    'SwiGLUFFN',
+    'SwiGLUFFNFused',
+    'RotaryEmbeddingFast',
+    'SparseAvgPooling',
+    'SparseConv2d',
+    'SparseHelper',
+    'SparseMaxPooling',
+    'SparseBatchNorm2d',
+    'SparseLayerNorm2D',
+    'SparseSyncBatchNorm2d',
+]
+
+if WITH_MULTIMODAL:
+    from .huggingface import (no_load_hf_pretrained_model, register_hf_model,
+                              register_hf_tokenizer)
+    from .tokenizer import (Blip2Tokenizer, BlipTokenizer, FullTokenizer,
+                            OFATokenizer)
+
+    __all__.extend([
+        'BlipTokenizer', 'OFATokenizer', 'Blip2Tokenizer', 'register_hf_model',
+        'register_hf_tokenizer', 'no_load_hf_pretrained_model', 'FullTokenizer'
+    ])
diff --git a/mmpretrain/models/utils/__pycache__/__init__.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a4131d1c62960e910812c9a1f580a8b6af6ead3a
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/attention.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/attention.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a9d6a6e56cc872ae4a392454710dda0675f3228a
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/attention.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/batch_shuffle.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/batch_shuffle.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..93178a1f024d1111eec85a4fe58e08d291c15598
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/batch_shuffle.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/box_utils.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/box_utils.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..7891c8ef5e5bb7d2258936c830eed55e7dc80f69
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/box_utils.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/channel_shuffle.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/channel_shuffle.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..6e253289c5e311c82e46d3c320857a13380f703c
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/channel_shuffle.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/clip_generator_helper.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/clip_generator_helper.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..5b93507dc624d11876410af2add93e562eb08e15
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/clip_generator_helper.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/data_preprocessor.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/data_preprocessor.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..48fb88b2501418b21ecd5144b8b4abc5d6a40af7
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/data_preprocessor.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/ema.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/ema.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..5b1287bc8745ecd032b003ec010b6a030dbd43d0
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/ema.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/embed.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/embed.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..01edc464841349f483bbc4dc85a98c5ed86300c3
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/embed.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/helpers.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/helpers.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..369489414db7461ef6abb010825b716ad6a326eb
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/helpers.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/inverted_residual.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/inverted_residual.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..359c822996bfc5811b9aa0925cdbd957f1f37d7c
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/inverted_residual.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/layer_scale.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/layer_scale.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..90efee682e157055442c1b2cdc554c204b8d4315
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/layer_scale.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/make_divisible.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/make_divisible.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..f001857924ef892954de56547de7a3e41743252b
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/make_divisible.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/norm.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/norm.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..ec20576e04822b77da73f6d6e73dbf546c07fad4
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/norm.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/position_encoding.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/position_encoding.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..419b3a92a7dadd23788e11e4b93549d78a214fa2
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/position_encoding.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/res_layer_extra_norm.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/res_layer_extra_norm.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..9a64755ca46f65f055da76b8799850fb8a0f8897
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/res_layer_extra_norm.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/se_layer.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/se_layer.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..89a2306690dd872aa12729738eea3b69d2412500
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/se_layer.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/sparse_modules.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/sparse_modules.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..543ea799e29724cb875b63aa73a2d94ccf077e09
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/sparse_modules.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/swiglu_ffn.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/swiglu_ffn.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..6765c4c410691e3d81f19b0c6a04b5760c94844f
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/swiglu_ffn.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/__pycache__/vector_quantizer.cpython-310.pyc b/mmpretrain/models/utils/__pycache__/vector_quantizer.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..4749627fb72a80dda40ced2d0bdc8d2f3731eacb
Binary files /dev/null and b/mmpretrain/models/utils/__pycache__/vector_quantizer.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/attention.py b/mmpretrain/models/utils/attention.py
new file mode 100644
index 0000000000000000000000000000000000000000..e92f6054dd83881b508ac5e87d9034cd86b3a36c
--- /dev/null
+++ b/mmpretrain/models/utils/attention.py
@@ -0,0 +1,1129 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import itertools
+import warnings
+from functools import partial
+from typing import List, Optional, Union
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn.bricks.drop import build_dropout
+from mmengine.model import BaseModule
+from mmengine.model.weight_init import trunc_normal_
+from mmengine.utils import digit_version
+
+from mmpretrain.registry import MODELS
+from .helpers import to_2tuple
+from .layer_scale import LayerScale
+
+# After pytorch v1.10.0, use torch.meshgrid without indexing
+# will raise extra warning. For more details,
+# refers to https://github.com/pytorch/pytorch/issues/50276
+if digit_version(torch.__version__) >= digit_version('1.10.0'):
+    torch_meshgrid = partial(torch.meshgrid, indexing='ij')
+else:
+    torch_meshgrid = torch.meshgrid
+
+
+def scaled_dot_product_attention_pyimpl(query,
+                                        key,
+                                        value,
+                                        attn_mask=None,
+                                        dropout_p=0.,
+                                        scale=None,
+                                        is_causal=False):
+    scale = scale or query.size(-1)**0.5
+    if is_causal and attn_mask is not None:
+        attn_mask = torch.ones(
+            query.size(-2), key.size(-2), dtype=torch.bool).tril(diagonal=0)
+    if attn_mask is not None and attn_mask.dtype == torch.bool:
+        attn_mask = attn_mask.masked_fill(not attn_mask, -float('inf'))
+
+    attn_weight = query @ key.transpose(-2, -1) / scale
+    if attn_mask is not None:
+        attn_weight += attn_mask
+    attn_weight = torch.softmax(attn_weight, dim=-1)
+    attn_weight = torch.dropout(attn_weight, dropout_p, True)
+    return attn_weight @ value
+
+
+if digit_version(torch.__version__) >= digit_version('2.0.0'):
+    scaled_dot_product_attention = F.scaled_dot_product_attention
+else:
+    scaled_dot_product_attention = scaled_dot_product_attention_pyimpl
+
+
+class WindowMSA(BaseModule):
+    """Window based multi-head self-attention (W-MSA) module with relative
+    position bias.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        window_size (tuple[int]): The height and width of the window.
+        num_heads (int): Number of attention heads.
+        qkv_bias (bool, optional): If True, add a learnable bias to q, k, v.
+            Defaults to True.
+        qk_scale (float, optional): Override default qk scale of
+            ``head_dim ** -0.5`` if set. Defaults to None.
+        attn_drop (float, optional): Dropout ratio of attention weight.
+            Defaults to 0.
+        proj_drop (float, optional): Dropout ratio of output. Defaults to 0.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 window_size,
+                 num_heads,
+                 qkv_bias=True,
+                 qk_scale=None,
+                 attn_drop=0.,
+                 proj_drop=0.,
+                 init_cfg=None):
+
+        super().__init__(init_cfg)
+        self.embed_dims = embed_dims
+        self.window_size = window_size  # Wh, Ww
+        self.num_heads = num_heads
+        head_embed_dims = embed_dims // num_heads
+        self.scale = qk_scale or head_embed_dims**-0.5
+
+        # define a parameter table of relative position bias
+        self.relative_position_bias_table = nn.Parameter(
+            torch.zeros((2 * window_size[0] - 1) * (2 * window_size[1] - 1),
+                        num_heads))  # 2*Wh-1 * 2*Ww-1, nH
+
+        # About 2x faster than original impl
+        Wh, Ww = self.window_size
+        rel_index_coords = self.double_step_seq(2 * Ww - 1, Wh, 1, Ww)
+        rel_position_index = rel_index_coords + rel_index_coords.T
+        rel_position_index = rel_position_index.flip(1).contiguous()
+        self.register_buffer('relative_position_index', rel_position_index)
+
+        self.qkv = nn.Linear(embed_dims, embed_dims * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(embed_dims, embed_dims)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+        self.softmax = nn.Softmax(dim=-1)
+
+    def init_weights(self):
+        super(WindowMSA, self).init_weights()
+
+        trunc_normal_(self.relative_position_bias_table, std=0.02)
+
+    def forward(self, x, mask=None):
+        """
+        Args:
+
+            x (tensor): input features with shape of (num_windows*B, N, C)
+            mask (tensor, Optional): mask with shape of (num_windows, Wh*Ww,
+                Wh*Ww), value should be between (-inf, 0].
+        """
+        B_, N, C = x.shape
+        qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads,
+                                  C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[
+            2]  # make torchscript happy (cannot use tensor as tuple)
+
+        q = q * self.scale
+        attn = (q @ k.transpose(-2, -1))
+
+        relative_position_bias = self.relative_position_bias_table[
+            self.relative_position_index.view(-1)].view(
+                self.window_size[0] * self.window_size[1],
+                self.window_size[0] * self.window_size[1],
+                -1)  # Wh*Ww,Wh*Ww,nH
+        relative_position_bias = relative_position_bias.permute(
+            2, 0, 1).contiguous()  # nH, Wh*Ww, Wh*Ww
+        attn = attn + relative_position_bias.unsqueeze(0)
+
+        if mask is not None:
+            nW = mask.shape[0]
+            attn = attn.view(B_ // nW, nW, self.num_heads, N,
+                             N) + mask.unsqueeze(1).unsqueeze(0)
+            attn = attn.view(-1, self.num_heads, N, N)
+            attn = self.softmax(attn)
+        else:
+            attn = self.softmax(attn)
+
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+    @staticmethod
+    def double_step_seq(step1, len1, step2, len2):
+        seq1 = torch.arange(0, step1 * len1, step1)
+        seq2 = torch.arange(0, step2 * len2, step2)
+        return (seq1[:, None] + seq2[None, :]).reshape(1, -1)
+
+
+class WindowMSAV2(BaseModule):
+    """Window based multi-head self-attention (W-MSA) module with relative
+    position bias.
+
+    Based on implementation on Swin Transformer V2 original repo. Refers to
+    https://github.com/microsoft/Swin-Transformer/blob/main/models/swin_transformer_v2.py
+    for more details.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        window_size (tuple[int]): The height and width of the window.
+        num_heads (int): Number of attention heads.
+        qkv_bias (bool): If True, add a learnable bias to q, k, v.
+            Defaults to True.
+        attn_drop (float): Dropout ratio of attention weight.
+            Defaults to 0.
+        proj_drop (float): Dropout ratio of output. Defaults to 0.
+        cpb_mlp_hidden_dims (int): The hidden dimensions of the continuous
+            relative position bias network. Defaults to 512.
+        pretrained_window_size (tuple(int)): The height and width of the window
+            in pre-training. Defaults to (0, 0), which means not load
+            pretrained model.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 window_size,
+                 num_heads,
+                 qkv_bias=True,
+                 attn_drop=0.,
+                 proj_drop=0.,
+                 cpb_mlp_hidden_dims=512,
+                 pretrained_window_size=(0, 0),
+                 init_cfg=None):
+
+        super().__init__(init_cfg)
+        self.embed_dims = embed_dims
+        self.window_size = window_size  # Wh, Ww
+        self.num_heads = num_heads
+
+        # Use small network for continuous relative position bias
+        self.cpb_mlp = nn.Sequential(
+            nn.Linear(
+                in_features=2, out_features=cpb_mlp_hidden_dims, bias=True),
+            nn.ReLU(inplace=True),
+            nn.Linear(
+                in_features=cpb_mlp_hidden_dims,
+                out_features=num_heads,
+                bias=False))
+
+        # Add learnable scalar for cosine attention
+        self.logit_scale = nn.Parameter(
+            torch.log(10 * torch.ones((num_heads, 1, 1))), requires_grad=True)
+
+        # get relative_coords_table
+        relative_coords_h = torch.arange(
+            -(self.window_size[0] - 1),
+            self.window_size[0],
+            dtype=torch.float32)
+        relative_coords_w = torch.arange(
+            -(self.window_size[1] - 1),
+            self.window_size[1],
+            dtype=torch.float32)
+        relative_coords_table = torch.stack(
+            torch_meshgrid([relative_coords_h, relative_coords_w])).permute(
+                1, 2, 0).contiguous().unsqueeze(0)  # 1, 2*Wh-1, 2*Ww-1, 2
+        if pretrained_window_size[0] > 0:
+            relative_coords_table[:, :, :, 0] /= (
+                pretrained_window_size[0] - 1)
+            relative_coords_table[:, :, :, 1] /= (
+                pretrained_window_size[1] - 1)
+        else:
+            relative_coords_table[:, :, :, 0] /= (self.window_size[0] - 1)
+            relative_coords_table[:, :, :, 1] /= (self.window_size[1] - 1)
+        relative_coords_table *= 8  # normalize to -8, 8
+        relative_coords_table = torch.sign(relative_coords_table) * torch.log2(
+            torch.abs(relative_coords_table) + 1.0) / np.log2(8)
+        self.register_buffer('relative_coords_table', relative_coords_table)
+
+        # get pair-wise relative position index
+        # for each token inside the window
+        indexes_h = torch.arange(self.window_size[0])
+        indexes_w = torch.arange(self.window_size[1])
+        coordinates = torch.stack(
+            torch_meshgrid([indexes_h, indexes_w]), dim=0)  # 2, Wh, Ww
+        coordinates = torch.flatten(coordinates, start_dim=1)  # 2, Wh*Ww
+        # 2, Wh*Ww, Wh*Ww
+        relative_coordinates = coordinates[:, :, None] - coordinates[:,
+                                                                     None, :]
+        relative_coordinates = relative_coordinates.permute(
+            1, 2, 0).contiguous()  # Wh*Ww, Wh*Ww, 2
+
+        relative_coordinates[:, :, 0] += self.window_size[
+            0] - 1  # shift to start from 0
+        relative_coordinates[:, :, 1] += self.window_size[1] - 1
+        relative_coordinates[:, :, 0] *= 2 * self.window_size[1] - 1
+        relative_position_index = relative_coordinates.sum(-1)  # Wh*Ww, Wh*Ww
+        self.register_buffer('relative_position_index',
+                             relative_position_index)
+
+        self.qkv = nn.Linear(embed_dims, embed_dims * 3, bias=False)
+        if qkv_bias:
+            self.q_bias = nn.Parameter(torch.zeros(embed_dims))
+            self.v_bias = nn.Parameter(torch.zeros(embed_dims))
+        else:
+            self.q_bias = None
+            self.v_bias = None
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(embed_dims, embed_dims)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+        self.softmax = nn.Softmax(dim=-1)
+
+    def forward(self, x, mask=None):
+        """
+        Args:
+
+            x (tensor): input features with shape of (num_windows*B, N, C)
+            mask (tensor, Optional): mask with shape of (num_windows, Wh*Ww,
+                Wh*Ww), value should be between (-inf, 0].
+        """
+        B_, N, C = x.shape
+        qkv_bias = None
+        if self.q_bias is not None:
+            qkv_bias = torch.cat(
+                (self.q_bias,
+                 torch.zeros_like(self.v_bias,
+                                  requires_grad=False), self.v_bias))
+        qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
+        qkv = qkv.reshape(B_, N, 3, self.num_heads,
+                          C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[
+            2]  # make torchscript happy (cannot use tensor as tuple)
+
+        # cosine attention
+        attn = (
+            F.normalize(q, dim=-1) @ F.normalize(k, dim=-1).transpose(-2, -1))
+        logit_scale = torch.clamp(
+            self.logit_scale, max=np.log(1. / 0.01)).exp()
+        attn = attn * logit_scale
+
+        relative_position_bias_table = self.cpb_mlp(
+            self.relative_coords_table).view(-1, self.num_heads)
+        relative_position_bias = relative_position_bias_table[
+            self.relative_position_index.view(-1)].view(
+                self.window_size[0] * self.window_size[1],
+                self.window_size[0] * self.window_size[1],
+                -1)  # Wh*Ww,Wh*Ww,nH
+        relative_position_bias = relative_position_bias.permute(
+            2, 0, 1).contiguous()  # nH, Wh*Ww, Wh*Ww
+        relative_position_bias = 16 * torch.sigmoid(relative_position_bias)
+        attn = attn + relative_position_bias.unsqueeze(0)
+
+        if mask is not None:
+            nW = mask.shape[0]
+            attn = attn.view(B_ // nW, nW, self.num_heads, N,
+                             N) + mask.unsqueeze(1).unsqueeze(0)
+            attn = attn.view(-1, self.num_heads, N, N)
+            attn = self.softmax(attn)
+        else:
+            attn = self.softmax(attn)
+
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+
+@MODELS.register_module()
+class ShiftWindowMSA(BaseModule):
+    """Shift Window Multihead Self-Attention Module.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        window_size (int): The height and width of the window.
+        shift_size (int, optional): The shift step of each window towards
+            right-bottom. If zero, act as regular window-msa. Defaults to 0.
+        dropout_layer (dict, optional): The dropout_layer used before output.
+            Defaults to dict(type='DropPath', drop_prob=0.).
+        pad_small_map (bool): If True, pad the small feature map to the window
+            size, which is common used in detection and segmentation. If False,
+            avoid shifting window and shrink the window size to the size of
+            feature map, which is common used in classification.
+            Defaults to False.
+        window_msa (Callable): To build a window multi-head attention module.
+            Defaults to :class:`WindowMSA`.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+        **kwargs: Other keyword arguments to build the window multi-head
+            attention module.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 window_size,
+                 shift_size=0,
+                 dropout_layer=dict(type='DropPath', drop_prob=0.),
+                 pad_small_map=False,
+                 window_msa=WindowMSA,
+                 init_cfg=None,
+                 **kwargs):
+        super().__init__(init_cfg)
+
+        self.shift_size = shift_size
+        self.window_size = window_size
+        assert 0 <= self.shift_size < self.window_size
+
+        self.w_msa = window_msa(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            window_size=to_2tuple(self.window_size),
+            **kwargs,
+        )
+
+        self.drop = build_dropout(dropout_layer)
+        self.pad_small_map = pad_small_map
+
+    def forward(self, query, hw_shape):
+        B, L, C = query.shape
+        H, W = hw_shape
+        assert L == H * W, f"The query length {L} doesn't match the input "\
+            f'shape ({H}, {W}).'
+        query = query.view(B, H, W, C)
+
+        window_size = self.window_size
+        shift_size = self.shift_size
+
+        if min(H, W) == window_size:
+            # If not pad small feature map, avoid shifting when the window size
+            # is equal to the size of feature map. It's to align with the
+            # behavior of the original implementation.
+            shift_size = shift_size if self.pad_small_map else 0
+        elif min(H, W) < window_size:
+            # In the original implementation, the window size will be shrunk
+            # to the size of feature map. The behavior is different with
+            # swin-transformer for downstream tasks. To support dynamic input
+            # shape, we don't allow this feature.
+            assert self.pad_small_map, \
+                f'The input shape ({H}, {W}) is smaller than the window ' \
+                f'size ({window_size}). Please set `pad_small_map=True`, or ' \
+                'decrease the `window_size`.'
+
+        pad_r = (window_size - W % window_size) % window_size
+        pad_b = (window_size - H % window_size) % window_size
+        query = F.pad(query, (0, 0, 0, pad_r, 0, pad_b))
+
+        H_pad, W_pad = query.shape[1], query.shape[2]
+
+        # cyclic shift
+        if shift_size > 0:
+            query = torch.roll(
+                query, shifts=(-shift_size, -shift_size), dims=(1, 2))
+
+        attn_mask = self.get_attn_mask((H_pad, W_pad),
+                                       window_size=window_size,
+                                       shift_size=shift_size,
+                                       device=query.device)
+
+        # nW*B, window_size, window_size, C
+        query_windows = self.window_partition(query, window_size)
+        # nW*B, window_size*window_size, C
+        query_windows = query_windows.view(-1, window_size**2, C)
+
+        # W-MSA/SW-MSA (nW*B, window_size*window_size, C)
+        attn_windows = self.w_msa(query_windows, mask=attn_mask)
+
+        # merge windows
+        attn_windows = attn_windows.view(-1, window_size, window_size, C)
+
+        # B H' W' C
+        shifted_x = self.window_reverse(attn_windows, H_pad, W_pad,
+                                        window_size)
+        # reverse cyclic shift
+        if self.shift_size > 0:
+            x = torch.roll(
+                shifted_x, shifts=(shift_size, shift_size), dims=(1, 2))
+        else:
+            x = shifted_x
+
+        if H != H_pad or W != W_pad:
+            x = x[:, :H, :W, :].contiguous()
+
+        x = x.view(B, H * W, C)
+
+        x = self.drop(x)
+
+        return x
+
+    @staticmethod
+    def window_reverse(windows, H, W, window_size):
+        B = int(windows.shape[0] / (H * W / window_size / window_size))
+        x = windows.view(B, H // window_size, W // window_size, window_size,
+                         window_size, -1)
+        x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
+        return x
+
+    @staticmethod
+    def window_partition(x, window_size):
+        B, H, W, C = x.shape
+        x = x.view(B, H // window_size, window_size, W // window_size,
+                   window_size, C)
+        windows = x.permute(0, 1, 3, 2, 4, 5).contiguous()
+        windows = windows.view(-1, window_size, window_size, C)
+        return windows
+
+    @staticmethod
+    def get_attn_mask(hw_shape, window_size, shift_size, device=None):
+        if shift_size > 0:
+            img_mask = torch.zeros(1, *hw_shape, 1, device=device)
+            h_slices = (slice(0, -window_size), slice(-window_size,
+                                                      -shift_size),
+                        slice(-shift_size, None))
+            w_slices = (slice(0, -window_size), slice(-window_size,
+                                                      -shift_size),
+                        slice(-shift_size, None))
+            cnt = 0
+            for h in h_slices:
+                for w in w_slices:
+                    img_mask[:, h, w, :] = cnt
+                    cnt += 1
+
+            # nW, window_size, window_size, 1
+            mask_windows = ShiftWindowMSA.window_partition(
+                img_mask, window_size)
+            mask_windows = mask_windows.view(-1, window_size * window_size)
+            attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
+            attn_mask = attn_mask.masked_fill(attn_mask != 0, -100.0)
+            attn_mask = attn_mask.masked_fill(attn_mask == 0, 0.0)
+        else:
+            attn_mask = None
+        return attn_mask
+
+
+class MultiheadAttention(BaseModule):
+    """Multi-head Attention Module.
+
+    This module implements multi-head attention that supports different input
+    dims and embed dims. And it also supports a shortcut from ``value``, which
+    is useful if input dims is not the same with embed dims.
+
+    Args:
+        embed_dims (int): The embedding dimension.
+        num_heads (int): Parallel attention heads.
+        input_dims (int, optional): The input dimension, and if None,
+            use ``embed_dims``. Defaults to None.
+        attn_drop (float): Dropout rate of the dropout layer after the
+            attention calculation of query and key. Defaults to 0.
+        proj_drop (float): Dropout rate of the dropout layer after the
+            output projection. Defaults to 0.
+        dropout_layer (dict): The dropout config before adding the shortcut.
+            Defaults to ``dict(type='Dropout', drop_prob=0.)``.
+        qkv_bias (bool): If True, add a learnable bias to q, k, v.
+            Defaults to True.
+        qk_scale (float, optional): Override default qk scale of
+            ``head_dim ** -0.5`` if set. Defaults to None.
+        proj_bias (bool) If True, add a learnable bias to output projection.
+            Defaults to True.
+        v_shortcut (bool): Add a shortcut from value to output. It's usually
+            used if ``input_dims`` is different from ``embed_dims``.
+            Defaults to False.
+        use_layer_scale (bool): Whether to use layer scale. Defaults to False.
+        layer_scale_init_value (float or torch.Tensor): Init value of layer
+            scale. Defaults to 0.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 input_dims=None,
+                 attn_drop=0.,
+                 proj_drop=0.,
+                 dropout_layer=dict(type='Dropout', drop_prob=0.),
+                 qkv_bias=True,
+                 qk_scale=None,
+                 proj_bias=True,
+                 v_shortcut=False,
+                 use_layer_scale=False,
+                 layer_scale_init_value=0.,
+                 init_cfg=None):
+        super(MultiheadAttention, self).__init__(init_cfg=init_cfg)
+
+        self.input_dims = input_dims or embed_dims
+        self.embed_dims = embed_dims
+        self.num_heads = num_heads
+        self.v_shortcut = v_shortcut
+
+        self.head_dims = embed_dims // num_heads
+        if qk_scale is not None:
+            self.scaled_dot_product_attention = partial(
+                scaled_dot_product_attention_pyimpl,
+                scale=self.head_dims**-0.5)
+        else:
+            self.scaled_dot_product_attention = scaled_dot_product_attention
+
+        self.qkv = nn.Linear(self.input_dims, embed_dims * 3, bias=qkv_bias)
+        self.attn_drop = attn_drop
+        self.proj = nn.Linear(embed_dims, embed_dims, bias=proj_bias)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+        self.out_drop = build_dropout(dropout_layer)
+
+        if use_layer_scale:
+            warnings.warn('The `use_layer_scale` in `MultiheadAttention` will '
+                          'be deprecated. Please use `layer_scale_init_value` '
+                          'to control whether using layer scale or not.')
+
+        if use_layer_scale or (layer_scale_init_value > 0):
+            layer_scale_init_value = layer_scale_init_value or 1e-5
+            self.gamma1 = LayerScale(
+                embed_dims, layer_scale_init_value=layer_scale_init_value)
+        else:
+            self.gamma1 = nn.Identity()
+
+    def forward(self, x):
+        B, N, _ = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads,
+                                  self.head_dims).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+
+        attn_drop = self.attn_drop if self.training else 0.
+        x = self.scaled_dot_product_attention(q, k, v, dropout_p=attn_drop)
+        x = x.transpose(1, 2).reshape(B, N, self.embed_dims)
+
+        x = self.proj(x)
+        x = self.out_drop(self.gamma1(self.proj_drop(x)))
+
+        if self.v_shortcut:
+            x = v.squeeze(1) + x
+        return x
+
+
+class BEiTAttention(BaseModule):
+    """Window based multi-head self-attention (W-MSA) module with relative
+    position bias.
+
+    The initial implementation is in MMSegmentation.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        window_size (tuple[int, int]): The height and width of the window.
+        use_rel_pos_bias (bool): Whether to use unique relative position bias,
+            if False, use shared relative position bias defined in backbone.
+        bias (str): The option to add leanable bias for q, k, v. If bias is
+            True, it will add leanable bias. If bias is 'qv_bias', it will only
+            add leanable bias for q, v. If bias is False, it will not add bias
+            for q, k, v. Default to 'qv_bias'.
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Default: None.
+        attn_drop_rate (float): Dropout ratio of attention weight.
+            Default: 0.0
+        proj_drop_rate (float): Dropout ratio of output. Default: 0.
+        init_cfg (dict | None, optional): The Config for initialization.
+            Default: None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 window_size,
+                 use_rel_pos_bias,
+                 bias='qv_bias',
+                 qk_scale=None,
+                 attn_drop_rate=0.,
+                 proj_drop_rate=0.,
+                 init_cfg=None,
+                 **kwargs):
+        super().__init__(init_cfg=init_cfg)
+        self.embed_dims = embed_dims
+        self.num_heads = num_heads
+        head_embed_dims = embed_dims // num_heads
+        self.bias = bias
+        self.scale = qk_scale or head_embed_dims**-0.5
+
+        qkv_bias = bias
+        if bias == 'qv_bias':
+            self._init_qv_bias()
+            qkv_bias = False
+
+        if window_size is None:
+            assert not use_rel_pos_bias
+        else:
+            assert isinstance(window_size, tuple)
+        self.window_size = window_size
+        self.use_rel_pos_bias = use_rel_pos_bias
+        self._init_rel_pos_embedding()
+
+        self.qkv = nn.Linear(embed_dims, embed_dims * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop_rate)
+        self.proj = nn.Linear(embed_dims, embed_dims)
+        self.proj_drop = nn.Dropout(proj_drop_rate)
+
+    def _init_qv_bias(self):
+        self.q_bias = nn.Parameter(torch.zeros(self.embed_dims))
+        self.v_bias = nn.Parameter(torch.zeros(self.embed_dims))
+
+    def _init_rel_pos_embedding(self):
+        if self.use_rel_pos_bias:
+            Wh, Ww = self.window_size
+            # cls to token & token 2 cls & cls to cls
+            self.num_relative_distance = (2 * Wh - 1) * (2 * Ww - 1) + 3
+            # relative_position_bias_table shape is (2*Wh-1 * 2*Ww-1 + 3, nH)
+            self.relative_position_bias_table = nn.Parameter(
+                torch.zeros(self.num_relative_distance, self.num_heads))
+
+            # get pair-wise relative position index for
+            # each token inside the window
+            coords_h = torch.arange(Wh)
+            coords_w = torch.arange(Ww)
+            # coords shape is (2, Wh, Ww)
+            coords = torch.stack(torch_meshgrid([coords_h, coords_w]))
+            # coords_flatten shape is (2, Wh*Ww)
+            coords_flatten = torch.flatten(coords, 1)
+            relative_coords = (
+                coords_flatten[:, :, None] - coords_flatten[:, None, :])
+            # relative_coords shape is (Wh*Ww, Wh*Ww, 2)
+            relative_coords = relative_coords.permute(1, 2, 0).contiguous()
+            # shift to start from 0
+            relative_coords[:, :, 0] += Wh - 1
+            relative_coords[:, :, 1] += Ww - 1
+            relative_coords[:, :, 0] *= 2 * Ww - 1
+            relative_position_index = torch.zeros(
+                size=(Wh * Ww + 1, ) * 2, dtype=relative_coords.dtype)
+            # relative_position_index shape is (Wh*Ww, Wh*Ww)
+            relative_position_index[1:, 1:] = relative_coords.sum(-1)
+            relative_position_index[0, 0:] = self.num_relative_distance - 3
+            relative_position_index[0:, 0] = self.num_relative_distance - 2
+            relative_position_index[0, 0] = self.num_relative_distance - 1
+
+            self.register_buffer('relative_position_index',
+                                 relative_position_index)
+        else:
+            self.window_size = None
+            self.relative_position_bias_table = None
+            self.relative_position_index = None
+
+    def init_weights(self):
+        super().init_weights()
+        if self.use_rel_pos_bias:
+            trunc_normal_(self.relative_position_bias_table, std=0.02)
+
+    def forward(self, x, rel_pos_bias=None):
+        """
+        Args:
+            x (tensor): input features with shape of (num_windows*B, N, C).
+            rel_pos_bias (tensor): input relative position bias with shape of
+                (num_heads, N, N).
+        """
+        B, N, C = x.shape
+
+        if self.bias == 'qv_bias':
+            k_bias = torch.zeros_like(self.v_bias, requires_grad=False)
+            qkv_bias = torch.cat((self.q_bias, k_bias, self.v_bias))
+            qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
+        else:
+            qkv = self.qkv(x)
+
+        qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+        q = q * self.scale
+        attn = (q @ k.transpose(-2, -1))
+
+        if self.relative_position_bias_table is not None:
+            Wh = self.window_size[0]
+            Ww = self.window_size[1]
+            relative_position_bias = self.relative_position_bias_table[
+                self.relative_position_index.view(-1)].view(
+                    Wh * Ww + 1, Wh * Ww + 1, -1)
+            relative_position_bias = relative_position_bias.permute(
+                2, 0, 1).contiguous()  # nH, Wh*Ww, Wh*Ww
+            attn = attn + relative_position_bias.unsqueeze(0)
+
+        if rel_pos_bias is not None:
+            # use shared relative position bias
+            attn = attn + rel_pos_bias
+
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+
+class ChannelMultiheadAttention(BaseModule):
+    """Channel Multihead Self-attention Module.
+
+    This module implements channel multi-head attention that supports different
+    input dims and embed dims.
+    Args:
+        embed_dims (int): The embedding dimension.
+        num_heads (int): Parallel attention heads.
+        input_dims (int, optional): The input dimension, and if None,
+            use ``embed_dims``. Defaults to None.
+        attn_drop (float): Dropout rate of the dropout layer after the
+            attention calculation of query and key. Defaults to 0.
+        proj_drop (float): Dropout rate of the dropout layer after the
+            output projection. Defaults to 0.
+        dropout_layer (dict): The dropout config before adding the shoutcut.
+            Defaults to ``dict(type='Dropout', drop_prob=0.)``.
+        qkv_bias (bool): If True, add a learnable bias to q, k, v.
+            Defaults to False.
+        proj_bias (bool) If True, add a learnable bias to output projection.
+            Defaults to True.
+        qk_scale_type (str): The scale type of qk scale.
+            Defaults to 'learnable'. It can be 'learnable', 'fixed' or 'none'.
+        qk_scale (float, optional): If set qk_scale_type to 'none', this
+            should be specified with valid float number. Defaults to None.
+        v_shortcut (bool): Add a shortcut from value to output. It's usually
+            used if ``input_dims`` is different from ``embed_dims``.
+            Defaults to False.
+        init_cfg (dict, optional): The Config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads=8,
+                 input_dims=None,
+                 attn_drop=0.,
+                 proj_drop=0.,
+                 dropout_layer=dict(type='Dropout', drop_prob=0.),
+                 qkv_bias=False,
+                 proj_bias=True,
+                 qk_scale_type='learnable',
+                 qk_scale=None,
+                 v_shortcut=False,
+                 init_cfg=None):
+        super().__init__(init_cfg)
+
+        self.input_dims = input_dims or embed_dims
+        self.embed_dims = embed_dims
+        self.num_heads = num_heads
+        self.v_shortcut = v_shortcut
+
+        self.head_dims = embed_dims // num_heads
+        if qk_scale_type == 'learnable':
+            self.scale = nn.Parameter(torch.ones(num_heads, 1, 1))
+        elif qk_scale_type == 'fixed':
+            self.scale = self.head_dims**-0.5
+        elif qk_scale_type == 'none':
+            assert qk_scale is not None
+            self.scale = qk_scale
+
+        self.qkv = nn.Linear(self.input_dims, embed_dims * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(embed_dims, embed_dims, bias=proj_bias)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+        self.out_drop = build_dropout(dropout_layer)
+
+    def forward(self, x):
+        B, N, _ = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads,
+                                  self.head_dims).permute(2, 0, 3, 1, 4)
+
+        q, k, v = [item.transpose(-2, -1) for item in [qkv[0], qkv[1], qkv[2]]]
+
+        q, k = F.normalize(q, dim=-1), F.normalize(k, dim=-1)
+
+        attn = (q @ k.transpose(-2, -1)) * self.scale
+        attn = attn.softmax(dim=-1)
+
+        x = (attn @ v).permute(0, 3, 1, 2).reshape(B, N, self.embed_dims)
+        x = self.proj(x)
+        x = self.out_drop(self.proj_drop(x))
+
+        if self.v_shortcut:
+            x = qkv[2].squeeze(1) + x
+        return x
+
+
+class LeAttention(BaseModule):
+    """LeViT Attention. Multi-head attention with attention bias,  which is
+    proposed in `LeViT: a Vision Transformer in ConvNet’s Clothing for Faster
+    Inference<https://arxiv.org/abs/2104.01136>`_
+
+    Args:
+        dim (int): Number of input channels.
+        num_heads (int): Number of attention heads. Default: 8.
+        key_dim (int): Dimension of key. Default: None.
+        attn_ratio (int): Ratio of attention heads. Default: 8.
+        resolution (tuple[int]): Input resolution. Default: (16, 16).
+        init_cfg (dict, optional): The Config for initialization.
+    """
+
+    def __init__(self,
+                 dim,
+                 key_dim,
+                 num_heads=8,
+                 attn_ratio=4,
+                 resolution=(14, 14),
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        # (h, w)
+        assert isinstance(resolution, tuple) and len(resolution) == 2
+        self.num_heads = num_heads
+        self.scale = key_dim**-0.5
+        self.key_dim = key_dim
+        self.nh_kd = nh_kd = key_dim * num_heads
+        self.d = int(attn_ratio * key_dim)
+        self.dh = int(attn_ratio * key_dim) * num_heads
+        self.attn_ratio = attn_ratio
+        h = self.dh + nh_kd * 2
+
+        self.norm = nn.LayerNorm(dim)
+        self.qkv = nn.Linear(dim, h)
+        self.proj = nn.Linear(self.dh, dim)
+
+        points = list(
+            itertools.product(range(resolution[0]), range(resolution[1])))
+        N = len(points)
+        attention_offsets = {}
+        idxs = []
+        for p1 in points:
+            for p2 in points:
+                offset = (abs(p1[0] - p2[0]), abs(p1[1] - p2[1]))
+                if offset not in attention_offsets:
+                    attention_offsets[offset] = len(attention_offsets)
+                idxs.append(attention_offsets[offset])
+        self.attention_biases = torch.nn.Parameter(
+            torch.zeros(num_heads, len(attention_offsets)))
+        self.register_buffer(
+            'attention_bias_idxs',
+            torch.LongTensor(idxs).view(N, N),
+            persistent=False)
+
+    @torch.no_grad()
+    def train(self, mode=True):
+        super().train(mode)
+        if mode and hasattr(self, 'ab'):
+            del self.ab
+        else:
+            self.ab = self.attention_biases[:, self.attention_bias_idxs]
+
+    def forward(self, x):  # x (B,N,C)
+        B, N, _ = x.shape
+
+        # Normalization
+        x = self.norm(x)
+
+        qkv = self.qkv(x)
+        # (B, N, num_heads, d)
+        q, k, v = qkv.view(B, N, self.num_heads,
+                           -1).split([self.key_dim, self.key_dim, self.d],
+                                     dim=3)
+        # (B, num_heads, N, d)
+        q = q.permute(0, 2, 1, 3)
+        k = k.permute(0, 2, 1, 3)
+        v = v.permute(0, 2, 1, 3)
+
+        attn = ((q @ k.transpose(-2, -1)) * self.scale +
+                (self.attention_biases[:, self.attention_bias_idxs]
+                 if self.training else self.ab))
+        attn = attn.softmax(dim=-1)
+        x = (attn @ v).transpose(1, 2).reshape(B, N, self.dh)
+        x = self.proj(x)
+        return x
+
+
+class CrossMultiheadAttention(BaseModule):
+    """Cross attention between queries and the union of keys and values.
+
+    This module is different from ``MultiheadAttention``, for the attention
+    is computed between queries and the union of keys and values.
+
+    Args:
+        embed_dims (int): The embedding dimension.
+        num_heads (int): Parallel attention heads.
+        qkv_bias (bool): If True, add a learnable bias to q, k, v.
+            Defaults to True.
+        qk_scale (float, optional): Override default qk scale of
+            ``head_dim ** -0.5`` if set. Defaults to None.
+        attn_drop (float): Dropout rate of the dropout layer after the
+            attention calculation of query and key. Defaults to 0.
+        proj_drop (float): Dropout rate of the dropout layer after the
+            output projection. Defaults to 0.
+    """
+
+    def __init__(self,
+                 embed_dims: int,
+                 num_heads: int = 8,
+                 qkv_bias: bool = False,
+                 qk_scale: float = None,
+                 attn_drop: float = 0.,
+                 proj_drop: float = 0.) -> None:
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = embed_dims // num_heads
+        self.scale = qk_scale or head_dim**-0.5
+
+        self.q = nn.Linear(embed_dims, embed_dims, bias=False)
+        self.k = nn.Linear(embed_dims, embed_dims, bias=False)
+        self.v = nn.Linear(embed_dims, embed_dims, bias=False)
+
+        if qkv_bias:
+            self.q_bias = nn.Parameter(torch.zeros(embed_dims))
+            self.v_bias = nn.Parameter(torch.zeros(embed_dims))
+        else:
+            self.q_bias = None
+            self.k_bias = None
+            self.v_bias = None
+
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(embed_dims, embed_dims)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def forward(self,
+                x: torch.Tensor,
+                k: torch.Tensor = None,
+                v: torch.Tensor = None) -> None:
+        """Forward function."""
+        B, N, _ = x.shape
+
+        N_k = k.shape[1]
+        N_v = v.shape[1]
+
+        q_bias, k_bias, v_bias = None, None, None
+        if self.q_bias is not None:
+            q_bias = self.q_bias
+            k_bias = torch.zeros_like(self.v_bias, requires_grad=False)
+            v_bias = self.v_bias
+
+        q = F.linear(
+            input=x, weight=self.q.weight, bias=q_bias)  # (B, N_q, dim)
+        k = F.linear(
+            input=k, weight=self.k.weight, bias=k_bias)  # (B, N_k, dim)
+        v = F.linear(input=v, weight=self.v.weight, bias=v_bias)
+
+        q = q.reshape(B, N, 1, self.num_heads,
+                      -1).permute(2, 0, 3, 1,
+                                  4).squeeze(0)  # (B, num_heads, N_q, dim)
+        k = k.reshape(B, N_k, 1, self.num_heads,
+                      -1).permute(2, 0, 3, 1,
+                                  4).squeeze(0)  # (B, num_heads, N_k, dim)
+        v = v.reshape(B, N_v, 1, self.num_heads,
+                      -1).permute(2, 0, 3, 1,
+                                  4).squeeze(0)  # (B, num_heads, N_v, dim)
+
+        q = q * self.scale
+        attn = (q @ k.transpose(-2, -1))  # (B, N_head, N_q, N_k)
+
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+
+        return x
+
+
+class PromptMultiheadAttention(MultiheadAttention):
+    """Prompt Multihead Attention for MILAN.
+
+    This module is specific for the prompt encoder in MILAN. It will not update
+    the visible tokens from the encoder.
+
+    Args:
+        embed_dims (int): The embedding dimension.
+        num_heads (int): Parallel attention heads.
+        input_dims (int, optional): The input dimension, and if None,
+            use ``embed_dims``. Defaults to None.
+        attn_drop (float): Dropout rate of the dropout layer after the
+            attention calculation of query and key. Defaults to 0.
+        proj_drop (float): Dropout rate of the dropout layer after the
+            output projection. Defaults to 0.
+        dropout_layer (dict): The dropout config before adding the shortcut.
+            Defaults to ``dict(type='Dropout', drop_prob=0.)``.
+        qkv_bias (bool): If True, add a learnable bias to q, k, v.
+            Defaults to True.
+        qk_scale (float, optional): Override default qk scale of
+            ``head_dim ** -0.5`` if set. Defaults to None.
+        proj_bias (bool) If True, add a learnable bias to output projection.
+            Defaults to True.
+        v_shortcut (bool): Add a shortcut from value to output. It's usually
+            used if ``input_dims`` is different from ``embed_dims``.
+            Defaults to False.
+        return_attention (bool): If True, return the attention map, computed by
+            the cross attention between the class token and all other tokens.
+            Defaults to False.
+        init_cfg (Union[List[dict], dict], optional): The Config for
+            initialization. Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims: int,
+                 num_heads: int,
+                 input_dims: Optional[int] = None,
+                 attn_drop: float = 0,
+                 proj_drop: float = 0,
+                 dropout_layer: dict = dict(type='Dropout', drop_prob=0.),
+                 qkv_bias: bool = True,
+                 qk_scale: Optional[float] = None,
+                 proj_bias: bool = True,
+                 v_shortcut: bool = False,
+                 use_layer_scale: bool = False,
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            input_dims=input_dims,
+            attn_drop=attn_drop,
+            proj_drop=proj_drop,
+            dropout_layer=dropout_layer,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            proj_bias=proj_bias,
+            v_shortcut=v_shortcut,
+            use_layer_scale=use_layer_scale,
+            init_cfg=init_cfg)
+        # no longer need qkv
+        del self.qkv
+
+        # to project the mask tokens
+        self.q = nn.Linear(embed_dims, embed_dims, bias=qkv_bias)
+        # to project al the tokens
+        self.kv = nn.Linear(embed_dims, embed_dims * 2, bias=qkv_bias)
+
+    def forward(self, x: torch.Tensor, visible_tokens: torch.Tensor,
+                ids_restore: torch.Tensor) -> torch.Tensor:
+        """Forward function for `PromptMultiheadAttention`.
+
+        Args:
+            x (torch.Tensor): Mask token features with shape N x L_m x C.
+            visible_tokens (torch.Tensor): The visible tokens features from
+                encoder with shape N x L_v x C.
+            ids_restore (torch.Tensor): The ids of all tokens in the original
+                image with shape N x L.
+
+        Returns:
+            torch Tensor: Output features with shape N x L x C.
+        """
+        x_ = torch.cat([visible_tokens[:, 1:, :], x], dim=1)
+        assert x_.shape[1] == ids_restore.shape[1]
+        x_ = torch.gather(
+            x_,
+            dim=1,
+            index=ids_restore.unsqueeze(-1).repeat(1, 1, x.shape[-1]))
+        x_ = torch.cat([visible_tokens[:, :1, :], x_], dim=1)
+
+        # full sequence shape
+        B, _, _ = x_.shape
+        q = self.q(x).reshape(B, x.shape[1], self.num_heads,
+                              self.head_dims).permute(0, 2, 1, 3)
+        kv = self.kv(x_).reshape(B, x_.shape[1], 2, self.num_heads,
+                                 self.head_dims).permute(2, 0, 3, 1, 4)
+        k, v = kv[0], kv[1]
+
+        attn_drop = self.attn_drop if self.training else 0.
+        attn = self.scaled_dot_product_attention(q, k, v, dropout_p=attn_drop)
+        x = attn.transpose(1, 2).reshape(B, x.shape[1], self.embed_dims)
+
+        x = self.proj(x)
+        x = self.out_drop(self.gamma1(self.proj_drop(x)))
+        return x
diff --git a/mmpretrain/models/utils/batch_augments/__init__.py b/mmpretrain/models/utils/batch_augments/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..2fbc4e179608767f667ca1075e5134dbecb8c38d
--- /dev/null
+++ b/mmpretrain/models/utils/batch_augments/__init__.py
@@ -0,0 +1,7 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .cutmix import CutMix
+from .mixup import Mixup
+from .resizemix import ResizeMix
+from .wrapper import RandomBatchAugment
+
+__all__ = ('RandomBatchAugment', 'CutMix', 'Mixup', 'ResizeMix')
diff --git a/mmpretrain/models/utils/batch_augments/__pycache__/__init__.cpython-310.pyc b/mmpretrain/models/utils/batch_augments/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..18e07c827c4fd4274675cc1fc666898a5569e4bf
Binary files /dev/null and b/mmpretrain/models/utils/batch_augments/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/batch_augments/__pycache__/cutmix.cpython-310.pyc b/mmpretrain/models/utils/batch_augments/__pycache__/cutmix.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..bf03f6331c028342b3072ebaa6d61d75bef48f33
Binary files /dev/null and b/mmpretrain/models/utils/batch_augments/__pycache__/cutmix.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/batch_augments/__pycache__/mixup.cpython-310.pyc b/mmpretrain/models/utils/batch_augments/__pycache__/mixup.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..949c8275d824df2ccc0cc1afcab5d71cd134fc50
Binary files /dev/null and b/mmpretrain/models/utils/batch_augments/__pycache__/mixup.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/batch_augments/__pycache__/resizemix.cpython-310.pyc b/mmpretrain/models/utils/batch_augments/__pycache__/resizemix.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..783fb5cc8914bbaaeb17ee22870b395125cdf2f5
Binary files /dev/null and b/mmpretrain/models/utils/batch_augments/__pycache__/resizemix.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/batch_augments/__pycache__/wrapper.cpython-310.pyc b/mmpretrain/models/utils/batch_augments/__pycache__/wrapper.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3e7420e4c4d0803efcf7775c6fa7df1d68d74aac
Binary files /dev/null and b/mmpretrain/models/utils/batch_augments/__pycache__/wrapper.cpython-310.pyc differ
diff --git a/mmpretrain/models/utils/batch_augments/cutmix.py b/mmpretrain/models/utils/batch_augments/cutmix.py
new file mode 100644
index 0000000000000000000000000000000000000000..665427bf5e2ff3a5ae9d656e7d642db8b72acabb
--- /dev/null
+++ b/mmpretrain/models/utils/batch_augments/cutmix.py
@@ -0,0 +1,157 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Tuple
+
+import numpy as np
+import torch
+
+from mmpretrain.registry import BATCH_AUGMENTS
+from .mixup import Mixup
+
+
+@BATCH_AUGMENTS.register_module()
+class CutMix(Mixup):
+    r"""CutMix batch agumentation.
+
+    CutMix is a method to improve the network's generalization capability. It's
+    proposed in `CutMix: Regularization Strategy to Train Strong Classifiers
+    with Localizable Features <https://arxiv.org/abs/1905.04899>`
+
+    With this method, patches are cut and pasted among training images where
+    the ground truth labels are also mixed proportionally to the area of the
+    patches.
+
+    Args:
+        alpha (float): Parameters for Beta distribution to generate the
+            mixing ratio. It should be a positive number. More details
+            can be found in :class:`Mixup`.
+        cutmix_minmax (List[float], optional): The min/max area ratio of the
+            patches. If not None, the bounding-box of patches is uniform
+            sampled within this ratio range, and the ``alpha`` will be ignored.
+            Otherwise, the bounding-box is generated according to the
+            ``alpha``. Defaults to None.
+        correct_lam (bool): Whether to apply lambda correction when cutmix bbox
+            clipped by image borders. Defaults to True.
+
+    .. note ::
+        If the ``cutmix_minmax`` is None, how to generate the bounding-box of
+        patches according to the ``alpha``?
+
+        First, generate a :math:`\lambda`, details can be found in
+        :class:`Mixup`. And then, the area ratio of the bounding-box
+        is calculated by:
+
+        .. math::
+            \text{ratio} = \sqrt{1-\lambda}
+    """
+
+    def __init__(self,
+                 alpha: float,
+                 cutmix_minmax: Optional[List[float]] = None,
+                 correct_lam: bool = True):
+        super().__init__(alpha=alpha)
+
+        self.cutmix_minmax = cutmix_minmax
+        self.correct_lam = correct_lam
+
+    def rand_bbox_minmax(
+            self,
+            img_shape: Tuple[int, int],
+            count: Optional[int] = None) -> Tuple[int, int, int, int]:
+        """Min-Max CutMix bounding-box Inspired by Darknet cutmix
+        implementation. It generates a random rectangular bbox based on min/max
+        percent values applied to each dimension of the input image.
+
+        Typical defaults for minmax are usually in the  .2-.3 for min and
+        .8-.9 range for max.
+
+        Args:
+            img_shape (tuple): Image shape as tuple
+            count (int, optional): Number of bbox to generate. Defaults to None
+        """
+        assert len(self.cutmix_minmax) == 2
+        img_h, img_w = img_shape
+        cut_h = np.random.randint(
+            int(img_h * self.cutmix_minmax[0]),
+            int(img_h * self.cutmix_minmax[1]),
+            size=count)
+        cut_w = np.random.randint(
+            int(img_w * self.cutmix_minmax[0]),
+            int(img_w * self.cutmix_minmax[1]),
+            size=count)
+        yl = np.random.randint(0, img_h - cut_h, size=count)
+        xl = np.random.randint(0, img_w - cut_w, size=count)
+        yu = yl + cut_h
+        xu = xl + cut_w
+        return yl, yu, xl, xu
+
+    def rand_bbox(self,
+                  img_shape: Tuple[int, int],
+                  lam: float,
+                  margin: float = 0.,
+                  count: Optional[int] = None) -> Tuple[int, int, int, int]:
+        """Standard CutMix bounding-box that generates a random square bbox
+        based on lambda value. This implementation includes support for
+        enforcing a border margin as percent of bbox dimensions.
+
+        Args:
+            img_shape (tuple): Image shape as tuple
+            lam (float): Cutmix lambda value
+            margin (float): Percentage of bbox dimension to enforce as margin
+                (reduce amount of box outside image). Defaults to 0.
+            count (int, optional): Number of bbox to generate. Defaults to None
+        """
+        ratio = np.sqrt(1 - lam)
+        img_h, img_w = img_shape
+        cut_h, cut_w = int(img_h * ratio), int(img_w * ratio)
+        margin_y, margin_x = int(margin * cut_h), int(margin * cut_w)
+        cy = np.random.randint(0 + margin_y, img_h - margin_y, size=count)
+        cx = np.random.randint(0 + margin_x, img_w - margin_x, size=count)
+        yl = np.clip(cy - cut_h // 2, 0, img_h)
+        yh = np.clip(cy + cut_h // 2, 0, img_h)
+        xl = np.clip(cx - cut_w // 2, 0, img_w)
+        xh = np.clip(cx + cut_w // 2, 0, img_w)
+        return yl, yh, xl, xh
+
+    def cutmix_bbox_and_lam(self,
+                            img_shape: Tuple[int, int],
+                            lam: float,
+                            count: Optional[int] = None) -> tuple:
+        """Generate bbox and apply lambda correction.
+
+        Args:
+            img_shape (tuple): Image shape as tuple
+            lam (float): Cutmix lambda value
+            count (int, optional): Number of bbox to generate. Defaults to None
+        """
+        if self.cutmix_minmax is not None:
+            yl, yu, xl, xu = self.rand_bbox_minmax(img_shape, count=count)
+        else:
+            yl, yu, xl, xu = self.rand_bbox(img_shape, lam, count=count)
+        if self.correct_lam or self.cutmix_minmax is not None:
+            bbox_area = (yu - yl) * (xu - xl)
+            lam = 1. - bbox_area / float(img_shape[0] * img_shape[1])
+        return (yl, yu, xl, xu), lam
+
+    def mix(self, batch_inputs: torch.Tensor,
+            batch_scores: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Mix the batch inputs and batch one-hot format ground truth.
+
+        Args:
+            batch_inputs (Tensor): A batch of images tensor in the shape of
+                ``(N, C, H, W)``.
+            batch_scores (Tensor): A batch of one-hot format labels in the
+                shape of ``(N, num_classes)``.
+
+        Returns:
+            Tuple[Tensor, Tensor): The mixed inputs and labels.
+        """
+        lam = np.random.beta(self.alpha, self.alpha)
+        batch_size = batch_inputs.size(0)
+        img_shape = batch_inputs.shape[-2:]
+        index = torch.randperm(batch_size)
+
+        (y1, y2, x1, x2), lam = self.cutmix_bbox_and_lam(img_shape, lam)
+        batch_inputs[:, :, y1:y2, x1:x2] = batch_inputs[index, :, y1:y2, x1:x2]
+        mixed_scores = lam * batch_scores + (1 - lam) * batch_scores[index, :]
+
+        return batch_inputs, mixed_scores
diff --git a/mmpretrain/models/utils/batch_augments/mixup.py b/mmpretrain/models/utils/batch_augments/mixup.py
new file mode 100644
index 0000000000000000000000000000000000000000..bedb2c3e5b6e62595e50f7494eeda7c14827b391
--- /dev/null
+++ b/mmpretrain/models/utils/batch_augments/mixup.py
@@ -0,0 +1,65 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Tuple
+
+import numpy as np
+import torch
+
+from mmpretrain.registry import BATCH_AUGMENTS
+
+
+@BATCH_AUGMENTS.register_module()
+class Mixup:
+    r"""Mixup batch augmentation.
+
+    Mixup is a method to reduces the memorization of corrupt labels and
+    increases the robustness to adversarial examples. It's proposed in
+    `mixup: Beyond Empirical Risk Minimization
+    <https://arxiv.org/abs/1710.09412>`_
+
+    Args:
+        alpha (float): Parameters for Beta distribution to generate the
+            mixing ratio. It should be a positive number. More details
+            are in the note.
+
+    Note:
+        The :math:`\alpha` (``alpha``) determines a random distribution
+        :math:`Beta(\alpha, \alpha)`. For each batch of data, we sample
+        a mixing ratio (marked as :math:`\lambda`, ``lam``) from the random
+        distribution.
+    """
+
+    def __init__(self, alpha: float):
+        assert isinstance(alpha, float) and alpha > 0
+
+        self.alpha = alpha
+
+    def mix(self, batch_inputs: torch.Tensor,
+            batch_scores: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Mix the batch inputs and batch one-hot format ground truth.
+
+        Args:
+            batch_inputs (Tensor): A batch of images tensor in the shape of
+                ``(N, C, H, W)``.
+            batch_scores (Tensor): A batch of one-hot format labels in the
+                shape of ``(N, num_classes)``.
+
+        Returns:
+            Tuple[Tensor, Tensor): The mixed inputs and labels.
+        """
+        lam = np.random.beta(self.alpha, self.alpha)
+        batch_size = batch_inputs.size(0)
+        index = torch.randperm(batch_size)
+
+        mixed_inputs = lam * batch_inputs + (1 - lam) * batch_inputs[index, :]
+        mixed_scores = lam * batch_scores + (1 - lam) * batch_scores[index, :]
+
+        return mixed_inputs, mixed_scores
+
+    def __call__(self, batch_inputs: torch.Tensor, batch_score: torch.Tensor):
+        """Mix the batch inputs and batch data samples."""
+        assert batch_score.ndim == 2, \
+            'The input `batch_score` should be a one-hot format tensor, '\
+            'which shape should be ``(N, num_classes)``.'
+
+        mixed_inputs, mixed_score = self.mix(batch_inputs, batch_score.float())
+        return mixed_inputs, mixed_score
diff --git a/mmpretrain/models/utils/batch_augments/resizemix.py b/mmpretrain/models/utils/batch_augments/resizemix.py
new file mode 100644
index 0000000000000000000000000000000000000000..c70f81b3d61331a808b0093b3b7e913c3884bf42
--- /dev/null
+++ b/mmpretrain/models/utils/batch_augments/resizemix.py
@@ -0,0 +1,95 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Tuple
+
+import numpy as np
+import torch
+import torch.nn.functional as F
+
+from mmpretrain.registry import BATCH_AUGMENTS
+from .cutmix import CutMix
+
+
+@BATCH_AUGMENTS.register_module()
+class ResizeMix(CutMix):
+    r"""ResizeMix Random Paste layer for a batch of data.
+
+    The ResizeMix will resize an image to a small patch and paste it on another
+    image. It's proposed in `ResizeMix: Mixing Data with Preserved Object
+    Information and True Labels <https://arxiv.org/abs/2012.11101>`_
+
+    Args:
+        alpha (float): Parameters for Beta distribution to generate the
+            mixing ratio. It should be a positive number. More details
+            can be found in :class:`Mixup`.
+        lam_min(float): The minimum value of lam. Defaults to 0.1.
+        lam_max(float): The maximum value of lam. Defaults to 0.8.
+        interpolation (str): algorithm used for upsampling:
+            'nearest' | 'linear' | 'bilinear' | 'bicubic' | 'trilinear' |
+            'area'. Defaults to 'bilinear'.
+        prob (float): The probability to execute resizemix. It should be in
+            range [0, 1]. Defaults to 1.0.
+        cutmix_minmax (List[float], optional): The min/max area ratio of the
+            patches. If not None, the bounding-box of patches is uniform
+            sampled within this ratio range, and the ``alpha`` will be ignored.
+            Otherwise, the bounding-box is generated according to the
+            ``alpha``. Defaults to None.
+        correct_lam (bool): Whether to apply lambda correction when cutmix bbox
+            clipped by image borders. Defaults to True
+        **kwargs: Any other parameters accpeted by :class:`CutMix`.
+
+    Note:
+        The :math:`\lambda` (``lam``) is the mixing ratio. It's a random
+        variable which follows :math:`Beta(\alpha, \alpha)` and is mapped
+        to the range [``lam_min``, ``lam_max``].
+
+        .. math::
+            \lambda = \frac{Beta(\alpha, \alpha)}
+            {\lambda_{max} - \lambda_{min}} + \lambda_{min}
+
+        And the resize ratio of source images is calculated by :math:`\lambda`:
+
+        .. math::
+            \text{ratio} = \sqrt{1-\lambda}
+    """
+
+    def __init__(self,
+                 alpha: float,
+                 lam_min: float = 0.1,
+                 lam_max: float = 0.8,
+                 interpolation: str = 'bilinear',
+                 cutmix_minmax: Optional[List[float]] = None,
+                 correct_lam: bool = True):
+        super().__init__(
+            alpha=alpha, cutmix_minmax=cutmix_minmax, correct_lam=correct_lam)
+        self.lam_min = lam_min
+        self.lam_max = lam_max
+        self.interpolation = interpolation
+
+    def mix(self, batch_inputs: torch.Tensor,
+            batch_scores: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Mix the batch inputs and batch one-hot format ground truth.
+
+        Args:
+            batch_inputs (Tensor): A batch of images tensor in the shape of
+                ``(N, C, H, W)``.
+            batch_scores (Tensor): A batch of one-hot format labels in the
+                shape of ``(N, num_classes)``.
+
+        Returns:
+            Tuple[Tensor, Tensor): The mixed inputs and labels.
+        """
+        lam = np.random.beta(self.alpha, self.alpha)
+        lam = lam * (self.lam_max - self.lam_min) + self.lam_min
+        img_shape = batch_inputs.shape[-2:]
+        batch_size = batch_inputs.size(0)
+        index = torch.randperm(batch_size)
+
+        (y1, y2, x1, x2), lam = self.cutmix_bbox_and_lam(img_shape, lam)
+        batch_inputs[:, :, y1:y2, x1:x2] = F.interpolate(
+            batch_inputs[index],
+            size=(int(y2 - y1), int(x2 - x1)),
+            mode=self.interpolation,
+            align_corners=False)
+        mixed_scores = lam * batch_scores + (1 - lam) * batch_scores[index, :]
+
+        return batch_inputs, mixed_scores
diff --git a/mmpretrain/models/utils/batch_augments/wrapper.py b/mmpretrain/models/utils/batch_augments/wrapper.py
new file mode 100644
index 0000000000000000000000000000000000000000..10e5304c3ca1a42428870ea5a00416007ca2e35c
--- /dev/null
+++ b/mmpretrain/models/utils/batch_augments/wrapper.py
@@ -0,0 +1,74 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Callable, Union
+
+import numpy as np
+import torch
+
+from mmpretrain.registry import BATCH_AUGMENTS
+
+
+class RandomBatchAugment:
+    """Randomly choose one batch augmentation to apply.
+
+    Args:
+        augments (Callable | dict | list): configs of batch
+            augmentations.
+        probs (float | List[float] | None): The probabilities of each batch
+            augmentations. If None, choose evenly. Defaults to None.
+
+    Example:
+        >>> import torch
+        >>> import torch.nn.functional as F
+        >>> from mmpretrain.models import RandomBatchAugment
+        >>> augments_cfg = [
+        ...     dict(type='CutMix', alpha=1.),
+        ...     dict(type='Mixup', alpha=1.)
+        ... ]
+        >>> batch_augment = RandomBatchAugment(augments_cfg, probs=[0.5, 0.3])
+        >>> imgs = torch.rand(16, 3, 32, 32)
+        >>> label = F.one_hot(torch.randint(0, 10, (16, )), num_classes=10)
+        >>> imgs, label = batch_augment(imgs, label)
+
+    .. note ::
+
+        To decide which batch augmentation will be used, it picks one of
+        ``augments`` based on the probabilities. In the example above, the
+        probability to use CutMix is 0.5, to use Mixup is 0.3, and to do
+        nothing is 0.2.
+    """
+
+    def __init__(self, augments: Union[Callable, dict, list], probs=None):
+        if not isinstance(augments, (tuple, list)):
+            augments = [augments]
+
+        self.augments = []
+        for aug in augments:
+            if isinstance(aug, dict):
+                self.augments.append(BATCH_AUGMENTS.build(aug))
+            else:
+                self.augments.append(aug)
+
+        if isinstance(probs, float):
+            probs = [probs]
+
+        if probs is not None:
+            assert len(augments) == len(probs), \
+                '``augments`` and ``probs`` must have same lengths. ' \
+                f'Got {len(augments)} vs {len(probs)}.'
+            assert sum(probs) <= 1, \
+                'The total probability of batch augments exceeds 1.'
+            self.augments.append(None)
+            probs.append(1 - sum(probs))
+
+        self.probs = probs
+
+    def __call__(self, batch_input: torch.Tensor, batch_score: torch.Tensor):
+        """Randomly apply batch augmentations to the batch inputs and batch
+        data samples."""
+        aug_index = np.random.choice(len(self.augments), p=self.probs)
+        aug = self.augments[aug_index]
+
+        if aug is not None:
+            return aug(batch_input, batch_score)
+        else:
+            return batch_input, batch_score.float()
diff --git a/mmpretrain/models/utils/batch_shuffle.py b/mmpretrain/models/utils/batch_shuffle.py
new file mode 100644
index 0000000000000000000000000000000000000000..a0b03c5fec5f99295daed2872feff73dfc238140
--- /dev/null
+++ b/mmpretrain/models/utils/batch_shuffle.py
@@ -0,0 +1,66 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Tuple
+
+import torch
+from mmengine.dist import all_gather, broadcast, get_rank
+
+
+@torch.no_grad()
+def batch_shuffle_ddp(x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Batch shuffle, for making use of BatchNorm.
+
+    Args:
+        x (torch.Tensor): Data in each GPU.
+
+    Returns:
+        Tuple[torch.Tensor, torch.Tensor]: Output of shuffle operation.
+            - x_gather[idx_this]: Shuffled data.
+            - idx_unshuffle: Index for restoring.
+    """
+    # gather from all gpus
+    batch_size_this = x.shape[0]
+    x_gather = torch.cat(all_gather(x), dim=0)
+    batch_size_all = x_gather.shape[0]
+
+    num_gpus = batch_size_all // batch_size_this
+
+    # random shuffle index
+    idx_shuffle = torch.randperm(batch_size_all)
+
+    # broadcast to all gpus
+    broadcast(idx_shuffle, src=0)
+
+    # index for restoring
+    idx_unshuffle = torch.argsort(idx_shuffle)
+
+    # shuffled index for this gpu
+    gpu_idx = get_rank()
+    idx_this = idx_shuffle.view(num_gpus, -1)[gpu_idx]
+
+    return x_gather[idx_this], idx_unshuffle
+
+
+@torch.no_grad()
+def batch_unshuffle_ddp(x: torch.Tensor,
+                        idx_unshuffle: torch.Tensor) -> torch.Tensor:
+    """Undo batch shuffle.
+
+    Args:
+        x (torch.Tensor): Data in each GPU.
+        idx_unshuffle (torch.Tensor): Index for restoring.
+
+    Returns:
+        torch.Tensor: Output of unshuffle operation.
+    """
+    # gather from all gpus
+    batch_size_this = x.shape[0]
+    x_gather = torch.cat(all_gather(x), dim=0)
+    batch_size_all = x_gather.shape[0]
+
+    num_gpus = batch_size_all // batch_size_this
+
+    # restored index for this gpu
+    gpu_idx = get_rank()
+    idx_this = idx_unshuffle.view(num_gpus, -1)[gpu_idx]
+
+    return x_gather[idx_this]
diff --git a/mmpretrain/models/utils/box_utils.py b/mmpretrain/models/utils/box_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..79db516c990f51a7c952404d932b6de022684fb4
--- /dev/null
+++ b/mmpretrain/models/utils/box_utils.py
@@ -0,0 +1,56 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torchvision.ops.boxes as boxes
+
+
+def box_cxcywh_to_xyxy(x):
+    x_c, y_c, w, h = x.unbind(-1)
+    b = [(x_c - 0.5 * w), (y_c - 0.5 * h), (x_c + 0.5 * w), (y_c + 0.5 * h)]
+    return torch.stack(b, dim=-1)
+
+
+def box_xyxy_to_cxcywh(x):
+    x0, y0, x1, y1 = x.unbind(-1)
+    b = [(x0 + x1) / 2.0, (y0 + y1) / 2.0, (x1 - x0), (y1 - y0)]
+    return torch.stack(b, dim=-1)
+
+
+def box_iou(boxes1, boxes2):
+    """Return intersection-over-union (Jaccard index) between two sets of
+    boxes.
+
+    Both sets of boxes are expected to be in ``(x1, y1, x2, y2)`` format with
+    ``0 <= x1 < x2`` and ``0 <= y1 < y2``.
+
+    Args:
+        boxes1 (Tensor[N, 4]): first set of boxes
+        boxes2 (Tensor[M, 4]): second set of boxes
+
+    Returns:
+        Tensor[N, M]: the NxM matrix containing the pairwise IoU values for
+        every element in boxes1 and boxes2
+    """
+    return boxes.box_iou(boxes1, boxes2)
+
+
+def generalized_box_iou(boxes1, boxes2):
+    """Return generalized intersection-over-union (Jaccard index) between two
+    sets of boxes.
+
+    Both sets of boxes are expected to be in ``(x1, y1, x2, y2)`` format with
+    ``0 <= x1 < x2`` and ``0 <= y1 < y2``.
+
+    Args:
+        boxes1 (Tensor[N, 4]): first set of boxes
+        boxes2 (Tensor[M, 4]): second set of boxes
+
+    Returns:
+        Tensor[N, M]: the NxM matrix containing the pairwise generalized IoU
+        values for every element in boxes1 and boxes2
+    """
+    # degenerate boxes gives inf / nan results
+    # so do an early check
+    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
+    assert (boxes2[:, 2:] >= boxes2[:, :2]).all()
+
+    return boxes.generalized_box_iou(boxes1, boxes2)
diff --git a/mmpretrain/models/utils/channel_shuffle.py b/mmpretrain/models/utils/channel_shuffle.py
new file mode 100644
index 0000000000000000000000000000000000000000..27006a8065db35a14c4207ce6613104374b064ad
--- /dev/null
+++ b/mmpretrain/models/utils/channel_shuffle.py
@@ -0,0 +1,29 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+
+
+def channel_shuffle(x, groups):
+    """Channel Shuffle operation.
+
+    This function enables cross-group information flow for multiple groups
+    convolution layers.
+
+    Args:
+        x (Tensor): The input tensor.
+        groups (int): The number of groups to divide the input tensor
+            in the channel dimension.
+
+    Returns:
+        Tensor: The output tensor after channel shuffle operation.
+    """
+
+    batch_size, num_channels, height, width = x.size()
+    assert (num_channels % groups == 0), ('num_channels should be '
+                                          'divisible by groups')
+    channels_per_group = num_channels // groups
+
+    x = x.view(batch_size, groups, channels_per_group, height, width)
+    x = torch.transpose(x, 1, 2).contiguous()
+    x = x.view(batch_size, -1, height, width)
+
+    return x
diff --git a/mmpretrain/models/utils/clip_generator_helper.py b/mmpretrain/models/utils/clip_generator_helper.py
new file mode 100644
index 0000000000000000000000000000000000000000..4f67f0ed6976585a20e15787fc6b94c41082d33d
--- /dev/null
+++ b/mmpretrain/models/utils/clip_generator_helper.py
@@ -0,0 +1,394 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Modified from https://github.com/zejiangh/MILAN
+from collections import OrderedDict
+from typing import Optional, Tuple, Union
+
+import numpy as np
+import torch
+from mmengine.logging import MMLogger
+from torch import nn
+
+from mmpretrain.registry import MODELS
+
+
+class LayerNorm(nn.LayerNorm):
+    """Subclass torch's LayerNorm to handle fp16."""
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward function."""
+        orig_type = x.dtype
+        ret = super().forward(x.type(torch.float32))
+        return ret.type(orig_type)
+
+
+@MODELS.register_module()
+class QuickGELU(nn.Module):
+    """A faster version of GELU."""
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward function."""
+        return x * torch.sigmoid(1.702 * x)
+
+
+class ResidualAttentionBlock(nn.Module):
+    """Residual Attention Block (RAB).
+
+    This module implements the same function as the MultiheadAttention,
+    but with a different interface, which is mainly used
+    in CLIP.
+
+    Args:
+        d_model (int): The feature dimension.
+        n_head (int): The number of attention heads.
+        attn_mask (torch.Tensor, optional): The attention mask.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 d_model: int,
+                 n_head: int,
+                 attn_mask: Optional[torch.Tensor] = None,
+                 return_attention: bool = False) -> None:
+        super().__init__()
+
+        self.attn = nn.MultiheadAttention(d_model, n_head)
+        self.ln_1 = LayerNorm(d_model)
+        self.mlp = nn.Sequential(
+            OrderedDict([('c_fc', nn.Linear(d_model, d_model * 4)),
+                         ('gelu', QuickGELU()),
+                         ('c_proj', nn.Linear(d_model * 4, d_model))]))
+        self.ln_2 = LayerNorm(d_model)
+        self.attn_mask = attn_mask
+
+        self.return_attention = return_attention
+
+    def attention(self, x: torch.Tensor) -> torch.Tensor:
+        """Attention function."""
+        self.attn_mask = self.attn_mask.to(
+            dtype=x.dtype,
+            device=x.device) if self.attn_mask is not None else None
+        if self.return_attention:
+            return self.attn(
+                x,
+                x,
+                x,
+                need_weights=self.return_attention,
+                attn_mask=self.attn_mask)
+        else:
+            return self.attn(
+                x,
+                x,
+                x,
+                need_weights=self.return_attention,
+                attn_mask=self.attn_mask)[0]
+
+    def forward(
+        self, x: torch.Tensor
+    ) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
+        """Forward function."""
+        if self.return_attention:
+            x_, attention = self.attention(self.ln_1(x))
+            x = x + x_
+            x = x + self.mlp(self.ln_2(x))
+            return x, attention
+        else:
+            x = x + self.attention(self.ln_1(x))
+            x = x + self.mlp(self.ln_2(x))
+            return x
+
+
+class Transformer(nn.Module):
+    """Transformer.
+
+    Both visual and text branches use this transformer.
+
+    Args:
+        width (int): The feature dimension.
+        layers (int): The number of layers.
+        heads (int): The number of attention heads.
+        attn_mask (torch.Tensor, optional): The attention mask.
+    """
+
+    def __init__(self,
+                 width: int,
+                 layers: int,
+                 heads: int,
+                 attn_mask: Optional[torch.Tensor] = None) -> None:
+        super().__init__()
+        self.width = width
+        self.layers = layers
+        self.resblocks = nn.ModuleList()
+        for _ in range(layers - 1):
+            self.resblocks.append(
+                ResidualAttentionBlock(width, heads, attn_mask))
+        self.resblocks.append(
+            ResidualAttentionBlock(
+                width, heads, attn_mask, return_attention=True))
+
+    def forward(
+            self, x: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Forward function."""
+        z = []
+        for idx, blk in enumerate(self.resblocks):
+            if idx < self.layers - 1:
+                x = blk(x)
+                z.append(x.permute(1, 0, 2))
+            else:
+                x, attention = blk(x)
+                z.append(x.permute(1, 0, 2))
+        return x, attention, z
+
+
+class VisionTransformer(nn.Module):
+    """Vision Transformer for CLIP.
+
+    Args:
+        input_resolution (int): The image size.
+        patch_size (int): The patch size.
+        width (int): The feature dimension.
+        layers (int): The number of layers.
+        heads (int): The number of attention heads.
+        out_dim (int): The output dimension.
+        fineturn (bool): Whether to fineturn the model.
+        average_target (bool): Whether to average the target.
+    """
+
+    def __init__(self,
+                 input_resolution: int,
+                 patch_size: int,
+                 width: int,
+                 layers: int,
+                 heads: int,
+                 output_dim: int,
+                 finetune=False,
+                 average_targets: int = 1) -> None:
+        super().__init__()
+        self.input_resolution = input_resolution
+        self.output_dim = output_dim
+        self.conv1 = nn.Conv2d(
+            in_channels=3,
+            out_channels=width,
+            kernel_size=patch_size,
+            stride=patch_size,
+            bias=False)
+
+        scale = width**-0.5
+        self.class_embedding = nn.Parameter(scale * torch.randn(width))
+        self.positional_embedding = nn.Parameter(scale * torch.randn(
+            (input_resolution // patch_size)**2 + 1, width))
+        self.ln_pre = LayerNorm(width)
+
+        self.transformer = Transformer(width, layers, heads)
+
+        self.finetune = finetune
+        if finetune is False:
+            self.ln_post = LayerNorm(width)
+            self.proj = nn.Parameter(scale * torch.randn(width, output_dim))
+
+        self.average_targets = average_targets
+
+    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Forward function."""
+        x = self.conv1(x)  # shape = [*, width, grid, grid]
+        x = x.reshape(x.shape[0], x.shape[1],
+                      -1)  # shape = [*, width, grid ** 2]
+        x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
+        x = torch.cat([
+            self.class_embedding.to(x.dtype) + torch.zeros(
+                x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x
+        ],
+                      dim=1)  # shape = [*, grid ** 2 + 1, width]
+        x = x + self.positional_embedding.to(x.dtype)
+        x = self.ln_pre(x)
+
+        x = x.permute(1, 0, 2)  # NLD -> LND
+        x, attention, z = self.transformer(x)
+        x = x.permute(1, 0, 2)  # LND -> NLD
+
+        x = self.ln_post(x)
+        if self.proj is not None:
+            x = x @ self.proj
+
+        return x, attention
+
+
+class CLIP(nn.Module):
+    """CLIP.
+
+    Args:
+        embed_dim (int): The embedding dimension.
+        image_resolution (int): The image size.
+        vision_layers (int): The number of layers in the vision transformer.
+        vision_width (int): The feature dimension in the vision transformer.
+        vision_patch_size (int): The patch size in the vision transformer.
+        context_length (int): The context length.
+        vocab_size (int): The vocabulary size.
+        transformer_width (int): The feature dimension in the text transformer.
+        transformer_heads (int): The number of attention heads in the
+            text transformer.
+        transformer_layers (int): The number of layers in the text transformer.
+        fineturn (bool): Whether to fineturn the model.
+        average_target (bool): Whether to average the target.
+    """
+
+    def __init__(
+        self,
+        embed_dim: int,
+        image_resolution: int,
+        vision_layers: Union[Tuple[int, int, int, int], int],
+        vision_width: int,
+        vision_patch_size: int,
+        context_length: int,
+        vocab_size: int,
+        transformer_width: int,
+        transformer_heads: int,
+        transformer_layers: int,
+        finetune: bool = False,
+        average_targets: int = 1,
+    ) -> None:
+        super().__init__()
+
+        self.context_length = context_length
+
+        vision_heads = vision_width // 64
+        self.visual = VisionTransformer(
+            input_resolution=image_resolution,
+            patch_size=vision_patch_size,
+            width=vision_width,
+            layers=vision_layers,
+            heads=vision_heads,
+            output_dim=embed_dim,
+            finetune=finetune,
+            average_targets=average_targets,
+        )
+
+        self.transformer = Transformer(
+            width=transformer_width,
+            layers=transformer_layers,
+            heads=transformer_heads,
+            attn_mask=self.build_attention_mask())
+
+        self.vocab_size = vocab_size
+        self.token_embedding = nn.Embedding(vocab_size, transformer_width)
+        self.positional_embedding = nn.Parameter(
+            torch.empty(self.context_length, transformer_width))
+        self.ln_final = LayerNorm(transformer_width)
+
+        self.text_projection = nn.Parameter(
+            torch.empty(transformer_width, embed_dim))
+        self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
+
+        self.initialize_parameters()
+
+    def initialize_parameters(self) -> None:
+        """Initialize the parameters.
+
+        The pretrained weight will override the initialized parameters by this
+        function.
+        """
+        nn.init.normal_(self.token_embedding.weight, std=0.02)
+        nn.init.normal_(self.positional_embedding, std=0.01)
+
+        proj_std = (self.transformer.width**-0.5) * (
+            (2 * self.transformer.layers)**-0.5)
+        attn_std = self.transformer.width**-0.5
+        fc_std = (2 * self.transformer.width)**-0.5
+        for block in self.transformer.resblocks:
+            nn.init.normal_(block.attn.in_proj_weight, std=attn_std)
+            nn.init.normal_(block.attn.out_proj.weight, std=proj_std)
+            nn.init.normal_(block.mlp.c_fc.weight, std=fc_std)
+            nn.init.normal_(block.mlp.c_proj.weight, std=proj_std)
+
+        if self.text_projection is not None:
+            nn.init.normal_(
+                self.text_projection, std=self.transformer.width**-0.5)
+
+    def build_attention_mask(self) -> torch.Tensor:
+        """Build the attention mask."""
+        # lazily create causal attention mask, with full attention between the
+        # vision tokens pytorch uses additive attention mask; fill with -inf
+        mask = torch.empty(self.context_length, self.context_length)
+        mask.fill_(float('-inf'))
+        mask.triu_(1)  # zero out the lower diagonal
+        return mask
+
+    @property
+    def dtype(self) -> torch.dtype:
+        """Get the dtype."""
+        return self.visual.conv1.weight.dtype
+
+    def encode_image(self,
+                     image: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Encode the image.
+
+        Get the feature and attention mask from the last layer of the visual
+        branch of CLIP.
+
+        Args:
+            image (torch.Tensor): The image tensor with shape NCHW.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]: The feature and attention mask.
+        """
+        return self.visual(image.type(self.dtype))
+
+
+def build_clip_model(state_dict: dict,
+                     finetune: bool = False,
+                     average_targets: int = 1) -> nn.Module:
+    """Build the CLIP model.
+
+    Args:
+        state_dict (dict): The pretrained state dict.
+        finetune (bool): Whether to fineturn the model.
+        average_targets (bool): Whether to average the target.
+
+    Returns:
+        nn.Module: The CLIP model.
+    """
+    vit = 'visual.proj' in state_dict
+
+    if vit:
+        vision_width = state_dict['visual.conv1.weight'].shape[0]
+        vision_layers = len([
+            k for k in state_dict.keys()
+            if k.startswith('visual.') and k.endswith('.attn.in_proj_weight')
+        ])
+        vision_patch_size = state_dict['visual.conv1.weight'].shape[-1]
+        grid_size = round(
+            (state_dict['visual.positional_embedding'].shape[0] - 1)**0.5)
+        image_resolution = vision_patch_size * grid_size
+
+    embed_dim = state_dict['text_projection'].shape[1]
+    context_length = state_dict['positional_embedding'].shape[0]
+    vocab_size = state_dict['token_embedding.weight'].shape[0]
+    transformer_width = state_dict['ln_final.weight'].shape[0]
+    transformer_heads = transformer_width // 64
+    transformer_layers = len(
+        set(
+            k.split('.')[2] for k in state_dict
+            if k.startswith('transformer.resblocks')))
+
+    model = CLIP(
+        embed_dim,
+        image_resolution,
+        vision_layers,
+        vision_width,
+        vision_patch_size,
+        context_length,
+        vocab_size,
+        transformer_width,
+        transformer_heads,
+        transformer_layers,
+        finetune,
+        average_targets,
+    )
+
+    for key in ['input_resolution', 'context_length', 'vocab_size']:
+        if key in state_dict:
+            del state_dict[key]
+
+    msg = model.load_state_dict(state_dict, strict=False)
+    MMLogger.get_current_instance().info(f'Load CLIP model: {msg}')
+    return model.eval()
diff --git a/mmpretrain/models/utils/data_preprocessor.py b/mmpretrain/models/utils/data_preprocessor.py
new file mode 100644
index 0000000000000000000000000000000000000000..c407bd4c9361b9fae329854d4a36dab929fef143
--- /dev/null
+++ b/mmpretrain/models/utils/data_preprocessor.py
@@ -0,0 +1,620 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from numbers import Number
+from typing import List, Optional, Sequence, Tuple, Union
+
+import torch
+import torch.nn.functional as F
+from mmengine.model import (BaseDataPreprocessor, ImgDataPreprocessor,
+                            stack_batch)
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import (DataSample, MultiTaskDataSample,
+                                   batch_label_to_onehot, cat_batch_labels,
+                                   tensor_split)
+from .batch_augments import RandomBatchAugment
+
+
+@MODELS.register_module()
+class ClsDataPreprocessor(BaseDataPreprocessor):
+    """Image pre-processor for classification tasks.
+
+    Comparing with the :class:`mmengine.model.ImgDataPreprocessor`,
+
+    1. It won't do normalization if ``mean`` is not specified.
+    2. It does normalization and color space conversion after stacking batch.
+    3. It supports batch augmentations like mixup and cutmix.
+
+    It provides the data pre-processing as follows
+
+    - Collate and move data to the target device.
+    - Pad inputs to the maximum size of current batch with defined
+      ``pad_value``. The padding size can be divisible by a defined
+      ``pad_size_divisor``
+    - Stack inputs to batch_inputs.
+    - Convert inputs from bgr to rgb if the shape of input is (3, H, W).
+    - Normalize image with defined std and mean.
+    - Do batch augmentations like Mixup and Cutmix during training.
+
+    Args:
+        mean (Sequence[Number], optional): The pixel mean of R, G, B channels.
+            Defaults to None.
+        std (Sequence[Number], optional): The pixel standard deviation of
+            R, G, B channels. Defaults to None.
+        pad_size_divisor (int): The size of padded image should be
+            divisible by ``pad_size_divisor``. Defaults to 1.
+        pad_value (Number): The padded pixel value. Defaults to 0.
+        to_rgb (bool): whether to convert image from BGR to RGB.
+            Defaults to False.
+        to_onehot (bool): Whether to generate one-hot format gt-labels and set
+            to data samples. Defaults to False.
+        num_classes (int, optional): The number of classes. Defaults to None.
+        batch_augments (dict, optional): The batch augmentations settings,
+            including "augments" and "probs". For more details, see
+            :class:`mmpretrain.models.RandomBatchAugment`.
+    """
+
+    def __init__(self,
+                 mean: Sequence[Number] = None,
+                 std: Sequence[Number] = None,
+                 pad_size_divisor: int = 1,
+                 pad_value: Number = 0,
+                 to_rgb: bool = False,
+                 to_onehot: bool = False,
+                 num_classes: Optional[int] = None,
+                 batch_augments: Optional[dict] = None):
+        super().__init__()
+        self.pad_size_divisor = pad_size_divisor
+        self.pad_value = pad_value
+        self.to_rgb = to_rgb
+        self.to_onehot = to_onehot
+        self.num_classes = num_classes
+
+        if mean is not None:
+            assert std is not None, 'To enable the normalization in ' \
+                'preprocessing, please specify both `mean` and `std`.'
+            # Enable the normalization in preprocessing.
+            self._enable_normalize = True
+            self.register_buffer('mean',
+                                 torch.tensor(mean).view(-1, 1, 1), False)
+            self.register_buffer('std',
+                                 torch.tensor(std).view(-1, 1, 1), False)
+        else:
+            self._enable_normalize = False
+
+        if batch_augments:
+            self.batch_augments = RandomBatchAugment(**batch_augments)
+            if not self.to_onehot:
+                from mmengine.logging import MMLogger
+                MMLogger.get_current_instance().info(
+                    'Because batch augmentations are enabled, the data '
+                    'preprocessor automatically enables the `to_onehot` '
+                    'option to generate one-hot format labels.')
+                self.to_onehot = True
+        else:
+            self.batch_augments = None
+
+    def forward(self, data: dict, training: bool = False) -> dict:
+        """Perform normalization, padding, bgr2rgb conversion and batch
+        augmentation based on ``BaseDataPreprocessor``.
+
+        Args:
+            data (dict): data sampled from dataloader.
+            training (bool): Whether to enable training time augmentation.
+
+        Returns:
+            dict: Data in the same format as the model input.
+        """
+        inputs = self.cast_data(data['inputs'])
+
+        if isinstance(inputs, torch.Tensor):
+            # The branch if use `default_collate` as the collate_fn in the
+            # dataloader.
+
+            # ------ To RGB ------
+            if self.to_rgb and inputs.size(1) == 3:
+                inputs = inputs.flip(1)
+
+            # -- Normalization ---
+            inputs = inputs.float()
+            if self._enable_normalize:
+                inputs = (inputs - self.mean) / self.std
+
+            # ------ Padding -----
+            if self.pad_size_divisor > 1:
+                h, w = inputs.shape[-2:]
+
+                target_h = math.ceil(
+                    h / self.pad_size_divisor) * self.pad_size_divisor
+                target_w = math.ceil(
+                    w / self.pad_size_divisor) * self.pad_size_divisor
+                pad_h = target_h - h
+                pad_w = target_w - w
+                inputs = F.pad(inputs, (0, pad_w, 0, pad_h), 'constant',
+                               self.pad_value)
+        else:
+            # The branch if use `pseudo_collate` as the collate_fn in the
+            # dataloader.
+
+            processed_inputs = []
+            for input_ in inputs:
+                # ------ To RGB ------
+                if self.to_rgb and input_.size(0) == 3:
+                    input_ = input_.flip(0)
+
+                # -- Normalization ---
+                input_ = input_.float()
+                if self._enable_normalize:
+                    input_ = (input_ - self.mean) / self.std
+
+                processed_inputs.append(input_)
+            # Combine padding and stack
+            inputs = stack_batch(processed_inputs, self.pad_size_divisor,
+                                 self.pad_value)
+
+        data_samples = data.get('data_samples', None)
+        sample_item = data_samples[0] if data_samples is not None else None
+
+        if isinstance(sample_item, DataSample):
+            batch_label = None
+            batch_score = None
+
+            if 'gt_label' in sample_item:
+                gt_labels = [sample.gt_label for sample in data_samples]
+                batch_label, label_indices = cat_batch_labels(gt_labels)
+                batch_label = batch_label.to(self.device)
+            if 'gt_score' in sample_item:
+                gt_scores = [sample.gt_score for sample in data_samples]
+                batch_score = torch.stack(gt_scores).to(self.device)
+            elif self.to_onehot and 'gt_label' in sample_item:
+                assert batch_label is not None, \
+                    'Cannot generate onehot format labels because no labels.'
+                num_classes = self.num_classes or sample_item.get(
+                    'num_classes')
+                assert num_classes is not None, \
+                    'Cannot generate one-hot format labels because not set ' \
+                    '`num_classes` in `data_preprocessor`.'
+                batch_score = batch_label_to_onehot(
+                    batch_label, label_indices, num_classes).to(self.device)
+
+            # ----- Batch Augmentations ----
+            if (training and self.batch_augments is not None
+                    and batch_score is not None):
+                inputs, batch_score = self.batch_augments(inputs, batch_score)
+
+            # ----- scatter labels and scores to data samples ---
+            if batch_label is not None:
+                for sample, label in zip(
+                        data_samples, tensor_split(batch_label,
+                                                   label_indices)):
+                    sample.set_gt_label(label)
+            if batch_score is not None:
+                for sample, score in zip(data_samples, batch_score):
+                    sample.set_gt_score(score)
+        elif isinstance(sample_item, MultiTaskDataSample):
+            data_samples = self.cast_data(data_samples)
+
+        return {'inputs': inputs, 'data_samples': data_samples}
+
+
+@MODELS.register_module()
+class SelfSupDataPreprocessor(ImgDataPreprocessor):
+    """Image pre-processor for operations, like normalization and bgr to rgb.
+
+    Compared with the :class:`mmengine.ImgDataPreprocessor`, this module
+    supports ``inputs`` as torch.Tensor or a list of torch.Tensor.
+    """
+
+    def __init__(self,
+                 mean: Optional[Sequence[Union[float, int]]] = None,
+                 std: Optional[Sequence[Union[float, int]]] = None,
+                 pad_size_divisor: int = 1,
+                 pad_value: Union[float, int] = 0,
+                 to_rgb: bool = False,
+                 bgr_to_rgb: bool = False,
+                 rgb_to_bgr: bool = False,
+                 non_blocking: Optional[bool] = False):
+        super().__init__(
+            mean=mean,
+            std=std,
+            pad_size_divisor=pad_size_divisor,
+            pad_value=pad_value,
+            bgr_to_rgb=bgr_to_rgb,
+            rgb_to_bgr=rgb_to_bgr,
+            non_blocking=non_blocking)
+
+        self._channel_conversion = to_rgb or bgr_to_rgb or rgb_to_bgr
+
+    def forward(
+            self,
+            data: dict,
+            training: bool = False
+    ) -> Tuple[List[torch.Tensor], Optional[list]]:
+        """Performs normalization and bgr2rgb conversion based on
+        ``BaseDataPreprocessor``.
+
+        Args:
+            data (dict): data sampled from dataloader.
+            training (bool): Whether to enable training time augmentation. If
+                subclasses override this method, they can perform different
+                preprocessing strategies for training and testing based on the
+                value of ``training``.
+        Returns:
+            Tuple[torch.Tensor, Optional[list]]: Data in the same format as the
+            model input.
+        """
+        assert isinstance(data,
+                          dict), 'Please use default_collate in dataloader, \
+            instead of pseudo_collate.'
+
+        data = [val for _, val in data.items()]
+        batch_inputs, batch_data_samples = self.cast_data(data)
+
+        # Here is what is different from :class:`mmengine.ImgDataPreprocessor`
+        # Since there are multiple views for an image for some algorithms,
+        # e.g. SimCLR, each item in inputs is a list, containing multi-views
+        # for an image.
+        if isinstance(batch_inputs, list):
+            # channel transform
+            if self._channel_conversion:
+                batch_inputs = [
+                    _input[:, [2, 1, 0], ...] for _input in batch_inputs
+                ]
+
+            # convert to float after channel conversion to ensure efficiency
+            batch_inputs = [_input.float() for _input in batch_inputs]
+
+            # normalization.
+            if self._enable_normalize:
+                batch_inputs = [(_input - self.mean) / self.std
+                                for _input in batch_inputs]
+        else:
+            # channel transform
+            if self._channel_conversion:
+                batch_inputs = batch_inputs[:, [2, 1, 0], ...]
+
+            # convert to float after channel conversion to ensure efficiency
+            batch_inputs = batch_inputs.float()
+
+            # normalization.
+            if self._enable_normalize:
+                batch_inputs = (batch_inputs - self.mean) / self.std
+
+        return {'inputs': batch_inputs, 'data_samples': batch_data_samples}
+
+
+@MODELS.register_module()
+class TwoNormDataPreprocessor(SelfSupDataPreprocessor):
+    """Image pre-processor for CAE, BEiT v1/v2, etc.
+
+    Compared with the :class:`mmselfsup.SelfSupDataPreprocessor`, this module
+    will normalize the prediction image and target image with different
+    normalization parameters.
+
+    Args:
+        mean (Sequence[float or int], optional): The pixel mean of image
+            channels. If ``to_rgb=True`` it means the mean value of R, G, B
+            channels. If the length of `mean` is 1, it means all channels have
+            the same mean value, or the input is a gray image. If it is not
+            specified, images will not be normalized. Defaults to None.
+        std (Sequence[float or int], optional): The pixel standard deviation of
+            image channels. If ``to_rgb=True`` it means the standard deviation
+            of R, G, B channels. If the length of `std` is 1, it means all
+            channels have the same standard deviation, or the input is a gray
+            image.  If it is not specified, images will not be normalized.
+            Defaults to None.
+        second_mean (Sequence[float or int], optional): The description is
+            like ``mean``, it can be customized for targe image. Defaults to
+            None.
+        second_std (Sequence[float or int], optional): The description is
+            like ``std``, it can be customized for targe image. Defaults to
+            None.
+        pad_size_divisor (int): The size of padded image should be
+            divisible by ``pad_size_divisor``. Defaults to 1.
+        pad_value (float or int): The padded pixel value. Defaults to 0.
+        to_rgb (bool): whether to convert image from BGR to RGB.
+            Defaults to False.
+        non_blocking (bool): Whether block current process when transferring
+            data to device. Defaults to False.
+    """
+
+    def __init__(self,
+                 mean: Optional[Sequence[Union[float, int]]] = None,
+                 std: Optional[Sequence[Union[float, int]]] = None,
+                 second_mean: Sequence[Union[float, int]] = None,
+                 second_std: Sequence[Union[float, int]] = None,
+                 pad_size_divisor: int = 1,
+                 pad_value: Union[float, int] = 0,
+                 to_rgb: bool = False,
+                 non_blocking: Optional[bool] = False):
+        super().__init__(
+            mean=mean,
+            std=std,
+            pad_size_divisor=pad_size_divisor,
+            pad_value=pad_value,
+            to_rgb=to_rgb,
+            non_blocking=non_blocking)
+        assert (second_mean is not None) and (second_std is not None), (
+            'mean and std should not be None while using '
+            '`TwoNormDataPreprocessor`')
+        assert len(second_mean) == 3 or len(second_mean) == 1, (
+            '`mean` should have 1 or 3 values, to be compatible with '
+            f'RGB or gray image, but got {len(second_mean)} values')
+        assert len(second_std) == 3 or len(second_std) == 1, (
+            '`std` should have 1 or 3 values, to be compatible with RGB '
+            f'or gray image, but got {len(std)} values')
+
+        self.register_buffer('second_mean',
+                             torch.tensor(second_mean).view(-1, 1, 1), False)
+        self.register_buffer('second_std',
+                             torch.tensor(second_std).view(-1, 1, 1), False)
+
+    def forward(
+            self,
+            data: dict,
+            training: bool = False
+    ) -> Tuple[List[torch.Tensor], Optional[list]]:
+        """Performs normalization and bgr2rgb conversion based on
+        ``BaseDataPreprocessor``. The ``batch_inputs`` in forward function is a
+        list.
+
+        Args:
+            data (dict): data sampled from dataloader.
+            training (bool): Whether to enable training time augmentation. If
+                subclasses override this method, they can perform different
+                preprocessing strategies for training and testing based on the
+                value of ``training``.
+        Returns:
+            Tuple[torch.Tensor, Optional[list]]: Data in the same format as the
+                model input.
+        """
+        data = [val for _, val in data.items()]
+        batch_inputs, batch_data_samples = self.cast_data(data)
+        # channel transform
+        if self._channel_conversion:
+            batch_inputs = [
+                _input[:, [2, 1, 0], ...] for _input in batch_inputs
+            ]
+
+        # convert to float after channel conversion to ensure efficiency
+        batch_inputs = [_input.float() for _input in batch_inputs]
+
+        # Normalization. Here is what is different from
+        # :class:`mmselfsup.SelfSupDataPreprocessor`. Normalize the target
+        # image and prediction image with different normalization params
+        if self._enable_normalize:
+            batch_inputs = [
+                (batch_inputs[0] - self.mean) / self.std,
+                (batch_inputs[1] - self.second_mean) / self.second_std
+            ]
+
+        return {'inputs': batch_inputs, 'data_samples': batch_data_samples}
+
+
+@MODELS.register_module()
+class VideoDataPreprocessor(BaseDataPreprocessor):
+    """Video pre-processor for operations, like normalization and bgr to rgb
+    conversion .
+
+    Compared with the :class:`mmaction.ActionDataPreprocessor`, this module
+    supports ``inputs`` as torch.Tensor or a list of torch.Tensor.
+
+    Args:
+        mean (Sequence[float or int, optional): The pixel mean of channels
+            of images or stacked optical flow. Defaults to None.
+        std (Sequence[float or int], optional): The pixel standard deviation
+            of channels of images or stacked optical flow. Defaults to None.
+        pad_size_divisor (int): The size of padded image should be
+            divisible by ``pad_size_divisor``. Defaults to 1.
+        pad_value (float or int): The padded pixel value. Defaults to 0.
+        to_rgb (bool): Whether to convert image from BGR to RGB.
+            Defaults to False.
+        format_shape (str): Format shape of input data.
+            Defaults to ``'NCHW'``.
+    """
+
+    def __init__(self,
+                 mean: Optional[Sequence[Union[float, int]]] = None,
+                 std: Optional[Sequence[Union[float, int]]] = None,
+                 pad_size_divisor: int = 1,
+                 pad_value: Union[float, int] = 0,
+                 to_rgb: bool = False,
+                 format_shape: str = 'NCHW') -> None:
+        super().__init__()
+        self.pad_size_divisor = pad_size_divisor
+        self.pad_value = pad_value
+        self.to_rgb = to_rgb
+        self.format_shape = format_shape
+
+        if mean is not None:
+            assert std is not None, 'To enable the normalization in ' \
+                                    'preprocessing, please specify both ' \
+                                    '`mean` and `std`.'
+            # Enable the normalization in preprocessing.
+            self._enable_normalize = True
+            if self.format_shape == 'NCHW':
+                normalizer_shape = (-1, 1, 1)
+            elif self.format_shape == 'NCTHW':
+                normalizer_shape = (-1, 1, 1, 1)
+            else:
+                raise ValueError(f'Invalid format shape: {format_shape}')
+
+            self.register_buffer(
+                'mean',
+                torch.tensor(mean, dtype=torch.float32).view(normalizer_shape),
+                False)
+            self.register_buffer(
+                'std',
+                torch.tensor(std, dtype=torch.float32).view(normalizer_shape),
+                False)
+        else:
+            self._enable_normalize = False
+
+    def forward(
+            self,
+            data: dict,
+            training: bool = False
+    ) -> Tuple[List[torch.Tensor], Optional[list]]:
+        """Performs normalization、padding and bgr2rgb conversion based on
+        ``BaseDataPreprocessor``.
+
+        Args:
+            data (dict): data sampled from dataloader.
+            training (bool): Whether to enable training time augmentation. If
+                subclasses override this method, they can perform different
+                preprocessing strategies for training and testing based on the
+                value of ``training``.
+        Returns:
+            Tuple[List[torch.Tensor], Optional[list]]: Data in the same format
+                as the model input.
+        """
+
+        data = [val for _, val in data.items()]
+        batch_inputs, batch_data_samples = self.cast_data(data)
+
+        if isinstance(batch_inputs, list):
+            # channel transform
+            if self.to_rgb:
+                if self.format_shape == 'NCHW':
+                    batch_inputs = [
+                        _input[..., [2, 1, 0], :, :] for _input in batch_inputs
+                    ]
+                elif self.format_shape == 'NCTHW':
+                    batch_inputs = [
+                        _input[..., [2, 1, 0], :, :, :]
+                        for _input in batch_inputs
+                    ]
+                else:
+                    raise ValueError(
+                        f'Invalid format shape: {self.format_shape}')
+
+            # convert to float after channel conversion to ensure efficiency
+            batch_inputs = [_input.float() for _input in batch_inputs]
+
+            # normalization
+            if self._enable_normalize:
+                batch_inputs = [(_input - self.mean) / self.std
+                                for _input in batch_inputs]
+
+        else:
+            # channel transform
+            if self.to_rgb:
+                if self.format_shape == 'NCHW':
+                    batch_inputs = batch_inputs[..., [2, 1, 0], :, :]
+                elif self.format_shape == 'NCTHW':
+                    batch_inputs = batch_inputs[..., [2, 1, 0], :, :, :]
+                else:
+                    raise ValueError(
+                        f'Invalid format shape: {self.format_shape}')
+
+            # convert to float after channel conversion to ensure efficiency
+            batch_inputs = batch_inputs.float()
+
+            # normalization
+            if self._enable_normalize:
+                batch_inputs = (batch_inputs - self.mean) / self.std
+
+        return {'inputs': batch_inputs, 'data_samples': batch_data_samples}
+
+
+@MODELS.register_module()
+class MultiModalDataPreprocessor(BaseDataPreprocessor):
+    """Data pre-processor for image-text multimodality tasks.
+
+    It provides the data pre-processing as follows
+
+    - Collate and move data to the target device.
+    - Pad inputs to the maximum size of current batch with defined
+      ``pad_value``. The padding size can be divisible by a defined
+      ``pad_size_divisor``
+    - Stack inputs to batch_inputs.
+    - Convert inputs from bgr to rgb if the shape of input is (3, H, W).
+    - Normalize image with defined std and mean.
+
+    Args:
+        mean (Sequence[Number], optional): The pixel mean of R, G, B channels.
+            Defaults to None.
+        std (Sequence[Number], optional): The pixel standard deviation of
+            R, G, B channels. Defaults to None.
+        pad_size_divisor (int): The size of padded image should be
+            divisible by ``pad_size_divisor``. Defaults to 1.
+        pad_value (Number): The padded pixel value. Defaults to 0.
+        to_rgb (bool): whether to convert image from BGR to RGB.
+            Defaults to False.
+    """
+
+    def __init__(
+        self,
+        mean: Sequence[Number] = None,
+        std: Sequence[Number] = None,
+        pad_size_divisor: int = 1,
+        pad_value: Number = 0,
+        to_rgb: bool = False,
+    ):
+        super().__init__()
+        self.pad_size_divisor = pad_size_divisor
+        self.pad_value = pad_value
+        self.to_rgb = to_rgb
+
+        if mean is not None:
+            assert std is not None, 'To enable the normalization in ' \
+                'preprocessing, please specify both `mean` and `std`.'
+            # Enable the normalization in preprocessing.
+            self._enable_normalize = True
+            self.register_buffer('mean',
+                                 torch.tensor(mean).view(-1, 1, 1), False)
+            self.register_buffer('std',
+                                 torch.tensor(std).view(-1, 1, 1), False)
+        else:
+            self._enable_normalize = False
+
+    def forward(self, data: dict, training: bool = False) -> dict:
+        """Perform normalization, padding, bgr2rgb conversion and batch
+        augmentation based on ``BaseDataPreprocessor``.
+
+        Args:
+            data (dict): data sampled from dataloader.
+            training (bool): Whether to enable training time augmentation.
+
+        Returns:
+            dict: Data in the same format as the model input.
+        """
+        data = self.cast_data(data)
+
+        imgs = data.get('inputs', None)
+
+        def _process_img(img):
+            # ------ To RGB ------
+            if self.to_rgb and img.size(1) == 3:
+                img = img.flip(1)
+
+            # -- Normalization ---
+            img = img.float()
+            if self._enable_normalize:
+                img = (img - self.mean) / self.std
+
+            # ------ Padding -----
+            if self.pad_size_divisor > 1:
+                h, w = img.shape[-2:]
+
+                target_h = math.ceil(
+                    h / self.pad_size_divisor) * self.pad_size_divisor
+                target_w = math.ceil(
+                    w / self.pad_size_divisor) * self.pad_size_divisor
+                pad_h = target_h - h
+                pad_w = target_w - w
+                img = F.pad(img, (0, pad_w, 0, pad_h), 'constant',
+                            self.pad_value)
+            return img
+
+        if isinstance(imgs, torch.Tensor):
+            imgs = _process_img(imgs)
+        elif isinstance(imgs, Sequence):
+            # B, T, C, H, W
+            imgs = torch.stack([_process_img(img) for img in imgs], dim=1)
+        elif imgs is not None:
+            raise ValueError(f'{type(imgs)} is not supported for imgs inputs.')
+
+        data_samples = data.get('data_samples', None)
+
+        return {'images': imgs, 'data_samples': data_samples}
diff --git a/mmpretrain/models/utils/ema.py b/mmpretrain/models/utils/ema.py
new file mode 100644
index 0000000000000000000000000000000000000000..63c5006bbb0d9ff967b3cce7d3b5ada0cc683468
--- /dev/null
+++ b/mmpretrain/models/utils/ema.py
@@ -0,0 +1,87 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from math import cos, pi
+from typing import Optional
+
+import torch
+import torch.nn as nn
+from mmengine.logging import MessageHub
+from mmengine.model import ExponentialMovingAverage
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class CosineEMA(ExponentialMovingAverage):
+    r"""CosineEMA is implemented for updating momentum parameter, used in BYOL,
+    MoCoV3, etc.
+
+    All parameters are updated by the formula as below:
+
+    .. math::
+
+        X'_{t+1} = (1 - m) * X'_t + m * X_t
+
+    Where :math:`m` the the momentum parameter. And it's updated with cosine
+    annealing, including momentum adjustment following:
+
+    .. math::
+        m = m_{end} + (m_{end} - m_{start}) * (\cos\frac{k\pi}{K} + 1) / 2
+
+    where :math:`k` is the current step, :math:`K` is the total steps.
+
+    .. note::
+        This :attr:`momentum` argument is different from one used in optimizer
+        classes and the conventional notion of momentum. Mathematically,
+        :math:`X'_{t}` is the moving average and :math:`X_t` is the new
+        observed value. The value of momentum is usually a small number,
+        allowing observed values to slowly update the ema parameters. See also
+        :external:py:class:`torch.nn.BatchNorm2d`.
+
+    Args:
+        model (nn.Module): The model to be averaged.
+        momentum (float): The start momentum value. Defaults to 0.004.
+        end_momentum (float): The end momentum value for cosine annealing.
+            Defaults to 0.
+        interval (int): Interval between two updates. Defaults to 1.
+        device (torch.device, optional): If provided, the averaged model will
+            be stored on the :attr:`device`. Defaults to None.
+        update_buffers (bool): if True, it will compute running averages for
+            both the parameters and the buffers of the model. Defaults to
+            False.
+    """
+
+    def __init__(self,
+                 model: nn.Module,
+                 momentum: float = 0.004,
+                 end_momentum: float = 0.,
+                 interval: int = 1,
+                 device: Optional[torch.device] = None,
+                 update_buffers: bool = False) -> None:
+        super().__init__(
+            model=model,
+            momentum=momentum,
+            interval=interval,
+            device=device,
+            update_buffers=update_buffers)
+        self.end_momentum = end_momentum
+
+    def avg_func(self, averaged_param: torch.Tensor,
+                 source_param: torch.Tensor, steps: int) -> None:
+        """Compute the moving average of the parameters using the cosine
+        momentum strategy.
+
+        Args:
+            averaged_param (Tensor): The averaged parameters.
+            source_param (Tensor): The source parameters.
+            steps (int): The number of times the parameters have been
+                updated.
+
+        Returns:
+            Tensor: The averaged parameters.
+        """
+        message_hub = MessageHub.get_current_instance()
+        max_iters = message_hub.get_info('max_iters')
+        cosine_annealing = (cos(pi * steps / float(max_iters)) + 1) / 2
+        momentum = self.end_momentum - (self.end_momentum -
+                                        self.momentum) * cosine_annealing
+        averaged_param.mul_(1 - momentum).add_(source_param, alpha=momentum)
diff --git a/mmpretrain/models/utils/embed.py b/mmpretrain/models/utils/embed.py
new file mode 100644
index 0000000000000000000000000000000000000000..8299f9a06789768b26ea58260a2984024fbf801d
--- /dev/null
+++ b/mmpretrain/models/utils/embed.py
@@ -0,0 +1,423 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import warnings
+from typing import Sequence
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn import build_conv_layer, build_norm_layer
+from mmcv.cnn.bricks.transformer import AdaptivePadding
+from mmengine.model import BaseModule
+
+from .helpers import to_2tuple
+
+
+def resize_pos_embed(pos_embed,
+                     src_shape,
+                     dst_shape,
+                     mode='bicubic',
+                     num_extra_tokens=1):
+    """Resize pos_embed weights.
+
+    Args:
+        pos_embed (torch.Tensor): Position embedding weights with shape
+            [1, L, C].
+        src_shape (tuple): The resolution of downsampled origin training
+            image, in format (H, W).
+        dst_shape (tuple): The resolution of downsampled new training
+            image, in format (H, W).
+        mode (str): Algorithm used for upsampling. Choose one from 'nearest',
+            'linear', 'bilinear', 'bicubic' and 'trilinear'.
+            Defaults to 'bicubic'.
+        num_extra_tokens (int): The number of extra tokens, such as cls_token.
+            Defaults to 1.
+
+    Returns:
+        torch.Tensor: The resized pos_embed of shape [1, L_new, C]
+    """
+    if src_shape[0] == dst_shape[0] and src_shape[1] == dst_shape[1]:
+        return pos_embed
+    assert pos_embed.ndim == 3, 'shape of pos_embed must be [1, L, C]'
+    _, L, C = pos_embed.shape
+    src_h, src_w = src_shape
+    assert L == src_h * src_w + num_extra_tokens, \
+        f"The length of `pos_embed` ({L}) doesn't match the expected " \
+        f'shape ({src_h}*{src_w}+{num_extra_tokens}). Please check the' \
+        '`img_size` argument.'
+    extra_tokens = pos_embed[:, :num_extra_tokens]
+
+    src_weight = pos_embed[:, num_extra_tokens:]
+    src_weight = src_weight.reshape(1, src_h, src_w, C).permute(0, 3, 1, 2)
+
+    # The cubic interpolate algorithm only accepts float32
+    dst_weight = F.interpolate(
+        src_weight.float(), size=dst_shape, align_corners=False, mode=mode)
+    dst_weight = torch.flatten(dst_weight, 2).transpose(1, 2)
+    dst_weight = dst_weight.to(src_weight.dtype)
+
+    return torch.cat((extra_tokens, dst_weight), dim=1)
+
+
+def resize_relative_position_bias_table(src_shape, dst_shape, table, num_head):
+    """Resize relative position bias table.
+
+    Args:
+        src_shape (int): The resolution of downsampled origin training
+            image, in format (H, W).
+        dst_shape (int): The resolution of downsampled new training
+            image, in format (H, W).
+        table (tensor): The relative position bias of the pretrained model.
+        num_head (int): Number of attention heads.
+
+    Returns:
+        torch.Tensor: The resized relative position bias table.
+    """
+    from scipy import interpolate
+
+    def geometric_progression(a, r, n):
+        return a * (1.0 - r**n) / (1.0 - r)
+
+    left, right = 1.01, 1.5
+    while right - left > 1e-6:
+        q = (left + right) / 2.0
+        gp = geometric_progression(1, q, src_shape // 2)
+        if gp > dst_shape // 2:
+            right = q
+        else:
+            left = q
+
+    dis = []
+    cur = 1
+    for i in range(src_shape // 2):
+        dis.append(cur)
+        cur += q**(i + 1)
+
+    r_ids = [-_ for _ in reversed(dis)]
+
+    x = r_ids + [0] + dis
+    y = r_ids + [0] + dis
+
+    t = dst_shape // 2.0
+    dx = np.arange(-t, t + 0.1, 1.0)
+    dy = np.arange(-t, t + 0.1, 1.0)
+
+    all_rel_pos_bias = []
+
+    for i in range(num_head):
+        z = table[:, i].view(src_shape, src_shape).float().numpy()
+        f_cubic = interpolate.interp2d(x, y, z, kind='cubic')
+        all_rel_pos_bias.append(
+            torch.Tensor(f_cubic(dx,
+                                 dy)).contiguous().view(-1,
+                                                        1).to(table.device))
+    new_rel_pos_bias = torch.cat(all_rel_pos_bias, dim=-1)
+    return new_rel_pos_bias
+
+
+class PatchEmbed(BaseModule):
+    """Image to Patch Embedding.
+
+    We use a conv layer to implement PatchEmbed.
+
+    Args:
+        img_size (int | tuple): The size of input image. Default: 224
+        in_channels (int): The num of input channels. Default: 3
+        embed_dims (int): The dimensions of embedding. Default: 768
+        norm_cfg (dict, optional): Config dict for normalization layer.
+            Default: None
+        conv_cfg (dict, optional): The config dict for conv layers.
+            Default: None
+        init_cfg (`mmcv.ConfigDict`, optional): The Config for initialization.
+            Default: None
+    """
+
+    def __init__(self,
+                 img_size=224,
+                 in_channels=3,
+                 embed_dims=768,
+                 norm_cfg=None,
+                 conv_cfg=None,
+                 init_cfg=None):
+        super(PatchEmbed, self).__init__(init_cfg)
+        warnings.warn('The `PatchEmbed` in mmpretrain will be deprecated. '
+                      'Please use `mmcv.cnn.bricks.transformer.PatchEmbed`. '
+                      "It's more general and supports dynamic input shape")
+
+        if isinstance(img_size, int):
+            img_size = to_2tuple(img_size)
+        elif isinstance(img_size, tuple):
+            if len(img_size) == 1:
+                img_size = to_2tuple(img_size[0])
+            assert len(img_size) == 2, \
+                f'The size of image should have length 1 or 2, ' \
+                f'but got {len(img_size)}'
+
+        self.img_size = img_size
+        self.embed_dims = embed_dims
+
+        # Use conv layer to embed
+        conv_cfg = conv_cfg or dict()
+        _conv_cfg = dict(
+            type='Conv2d', kernel_size=16, stride=16, padding=0, dilation=1)
+        _conv_cfg.update(conv_cfg)
+        self.projection = build_conv_layer(_conv_cfg, in_channels, embed_dims)
+
+        # Calculate how many patches a input image is splited to.
+        h_out, w_out = [(self.img_size[i] + 2 * self.projection.padding[i] -
+                         self.projection.dilation[i] *
+                         (self.projection.kernel_size[i] - 1) - 1) //
+                        self.projection.stride[i] + 1 for i in range(2)]
+
+        self.patches_resolution = (h_out, w_out)
+        self.num_patches = h_out * w_out
+
+        if norm_cfg is not None:
+            self.norm = build_norm_layer(norm_cfg, embed_dims)[1]
+        else:
+            self.norm = None
+
+    def forward(self, x):
+        B, C, H, W = x.shape
+        assert H == self.img_size[0] and W == self.img_size[1], \
+            f"Input image size ({H}*{W}) doesn't " \
+            f'match model ({self.img_size[0]}*{self.img_size[1]}).'
+        # The output size is (B, N, D), where N=H*W/P/P, D is embid_dim
+        x = self.projection(x).flatten(2).transpose(1, 2)
+
+        if self.norm is not None:
+            x = self.norm(x)
+
+        return x
+
+
+# Modified from pytorch-image-models
+class HybridEmbed(BaseModule):
+    """CNN Feature Map Embedding.
+
+    Extract feature map from CNN, flatten,
+    project to embedding dim.
+
+    Args:
+        backbone (nn.Module): CNN backbone
+        img_size (int | tuple): The size of input image. Default: 224
+        feature_size (int | tuple, optional): Size of feature map extracted by
+            CNN backbone. Default: None
+        in_channels (int): The num of input channels. Default: 3
+        embed_dims (int): The dimensions of embedding. Default: 768
+        conv_cfg (dict, optional): The config dict for conv layers.
+            Default: None.
+        init_cfg (`mmcv.ConfigDict`, optional): The Config for initialization.
+            Default: None.
+    """
+
+    def __init__(self,
+                 backbone,
+                 img_size=224,
+                 feature_size=None,
+                 in_channels=3,
+                 embed_dims=768,
+                 conv_cfg=None,
+                 init_cfg=None):
+        super(HybridEmbed, self).__init__(init_cfg)
+        assert isinstance(backbone, nn.Module)
+        if isinstance(img_size, int):
+            img_size = to_2tuple(img_size)
+        elif isinstance(img_size, tuple):
+            if len(img_size) == 1:
+                img_size = to_2tuple(img_size[0])
+            assert len(img_size) == 2, \
+                f'The size of image should have length 1 or 2, ' \
+                f'but got {len(img_size)}'
+
+        self.img_size = img_size
+        self.backbone = backbone
+        if feature_size is None:
+            with torch.no_grad():
+                # FIXME this is hacky, but most reliable way of
+                #  determining the exact dim of the output feature
+                #  map for all networks, the feature metadata has
+                #  reliable channel and stride info, but using
+                #  stride to calc feature dim requires info about padding of
+                #  each stage that isn't captured.
+                training = backbone.training
+                if training:
+                    backbone.eval()
+                o = self.backbone(
+                    torch.zeros(1, in_channels, img_size[0], img_size[1]))
+                if isinstance(o, (list, tuple)):
+                    # last feature if backbone outputs list/tuple of features
+                    o = o[-1]
+                feature_size = o.shape[-2:]
+                feature_dim = o.shape[1]
+                backbone.train(training)
+        else:
+            feature_size = to_2tuple(feature_size)
+            if hasattr(self.backbone, 'feature_info'):
+                feature_dim = self.backbone.feature_info.channels()[-1]
+            else:
+                feature_dim = self.backbone.num_features
+        self.num_patches = feature_size[0] * feature_size[1]
+
+        # Use conv layer to embed
+        conv_cfg = conv_cfg or dict()
+        _conv_cfg = dict(
+            type='Conv2d', kernel_size=1, stride=1, padding=0, dilation=1)
+        _conv_cfg.update(conv_cfg)
+        self.projection = build_conv_layer(_conv_cfg, feature_dim, embed_dims)
+
+    def forward(self, x):
+        x = self.backbone(x)
+        if isinstance(x, (list, tuple)):
+            # last feature if backbone outputs list/tuple of features
+            x = x[-1]
+        x = self.projection(x).flatten(2).transpose(1, 2)
+        return x
+
+
+class PatchMerging(BaseModule):
+    """Merge patch feature map.
+
+    Modified from mmcv, and this module supports specifying whether to use
+    post-norm.
+
+    This layer groups feature map by kernel_size, and applies norm and linear
+    layers to the grouped feature map ((used in Swin Transformer)). Our
+    implementation uses :class:`torch.nn.Unfold` to merge patches, which is
+    about 25% faster than the original implementation. However, we need to
+    modify pretrained models for compatibility.
+
+    Args:
+        in_channels (int): The num of input channels. To gets fully covered
+            by filter and stride you specified.
+        out_channels (int): The num of output channels.
+        kernel_size (int | tuple, optional): the kernel size in the unfold
+            layer. Defaults to 2.
+        stride (int | tuple, optional): the stride of the sliding blocks in the
+            unfold layer. Defaults to None, which means to be set as
+            ``kernel_size``.
+        padding (int | tuple | string ): The padding length of
+            embedding conv. When it is a string, it means the mode
+            of adaptive padding, support "same" and "corner" now.
+            Defaults to "corner".
+        dilation (int | tuple, optional): dilation parameter in the unfold
+            layer. Defaults to 1.
+        bias (bool, optional): Whether to add bias in linear layer or not.
+            Defaults to False.
+        norm_cfg (dict, optional): Config dict for normalization layer.
+            Defaults to ``dict(type='LN')``.
+        use_post_norm (bool): Whether to use post normalization here.
+            Defaults to False.
+        init_cfg (dict, optional): The extra config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 kernel_size=2,
+                 stride=None,
+                 padding='corner',
+                 dilation=1,
+                 bias=False,
+                 norm_cfg=dict(type='LN'),
+                 use_post_norm=False,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.use_post_norm = use_post_norm
+
+        if stride:
+            stride = stride
+        else:
+            stride = kernel_size
+
+        kernel_size = to_2tuple(kernel_size)
+        stride = to_2tuple(stride)
+        dilation = to_2tuple(dilation)
+
+        if isinstance(padding, str):
+            self.adaptive_padding = AdaptivePadding(
+                kernel_size=kernel_size,
+                stride=stride,
+                dilation=dilation,
+                padding=padding)
+            # disable the padding of unfold
+            padding = 0
+        else:
+            self.adaptive_padding = None
+
+        padding = to_2tuple(padding)
+        self.sampler = nn.Unfold(
+            kernel_size=kernel_size,
+            dilation=dilation,
+            padding=padding,
+            stride=stride)
+
+        sample_dim = kernel_size[0] * kernel_size[1] * in_channels
+
+        self.reduction = nn.Linear(sample_dim, out_channels, bias=bias)
+
+        if norm_cfg is not None:
+            # build pre or post norm layer based on different channels
+            if self.use_post_norm:
+                self.norm = build_norm_layer(norm_cfg, out_channels)[1]
+            else:
+                self.norm = build_norm_layer(norm_cfg, sample_dim)[1]
+        else:
+            self.norm = None
+
+    def forward(self, x, input_size):
+        """
+        Args:
+            x (Tensor): Has shape (B, H*W, C_in).
+            input_size (tuple[int]): The spatial shape of x, arrange as (H, W).
+                Default: None.
+
+        Returns:
+            tuple: Contains merged results and its spatial shape.
+
+            - x (Tensor): Has shape (B, Merged_H * Merged_W, C_out)
+            - out_size (tuple[int]): Spatial shape of x, arrange as
+              (Merged_H, Merged_W).
+        """
+        B, L, C = x.shape
+        assert isinstance(input_size, Sequence), f'Expect ' \
+                                                 f'input_size is ' \
+                                                 f'`Sequence` ' \
+                                                 f'but get {input_size}'
+
+        H, W = input_size
+        assert L == H * W, 'input feature has wrong size'
+
+        x = x.view(B, H, W, C).permute([0, 3, 1, 2])  # B, C, H, W
+
+        if self.adaptive_padding:
+            x = self.adaptive_padding(x)
+            H, W = x.shape[-2:]
+
+        # Use nn.Unfold to merge patch. About 25% faster than original method,
+        # but need to modify pretrained model for compatibility
+        # if kernel_size=2 and stride=2, x should has shape (B, 4*C, H/2*W/2)
+        x = self.sampler(x)
+
+        out_h = (H + 2 * self.sampler.padding[0] - self.sampler.dilation[0] *
+                 (self.sampler.kernel_size[0] - 1) -
+                 1) // self.sampler.stride[0] + 1
+        out_w = (W + 2 * self.sampler.padding[1] - self.sampler.dilation[1] *
+                 (self.sampler.kernel_size[1] - 1) -
+                 1) // self.sampler.stride[1] + 1
+
+        output_size = (out_h, out_w)
+        x = x.transpose(1, 2)  # B, H/2*W/2, 4*C
+
+        if self.use_post_norm:
+            # use post-norm here
+            x = self.reduction(x)
+            x = self.norm(x) if self.norm else x
+        else:
+            x = self.norm(x) if self.norm else x
+            x = self.reduction(x)
+
+        return x, output_size
diff --git a/mmpretrain/models/utils/helpers.py b/mmpretrain/models/utils/helpers.py
new file mode 100644
index 0000000000000000000000000000000000000000..971f45054e5edac15c71aa64ddd26164bf404d22
--- /dev/null
+++ b/mmpretrain/models/utils/helpers.py
@@ -0,0 +1,53 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import collections.abc
+import warnings
+from itertools import repeat
+
+import torch
+from mmengine.utils import digit_version
+
+
+def is_tracing() -> bool:
+    """Determine whether the model is called during the tracing of code with
+    ``torch.jit.trace``."""
+    if digit_version(torch.__version__) >= digit_version('1.6.0'):
+        on_trace = torch.jit.is_tracing()
+        # In PyTorch 1.6, torch.jit.is_tracing has a bug.
+        # Refers to https://github.com/pytorch/pytorch/issues/42448
+        if isinstance(on_trace, bool):
+            return on_trace
+        else:
+            return torch._C._is_tracing()
+    else:
+        warnings.warn(
+            'torch.jit.is_tracing is only supported after v1.6.0. '
+            'Therefore is_tracing returns False automatically. Please '
+            'set on_trace manually if you are using trace.', UserWarning)
+        return False
+
+
+# From PyTorch internals
+def _ntuple(n):
+    """A `to_tuple` function generator.
+
+    It returns a function, this function will repeat the input to a tuple of
+    length ``n`` if the input is not an Iterable object, otherwise, return the
+    input directly.
+
+    Args:
+        n (int): The number of the target length.
+    """
+
+    def parse(x):
+        if isinstance(x, collections.abc.Iterable):
+            return x
+        return tuple(repeat(x, n))
+
+    return parse
+
+
+to_1tuple = _ntuple(1)
+to_2tuple = _ntuple(2)
+to_3tuple = _ntuple(3)
+to_4tuple = _ntuple(4)
+to_ntuple = _ntuple
diff --git a/mmpretrain/models/utils/huggingface.py b/mmpretrain/models/utils/huggingface.py
new file mode 100644
index 0000000000000000000000000000000000000000..a44d6daaf1cc4c51579fd849fb84ee1a5cc6e7d2
--- /dev/null
+++ b/mmpretrain/models/utils/huggingface.py
@@ -0,0 +1,100 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import contextlib
+from typing import Optional
+
+import transformers
+from mmengine.registry import Registry
+from transformers import AutoConfig, PreTrainedModel
+from transformers.models.auto.auto_factory import _BaseAutoModelClass
+
+from mmpretrain.registry import MODELS, TOKENIZER
+
+
+def register_hf_tokenizer(
+    cls: Optional[type] = None,
+    registry: Registry = TOKENIZER,
+):
+    """Register HuggingFace-style PreTrainedTokenizerBase class."""
+    if cls is None:
+
+        # use it as a decorator: @register_hf_tokenizer()
+        def _register(cls):
+            register_hf_tokenizer(cls=cls)
+            return cls
+
+        return _register
+
+    def from_pretrained(**kwargs):
+        if ('pretrained_model_name_or_path' not in kwargs
+                and 'name_or_path' not in kwargs):
+            raise TypeError(
+                f'{cls.__name__}.from_pretrained() missing required '
+                "argument 'pretrained_model_name_or_path' or 'name_or_path'.")
+        # `pretrained_model_name_or_path` is too long for config,
+        # add an alias name `name_or_path` here.
+        name_or_path = kwargs.pop('pretrained_model_name_or_path',
+                                  kwargs.pop('name_or_path'))
+        return cls.from_pretrained(name_or_path, **kwargs)
+
+    registry._register_module(module=from_pretrained, module_name=cls.__name__)
+    return cls
+
+
+_load_hf_pretrained_model = True
+
+
+@contextlib.contextmanager
+def no_load_hf_pretrained_model():
+    global _load_hf_pretrained_model
+    _load_hf_pretrained_model = False
+    yield
+    _load_hf_pretrained_model = True
+
+
+def register_hf_model(
+    cls: Optional[type] = None,
+    registry: Registry = MODELS,
+):
+    """Register HuggingFace-style PreTrainedModel class."""
+    if cls is None:
+
+        # use it as a decorator: @register_hf_tokenizer()
+        def _register(cls):
+            register_hf_model(cls=cls)
+            return cls
+
+        return _register
+
+    if issubclass(cls, _BaseAutoModelClass):
+        get_config = AutoConfig.from_pretrained
+        from_config = cls.from_config
+    elif issubclass(cls, PreTrainedModel):
+        get_config = cls.config_class.from_pretrained
+        from_config = cls
+    else:
+        raise TypeError('Not auto model nor pretrained model of huggingface.')
+
+    def build(**kwargs):
+        if ('pretrained_model_name_or_path' not in kwargs
+                and 'name_or_path' not in kwargs):
+            raise TypeError(
+                f'{cls.__name__} missing required argument '
+                '`pretrained_model_name_or_path` or `name_or_path`.')
+        # `pretrained_model_name_or_path` is too long for config,
+        # add an alias name `name_or_path` here.
+        name_or_path = kwargs.pop('pretrained_model_name_or_path',
+                                  kwargs.pop('name_or_path'))
+
+        if kwargs.pop('load_pretrained', True) and _load_hf_pretrained_model:
+            model = cls.from_pretrained(name_or_path, **kwargs)
+            setattr(model, 'is_init', True)
+            return model
+        else:
+            cfg = get_config(name_or_path, **kwargs)
+            return from_config(cfg)
+
+    registry._register_module(module=build, module_name=cls.__name__)
+    return cls
+
+
+register_hf_model(transformers.AutoModelForCausalLM)
diff --git a/mmpretrain/models/utils/inverted_residual.py b/mmpretrain/models/utils/inverted_residual.py
new file mode 100644
index 0000000000000000000000000000000000000000..8387b21251aacff8efcb1b048e37ecdfa1299b2b
--- /dev/null
+++ b/mmpretrain/models/utils/inverted_residual.py
@@ -0,0 +1,125 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+import torch.utils.checkpoint as cp
+from mmcv.cnn import ConvModule
+from mmcv.cnn.bricks import DropPath
+from mmengine.model import BaseModule
+
+from .se_layer import SELayer
+
+
+class InvertedResidual(BaseModule):
+    """Inverted Residual Block.
+
+    Args:
+        in_channels (int): The input channels of this module.
+        out_channels (int): The output channels of this module.
+        mid_channels (int): The input channels of the depthwise convolution.
+        kernel_size (int): The kernel size of the depthwise convolution.
+            Defaults to 3.
+        stride (int): The stride of the depthwise convolution. Defaults to 1.
+        se_cfg (dict, optional): Config dict for se layer. Defaults to None,
+            which means no se layer.
+        conv_cfg (dict): Config dict for convolution layer. Defaults to None,
+            which means using conv2d.
+        norm_cfg (dict): Config dict for normalization layer.
+            Defaults to ``dict(type='BN')``.
+        act_cfg (dict): Config dict for activation layer.
+            Defaults to ``dict(type='ReLU')``.
+        drop_path_rate (float): stochastic depth rate. Defaults to 0.
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+            memory while slowing down the training speed. Defaults to False.
+        init_cfg (dict | list[dict], optional): Initialization config dict.
+    """
+
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 mid_channels,
+                 kernel_size=3,
+                 stride=1,
+                 se_cfg=None,
+                 conv_cfg=None,
+                 norm_cfg=dict(type='BN'),
+                 act_cfg=dict(type='ReLU'),
+                 drop_path_rate=0.,
+                 with_cp=False,
+                 init_cfg=None):
+        super(InvertedResidual, self).__init__(init_cfg)
+        self.with_res_shortcut = (stride == 1 and in_channels == out_channels)
+        assert stride in [1, 2]
+        self.with_cp = with_cp
+        self.drop_path = DropPath(
+            drop_path_rate) if drop_path_rate > 0 else nn.Identity()
+        self.with_se = se_cfg is not None
+        self.with_expand_conv = (mid_channels != in_channels)
+
+        if self.with_se:
+            assert isinstance(se_cfg, dict)
+
+        if self.with_expand_conv:
+            self.expand_conv = ConvModule(
+                in_channels=in_channels,
+                out_channels=mid_channels,
+                kernel_size=1,
+                stride=1,
+                padding=0,
+                conv_cfg=conv_cfg,
+                norm_cfg=norm_cfg,
+                act_cfg=act_cfg)
+        self.depthwise_conv = ConvModule(
+            in_channels=mid_channels,
+            out_channels=mid_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=kernel_size // 2,
+            groups=mid_channels,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=act_cfg)
+        if self.with_se:
+            self.se = SELayer(**se_cfg)
+        self.linear_conv = ConvModule(
+            in_channels=mid_channels,
+            out_channels=out_channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            conv_cfg=conv_cfg,
+            norm_cfg=norm_cfg,
+            act_cfg=None)
+
+    def forward(self, x):
+        """Forward function.
+
+        Args:
+            x (torch.Tensor): The input tensor.
+
+        Returns:
+            torch.Tensor: The output tensor.
+        """
+
+        def _inner_forward(x):
+            out = x
+
+            if self.with_expand_conv:
+                out = self.expand_conv(out)
+
+            out = self.depthwise_conv(out)
+
+            if self.with_se:
+                out = self.se(out)
+
+            out = self.linear_conv(out)
+
+            if self.with_res_shortcut:
+                return x + self.drop_path(out)
+            else:
+                return out
+
+        if self.with_cp and x.requires_grad:
+            out = cp.checkpoint(_inner_forward, x)
+        else:
+            out = _inner_forward(x)
+
+        return out
diff --git a/mmpretrain/models/utils/layer_scale.py b/mmpretrain/models/utils/layer_scale.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb480a15ce35570a5fcfe060c25ef676730430a7
--- /dev/null
+++ b/mmpretrain/models/utils/layer_scale.py
@@ -0,0 +1,40 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Union
+
+import torch
+import torch.nn as nn
+
+
+class LayerScale(nn.Module):
+    """LayerScale layer.
+
+    Args:
+        dim (int): Dimension of input features.
+        layer_scale_init_value (float or torch.Tensor): Init value of layer
+            scale. Defaults to 1e-5.
+        inplace (bool): inplace: can optionally do the
+            operation in-place. Defaults to False.
+        data_format (str): The input data format, could be 'channels_last'
+             or 'channels_first', representing (B, C, H, W) and
+             (B, N, C) format data respectively. Defaults to 'channels_last'.
+    """
+
+    def __init__(self,
+                 dim: int,
+                 layer_scale_init_value: Union[float, torch.Tensor] = 1e-5,
+                 inplace: bool = False,
+                 data_format: str = 'channels_last'):
+        super().__init__()
+        assert data_format in ('channels_last', 'channels_first'), \
+            "'data_format' could only be channels_last or channels_first."
+        self.inplace = inplace
+        self.data_format = data_format
+        self.weight = nn.Parameter(torch.ones(dim) * layer_scale_init_value)
+
+    def forward(self, x):
+        if self.data_format == 'channels_first':
+            if self.inplace:
+                return x.mul_(self.weight.view(-1, 1, 1))
+            else:
+                return x * self.weight.view(-1, 1, 1)
+        return x.mul_(self.weight) if self.inplace else x * self.weight
diff --git a/mmpretrain/models/utils/make_divisible.py b/mmpretrain/models/utils/make_divisible.py
new file mode 100644
index 0000000000000000000000000000000000000000..1ec74689e37d4a9d605a595adb0cca1da88aa19a
--- /dev/null
+++ b/mmpretrain/models/utils/make_divisible.py
@@ -0,0 +1,25 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+def make_divisible(value, divisor, min_value=None, min_ratio=0.9):
+    """Make divisible function.
+
+    This function rounds the channel number down to the nearest value that can
+    be divisible by the divisor.
+
+    Args:
+        value (int): The original channel number.
+        divisor (int): The divisor to fully divide the channel number.
+        min_value (int, optional): The minimum value of the output channel.
+            Default: None, means that the minimum value equal to the divisor.
+        min_ratio (float): The minimum ratio of the rounded channel
+            number to the original channel number. Default: 0.9.
+    Returns:
+        int: The modified output channel number
+    """
+
+    if min_value is None:
+        min_value = divisor
+    new_value = max(min_value, int(value + divisor / 2) // divisor * divisor)
+    # Make sure that round down does not go down by more than (1-min_ratio).
+    if new_value < min_ratio * value:
+        new_value += divisor
+    return new_value
diff --git a/mmpretrain/models/utils/norm.py b/mmpretrain/models/utils/norm.py
new file mode 100644
index 0000000000000000000000000000000000000000..8b890a0c6ec654f00e4bb4cd148158eaeba7599d
--- /dev/null
+++ b/mmpretrain/models/utils/norm.py
@@ -0,0 +1,133 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class GRN(nn.Module):
+    """Global Response Normalization Module.
+
+    Come from `ConvNeXt V2: Co-designing and Scaling ConvNets with Masked
+    Autoencoders <http://arxiv.org/abs/2301.00808>`_
+
+    Args:
+        in_channels (int): The number of channels of the input tensor.
+        eps (float): a value added to the denominator for numerical stability.
+            Defaults to 1e-6.
+    """
+
+    def __init__(self, in_channels, eps=1e-6):
+        super().__init__()
+        self.in_channels = in_channels
+        self.gamma = nn.Parameter(torch.zeros(in_channels))
+        self.beta = nn.Parameter(torch.zeros(in_channels))
+        self.eps = eps
+
+    def forward(self, x: torch.Tensor, data_format='channel_first'):
+        """Forward method.
+
+        Args:
+            x (torch.Tensor): The input tensor.
+            data_format (str): The format of the input tensor. If
+                ``"channel_first"``, the shape of the input tensor should be
+                (B, C, H, W). If ``"channel_last"``, the shape of the input
+                tensor should be (B, H, W, C). Defaults to "channel_first".
+        """
+        if data_format == 'channel_last':
+            gx = torch.norm(x, p=2, dim=(1, 2), keepdim=True)
+            nx = gx / (gx.mean(dim=-1, keepdim=True) + self.eps)
+            x = self.gamma * (x * nx) + self.beta + x
+        elif data_format == 'channel_first':
+            gx = torch.norm(x, p=2, dim=(2, 3), keepdim=True)
+            nx = gx / (gx.mean(dim=1, keepdim=True) + self.eps)
+            x = self.gamma.view(1, -1, 1, 1) * (x * nx) + self.beta.view(
+                1, -1, 1, 1) + x
+        return x
+
+
+@MODELS.register_module('LN2d')
+class LayerNorm2d(nn.LayerNorm):
+    """LayerNorm on channels for 2d images.
+
+    Args:
+        num_channels (int): The number of channels of the input tensor.
+        eps (float): a value added to the denominator for numerical stability.
+            Defaults to 1e-5.
+        elementwise_affine (bool): a boolean value that when set to ``True``,
+            this module has learnable per-element affine parameters initialized
+            to ones (for weights) and zeros (for biases). Defaults to True.
+    """
+
+    def __init__(self, num_channels: int, **kwargs) -> None:
+        super().__init__(num_channels, **kwargs)
+        self.num_channels = self.normalized_shape[0]
+
+    def forward(self, x, data_format='channel_first'):
+        """Forward method.
+
+        Args:
+            x (torch.Tensor): The input tensor.
+            data_format (str): The format of the input tensor. If
+                ``"channel_first"``, the shape of the input tensor should be
+                (B, C, H, W). If ``"channel_last"``, the shape of the input
+                tensor should be (B, H, W, C). Defaults to "channel_first".
+        """
+        assert x.dim() == 4, 'LayerNorm2d only supports inputs with shape ' \
+            f'(N, C, H, W), but got tensor with shape {x.shape}'
+        if data_format == 'channel_last':
+            x = F.layer_norm(x, self.normalized_shape, self.weight, self.bias,
+                             self.eps)
+        elif data_format == 'channel_first':
+            x = x.permute(0, 2, 3, 1)
+            x = F.layer_norm(x, self.normalized_shape, self.weight, self.bias,
+                             self.eps)
+            # If the output is discontiguous, it may cause some unexpected
+            # problem in the downstream tasks
+            x = x.permute(0, 3, 1, 2).contiguous()
+        return x
+
+
+def build_norm_layer(cfg: dict, num_features: int) -> nn.Module:
+    """Build normalization layer.
+
+    Args:
+        cfg (dict): The norm layer config, which should contain:
+
+            - type (str): Layer type.
+            - layer args: Args needed to instantiate a norm layer.
+
+        num_features (int): Number of input channels.
+
+    Returns:
+        nn.Module: The created norm layer.
+    """
+    if not isinstance(cfg, dict):
+        raise TypeError('cfg must be a dict')
+    if 'type' not in cfg:
+        raise KeyError('the cfg dict must contain the key "type"')
+    cfg_ = cfg.copy()
+
+    layer_type = cfg_.pop('type')
+    norm_layer = MODELS.get(layer_type)
+    if norm_layer is None:
+        raise KeyError(f'Cannot find {layer_type} in registry under scope '
+                       f'name {MODELS.scope}')
+
+    requires_grad = cfg_.pop('requires_grad', True)
+    cfg_.setdefault('eps', 1e-5)
+
+    if layer_type != 'GN':
+        layer = norm_layer(num_features, **cfg_)
+    else:
+        layer = norm_layer(num_channels=num_features, **cfg_)
+
+    if layer_type == 'SyncBN' and hasattr(layer, '_specify_ddp_gpu_num'):
+        layer._specify_ddp_gpu_num(1)
+
+    for param in layer.parameters():
+        param.requires_grad = requires_grad
+
+    return layer
diff --git a/mmpretrain/models/utils/position_encoding.py b/mmpretrain/models/utils/position_encoding.py
new file mode 100644
index 0000000000000000000000000000000000000000..07a3c486a25a84633d7e50463dd8b09f1c222837
--- /dev/null
+++ b/mmpretrain/models/utils/position_encoding.py
@@ -0,0 +1,247 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from functools import partial
+from typing import Optional, Sequence, Union
+
+import torch
+import torch.nn as nn
+from mmengine.model import BaseModule
+from mmengine.utils import digit_version
+
+from ..utils import to_2tuple
+
+# After pytorch v1.10.0, use torch.meshgrid without indexing
+# will raise extra warning. For more details,
+# refers to https://github.com/pytorch/pytorch/issues/50276
+if digit_version(torch.__version__) >= digit_version('1.10.0'):
+    torch_meshgrid = partial(torch.meshgrid, indexing='ij')
+else:
+    torch_meshgrid = torch.meshgrid
+
+
+class ConditionalPositionEncoding(BaseModule):
+    """The Conditional Position Encoding (CPE) module.
+
+    The CPE is the implementation of 'Conditional Positional Encodings
+    for Vision Transformers <https://arxiv.org/abs/2102.10882>'_.
+
+    Args:
+       in_channels (int): Number of input channels.
+       embed_dims (int): The feature dimension. Default: 768.
+       stride (int): Stride of conv layer. Default: 1.
+    """
+
+    def __init__(self, in_channels, embed_dims=768, stride=1, init_cfg=None):
+        super(ConditionalPositionEncoding, self).__init__(init_cfg=init_cfg)
+        self.proj = nn.Conv2d(
+            in_channels,
+            embed_dims,
+            kernel_size=3,
+            stride=stride,
+            padding=1,
+            bias=True,
+            groups=embed_dims)
+        self.stride = stride
+
+    def forward(self, x, hw_shape):
+        B, N, C = x.shape
+        H, W = hw_shape
+        feat_token = x
+        # convert (B, N, C) to (B, C, H, W)
+        cnn_feat = feat_token.transpose(1, 2).view(B, C, H, W).contiguous()
+        if self.stride == 1:
+            x = self.proj(cnn_feat) + cnn_feat
+        else:
+            x = self.proj(cnn_feat)
+        x = x.flatten(2).transpose(1, 2)
+        return x
+
+
+class PositionEncodingFourier(BaseModule):
+    """The Position Encoding Fourier (PEF) module.
+
+    The PEF is adopted from EdgeNeXt <https://arxiv.org/abs/2206.10589>'_.
+    Args:
+        in_channels (int): Number of input channels.
+            Default: 32
+        embed_dims (int): The feature dimension.
+            Default: 768.
+        temperature (int): Temperature.
+            Default: 10000.
+        dtype (torch.dtype): The data type.
+            Default: torch.float32.
+        init_cfg (dict): The config dict for initializing the module.
+            Default: None.
+    """
+
+    def __init__(self,
+                 in_channels=32,
+                 embed_dims=768,
+                 temperature=10000,
+                 dtype=torch.float32,
+                 init_cfg=None):
+        super(PositionEncodingFourier, self).__init__(init_cfg=init_cfg)
+        self.proj = nn.Conv2d(in_channels * 2, embed_dims, kernel_size=1)
+        self.scale = 2 * math.pi
+        self.in_channels = in_channels
+        self.embed_dims = embed_dims
+        self.dtype = dtype
+
+        if digit_version(torch.__version__) < digit_version('1.8.0'):
+            floor_div = torch.floor_divide
+        else:
+            floor_div = partial(torch.div, rounding_mode='floor')
+        dim_t = torch.arange(in_channels, dtype=self.dtype)
+        self.dim_t = temperature**(2 * floor_div(dim_t, 2) / in_channels)
+
+    def forward(self, bhw_shape):
+        B, H, W = bhw_shape
+        mask = torch.zeros(B, H, W).bool().to(self.proj.weight.device)
+        not_mask = ~mask
+        eps = 1e-6
+        y_embed = not_mask.cumsum(1, dtype=self.dtype)
+        x_embed = not_mask.cumsum(2, dtype=self.dtype)
+        y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
+        x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale
+
+        dim_t = self.dim_t.to(mask.device)
+        pos_x = x_embed[:, :, :, None] / dim_t
+        pos_y = y_embed[:, :, :, None] / dim_t
+        pos_x = torch.stack(
+            (pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()),
+            dim=4).flatten(3)
+        pos_y = torch.stack(
+            (pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()),
+            dim=4).flatten(3)
+
+        pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)
+        pos = self.proj(pos)
+
+        return pos
+
+
+def build_2d_sincos_position_embedding(
+        patches_resolution: Union[int, Sequence[int]],
+        embed_dims: int,
+        temperature: Optional[int] = 10000.,
+        cls_token: Optional[bool] = False) -> torch.Tensor:
+    """The function is to build position embedding for model to obtain the
+    position information of the image patches.
+
+    Args:
+        patches_resolution (Union[int, Sequence[int]]): The resolution of each
+            patch.
+        embed_dims (int): The dimension of the embedding vector.
+        temperature (int, optional): The temperature parameter. Defaults to
+            10000.
+        cls_token (bool, optional): Whether to concatenate class token.
+            Defaults to False.
+
+    Returns:
+        torch.Tensor: The position embedding vector.
+    """
+
+    if isinstance(patches_resolution, int):
+        patches_resolution = (patches_resolution, patches_resolution)
+
+    h, w = patches_resolution
+    grid_w = torch.arange(w, dtype=torch.float32)
+    grid_h = torch.arange(h, dtype=torch.float32)
+    grid_w, grid_h = torch_meshgrid(grid_w, grid_h)
+    assert embed_dims % 4 == 0, \
+        'Embed dimension must be divisible by 4.'
+    pos_dim = embed_dims // 4
+
+    omega = torch.arange(pos_dim, dtype=torch.float32) / pos_dim
+    omega = 1. / (temperature**omega)
+    out_w = torch.einsum('m,d->md', [grid_w.flatten(), omega])
+    out_h = torch.einsum('m,d->md', [grid_h.flatten(), omega])
+
+    pos_emb = torch.cat(
+        [
+            torch.sin(out_w),
+            torch.cos(out_w),
+            torch.sin(out_h),
+            torch.cos(out_h)
+        ],
+        dim=1,
+    )[None, :, :]
+
+    if cls_token:
+        cls_token_pe = torch.zeros([1, 1, embed_dims], dtype=torch.float32)
+        pos_emb = torch.cat([cls_token_pe, pos_emb], dim=1)
+
+    return pos_emb
+
+
+class RotaryEmbeddingFast(BaseModule):
+    """Implements 2D rotary embedding (RoPE) for image tokens. Position
+    encoding is implemented with sin and cos functions,
+
+        .. math::
+            Pos_{cos} = cos(\frac{t}{\theta^{\frac{2i}{d}}} \\
+            Pos_{sin} = sin(\frac{t}{\theta^{\frac{2i}{d}}}
+    Args:
+        embed_dims (int): The feature dimension for each head.
+        patch_resolution (int | tuple): The resolution of the
+            image, in format (H, W).
+        theta (float): The hyperparameter for position coding.
+            Defaults to 10000.
+        init_cfg (dict, optional): Initialization config dict.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 patch_resolution,
+                 theta=10000.,
+                 init_cfg=None):
+        super(RotaryEmbeddingFast, self).__init__(init_cfg=init_cfg)
+
+        self.half_dim = embed_dims // 2
+        self.patch_resolution = to_2tuple(patch_resolution)
+        self.theta = theta
+
+        freqs_cos, freqs_sin = self.compute_position_embedding()
+        self.register_buffer('freqs_cos', freqs_cos)
+        self.register_buffer('freqs_sin', freqs_sin)
+
+    def compute_position_embedding(self):
+        frequency = self.theta**(
+            torch.arange(0, self.half_dim, 2).float() / self.half_dim)
+        frequency = 1. / frequency
+
+        h, w = self.patch_resolution
+        th = torch.arange(h) / h * self.half_dim
+        tw = torch.arange(w) / w * self.half_dim
+
+        position_h = (th[:, None] @ frequency[None, :]).repeat(1, 2)
+        position_w = (tw[:, None] @ frequency[None, :]).repeat(1, 2)
+
+        height = position_h[:, None, :].expand(h, w, self.half_dim)
+        width = position_w[None, :, :].expand(h, w, self.half_dim)
+        position = torch.cat((height, width), dim=-1)
+
+        freqs_cos = position.cos().view(-1, position.shape[-1])
+        freqs_sin = position.sin().view(-1, position.shape[-1])
+
+        return freqs_cos, freqs_sin
+
+    def forward(self, x, patch_resolution):
+        # Check whether the patch resolution is the predefined size
+        patch_resolution = to_2tuple(patch_resolution)
+        if patch_resolution != self.patch_resolution:
+            self.patch_resolution = patch_resolution
+            freqs_cos, freqs_sin = self.compute_position_embedding()
+            self.register_buffer('freqs_cos', freqs_cos.to(x.device))
+            self.register_buffer('freqs_sin', freqs_sin.to(x.device))
+
+        batch, num_heads, num_patches, dim = x.shape
+
+        inputs = x
+        x = x.reshape(batch, num_heads, num_patches, -1, 2)
+        x1, x2 = x.unbind(dim=-1)
+        x = torch.stack((-x2, x1), dim=-1)
+        x = x.reshape(batch, num_heads, num_patches, dim)
+
+        return inputs * self.freqs_cos + x * self.freqs_sin
diff --git a/mmpretrain/models/utils/res_layer_extra_norm.py b/mmpretrain/models/utils/res_layer_extra_norm.py
new file mode 100644
index 0000000000000000000000000000000000000000..37e387ba9795ec528bd210dab75bd05abdc0addf
--- /dev/null
+++ b/mmpretrain/models/utils/res_layer_extra_norm.py
@@ -0,0 +1,31 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .norm import build_norm_layer
+
+try:
+    from mmdet.models.backbones import ResNet
+    from mmdet.models.roi_heads.shared_heads.res_layer import ResLayer
+    from mmdet.registry import MODELS
+
+    @MODELS.register_module()
+    class ResLayerExtraNorm(ResLayer):
+        """Add extra norm to original ``ResLayer``."""
+
+        def __init__(self, *args, **kwargs):
+            super(ResLayerExtraNorm, self).__init__(*args, **kwargs)
+
+            block = ResNet.arch_settings[kwargs['depth']][0]
+            self.add_module(
+                'norm',
+                build_norm_layer(self.norm_cfg,
+                                 64 * 2**self.stage * block.expansion))
+
+        def forward(self, x):
+            """Forward function."""
+            res_layer = getattr(self, f'layer{self.stage + 1}')
+            norm = getattr(self, 'norm')
+            x = res_layer(x)
+            out = norm(x)
+            return out
+
+except ImportError:
+    ResLayerExtraNorm = None
diff --git a/mmpretrain/models/utils/se_layer.py b/mmpretrain/models/utils/se_layer.py
new file mode 100644
index 0000000000000000000000000000000000000000..20290171008c2fd6f7a9e14e444f23b8375abe22
--- /dev/null
+++ b/mmpretrain/models/utils/se_layer.py
@@ -0,0 +1,80 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+from mmcv.cnn import ConvModule
+from mmengine.model import BaseModule
+from mmengine.utils import is_tuple_of
+
+from .make_divisible import make_divisible
+
+
+class SELayer(BaseModule):
+    """Squeeze-and-Excitation Module.
+
+    Args:
+        channels (int): The input (and output) channels of the SE layer.
+        squeeze_channels (None or int): The intermediate channel number of
+            SElayer. Default: None, means the value of ``squeeze_channels``
+            is ``make_divisible(channels // ratio, divisor)``.
+        ratio (int): Squeeze ratio in SELayer, the intermediate channel will
+            be ``make_divisible(channels // ratio, divisor)``. Only used when
+            ``squeeze_channels`` is None. Default: 16.
+        divisor(int): The divisor to true divide the channel number. Only
+            used when ``squeeze_channels`` is None. Default: 8.
+        conv_cfg (None or dict): Config dict for convolution layer. Default:
+            None, which means using conv2d.
+        return_weight(bool): Whether to return the weight. Default: False.
+        act_cfg (dict or Sequence[dict]): Config dict for activation layer.
+            If act_cfg is a dict, two activation layers will be configurated
+            by this dict. If act_cfg is a sequence of dicts, the first
+            activation layer will be configurated by the first dict and the
+            second activation layer will be configurated by the second dict.
+            Default: (dict(type='ReLU'), dict(type='Sigmoid'))
+    """
+
+    def __init__(self,
+                 channels,
+                 squeeze_channels=None,
+                 ratio=16,
+                 divisor=8,
+                 bias='auto',
+                 conv_cfg=None,
+                 act_cfg=(dict(type='ReLU'), dict(type='Sigmoid')),
+                 return_weight=False,
+                 init_cfg=None):
+        super(SELayer, self).__init__(init_cfg)
+        if isinstance(act_cfg, dict):
+            act_cfg = (act_cfg, act_cfg)
+        assert len(act_cfg) == 2
+        assert is_tuple_of(act_cfg, dict)
+        self.global_avgpool = nn.AdaptiveAvgPool2d(1)
+        if squeeze_channels is None:
+            squeeze_channels = make_divisible(channels // ratio, divisor)
+        assert isinstance(squeeze_channels, int) and squeeze_channels > 0, \
+            '"squeeze_channels" should be a positive integer, but get ' + \
+            f'{squeeze_channels} instead.'
+        self.return_weight = return_weight
+        self.conv1 = ConvModule(
+            in_channels=channels,
+            out_channels=squeeze_channels,
+            kernel_size=1,
+            stride=1,
+            bias=bias,
+            conv_cfg=conv_cfg,
+            act_cfg=act_cfg[0])
+        self.conv2 = ConvModule(
+            in_channels=squeeze_channels,
+            out_channels=channels,
+            kernel_size=1,
+            stride=1,
+            bias=bias,
+            conv_cfg=conv_cfg,
+            act_cfg=act_cfg[1])
+
+    def forward(self, x):
+        out = self.global_avgpool(x)
+        out = self.conv1(out)
+        out = self.conv2(out)
+        if self.return_weight:
+            return out
+        else:
+            return x * out
diff --git a/mmpretrain/models/utils/sparse_modules.py b/mmpretrain/models/utils/sparse_modules.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd6bf345399bbb9c1c2ec4af6c19cfe7adf9beb6
--- /dev/null
+++ b/mmpretrain/models/utils/sparse_modules.py
@@ -0,0 +1,149 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Copyright (c) ByteDance, Inc. and its affiliates. All rights reserved.
+# Modified from https://github.com/keyu-tian/SparK/blob/main/encoder.py
+import torch
+import torch.nn as nn
+
+from mmpretrain.registry import MODELS
+
+
+class SparseHelper:
+    """The helper to compute sparse operation with pytorch, such as sparse
+    convlolution, sparse batch norm, etc."""
+
+    _cur_active: torch.Tensor = None
+
+    @staticmethod
+    def _get_active_map_or_index(H: int,
+                                 returning_active_map: bool = True
+                                 ) -> torch.Tensor:
+        """Get current active map with (B, 1, f, f) shape or index format."""
+        # _cur_active with shape (B, 1, f, f)
+        downsample_raito = H // SparseHelper._cur_active.shape[-1]
+        active_ex = SparseHelper._cur_active.repeat_interleave(
+            downsample_raito, 2).repeat_interleave(downsample_raito, 3)
+        return active_ex if returning_active_map else active_ex.squeeze(
+            1).nonzero(as_tuple=True)
+
+    @staticmethod
+    def sp_conv_forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Sparse convolution forward function."""
+        x = super(type(self), self).forward(x)
+
+        # (b, c, h, w) *= (b, 1, h, w), mask the output of conv
+        x *= SparseHelper._get_active_map_or_index(
+            H=x.shape[2], returning_active_map=True)
+        return x
+
+    @staticmethod
+    def sp_bn_forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Sparse batch norm forward function."""
+        active_index = SparseHelper._get_active_map_or_index(
+            H=x.shape[2], returning_active_map=False)
+
+        # (b, c, h, w) -> (b, h, w, c)
+        x_permuted = x.permute(0, 2, 3, 1)
+
+        # select the features on non-masked positions to form flatten features
+        # with shape (n, c)
+        x_flattened = x_permuted[active_index]
+
+        # use BN1d to normalize this flatten feature (n, c)
+        x_flattened = super(type(self), self).forward(x_flattened)
+
+        # generate output
+        output = torch.zeros_like(x_permuted, dtype=x_flattened.dtype)
+        output[active_index] = x_flattened
+
+        # (b, h, w, c) -> (b, c, h, w)
+        output = output.permute(0, 3, 1, 2)
+        return output
+
+
+class SparseConv2d(nn.Conv2d):
+    """hack: override the forward function.
+    See `sp_conv_forward` above for more details
+    """
+    forward = SparseHelper.sp_conv_forward
+
+
+class SparseMaxPooling(nn.MaxPool2d):
+    """hack: override the forward function.
+    See `sp_conv_forward` above for more details
+    """
+    forward = SparseHelper.sp_conv_forward
+
+
+class SparseAvgPooling(nn.AvgPool2d):
+    """hack: override the forward function.
+    See `sp_conv_forward` above for more details
+    """
+    forward = SparseHelper.sp_conv_forward
+
+
+@MODELS.register_module()
+class SparseBatchNorm2d(nn.BatchNorm1d):
+    """hack: override the forward function.
+    See `sp_bn_forward` above for more details
+    """
+    forward = SparseHelper.sp_bn_forward
+
+
+@MODELS.register_module()
+class SparseSyncBatchNorm2d(nn.SyncBatchNorm):
+    """hack: override the forward function.
+    See `sp_bn_forward` above for more details
+    """
+    forward = SparseHelper.sp_bn_forward
+
+
+@MODELS.register_module('SparseLN2d')
+class SparseLayerNorm2D(nn.LayerNorm):
+    """Implementation of sparse LayerNorm on channels for 2d images."""
+
+    def forward(self,
+                x: torch.Tensor,
+                data_format='channel_first') -> torch.Tensor:
+        """Sparse layer norm forward function with 2D data.
+
+        Args:
+            x (torch.Tensor): The input tensor.
+            data_format (str): The format of the input tensor. If
+                ``"channel_first"``, the shape of the input tensor should be
+                (B, C, H, W). If ``"channel_last"``, the shape of the input
+                tensor should be (B, H, W, C). Defaults to "channel_first".
+        """
+        assert x.dim() == 4, (
+            f'LayerNorm2d only supports inputs with shape '
+            f'(N, C, H, W), but got tensor with shape {x.shape}')
+        if data_format == 'channel_last':
+            index = SparseHelper._get_active_map_or_index(
+                H=x.shape[1], returning_active_map=False)
+
+            # select the features on non-masked positions to form flatten
+            # features with shape (n, c)
+            x_flattened = x[index]
+            # use LayerNorm to normalize this flatten feature (n, c)
+            x_flattened = super().forward(x_flattened)
+
+            # generate output
+            x = torch.zeros_like(x, dtype=x_flattened.dtype)
+            x[index] = x_flattened
+        elif data_format == 'channel_first':
+            index = SparseHelper._get_active_map_or_index(
+                H=x.shape[2], returning_active_map=False)
+            x_permuted = x.permute(0, 2, 3, 1)
+
+            # select the features on non-masked positions to form flatten
+            # features with shape (n, c)
+            x_flattened = x_permuted[index]
+            # use LayerNorm to normalize this flatten feature (n, c)
+            x_flattened = super().forward(x_flattened)
+
+            # generate output
+            x = torch.zeros_like(x_permuted, dtype=x_flattened.dtype)
+            x[index] = x_flattened
+            x = x.permute(0, 3, 1, 2).contiguous()
+        else:
+            raise NotImplementedError
+        return x
diff --git a/mmpretrain/models/utils/swiglu_ffn.py b/mmpretrain/models/utils/swiglu_ffn.py
new file mode 100644
index 0000000000000000000000000000000000000000..20b4591f4f09ae185dd28e432dff7919d98d3a50
--- /dev/null
+++ b/mmpretrain/models/utils/swiglu_ffn.py
@@ -0,0 +1,98 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn.bricks.drop import build_dropout
+
+from .layer_scale import LayerScale
+from .norm import build_norm_layer
+
+
+class SwiGLUFFN(nn.Module):
+    """SwiGLU FFN layer.
+
+    Modified from https://github.com/facebookresearch/dinov2/blob/main/dinov2/layers/swiglu_ffn.py
+    """  # noqa
+
+    def __init__(
+        self,
+        embed_dims: int,
+        feedforward_channels: Optional[int] = None,
+        out_dims: Optional[int] = None,
+        layer_scale_init_value: float = 0.,
+        bias: bool = True,
+        dropout_layer: Optional[dict] = None,
+        norm_cfg: Optional[dict] = None,
+        add_identity: bool = True,
+    ) -> None:
+        super().__init__()
+        self.embed_dims = embed_dims
+        self.out_dims = out_dims or embed_dims
+        hidden_dims = feedforward_channels or embed_dims
+
+        self.w12 = nn.Linear(self.embed_dims, 2 * hidden_dims, bias=bias)
+
+        if norm_cfg is not None:
+            self.norm = build_norm_layer(norm_cfg, hidden_dims)
+        else:
+            self.norm = nn.Identity()
+
+        self.w3 = nn.Linear(hidden_dims, self.out_dims, bias=bias)
+
+        if layer_scale_init_value > 0:
+            self.gamma2 = LayerScale(
+                dim=embed_dims, layer_scale_init_value=layer_scale_init_value)
+        else:
+            self.gamma2 = nn.Identity()
+
+        self.dropout_layer = build_dropout(
+            dropout_layer) if dropout_layer else torch.nn.Identity()
+        self.add_identity = add_identity
+
+    def forward(self,
+                x: torch.Tensor,
+                identity: Optional[torch.Tensor] = None) -> torch.Tensor:
+        x12 = self.w12(x)
+        x1, x2 = x12.chunk(2, dim=-1)
+        hidden = F.silu(x1) * x2
+        hidden = self.norm(hidden)
+        out = self.w3(hidden)
+        out = self.gamma2(out)
+        out = self.dropout_layer(out)
+
+        if self.out_dims != self.embed_dims or not self.add_identity:
+            # due to the dimension inconsistence or user setting
+            # not to apply residual operation
+            return out
+
+        if identity is None:
+            identity = x
+        return identity + out
+
+
+class SwiGLUFFNFused(SwiGLUFFN):
+    """SwiGLU FFN layer with fusing.
+
+    Modified from https://github.com/facebookresearch/dinov2/blob/main/dinov2/layers/swiglu_ffn.py
+    """  # noqa
+
+    def __init__(
+        self,
+        embed_dims: int,
+        feedforward_channels: Optional[int] = None,
+        out_dims: Optional[int] = None,
+        layer_scale_init_value: float = 0.,
+        bias: bool = True,
+    ) -> None:
+        out_dims = out_dims or embed_dims
+        feedforward_channels = feedforward_channels or embed_dims
+        feedforward_channels = (int(feedforward_channels * 2 / 3) + 7) // 8 * 8
+        super().__init__(
+            embed_dims=embed_dims,
+            feedforward_channels=feedforward_channels,
+            out_dims=out_dims,
+            layer_scale_init_value=layer_scale_init_value,
+            bias=bias,
+        )
diff --git a/mmpretrain/models/utils/tokenizer.py b/mmpretrain/models/utils/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..fddda432ff7425a18067c966431442b38a56c21a
--- /dev/null
+++ b/mmpretrain/models/utils/tokenizer.py
@@ -0,0 +1,188 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import collections
+import os
+
+from mmengine.fileio import list_from_file
+from transformers import (AutoTokenizer, BartTokenizer, BasicTokenizer,
+                          BertTokenizer, BertTokenizerFast, LlamaTokenizer,
+                          WordpieceTokenizer)
+
+from mmpretrain.registry import TOKENIZER
+from .huggingface import register_hf_tokenizer
+
+register_hf_tokenizer(AutoTokenizer)
+register_hf_tokenizer(LlamaTokenizer)
+register_hf_tokenizer(BertTokenizer)
+
+
+@register_hf_tokenizer()
+class BlipTokenizer(BertTokenizerFast):
+    """"BlipTokenizer inherit BertTokenizerFast (fast, Rust-based)."""
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path,
+        *init_inputs,
+        **kwargs,
+    ):
+        os.environ['TOKENIZERS_PARALLELISM'] = 'true'
+
+        tokenizer = super().from_pretrained(
+            pretrained_model_name_or_path,
+            *init_inputs,
+            **kwargs,
+        )
+        tokenizer.add_special_tokens({'bos_token': '[DEC]'})
+        tokenizer.add_special_tokens({'additional_special_tokens': ['[ENC]']})
+        return tokenizer
+
+
+@register_hf_tokenizer()
+class Blip2Tokenizer(BertTokenizer):
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path,
+        *init_inputs,
+        **kwargs,
+    ):
+        tokenizer = super().from_pretrained(
+            pretrained_model_name_or_path,
+            *init_inputs,
+            **kwargs,
+        )
+        tokenizer.add_special_tokens({'bos_token': '[DEC]'})
+        return tokenizer
+
+
+@register_hf_tokenizer()
+class OFATokenizer(BartTokenizer):
+
+    vocab_files_names = {
+        'vocab_file': 'vocab.json',
+        'merges_file': 'merges.txt'
+    }
+
+    pretrained_vocab_files_map = {
+        'vocab_file': {
+            'OFA-Sys/OFA-tiny':
+            'https://huggingface.co/OFA-Sys/OFA-tiny/blob/main/vocab.json',
+            'OFA-Sys/OFA-medium':
+            'https://huggingface.co/OFA-Sys/OFA-medium/blob/main/vocab.json',
+            'OFA-Sys/OFA-base':
+            'https://huggingface.co/OFA-Sys/OFA-base/blob/main/vocab.json',
+            'OFA-Sys/OFA-large':
+            'https://huggingface.co/OFA-Sys/OFA-large/blob/main/vocab.json',
+        },
+        'merges_file': {
+            'OFA-Sys/OFA-tiny':
+            'https://huggingface.co/OFA-Sys/OFA-tiny/blob/main/merges.txt',
+            'OFA-Sys/OFA-medium':
+            'https://huggingface.co/OFA-Sys/OFA-medium/blob/main/merges.txt',
+            'OFA-Sys/OFA-base':
+            'https://huggingface.co/OFA-Sys/OFA-base/blob/main/merges.txt',
+            'OFA-Sys/OFA-large':
+            'https://huggingface.co/OFA-Sys/OFA-large/blob/main/merges.txt',
+        },
+    }
+
+    max_model_input_sizes = {
+        'OFA-Sys/OFA-tiny': 1024,
+        'OFA-Sys/OFA-medium': 1024,
+        'OFA-Sys/OFA-base': 1024,
+        'OFA-Sys/OFA-large': 1024,
+    }
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path,
+        *init_inputs,
+        **kwargs,
+    ):
+        num_bins = kwargs.pop('num_bins', 1000)
+        tokenizer = super().from_pretrained(
+            pretrained_model_name_or_path,
+            *init_inputs,
+            **kwargs,
+        )
+        length = len(tokenizer)
+        tokenizer.add_tokens(['<code_{}>'.format(i) for i in range(8192)])
+        tokenizer.code_offset = length
+        tokenizer.add_tokens(['<bin_{}>'.format(i) for i in range(num_bins)])
+        tokenizer.bin_offset = length + 8192
+        tokenizer.num_bins = num_bins
+        return tokenizer
+
+
+@TOKENIZER.register_module()
+class FullTokenizer(BertTokenizer):
+    """Runs end-to-end tokenziation."""
+
+    def __init__(self, vocab_file, do_lower_case=True):
+        self.vocab = self.load_vocab(vocab_file)
+        self.inv_vocab = {v: k for k, v in self.vocab.items()}
+        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(
+            vocab=self.vocab, unk_token='[UNK]', max_input_chars_per_word=200)
+
+    def load_vocab(self, vocab_file):
+        """Loads a vocabulary file into a dictionary."""
+        vocab = collections.OrderedDict()
+        index = 0
+        vocab_list = list_from_file(vocab_file)
+        for token in vocab_list:
+            if not token:
+                break
+            token = token.strip()
+            vocab[token] = index
+            index += 1
+        return vocab
+
+    def tokenize(self, text):
+        split_tokens = []
+        for token in self.basic_tokenizer.tokenize(text):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+
+        return split_tokens
+
+    def convert_by_vocab(self, vocab, items):
+        """Converts a sequence of [tokens|ids] using the vocab."""
+        output = []
+        for item in items:
+            output.append(vocab[item])
+        return output
+
+    def convert_tokens_to_ids(self, tokens):
+        return self.convert_by_vocab(self.vocab, tokens)
+
+    def convert_ids_to_tokens(self, ids):
+        return self.convert_by_vocab(self.inv_vocab, ids)
+
+    @staticmethod
+    def convert_tokens_to_string(tokens, clean_up_tokenization_spaces=True):
+        """Converts a sequence of tokens (string) in a single string."""
+
+        def clean_up_tokenization(out_string):
+            """Clean up a list of simple English tokenization artifacts like
+            spaces before punctuations and abbreviated forms."""
+            out_string = (
+                out_string.replace(' .', '.').replace(' ?', '?').replace(
+                    ' !', '!').replace(' ,', ',').replace(" ' ", "'").replace(
+                        " n't", "n't").replace(" 'm", "'m").replace(
+                            " 's", "'s").replace(" 've",
+                                                 "'ve").replace(" 're", "'re"))
+            return out_string
+
+        text = ' '.join(tokens).replace(' ##', '').strip()
+        if clean_up_tokenization_spaces:
+            clean_text = clean_up_tokenization(text)
+            return clean_text
+        else:
+            return text
+
+    def vocab_size(self):
+        return len(self.vocab)
diff --git a/mmpretrain/models/utils/vector_quantizer.py b/mmpretrain/models/utils/vector_quantizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c2ea89339e190d0d19bf5c89b60c1d4bab8fad5
--- /dev/null
+++ b/mmpretrain/models/utils/vector_quantizer.py
@@ -0,0 +1,232 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Copyright (c) 2022 Microsoft
+# Modified from
+# https://github.com/microsoft/unilm/blob/master/beit2/norm_ema_quantizer.py
+from typing import Optional, Tuple
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange, repeat
+from mmengine.dist import all_reduce
+
+
+def ema_inplace(moving_avg: torch.Tensor, new: torch.Tensor,
+                decay: torch.Tensor) -> None:
+    """Update moving average."""
+    moving_avg.data.mul_(decay).add_(new, alpha=(1 - decay))
+
+
+def norm_ema_inplace(moving_avg: torch.Tensor, new: torch.Tensor,
+                     decay: torch.Tensor) -> None:
+    """Update moving average with norm data."""
+    moving_avg.data.mul_(decay).add_(new, alpha=(1 - decay))
+    moving_avg.data.copy_(F.normalize(moving_avg.data, p=2, dim=-1))
+
+
+def sample_vectors(samples: torch.Tensor, num: int) -> torch.Tensor:
+    """Sample vectors according to the given number."""
+    num_samples, device = samples.shape[0], samples.device
+
+    if num_samples >= num:
+        indices = torch.randperm(num_samples, device=device)[:num]
+    else:
+        indices = torch.randint(0, num_samples, (num, ), device=device)
+
+    return samples[indices]
+
+
+def kmeans(samples: torch.Tensor,
+           num_clusters: int,
+           num_iters: int = 10,
+           use_cosine_sim: bool = False) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Run k-means algorithm."""
+    dim, dtype, _ = samples.shape[-1], samples.dtype, samples.device
+
+    means = sample_vectors(samples, num_clusters)
+
+    for _ in range(num_iters):
+        if use_cosine_sim:
+            dists = samples @ means.t()
+        else:
+            diffs = rearrange(samples, 'n d -> n () d') \
+                    - rearrange(means, 'c d -> () c d')
+            dists = -(diffs**2).sum(dim=-1)
+
+        buckets = dists.max(dim=-1).indices
+        bins = torch.bincount(buckets, minlength=num_clusters)
+        zero_mask = bins == 0
+        bins_min_clamped = bins.masked_fill(zero_mask, 1)
+
+        new_means = buckets.new_zeros(num_clusters, dim, dtype=dtype)
+        new_means.scatter_add_(0, repeat(buckets, 'n -> n d', d=dim), samples)
+        new_means = new_means / bins_min_clamped[..., None]
+
+        if use_cosine_sim:
+            new_means = F.normalize(new_means, p=2, dim=-1)
+
+        means = torch.where(zero_mask[..., None], means, new_means)
+
+    return means, bins
+
+
+class EmbeddingEMA(nn.Module):
+    """The codebook of embedding vectors.
+
+    Args:
+        num_tokens (int): Number of embedding vectors in the codebook.
+        codebook_dim (int) : The dimension of embedding vectors in the
+            codebook.
+        kmeans_init (bool): Whether to use k-means to initialize the
+            VectorQuantizer. Defaults to True.
+        codebook_init_path (str): The initialization checkpoint for codebook.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 num_tokens: int,
+                 codebook_dim: int,
+                 kmeans_init: bool = True,
+                 codebook_init_path: Optional[str] = None):
+        super().__init__()
+        self.num_tokens = num_tokens
+        self.codebook_dim = codebook_dim
+        if codebook_init_path is None:
+            if not kmeans_init:
+                weight = torch.randn(num_tokens, codebook_dim)
+                weight = F.normalize(weight, p=2, dim=-1)
+            else:
+                weight = torch.zeros(num_tokens, codebook_dim)
+            self.register_buffer('initted', torch.Tensor([not kmeans_init]))
+        else:
+            print(f'load init codebook weight from {codebook_init_path}')
+            codebook_ckpt_weight = torch.load(
+                codebook_init_path, map_location='cpu')
+            weight = codebook_ckpt_weight.clone()
+            self.register_buffer('initted', torch.Tensor([True]))
+
+        self.weight = nn.Parameter(weight, requires_grad=False)
+        self.update = True
+
+    @torch.jit.ignore
+    def init_embed_(self, data: torch.Tensor) -> None:
+        """Initialize embedding vectors of codebook."""
+        if self.initted:
+            return
+        print('Performing K-means init for codebook')
+        embed, _ = kmeans(data, self.num_tokens, 10, use_cosine_sim=True)
+        self.weight.data.copy_(embed)
+        self.initted.data.copy_(torch.Tensor([True]))
+
+    def forward(self, embed_id: torch.Tensor) -> torch.Tensor:
+        """Get embedding vectors."""
+        return F.embedding(embed_id, self.weight)
+
+
+class NormEMAVectorQuantizer(nn.Module):
+    """Normed EMA vector quantizer module.
+
+    Args:
+        num_embed (int): Number of embedding vectors in the codebook. Defaults
+            to 8192.
+        embed_dims (int) : The dimension of embedding vectors in the codebook.
+            Defaults to 32.
+        beta (float): The mutiplier for VectorQuantizer embedding loss.
+            Defaults to 1.
+        decay (float): The decay parameter of EMA. Defaults to 0.99.
+        statistic_code_usage (bool): Whether to use cluster_size to record
+            statistic. Defaults to True.
+        kmeans_init (bool): Whether to use k-means to initialize the
+            VectorQuantizer. Defaults to True.
+        codebook_init_path (str): The initialization checkpoint for codebook.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 num_embed: int,
+                 embed_dims: int,
+                 beta: float,
+                 decay: float = 0.99,
+                 statistic_code_usage: bool = True,
+                 kmeans_init: bool = True,
+                 codebook_init_path: Optional[str] = None) -> None:
+        super().__init__()
+        self.codebook_dim = embed_dims
+        self.num_tokens = num_embed
+        self.beta = beta
+        self.decay = decay
+
+        # learnable = True if orthogonal_reg_weight > 0 else False
+        self.embedding = EmbeddingEMA(
+            num_tokens=self.num_tokens,
+            codebook_dim=self.codebook_dim,
+            kmeans_init=kmeans_init,
+            codebook_init_path=codebook_init_path)
+
+        self.statistic_code_usage = statistic_code_usage
+        if statistic_code_usage:
+            self.register_buffer('cluster_size', torch.zeros(num_embed))
+
+    def reset_cluster_size(self, device):
+
+        if self.statistic_code_usage:
+            self.register_buffer('cluster_size', torch.zeros(self.num_tokens))
+            self.cluster_size = self.cluster_size.to(device)
+
+    def forward(self, z):
+        """Forward function."""
+        # reshape z -> (batch, height, width, channel)
+        z = rearrange(z, 'b c h w -> b h w c')
+        z = F.normalize(z, p=2, dim=-1)
+        z_flattened = z.reshape(-1, self.codebook_dim)
+
+        self.embedding.init_embed_(z_flattened)
+
+        # 'n d -> d n'
+        d = z_flattened.pow(2).sum(dim=1, keepdim=True) + \
+            self.embedding.weight.pow(2).sum(dim=1) - 2 * \
+            torch.einsum('bd,nd->bn', z_flattened, self.embedding.weight)
+
+        encoding_indices = torch.argmin(d, dim=1)
+
+        z_q = self.embedding(encoding_indices).view(z.shape)
+
+        encodings = F.one_hot(encoding_indices, self.num_tokens).type(z.dtype)
+
+        if not self.training:
+            with torch.no_grad():
+                cluster_size = encodings.sum(0)
+                all_reduce(cluster_size)
+                ema_inplace(self.cluster_size, cluster_size, self.decay)
+
+        if self.training and self.embedding.update:
+            # update cluster size with EMA
+            bins = encodings.sum(0)
+            all_reduce(bins)
+            ema_inplace(self.cluster_size, bins, self.decay)
+
+            zero_mask = (bins == 0)
+            bins = bins.masked_fill(zero_mask, 1.)
+
+            embed_sum = z_flattened.t() @ encodings
+            all_reduce(embed_sum)
+
+            embed_normalized = (embed_sum / bins.unsqueeze(0)).t()
+            embed_normalized = F.normalize(embed_normalized, p=2, dim=-1)
+            embed_normalized = torch.where(zero_mask[..., None],
+                                           self.embedding.weight,
+                                           embed_normalized)
+
+            # Update embedding vectors with EMA
+            norm_ema_inplace(self.embedding.weight, embed_normalized,
+                             self.decay)
+
+        # compute loss for embedding
+        loss = self.beta * F.mse_loss(z_q.detach(), z)
+
+        # preserve gradients
+        z_q = z + (z_q - z).detach()
+
+        # reshape back to match original input shape
+        z_q = rearrange(z_q, 'b h w c -> b c h w')
+        return z_q, loss, encoding_indices
diff --git a/mmpretrain/registry.py b/mmpretrain/registry.py
new file mode 100644
index 0000000000000000000000000000000000000000..cac2bdad725b9adf5c345d58e5e4a0320b3ddcd4
--- /dev/null
+++ b/mmpretrain/registry.py
@@ -0,0 +1,195 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+"""MMPretrain provides 21 registry nodes to support using modules across
+projects. Each node is a child of the root registry in MMEngine.
+
+More details can be found at
+https://mmengine.readthedocs.io/en/latest/tutorials/registry.html.
+"""
+
+from mmengine.registry import DATA_SAMPLERS as MMENGINE_DATA_SAMPLERS
+from mmengine.registry import DATASETS as MMENGINE_DATASETS
+from mmengine.registry import EVALUATOR as MMENGINE_EVALUATOR
+from mmengine.registry import HOOKS as MMENGINE_HOOKS
+from mmengine.registry import LOG_PROCESSORS as MMENGINE_LOG_PROCESSORS
+from mmengine.registry import LOOPS as MMENGINE_LOOPS
+from mmengine.registry import METRICS as MMENGINE_METRICS
+from mmengine.registry import MODEL_WRAPPERS as MMENGINE_MODEL_WRAPPERS
+from mmengine.registry import MODELS as MMENGINE_MODELS
+from mmengine.registry import \
+    OPTIM_WRAPPER_CONSTRUCTORS as MMENGINE_OPTIM_WRAPPER_CONSTRUCTORS
+from mmengine.registry import OPTIM_WRAPPERS as MMENGINE_OPTIM_WRAPPERS
+from mmengine.registry import OPTIMIZERS as MMENGINE_OPTIMIZERS
+from mmengine.registry import PARAM_SCHEDULERS as MMENGINE_PARAM_SCHEDULERS
+from mmengine.registry import \
+    RUNNER_CONSTRUCTORS as MMENGINE_RUNNER_CONSTRUCTORS
+from mmengine.registry import RUNNERS as MMENGINE_RUNNERS
+from mmengine.registry import TASK_UTILS as MMENGINE_TASK_UTILS
+from mmengine.registry import TRANSFORMS as MMENGINE_TRANSFORMS
+from mmengine.registry import VISBACKENDS as MMENGINE_VISBACKENDS
+from mmengine.registry import VISUALIZERS as MMENGINE_VISUALIZERS
+from mmengine.registry import \
+    WEIGHT_INITIALIZERS as MMENGINE_WEIGHT_INITIALIZERS
+from mmengine.registry import Registry
+
+__all__ = [
+    'RUNNERS', 'RUNNER_CONSTRUCTORS', 'LOOPS', 'HOOKS', 'LOG_PROCESSORS',
+    'OPTIMIZERS', 'OPTIM_WRAPPERS', 'OPTIM_WRAPPER_CONSTRUCTORS',
+    'PARAM_SCHEDULERS', 'DATASETS', 'DATA_SAMPLERS', 'TRANSFORMS', 'MODELS',
+    'MODEL_WRAPPERS', 'WEIGHT_INITIALIZERS', 'BATCH_AUGMENTS', 'TASK_UTILS',
+    'METRICS', 'EVALUATORS', 'VISUALIZERS', 'VISBACKENDS'
+]
+
+#######################################################################
+#                         mmpretrain.engine                           #
+#######################################################################
+
+# Runners like `EpochBasedRunner` and `IterBasedRunner`
+RUNNERS = Registry(
+    'runner',
+    parent=MMENGINE_RUNNERS,
+    locations=['mmpretrain.engine'],
+)
+# Runner constructors that define how to initialize runners
+RUNNER_CONSTRUCTORS = Registry(
+    'runner constructor',
+    parent=MMENGINE_RUNNER_CONSTRUCTORS,
+    locations=['mmpretrain.engine'],
+)
+# Loops which define the training or test process, like `EpochBasedTrainLoop`
+LOOPS = Registry(
+    'loop',
+    parent=MMENGINE_LOOPS,
+    locations=['mmpretrain.engine'],
+)
+# Hooks to add additional functions during running, like `CheckpointHook`
+HOOKS = Registry(
+    'hook',
+    parent=MMENGINE_HOOKS,
+    locations=['mmpretrain.engine'],
+)
+# Log processors to process the scalar log data.
+LOG_PROCESSORS = Registry(
+    'log processor',
+    parent=MMENGINE_LOG_PROCESSORS,
+    locations=['mmpretrain.engine'],
+)
+# Optimizers to optimize the model weights, like `SGD` and `Adam`.
+OPTIMIZERS = Registry(
+    'optimizer',
+    parent=MMENGINE_OPTIMIZERS,
+    locations=['mmpretrain.engine'],
+)
+# Optimizer wrappers to enhance the optimization process.
+OPTIM_WRAPPERS = Registry(
+    'optimizer_wrapper',
+    parent=MMENGINE_OPTIM_WRAPPERS,
+    locations=['mmpretrain.engine'],
+)
+# Optimizer constructors to customize the hyperparameters of optimizers.
+OPTIM_WRAPPER_CONSTRUCTORS = Registry(
+    'optimizer wrapper constructor',
+    parent=MMENGINE_OPTIM_WRAPPER_CONSTRUCTORS,
+    locations=['mmpretrain.engine'],
+)
+# Parameter schedulers to dynamically adjust optimization parameters.
+PARAM_SCHEDULERS = Registry(
+    'parameter scheduler',
+    parent=MMENGINE_PARAM_SCHEDULERS,
+    locations=['mmpretrain.engine'],
+)
+
+#######################################################################
+#                        mmpretrain.datasets                          #
+#######################################################################
+
+# Datasets like `ImageNet` and `CIFAR10`.
+DATASETS = Registry(
+    'dataset',
+    parent=MMENGINE_DATASETS,
+    locations=['mmpretrain.datasets'],
+)
+# Samplers to sample the dataset.
+DATA_SAMPLERS = Registry(
+    'data sampler',
+    parent=MMENGINE_DATA_SAMPLERS,
+    locations=['mmpretrain.datasets'],
+)
+# Transforms to process the samples from the dataset.
+TRANSFORMS = Registry(
+    'transform',
+    parent=MMENGINE_TRANSFORMS,
+    locations=['mmpretrain.datasets'],
+)
+
+#######################################################################
+#                         mmpretrain.models                           #
+#######################################################################
+
+# Neural network modules inheriting `nn.Module`.
+MODELS = Registry(
+    'model',
+    parent=MMENGINE_MODELS,
+    locations=['mmpretrain.models'],
+)
+# Model wrappers like 'MMDistributedDataParallel'
+MODEL_WRAPPERS = Registry(
+    'model_wrapper',
+    parent=MMENGINE_MODEL_WRAPPERS,
+    locations=['mmpretrain.models'],
+)
+# Weight initialization methods like uniform, xavier.
+WEIGHT_INITIALIZERS = Registry(
+    'weight initializer',
+    parent=MMENGINE_WEIGHT_INITIALIZERS,
+    locations=['mmpretrain.models'],
+)
+# Batch augmentations like `Mixup` and `CutMix`.
+BATCH_AUGMENTS = Registry(
+    'batch augment',
+    locations=['mmpretrain.models'],
+)
+# Task-specific modules like anchor generators and box coders
+TASK_UTILS = Registry(
+    'task util',
+    parent=MMENGINE_TASK_UTILS,
+    locations=['mmpretrain.models'],
+)
+# Tokenizer to encode sequence
+TOKENIZER = Registry(
+    'tokenizer',
+    locations=['mmpretrain.models'],
+)
+
+#######################################################################
+#                       mmpretrain.evaluation                         #
+#######################################################################
+
+# Metrics to evaluate the model prediction results.
+METRICS = Registry(
+    'metric',
+    parent=MMENGINE_METRICS,
+    locations=['mmpretrain.evaluation'],
+)
+# Evaluators to define the evaluation process.
+EVALUATORS = Registry(
+    'evaluator',
+    parent=MMENGINE_EVALUATOR,
+    locations=['mmpretrain.evaluation'],
+)
+
+#######################################################################
+#                      mmpretrain.visualization                       #
+#######################################################################
+
+# Visualizers to display task-specific results.
+VISUALIZERS = Registry(
+    'visualizer',
+    parent=MMENGINE_VISUALIZERS,
+    locations=['mmpretrain.visualization'],
+)
+# Backends to save the visualization results, like TensorBoard, WandB.
+VISBACKENDS = Registry(
+    'vis_backend',
+    parent=MMENGINE_VISBACKENDS,
+    locations=['mmpretrain.visualization'],
+)
diff --git a/mmpretrain/structures/__init__.py b/mmpretrain/structures/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e7de863087d9d07800ff119d3c8b941059ef3886
--- /dev/null
+++ b/mmpretrain/structures/__init__.py
@@ -0,0 +1,10 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .data_sample import DataSample
+from .multi_task_data_sample import MultiTaskDataSample
+from .utils import (batch_label_to_onehot, cat_batch_labels, format_label,
+                    format_score, label_to_onehot, tensor_split)
+
+__all__ = [
+    'DataSample', 'batch_label_to_onehot', 'cat_batch_labels', 'tensor_split',
+    'MultiTaskDataSample', 'label_to_onehot', 'format_label', 'format_score'
+]
diff --git a/mmpretrain/structures/__pycache__/__init__.cpython-310.pyc b/mmpretrain/structures/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..62a0e530fa7e5bad64f522becee982a170945682
Binary files /dev/null and b/mmpretrain/structures/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/structures/__pycache__/data_sample.cpython-310.pyc b/mmpretrain/structures/__pycache__/data_sample.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..29b5ec78561ee99bc3b3563c04d6d9dddc13c0a0
Binary files /dev/null and b/mmpretrain/structures/__pycache__/data_sample.cpython-310.pyc differ
diff --git a/mmpretrain/structures/__pycache__/multi_task_data_sample.cpython-310.pyc b/mmpretrain/structures/__pycache__/multi_task_data_sample.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3b7da088ab525da360674ac7ec328d13d129915a
Binary files /dev/null and b/mmpretrain/structures/__pycache__/multi_task_data_sample.cpython-310.pyc differ
diff --git a/mmpretrain/structures/__pycache__/utils.cpython-310.pyc b/mmpretrain/structures/__pycache__/utils.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..e2e7dbac92929810dd93e56d8635c658226a1200
Binary files /dev/null and b/mmpretrain/structures/__pycache__/utils.cpython-310.pyc differ
diff --git a/mmpretrain/structures/data_sample.py b/mmpretrain/structures/data_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..ce588b8ba13811afdb2bb3300d42f221a6f2df7f
--- /dev/null
+++ b/mmpretrain/structures/data_sample.py
@@ -0,0 +1,167 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from multiprocessing.reduction import ForkingPickler
+from typing import Union
+
+import numpy as np
+import torch
+from mmengine.structures import BaseDataElement
+
+from .utils import LABEL_TYPE, SCORE_TYPE, format_label, format_score
+
+
+class DataSample(BaseDataElement):
+    """A general data structure interface.
+
+    It's used as the interface between different components.
+
+    The following fields are convention names in MMPretrain, and we will set or
+    get these fields in data transforms, models, and metrics if needed. You can
+    also set any new fields for your need.
+
+    Meta fields:
+        img_shape (Tuple): The shape of the corresponding input image.
+        ori_shape (Tuple): The original shape of the corresponding image.
+        sample_idx (int): The index of the sample in the dataset.
+        num_classes (int): The number of all categories.
+
+    Data fields:
+        gt_label (tensor): The ground truth label.
+        gt_score (tensor): The ground truth score.
+        pred_label (tensor): The predicted label.
+        pred_score (tensor): The predicted score.
+        mask (tensor): The mask used in masked image modeling.
+
+    Examples:
+        >>> import torch
+        >>> from mmpretrain.structures import DataSample
+        >>>
+        >>> img_meta = dict(img_shape=(960, 720), num_classes=5)
+        >>> data_sample = DataSample(metainfo=img_meta)
+        >>> data_sample.set_gt_label(3)
+        >>> print(data_sample)
+        <DataSample(
+        META INFORMATION
+            num_classes: 5
+            img_shape: (960, 720)
+        DATA FIELDS
+            gt_label: tensor([3])
+        ) at 0x7ff64c1c1d30>
+        >>>
+        >>> # For multi-label data
+        >>> data_sample = DataSample().set_gt_label([0, 1, 4])
+        >>> print(data_sample)
+        <DataSample(
+        DATA FIELDS
+            gt_label: tensor([0, 1, 4])
+        ) at 0x7ff5b490e100>
+        >>>
+        >>> # Set one-hot format score
+        >>> data_sample = DataSample().set_pred_score([0.1, 0.1, 0.6, 0.1])
+        >>> print(data_sample)
+        <DataSample(
+        META INFORMATION
+            num_classes: 4
+        DATA FIELDS
+            pred_score: tensor([0.1000, 0.1000, 0.6000, 0.1000])
+        ) at 0x7ff5b48ef6a0>
+        >>>
+        >>> # Set custom field
+        >>> data_sample = DataSample()
+        >>> data_sample.my_field = [1, 2, 3]
+        >>> print(data_sample)
+        <DataSample(
+        DATA FIELDS
+            my_field: [1, 2, 3]
+        ) at 0x7f8e9603d3a0>
+        >>> print(data_sample.my_field)
+        [1, 2, 3]
+    """
+
+    def set_gt_label(self, value: LABEL_TYPE) -> 'DataSample':
+        """Set ``gt_label``."""
+        self.set_field(format_label(value), 'gt_label', dtype=torch.Tensor)
+        return self
+
+    def set_gt_score(self, value: SCORE_TYPE) -> 'DataSample':
+        """Set ``gt_score``."""
+        score = format_score(value)
+        self.set_field(score, 'gt_score', dtype=torch.Tensor)
+        if hasattr(self, 'num_classes'):
+            assert len(score) == self.num_classes, \
+                f'The length of score {len(score)} should be '\
+                f'equal to the num_classes {self.num_classes}.'
+        else:
+            self.set_field(
+                name='num_classes', value=len(score), field_type='metainfo')
+        return self
+
+    def set_pred_label(self, value: LABEL_TYPE) -> 'DataSample':
+        """Set ``pred_label``."""
+        self.set_field(format_label(value), 'pred_label', dtype=torch.Tensor)
+        return self
+
+    def set_pred_score(self, value: SCORE_TYPE):
+        """Set ``pred_label``."""
+        score = format_score(value)
+        self.set_field(score, 'pred_score', dtype=torch.Tensor)
+        if hasattr(self, 'num_classes'):
+            assert len(score) == self.num_classes, \
+                f'The length of score {len(score)} should be '\
+                f'equal to the num_classes {self.num_classes}.'
+        else:
+            self.set_field(
+                name='num_classes', value=len(score), field_type='metainfo')
+        return self
+
+    def set_mask(self, value: Union[torch.Tensor, np.ndarray]):
+        if isinstance(value, np.ndarray):
+            value = torch.from_numpy(value)
+        elif not isinstance(value, torch.Tensor):
+            raise TypeError(f'Invalid mask type {type(value)}')
+        self.set_field(value, 'mask', dtype=torch.Tensor)
+        return self
+
+    def __repr__(self) -> str:
+        """Represent the object."""
+
+        def dump_items(items, prefix=''):
+            return '\n'.join(f'{prefix}{k}: {v}' for k, v in items)
+
+        repr_ = ''
+        if len(self._metainfo_fields) > 0:
+            repr_ += '\n\nMETA INFORMATION\n'
+            repr_ += dump_items(self.metainfo_items(), prefix=' ' * 4)
+        if len(self._data_fields) > 0:
+            repr_ += '\n\nDATA FIELDS\n'
+            repr_ += dump_items(self.items(), prefix=' ' * 4)
+
+        repr_ = f'<{self.__class__.__name__}({repr_}\n\n) at {hex(id(self))}>'
+        return repr_
+
+
+def _reduce_datasample(data_sample):
+    """reduce DataSample."""
+    attr_dict = data_sample.__dict__
+    convert_keys = []
+    for k, v in attr_dict.items():
+        if isinstance(v, torch.Tensor):
+            attr_dict[k] = v.numpy()
+            convert_keys.append(k)
+    return _rebuild_datasample, (attr_dict, convert_keys)
+
+
+def _rebuild_datasample(attr_dict, convert_keys):
+    """rebuild DataSample."""
+    data_sample = DataSample()
+    for k in convert_keys:
+        attr_dict[k] = torch.from_numpy(attr_dict[k])
+    data_sample.__dict__ = attr_dict
+    return data_sample
+
+
+# Due to the multi-processing strategy of PyTorch, DataSample may consume many
+# file descriptors because it contains multiple tensors. Here we overwrite the
+# reduce function of DataSample in ForkingPickler and convert these tensors to
+# np.ndarray during pickling. It may slightly influence the performance of
+# dataloader.
+ForkingPickler.register(DataSample, _reduce_datasample)
diff --git a/mmpretrain/structures/multi_task_data_sample.py b/mmpretrain/structures/multi_task_data_sample.py
new file mode 100644
index 0000000000000000000000000000000000000000..f00993861bfb4f35fb7d145198f81c5e9f0a5993
--- /dev/null
+++ b/mmpretrain/structures/multi_task_data_sample.py
@@ -0,0 +1,10 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+
+from mmengine.structures import BaseDataElement
+
+
+class MultiTaskDataSample(BaseDataElement):
+
+    @property
+    def tasks(self):
+        return self._data_fields
diff --git a/mmpretrain/structures/utils.py b/mmpretrain/structures/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4f9e95ef6fd557b9d0bdf5f017a7b73ba250453
--- /dev/null
+++ b/mmpretrain/structures/utils.py
@@ -0,0 +1,153 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Sequence, Union
+
+import numpy as np
+import torch
+import torch.nn.functional as F
+from mmengine.utils import is_str
+
+if hasattr(torch, 'tensor_split'):
+    tensor_split = torch.tensor_split
+else:
+    # A simple implementation of `tensor_split`.
+    def tensor_split(input: torch.Tensor, indices: list):
+        outs = []
+        for start, end in zip([0] + indices, indices + [input.size(0)]):
+            outs.append(input[start:end])
+        return outs
+
+
+LABEL_TYPE = Union[torch.Tensor, np.ndarray, Sequence, int]
+SCORE_TYPE = Union[torch.Tensor, np.ndarray, Sequence]
+
+
+def format_label(value: LABEL_TYPE) -> torch.Tensor:
+    """Convert various python types to label-format tensor.
+
+    Supported types are: :class:`numpy.ndarray`, :class:`torch.Tensor`,
+    :class:`Sequence`, :class:`int`.
+
+    Args:
+        value (torch.Tensor | numpy.ndarray | Sequence | int): Label value.
+
+    Returns:
+        :obj:`torch.Tensor`: The foramtted label tensor.
+    """
+
+    # Handle single number
+    if isinstance(value, (torch.Tensor, np.ndarray)) and value.ndim == 0:
+        value = int(value.item())
+
+    if isinstance(value, np.ndarray):
+        value = torch.from_numpy(value).to(torch.long)
+    elif isinstance(value, Sequence) and not is_str(value):
+        value = torch.tensor(value).to(torch.long)
+    elif isinstance(value, int):
+        value = torch.LongTensor([value])
+    elif not isinstance(value, torch.Tensor):
+        raise TypeError(f'Type {type(value)} is not an available label type.')
+    assert value.ndim == 1, \
+        f'The dims of value should be 1, but got {value.ndim}.'
+
+    return value
+
+
+def format_score(value: SCORE_TYPE) -> torch.Tensor:
+    """Convert various python types to score-format tensor.
+
+    Supported types are: :class:`numpy.ndarray`, :class:`torch.Tensor`,
+    :class:`Sequence`.
+
+    Args:
+        value (torch.Tensor | numpy.ndarray | Sequence): Score values.
+
+    Returns:
+        :obj:`torch.Tensor`: The foramtted score tensor.
+    """
+
+    if isinstance(value, np.ndarray):
+        value = torch.from_numpy(value).float()
+    elif isinstance(value, Sequence) and not is_str(value):
+        value = torch.tensor(value).float()
+    elif not isinstance(value, torch.Tensor):
+        raise TypeError(f'Type {type(value)} is not an available label type.')
+    assert value.ndim == 1, \
+        f'The dims of value should be 1, but got {value.ndim}.'
+
+    return value
+
+
+def cat_batch_labels(elements: List[torch.Tensor]):
+    """Concat a batch of label tensor to one tensor.
+
+    Args:
+        elements (List[tensor]): A batch of labels.
+
+    Returns:
+        Tuple[torch.Tensor, List[int]]: The first item is the concated label
+        tensor, and the second item is the split indices of every sample.
+    """
+    labels = []
+    splits = [0]
+    for element in elements:
+        labels.append(element)
+        splits.append(splits[-1] + element.size(0))
+    batch_label = torch.cat(labels)
+    return batch_label, splits[1:-1]
+
+
+def batch_label_to_onehot(batch_label, split_indices, num_classes):
+    """Convert a concated label tensor to onehot format.
+
+    Args:
+        batch_label (torch.Tensor): A concated label tensor from multiple
+            samples.
+        split_indices (List[int]): The split indices of every sample.
+        num_classes (int): The number of classes.
+
+    Returns:
+        torch.Tensor: The onehot format label tensor.
+
+    Examples:
+        >>> import torch
+        >>> from mmpretrain.structures import batch_label_to_onehot
+        >>> # Assume a concated label from 3 samples.
+        >>> # label 1: [0, 1], label 2: [0, 2, 4], label 3: [3, 1]
+        >>> batch_label = torch.tensor([0, 1, 0, 2, 4, 3, 1])
+        >>> split_indices = [2, 5]
+        >>> batch_label_to_onehot(batch_label, split_indices, num_classes=5)
+        tensor([[1, 1, 0, 0, 0],
+                [1, 0, 1, 0, 1],
+                [0, 1, 0, 1, 0]])
+    """
+    sparse_onehot_list = F.one_hot(batch_label, num_classes)
+    onehot_list = [
+        sparse_onehot.sum(0)
+        for sparse_onehot in tensor_split(sparse_onehot_list, split_indices)
+    ]
+    return torch.stack(onehot_list)
+
+
+def label_to_onehot(label: LABEL_TYPE, num_classes: int):
+    """Convert a label to onehot format tensor.
+
+    Args:
+        label (LABEL_TYPE): Label value.
+        num_classes (int): The number of classes.
+
+    Returns:
+        torch.Tensor: The onehot format label tensor.
+
+    Examples:
+        >>> import torch
+        >>> from mmpretrain.structures import label_to_onehot
+        >>> # Single-label
+        >>> label_to_onehot(1, num_classes=5)
+        tensor([0, 1, 0, 0, 0])
+        >>> # Multi-label
+        >>> label_to_onehot([0, 2, 3], num_classes=5)
+        tensor([1, 0, 1, 1, 0])
+    """
+    label = format_label(label)
+    sparse_onehot = F.one_hot(label, num_classes)
+    return sparse_onehot.sum(0)
diff --git a/mmpretrain/utils/__init__.py b/mmpretrain/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..991e3217d2f1e5926028e6c9c79e450e30404a33
--- /dev/null
+++ b/mmpretrain/utils/__init__.py
@@ -0,0 +1,12 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .analyze import load_json_log
+from .collect_env import collect_env
+from .dependency import require
+from .misc import get_ori_model
+from .progress import track, track_on_main_process
+from .setup_env import register_all_modules
+
+__all__ = [
+    'collect_env', 'register_all_modules', 'track_on_main_process',
+    'load_json_log', 'get_ori_model', 'track', 'require'
+]
diff --git a/mmpretrain/utils/__pycache__/__init__.cpython-310.pyc b/mmpretrain/utils/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..541fb6b93c0d3a57dc8511cf28e602780a92db11
Binary files /dev/null and b/mmpretrain/utils/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/utils/__pycache__/analyze.cpython-310.pyc b/mmpretrain/utils/__pycache__/analyze.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..20c7dbf63a5016c68b2df3bb3ce3c95cf6c7c221
Binary files /dev/null and b/mmpretrain/utils/__pycache__/analyze.cpython-310.pyc differ
diff --git a/mmpretrain/utils/__pycache__/collect_env.cpython-310.pyc b/mmpretrain/utils/__pycache__/collect_env.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a4f988f90a7a18ada231ed2ec56927e978eccf6b
Binary files /dev/null and b/mmpretrain/utils/__pycache__/collect_env.cpython-310.pyc differ
diff --git a/mmpretrain/utils/__pycache__/dependency.cpython-310.pyc b/mmpretrain/utils/__pycache__/dependency.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..27a491f88f106f7e1a528302d181bd806f518d61
Binary files /dev/null and b/mmpretrain/utils/__pycache__/dependency.cpython-310.pyc differ
diff --git a/mmpretrain/utils/__pycache__/misc.cpython-310.pyc b/mmpretrain/utils/__pycache__/misc.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..99d8387e33ebd24b2e5e42af69fe3e14d64bec86
Binary files /dev/null and b/mmpretrain/utils/__pycache__/misc.cpython-310.pyc differ
diff --git a/mmpretrain/utils/__pycache__/progress.cpython-310.pyc b/mmpretrain/utils/__pycache__/progress.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..a3828aad83af625457ddaf9d32fcf510bc7e423e
Binary files /dev/null and b/mmpretrain/utils/__pycache__/progress.cpython-310.pyc differ
diff --git a/mmpretrain/utils/__pycache__/setup_env.cpython-310.pyc b/mmpretrain/utils/__pycache__/setup_env.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..b94f2473be9961bb6d65b7e89e3436998af722b0
Binary files /dev/null and b/mmpretrain/utils/__pycache__/setup_env.cpython-310.pyc differ
diff --git a/mmpretrain/utils/analyze.py b/mmpretrain/utils/analyze.py
new file mode 100644
index 0000000000000000000000000000000000000000..a933591618951e1e49558f4f5cbbdf9c49a76bfe
--- /dev/null
+++ b/mmpretrain/utils/analyze.py
@@ -0,0 +1,43 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import json
+
+
+def load_json_log(json_log):
+    """load and convert json_logs to log_dicts.
+
+    Args:
+        json_log (str): The path of the json log file.
+
+    Returns:
+        dict: The result dict contains two items, "train" and "val", for
+        the training log and validate log.
+
+    Example:
+        An example output:
+
+        .. code-block:: python
+
+            {
+                'train': [
+                    {"lr": 0.1, "time": 0.02, "epoch": 1, "step": 100},
+                    {"lr": 0.1, "time": 0.02, "epoch": 1, "step": 200},
+                    {"lr": 0.1, "time": 0.02, "epoch": 1, "step": 300},
+                    ...
+                ]
+                'val': [
+                    {"accuracy/top1": 32.1, "step": 1},
+                    {"accuracy/top1": 50.2, "step": 2},
+                    {"accuracy/top1": 60.3, "step": 2},
+                    ...
+                ]
+            }
+    """
+    log_dict = dict(train=[], val=[])
+    with open(json_log, 'r') as log_file:
+        for line in log_file:
+            log = json.loads(line.strip())
+            # A hack trick to determine whether the line is training log.
+            mode = 'train' if 'lr' in log else 'val'
+            log_dict[mode].append(log)
+
+    return log_dict
diff --git a/mmpretrain/utils/collect_env.py b/mmpretrain/utils/collect_env.py
new file mode 100644
index 0000000000000000000000000000000000000000..988451ec530e8d21ec3d5a087a3bb7f7b66fd223
--- /dev/null
+++ b/mmpretrain/utils/collect_env.py
@@ -0,0 +1,16 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import mmcv
+from mmengine.utils import get_git_hash
+from mmengine.utils.dl_utils import collect_env as collect_base_env
+
+import mmpretrain
+
+
+def collect_env(with_torch_comiling_info=False):
+    """Collect the information of the running environments."""
+    env_info = collect_base_env()
+    env_info['MMCV'] = mmcv.__version__
+    if not with_torch_comiling_info:
+        env_info.pop('PyTorch compiling details')
+    env_info['MMPreTrain'] = mmpretrain.__version__ + '+' + get_git_hash()[:7]
+    return env_info
diff --git a/mmpretrain/utils/dependency.py b/mmpretrain/utils/dependency.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e3d8ae5df7a6968f26e0563e80a7d37a2e2cd68
--- /dev/null
+++ b/mmpretrain/utils/dependency.py
@@ -0,0 +1,82 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import re
+from functools import wraps
+from inspect import isfunction
+
+from importlib_metadata import PackageNotFoundError, distribution
+from mmengine.utils import digit_version
+
+
+def satisfy_requirement(dep):
+    pat = '(' + '|'.join(['>=', '==', '>']) + ')'
+    parts = re.split(pat, dep, maxsplit=1)
+    parts = [p.strip() for p in parts]
+    package = parts[0]
+    if len(parts) > 1:
+        op, version = parts[1:]
+        op = {
+            '>=': '__ge__',
+            '==': '__eq__',
+            '>': '__gt__',
+            '<': '__lt__',
+            '<=': '__le__'
+        }[op]
+    else:
+        op, version = None, None
+
+    try:
+        dist = distribution(package)
+        if op is None or getattr(digit_version(dist.version), op)(
+                digit_version(version)):
+            return True
+    except PackageNotFoundError:
+        pass
+
+    return False
+
+
+def require(dep, install=None):
+    """A wrapper of function for extra package requirements.
+
+    Args:
+        dep (str): The dependency package name, like ``transformers``
+            or ``transformers>=4.28.0``.
+        install (str, optional): The installation command hint. Defaults
+            to None, which means to use "pip install dep".
+    """
+
+    def wrapper(fn):
+        assert isfunction(fn)
+
+        @wraps(fn)
+        def ask_install(*args, **kwargs):
+            name = fn.__qualname__.replace('.__init__', '')
+            ins = install or f'pip install "{dep}"'
+            raise ImportError(
+                f'{name} requires {dep}, please install it by `{ins}`.')
+
+        if satisfy_requirement(dep):
+            fn._verify_require = getattr(fn, '_verify_require', lambda: None)
+            return fn
+
+        ask_install._verify_require = ask_install
+        return ask_install
+
+    return wrapper
+
+
+WITH_MULTIMODAL = all(
+    satisfy_requirement(item)
+    for item in ['pycocotools', 'transformers>=4.28.0'])
+
+
+def register_multimodal_placeholder(names, registry):
+    for name in names:
+
+        def ask_install(*args, **kwargs):
+            raise ImportError(
+                f'{name} requires extra multi-modal dependencies, please '
+                'install it by `pip install "mmpretrain[multimodal]"` '
+                'or `pip install -e ".[multimodal]"`.')
+
+        registry.register_module(name=name, module=ask_install)
diff --git a/mmpretrain/utils/misc.py b/mmpretrain/utils/misc.py
new file mode 100644
index 0000000000000000000000000000000000000000..cc532679943689233be76e9a8f74da8ed822443e
--- /dev/null
+++ b/mmpretrain/utils/misc.py
@@ -0,0 +1,18 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+from mmengine.model import is_model_wrapper
+
+
+def get_ori_model(model: nn.Module) -> nn.Module:
+    """Get original model if the input model is a model wrapper.
+
+    Args:
+        model (nn.Module): A model may be a model wrapper.
+
+    Returns:
+        nn.Module: The model without model wrapper.
+    """
+    if is_model_wrapper(model):
+        return model.module
+    else:
+        return model
diff --git a/mmpretrain/utils/progress.py b/mmpretrain/utils/progress.py
new file mode 100644
index 0000000000000000000000000000000000000000..b23f976a42fc3a2f6e38f025f01041deb5608405
--- /dev/null
+++ b/mmpretrain/utils/progress.py
@@ -0,0 +1,40 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional
+
+import mmengine.dist as dist
+import rich.progress as progress
+from rich.live import Live
+
+disable_progress_bar = False
+global_progress = progress.Progress(
+    '{task.description}',
+    progress.BarColumn(),
+    progress.TaskProgressColumn(show_speed=True),
+    progress.TimeRemainingColumn(),
+)
+global_live = Live(global_progress, refresh_per_second=10)
+
+
+def track(sequence, description: str = '', total: Optional[float] = None):
+    if disable_progress_bar:
+        yield from sequence
+    else:
+        global_live.start()
+        task_id = global_progress.add_task(description, total=total)
+        task = global_progress._tasks[task_id]
+        try:
+            yield from global_progress.track(sequence, task_id=task_id)
+        finally:
+            if task.total is None:
+                global_progress.update(task_id, total=task.completed)
+            if all(task.finished for task in global_progress.tasks):
+                global_live.stop()
+                for task_id in global_progress.task_ids:
+                    global_progress.remove_task(task_id)
+
+
+def track_on_main_process(sequence, description='', total=None):
+    if not dist.is_main_process() or disable_progress_bar:
+        yield from sequence
+    else:
+        yield from track(sequence, total=total, description=description)
diff --git a/mmpretrain/utils/setup_env.py b/mmpretrain/utils/setup_env.py
new file mode 100644
index 0000000000000000000000000000000000000000..1b57b848c98a75c7a1b5854c800ecc2dd5da6df8
--- /dev/null
+++ b/mmpretrain/utils/setup_env.py
@@ -0,0 +1,41 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import datetime
+import warnings
+
+from mmengine import DefaultScope
+
+
+def register_all_modules(init_default_scope: bool = True) -> None:
+    """Register all modules in mmpretrain into the registries.
+
+    Args:
+        init_default_scope (bool): Whether initialize the mmpretrain default
+            scope. If True, the global default scope will be set to
+            `mmpretrain`, and all registries will build modules from
+            mmpretrain's registry node. To understand more about the registry,
+            please refer to
+            https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/registry.md
+            Defaults to True.
+    """  # noqa: E501
+    import mmpretrain.datasets  # noqa: F401,F403
+    import mmpretrain.engine  # noqa: F401,F403
+    import mmpretrain.evaluation  # noqa: F401,F403
+    import mmpretrain.models  # noqa: F401,F403
+    import mmpretrain.structures  # noqa: F401,F403
+    import mmpretrain.visualization  # noqa: F401,F403
+
+    if not init_default_scope:
+        return
+
+    current_scope = DefaultScope.get_current_instance()
+    if current_scope is None:
+        DefaultScope.get_instance('mmpretrain', scope_name='mmpretrain')
+    elif current_scope.scope_name != 'mmpretrain':
+        warnings.warn(
+            f'The current default scope "{current_scope.scope_name}" '
+            'is not "mmpretrain", `register_all_modules` will force '
+            'the current default scope to be "mmpretrain". If this is '
+            'not expected, please set `init_default_scope=False`.')
+        # avoid name conflict
+        new_instance_name = f'mmpretrain-{datetime.datetime.now()}'
+        DefaultScope.get_instance(new_instance_name, scope_name='mmpretrain')
diff --git a/mmpretrain/version.py b/mmpretrain/version.py
new file mode 100644
index 0000000000000000000000000000000000000000..1822b7f272b2c8eb6178b8d5c71bb8091305eb5a
--- /dev/null
+++ b/mmpretrain/version.py
@@ -0,0 +1,28 @@
+# Copyright (c) OpenMMLab. All rights reserved
+
+__version__ = '1.2.0'
+
+
+def parse_version_info(version_str):
+    """Parse a version string into a tuple.
+
+    Args:
+        version_str (str): The version string.
+    Returns:
+        tuple[int | str]: The version info, e.g., "1.3.0" is parsed into
+            (1, 3, 0), and "2.0.0rc1" is parsed into (2, 0, 0, 'rc1').
+    """
+    version_info = []
+    for x in version_str.split('.'):
+        if x.isdigit():
+            version_info.append(int(x))
+        elif x.find('rc') != -1:
+            patch_version = x.split('rc')
+            version_info.append(int(patch_version[0]))
+            version_info.append(f'rc{patch_version[1]}')
+    return tuple(version_info)
+
+
+version_info = parse_version_info(__version__)
+
+__all__ = ['__version__', 'version_info', 'parse_version_info']
diff --git a/mmpretrain/visualization/__init__.py b/mmpretrain/visualization/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..0dbeecfb070193f479b248dca3e98311577410a1
--- /dev/null
+++ b/mmpretrain/visualization/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .utils import create_figure, get_adaptive_scale
+from .visualizer import UniversalVisualizer
+
+__all__ = ['UniversalVisualizer', 'get_adaptive_scale', 'create_figure']
diff --git a/mmpretrain/visualization/__pycache__/__init__.cpython-310.pyc b/mmpretrain/visualization/__pycache__/__init__.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..76fb470ad20f836ca54247d220789a6184b42fb4
Binary files /dev/null and b/mmpretrain/visualization/__pycache__/__init__.cpython-310.pyc differ
diff --git a/mmpretrain/visualization/__pycache__/utils.cpython-310.pyc b/mmpretrain/visualization/__pycache__/utils.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..724fbc47b50d965640a5704cf37d79f1234b82ba
Binary files /dev/null and b/mmpretrain/visualization/__pycache__/utils.cpython-310.pyc differ
diff --git a/mmpretrain/visualization/__pycache__/visualizer.cpython-310.pyc b/mmpretrain/visualization/__pycache__/visualizer.cpython-310.pyc
new file mode 100644
index 0000000000000000000000000000000000000000..3e392df4d84a28d4e18a6543f050251e1063b981
Binary files /dev/null and b/mmpretrain/visualization/__pycache__/visualizer.cpython-310.pyc differ
diff --git a/mmpretrain/visualization/utils.py b/mmpretrain/visualization/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..91a1d81f1449dfbfb7ff5198eb6dc25a6386ed48
--- /dev/null
+++ b/mmpretrain/visualization/utils.py
@@ -0,0 +1,60 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import TYPE_CHECKING, Tuple
+
+if TYPE_CHECKING:
+    from matplotlib.figure import Figure
+
+
+def get_adaptive_scale(img_shape: Tuple[int, int],
+                       min_scale: float = 0.3,
+                       max_scale: float = 3.0) -> float:
+    """Get adaptive scale according to image shape.
+
+    The target scale depends on the the short edge length of the image. If the
+    short edge length equals 224, the output is 1.0. And output linear scales
+    according the short edge length.
+
+    You can also specify the minimum scale and the maximum scale to limit the
+    linear scale.
+
+    Args:
+        img_shape (Tuple[int, int]): The shape of the canvas image.
+        min_size (int): The minimum scale. Defaults to 0.3.
+        max_size (int): The maximum scale. Defaults to 3.0.
+
+    Returns:
+        int: The adaptive scale.
+    """
+    short_edge_length = min(img_shape)
+    scale = short_edge_length / 224.
+    return min(max(scale, min_scale), max_scale)
+
+
+def create_figure(*args, margin=False, **kwargs) -> 'Figure':
+    """Create a independent figure.
+
+    Different from the :func:`plt.figure`, the figure from this function won't
+    be managed by matplotlib. And it has
+    :obj:`matplotlib.backends.backend_agg.FigureCanvasAgg`, and therefore, you
+    can use the ``canvas`` attribute to get access the drawn image.
+
+    Args:
+        *args: All positional arguments of :class:`matplotlib.figure.Figure`.
+        margin: Whether to reserve the white edges of the figure.
+            Defaults to False.
+        **kwargs: All keyword arguments of :class:`matplotlib.figure.Figure`.
+
+    Return:
+        matplotlib.figure.Figure: The created figure.
+    """
+    from matplotlib.backends.backend_agg import FigureCanvasAgg
+    from matplotlib.figure import Figure
+
+    figure = Figure(*args, **kwargs)
+    FigureCanvasAgg(figure)
+
+    if not margin:
+        # remove white edges by set subplot margin
+        figure.subplots_adjust(left=0, right=1, bottom=0, top=1)
+
+    return figure
diff --git a/mmpretrain/visualization/visualizer.py b/mmpretrain/visualization/visualizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..5d18ca87f6bc246b4defe17281ae87c4464e1b89
--- /dev/null
+++ b/mmpretrain/visualization/visualizer.py
@@ -0,0 +1,777 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Optional, Sequence, Tuple, Union
+
+import mmcv
+import numpy as np
+import torch
+import torch.nn.functional as F
+from mmengine.dataset import BaseDataset
+from mmengine.dist import master_only
+from mmengine.visualization import Visualizer
+from mmengine.visualization.utils import img_from_canvas
+
+from mmpretrain.registry import VISUALIZERS
+from mmpretrain.structures import DataSample
+from .utils import create_figure, get_adaptive_scale
+
+
+@VISUALIZERS.register_module()
+class UniversalVisualizer(Visualizer):
+    """Universal Visualizer for multiple tasks.
+
+    Args:
+        name (str): Name of the instance. Defaults to 'visualizer'.
+        image (np.ndarray, optional): the origin image to draw. The format
+            should be RGB. Defaults to None.
+        vis_backends (list, optional): Visual backend config list.
+            Defaults to None.
+        save_dir (str, optional): Save file dir for all storage backends.
+            If it is None, the backend storage will not save any data.
+        fig_save_cfg (dict): Keyword parameters of figure for saving.
+            Defaults to empty dict.
+        fig_show_cfg (dict): Keyword parameters of figure for showing.
+            Defaults to empty dict.
+    """
+    DEFAULT_TEXT_CFG = {
+        'family': 'monospace',
+        'color': 'white',
+        'bbox': dict(facecolor='black', alpha=0.5, boxstyle='Round'),
+        'verticalalignment': 'top',
+        'horizontalalignment': 'left',
+    }
+
+    @master_only
+    def visualize_cls(self,
+                      image: np.ndarray,
+                      data_sample: DataSample,
+                      classes: Optional[Sequence[str]] = None,
+                      draw_gt: bool = True,
+                      draw_pred: bool = True,
+                      draw_score: bool = True,
+                      resize: Optional[int] = None,
+                      rescale_factor: Optional[float] = None,
+                      text_cfg: dict = dict(),
+                      show: bool = False,
+                      wait_time: float = 0,
+                      out_file: Optional[str] = None,
+                      name: str = '',
+                      step: int = 0) -> None:
+        """Visualize image classification result.
+
+        This method will draw an text box on the input image to visualize the
+        information about image classification, like the ground-truth label and
+        prediction label.
+
+        Args:
+            image (np.ndarray): The image to draw. The format should be RGB.
+            data_sample (:obj:`DataSample`): The annotation of the image.
+            classes (Sequence[str], optional): The categories names.
+                Defaults to None.
+            draw_gt (bool): Whether to draw ground-truth labels.
+                Defaults to True.
+            draw_pred (bool): Whether to draw prediction labels.
+                Defaults to True.
+            draw_score (bool): Whether to draw the prediction scores
+                of prediction categories. Defaults to True.
+            resize (int, optional): Resize the short edge of the image to the
+                specified length before visualization. Defaults to None.
+            rescale_factor (float, optional): Rescale the image by the rescale
+                factor before visualization. Defaults to None.
+            text_cfg (dict): Extra text setting, which accepts
+                arguments of :meth:`mmengine.Visualizer.draw_texts`.
+                Defaults to an empty dict.
+            show (bool): Whether to display the drawn image in a window, please
+                confirm your are able to access the graphical interface.
+                Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            out_file (str, optional): Extra path to save the visualization
+                result. If specified, the visualizer will only save the result
+                image to the out_file and ignore its storage backends.
+                Defaults to None.
+            name (str): The image identifier. It's useful when using the
+                storage backends of the visualizer to save or display the
+                image. Defaults to an empty string.
+            step (int): The global step value. It's useful to record a
+                series of visualization results for the same image with the
+                storage backends. Defaults to 0.
+
+        Returns:
+            np.ndarray: The visualization image.
+        """
+        if self.dataset_meta is not None:
+            classes = classes or self.dataset_meta.get('classes', None)
+
+        if resize is not None:
+            h, w = image.shape[:2]
+            if w < h:
+                image = mmcv.imresize(image, (resize, resize * h // w))
+            else:
+                image = mmcv.imresize(image, (resize * w // h, resize))
+        elif rescale_factor is not None:
+            image = mmcv.imrescale(image, rescale_factor)
+
+        texts = []
+        self.set_image(image)
+
+        if draw_gt and 'gt_label' in data_sample:
+            idx = data_sample.gt_label.tolist()
+            class_labels = [''] * len(idx)
+            if classes is not None:
+                class_labels = [f' ({classes[i]})' for i in idx]
+            labels = [str(idx[i]) + class_labels[i] for i in range(len(idx))]
+            prefix = 'Ground truth: '
+            texts.append(prefix + ('\n' + ' ' * len(prefix)).join(labels))
+
+        if draw_pred and 'pred_label' in data_sample:
+            idx = data_sample.pred_label.tolist()
+            score_labels = [''] * len(idx)
+            class_labels = [''] * len(idx)
+            if draw_score and 'pred_score' in data_sample:
+                score_labels = [
+                    f', {data_sample.pred_score[i].item():.2f}' for i in idx
+                ]
+
+            if classes is not None:
+                class_labels = [f' ({classes[i]})' for i in idx]
+
+            labels = [
+                str(idx[i]) + score_labels[i] + class_labels[i]
+                for i in range(len(idx))
+            ]
+            prefix = 'Prediction: '
+            texts.append(prefix + ('\n' + ' ' * len(prefix)).join(labels))
+
+        img_scale = get_adaptive_scale(image.shape[:2])
+        text_cfg = {
+            'size': int(img_scale * 7),
+            **self.DEFAULT_TEXT_CFG,
+            **text_cfg,
+        }
+        self.ax_save.text(
+            img_scale * 5,
+            img_scale * 5,
+            '\n'.join(texts),
+            **text_cfg,
+        )
+        drawn_img = self.get_image()
+
+        if show:
+            self.show(drawn_img, win_name=name, wait_time=wait_time)
+
+        if out_file is not None:
+            # save the image to the target file instead of vis_backends
+            mmcv.imwrite(drawn_img[..., ::-1], out_file)
+        else:
+            self.add_image(name, drawn_img, step=step)
+
+        return drawn_img
+
+    @master_only
+    def visualize_image_retrieval(self,
+                                  image: np.ndarray,
+                                  data_sample: DataSample,
+                                  prototype_dataset: BaseDataset,
+                                  topk: int = 1,
+                                  draw_score: bool = True,
+                                  resize: Optional[int] = None,
+                                  text_cfg: dict = dict(),
+                                  show: bool = False,
+                                  wait_time: float = 0,
+                                  out_file: Optional[str] = None,
+                                  name: Optional[str] = '',
+                                  step: int = 0) -> None:
+        """Visualize image retrieval result.
+
+        This method will draw the input image and the images retrieved from the
+        prototype dataset.
+
+        Args:
+            image (np.ndarray): The image to draw. The format should be RGB.
+            data_sample (:obj:`DataSample`): The annotation of the image.
+            prototype_dataset (:obj:`BaseDataset`): The prototype dataset.
+                It should have `get_data_info` method and return a dict
+                includes `img_path`.
+            draw_score (bool): Whether to draw the match scores of the
+                retrieved images. Defaults to True.
+            resize (int, optional): Resize the long edge of the image to the
+                specified length before visualization. Defaults to None.
+            text_cfg (dict): Extra text setting, which accepts arguments of
+                :func:`plt.text`. Defaults to an empty dict.
+            show (bool): Whether to display the drawn image in a window, please
+                confirm your are able to access the graphical interface.
+                Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            out_file (str, optional): Extra path to save the visualization
+                result. If specified, the visualizer will only save the result
+                image to the out_file and ignore its storage backends.
+                Defaults to None.
+            name (str): The image identifier. It's useful when using the
+                storage backends of the visualizer to save or display the
+                image. Defaults to an empty string.
+            step (int): The global step value. It's useful to record a
+                series of visualization results for the same image with the
+                storage backends. Defaults to 0.
+
+        Returns:
+            np.ndarray: The visualization image.
+        """
+        text_cfg = {**self.DEFAULT_TEXT_CFG, **text_cfg}
+        if resize is not None:
+            image = mmcv.imrescale(image, (resize, resize))
+
+        match_scores, indices = torch.topk(data_sample.pred_score, k=topk)
+
+        figure = create_figure(margin=True)
+        gs = figure.add_gridspec(2, topk)
+        query_plot = figure.add_subplot(gs[0, :])
+        query_plot.axis(False)
+        query_plot.imshow(image)
+
+        for k, (score, sample_idx) in enumerate(zip(match_scores, indices)):
+            sample = prototype_dataset.get_data_info(sample_idx.item())
+            value_image = mmcv.imread(sample['img_path'])[..., ::-1]
+            value_plot = figure.add_subplot(gs[1, k])
+            value_plot.axis(False)
+            value_plot.imshow(value_image)
+            if draw_score:
+                value_plot.text(
+                    5,
+                    5,
+                    f'{score:.2f}',
+                    **text_cfg,
+                )
+        drawn_img = img_from_canvas(figure.canvas)
+        self.set_image(drawn_img)
+
+        if show:
+            self.show(drawn_img, win_name=name, wait_time=wait_time)
+
+        if out_file is not None:
+            # save the image to the target file instead of vis_backends
+            mmcv.imwrite(drawn_img[..., ::-1], out_file)
+        else:
+            self.add_image(name, drawn_img, step=step)
+
+        return drawn_img
+
+    def add_mask_to_image(
+        self,
+        image: np.ndarray,
+        data_sample: DataSample,
+        resize: Union[int, Tuple[int]] = 224,
+        color: Union[str, Tuple[int]] = 'black',
+        alpha: Union[int, float] = 0.8,
+    ) -> np.ndarray:
+        if isinstance(resize, int):
+            resize = (resize, resize)
+
+        image = mmcv.imresize(image, resize)
+        self.set_image(image)
+
+        if isinstance(data_sample.mask, np.ndarray):
+            data_sample.mask = torch.tensor(data_sample.mask)
+        mask = data_sample.mask.float()[None, None, ...]
+        mask_ = F.interpolate(mask, image.shape[:2], mode='nearest')[0, 0]
+
+        self.draw_binary_masks(mask_.bool(), colors=color, alphas=alpha)
+
+        drawn_img = self.get_image()
+        return drawn_img
+
+    @master_only
+    def visualize_masked_image(self,
+                               image: np.ndarray,
+                               data_sample: DataSample,
+                               resize: Union[int, Tuple[int]] = 224,
+                               color: Union[str, Tuple[int]] = 'black',
+                               alpha: Union[int, float] = 0.8,
+                               show: bool = False,
+                               wait_time: float = 0,
+                               out_file: Optional[str] = None,
+                               name: str = '',
+                               step: int = 0) -> None:
+        """Visualize masked image.
+
+        This method will draw an image with binary mask.
+
+        Args:
+            image (np.ndarray): The image to draw. The format should be RGB.
+            data_sample (:obj:`DataSample`): The annotation of the image.
+            resize (int | Tuple[int]): Resize the input image to the specified
+                shape. Defaults to 224.
+            color (str | Tuple[int]): The color of the binary mask.
+                Defaults to "black".
+            alpha (int | float): The transparency of the mask. Defaults to 0.8.
+            show (bool): Whether to display the drawn image in a window, please
+                confirm your are able to access the graphical interface.
+                Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            out_file (str, optional): Extra path to save the visualization
+                result. If specified, the visualizer will only save the result
+                image to the out_file and ignore its storage backends.
+                Defaults to None.
+            name (str): The image identifier. It's useful when using the
+                storage backends of the visualizer to save or display the
+                image. Defaults to an empty string.
+            step (int): The global step value. It's useful to record a
+                series of visualization results for the same image with the
+                storage backends. Defaults to 0.
+
+        Returns:
+            np.ndarray: The visualization image.
+        """
+        drawn_img = self.add_mask_to_image(
+            image=image,
+            data_sample=data_sample,
+            resize=resize,
+            color=color,
+            alpha=alpha)
+
+        if show:
+            self.show(drawn_img, win_name=name, wait_time=wait_time)
+
+        if out_file is not None:
+            # save the image to the target file instead of vis_backends
+            mmcv.imwrite(drawn_img[..., ::-1], out_file)
+        else:
+            self.add_image(name, drawn_img, step=step)
+
+        return drawn_img
+
+    @master_only
+    def visualize_image_caption(self,
+                                image: np.ndarray,
+                                data_sample: DataSample,
+                                resize: Optional[int] = None,
+                                text_cfg: dict = dict(),
+                                show: bool = False,
+                                wait_time: float = 0,
+                                out_file: Optional[str] = None,
+                                name: Optional[str] = '',
+                                step: int = 0) -> None:
+        """Visualize image caption result.
+
+        This method will draw the input image and the images caption.
+
+        Args:
+            image (np.ndarray): The image to draw. The format should be RGB.
+            data_sample (:obj:`DataSample`): The annotation of the image.
+            resize (int, optional): Resize the long edge of the image to the
+                specified length before visualization. Defaults to None.
+            text_cfg (dict): Extra text setting, which accepts arguments of
+                :func:`plt.text`. Defaults to an empty dict.
+            show (bool): Whether to display the drawn image in a window, please
+                confirm your are able to access the graphical interface.
+                Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            out_file (str, optional): Extra path to save the visualization
+                result. If specified, the visualizer will only save the result
+                image to the out_file and ignore its storage backends.
+                Defaults to None.
+            name (str): The image identifier. It's useful when using the
+                storage backends of the visualizer to save or display the
+                image. Defaults to an empty string.
+            step (int): The global step value. It's useful to record a
+                series of visualization results for the same image with the
+                storage backends. Defaults to 0.
+
+        Returns:
+            np.ndarray: The visualization image.
+        """
+        text_cfg = {**self.DEFAULT_TEXT_CFG, **text_cfg}
+
+        if resize is not None:
+            h, w = image.shape[:2]
+            if w < h:
+                image = mmcv.imresize(image, (resize, resize * h // w))
+            else:
+                image = mmcv.imresize(image, (resize * w // h, resize))
+
+        self.set_image(image)
+
+        img_scale = get_adaptive_scale(image.shape[:2])
+        text_cfg = {
+            'size': int(img_scale * 7),
+            **self.DEFAULT_TEXT_CFG,
+            **text_cfg,
+        }
+        self.ax_save.text(
+            img_scale * 5,
+            img_scale * 5,
+            data_sample.get('pred_caption'),
+            wrap=True,
+            **text_cfg,
+        )
+        drawn_img = self.get_image()
+
+        if show:
+            self.show(drawn_img, win_name=name, wait_time=wait_time)
+
+        if out_file is not None:
+            # save the image to the target file instead of vis_backends
+            mmcv.imwrite(drawn_img[..., ::-1], out_file)
+        else:
+            self.add_image(name, drawn_img, step=step)
+
+        return drawn_img
+
+    @master_only
+    def visualize_vqa(self,
+                      image: np.ndarray,
+                      data_sample: DataSample,
+                      resize: Optional[int] = None,
+                      text_cfg: dict = dict(),
+                      show: bool = False,
+                      wait_time: float = 0,
+                      out_file: Optional[str] = None,
+                      name: Optional[str] = '',
+                      step: int = 0) -> None:
+        """Visualize visual question answering result.
+
+        This method will draw the input image, question and answer.
+
+        Args:
+            image (np.ndarray): The image to draw. The format should be RGB.
+            data_sample (:obj:`DataSample`): The annotation of the image.
+            resize (int, optional): Resize the long edge of the image to the
+                specified length before visualization. Defaults to None.
+            text_cfg (dict): Extra text setting, which accepts arguments of
+                :func:`plt.text`. Defaults to an empty dict.
+            show (bool): Whether to display the drawn image in a window, please
+                confirm your are able to access the graphical interface.
+                Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            out_file (str, optional): Extra path to save the visualization
+                result. If specified, the visualizer will only save the result
+                image to the out_file and ignore its storage backends.
+                Defaults to None.
+            name (str): The image identifier. It's useful when using the
+                storage backends of the visualizer to save or display the
+                image. Defaults to an empty string.
+            step (int): The global step value. It's useful to record a
+                series of visualization results for the same image with the
+                storage backends. Defaults to 0.
+
+        Returns:
+            np.ndarray: The visualization image.
+        """
+        text_cfg = {**self.DEFAULT_TEXT_CFG, **text_cfg}
+
+        if resize is not None:
+            h, w = image.shape[:2]
+            if w < h:
+                image = mmcv.imresize(image, (resize, resize * h // w))
+            else:
+                image = mmcv.imresize(image, (resize * w // h, resize))
+
+        self.set_image(image)
+
+        img_scale = get_adaptive_scale(image.shape[:2])
+        text_cfg = {
+            'size': int(img_scale * 7),
+            **self.DEFAULT_TEXT_CFG,
+            **text_cfg,
+        }
+        text = (f'Q: {data_sample.get("question")}\n'
+                f'A: {data_sample.get("pred_answer")}')
+        self.ax_save.text(
+            img_scale * 5,
+            img_scale * 5,
+            text,
+            wrap=True,
+            **text_cfg,
+        )
+        drawn_img = self.get_image()
+
+        if show:
+            self.show(drawn_img, win_name=name, wait_time=wait_time)
+
+        if out_file is not None:
+            # save the image to the target file instead of vis_backends
+            mmcv.imwrite(drawn_img[..., ::-1], out_file)
+        else:
+            self.add_image(name, drawn_img, step=step)
+
+        return drawn_img
+
+    @master_only
+    def visualize_visual_grounding(self,
+                                   image: np.ndarray,
+                                   data_sample: DataSample,
+                                   resize: Optional[int] = None,
+                                   text_cfg: dict = dict(),
+                                   show: bool = False,
+                                   wait_time: float = 0,
+                                   out_file: Optional[str] = None,
+                                   name: Optional[str] = '',
+                                   line_width: Union[int, float] = 3,
+                                   bbox_color: Union[str, tuple] = 'green',
+                                   step: int = 0) -> None:
+        """Visualize visual grounding result.
+
+        This method will draw the input image, bbox and the object.
+
+        Args:
+            image (np.ndarray): The image to draw. The format should be RGB.
+            data_sample (:obj:`DataSample`): The annotation of the image.
+            resize (int, optional): Resize the long edge of the image to the
+                specified length before visualization. Defaults to None.
+            text_cfg (dict): Extra text setting, which accepts arguments of
+                :func:`plt.text`. Defaults to an empty dict.
+            show (bool): Whether to display the drawn image in a window, please
+                confirm your are able to access the graphical interface.
+                Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            out_file (str, optional): Extra path to save the visualization
+                result. If specified, the visualizer will only save the result
+                image to the out_file and ignore its storage backends.
+                Defaults to None.
+            name (str): The image identifier. It's useful when using the
+                storage backends of the visualizer to save or display the
+                image. Defaults to an empty string.
+            step (int): The global step value. It's useful to record a
+                series of visualization results for the same image with the
+                storage backends. Defaults to 0.
+
+        Returns:
+            np.ndarray: The visualization image.
+        """
+        text_cfg = {**self.DEFAULT_TEXT_CFG, **text_cfg}
+
+        gt_bboxes = data_sample.get('gt_bboxes')
+        pred_bboxes = data_sample.get('pred_bboxes')
+        if resize is not None:
+            h, w = image.shape[:2]
+            if w < h:
+                image, w_scale, h_scale = mmcv.imresize(
+                    image, (resize, resize * h // w), return_scale=True)
+            else:
+                image, w_scale, h_scale = mmcv.imresize(
+                    image, (resize * w // h, resize), return_scale=True)
+            pred_bboxes[:, ::2] *= w_scale
+            pred_bboxes[:, 1::2] *= h_scale
+            if gt_bboxes is not None:
+                gt_bboxes[:, ::2] *= w_scale
+                gt_bboxes[:, 1::2] *= h_scale
+
+        self.set_image(image)
+        # Avoid the line-width limit in the base classes.
+        self._default_font_size = 1e3
+        self.draw_bboxes(
+            pred_bboxes, line_widths=line_width, edge_colors=bbox_color)
+        if gt_bboxes is not None:
+            self.draw_bboxes(
+                gt_bboxes, line_widths=line_width, edge_colors='blue')
+
+        img_scale = get_adaptive_scale(image.shape[:2])
+        text_cfg = {
+            'size': int(img_scale * 7),
+            **self.DEFAULT_TEXT_CFG,
+            **text_cfg,
+        }
+
+        text_positions = pred_bboxes[:, :2] + line_width
+        for i in range(pred_bboxes.size(0)):
+            self.ax_save.text(
+                text_positions[i, 0] + line_width,
+                text_positions[i, 1] + line_width,
+                data_sample.get('text'),
+                **text_cfg,
+            )
+        drawn_img = self.get_image()
+
+        if show:
+            self.show(drawn_img, win_name=name, wait_time=wait_time)
+
+        if out_file is not None:
+            # save the image to the target file instead of vis_backends
+            mmcv.imwrite(drawn_img[..., ::-1], out_file)
+        else:
+            self.add_image(name, drawn_img, step=step)
+
+        return drawn_img
+
+    @master_only
+    def visualize_t2i_retrieval(self,
+                                text: str,
+                                data_sample: DataSample,
+                                prototype_dataset: BaseDataset,
+                                topk: int = 1,
+                                draw_score: bool = True,
+                                text_cfg: dict = dict(),
+                                fig_cfg: dict = dict(),
+                                show: bool = False,
+                                wait_time: float = 0,
+                                out_file: Optional[str] = None,
+                                name: Optional[str] = '',
+                                step: int = 0) -> None:
+        """Visualize Text-To-Image retrieval result.
+
+        This method will draw the input text and the images retrieved from the
+        prototype dataset.
+
+        Args:
+            image (np.ndarray): The image to draw. The format should be RGB.
+            data_sample (:obj:`DataSample`): The annotation of the image.
+            prototype_dataset (:obj:`BaseDataset`): The prototype dataset.
+                It should have `get_data_info` method and return a dict
+                includes `img_path`.
+            topk (int): To visualize the topk matching items. Defaults to 1.
+            draw_score (bool): Whether to draw the match scores of the
+                retrieved images. Defaults to True.
+            text_cfg (dict): Extra text setting, which accepts arguments of
+                :func:`plt.text`. Defaults to an empty dict.
+            fig_cfg (dict): Extra figure setting, which accepts arguments of
+                :func:`plt.Figure`. Defaults to an empty dict.
+            show (bool): Whether to display the drawn image in a window, please
+                confirm your are able to access the graphical interface.
+                Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            out_file (str, optional): Extra path to save the visualization
+                result. If specified, the visualizer will only save the result
+                image to the out_file and ignore its storage backends.
+                Defaults to None.
+            name (str): The image identifier. It's useful when using the
+                storage backends of the visualizer to save or display the
+                image. Defaults to an empty string.
+            step (int): The global step value. It's useful to record a
+                series of visualization results for the same image with the
+                storage backends. Defaults to 0.
+
+        Returns:
+            np.ndarray: The visualization image.
+        """
+        text_cfg = {**self.DEFAULT_TEXT_CFG, **text_cfg}
+
+        match_scores, indices = torch.topk(data_sample.pred_score, k=topk)
+
+        figure = create_figure(margin=True, **fig_cfg)
+        figure.suptitle(text)
+        gs = figure.add_gridspec(1, topk)
+
+        for k, (score, sample_idx) in enumerate(zip(match_scores, indices)):
+            sample = prototype_dataset.get_data_info(sample_idx.item())
+            value_image = mmcv.imread(sample['img_path'])[..., ::-1]
+            value_plot = figure.add_subplot(gs[0, k])
+            value_plot.axis(False)
+            value_plot.imshow(value_image)
+            if draw_score:
+                value_plot.text(
+                    5,
+                    5,
+                    f'{score:.2f}',
+                    **text_cfg,
+                )
+        drawn_img = img_from_canvas(figure.canvas)
+        self.set_image(drawn_img)
+
+        if show:
+            self.show(drawn_img, win_name=name, wait_time=wait_time)
+
+        if out_file is not None:
+            # save the image to the target file instead of vis_backends
+            mmcv.imwrite(drawn_img[..., ::-1], out_file)
+        else:
+            self.add_image(name, drawn_img, step=step)
+
+        return drawn_img
+
+    @master_only
+    def visualize_i2t_retrieval(self,
+                                image: np.ndarray,
+                                data_sample: DataSample,
+                                prototype_dataset: Sequence[str],
+                                topk: int = 1,
+                                draw_score: bool = True,
+                                resize: Optional[int] = None,
+                                text_cfg: dict = dict(),
+                                show: bool = False,
+                                wait_time: float = 0,
+                                out_file: Optional[str] = None,
+                                name: str = '',
+                                step: int = 0) -> None:
+        """Visualize Image-To-Text retrieval result.
+
+        This method will draw the input image and the texts retrieved from the
+        prototype dataset.
+
+        Args:
+            image (np.ndarray): The image to draw. The format should be RGB.
+            data_sample (:obj:`DataSample`): The annotation of the image.
+            prototype_dataset (Sequence[str]): The prototype dataset.
+                It should be a list of texts.
+            topk (int): To visualize the topk matching items. Defaults to 1.
+            draw_score (bool): Whether to draw the prediction scores
+                of prediction categories. Defaults to True.
+            resize (int, optional): Resize the short edge of the image to the
+                specified length before visualization. Defaults to None.
+            text_cfg (dict): Extra text setting, which accepts
+                arguments of :meth:`mmengine.Visualizer.draw_texts`.
+                Defaults to an empty dict.
+            show (bool): Whether to display the drawn image in a window, please
+                confirm your are able to access the graphical interface.
+                Defaults to False.
+            wait_time (float): The display time (s). Defaults to 0, which means
+                "forever".
+            out_file (str, optional): Extra path to save the visualization
+                result. If specified, the visualizer will only save the result
+                image to the out_file and ignore its storage backends.
+                Defaults to None.
+            name (str): The image identifier. It's useful when using the
+                storage backends of the visualizer to save or display the
+                image. Defaults to an empty string.
+            step (int): The global step value. It's useful to record a
+                series of visualization results for the same image with the
+                storage backends. Defaults to 0.
+
+        Returns:
+            np.ndarray: The visualization image.
+        """
+        if resize is not None:
+            h, w = image.shape[:2]
+            if w < h:
+                image = mmcv.imresize(image, (resize, resize * h // w))
+            else:
+                image = mmcv.imresize(image, (resize * w // h, resize))
+
+        self.set_image(image)
+
+        match_scores, indices = torch.topk(data_sample.pred_score, k=topk)
+        texts = []
+        for score, sample_idx in zip(match_scores, indices):
+            text = prototype_dataset[sample_idx.item()]
+            if draw_score:
+                text = f'{score:.2f} ' + text
+            texts.append(text)
+
+        img_scale = get_adaptive_scale(image.shape[:2])
+        text_cfg = {
+            'size': int(img_scale * 7),
+            **self.DEFAULT_TEXT_CFG,
+            **text_cfg,
+        }
+        self.ax_save.text(
+            img_scale * 5,
+            img_scale * 5,
+            '\n'.join(texts),
+            **text_cfg,
+        )
+        drawn_img = self.get_image()
+
+        if show:
+            self.show(drawn_img, win_name=name, wait_time=wait_time)
+
+        if out_file is not None:
+            # save the image to the target file instead of vis_backends
+            mmcv.imwrite(drawn_img[..., ::-1], out_file)
+        else:
+            self.add_image(name, drawn_img, step=step)
+
+        return drawn_img
diff --git a/model-index.yml b/model-index.yml
new file mode 100644
index 0000000000000000000000000000000000000000..1bd928533e1be02db1dea58dbf6c52b2bde45f45
--- /dev/null
+++ b/model-index.yml
@@ -0,0 +1,85 @@
+Import:
+  - configs/mobilenet_v2/metafile.yml
+  - configs/mobilenet_v3/metafile.yml
+  - configs/resnet/metafile.yml
+  - configs/res2net/metafile.yml
+  - configs/resnext/metafile.yml
+  - configs/seresnet/metafile.yml
+  - configs/shufflenet_v1/metafile.yml
+  - configs/shufflenet_v2/metafile.yml
+  - configs/swin_transformer/metafile.yml
+  - configs/vgg/metafile.yml
+  - configs/repvgg/metafile.yml
+  - configs/tnt/metafile.yml
+  - configs/vision_transformer/metafile.yml
+  - configs/t2t_vit/metafile.yml
+  - configs/tinyvit/metafile.yml
+  - configs/mlp_mixer/metafile.yml
+  - configs/conformer/metafile.yml
+  - configs/regnet/metafile.yml
+  - configs/deit/metafile.yml
+  - configs/twins/metafile.yml
+  - configs/efficientnet/metafile.yml
+  - configs/convnext/metafile.yml
+  - configs/hrnet/metafile.yml
+  - configs/repmlp/metafile.yml
+  - configs/wrn/metafile.yml
+  - configs/van/metafile.yml
+  - configs/cspnet/metafile.yml
+  - configs/convmixer/metafile.yml
+  - configs/densenet/metafile.yml
+  - configs/poolformer/metafile.yml
+  - configs/inception_v3/metafile.yml
+  - configs/mvit/metafile.yml
+  - configs/edgenext/metafile.yml
+  - configs/mobileone/metafile.yml
+  - configs/efficientformer/metafile.yml
+  - configs/swin_transformer_v2/metafile.yml
+  - configs/deit3/metafile.yml
+  - configs/hornet/metafile.yml
+  - configs/mobilevit/metafile.yml
+  - configs/davit/metafile.yml
+  - configs/replknet/metafile.yml
+  - configs/csra/metafile.yml
+  - configs/beit/metafile.yml
+  - configs/beitv2/metafile.yml
+  - configs/eva/metafile.yml
+  - configs/revvit/metafile.yml
+  - configs/clip/metafile.yml
+  - configs/mixmim/metafile.yml
+  - configs/efficientnet_v2/metafile.yml
+  - configs/convnext_v2/metafile.yml
+  - configs/levit/metafile.yml
+  - configs/vig/metafile.yml
+  - configs/arcface/metafile.yml
+  - configs/xcit/metafile.yml
+  - configs/byol/metafile.yml
+  - configs/densecl/metafile.yml
+  - configs/mocov2/metafile.yml
+  - configs/mocov3/metafile.yml
+  - configs/simclr/metafile.yml
+  - configs/simsiam/metafile.yml
+  - configs/swav/metafile.yml
+  - configs/mae/metafile.yml
+  - configs/simmim/metafile.yml
+  - configs/barlowtwins/metafile.yml
+  - configs/cae/metafile.yml
+  - configs/maskfeat/metafile.yml
+  - configs/milan/metafile.yml
+  - configs/ofa/metafile.yml
+  - configs/riformer/metafile.yml
+  - configs/sam/metafile.yml
+  - configs/glip/metafile.yml
+  - configs/eva02/metafile.yml
+  - configs/dinov2/metafile.yml
+  - configs/blip/metafile.yml
+  - configs/flamingo/metafile.yml
+  - configs/blip2/metafile.yml
+  - configs/chinese_clip/metafile.yml
+  - configs/itpn/metafile.yml
+  - configs/hivit/metafile.yml
+  - configs/spark/metafile.yml
+  - configs/minigpt4/metafile.yml
+  - configs/llava/metafile.yml
+  - configs/otter/metafile.yml
+  - configs/mff/metafile.yml
diff --git a/model.properties b/model.properties
deleted file mode 100644
index efbeb62d83795f6d3c01a81c8f030349d48388fa..0000000000000000000000000000000000000000
--- a/model.properties
+++ /dev/null
@@ -1,10 +0,0 @@
-# 模型唯一标识
-modelCode = 128
-# 模型名称
-modelName=mobilenetv2_mmcv
-# 模型描述
-modelDescription=Mobilenetv2是种轻量级的卷积神经网络模型,基于MMCV实现测试
-# 应用场景
-appScenario=训练,图像分类,制造,能源,交通,网安
-# 框架类型
-frameType=pytorch,mmcv
diff --git a/projects/README.md b/projects/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5122e4b6391708e43c9b621ae6757ffae4967a42
--- /dev/null
+++ b/projects/README.md
@@ -0,0 +1,21 @@
+# Welcome to Projects of MMPreTrain
+
+In this folder, we welcome all contribution of vision deep-learning backbone from community.
+
+Here, these requirements, e.g. code standards, are not that strict as in core package. Thus, developers from the community can implement their algorithms much more easily and efficiently in MMPreTrain. We appreciate all contributions from community to make MMPreTrain greater.
+
+Here is an [example project](./example_project) about how to add your algorithms easily.
+
+We also provide some documentation listed below:
+
+- [New Model Guide](https://mmpretrain.readthedocs.io/en/latest/advanced_guides/modules.html)
+
+  The documentation of adding new models.
+
+- [Contribution Guide](https://mmpretrain.readthedocs.io/en/latest/notes/contribution_guide.html)
+
+  The guides for new contributors about how to add your projects to MMPreTrain.
+
+- [Discussions](https://github.com/open-mmlab/mmpretrain/discussions)
+
+  Welcome to start discussion!
diff --git a/projects/dino/README.md b/projects/dino/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3458fa4cdb3c32fae4e062167bb7f11cf7bc7c47
--- /dev/null
+++ b/projects/dino/README.md
@@ -0,0 +1,26 @@
+# Implementation for DINO
+
+**NOTE**: We only guarantee correctness of the forward pass, not responsible for full reimplementation.
+
+First, ensure you are in the root directory of MMPretrain, then you have two choices
+to play with DINO in MMPretrain:
+
+## Slurm
+
+If you are using a cluster managed by Slurm, you can use the following command to
+start your job:
+
+```shell
+GPUS_PER_NODE=8 GPUS=8 CPUS_PER_TASK=16 bash projects/dino/tools/slurm_train.sh mm_model dino projects/dino/config/dino_vit-base-p16_8xb64-amp-coslr-100e_in1k.py --amp
+```
+
+The above command will pre-train the model on a single node with 8 GPUs.
+
+## PyTorch
+
+If you are using a single machine, without any cluster management software, you can use the following command
+
+```shell
+NNODES=1 bash projects/dino/tools/dist_train.sh projects/dino/config/dino_vit-base-p16_8xb64-amp-coslr-100e_in1k.py 8
+--amp
+```
diff --git a/projects/dino/config/dino_vit-base-p16_8xb64-amp-coslr-100e_in1k.py b/projects/dino/config/dino_vit-base-p16_8xb64-amp-coslr-100e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..d4a1c240218980ef105cefea6eaab22ea586dfa9
--- /dev/null
+++ b/projects/dino/config/dino_vit-base-p16_8xb64-amp-coslr-100e_in1k.py
@@ -0,0 +1,104 @@
+model = dict(
+    type='DINO',
+    data_preprocessor=dict(
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        bgr_to_rgb=True),
+    backbone=dict(
+        type='mmpretrain.VisionTransformer', arch='b', patch_size=16),
+    neck=dict(
+        type='DINONeck',
+        in_channels=768,
+        out_channels=65536,
+        hidden_channels=2048,
+        bottleneck_channels=256),
+    head=dict(
+        type='DINOHead',
+        out_channels=65536,
+        num_crops=10,
+        student_temp=0.1,
+        center_momentum=0.9))
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='DINOMultiCrop',
+        global_crops_scale=(0.4, 1.0),
+        local_crops_scale=(0.05, 0.4),
+        local_crops_number=8),
+    dict(type='PackInputs')
+]
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=16,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type='mmpretrain.ImageNet',
+        data_root='/data/imagenet/',
+        ann_file='meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline,
+    ))
+optimizer = dict(type='AdamW', lr=0.0024, betas=(0.9, 0.95), weight_decay=0.05)
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(
+        type='AdamW', lr=0.0024, betas=(0.9, 0.95), weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys=dict(
+            ln=dict(decay_mult=0.0),
+            bias=dict(decay_mult=0.0),
+            pos_embed=dict(decay_mult=0.0),
+            mask_token=dict(decay_mult=0.0),
+            cls_token=dict(decay_mult=0.0))),
+    loss_scale='dynamic')
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-09,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=90,
+        by_epoch=True,
+        begin=10,
+        end=100,
+        convert_to_iter_based=True)
+]
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=100)
+default_scope = 'mmpretrain'
+default_hooks = dict(
+    runtime_info=dict(type='RuntimeInfoHook'),
+    timer=dict(type='IterTimerHook'),
+    logger=dict(type='LoggerHook', interval=100),
+    param_scheduler=dict(type='ParamSchedulerHook'),
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=1),
+    sampler_seed=dict(type='DistSamplerSeedHook'))
+env_cfg = dict(
+    cudnn_benchmark=False,
+    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+    dist_cfg=dict(backend='nccl'))
+log_processor = dict(
+    window_size=10,
+    custom_cfg=[dict(data_src='', method='mean', window_size='global')])
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='UniversalVisualizer',
+    vis_backends=[dict(type='LocalVisBackend')],
+    name='visualizer')
+log_level = 'INFO'
+load_from = None
+resume = True
+randomness = dict(seed=2, diff_rank_seed=True)
+custom_hooks = [
+    dict(
+        type='DINOTeacherTempWarmupHook',
+        warmup_teacher_temp=0.04,
+        teacher_temp=0.04,
+        teacher_temp_warmup_epochs=0,
+        max_epochs=100)
+]
diff --git a/projects/dino/dataset/__init__.py b/projects/dino/dataset/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..da65f2853ad8f06bd3da50bbe71f34e4bc5f3550
--- /dev/null
+++ b/projects/dino/dataset/__init__.py
@@ -0,0 +1 @@
+from .transform import *  # noqa: F401,F403
diff --git a/projects/dino/dataset/transform/__init__.py b/projects/dino/dataset/transform/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..00dacb3f3c9632d63df275f7b781a9b82d0688eb
--- /dev/null
+++ b/projects/dino/dataset/transform/__init__.py
@@ -0,0 +1,3 @@
+from .processing import DINOMultiCrop
+
+__all__ = ['DINOMultiCrop']
diff --git a/projects/dino/dataset/transform/processing.py b/projects/dino/dataset/transform/processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..df4bf0be9dde7e28d67b70400e41905f90f226c8
--- /dev/null
+++ b/projects/dino/dataset/transform/processing.py
@@ -0,0 +1,91 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import random
+
+from mmcv.transforms import RandomApply  # noqa: E501
+from mmcv.transforms import BaseTransform, Compose, RandomFlip, RandomGrayscale
+
+from mmpretrain.datasets.transforms import (ColorJitter, GaussianBlur,
+                                            RandomResizedCrop, Solarize)
+from mmpretrain.registry import TRANSFORMS
+
+
+@TRANSFORMS.register_module()
+class DINOMultiCrop(BaseTransform):
+    """Multi-crop transform for DINO.
+
+    This module applies the multi-crop transform for DINO.
+
+    Args:
+        global_crops_scale (int): Scale of global crops.
+        local_crops_scale (int): Scale of local crops.
+        local_crops_number (int): Number of local crops.
+    """
+
+    def __init__(self, global_crops_scale: int, local_crops_scale: int,
+                 local_crops_number: int) -> None:
+        super().__init__()
+        self.global_crops_scale = global_crops_scale
+        self.local_crops_scale = local_crops_scale
+
+        flip_and_color_jitter = Compose([
+            RandomFlip(prob=0.5, direction='horizontal'),
+            RandomApply([
+                ColorJitter(
+                    brightness=0.4, contrast=0.4, saturation=0.2, hue=0.1)
+            ],
+                        prob=0.8),
+            RandomGrayscale(
+                prob=0.2,
+                keep_channels=True,
+                channel_weights=(0.114, 0.587, 0.2989),
+            )
+        ])
+
+        self.global_transform_1 = Compose([
+            RandomResizedCrop(
+                224,
+                crop_ratio_range=global_crops_scale,
+                interpolation='bicubic'),
+            flip_and_color_jitter,
+            GaussianBlur(prob=1.0, radius=random.uniform(0.1, 2.0)),
+        ])
+
+        self.global_transform_2 = Compose([
+            RandomResizedCrop(
+                224,
+                crop_ratio_range=global_crops_scale,
+                interpolation='bicubic'),
+            flip_and_color_jitter,
+            GaussianBlur(prob=1.0, radius=random.uniform(0.1, 2.0)),
+            Solarize(thr=128, prob=0.2),
+        ])
+
+        self.local_crops_number = local_crops_number
+        self.local_transform = Compose([
+            RandomResizedCrop(
+                96,
+                crop_ratio_range=local_crops_scale,
+                interpolation='bicubic'),
+            flip_and_color_jitter,
+            GaussianBlur(prob=1.0, radius=random.uniform(0.1, 2.0)),
+        ])
+
+    def transform(self, results: dict) -> dict:
+        ori_img = results['img']
+        crops = []
+        results['img'] = ori_img
+        crops.append(self.global_transform_1(results)['img'])
+        results['img'] = ori_img
+        crops.append(self.global_transform_2(results)['img'])
+        for _ in range(self.local_crops_number):
+            results['img'] = ori_img
+            crops.append(self.local_transform(results)['img'])
+        results['img'] = crops
+        return results
+
+    def __repr__(self) -> str:
+        repr_str = self.__class__.__name__
+        repr_str += f'(global_crops_scale = {self.global_crops_scale}, '
+        repr_str += f'local_crops_scale = {self.local_crops_scale}, '
+        repr_str += f'local_crop_number = {self.local_crops_number})'
+        return repr_str
diff --git a/projects/dino/engine/__init__.py b/projects/dino/engine/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..41422545e617338ef14a1d2611f1d5340552d2b1
--- /dev/null
+++ b/projects/dino/engine/__init__.py
@@ -0,0 +1 @@
+from .hooks import *  # noqa
diff --git a/projects/dino/engine/hooks/__init__.py b/projects/dino/engine/hooks/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..df43c492e52b965bd44020002e765d1d85a1a267
--- /dev/null
+++ b/projects/dino/engine/hooks/__init__.py
@@ -0,0 +1,3 @@
+from .dino_teacher_temp_warmup_hook import DINOTeacherTempWarmupHook
+
+__all__ = ['DINOTeacherTempWarmupHook']
diff --git a/projects/dino/engine/hooks/dino_teacher_temp_warmup_hook.py b/projects/dino/engine/hooks/dino_teacher_temp_warmup_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..d66b0250e721dbe16e08416703174bfa40ad8701
--- /dev/null
+++ b/projects/dino/engine/hooks/dino_teacher_temp_warmup_hook.py
@@ -0,0 +1,33 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import numpy as np
+from mmengine.hooks import Hook
+
+from mmpretrain.registry import HOOKS
+
+
+@HOOKS.register_module()
+class DINOTeacherTempWarmupHook(Hook):
+    """Warmup teacher temperature for DINO.
+
+    This hook warmups the temperature for teacher to stabilize the training
+    process.
+
+    Args:
+        warmup_teacher_temp (float): Warmup temperature for teacher.
+        teacher_temp (float): Temperature for teacher.
+        teacher_temp_warmup_epochs (int): Warmup epochs for teacher
+            temperature.
+        max_epochs (int): Maximum epochs for training.
+    """
+
+    def __init__(self, warmup_teacher_temp: float, teacher_temp: float,
+                 teacher_temp_warmup_epochs: int, max_epochs: int) -> None:
+        super().__init__()
+        self.teacher_temps = np.concatenate(
+            (np.linspace(warmup_teacher_temp, teacher_temp,
+                         teacher_temp_warmup_epochs),
+             np.ones(max_epochs - teacher_temp_warmup_epochs) * teacher_temp))
+
+    def before_train_epoch(self, runner) -> None:
+        runner.model.module.head.teacher_temp = self.teacher_temps[
+            runner.epoch]
diff --git a/projects/dino/models/__init__.py b/projects/dino/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..49d014874adf398d510680d5fe0280cc3a9c077f
--- /dev/null
+++ b/projects/dino/models/__init__.py
@@ -0,0 +1,3 @@
+from .algorithm import *  # noqa
+from .head import *  # noqa
+from .neck import *  # noqa
diff --git a/projects/dino/models/algorithm/__init__.py b/projects/dino/models/algorithm/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..1125b63f851f09248a530f938dff0145153802ff
--- /dev/null
+++ b/projects/dino/models/algorithm/__init__.py
@@ -0,0 +1,3 @@
+from .dino import DINO
+
+__all__ = ['DINO']
diff --git a/projects/dino/models/algorithm/dino.py b/projects/dino/models/algorithm/dino.py
new file mode 100644
index 0000000000000000000000000000000000000000..2d78922f1f6198a702fae99fb36ab4f6a20eeba6
--- /dev/null
+++ b/projects/dino/models/algorithm/dino.py
@@ -0,0 +1,82 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Union
+
+import torch
+from torch import nn
+
+from mmpretrain.models import BaseSelfSupervisor, CosineEMA
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+
+
+@MODELS.register_module()
+class DINO(BaseSelfSupervisor):
+    """Implementation for DINO.
+
+    This module is proposed in `DINO: Emerging Properties in Self-Supervised
+    Vision Transformers <https://arxiv.org/abs/2104.14294>`_.
+
+    Args:
+        backbone (dict): Config for backbone.
+        neck (dict): Config for neck.
+        head (dict): Config for head.
+        pretrained (str, optional): Path for pretrained model.
+            Defaults to None.
+        base_momentum (float, optional): Base momentum for momentum update.
+            Defaults to 0.99.
+        data_preprocessor (dict, optional): Config for data preprocessor.
+            Defaults to None.
+        init_cfg (list[dict] | dict, optional): Config for initialization.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 backbone: dict,
+                 neck: dict,
+                 head: dict,
+                 pretrained: Optional[str] = None,
+                 base_momentum: float = 0.99,
+                 data_preprocessor: Optional[dict] = None,
+                 init_cfg: Optional[Union[List[dict], dict]] = None) -> None:
+        super().__init__(
+            backbone=backbone,
+            neck=neck,
+            head=head,
+            pretrained=pretrained,
+            data_preprocessor=data_preprocessor,
+            init_cfg=init_cfg)
+
+        # create momentum model
+        self.teacher = CosineEMA(
+            nn.Sequential(self.backbone, self.neck), momentum=base_momentum)
+        # weight normalization layer
+        self.neck.last_layer = nn.utils.weight_norm(self.neck.last_layer)
+        self.neck.last_layer.weight_g.data.fill_(1)
+        self.neck.last_layer.weight_g.requires_grad = False
+        self.teacher.module[1].last_layer = nn.utils.weight_norm(
+            self.teacher.module[1].last_layer)
+        self.teacher.module[1].last_layer.weight_g.data.fill_(1)
+        self.teacher.module[1].last_layer.weight_g.requires_grad = False
+
+    def loss(self, inputs: torch.Tensor,
+             data_samples: List[DataSample]) -> dict:
+        global_crops = torch.cat(inputs[:2])
+        local_crops = torch.cat(inputs[2:])
+        # teacher forward
+        teacher_output = self.teacher(global_crops)
+
+        # student forward global
+        student_output_global = self.backbone(global_crops)
+        student_output_global = self.neck(student_output_global)
+
+        # student forward local
+        student_output_local = self.backbone(local_crops)
+        student_output_local = self.neck(student_output_local)
+
+        student_output = torch.cat(
+            (student_output_global, student_output_local))
+
+        # compute loss
+        loss = self.head(student_output, teacher_output)
+
+        return dict(loss=loss)
diff --git a/projects/dino/models/head/__init__.py b/projects/dino/models/head/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..fe31e084cd3746f0b4661cce56872aed990a858a
--- /dev/null
+++ b/projects/dino/models/head/__init__.py
@@ -0,0 +1,3 @@
+from .dino_head import DINOHead
+
+__all__ = ['DINOHead']
diff --git a/projects/dino/models/head/dino_head.py b/projects/dino/models/head/dino_head.py
new file mode 100644
index 0000000000000000000000000000000000000000..e817bfade3817de0555b3814cf70918904361c15
--- /dev/null
+++ b/projects/dino/models/head/dino_head.py
@@ -0,0 +1,69 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn.functional as F
+from mmengine.dist import all_reduce, get_world_size
+from mmengine.model import BaseModule
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class DINOHead(BaseModule):
+    """Implementation for DINO head.
+
+    This module is proposed in `DINO: Emerging Properties in Self-Supervised
+    Vision Transformers <https://arxiv.org/abs/2104.14294>`_.
+
+    Args:
+        out_channels (int): Output channels of the head.
+        num_crops (int): Number of crops.
+        student_temp (float): Temperature for student output.
+        center_momentum (float): Momentum for center update.
+    """
+
+    def __init__(self, out_channels: int, num_crops: int, student_temp: float,
+                 center_momentum: float) -> None:
+        super().__init__()
+        self.student_temp = student_temp
+        self.teacher_temp = 0
+        self.center_momentum = center_momentum
+        self.num_crops = num_crops
+        self.register_buffer('center', torch.zeros(1, out_channels))
+
+    def forward(self, student_output: torch.Tensor,
+                teacher_output: torch.Tensor) -> torch.Tensor:
+
+        current_teacher_output = teacher_output
+        student_output = student_output / self.student_temp
+        student_output = student_output.chunk(self.num_crops, dim=0)
+
+        # teacher centering and sharpening
+        teacher_output = F.softmax(
+            (teacher_output - self.center) / self.teacher_temp, dim=-1)
+        teacher_output = teacher_output.detach().chunk(2, dim=0)
+
+        total_loss = 0
+        n_loss_terms = 0
+
+        for i in range(len(teacher_output)):
+            for j in range(len(student_output)):
+                if i == j:
+                    continue
+                total_loss += (-teacher_output[i] *
+                               student_output[j].log_softmax(dim=-1)).sum(
+                                   dim=-1).mean()
+                n_loss_terms += 1
+        total_loss /= n_loss_terms
+        self.update_center(current_teacher_output)
+        return total_loss
+
+    @torch.no_grad()
+    def update_center(self, teacher_output: torch.Tensor) -> None:
+
+        batch_center = torch.sum(teacher_output, dim=0, keepdim=True)
+        all_reduce(batch_center)
+        batch_center = batch_center / (len(teacher_output) * get_world_size())
+
+        # ema update batch center
+        self.center = self.center * self.center_momentum + batch_center * (
+            1 - self.center_momentum)
diff --git a/projects/dino/models/neck/__init__.py b/projects/dino/models/neck/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5f4aadb09d82765a58c5082afe8eddeacc8742c
--- /dev/null
+++ b/projects/dino/models/neck/__init__.py
@@ -0,0 +1,3 @@
+from .dino_neck import DINONeck
+
+__all__ = ['DINONeck']
diff --git a/projects/dino/models/neck/dino_neck.py b/projects/dino/models/neck/dino_neck.py
new file mode 100644
index 0000000000000000000000000000000000000000..8d8881ea24a10e7adbdfd31e505f49b92cfc8a90
--- /dev/null
+++ b/projects/dino/models/neck/dino_neck.py
@@ -0,0 +1,41 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+from mmengine.model import BaseModule
+from torch import nn
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class DINONeck(BaseModule):
+    """Implementation for DINO neck.
+
+    This module is proposed in `DINO: Emerging Properties in Self-Supervised
+    Vision Transformers <https://arxiv.org/abs/2104.14294>`_.
+
+    Args:
+        in_channels (int): Input channels.
+        hidden_channels (int): Hidden channels.
+        out_channels (int): Output channels.
+        bottleneck_channels (int): Bottleneck channels.
+    """
+
+    def __init__(self, in_channels: int, hidden_channels: int,
+                 out_channels: int, bottleneck_channels: int) -> None:
+        super().__init__()
+        self.mlp = nn.Sequential(*[
+            nn.Linear(in_channels, hidden_channels),
+            nn.GELU(),
+            nn.Linear(hidden_channels, hidden_channels),
+            nn.GELU(),
+            nn.Linear(hidden_channels, bottleneck_channels),
+        ])
+
+        self.last_layer = nn.Linear(
+            bottleneck_channels, out_channels, bias=False)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.mlp(x[0])
+        x = nn.functional.normalize(x, dim=-1, p=2)
+        x = self.last_layer(x)
+        return x
diff --git a/projects/dino/tools/dist_train.sh b/projects/dino/tools/dist_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3fca7641dec4090930c85991a079c28409529d4e
--- /dev/null
+++ b/projects/dino/tools/dist_train.sh
@@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+
+CONFIG=$1
+GPUS=$2
+NNODES=${NNODES:-1}
+NODE_RANK=${NODE_RANK:-0}
+PORT=${PORT:-29500}
+MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+python -m torch.distributed.launch \
+    --nnodes=$NNODES \
+    --node_rank=$NODE_RANK \
+    --master_addr=$MASTER_ADDR \
+    --nproc_per_node=$GPUS \
+    --master_port=$PORT \
+    $(dirname "$0")/train.py \
+    $CONFIG \
+    --launcher pytorch ${@:3}
diff --git a/projects/dino/tools/slurm_train.sh b/projects/dino/tools/slurm_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7e2ad297d8421a3837da82a3d889bdab92d6dcad
--- /dev/null
+++ b/projects/dino/tools/slurm_train.sh
@@ -0,0 +1,23 @@
+#!/usr/bin/env bash
+
+set -x
+
+PARTITION=$1
+JOB_NAME=$2
+CONFIG=$3
+GPUS=${GPUS:-8}
+GPUS_PER_NODE=${GPUS_PER_NODE:-8}
+CPUS_PER_TASK=${CPUS_PER_TASK:-5}
+SRUN_ARGS=${SRUN_ARGS:-""}
+PY_ARGS=${@:4}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+srun -p ${PARTITION} \
+    --job-name=${JOB_NAME} \
+    --gres=gpu:${GPUS_PER_NODE} \
+    --ntasks=${GPUS} \
+    --ntasks-per-node=${GPUS_PER_NODE} \
+    --cpus-per-task=${CPUS_PER_TASK} \
+    --kill-on-bad-exit=1 \
+    ${SRUN_ARGS} \
+    python -u projects/dino/tools/train.py ${CONFIG} --launcher="slurm" ${PY_ARGS}
diff --git a/projects/dino/tools/train.py b/projects/dino/tools/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9482c3b75ade1724a477f5afe6a82cbdb8c371c
--- /dev/null
+++ b/projects/dino/tools/train.py
@@ -0,0 +1,104 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os
+import os.path as osp
+
+from dataset import *  # noqa: F401,F403
+from engine import *  # noqa: F401,F403
+from mmengine.config import Config, DictAction
+from mmengine.runner import Runner
+from models.algorithm import *  # noqa: F401,F403
+from models.head import *  # noqa: F401,F403
+from models.neck import *  # noqa: F401,F403
+
+from mmpretrain.utils import register_all_modules
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Train a model')
+    parser.add_argument('config', help='train config file path')
+    parser.add_argument('--work-dir', help='the dir to save logs and models')
+    parser.add_argument(
+        '--resume',
+        nargs='?',
+        type=str,
+        const='auto',
+        help='If specify checkpint path, resume from it, while if not '
+        'specify, try to auto resume from the latest checkpoint '
+        'in the work directory.')
+    parser.add_argument(
+        '--amp',
+        action='store_true',
+        help='enable automatic-mixed-precision training')
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. If the value to '
+        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+        'Note that the quotation marks are necessary and that no white space '
+        'is allowed.')
+    parser.add_argument(
+        '--launcher',
+        choices=['none', 'pytorch', 'slurm', 'mpi'],
+        default='none',
+        help='job launcher')
+    parser.add_argument('--local_rank', type=int, default=0)
+    args = parser.parse_args()
+    if 'LOCAL_RANK' not in os.environ:
+        os.environ['LOCAL_RANK'] = str(args.local_rank)
+
+    return args
+
+
+def main():
+    args = parse_args()
+
+    # register all modules in mmpretrain into the registries
+    # do not init the default scope here because it will be init in the runner
+    register_all_modules(init_default_scope=False)
+
+    # load config
+    cfg = Config.fromfile(args.config)
+    cfg.launcher = args.launcher
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+
+    # work_dir is determined in this priority: CLI > segment in file > filename
+    if args.work_dir is not None:
+        # update configs according to CLI args if args.work_dir is not None
+        cfg.work_dir = args.work_dir
+    elif cfg.get('work_dir', None) is None:
+        # use config filename as default work_dir if cfg.work_dir is None
+        work_type = args.config.split('/')[1]
+        cfg.work_dir = osp.join('./work_dirs', work_type,
+                                osp.splitext(osp.basename(args.config))[0])
+
+    # enable automatic-mixed-precision training
+    if args.amp is True:
+        optim_wrapper = cfg.optim_wrapper.get('type', 'OptimWrapper')
+        assert optim_wrapper in ['OptimWrapper', 'AmpOptimWrapper'], \
+            '`--amp` is not supported custom optimizer wrapper type ' \
+            f'`{optim_wrapper}.'
+        cfg.optim_wrapper.type = 'AmpOptimWrapper'
+        cfg.optim_wrapper.setdefault('loss_scale', 'dynamic')
+
+    # resume training
+    if args.resume == 'auto':
+        cfg.resume = True
+        cfg.load_from = None
+    elif args.resume is not None:
+        cfg.resume = True
+        cfg.load_from = args.resume
+
+    # build the runner from config
+    runner = Runner.from_cfg(cfg)
+
+    # start training
+    runner.train()
+
+
+if __name__ == '__main__':
+    main()
diff --git a/projects/example_project/README.md b/projects/example_project/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2931d87a7df47c5f45990917855d14db18525ff5
--- /dev/null
+++ b/projects/example_project/README.md
@@ -0,0 +1,128 @@
+# Example Project
+
+This is an example README for community `projects/`. You can write your README in your own project. Here are
+some recommended parts of a README for others to understand and use your project, you can copy or modify them
+according to your project.
+
+## Usage
+
+### Setup Environment
+
+Please refer to [Get Started](https://mmpretrain.readthedocs.io/en/latest/get_started.html) to install
+MMPreTrain.
+
+At first, add the current folder to `PYTHONPATH`, so that Python can find your code. Run command in the current directory to add it.
+
+> Please run it every time after you opened a new shell.
+
+```shell
+export PYTHONPATH=`pwd`:$PYTHONPATH
+```
+
+### Data Preparation
+
+Prepare the ImageNet-2012 dataset according to the [instruction](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#imagenet).
+
+### Training commands
+
+**To train with single GPU:**
+
+```bash
+mim train mmpretrain configs/examplenet_8xb32_in1k.py
+```
+
+**To train with multiple GPUs:**
+
+```bash
+mim train mmpretrain configs/examplenet_8xb32_in1k.py --launcher pytorch --gpus 8
+```
+
+**To train with multiple GPUs by slurm:**
+
+```bash
+mim train mmpretrain configs/examplenet_8xb32_in1k.py --launcher slurm \
+    --gpus 16 --gpus-per-node 8 --partition $PARTITION
+```
+
+### Testing commands
+
+**To test with single GPU:**
+
+```bash
+mim test mmpretrain configs/examplenet_8xb32_in1k.py --checkpoint $CHECKPOINT
+```
+
+**To test with multiple GPUs:**
+
+```bash
+mim test mmpretrain configs/examplenet_8xb32_in1k.py --checkpoint $CHECKPOINT --launcher pytorch --gpus 8
+```
+
+**To test with multiple GPUs by slurm:**
+
+```bash
+mim test mmpretrain configs/examplenet_8xb32_in1k.py --checkpoint $CHECKPOINT --launcher slurm \
+    --gpus 16 --gpus-per-node 8 --partition $PARTITION
+```
+
+## Results
+
+|       Model        |   Pretrain   | Top-1 (%) | Top-5 (%) |                 Config                  |                Download                |
+| :----------------: | :----------: | :-------: | :-------: | :-------------------------------------: | :------------------------------------: |
+|  ExampleNet-tiny   | From scratch |   82.33   |   96.15   | [config](./mvitv2-tiny_8xb256_in1k.py)  | [model](MODEL-LINK) \| [log](LOG-LINK) |
+| ExampleNet-small\* | From scratch |   83.63   |   96.51   | [config](./mvitv2-small_8xb256_in1k.py) |          [model](MODEL-LINK)           |
+| ExampleNet-base\*  | From scratch |   84.34   |   96.86   | [config](./mvitv2-base_8xb256_in1k.py)  |          [model](MODEL-LINK)           |
+
+*Models with * are converted from the [official repo](REPO-LINK). The config files of these models are only for inference. We don't ensure these config files' training accuracy and welcome you to contribute your reproduction results.*
+
+## Citation
+
+<!-- Replace to the citation of the paper your project refers to. -->
+
+```BibTeX
+@misc{2023mmpretrain,
+    title={OpenMMLab's Pre-training Toolbox and Benchmark},
+    author={MMPreTrain Contributors},
+    howpublished = {\url{https://github.com/open-mmlab/mmpretrain}},
+    year={2023}
+}
+```
+
+## Checklist
+
+Here is a checklist of this project's progress. And you can ignore this part if you don't plan to contribute
+to MMPreTrain projects.
+
+- [ ] Milestone 1: PR-ready, and acceptable to be one of the `projects/`.
+
+  - [ ] Finish the code
+
+    <!-- The code's design shall follow existing interfaces and convention. For example, each model component should be registered into `mmpretrain.registry.MODELS` and configurable via a config file. -->
+
+  - [ ] Basic docstrings & proper citation
+
+    <!-- Each major class should contains a docstring, describing its functionality and arguments. If your code is copied or modified from other open-source projects, don't forget to cite the source project in docstring and make sure your behavior is not against its license. Typically, we do not accept any code snippet under GPL license. [A Short Guide to Open Source Licenses](https://medium.com/nationwide-technology/a-short-guide-to-open-source-licenses-cf5b1c329edd) -->
+
+  - [ ] Converted checkpoint and results (Only for reproduction)
+
+    <!-- If you are reproducing the result from a paper, make sure the model in the project can match that results. Also please provide checkpoint links or a checkpoint conversion script for others to get the pre-trained model. -->
+
+- [ ] Milestone 2: Indicates a successful model implementation.
+
+  - [ ] Training results
+
+    <!-- If you are reproducing the result from a paper, train your model from scratch and verified that the final result can match the original result. Usually, ±0.1% is acceptable for the image classification task on ImageNet-1k. -->
+
+- [ ] Milestone 3: Good to be a part of our core package!
+
+  - [ ] Unit tests
+
+    <!-- Unit tests for the major module are required. [Example](https://github.com/open-mmlab/mmpretrain/blob/main/tests/test_models/test_backbones/test_vision_transformer.py) -->
+
+  - [ ] Code style
+
+    <!-- Refactor your code according to reviewer's comment. -->
+
+  - [ ] `metafile.yml` and `README.md`
+
+    <!-- It will used for MMPreTrain to acquire your models. [Example](https://github.com/open-mmlab/mmpretrain/blob/main/configs/mvit/metafile.yml). In particular, you may have to refactor this README into a standard one. [Example](https://github.com/open-mmlab/mmpretrain/blob/main/configs/swin_transformer/README.md) -->
diff --git a/projects/example_project/configs/examplenet_8xb32_in1k.py b/projects/example_project/configs/examplenet_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..99ab94d6ef5b8cf69db0a20f4c5cdcd732ccd2fa
--- /dev/null
+++ b/projects/example_project/configs/examplenet_8xb32_in1k.py
@@ -0,0 +1,10 @@
+# Directly inherit the entire recipe you want to use.
+_base_ = 'mmpretrain::resnet/resnet50_8xb32_in1k.py'
+
+# This line is to import your own modules.
+custom_imports = dict(imports='models')
+
+# Modify the backbone to use your own backbone.
+_base_['model']['backbone'] = dict(type='ExampleNet', depth=18)
+# Modify the in_channels of classifier head to fit your backbone.
+_base_['model']['head']['in_channels'] = 512
diff --git a/projects/example_project/models/__init__.py b/projects/example_project/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2d4f2f57125f586560752094af3e00927adba46
--- /dev/null
+++ b/projects/example_project/models/__init__.py
@@ -0,0 +1,3 @@
+from .example_net import ExampleNet
+
+__all__ = ['ExampleNet']
diff --git a/projects/example_project/models/example_net.py b/projects/example_project/models/example_net.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec3aab818ffe46e7a1bd874be8144b985d9421ef
--- /dev/null
+++ b/projects/example_project/models/example_net.py
@@ -0,0 +1,31 @@
+from mmpretrain.models import ResNet
+from mmpretrain.registry import MODELS
+
+
+# Register your model to the `MODELS`.
+@MODELS.register_module()
+class ExampleNet(ResNet):
+    """Implements an example backbone.
+
+    Implement the backbone network just like a normal pytorch network.
+    """
+
+    def __init__(self, **kwargs) -> None:
+        print('#############################\n'
+              '#     Hello MMPretrain!     #\n'
+              '#############################')
+        super().__init__(**kwargs)
+
+    def forward(self, x):
+        """The forward method of the network.
+
+        Args:
+            x (torch.Tensor): A tensor of image batch with shape
+                ``(batch_size, num_channels, height, width)``.
+
+        Returns:
+            Tuple[torch.Tensor]: Please return a tuple of tensors and every
+            tensor is a feature map of specified scale. If you only want the
+            final feature map, simply return a tuple with one item.
+        """
+        return super().forward(x)
diff --git a/projects/fgia_accv2022_1st/README.md b/projects/fgia_accv2022_1st/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f929fb70a051b06fef7f3b4676d90dbe98274785
--- /dev/null
+++ b/projects/fgia_accv2022_1st/README.md
@@ -0,0 +1,143 @@
+# Solution of FGIA ACCV 2022(1st Place)
+
+This is fine-tuning part of the 1st Place Solution for Webly-supervised Fine-grained Recognition, refer to the ACCV workshop competition in https://www.cvmart.net/race/10412/base.
+
+## Result
+
+<details>
+
+<summary>Show the result</summary>
+
+<br>
+
+**Leaderboard A**
+
+![LB-A](https://user-images.githubusercontent.com/18586273/205498131-5728e470-b4f6-43b7-82a5-5f8e3bd5168e.png)
+
+**Leaderboard B**
+
+![LB-B](https://user-images.githubusercontent.com/18586273/205498171-5a3a3055-370a-4a8b-9779-b686254ebc94.png)
+
+</br>
+
+</details>
+
+## Reproduce
+
+For detailed self-supervised pretrain code, please refer to [Self-spervised Pre-training](#self-supervised-pre-training).
+For detailed finetuning and inference code, please refer to [this repo](https://github.com/Ezra-Yu/ACCV2022_FGIA_1st).
+
+## Description
+
+### Overview of Our Solution
+
+![image](https://user-images.githubusercontent.com/18586273/205498371-31dbc1f4-5814-44bc-904a-f0d32515c7dd.png)
+
+### Our Model
+
+- ViT(MAE-pre-train)   # Pretrained with [MAE](https://github.com/open-mmlab/mmppretrain/tree/main/projects/fgia_accv2022_1st/config/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py)
+- Swin-v2(SimMIM-pre-train)   # From [MMPretrain-swin_transformer_v2](https://github.com/open-mmlab/mmppretrain/tree/main/configs/swin_transformer_v2).
+
+\*\*The architectures we use \*\*
+
+- ViT + CE-loss + post-LongTail-Adjusment
+- ViT + SubCenterArcFaceWithAdvMargin(CE)
+- Swin-B + SubCenterArcFaceWithAdvMargin(SoftMax-EQL)
+- Swin-L + SubCenterArcFaceWithAdvMargin(SoftMAx-EQL)
+
+## Self-supervised Pre-training
+
+### Requirements
+
+```shell
+PyTorch 1.11.0
+torchvision 0.12.0
+CUDA 11.3
+MMEngine >= 0.1.0
+MMCV >= 2.0.0rc0
+```
+
+### Preparing the dataset
+
+First you should refactor the folder of your dataset in the following format:
+
+```text
+mmpretrain
+|
+|── data
+|    |── WebiNat5000
+|    |       |── meta
+|    |       |    |── train.txt
+|    |       |── train
+|    |       |── testa
+|    |       |── testb
+```
+
+The `train`, `testa`, and `testb` folders contain the same content with
+those provided by the official website of the competition.
+
+### Start pre-training
+
+First, you should install all these requirements, following this [page](https://mmpretrain.readthedocs.io/en/latest/get_started.html).
+Then change your current directory to the root of MMPretrain
+
+```shell
+cd $MMPretrain
+```
+
+Then you have the following two choices to start pre-training
+
+#### Slurm
+
+If you have a cluster managed by Slurm, you can use the following command:
+
+```shell
+## we use 16 NVIDIA 80G A100 GPUs for pre-training
+GPUS_PER_NODE=8 GPUS=16 SRUN_ARGS=${SRUN_ARGS} bash tools/slurm_train.sh ${PARTITION} ${JOB_NAME} projects/fgia_accv2022_1st/config/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py [optional arguments]
+```
+
+#### Pytorch
+
+Or you can use the following two commands to start distributed training on two separate nodes:
+
+```shell
+# node 1
+NNODES=2 NODE_RANK=0 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} bash tools/dist_train.sh projects/fgia_accv2022_1st/config/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py 8
+```
+
+```shell
+# node 2
+NNODES=2 NODE_RANK=1 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} bash tools/dist_train.sh projects/fgia_accv2022_1st/config/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py 8
+```
+
+All these logs and checkpoints will be saved under the folder `work_dirs`in the root.
+
+## Fine-tuning with bag of tricks
+
+- [MAE](https://github.com/open-mmlab/mmpretrain/tree/main/configs/mae) |  [Config](https://github.com/Ezra-Yu/ACCV_workshop/tree/master/configs/vit)
+- [Swinv2](https://github.com/open-mmlab/mmpretrain/tree/main/configs/swin_transformer_v2) | [Config](https://github.com/Ezra-Yu/ACCV_workshop/tree/master/configs/swin)
+- [ArcFace](https://arxiv.org/abs/1801.07698)   |   [Code](https://github.com/Ezra-Yu/ACCV_workshop/blob/master/src/models/arcface_head.py)
+- [SubCenterArcFaceWithAdvMargin](https://paperswithcode.com/paper/sub-center-arcface-boosting-face-recognition)   |   [Code](https://github.com/Ezra-Yu/ACCV_workshop/blob/master/src/models/arcface_head.py)
+- [Post-LT-adjusment](https://paperswithcode.com/paper/long-tail-learning-via-logit-adjustment)   |   [Code](https://github.com/Ezra-Yu/ACCV_workshop/blob/master/src/models/linear_head_lt.py)
+- [SoftMaxEQL](https://paperswithcode.com/paper/the-equalization-losses-gradient-driven)   |   [Code](https://github.com/Ezra-Yu/ACCV_workshop/blob/master/src/models/eql.py)
+- FlipTTA [Code](https://github.com/Ezra-Yu/ACCV_workshop/blob/master/src/models/tta_classifier.py)
+- clean dataset
+- self-emsemble: [Uniform-model-soup](https://arxiv.org/abs/2203.05482) | [code](https://github.com/Ezra-Yu/ACCV_workshop/blob/master/tools/model_soup.py)
+- [pseudo](https://lilianweng.github.io/posts/2021-12-05-semi-supervised/)  | [Code](https://github.com/Ezra-Yu/ACCV_workshop/blob/master/tools/creat_pseudo.py)
+- bagging-emsemble [Code](https://github.com/Ezra-Yu/ACCV_workshop/blob/master/tools/emsemble.py),
+- post-process: [re-distribute-label](https://github.com/Ezra-Yu/ACCV_workshop/blob/master/tools/re-distribute-label.py);
+
+![Overview](https://user-images.githubusercontent.com/18586273/205498258-e5720d83-7006-4aea-86b5-aab1a8998c6c.png)
+
+![image](https://user-images.githubusercontent.com/18586273/205498027-def99b0d-a99a-470b-b292-8d5fc83111fc.png)
+
+#### Used but no improvements
+
+1. Using retrieval paradigm to solve this classification task;
+2. Using EfficientNetv2 backbone.
+
+#### Not used but worth to do
+
+1. Try [DiVE](https://arxiv.org/abs/2103.15042) algorithm to improve performance in long tail dataset;
+2. Use SimMIM to pre-train Swin-v2 on the competition dataset;
+3. refine the re-distribute-label tool.
diff --git a/projects/fgia_accv2022_1st/config/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py b/projects/fgia_accv2022_1st/config/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..8b2c5e9c2ac0484442363ee3c6bedab0e0366353
--- /dev/null
+++ b/projects/fgia_accv2022_1st/config/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k.py
@@ -0,0 +1,107 @@
+model = dict(
+    type='MAE',
+    backbone=dict(type='MAEViT', arch='l', patch_size=16, mask_ratio=0.75),
+    neck=dict(
+        type='MAEPretrainDecoder',
+        patch_size=16,
+        in_chans=3,
+        embed_dim=1024,
+        decoder_embed_dim=512,
+        decoder_depth=8,
+        decoder_num_heads=16,
+        mlp_ratio=4.0),
+    head=dict(
+        type='MAEPretrainHead',
+        norm_pix=True,
+        patch_size=16,
+        loss=dict(type='MAEReconstructionLoss')),
+    init_cfg=dict(
+        type='Pretrained',
+        checkpoint=  # noqa: E251
+        'https://download.openmmlab.com/mmselfsup/1.x/mae/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k/mae_vit-large-p16_8xb512-fp16-coslr-1600e_in1k_20220825-cc7e98c9.pth'  # noqa
+    ))
+custom_imports = dict(
+    imports='mmpretrain.datasets', allow_failed_imports=False)
+data_preprocessor = dict(
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    bgr_to_rgb=True)
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        crop_ratio_range=(0.2, 1.0),
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5),
+    dict(type='PackInputs')
+]
+train_dataloader = dict(
+    batch_size=256,
+    num_workers=16,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    pin_memory=True,
+    dataset=dict(
+        type='ImageNet',
+        data_root='data/WebiNat5000/',
+        ann_file='data/WebiNat5000/meta/train.txt',
+        data_prefix=dict(img_path='train/'),
+        pipeline=train_pipeline))
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(
+        type='AdamW', lr=0.0024, betas=(0.9, 0.95), weight_decay=0.05),
+    paramwise_cfg=dict(
+        custom_keys=dict(
+            ln=dict(decay_mult=0.0),
+            bias=dict(decay_mult=0.0),
+            pos_embed=dict(decay_mult=0.0),
+            mask_token=dict(decay_mult=0.0),
+            cls_token=dict(decay_mult=0.0))),
+    loss_scale='dynamic')
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.0001,
+        by_epoch=True,
+        begin=0,
+        end=40,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=1560,
+        by_epoch=True,
+        begin=40,
+        end=1600,
+        convert_to_iter_based=True)
+]
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1600)
+default_scope = 'mmpretrain'
+default_hooks = dict(
+    runtime_info=dict(type='RuntimeInfoHook'),
+    timer=dict(type='IterTimerHook'),
+    logger=dict(type='LoggerHook', interval=100),
+    param_scheduler=dict(type='ParamSchedulerHook'),
+    checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=1),
+    sampler_seed=dict(type='DistSamplerSeedHook'))
+env_cfg = dict(
+    cudnn_benchmark=False,
+    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
+    dist_cfg=dict(backend='nccl'))
+log_processor = dict(
+    window_size=10,
+    custom_cfg=[dict(data_src='', method='mean', windows_size='global')])
+vis_backends = [dict(type='LocalVisBackend')]
+visualizer = dict(
+    type='UniversalVisualizer',
+    vis_backends=[dict(type='LocalVisBackend')],
+    name='visualizer')
+log_level = 'INFO'
+load_from = None
+resume = False
+randomness = dict(seed=0, diff_rank_seed=True)
+launcher = 'slurm'
+work_dir = './work_dirs/mae_vit-large-p16_8xb512-amp-coslr-1600e_in1k'
diff --git a/projects/gradio_demo/README.md b/projects/gradio_demo/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6799f865ad96829ec383b0beba9b3a232c7fb2c6
--- /dev/null
+++ b/projects/gradio_demo/README.md
@@ -0,0 +1,44 @@
+# MMPretrain Gradio Demo
+
+Here is a gradio demo for MMPretrain supported inference tasks.
+
+Currently supported tasks:
+
+- Image Classifiation
+- Image-To-Image Retrieval
+- Text-To-Image Retrieval (require multi-modality support)
+- Image Caption (require multi-modality support)
+- Visual Question Answering (require multi-modality support)
+- Visual Grounding (require multi-modality support)
+
+## Preview
+
+<img src="https://user-images.githubusercontent.com/26739999/236147750-90ccb517-92c0-44e9-905e-1473677023b1.jpg" width="100%"/>
+
+## Requirements
+
+To run the demo, you need to install MMPretrain at first. And please install with the extra multi-modality
+dependencies to enable multi-modality tasks.
+
+```shell
+# At the MMPretrain root folder
+pip install -e ".[multimodal]"
+```
+
+And then install the latest gradio package.
+
+```shell
+pip install "gradio>=3.31.0"
+```
+
+## Start
+
+Then, you can start the gradio server on the local machine by:
+
+```shell
+# At the project folder
+python launch.py
+```
+
+The demo will start a local server `http://127.0.0.1:7860` and you can browse it by your browser.
+And to share it to others, please set `share=True` in the `demo.launch()`.
diff --git a/projects/gradio_demo/conversation.py b/projects/gradio_demo/conversation.py
new file mode 100644
index 0000000000000000000000000000000000000000..3c5946900b0a8cc63adb12d57c430627055af51b
--- /dev/null
+++ b/projects/gradio_demo/conversation.py
@@ -0,0 +1,137 @@
+# Modified from
+# https://github.com/Vision-CAIR/MiniGPT-4/blob/main/minigpt4/conversation/conversation.py
+import dataclasses
+from typing import List
+
+import torch
+
+
+@dataclasses.dataclass
+class Conversation:
+    system: str
+    roles: List[str]
+    messages: List[List[str]]
+    sep: str = '###'
+
+    def get_prompt(self):
+        ret = self.system + self.sep
+        for role, message in self.messages:
+            if message:
+                ret += role + ': ' + message + self.sep
+            else:
+                ret += role + ':'
+        return ret
+
+    def append_message(self, role, message):
+        self.messages.append([role, message])
+
+    def copy(self):
+        return Conversation(
+            system=self.system,
+            roles=[role for role in self.roles],
+            messages=[[y for y in x] for x in self.messages],
+            sep=self.sep,
+        )
+
+    def dict(self):
+        return {
+            'system': self.system,
+            'roles': self.roles,
+            'messages': self.messages,
+            'offset': self.offset,
+            'sep': self.sep,
+        }
+
+
+EN_CONV_VISION = Conversation(
+    system='Give the following image. '
+    'You will be able to see the image once I provide it to you. '
+    'Please answer my questions in detail.',
+    roles=['Ask', 'Answer'],
+    messages=[],
+    sep='###',
+)
+
+ZH_CONV_VISION = Conversation(
+    system='给定一张图片，请仔细观察这张图片，并回答我的问题。',
+    roles=['问', '答'],
+    messages=[],
+    sep='###',
+)
+
+
+class Chat:
+
+    def __init__(self, inferencer, device, is_half=False):
+        self.device = device
+        self.inferencer = inferencer
+        self.model = inferencer.model
+        self.is_half = is_half
+        if is_half:
+            self.model = self.model.half()
+        self.model = self.model.to(device)
+        self.max_length = 2000
+
+    def upload_img(self, image, conv, img_list):
+        img = next(self.inferencer.preprocess([image]))
+        img = self.model.data_preprocessor(img, False)['images']
+        img = img.to(self.device)
+        image_emb, _ = self.model.encode_img(img)
+        img_list.append(image_emb)
+        conv.append_message(conv.roles[0], '<Img><ImageHere></Img>')
+
+    def get_context_emb(self, conv, img_list):
+        prompt = conv.get_prompt()
+        prompt_segs = prompt.split('<ImageHere>')
+        seg_tokens = [
+            self.model.llama_tokenizer(
+                seg, return_tensors='pt',
+                add_special_tokens=(i == 0)).to(self.device).input_ids
+            for i, seg in enumerate(prompt_segs)
+        ]
+        seg_embs = [
+            self.model.llama_model.model.embed_tokens(seg_token)
+            for seg_token in seg_tokens
+        ]
+        mixed_embs = [
+            emb for pair in zip(seg_embs[:-1], img_list) for emb in pair
+        ] + [seg_embs[-1]]
+        mixed_embs = torch.cat(mixed_embs, dim=1)
+        return mixed_embs
+
+    def ask(self, text, conv):
+        if len(conv.messages) > 0 and conv.messages[-1][0] == conv.roles[
+                0] and conv.messages[-1][1][-6:] == '</Img>':
+            conv.messages[-1][1] = ' '.join([conv.messages[-1][1], text])
+        else:
+            conv.append_message(conv.roles[0], text)
+
+    def answer(self, conv, img_list, generation_cfg):
+        conv.append_message(conv.roles[1], None)
+        embs = self.get_context_emb(conv, img_list)
+        cur_max_len = generation_cfg['max_new_tokens'] + embs.shape[1]
+        if cur_max_len > self.max_length:
+            print('Warning: The number of tokens in current conversation'
+                  'exceeds the max length. '
+                  'The model will not see the contexts outside the range.')
+        begin_idx = max(0, cur_max_len - self.max_length)
+        embs = embs[:, begin_idx:]
+        if self.is_half:
+            embs = embs.half()
+        outputs = self.model.llama_model.generate(
+            inputs_embeds=embs,
+            eos_token_id=self.model.end_token_id,
+            **generation_cfg)
+
+        output_token = outputs[0]
+        if output_token[0] == 0:
+            output_token = output_token[1:]
+        elif output_token[0] == 1:
+            output_token = output_token[1:]
+            output_text = self.model.llama_tokenizer.decode(
+                output_token,
+                add_special_tokens=False,
+                skip_special_tokens=True)
+        output_text = output_text.split('###')[0]
+        conv.messages[-1][1] = output_text
+        return output_text
diff --git a/projects/gradio_demo/launch.py b/projects/gradio_demo/launch.py
new file mode 100644
index 0000000000000000000000000000000000000000..61bccee54a27c359635b1beebbeb436c423d9c49
--- /dev/null
+++ b/projects/gradio_demo/launch.py
@@ -0,0 +1,467 @@
+from functools import partial
+from pathlib import Path
+from typing import Callable
+
+import gradio as gr
+import torch
+from mmengine.logging import MMLogger
+
+import mmpretrain
+from mmpretrain.apis import (ImageCaptionInferencer,
+                             ImageClassificationInferencer,
+                             ImageRetrievalInferencer,
+                             TextToImageRetrievalInferencer,
+                             VisualGroundingInferencer,
+                             VisualQuestionAnsweringInferencer)
+from mmpretrain.utils.dependency import WITH_MULTIMODAL
+from mmpretrain.visualization import UniversalVisualizer
+
+mmpretrain.utils.progress.disable_progress_bar = True
+
+logger = MMLogger('mmpretrain', logger_name='mmpre')
+if torch.cuda.is_available():
+    devices = [
+        torch.device(f'cuda:{i}') for i in range(torch.cuda.device_count())
+    ]
+    logger.info(f'Available GPUs: {len(devices)}')
+elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
+    devices = [torch.device('mps')]
+    logger.info('Available MPS.')
+else:
+    devices = [torch.device('cpu')]
+    logger.info('Available CPU.')
+
+
+def get_free_device():
+    if hasattr(torch.cuda, 'mem_get_info'):
+        free = [torch.cuda.mem_get_info(gpu)[0] for gpu in devices]
+        select = max(zip(free, range(len(free))))[1]
+    else:
+        import random
+        select = random.randint(0, len(devices) - 1)
+    return devices[select]
+
+
+class InferencerCache:
+    max_size = 2
+    _cache = []
+
+    @classmethod
+    def get_instance(cls, instance_name, callback: Callable):
+        if len(cls._cache) > 0:
+            for i, cache in enumerate(cls._cache):
+                if cache[0] == instance_name:
+                    # Re-insert to the head of list.
+                    cls._cache.insert(0, cls._cache.pop(i))
+                    logger.info(f'Use cached {instance_name}.')
+                    return cache[1]
+
+        if len(cls._cache) == cls.max_size:
+            cls._cache.pop(cls.max_size - 1)
+            torch.cuda.empty_cache()
+        device = get_free_device()
+        instance = callback(device=device)
+        logger.info(f'New instance {instance_name} on {device}.')
+        cls._cache.insert(0, (instance_name, instance))
+        return instance
+
+
+class ImageCaptionTab:
+
+    def __init__(self) -> None:
+        self.model_list = ImageCaptionInferencer.list_models()
+        self.tab = self.create_ui()
+
+    def create_ui(self):
+        with gr.Row():
+            with gr.Column():
+                select_model = gr.Dropdown(
+                    label='Choose a model',
+                    elem_id='image_caption_models',
+                    elem_classes='select_model',
+                    choices=self.model_list,
+                    value='blip-base_3rdparty_coco-caption',
+                )
+            with gr.Column():
+                image_input = gr.Image(
+                    label='Input',
+                    source='upload',
+                    elem_classes='input_image',
+                    interactive=True,
+                    tool='editor',
+                )
+                caption_output = gr.Textbox(
+                    label='Result',
+                    lines=2,
+                    elem_classes='caption_result',
+                    interactive=False,
+                )
+                run_button = gr.Button(
+                    'Run',
+                    elem_classes='run_button',
+                )
+                run_button.click(
+                    self.inference,
+                    inputs=[select_model, image_input],
+                    outputs=caption_output,
+                )
+
+    def inference(self, model, image):
+        image = image[:, :, ::-1]
+        inferencer_name = self.__class__.__name__ + model
+        inferencer = InferencerCache.get_instance(
+            inferencer_name, partial(ImageCaptionInferencer, model))
+
+        result = inferencer(image)[0]
+        return result['pred_caption']
+
+
+class ImageClassificationTab:
+
+    def __init__(self) -> None:
+        self.short_list = [
+            'resnet50_8xb32_in1k',
+            'resnet50_8xb256-rsb-a1-600e_in1k',
+            'swin-base_16xb64_in1k',
+            'convnext-base_32xb128_in1k',
+            'vit-base-p16_32xb128-mae_in1k',
+        ]
+        self.long_list = ImageClassificationInferencer.list_models()
+        self.tab = self.create_ui()
+
+    def create_ui(self):
+        with gr.Row():
+            with gr.Column():
+                select_model = gr.Dropdown(
+                    label='Choose a model',
+                    elem_id='image_classification_models',
+                    elem_classes='select_model',
+                    choices=self.short_list,
+                    value='swin-base_16xb64_in1k',
+                )
+                expand = gr.Checkbox(label='Browse all models')
+
+                def browse_all_model(value):
+                    models = self.long_list if value else self.short_list
+                    return gr.update(choices=models)
+
+                expand.select(
+                    fn=browse_all_model, inputs=expand, outputs=select_model)
+            with gr.Column():
+                in_image = gr.Image(
+                    label='Input',
+                    source='upload',
+                    elem_classes='input_image',
+                    interactive=True,
+                    tool='editor',
+                )
+                out_cls = gr.Label(
+                    label='Result',
+                    num_top_classes=5,
+                    elem_classes='cls_result',
+                )
+                run_button = gr.Button(
+                    'Run',
+                    elem_classes='run_button',
+                )
+                run_button.click(
+                    self.inference,
+                    inputs=[select_model, in_image],
+                    outputs=out_cls,
+                )
+
+    def inference(self, model, image):
+        image = image[:, :, ::-1]
+
+        inferencer_name = self.__class__.__name__ + model
+        inferencer = InferencerCache.get_instance(
+            inferencer_name, partial(ImageClassificationInferencer, model))
+        result = inferencer(image)[0]['pred_scores'].tolist()
+
+        if inferencer.classes is not None:
+            classes = inferencer.classes
+        else:
+            classes = list(range(len(result)))
+
+        return dict(zip(classes, result))
+
+
+class ImageRetrievalTab:
+
+    def __init__(self) -> None:
+        self.model_list = ImageRetrievalInferencer.list_models()
+        self.tab = self.create_ui()
+
+    def create_ui(self):
+        with gr.Row():
+            with gr.Column():
+                select_model = gr.Dropdown(
+                    label='Choose a model',
+                    elem_id='image_retri_models',
+                    elem_classes='select_model',
+                    choices=self.model_list,
+                    value='resnet50-arcface_inshop',
+                )
+                topk = gr.Slider(minimum=1, maximum=6, value=3, step=1)
+            with gr.Column():
+                prototype = gr.File(
+                    label='Retrieve from',
+                    file_count='multiple',
+                    file_types=['image'])
+                image_input = gr.Image(
+                    label='Query',
+                    source='upload',
+                    elem_classes='input_image',
+                    interactive=True,
+                    tool='editor',
+                )
+                retri_output = gr.Gallery(
+                    label='Result',
+                    elem_classes='img_retri_result',
+                ).style(
+                    columns=[3], object_fit='contain', height='auto')
+                run_button = gr.Button(
+                    'Run',
+                    elem_classes='run_button',
+                )
+                run_button.click(
+                    self.inference,
+                    inputs=[select_model, prototype, image_input, topk],
+                    outputs=retri_output,
+                )
+
+    def inference(self, model, prototype, image, topk):
+        image = image[:, :, ::-1]
+
+        import hashlib
+
+        proto_signature = ''.join(file.name for file in prototype).encode()
+        proto_signature = hashlib.sha256(proto_signature).hexdigest()
+        inferencer_name = self.__class__.__name__ + model + proto_signature
+        tmp_dir = Path(prototype[0].name).parent
+        cache_file = tmp_dir / f'{inferencer_name}.pth'
+
+        inferencer = InferencerCache.get_instance(
+            inferencer_name,
+            partial(
+                ImageRetrievalInferencer,
+                model,
+                prototype=[file.name for file in prototype],
+                prototype_cache=str(cache_file),
+            ),
+        )
+
+        result = inferencer(image, topk=min(topk, len(prototype)))[0]
+        return [(str(item['sample']['img_path']),
+                 str(item['match_score'].cpu().item())) for item in result]
+
+
+class TextToImageRetrievalTab:
+
+    def __init__(self) -> None:
+        self.model_list = TextToImageRetrievalInferencer.list_models()
+        self.tab = self.create_ui()
+
+    def create_ui(self):
+        with gr.Row():
+            with gr.Column():
+                select_model = gr.Dropdown(
+                    label='Choose a model',
+                    elem_id='t2i_retri_models',
+                    elem_classes='select_model',
+                    choices=self.model_list,
+                    value='blip-base_3rdparty_coco-retrieval',
+                )
+                topk = gr.Slider(minimum=1, maximum=6, value=3, step=1)
+            with gr.Column():
+                prototype = gr.File(
+                    file_count='multiple', file_types=['image'])
+                text_input = gr.Textbox(
+                    label='Query',
+                    elem_classes='input_text',
+                    interactive=True,
+                )
+                retri_output = gr.Gallery(
+                    label='Result',
+                    elem_classes='img_retri_result',
+                ).style(
+                    columns=[3], object_fit='contain', height='auto')
+                run_button = gr.Button(
+                    'Run',
+                    elem_classes='run_button',
+                )
+                run_button.click(
+                    self.inference,
+                    inputs=[select_model, prototype, text_input, topk],
+                    outputs=retri_output,
+                )
+
+    def inference(self, model, prototype, text, topk):
+        import hashlib
+
+        proto_signature = ''.join(file.name for file in prototype).encode()
+        proto_signature = hashlib.sha256(proto_signature).hexdigest()
+        inferencer_name = self.__class__.__name__ + model + proto_signature
+        tmp_dir = Path(prototype[0].name).parent
+        cache_file = tmp_dir / f'{inferencer_name}.pth'
+
+        inferencer = InferencerCache.get_instance(
+            inferencer_name,
+            partial(
+                TextToImageRetrievalInferencer,
+                model,
+                prototype=[file.name for file in prototype],
+                prototype_cache=str(cache_file),
+            ),
+        )
+
+        result = inferencer(text, topk=min(topk, len(prototype)))[0]
+        return [(str(item['sample']['img_path']),
+                 str(item['match_score'].cpu().item())) for item in result]
+
+
+class VisualGroundingTab:
+
+    def __init__(self) -> None:
+        self.model_list = VisualGroundingInferencer.list_models()
+        self.tab = self.create_ui()
+        self.visualizer = UniversalVisualizer(
+            fig_save_cfg=dict(figsize=(16, 9)))
+
+    def create_ui(self):
+        with gr.Row():
+            with gr.Column():
+                select_model = gr.Dropdown(
+                    label='Choose a model',
+                    elem_id='vg_models',
+                    elem_classes='select_model',
+                    choices=self.model_list,
+                    value='ofa-base_3rdparty_refcoco',
+                )
+            with gr.Column():
+                image_input = gr.Image(
+                    label='Image',
+                    source='upload',
+                    elem_classes='input_image',
+                    interactive=True,
+                    tool='editor',
+                )
+                text_input = gr.Textbox(
+                    label='The object to search',
+                    elem_classes='input_text',
+                    interactive=True,
+                )
+                vg_output = gr.Image(
+                    label='Result',
+                    source='upload',
+                    interactive=False,
+                    elem_classes='vg_result',
+                )
+                run_button = gr.Button(
+                    'Run',
+                    elem_classes='run_button',
+                )
+                run_button.click(
+                    self.inference,
+                    inputs=[select_model, image_input, text_input],
+                    outputs=vg_output,
+                )
+
+    def inference(self, model, image, text):
+
+        inferencer_name = self.__class__.__name__ + model
+
+        inferencer = InferencerCache.get_instance(
+            inferencer_name,
+            partial(VisualGroundingInferencer, model),
+        )
+
+        result = inferencer(
+            image[:, :, ::-1], text, return_datasamples=True)[0]
+        vis = self.visualizer.visualize_visual_grounding(
+            image, result, resize=512)
+        return vis
+
+
+class VisualQuestionAnsweringTab:
+
+    def __init__(self) -> None:
+        self.model_list = VisualQuestionAnsweringInferencer.list_models()
+        # The fine-tuned OFA vqa models requires extra object description.
+        self.model_list.remove('ofa-base_3rdparty-finetuned_vqa')
+        self.tab = self.create_ui()
+
+    def create_ui(self):
+        with gr.Row():
+            with gr.Column():
+                select_model = gr.Dropdown(
+                    label='Choose a model',
+                    elem_id='vqa_models',
+                    elem_classes='select_model',
+                    choices=self.model_list,
+                    value='ofa-base_3rdparty-zeroshot_coco-vqa',
+                )
+            with gr.Column():
+                image_input = gr.Image(
+                    label='Input',
+                    source='upload',
+                    elem_classes='input_image',
+                    interactive=True,
+                    tool='editor',
+                )
+                question_input = gr.Textbox(
+                    label='Question',
+                    elem_classes='question_input',
+                )
+                answer_output = gr.Textbox(
+                    label='Answer',
+                    elem_classes='answer_result',
+                )
+                run_button = gr.Button(
+                    'Run',
+                    elem_classes='run_button',
+                )
+                run_button.click(
+                    self.inference,
+                    inputs=[select_model, image_input, question_input],
+                    outputs=answer_output,
+                )
+
+    def inference(self, model, image, question):
+        image = image[:, :, ::-1]
+
+        inferencer_name = self.__class__.__name__ + model
+        inferencer = InferencerCache.get_instance(
+            inferencer_name, partial(VisualQuestionAnsweringInferencer, model))
+
+        result = inferencer(image, question)[0]
+        return result['pred_answer']
+
+
+if __name__ == '__main__':
+    title = 'MMPretrain Inference Demo'
+    with gr.Blocks(analytics_enabled=False, title=title) as demo:
+        gr.Markdown(f'# {title}')
+        with gr.Tabs():
+            with gr.TabItem('Image Classification'):
+                ImageClassificationTab()
+            with gr.TabItem('Image-To-Image Retrieval'):
+                ImageRetrievalTab()
+            if WITH_MULTIMODAL:
+                with gr.TabItem('Image Caption'):
+                    ImageCaptionTab()
+                with gr.TabItem('Text-To-Image Retrieval'):
+                    TextToImageRetrievalTab()
+                with gr.TabItem('Visual Grounding'):
+                    VisualGroundingTab()
+                with gr.TabItem('Visual Question Answering'):
+                    VisualQuestionAnsweringTab()
+            else:
+                with gr.TabItem('Multi-modal tasks'):
+                    gr.Markdown(
+                        'To inference multi-modal models, please install '
+                        'the extra multi-modal dependencies, please refer '
+                        'to https://mmpretrain.readthedocs.io/en/latest/'
+                        'get_started.html#installation')
+
+    demo.launch()
diff --git a/projects/gradio_demo/minigpt4_demo.py b/projects/gradio_demo/minigpt4_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4d61426fa7694fcefff475a80ab74db141b9681
--- /dev/null
+++ b/projects/gradio_demo/minigpt4_demo.py
@@ -0,0 +1,144 @@
+import argparse
+
+import gradio as gr
+import numpy as np
+import torch
+from conversation import EN_CONV_VISION, ZH_CONV_VISION, Chat
+
+from mmpretrain import ImageCaptionInferencer
+
+parser = argparse.ArgumentParser(description='MiniGPT4 demo')
+parser.add_argument(
+    'cfg', type=str, help='config file for minigpt4 (absolute path)')
+parser.add_argument(
+    'ckpt', type=str, help='pretrained file for minigpt4 (absolute path)')
+args = parser.parse_args()
+
+if torch.cuda.is_available():
+    devices = [
+        torch.device(f'cuda:{i}') for i in range(torch.cuda.device_count())
+    ]
+elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
+    devices = [torch.device('mps')]
+else:
+    devices = [torch.device('cpu')]
+
+
+def get_free_device():
+    if hasattr(torch.cuda, 'mem_get_info'):
+        free = [torch.cuda.mem_get_info(gpu)[0] for gpu in devices]
+        select = max(zip(free, range(len(free))))[1]
+    else:
+        import random
+        select = random.randint(0, len(devices) - 1)
+    return devices[select]
+
+
+device = get_free_device()
+inferencer = ImageCaptionInferencer(model=args.cfg, pretrained=args.ckpt)
+model = inferencer.model
+chat = Chat(inferencer, device=device, is_half=(device.type != 'cpu'))
+
+
+def reset(chat_state, img_list):
+    if chat_state is not None:
+        chat_state.messages = []
+    if img_list is not None:
+        img_list = []
+    return (None, gr.update(value=None, interactive=True),
+            gr.update(
+                value=None,
+                placeholder='Please upload your image first',
+                interactive=False),
+            gr.update(value='Upload & Start Chat',
+                      interactive=True), chat_state, img_list,
+            gr.update(value='Restart', interactive=False),
+            gr.update(value='English', interactive=True))
+
+
+def upload_img(gr_img, language, chat_state):
+    if gr_img is None:
+        return (None,
+                gr.update(
+                    placeholder='Please upload your image first',
+                    interactive=False),
+                gr.update(value='Upload & Start Chat',
+                          interactive=True), chat_state, None,
+                gr.update(value='Restart', interactive=False),
+                gr.update(value='English', interactive=True))
+
+    if (language == 'English'):
+        chat_state = EN_CONV_VISION.copy()
+    else:
+        chat_state = ZH_CONV_VISION.copy()
+    img_list = []
+    gr_img_array = np.asarray(gr_img)
+    chat.upload_img(gr_img_array, chat_state, img_list)
+    return (gr.update(interactive=False),
+            gr.update(placeholder='Type and press Enter', interactive=True),
+            gr.update(value='Start Chatting',
+                      interactive=False), chat_state, img_list,
+            gr.update(value='Restart',
+                      interactive=True), gr.update(interactive=False))
+
+
+def ask(user_message, chatbot, chat_state):
+    if (len(user_message) == 0):
+        return gr.update(
+            value=None,
+            placeholder='Input should not be empty!',
+            interactive=True), chatbot, chat_state
+    chat.ask(user_message, chat_state)
+    chatbot = chatbot + [[user_message, None]]
+    return '', chatbot, chat_state
+
+
+def answer(chatbot, chat_state, img_list):
+    llm_message = chat.answer(
+        conv=chat_state,
+        img_list=img_list,
+        generation_cfg=model.generation_cfg)
+    chatbot[-1][1] = llm_message
+    return chatbot, chat_state, img_list
+
+
+if __name__ == '__main__':
+    title = 'MMPretrain MiniGPT-4 Inference Demo'
+    with gr.Blocks(analytics_enabled=False, title=title) as demo:
+        gr.Markdown(f'# {title}')
+        with gr.Row():
+            with gr.Column():
+                image = gr.Image(type='pil')
+                language = gr.Dropdown(['English', 'Chinese'],
+                                       label='Language',
+                                       info='Select chatbot\'s language',
+                                       value='English',
+                                       interactive=True)
+                upload_button = gr.Button(
+                    value='Upload & Start Chat', interactive=True)
+                clear = gr.Button(value='Restart', interactive=False)
+
+            with gr.Column():
+                chat_state = gr.State()
+                img_list = gr.State()
+                chatbot = gr.Chatbot(
+                    label='MiniGPT-4', min_width=320, height=600)
+                text_input = gr.Textbox(
+                    label='User',
+                    placeholder='Please upload your image first',
+                    interactive=False)
+
+        upload_button.click(upload_img, [image, language, chat_state], [
+            image, text_input, upload_button, chat_state, img_list, clear,
+            language
+        ])
+        text_input.submit(ask, [text_input, chatbot, chat_state],
+                          [text_input, chatbot, chat_state]).then(
+                              answer, [chatbot, chat_state, img_list],
+                              [chatbot, chat_state, img_list])
+        clear.click(reset, [chat_state, img_list], [
+            chatbot, image, text_input, upload_button, chat_state, img_list,
+            clear, language
+        ])
+
+    demo.launch(share=True)
diff --git a/projects/internimage_classification/README.md b/projects/internimage_classification/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..53b02133d6bb7c15dbfef6f9075a3d4ef8d3d354
--- /dev/null
+++ b/projects/internimage_classification/README.md
@@ -0,0 +1,121 @@
+# InternImage Classification
+
+## Description
+
+This is the implementation of [InternImage](https://arxiv.org/abs/2211.05778) for image classification.
+
+## Usage
+
+### Setup Environment
+
+Please refer to [Get Started](https://mmpretrain.readthedocs.io/en/latest/get_started.html) documentation of MMPretrain to finish installation.
+
+Please install DCNv3. Run the command below following the [ InternImage official installation instructions](https://github.com/OpenGVLab/InternImage/blob/master/classification/README.md).
+
+```shell
+cd ops_dcnv3
+sh ./make.sh
+```
+
+### Training and Test Commands
+
+At first, you need to add the current folder to `PYTHONPATH`, so that Python can find your model files. In `projects/internimage_classification/` root directory, please run command below to add it.
+
+```shell
+export PYTHONPATH=`pwd`:$PYTHONPATH
+```
+
+#### Training
+
+##### On Local Single GPU
+
+```bash
+# train with mim
+mim train mmpretrain ${CONFIG} --work-dir ${WORK_DIR}
+
+# a specific command example
+mim train mmpretrain configs/internimage-tiny_8xb128_in1k-224.py \
+	--work-dir work_dirs/internimage-tiny_8xb128_in1k-224/
+```
+
+##### On Multiple GPUs
+
+```bash
+# train with mim
+mim train mmpretrain ${CONFIG} \
+    --work-dir ${WORK_DIR} \
+    --launcher pytorch --gpus 8
+```
+
+##### On Multiple GPUs with Slurm
+
+```bash
+# train with mim
+mim train mmpretrain ${CONFIG} \
+    --work-dir ${WORK_DIR} \
+    --launcher slurm --gpus 16 --gpus-per-node 8 \
+    --partition ${PARTITION}
+```
+
+#### Test
+
+Please download the pretrain weight provided by [OpenGVLab](https://github.com/OpenGVLab/) from [here](https://huggingface.co/OpenGVLab/InternImage/tree/main)
+
+##### On Local Single GPU
+
+```bash
+# test with mim
+mim test mmpretrain ${CONFIG} -C ${CHECKPOINT}
+
+# a specific command example
+mim test mmpretrain configs/internimage-tiny_8xb128_in1k-224.py -C /PATH/TO/internimage_t_1k_224.pth
+```
+
+##### On Multiple GPUs
+
+```bash
+# test with mim
+# a specific command examples, 8 GPUs here
+mim test mmpretrain configs/internimage_t_1k_224.py \
+	-C /PATH/TO/internimage_t_1k_224.pth \
+    --launcher pytorch --gpus 8
+```
+
+##### On Multiple GPUs with Slurm
+
+```bash
+# test with mim
+mim test mmpretrain ${CONFIG} \
+    -C ${CHECKPOINT}
+    --work-dir ${WORK_DIR} \
+    --launcher slurm --gpus 8 --gpus-per-node 8 \
+    --partition ${PARTITION} \
+    $PY_ARGS
+```
+
+Note: `PY_ARGS` is other optional args.
+
+## Results on ImageNet1K
+
+The accuracy of different models on ImageNet1K,
+
+|      name      | resolution |  acc@1  |  acc@5  |                          config                           |                                              weight                                               |
+| :------------: | :--------: | :-----: | :-----: | :-------------------------------------------------------: | :-----------------------------------------------------------------------------------------------: |
+| InternImage-T  |    224     | 83.4700 | 96.5340 |  [config](./configs/internimage-tiny_8xb128_in1k-224.py)  |    [model](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_t_1k_224.pth)    |
+| InternImage-S  |    224     | 84.1640 | 96.9320 | [config](./configs/internimage-small_8xb128_in1k-224.py)  |    [model](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_s_1k_224.pth)    |
+| InternImage-B  |    224     | 84.8660 | 97.1820 |  [config](./configs/internimage-base_8xb128_in1k-224.py)  |    [model](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_b_1k_224.pth)    |
+| InternImage-L  |    384     | 87.7060 | 98.3820 | [config](./configs/internimage-large_8xb128_in1k-384.py)  | [model](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_l_22kto1k_384.pth)  |
+| InternImage-XL |    384     | 88.0460 | 98.5620 | [config](./configs/internimage-xlagre_8xb128_in1k-384.py) | [model](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_xl_22kto1k_384.pth) |
+| InternImage-H  |    640     | 89.5500 | 98.8500 |  [config](./configs/internimage-huge_8xb128_in1k-640.py)  | [model](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_h_22kto1k_640.pth)  |
+| InternImage-G  |    512     | 90.0580 | 98.9700 | [config](./configs/internimage-giant_8xb128_in1k-512.py)  | [model](https://huggingface.co/OpenGVLab/InternImage/resolve/main/internimage_g_22kto1k_512.pth)  |
+
+## Citation
+
+```bibtex
+@article{wang2022internimage,
+  title={InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions},
+  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
+  journal={arXiv preprint arXiv:2211.05778},
+  year={2022}
+}
+```
diff --git a/projects/internimage_classification/configs/_base_.py b/projects/internimage_classification/configs/_base_.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e9b2adae16078bc814c22a8b4bc057068f10824
--- /dev/null
+++ b/projects/internimage_classification/configs/_base_.py
@@ -0,0 +1,113 @@
+_base_ = 'mmpretrain::_base_/default_runtime.py'
+
+# dataset settings
+dataset_type = 'ImageNet'
+data_preprocessor = dict(
+    num_classes=1000,
+    # RGB format normalization parameters
+    mean=[123.675, 116.28, 103.53],
+    std=[58.395, 57.12, 57.375],
+    # convert image from BGR to RGB
+    to_rgb=True,
+)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=224,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=224,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(
+    batch_size=128,
+    num_workers=8,
+    dataset=dict(
+        type=dataset_type,
+        data_root='../../data/imagenet',
+        data_prefix='train',
+        pipeline=train_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=True),
+)
+
+val_dataloader = dict(
+    batch_size=128,
+    num_workers=8,
+    dataset=dict(
+        type=dataset_type,
+        data_root='../../data/imagenet',
+        data_prefix='val',
+        pipeline=test_pipeline),
+    sampler=dict(type='DefaultSampler', shuffle=False),
+)
+val_evaluator = dict(type='Accuracy', topk=(1, 5))
+
+test_dataloader = val_dataloader
+test_evaluator = val_evaluator
+
+# model setting
+custom_imports = dict(imports='models')
+
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='InternImage',
+        stem_channels=64,
+        drop_path_rate=0.1,
+        stage_blocks=[4, 4, 18, 4],
+        groups=[4, 8, 16, 32]),
+    neck=dict(type='GlobalAveragePooling'),
+    head=dict(
+        type='LinearClsHead',
+        num_classes=1000,
+        in_channels=768,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+        topk=(1, 5)))
+
+# optimizer
+optim_wrapper = dict(
+    optimizer=dict(type='AdamW', lr=1.25e-04, eps=1e-8, betas=(0.9, 0.999)),
+    weight_decay=0.05)
+
+# learning policy
+param_scheduler = [
+    # warm up learning rate scheduler
+    dict(
+        type='LinearLR',
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    # main learning rate scheduler
+    dict(
+        type='CosineAnnealingLR',
+        T_max=280,
+        by_epoch=True,
+        begin=20,
+        end=300,
+        eta_min=1.25e-06)
+]
+
+# train, val, test setting
+train_cfg = dict(by_epoch=True, max_epochs=300, val_interval=1)
+val_cfg = dict()
+test_cfg = dict()
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR,
+# based on the actual training batch size.
+auto_scale_lr = dict(base_batch_size=128 * 8)
diff --git a/projects/internimage_classification/configs/internimage-base_8xb128_in1k-224.py b/projects/internimage_classification/configs/internimage-base_8xb128_in1k-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..735d17e58f3d2c52263fb00abf0b425ad36ad754
--- /dev/null
+++ b/projects/internimage_classification/configs/internimage-base_8xb128_in1k-224.py
@@ -0,0 +1,13 @@
+_base_ = './_base_.py'
+
+model = dict(
+    backbone=dict(
+        stem_channels=112,
+        drop_path_rate=0.5,
+        stage_blocks=[4, 4, 21, 4],
+        groups=[7, 14, 28, 56],
+        layer_scale=1e-5,
+        post_norm=True),
+    head=dict(in_channels=1344))
+
+optim_wrapper = dict(optimizer=dict(lr=0.0005))
diff --git a/projects/internimage_classification/configs/internimage-giant_8xb128_in1k-512.py b/projects/internimage_classification/configs/internimage-giant_8xb128_in1k-512.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ccd34ee2651a3b068ec5324f1849662dd25c8bf
--- /dev/null
+++ b/projects/internimage_classification/configs/internimage-giant_8xb128_in1k-512.py
@@ -0,0 +1,55 @@
+_base_ = './_base_.py'
+
+model = dict(
+    backbone=dict(
+        stem_channels=512,
+        drop_path_rate=0.4,
+        stage_blocks=[2, 2, 48, 4],
+        groups=[16, 32, 64, 128],
+        dw_kernel_size=5,
+        level2_post_norm=True,
+        level2_post_norm_block_ids=[5, 11, 17, 23, 29, 35, 41, 47],
+        center_feature_scale=True,
+        use_clip_projector=True,
+    ),
+    neck=None,
+    head=dict(in_channels=768))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=512,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs'),
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=512,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=512),
+    dict(type='PackInputs'),
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+optim_wrapper = dict(optimizer=dict(lr=5e-6))
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        by_epoch=True,
+        begin=0,
+        end=2,
+        convert_to_iter_based=True),
+    dict(type='CosineAnnealingLR', T_max=18, by_epoch=True, begin=2, end=20)
+]
+train_cfg = dict(by_epoch=True, max_epochs=20, val_interval=1)
diff --git a/projects/internimage_classification/configs/internimage-huge_8xb128_in1k-640.py b/projects/internimage_classification/configs/internimage-huge_8xb128_in1k-640.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e7c8e7744fa30c4832eb316ced0db3bde70b5b8
--- /dev/null
+++ b/projects/internimage_classification/configs/internimage-huge_8xb128_in1k-640.py
@@ -0,0 +1,55 @@
+_base_ = './_base_.py'
+
+model = dict(
+    backbone=dict(
+        stem_channels=320,
+        drop_path_rate=0.1,
+        stage_blocks=[6, 6, 32, 6],
+        groups=[10, 20, 40, 80],
+        dw_kernel_size=5,
+        res_post_norm=True,
+        level2_post_norm=True,
+        level2_post_norm_block_ids=[5, 11, 17, 23, 29],
+        center_feature_scale=True,
+        use_clip_projector=True,
+    ),
+    neck=None,
+    head=dict(in_channels=768))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=640,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=640,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=640),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+optim_wrapper = dict(optimizer=dict(lr=5e-6))
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        by_epoch=True,
+        begin=0,
+        end=2,
+        convert_to_iter_based=True),
+    dict(type='CosineAnnealingLR', T_max=18, by_epoch=True, begin=2, end=20)
+]
+train_cfg = dict(by_epoch=True, max_epochs=20, val_interval=1)
diff --git a/projects/internimage_classification/configs/internimage-large_8xb128_in1k-384.py b/projects/internimage_classification/configs/internimage-large_8xb128_in1k-384.py
new file mode 100644
index 0000000000000000000000000000000000000000..838ec95ba872b4f3ebd920aabd27bdad58601a66
--- /dev/null
+++ b/projects/internimage_classification/configs/internimage-large_8xb128_in1k-384.py
@@ -0,0 +1,51 @@
+_base_ = './_base_.py'
+
+model = dict(
+    backbone=dict(
+        stem_channels=160,
+        drop_path_rate=0.1,
+        stage_blocks=[5, 5, 22, 5],
+        groups=[10, 20, 40, 80],
+        layer_scale=1e-5,
+        offset_scale=2.0,
+        post_norm=True),
+    head=dict(in_channels=1920))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs')
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=384,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+optim_wrapper = dict(optimizer=dict(lr=5e-6))
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        by_epoch=True,
+        begin=0,
+        end=2,
+        convert_to_iter_based=True),
+    dict(type='CosineAnnealingLR', T_max=18, by_epoch=True, begin=2, end=20)
+]
+train_cfg = dict(by_epoch=True, max_epochs=20, val_interval=1)
diff --git a/projects/internimage_classification/configs/internimage-small_8xb128_in1k-224.py b/projects/internimage_classification/configs/internimage-small_8xb128_in1k-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..ba2075e41241bf6268facc398c36f363b394053b
--- /dev/null
+++ b/projects/internimage_classification/configs/internimage-small_8xb128_in1k-224.py
@@ -0,0 +1,11 @@
+_base_ = './_base_.py'
+
+model = dict(
+    backbone=dict(
+        stem_channels=80,
+        drop_path_rate=0.4,
+        stage_blocks=[4, 4, 21, 4],
+        groups=[5, 10, 20, 40],
+        layer_scale=1e-5,
+        post_norm=True),
+    head=dict(in_channels=960))
diff --git a/projects/internimage_classification/configs/internimage-tiny_8xb128_in1k-224.py b/projects/internimage_classification/configs/internimage-tiny_8xb128_in1k-224.py
new file mode 100644
index 0000000000000000000000000000000000000000..abee278ac1d88b7b9f35e1cf06c8c4d0bc08b76c
--- /dev/null
+++ b/projects/internimage_classification/configs/internimage-tiny_8xb128_in1k-224.py
@@ -0,0 +1,8 @@
+_base_ = './_base_.py'
+
+model = dict(
+    backbone=dict(
+        stem_channels=64,
+        drop_path_rate=0.1,
+        stage_blocks=[4, 4, 18, 4],
+        groups=[4, 8, 16, 32]))
diff --git a/projects/internimage_classification/configs/internimage-xlagre_8xb128_in1k-384.py b/projects/internimage_classification/configs/internimage-xlagre_8xb128_in1k-384.py
new file mode 100644
index 0000000000000000000000000000000000000000..dfd494bb51c64f5458d109989fc4e3340297d247
--- /dev/null
+++ b/projects/internimage_classification/configs/internimage-xlagre_8xb128_in1k-384.py
@@ -0,0 +1,50 @@
+_base_ = './_base_.py'
+
+model = dict(
+    backbone=dict(
+        stem_channels=192,
+        drop_path_rate=0.2,
+        stage_blocks=[5, 5, 24, 5],
+        groups=[12, 24, 48, 96],
+        layer_scale=1e-5,
+        offset_scale=2.0,
+        post_norm=True),
+    head=dict(in_channels=2304))
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='RandomResizedCrop',
+        scale=384,
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='RandomFlip', prob=0.5, direction='horizontal'),
+    dict(type='PackInputs')
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='ResizeEdge',
+        scale=384,
+        edge='short',
+        backend='pillow',
+        interpolation='bicubic'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='PackInputs')
+]
+
+train_dataloader = dict(dataset=dict(pipeline=train_pipeline))
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
+
+optim_wrapper = dict(optimizer=dict(lr=5e-6))
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        by_epoch=True,
+        begin=0,
+        end=2,
+        convert_to_iter_based=True),
+    dict(type='CosineAnnealingLR', T_max=18, by_epoch=True, begin=2, end=20)
+]
+train_cfg = dict(by_epoch=True, max_epochs=20, val_interval=1)
diff --git a/projects/internimage_classification/models/__init__.py b/projects/internimage_classification/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..99cbcf624ce6caac3c075c29ec006510d33f8654
--- /dev/null
+++ b/projects/internimage_classification/models/__init__.py
@@ -0,0 +1,4 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .intern_image import InternImage
+
+__all__ = ['InternImage']
diff --git a/projects/internimage_classification/models/intern_image.py b/projects/internimage_classification/models/intern_image.py
new file mode 100644
index 0000000000000000000000000000000000000000..41c42cc55508ac7ec21a13dd963505b4a12e3e46
--- /dev/null
+++ b/projects/internimage_classification/models/intern_image.py
@@ -0,0 +1,636 @@
+# Copyright (c) 2022 OpenGVLab
+# Copyright (c) OpenMMLab. All rights reserved.
+# modified from
+# https://github.com/OpenGVLab/InternImage/blob/master/classification/models/intern_image.py
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint as cp
+from mmcv.cnn.bricks import DropPath, build_activation_layer
+from mmcv.cnn.bricks.transformer import FFN
+from mmengine.model.weight_init import trunc_normal_
+from ops_dcnv3 import modules as opsm
+
+from mmpretrain.models.backbones.base_backbone import BaseBackbone
+from mmpretrain.models.utils import CrossMultiheadAttention
+from mmpretrain.registry import MODELS
+
+
+class to_channels_first(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x.permute(0, 3, 1, 2)
+
+
+class to_channels_last(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x.permute(0, 2, 3, 1)
+
+
+def build_norm_layer(dim,
+                     norm_layer,
+                     in_format='channels_last',
+                     out_format='channels_last',
+                     eps=1e-6):
+    layers = []
+    if norm_layer == 'BN':
+        if in_format == 'channels_last':
+            layers.append(to_channels_first())
+        layers.append(nn.BatchNorm2d(dim))
+        if out_format == 'channels_last':
+            layers.append(to_channels_last())
+    elif norm_layer == 'LN':
+        if in_format == 'channels_first':
+            layers.append(to_channels_last())
+        layers.append(nn.LayerNorm(dim, eps=eps))
+        if out_format == 'channels_first':
+            layers.append(to_channels_first())
+    else:
+        raise NotImplementedError(
+            f'build_norm_layer does not support {norm_layer}')
+    return nn.Sequential(*layers)
+
+
+class AttentiveBlock(nn.Module):
+    """Attentive Block.
+
+    Args:
+        dim (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        qkv_bias (bool, optional):  If True, add a learnable bias to q, k, v.
+            Default: False.
+        qk_scale (float, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Default: None.
+        drop (float, optional): Dropout rate. Default: 0.0.
+        attn_drop (float, optional): Attention dropout rate. Default: 0.0.
+        drop_path (float, optional): Stochastic depth rate. Default: 0.0.
+        norm_cfg (dict, optional): Normalization layer.
+            Default: dict(type='LN')
+        out_dim (int, optional): Dimension of output. Default: None.
+    """
+
+    def __init__(self,
+                 dim,
+                 num_heads,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 drop=0.,
+                 attn_drop=0.,
+                 drop_path=0.,
+                 norm_cfg=dict(type='LN'),
+                 out_dim=None):
+        super().__init__()
+        norm_layer = norm_cfg['type']
+        self.norm1_q = build_norm_layer(dim, norm_layer, eps=1e-6)
+        self.norm1_k = build_norm_layer(dim, norm_layer, eps=1e-6)
+        self.norm1_v = build_norm_layer(dim, norm_layer, eps=1e-6)
+
+        self.cross_dcn = CrossMultiheadAttention(
+            embed_dims=dim,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            attn_drop=attn_drop,
+            proj_drop=drop,
+        )
+        if out_dim and out_dim != dim:
+            self.cross_dcn.proj = nn.Linear(dim, out_dim)
+
+        self.drop_path = DropPath(
+            drop_path) if drop_path > 0. else nn.Identity()
+
+    def forward(self, x_q, x_kv, pos_q, pos_k):
+        x_q = self.norm1_q(x_q + pos_q)
+        x_k = self.norm1_k(x_kv + pos_k)
+        x_v = self.norm1_v(x_kv)
+        x = self.cross_dcn(x_q, k=x_k, v=x_v)
+        return x
+
+
+class AttentionPoolingBlock(AttentiveBlock):
+
+    def forward(self, x):
+        x_q = x.mean(1, keepdim=True)
+        x_kv = x
+        pos_q, pos_k = 0, 0
+        x = super().forward(x_q, x_kv, pos_q, pos_k)
+        x = x.squeeze(1)
+        return x
+
+
+class DownsampleLayer(nn.Module):
+    """Downsample layer of InternImage.
+
+    Args:
+        channels (int): number of input channels
+        norm_layer (str): normalization layer
+    """
+
+    def __init__(self, channels, norm_layer='LN'):
+        super().__init__()
+        self.conv = nn.Conv2d(
+            channels,
+            2 * channels,
+            kernel_size=3,
+            stride=2,
+            padding=1,
+            bias=False)
+        self.norm = build_norm_layer(2 * channels, norm_layer,
+                                     'channels_first', 'channels_last')
+
+    def forward(self, x):
+        x = self.conv(x.permute(0, 3, 1, 2))
+        x = self.norm(x)
+        return x
+
+
+class InternImageLayer(nn.Module):
+    """Basic layer of InternImage.
+
+    Args:
+        core_op (nn.Module): core operation of InternImage
+        channels (int): number of input channels
+        groups (list): Groups of each block.
+        mlp_ratio (float): ratio of mlp hidden features to input channels
+        drop (float): dropout rate
+        drop_path (float): drop path rate
+        act_cfg (dict): activation layer
+        norm_cfg (dict): normalization layer
+        post_norm (bool): whether to use post normalization
+        layer_scale (float): layer scale
+        offset_scale (float): offset scale
+        with_cp (bool): whether to use checkpoint
+    """
+
+    def __init__(
+        self,
+        core_op,
+        channels,
+        groups,
+        mlp_ratio=4.,
+        drop=0.,
+        drop_path=0.,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='LN'),
+        post_norm=False,
+        layer_scale=None,
+        offset_scale=1.0,
+        with_cp=False,
+        dw_kernel_size=None,
+        res_post_norm=False,
+        center_feature_scale=False,
+        remove_center=False,
+    ):
+        super().__init__()
+        self.channels = channels
+        self.groups = groups
+        self.mlp_ratio = mlp_ratio
+        self.with_cp = with_cp
+
+        self.norm1 = build_norm_layer(channels, 'LN')
+        self.post_norm = post_norm
+        self.dcn = core_op(
+            channels=channels,
+            kernel_size=3,
+            stride=1,
+            pad=1,
+            dilation=1,
+            group=groups,
+            offset_scale=offset_scale,
+            act_layer=act_cfg['type'],
+            norm_layer=norm_cfg['type'],
+            dw_kernel_size=dw_kernel_size,
+            center_feature_scale=center_feature_scale,
+            remove_center=remove_center,
+        )
+        self.drop_path = DropPath(drop_path) if drop_path > 0. \
+            else nn.Identity()
+        self.norm2 = build_norm_layer(channels, 'LN')
+
+        self.mlp = FFN(
+            embed_dims=channels,
+            feedforward_channels=int(channels * mlp_ratio),
+            act_cfg=act_cfg,
+            ffn_drop=drop,
+            add_identity=False)
+
+        self.layer_scale = layer_scale is not None
+        if self.layer_scale:
+            self.gamma1 = nn.Parameter(
+                layer_scale * torch.ones(channels), requires_grad=True)
+            self.gamma2 = nn.Parameter(
+                layer_scale * torch.ones(channels), requires_grad=True)
+        self.res_post_norm = res_post_norm
+        if res_post_norm:
+            self.res_post_norm1 = build_norm_layer(channels, 'LN')
+            self.res_post_norm2 = build_norm_layer(channels, 'LN')
+
+    def forward(self, x):
+
+        def _inner_forward(x):
+            if not self.layer_scale:
+                if self.post_norm:
+                    x = x + self.drop_path(self.norm1(self.dcn(x)))
+                    x = x + self.drop_path(self.norm2(self.mlp(x)))
+                elif self.res_post_norm:
+                    x = x + self.drop_path(
+                        self.res_post_norm1(self.dcn(self.norm1(x))))
+                    x = x + self.drop_path(
+                        self.res_post_norm2(self.mlp(self.norm2(x))))
+                else:
+                    x = x + self.drop_path(self.dcn(self.norm1(x)))
+                    x = x + self.drop_path(self.mlp(self.norm2(x)))
+                return x
+            if self.post_norm:
+                x = x + self.drop_path(self.gamma1 * self.norm1(self.dcn(x)))
+                x = x + self.drop_path(self.gamma2 * self.norm2(self.mlp(x)))
+            else:
+                x = x + self.drop_path(self.gamma1 * self.dcn(self.norm1(x)))
+                x = x + self.drop_path(self.gamma2 * self.mlp(self.norm2(x)))
+            return x
+
+        if self.with_cp and x.requires_grad:
+            x = cp.checkpoint(_inner_forward, x)
+        else:
+            x = _inner_forward(x)
+        return x
+
+
+class InternImageBlock(nn.Module):
+    """Block of InternImage.
+
+    Args:
+        core_op (nn.Module): core operation of InternImage
+        channels (int): number of input channels
+        depths (list): Depth of each block.
+        groups (list): Groups of each block.
+        mlp_ratio (float): ratio of mlp hidden features to input channels
+        drop (float): dropout rate
+        drop_path (float): drop path rate
+        act_cfg (dict): activation layer
+        norm_cfg (dict): normalization layer
+        post_norm (bool): whether to use post normalization
+        layer_scale (float): layer scale
+        offset_scale (float): offset scale
+        with_cp (bool): whether to use checkpoint
+    """
+
+    def __init__(
+        self,
+        core_op,
+        channels,
+        depth,
+        groups,
+        downsample=True,
+        mlp_ratio=4.,
+        drop=0.,
+        drop_path=0.,
+        act_cfg=dict(type='GELU'),
+        norm_cfg=dict(type='LN'),
+        post_norm=False,
+        offset_scale=1.0,
+        layer_scale=None,
+        with_cp=False,
+        dw_kernel_size=None,
+        post_norm_block_ids=None,
+        res_post_norm=False,
+        center_feature_scale=False,
+        remove_center=False,
+    ):
+        super().__init__()
+        self.channels = channels
+        self.depth = depth
+        self.post_norm = post_norm
+        self.center_feature_scale = center_feature_scale
+
+        self.blocks = nn.ModuleList([
+            InternImageLayer(
+                core_op=core_op,
+                channels=channels,
+                groups=groups,
+                mlp_ratio=mlp_ratio,
+                drop=drop,
+                drop_path=drop_path[i]
+                if isinstance(drop_path, list) else drop_path,
+                act_cfg=act_cfg,
+                norm_cfg=norm_cfg,
+                post_norm=post_norm,
+                layer_scale=layer_scale,
+                offset_scale=offset_scale,
+                with_cp=with_cp,
+                dw_kernel_size=dw_kernel_size,
+                res_post_norm=res_post_norm,
+                center_feature_scale=center_feature_scale,
+                remove_center=remove_center,
+            ) for i in range(depth)
+        ])
+        if not self.post_norm or center_feature_scale:
+            self.norm = build_norm_layer(channels, 'LN')
+        self.post_norm_block_ids = post_norm_block_ids
+        if post_norm_block_ids is not None:
+            self.post_norms = nn.ModuleList([
+                build_norm_layer(channels, 'LN', eps=1e-6)
+                for _ in post_norm_block_ids
+            ])
+        self.downsample = DownsampleLayer(
+            channels=channels,
+            norm_layer=norm_cfg['type']) if downsample else None
+
+    def forward(self, x, return_wo_downsample=False):
+        for i, blk in enumerate(self.blocks):
+            x = blk(x)
+            if (self.post_norm_block_ids
+                    is not None) and (i in self.post_norm_block_ids):
+                index = self.post_norm_block_ids.index(i)
+                x = self.post_norms[index](x)
+        if not self.post_norm or self.center_feature_scale:
+            x = self.norm(x)
+        if return_wo_downsample:
+            x_ = x
+        if self.downsample is not None:
+            x = self.downsample(x)
+
+        if return_wo_downsample:
+            return x, x_
+        return x
+
+
+@MODELS.register_module()
+class InternImage(BaseBackbone):
+    """ InternImage
+        A PyTorch impl of : `InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions`  -
+          https://arxiv.org/pdf/2103.14030
+
+    Args:
+        core_op (str): Core operator. Default: 'DCNv3'
+        stem_channels (int): Number of the first stage. Default: 64
+        stage_blocks (list): Depth of each block. Default: [3, 4, 18, 5]
+        groups (list): Groups of each block. Default: [3, 6, 12, 24]
+        num_classes (int): Number of classes. Default: 1000
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.
+        drop_rate (float): Probability of an element to be zeroed. Default: 0.
+        drop_path_rate (float): Stochastic depth rate. Default: 0.
+        act_cfg (dict): Activation layer. Default: dict(type='GELU')
+        norm_cfg (dict): Normalization layer. Default: dict(type='LN')
+        layer_scale (bool): Whether to use layer scale. Default: False
+        cls_scale (bool): Whether to use class scale. Default: False
+        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
+        dw_kernel_size (int): Size of the dwconv. Default: None
+        use_clip_projector (bool): Whether to use clip projector. Default: False
+        level2_post_norm (bool): Whether to use level2 post norm. Default: False
+        level2_post_norm_block_ids (list): Indexes of post norm blocks. Default: None
+        res_post_norm (bool): Whether to use res post norm. Default: False
+        center_feature_scale (bool): Whether to use center feature scale. Default: False
+    """  # noqa: E501
+
+    def __init__(self,
+                 stem_channels=64,
+                 stage_blocks=[3, 4, 18, 5],
+                 groups=[3, 6, 12, 24],
+                 mlp_ratio=4.,
+                 drop_rate=0.,
+                 drop_path_rate=0.2,
+                 drop_path_type='linear',
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='LN'),
+                 layer_scale=None,
+                 offset_scale=1.0,
+                 post_norm=False,
+                 cls_scale=1.5,
+                 with_cp=False,
+                 dw_kernel_size=None,
+                 use_clip_projector=False,
+                 level2_post_norm=False,
+                 level2_post_norm_block_ids=None,
+                 res_post_norm=False,
+                 center_feature_scale=False,
+                 remove_center=False,
+                 init_cfg=None):
+        super(InternImage, self).__init__(init_cfg)
+
+        self.core_op = 'DCNv3'
+        self.num_stages = len(stage_blocks)
+        self.num_features = int(stem_channels * 2**(self.num_stages - 1))
+        self.post_norm = post_norm
+        self.mlp_ratio = mlp_ratio
+        self.use_clip_projector = use_clip_projector
+        self.level2_post_norm_block_ids = level2_post_norm_block_ids
+        self.remove_center = remove_center
+        self.act_cfg = act_cfg
+        self.norm_cfg = norm_cfg
+
+        # stem layer
+        self._make_stem_layer(in_channels=3, stem_channels=stem_channels)
+        self.pos_drop = nn.Dropout(p=drop_rate)
+
+        # stochastic depth decay rule
+        total_depth = sum(stage_blocks)
+        dpr = [
+            x.item() for x in torch.linspace(0, drop_path_rate, total_depth)
+        ]
+        if drop_path_type == 'uniform':
+            for i in range(len(dpr)):
+                dpr[i] = drop_path_rate
+
+        # InternImage Layers
+        self.layers = nn.ModuleList()
+        for i in range(self.num_stages):
+            if level2_post_norm and i == 2:
+                post_norm_block_ids = level2_post_norm_block_ids
+            else:
+                post_norm_block_ids = None
+
+            layer = InternImageBlock(
+                core_op=getattr(opsm, self.core_op),
+                channels=int(stem_channels * 2**i),
+                depth=stage_blocks[i],
+                groups=groups[i],
+                mlp_ratio=self.mlp_ratio,
+                drop=drop_rate,
+                drop_path=dpr[sum(stage_blocks[:i]):sum(stage_blocks[:i + 1])],
+                act_cfg=act_cfg,
+                norm_cfg=norm_cfg,
+                post_norm=post_norm,
+                downsample=(i < self.num_stages - 1),
+                layer_scale=layer_scale,
+                offset_scale=offset_scale,
+                with_cp=with_cp,
+                dw_kernel_size=dw_kernel_size,
+                post_norm_block_ids=post_norm_block_ids,
+                res_post_norm=res_post_norm,
+                center_feature_scale=center_feature_scale,
+                remove_center=remove_center,
+            )
+            self.layers.append(layer)
+
+        # Conv Head
+        if not use_clip_projector:
+            self.conv_head = nn.Sequential(
+                nn.Conv2d(
+                    self.num_features,
+                    int(self.num_features * cls_scale),
+                    kernel_size=1,
+                    bias=False),
+                build_norm_layer(
+                    int(self.num_features * cls_scale), 'BN', 'channels_first',
+                    'channels_first'), build_activation_layer(act_cfg))
+
+        else:
+            pretrain_embed_dim, _stride, attnpool_num_heads, clip_embed_dim \
+                = 1024, 2, 16, 768
+            self.dcnv3_head_x4 = nn.Sequential(
+                nn.Conv2d(
+                    in_channels=self.num_features,
+                    out_channels=pretrain_embed_dim * (_stride**2),
+                    kernel_size=1), nn.PixelShuffle(_stride))
+            self.dcnv3_head_x3 = nn.Conv2d(
+                in_channels=self.num_features // 2,
+                out_channels=pretrain_embed_dim,
+                kernel_size=1)
+            self.clip_projector = AttentionPoolingBlock(
+                dim=pretrain_embed_dim,
+                num_heads=attnpool_num_heads,
+                qkv_bias=True,
+                qk_scale=None,
+                drop=0.,
+                attn_drop=0.,
+                norm_cfg=norm_cfg,
+                out_dim=clip_embed_dim)
+            norm_layer = norm_cfg['type']
+            self.fc_norm = build_norm_layer(
+                clip_embed_dim, norm_layer, eps=1e-6)
+
+    def init_weights(self):
+        super(InternImage, self).init_weights()
+
+        for m in self.modules():
+            if isinstance(m, nn.Linear):
+                trunc_normal_(m.weight, std=.02)
+                if isinstance(m, nn.Linear) and m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+            elif isinstance(m, nn.LayerNorm):
+                nn.init.constant_(m.bias, 0)
+                nn.init.constant_(m.weight, 1.0)
+
+            elif isinstance(m, getattr(opsm, self.core_op)):
+                m._reset_parameters()
+
+    def _make_stem_layer(self, in_channels, stem_channels):
+        norm_layer = self.norm_cfg['type']
+        self.patch_embed = nn.Sequential(
+            nn.Conv2d(
+                in_channels,
+                stem_channels // 2,
+                kernel_size=3,
+                stride=2,
+                padding=1),
+            build_norm_layer(stem_channels // 2, norm_layer, 'channels_first',
+                             'channels_first'),
+            build_activation_layer(self.act_cfg),
+            nn.Conv2d(
+                stem_channels // 2,
+                stem_channels,
+                kernel_size=3,
+                stride=2,
+                padding=1),
+            build_norm_layer(stem_channels, norm_layer, 'channels_first',
+                             'channels_last'),
+        )
+
+    def forward_features(self, x):
+        x = self.patch_embed(x)
+        x = self.pos_drop(x)
+
+        for layer in self.layers:
+            x = layer(x)
+
+        x = self.conv_head(x.permute(0, 3, 1, 2))
+        return (x, )
+
+    def forward_features_seq_out(self, x):
+        x = self.patch_embed(x)
+        x = self.pos_drop(x)
+
+        seq_out = []
+        for layer in self.layers:
+            x, x_ = layer(x, return_wo_downsample=True)
+            seq_out.append(x_)
+        return seq_out
+
+    def forward_clip_projector(self, x):  # for InternImage-H/G
+        xs = self.forward_features_seq_out(x)
+        x1, x2, x3, x4 = xs
+
+        x1 = x1.permute(0, 3, 1, 2)  # NHWC -> NCHW
+        x2 = x2.permute(0, 3, 1, 2)  # NHWC -> NCHW
+        x3 = x3.permute(0, 3, 1, 2)  # NHWC -> NCHW
+        x4 = x4.permute(0, 3, 1, 2)  # NHWC -> NCHW
+
+        x4 = self.dcnv3_head_x4(x4)
+        x = x4
+        x3 = self.dcnv3_head_x3(x3)
+        x = x + x3
+
+        x = x.flatten(-2).transpose(1, 2).contiguous()
+        x = self.clip_projector(x)
+        x = self.fc_norm(x)
+
+        return (x, )
+
+    def forward(self, x):
+        if not self.use_clip_projector:
+            # for InternImage-T/S/B/L/XL
+            return self.forward_features(x)
+        else:
+            # for InternImage-H/G
+            return self.forward_clip_projector(x)
+
+    @staticmethod
+    def _checkpoint_filter(state_dict, prefix, local_metadata, strict,
+                           missing_keys, unexpected_keys, error_msgs):
+
+        def internimage_to_mmpretrain():
+            for k, v in state_dict['model'].items():
+                if 'head.' in k and 'conv_head' not in k:
+                    if 'weight' in k:
+                        new_k = 'head.fc.weight'
+                    else:
+                        new_k = 'head.fc.bias'
+                elif 'patch_embed' in k:
+                    map_fun = {
+                        'conv1': '0',
+                        'norm1': '1',
+                        'conv2': '3',
+                        'norm2': '4'
+                    }
+                    new_k = k
+                    for old, new in map_fun.items():
+                        new_k = new_k.replace(old, new)
+                    new_k = 'backbone.' + new_k
+
+                elif 'levels' in k:
+                    new_k = k.replace('levels', 'layers')
+                    if 'mlp' in new_k:
+                        new_k = new_k.replace('fc1', 'layers.0.0')
+                        new_k = new_k.replace('fc2', 'layers.1')
+                    new_k = 'backbone.' + new_k
+                elif 'clip_projector.cross_dcn.k_bias' in k:
+                    continue
+                else:
+                    new_k = 'backbone.' + k
+
+                state_dict[new_k] = state_dict['model'][k]
+            del state_dict['model']
+
+        # The original weights need to be converted to mmpretrain format.
+        # Some modules in the original weights starts with 'levels',
+        # and in this implement they are replaced with 'layers'.
+        if 'model' in state_dict and 'levels.0.blocks.0.norm1.0.weight'\
+                in state_dict['model']:
+            internimage_to_mmpretrain()
diff --git a/projects/internimage_classification/ops_dcnv3/functions/__init__.py b/projects/internimage_classification/ops_dcnv3/functions/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..bc1f5b6eb9618f0bc3eb8074071235a706ed5162
--- /dev/null
+++ b/projects/internimage_classification/ops_dcnv3/functions/__init__.py
@@ -0,0 +1,10 @@
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2022 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+# Copied from
+# https://github.com/OpenGVLab/InternImage/blob/master/classification/models/
+
+from .dcnv3_func import DCNv3Function, dcnv3_core_pytorch  # noqa
diff --git a/projects/internimage_classification/ops_dcnv3/functions/dcnv3_func.py b/projects/internimage_classification/ops_dcnv3/functions/dcnv3_func.py
new file mode 100644
index 0000000000000000000000000000000000000000..da1b6afe189ef1aa365952f126b8ba7c9719f911
--- /dev/null
+++ b/projects/internimage_classification/ops_dcnv3/functions/dcnv3_func.py
@@ -0,0 +1,248 @@
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2022 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+# Copied from
+# https://github.com/OpenGVLab/InternImage/blob/master/classification/models/
+
+from __future__ import absolute_import, division, print_function
+import pkg_resources
+
+import DCNv3
+import torch
+import torch.nn.functional as F
+from torch.autograd import Function
+from torch.autograd.function import once_differentiable
+from torch.cuda.amp import custom_bwd, custom_fwd
+
+dcn_version = float(pkg_resources.get_distribution('DCNv3').version)
+
+
+class DCNv3Function(Function):
+
+    @staticmethod
+    @custom_fwd
+    def forward(ctx, input, offset, mask, kernel_h, kernel_w, stride_h,
+                stride_w, pad_h, pad_w, dilation_h, dilation_w, group,
+                group_channels, offset_scale, im2col_step, remove_center):
+        ctx.kernel_h = kernel_h
+        ctx.kernel_w = kernel_w
+        ctx.stride_h = stride_h
+        ctx.stride_w = stride_w
+        ctx.pad_h = pad_h
+        ctx.pad_w = pad_w
+        ctx.dilation_h = dilation_h
+        ctx.dilation_w = dilation_w
+        ctx.group = group
+        ctx.group_channels = group_channels
+        ctx.offset_scale = offset_scale
+        ctx.im2col_step = im2col_step
+        ctx.remove_center = remove_center
+
+        args = [
+            input, offset, mask, kernel_h, kernel_w, stride_h, stride_w, pad_h,
+            pad_w, dilation_h, dilation_w, group, group_channels, offset_scale,
+            ctx.im2col_step
+        ]
+        if remove_center or dcn_version > 1.0:
+            args.append(remove_center)
+
+        output = DCNv3.dcnv3_forward(*args)
+        ctx.save_for_backward(input, offset, mask)
+
+        return output
+
+    @staticmethod
+    @once_differentiable
+    @custom_bwd
+    def backward(ctx, grad_output):
+        input, offset, mask = ctx.saved_tensors
+
+        args = [
+            input, offset, mask, ctx.kernel_h, ctx.kernel_w, ctx.stride_h,
+            ctx.stride_w, ctx.pad_h, ctx.pad_w, ctx.dilation_h, ctx.dilation_w,
+            ctx.group, ctx.group_channels, ctx.offset_scale,
+            grad_output.contiguous(), ctx.im2col_step
+        ]
+        if ctx.remove_center or dcn_version > 1.0:
+            args.append(ctx.remove_center)
+
+        grad_input, grad_offset, grad_mask = \
+            DCNv3.dcnv3_backward(*args)
+
+        return grad_input, grad_offset, grad_mask, \
+            None, None, None, None, None, None, None,\
+            None, None, None, None, None, None
+
+    @staticmethod
+    def symbolic(g, input, offset, mask, kernel_h, kernel_w, stride_h,
+                 stride_w, pad_h, pad_w, dilation_h, dilation_w, group,
+                 group_channels, offset_scale, im2col_step, remove_center):
+        """Symbolic function for mmdeploy::DCNv3.
+
+        Returns:
+            DCNv3 op for onnx.
+        """
+        return g.op(
+            'mmdeploy::TRTDCNv3',
+            input,
+            offset,
+            mask,
+            kernel_h_i=int(kernel_h),
+            kernel_w_i=int(kernel_w),
+            stride_h_i=int(stride_h),
+            stride_w_i=int(stride_w),
+            pad_h_i=int(pad_h),
+            pad_w_i=int(pad_w),
+            dilation_h_i=int(dilation_h),
+            dilation_w_i=int(dilation_w),
+            group_i=int(group),
+            group_channels_i=int(group_channels),
+            offset_scale_f=float(offset_scale),
+            im2col_step_i=int(im2col_step),
+            remove_center=int(remove_center),
+        )
+
+
+def _get_reference_points(spatial_shapes,
+                          device,
+                          kernel_h,
+                          kernel_w,
+                          dilation_h,
+                          dilation_w,
+                          pad_h=0,
+                          pad_w=0,
+                          stride_h=1,
+                          stride_w=1):
+    _, H_, W_, _ = spatial_shapes
+    H_out = (H_ - (dilation_h * (kernel_h - 1) + 1)) // stride_h + 1
+    W_out = (W_ - (dilation_w * (kernel_w - 1) + 1)) // stride_w + 1
+
+    ref_y, ref_x = torch.meshgrid(
+        torch.linspace(
+            # pad_h + 0.5,
+            # H_ - pad_h - 0.5,
+            (dilation_h * (kernel_h - 1)) // 2 + 0.5,
+            (dilation_h * (kernel_h - 1)) // 2 + 0.5 + (H_out - 1) * stride_h,
+            H_out,
+            dtype=torch.float32,
+            device=device),
+        torch.linspace(
+            # pad_w + 0.5,
+            # W_ - pad_w - 0.5,
+            (dilation_w * (kernel_w - 1)) // 2 + 0.5,
+            (dilation_w * (kernel_w - 1)) // 2 + 0.5 + (W_out - 1) * stride_w,
+            W_out,
+            dtype=torch.float32,
+            device=device))
+    ref_y = ref_y.reshape(-1)[None] / H_
+    ref_x = ref_x.reshape(-1)[None] / W_
+
+    ref = torch.stack((ref_x, ref_y), -1).reshape(1, H_out, W_out, 1, 2)
+
+    return ref
+
+
+def _generate_dilation_grids(spatial_shapes, kernel_h, kernel_w, dilation_h,
+                             dilation_w, group, device):
+    _, H_, W_, _ = spatial_shapes
+    points_list = []
+    x, y = torch.meshgrid(
+        torch.linspace(
+            -((dilation_w * (kernel_w - 1)) // 2),
+            -((dilation_w * (kernel_w - 1)) // 2) +
+            (kernel_w - 1) * dilation_w,
+            kernel_w,
+            dtype=torch.float32,
+            device=device),
+        torch.linspace(
+            -((dilation_h * (kernel_h - 1)) // 2),
+            -((dilation_h * (kernel_h - 1)) // 2) +
+            (kernel_h - 1) * dilation_h,
+            kernel_h,
+            dtype=torch.float32,
+            device=device))
+
+    points_list.extend([x / W_, y / H_])
+    grid = torch.stack(points_list, -1).reshape(-1, 1, 2).\
+        repeat(1, group, 1).permute(1, 0, 2)
+    grid = grid.reshape(1, 1, 1, group * kernel_h * kernel_w, 2)
+
+    return grid
+
+
+def remove_center_sampling_locations(sampling_locations, kernel_w, kernel_h):
+    idx = list(range(sampling_locations.shape[-2]))
+    C = (kernel_w * kernel_h - 1) // 2
+    idx = [i for i in idx if i != C and (i - C) % (C * 2 + 1) != 0]
+    sampling_locations = sampling_locations[:, :, :, idx, :]
+    return sampling_locations
+
+
+def dcnv3_core_pytorch(input, offset, mask, kernel_h, kernel_w, stride_h,
+                       stride_w, pad_h, pad_w, dilation_h, dilation_w, group,
+                       group_channels, offset_scale, remove_center):
+    # for debug and test only,
+    # need to use cuda version instead
+
+    if remove_center and (kernel_h % 2 == 0 or kernel_w % 2 == 0
+                          or kernel_w != kernel_h):
+        raise ValueError(
+            'remove_center is only compatible with square odd kernel size.')
+
+    input = F.pad(input, [0, 0, pad_h, pad_h, pad_w, pad_w])
+    N_, H_in, W_in, _ = input.shape
+    _, H_out, W_out, _ = offset.shape
+
+    ref = _get_reference_points(input.shape, input.device, kernel_h, kernel_w,
+                                dilation_h, dilation_w, pad_h, pad_w, stride_h,
+                                stride_w)
+    grid = _generate_dilation_grids(input.shape, kernel_h, kernel_w,
+                                    dilation_h, dilation_w, group,
+                                    input.device)
+    spatial_norm = torch.tensor([W_in, H_in]).reshape(1, 1, 1, 2).\
+        repeat(1, 1, 1, group*(kernel_h*kernel_w-remove_center)).\
+        to(input.device)
+
+    sampling_locations = (ref + grid * offset_scale).repeat(N_, 1, 1, 1, 1)
+    if remove_center:
+        sampling_locations = remove_center_sampling_locations(
+            sampling_locations, kernel_w=kernel_w, kernel_h=kernel_h)
+    sampling_locations = sampling_locations.flatten(3, 4)
+    sampling_locations = sampling_locations + \
+        offset * offset_scale / spatial_norm
+
+    P_ = kernel_h * kernel_w - remove_center
+    sampling_grids = 2 * sampling_locations - 1
+    # N_, H_in, W_in, group*group_channels ->
+    # N_, H_in*W_in, group*group_channels ->
+    # N_, group*group_channels, H_in*W_in ->
+    # N_*group, group_channels, H_in, W_in
+    input_ = input.view(N_, H_in*W_in, group*group_channels).transpose(1, 2).\
+        reshape(N_*group, group_channels, H_in, W_in)
+    # N_, H_out, W_out, group*P_*2 ->
+    # N_, H_out*W_out, group, P_, 2 ->
+    # N_, group, H_out*W_out, P_, 2 ->
+    # N_*group, H_out*W_out, P_, 2
+    sampling_grid_ = sampling_grids.view(N_, H_out*W_out, group, P_, 2).\
+        transpose(1, 2).flatten(0, 1)
+    # N_*group, group_channels, H_out*W_out, P_
+    sampling_input_ = F.grid_sample(
+        input_,
+        sampling_grid_,
+        mode='bilinear',
+        padding_mode='zeros',
+        align_corners=False)
+
+    # (N_, H_out, W_out, group*P_) ->
+    # N_, H_out*W_out, group, P_ ->
+    # (N_, group, H_out*W_out, P_) ->
+    # (N_*group, 1, H_out*W_out, P_)
+    mask = mask.view(N_, H_out*W_out, group, P_).transpose(1, 2).\
+        reshape(N_*group, 1, H_out*W_out, P_)
+    output = (sampling_input_ * mask).sum(-1).view(N_, group * group_channels,
+                                                   H_out * W_out)
+
+    return output.transpose(1, 2).reshape(N_, H_out, W_out, -1).contiguous()
diff --git a/projects/internimage_classification/ops_dcnv3/make.sh b/projects/internimage_classification/ops_dcnv3/make.sh
new file mode 100755
index 0000000000000000000000000000000000000000..31ba0f9b7d2083b3afaf59ea1b8e4cd5bd479628
--- /dev/null
+++ b/projects/internimage_classification/ops_dcnv3/make.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2022 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+# Copied from
+# https://github.com/OpenGVLab/InternImage/blob/master/classification/models/
+
+python setup.py build install
diff --git a/projects/internimage_classification/ops_dcnv3/modules/__init__.py b/projects/internimage_classification/ops_dcnv3/modules/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..930cd3ff9f4aa70366ecf821de7f178cda370585
--- /dev/null
+++ b/projects/internimage_classification/ops_dcnv3/modules/__init__.py
@@ -0,0 +1,10 @@
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2022 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+# Copied from
+# https://github.com/OpenGVLab/InternImage/blob/master/classification/models/
+
+from .dcnv3 import DCNv3, DCNv3_pytorch  # noqa
diff --git a/projects/internimage_classification/ops_dcnv3/modules/dcnv3.py b/projects/internimage_classification/ops_dcnv3/modules/dcnv3.py
new file mode 100644
index 0000000000000000000000000000000000000000..47a369aacde9065bf175b0f34566b37ad6f9e4e8
--- /dev/null
+++ b/projects/internimage_classification/ops_dcnv3/modules/dcnv3.py
@@ -0,0 +1,360 @@
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2022 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+# Copied from
+# https://github.com/OpenGVLab/InternImage/blob/master/classification/models/
+
+from __future__ import absolute_import, division, print_function
+import warnings
+
+import torch
+import torch.nn.functional as F
+from torch import nn
+from torch.nn.init import constant_, xavier_uniform_
+
+from ..functions import DCNv3Function, dcnv3_core_pytorch
+
+
+class to_channels_first(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x.permute(0, 3, 1, 2)
+
+
+class to_channels_last(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+
+    def forward(self, x):
+        return x.permute(0, 2, 3, 1)
+
+
+def build_norm_layer(dim,
+                     norm_layer,
+                     in_format='channels_last',
+                     out_format='channels_last',
+                     eps=1e-6):
+    layers = []
+    if norm_layer == 'BN':
+        if in_format == 'channels_last':
+            layers.append(to_channels_first())
+        layers.append(nn.BatchNorm2d(dim))
+        if out_format == 'channels_last':
+            layers.append(to_channels_last())
+    elif norm_layer == 'LN':
+        if in_format == 'channels_first':
+            layers.append(to_channels_last())
+        layers.append(nn.LayerNorm(dim, eps=eps))
+        if out_format == 'channels_first':
+            layers.append(to_channels_first())
+    else:
+        raise NotImplementedError(
+            f'build_norm_layer does not support {norm_layer}')
+    return nn.Sequential(*layers)
+
+
+def build_act_layer(act_layer):
+    if act_layer == 'ReLU':
+        return nn.ReLU(inplace=True)
+    elif act_layer == 'SiLU':
+        return nn.SiLU(inplace=True)
+    elif act_layer == 'GELU':
+        return nn.GELU()
+
+    raise NotImplementedError(f'build_act_layer does not support {act_layer}')
+
+
+def _is_power_of_2(n):
+    if (not isinstance(n, int)) or (n < 0):
+        raise ValueError(
+            'invalid input for _is_power_of_2: {} (type: {})'.format(
+                n, type(n)))
+
+    return (n & (n - 1) == 0) and n != 0
+
+
+class CenterFeatureScaleModule(nn.Module):
+
+    def forward(self, query, center_feature_scale_proj_weight,
+                center_feature_scale_proj_bias):
+        center_feature_scale = F.linear(
+            query,
+            weight=center_feature_scale_proj_weight,
+            bias=center_feature_scale_proj_bias).sigmoid()
+        return center_feature_scale
+
+
+class DCNv3_pytorch(nn.Module):
+
+    def __init__(
+        self,
+        channels=64,
+        kernel_size=3,
+        dw_kernel_size=None,
+        stride=1,
+        pad=1,
+        dilation=1,
+        group=4,
+        offset_scale=1.0,
+        act_layer='GELU',
+        norm_layer='LN',
+        center_feature_scale=False,
+        remove_center=False,
+    ):
+        """DCNv3 Module.
+
+        :param channels
+        :param kernel_size
+        :param stride
+        :param pad
+        :param dilation
+        :param group
+        :param offset_scale
+        :param act_layer
+        :param norm_layer
+        """
+        super().__init__()
+        if channels % group != 0:
+            raise ValueError(f'channels must be divisible by group, '
+                             f'but got {channels} and {group}')
+        _d_per_group = channels // group
+        dw_kernel_size = dw_kernel_size if dw_kernel_size is not None\
+            else kernel_size
+        # you'd better set _d_per_group to a power of 2
+        # which is more efficient in our CUDA implementation
+        if not _is_power_of_2(_d_per_group):
+            warnings.warn(
+                "You'd better set channels in DCNv3 "
+                'to make the dimension of each attention head a power of 2 '
+                'which is more efficient in our CUDA implementation.')
+
+        self.offset_scale = offset_scale
+        self.channels = channels
+        self.kernel_size = kernel_size
+        self.dw_kernel_size = dw_kernel_size
+        self.stride = stride
+        self.dilation = dilation
+        self.pad = pad
+        self.group = group
+        self.group_channels = channels // group
+        self.offset_scale = offset_scale
+        self.center_feature_scale = center_feature_scale
+        self.remove_center = int(remove_center)
+
+        self.dw_conv = nn.Sequential(
+            nn.Conv2d(
+                channels,
+                channels,
+                kernel_size=dw_kernel_size,
+                stride=1,
+                padding=(dw_kernel_size - 1) // 2,
+                groups=channels),
+            build_norm_layer(channels, norm_layer, 'channels_first',
+                             'channels_last'), build_act_layer(act_layer))
+        self.offset = nn.Linear(
+            channels,
+            group * (kernel_size * kernel_size - remove_center) * 2)
+        self.mask = nn.Linear(
+            channels, group * (kernel_size * kernel_size - remove_center))
+        self.input_proj = nn.Linear(channels, channels)
+        self.output_proj = nn.Linear(channels, channels)
+        self._reset_parameters()
+
+        if center_feature_scale:
+            self.center_feature_scale_proj_weight = nn.Parameter(
+                torch.zeros((group, channels), dtype=torch.float))
+            self.center_feature_scale_proj_bias = nn.Parameter(
+                torch.tensor(0.0, dtype=torch.float).view(
+                    (1, )).repeat(group, ))
+            self.center_feature_scale_module = CenterFeatureScaleModule()
+
+    def _reset_parameters(self):
+        constant_(self.offset.weight.data, 0.)
+        constant_(self.offset.bias.data, 0.)
+        constant_(self.mask.weight.data, 0.)
+        constant_(self.mask.bias.data, 0.)
+        xavier_uniform_(self.input_proj.weight.data)
+        constant_(self.input_proj.bias.data, 0.)
+        xavier_uniform_(self.output_proj.weight.data)
+        constant_(self.output_proj.bias.data, 0.)
+
+    def forward(self, input):
+        """
+        :param query                       (N, H, W, C)
+        :return output                     (N, H, W, C)
+        """
+        N, H, W, _ = input.shape
+
+        x = self.input_proj(input)
+        x_proj = x
+
+        x1 = input.permute(0, 3, 1, 2)
+        x1 = self.dw_conv(x1)
+        offset = self.offset(x1)
+        mask = self.mask(x1).reshape(N, H, W, self.group, -1)
+        mask = F.softmax(mask, -1).reshape(N, H, W, -1)
+
+        x = dcnv3_core_pytorch(x, offset, mask, self.kernel_size,
+                               self.kernel_size, self.stride, self.stride,
+                               self.pad, self.pad, self.dilation,
+                               self.dilation, self.group, self.group_channels,
+                               self.offset_scale, self.remove_center)
+        if self.center_feature_scale:
+            center_feature_scale = self.center_feature_scale_module(
+                x1, self.center_feature_scale_proj_weight,
+                self.center_feature_scale_proj_bias)
+            # N, H, W, groups ->
+            # N, H, W, groups, 1 ->
+            # N, H, W, groups, _d_per_group ->
+            # N, H, W, channels
+            center_feature_scale = center_feature_scale[..., None].repeat(
+                1, 1, 1, 1, self.channels // self.group).flatten(-2)
+            x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
+        x = self.output_proj(x)
+
+        return x
+
+
+class DCNv3(nn.Module):
+
+    def __init__(
+        self,
+        channels=64,
+        kernel_size=3,
+        dw_kernel_size=None,
+        stride=1,
+        pad=1,
+        dilation=1,
+        group=4,
+        offset_scale=1.0,
+        act_layer='GELU',
+        norm_layer='LN',
+        center_feature_scale=False,
+        remove_center=False,
+    ):
+        """DCNv3 Module.
+
+        :param channels
+        :param kernel_size
+        :param stride
+        :param pad
+        :param dilation
+        :param group
+        :param offset_scale
+        :param act_layer
+        :param norm_layer
+        """
+        super().__init__()
+        if channels % group != 0:
+            raise ValueError(f'channels must be divisible by group, '
+                             f'but got {channels} and {group}')
+        _d_per_group = channels // group
+        dw_kernel_size = dw_kernel_size if dw_kernel_size is not None\
+            else kernel_size
+        # you'd better set _d_per_group to a power of 2
+        # which is more efficient in our CUDA implementation
+        if not _is_power_of_2(_d_per_group):
+            warnings.warn(
+                "You'd better set channels in DCNv3 "
+                'to make the dimension of each attention head a power of 2 '
+                'which is more efficient in our CUDA implementation.')
+
+        self.offset_scale = offset_scale
+        self.channels = channels
+        self.kernel_size = kernel_size
+        self.dw_kernel_size = dw_kernel_size
+        self.stride = stride
+        self.dilation = dilation
+        self.pad = pad
+        self.group = group
+        self.group_channels = channels // group
+        self.offset_scale = offset_scale
+        self.center_feature_scale = center_feature_scale
+        self.remove_center = int(remove_center)
+
+        if self.remove_center and self.kernel_size % 2 == 0:
+            raise ValueError(
+                'remove_center is only compatible with odd kernel size.')
+
+        self.dw_conv = nn.Sequential(
+            nn.Conv2d(
+                channels,
+                channels,
+                kernel_size=dw_kernel_size,
+                stride=1,
+                padding=(dw_kernel_size - 1) // 2,
+                groups=channels),
+            build_norm_layer(channels, norm_layer, 'channels_first',
+                             'channels_last'), build_act_layer(act_layer))
+        self.offset = nn.Linear(
+            channels,
+            group * (kernel_size * kernel_size - remove_center) * 2)
+        self.mask = nn.Linear(
+            channels, group * (kernel_size * kernel_size - remove_center))
+        self.input_proj = nn.Linear(channels, channels)
+        self.output_proj = nn.Linear(channels, channels)
+        self._reset_parameters()
+
+        if center_feature_scale:
+            self.center_feature_scale_proj_weight = nn.Parameter(
+                torch.zeros((group, channels), dtype=torch.float))
+            self.center_feature_scale_proj_bias = nn.Parameter(
+                torch.tensor(0.0, dtype=torch.float).view(
+                    (1, )).repeat(group, ))
+            self.center_feature_scale_module = CenterFeatureScaleModule()
+
+    def _reset_parameters(self):
+        constant_(self.offset.weight.data, 0.)
+        constant_(self.offset.bias.data, 0.)
+        constant_(self.mask.weight.data, 0.)
+        constant_(self.mask.bias.data, 0.)
+        xavier_uniform_(self.input_proj.weight.data)
+        constant_(self.input_proj.bias.data, 0.)
+        xavier_uniform_(self.output_proj.weight.data)
+        constant_(self.output_proj.bias.data, 0.)
+
+    def forward(self, input):
+        """
+        :param query                       (N, H, W, C)
+        :return output                     (N, H, W, C)
+        """
+        N, H, W, _ = input.shape
+
+        x = self.input_proj(input)
+        x_proj = x
+        dtype = x.dtype
+
+        x1 = input.permute(0, 3, 1, 2)
+        x1 = self.dw_conv(x1)
+        offset = self.offset(x1)
+        mask = self.mask(x1).reshape(N, H, W, self.group, -1)
+        mask = F.softmax(mask, -1)
+        mask = mask.reshape(N, H, W, -1).type(dtype)
+
+        x = DCNv3Function.apply(x, offset, mask, self.kernel_size,
+                                self.kernel_size, self.stride, self.stride,
+                                self.pad, self.pad, self.dilation,
+                                self.dilation, self.group, self.group_channels,
+                                self.offset_scale, 256, self.remove_center)
+
+        if self.center_feature_scale:
+            center_feature_scale = self.center_feature_scale_module(
+                x1, self.center_feature_scale_proj_weight,
+                self.center_feature_scale_proj_bias)
+            # N, H, W, groups ->
+            # N, H, W, groups, 1 ->
+            # N, H, W, groups, _d_per_group ->
+            # N, H, W, channels
+            center_feature_scale = center_feature_scale[..., None].repeat(
+                1, 1, 1, 1, self.channels // self.group).flatten(-2)
+            x = x * (1 - center_feature_scale) + x_proj * center_feature_scale
+        x = self.output_proj(x)
+
+        return x
diff --git a/projects/internimage_classification/ops_dcnv3/setup.py b/projects/internimage_classification/ops_dcnv3/setup.py
new file mode 100644
index 0000000000000000000000000000000000000000..34f8e7a3ee121ef2c27ce92a4b38859c8a94de27
--- /dev/null
+++ b/projects/internimage_classification/ops_dcnv3/setup.py
@@ -0,0 +1,72 @@
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2022 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+# Copied from
+# https://github.com/OpenGVLab/InternImage/blob/master/classification/models/
+
+import glob
+import os
+from setuptools import find_packages, setup
+
+import torch
+from torch.utils.cpp_extension import CUDA_HOME, CppExtension, CUDAExtension
+
+requirements = ['torch', 'torchvision']
+
+
+def get_extensions():
+    this_dir = os.path.dirname(os.path.abspath(__file__))
+    extensions_dir = os.path.join(this_dir, 'src')
+
+    main_file = glob.glob(os.path.join(extensions_dir, '*.cpp'))
+    source_cpu = glob.glob(os.path.join(extensions_dir, 'cpu', '*.cpp'))
+    source_cuda = glob.glob(os.path.join(extensions_dir, 'cuda', '*.cu'))
+
+    sources = main_file + source_cpu
+    extension = CppExtension
+    extra_compile_args = {'cxx': []}
+    define_macros = []
+
+    if torch.cuda.is_available() and CUDA_HOME is not None:
+        extension = CUDAExtension
+        sources += source_cuda
+        define_macros += [('WITH_CUDA', None)]
+        extra_compile_args['nvcc'] = [
+            # "-DCUDA_HAS_FP16=1",
+            # "-D__CUDA_NO_HALF_OPERATORS__",
+            # "-D__CUDA_NO_HALF_CONVERSIONS__",
+            # "-D__CUDA_NO_HALF2_OPERATORS__",
+        ]
+    else:
+        raise NotImplementedError('Cuda is not availabel')
+
+    sources = [os.path.join(extensions_dir, s) for s in sources]
+    include_dirs = [extensions_dir]
+    ext_modules = [
+        extension(
+            'DCNv3',
+            sources,
+            include_dirs=include_dirs,
+            define_macros=define_macros,
+            extra_compile_args=extra_compile_args,
+        )
+    ]
+    return ext_modules
+
+
+setup(
+    name='DCNv3',
+    version='1.1',
+    author='InternImage',
+    url='https://github.com/OpenGVLab/InternImage',
+    description='PyTorch Wrapper for CUDA Functions of DCNv3',
+    packages=find_packages(exclude=(
+        'configs',
+        'tests',
+    )),
+    ext_modules=get_extensions(),
+    cmdclass={'build_ext': torch.utils.cpp_extension.BuildExtension},
+)
diff --git a/projects/internimage_classification/ops_dcnv3/src/cpu/dcnv3_cpu.cpp b/projects/internimage_classification/ops_dcnv3/src/cpu/dcnv3_cpu.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..a3bddc1814e0cae6076102b94bed415f45f61f14
--- /dev/null
+++ b/projects/internimage_classification/ops_dcnv3/src/cpu/dcnv3_cpu.cpp
@@ -0,0 +1,37 @@
+/*!
+**************************************************************************************************
+* InternImage
+* Copyright (c) 2022 OpenGVLab
+* Licensed under The MIT License [see LICENSE for details]
+**************************************************************************************************
+* Modified from
+*https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0
+**************************************************************************************************
+*/
+
+#include <vector>
+
+#include <ATen/ATen.h>
+#include <ATen/cuda/CUDAContext.h>
+
+at::Tensor dcnv3_cpu_forward(const at::Tensor &input, const at::Tensor &offset,
+                             const at::Tensor &mask, const int kernel_h,
+                             const int kernel_w, const int stride_h,
+                             const int stride_w, const int pad_h,
+                             const int pad_w, const int dilation_h,
+                             const int dilation_w, const int group,
+                             const int group_channels, const float offset_scale,
+                             const int im2col_step) {
+    AT_ERROR("Not implement on cpu");
+}
+
+std::vector<at::Tensor>
+dcnv3_cpu_backward(const at::Tensor &input, const at::Tensor &offset,
+                   const at::Tensor &mask, const int kernel_h,
+                   const int kernel_w, const int stride_h, const int stride_w,
+                   const int pad_h, const int pad_w, const int dilation_h,
+                   const int dilation_w, const int group,
+                   const int group_channels, const float offset_scale,
+                   const at::Tensor &grad_output, const int im2col_step) {
+    AT_ERROR("Not implement on cpu");
+}
diff --git a/projects/internimage_classification/ops_dcnv3/src/cpu/dcnv3_cpu.h b/projects/internimage_classification/ops_dcnv3/src/cpu/dcnv3_cpu.h
new file mode 100644
index 0000000000000000000000000000000000000000..d457bcbddf7c8fead715109591683012d341d4ea
--- /dev/null
+++ b/projects/internimage_classification/ops_dcnv3/src/cpu/dcnv3_cpu.h
@@ -0,0 +1,31 @@
+/*!
+**************************************************************************************************
+* InternImage
+* Copyright (c) 2022 OpenGVLab
+* Licensed under The MIT License [see LICENSE for details]
+**************************************************************************************************
+* Modified from
+*https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0
+**************************************************************************************************
+*/
+
+#pragma once
+#include <torch/extension.h>
+
+at::Tensor dcnv3_cpu_forward(const at::Tensor &input, const at::Tensor &offset,
+                             const at::Tensor &mask, const int kernel_h,
+                             const int kernel_w, const int stride_h,
+                             const int stride_w, const int pad_h,
+                             const int pad_w, const int dilation_h,
+                             const int dilation_w, const int group,
+                             const int group_channels, const float offset_scale,
+                             const int im2col_step);
+
+std::vector<at::Tensor>
+dcnv3_cpu_backward(const at::Tensor &input, const at::Tensor &offset,
+                   const at::Tensor &mask, const int kernel_h,
+                   const int kernel_w, const int stride_h, const int stride_w,
+                   const int pad_h, const int pad_w, const int dilation_h,
+                   const int dilation_w, const int group,
+                   const int group_channels, const float offset_scale,
+                   const at::Tensor &grad_output, const int im2col_step);
diff --git a/projects/internimage_classification/ops_dcnv3/src/cuda/dcnv3_cuda.cu b/projects/internimage_classification/ops_dcnv3/src/cuda/dcnv3_cuda.cu
new file mode 100644
index 0000000000000000000000000000000000000000..f793248081c97bbad99b257af0eb2a699301580c
--- /dev/null
+++ b/projects/internimage_classification/ops_dcnv3/src/cuda/dcnv3_cuda.cu
@@ -0,0 +1,174 @@
+/*!
+**************************************************************************************************
+* InternImage
+* Copyright (c) 2022 OpenGVLab
+* Licensed under The MIT License [see LICENSE for details]
+**************************************************************************************************
+* Modified from
+*https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0
+**************************************************************************************************
+*/
+
+#include "cuda/dcnv3_im2col_cuda.cuh"
+#include <vector>
+
+#include <ATen/ATen.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <torch/torch.h>
+
+at::Tensor dcnv3_cuda_forward(const at::Tensor &input, const at::Tensor &offset,
+                              const at::Tensor &mask, const int kernel_h,
+                              const int kernel_w, const int stride_h,
+                              const int stride_w, const int pad_h,
+                              const int pad_w, const int dilation_h,
+                              const int dilation_w, const int group,
+                              const int group_channels,
+                              const float offset_scale, const int im2col_step, const int remove_center) {
+    AT_ASSERTM(input.is_contiguous(), "input tensor has to be contiguous");
+    AT_ASSERTM(offset.is_contiguous(), "offset tensor has to be contiguous");
+    AT_ASSERTM(mask.is_contiguous(), "mask tensor has to be contiguous");
+    AT_ASSERTM(input.type().is_cuda(), "input must be a CUDA tensor");
+    AT_ASSERTM(offset.type().is_cuda(), "offset must be a CUDA tensor");
+    AT_ASSERTM(mask.type().is_cuda(), "mask must be a CUDA tensor");
+
+    const int batch = input.size(0);
+    const int height_in = input.size(1);
+    const int width_in = input.size(2);
+    const int channels = input.size(3);
+    const int height_out =
+        (height_in + 2 * pad_h - (dilation_h * (kernel_h - 1) + 1)) / stride_h +
+        1;
+    const int width_out =
+        (width_in + 2 * pad_w - (dilation_w * (kernel_w - 1) + 1)) / stride_w +
+        1;
+    const int im2col_step_ = std::min(batch, im2col_step);
+
+    AT_ASSERTM(batch % im2col_step_ == 0,
+               "batch(%d) must divide im2col_step(%d)", batch, im2col_step_);
+    AT_ASSERTM(
+        channels == (group * group_channels),
+        "Input channels and group times group channels won't match: (%d vs %d).",
+        channels, group * group_channels);
+
+    auto output =
+        at::zeros({batch, height_out, width_out, group * group_channels},
+                  input.options());
+
+    const int batch_n = im2col_step_;
+    auto output_n = output.view({batch / batch_n, batch_n, height_out,
+                                 width_out, group * group_channels});
+    auto per_input_size = height_in * width_in * group * group_channels;
+    auto per_offset_size =
+        height_out * width_out * group * (kernel_h * kernel_w - remove_center) * 2;
+    auto per_mask_size = height_out * width_out * group * (kernel_h * kernel_w - remove_center);
+    for (int n = 0; n < batch / im2col_step_; ++n) {
+        auto columns = output_n.select(0, n);
+        // AT_DISPATCH_FLOATING_TYPES(
+        AT_DISPATCH_FLOATING_TYPES_AND_HALF(
+            input.type(), "ms_deform_attn_forward_cuda", ([&] {
+                dcnv3_im2col_cuda(
+                    at::cuda::getCurrentCUDAStream(),
+                    input.data<scalar_t>() + n * im2col_step_ * per_input_size,
+                    offset.data<scalar_t>() +
+                        n * im2col_step_ * per_offset_size,
+                    mask.data<scalar_t>() + n * im2col_step_ * per_mask_size,
+                    columns.data<scalar_t>(), kernel_h, kernel_w, stride_h,
+                    stride_w, pad_h, pad_w, dilation_h, dilation_w, group,
+                    group_channels, batch_n, height_in, width_in, height_out,
+                    width_out, offset_scale, remove_center);
+            }));
+    }
+
+    return output;
+}
+
+std::vector<at::Tensor>
+dcnv3_cuda_backward(const at::Tensor &input, const at::Tensor &offset,
+                    const at::Tensor &mask, const int kernel_h,
+                    const int kernel_w, const int stride_h, const int stride_w,
+                    const int pad_h, const int pad_w, const int dilation_h,
+                    const int dilation_w, const int group,
+                    const int group_channels, const float offset_scale,
+                    const at::Tensor &grad_output, const int im2col_step, const int remove_center) {
+
+    AT_ASSERTM(input.is_contiguous(), "input tensor has to be contiguous");
+    AT_ASSERTM(offset.is_contiguous(), "offset tensor has to be contiguous");
+    AT_ASSERTM(mask.is_contiguous(), "mask tensor has to be contiguous");
+    AT_ASSERTM(grad_output.is_contiguous(),
+               "grad_output tensor has to be contiguous");
+    AT_ASSERTM(input.type().is_cuda(), "input must be a CUDA tensor");
+    AT_ASSERTM(offset.type().is_cuda(), "offset must be a CUDA tensor");
+    AT_ASSERTM(mask.type().is_cuda(), "mask must be a CUDA tensor");
+    AT_ASSERTM(grad_output.type().is_cuda(),
+               "grad_output must be a CUDA tensor");
+
+    const int batch = input.size(0);
+    const int height_in = input.size(1);
+    const int width_in = input.size(2);
+    const int channels = input.size(3);
+    const int height_out =
+        (height_in + 2 * pad_h - (dilation_h * (kernel_h - 1) + 1)) / stride_h +
+        1;
+    const int width_out =
+        (width_in + 2 * pad_w - (dilation_w * (kernel_w - 1) + 1)) / stride_w +
+        1;
+    const int im2col_step_ = std::min(batch, im2col_step);
+
+    AT_ASSERTM(batch % im2col_step_ == 0,
+               "batch(%d) must divide im2col_step(%d)", batch, im2col_step_);
+    AT_ASSERTM(
+        channels == (group * group_channels),
+        "Input channels and group times group channels won't match: (%d vs %d).",
+        channels, group * group_channels);
+
+    auto dtype = input.dtype();
+    if (dtype == at::kHalf) {
+        dtype = at::kFloat;
+    }
+
+    auto grad_input = at::zeros_like(input, dtype);
+    auto grad_offset = at::zeros_like(offset, dtype);
+    auto grad_mask = at::zeros_like(mask, dtype);
+
+    const int batch_n = im2col_step_;
+    auto per_input_size = height_in * width_in * group * group_channels;
+    auto per_offset_size =
+        height_out * width_out * group * (kernel_h * kernel_w - remove_center) * 2;
+    auto per_mask_size = height_out * width_out * group * (kernel_h * kernel_w - remove_center);
+    auto grad_output_n =
+        grad_output.view({batch / im2col_step_, batch_n, height_out * width_out,
+                          group, group_channels});
+
+    for (int n = 0; n < batch / im2col_step_; ++n) {
+        auto grad_output_g = grad_output_n.select(0, n);
+        // AT_DISPATCH_FLOATING_TYPES(
+        AT_DISPATCH_FLOATING_TYPES_AND_HALF(
+            input.type(), "ms_deform_attn_backward_cuda", ([&] {
+                dcnv3_col2im_cuda(
+                    at::cuda::getCurrentCUDAStream(),
+                    grad_output_g.data<scalar_t>(),
+                    input.data<scalar_t>() + n * im2col_step_ * per_input_size,
+                    offset.data<scalar_t>() +
+                        n * im2col_step_ * per_offset_size,
+                    mask.data<scalar_t>() + n * im2col_step_ * per_mask_size,
+                    kernel_h, kernel_w, stride_h, stride_w, pad_h, pad_w,
+                    dilation_h, dilation_w, group, group_channels, batch_n,
+                    height_in, width_in, height_out, width_out, offset_scale, remove_center,
+                    grad_input.data<opmath_t>() +
+                        n * im2col_step_ * per_input_size,
+                    grad_offset.data<opmath_t>() +
+                        n * im2col_step_ * per_offset_size,
+                    grad_mask.data<opmath_t>() +
+                        n * im2col_step_ * per_mask_size);
+            }));
+    }
+
+    if (input.dtype() == torch::kHalf) {
+        return {grad_input.to(torch::kHalf), grad_offset.to(torch::kHalf),
+                grad_mask.to(torch::kHalf)};
+    } else {
+        return {grad_input, grad_offset, grad_mask};
+    }
+}
diff --git a/projects/internimage_classification/ops_dcnv3/src/cuda/dcnv3_cuda.h b/projects/internimage_classification/ops_dcnv3/src/cuda/dcnv3_cuda.h
new file mode 100644
index 0000000000000000000000000000000000000000..d7ac0244b88f4852f27c1e29d66e6d4632727a16
--- /dev/null
+++ b/projects/internimage_classification/ops_dcnv3/src/cuda/dcnv3_cuda.h
@@ -0,0 +1,31 @@
+/*!
+**************************************************************************************************
+* InternImage
+* Copyright (c) 2022 OpenGVLab
+* Licensed under The MIT License [see LICENSE for details]
+**************************************************************************************************
+* Modified from
+*https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0
+**************************************************************************************************
+*/
+
+#pragma once
+#include <torch/extension.h>
+
+at::Tensor dcnv3_cuda_forward(const at::Tensor &input, const at::Tensor &offset,
+                              const at::Tensor &mask, const int kernel_h,
+                              const int kernel_w, const int stride_h,
+                              const int stride_w, const int pad_h,
+                              const int pad_w, const int dilation_h,
+                              const int dilation_w, const int group,
+                              const int group_channels,
+                              const float offset_scale, const int im2col_step, const int remove_center);
+
+std::vector<at::Tensor>
+dcnv3_cuda_backward(const at::Tensor &input, const at::Tensor &offset,
+                    const at::Tensor &mask, const int kernel_h,
+                    const int kernel_w, const int stride_h, const int stride_w,
+                    const int pad_h, const int pad_w, const int dilation_h,
+                    const int dilation_w, const int group,
+                    const int group_channels, const float offset_scale,
+                    const at::Tensor &grad_output, const int im2col_step, const int remove_center);
diff --git a/projects/internimage_classification/ops_dcnv3/src/cuda/dcnv3_im2col_cuda.cuh b/projects/internimage_classification/ops_dcnv3/src/cuda/dcnv3_im2col_cuda.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..ab6da3c5306ce05cc6c49737c314cd7463b48edd
--- /dev/null
+++ b/projects/internimage_classification/ops_dcnv3/src/cuda/dcnv3_im2col_cuda.cuh
@@ -0,0 +1,1094 @@
+/*!
+**************************************************************************************************
+* InternImage
+* Copyright (c) 2022 OpenGVLab
+* Licensed under The MIT License [see LICENSE for details]
+**************************************************************************************************
+* Modified from
+*https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0
+**************************************************************************************************
+*/
+
+#include <algorithm>
+#include <cstdio>
+#include <cstring>
+
+#include <ATen/ATen.h>
+#include <ATen/OpMathType.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <THC/THCAtomics.cuh>
+
+#define CUDA_KERNEL_LOOP(i, n)                                                 \
+    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (n);               \
+         i += blockDim.x * gridDim.x)
+
+const int CUDA_NUM_THREADS = 256;
+inline int GET_BLOCKS(const int N, const int num_threads) {
+    return (N + num_threads - 1) / num_threads;
+}
+
+#define opmath_t at::opmath_type<scalar_t>
+
+template <typename scalar_t>
+__device__ opmath_t dcnv3_im2col_bilinear(const scalar_t *&bottom_data,
+                                          const int &height, const int &width,
+                                          const int &group,
+                                          const int &group_channels,
+                                          const opmath_t &h, const opmath_t &w,
+                                          const int &g, const int &c) {
+    const int h_low = floor(h);
+    const int w_low = floor(w);
+    const int h_high = h_low + 1;
+    const int w_high = w_low + 1;
+
+    const opmath_t lh = h - h_low;
+    const opmath_t lw = w - w_low;
+    const opmath_t hh = 1 - lh, hw = 1 - lw;
+
+    const int w_stride = group * group_channels;
+    const int h_stride = width * w_stride;
+    const int h_low_ptr_offset = h_low * h_stride;
+    const int h_high_ptr_offset = h_low_ptr_offset + h_stride;
+    const int w_low_ptr_offset = w_low * w_stride;
+    const int w_high_ptr_offset = w_low_ptr_offset + w_stride;
+    const int base_ptr = g * group_channels + c;
+
+    opmath_t v1 = 0;
+    if (h_low >= 0 && w_low >= 0) {
+        const int ptr1 = h_low_ptr_offset + w_low_ptr_offset + base_ptr;
+        v1 = bottom_data[ptr1];
+    }
+    opmath_t v2 = 0;
+    if (h_low >= 0 && w_high <= width - 1) {
+        const int ptr2 = h_low_ptr_offset + w_high_ptr_offset + base_ptr;
+        v2 = bottom_data[ptr2];
+    }
+    opmath_t v3 = 0;
+    if (h_high <= height - 1 && w_low >= 0) {
+        const int ptr3 = h_high_ptr_offset + w_low_ptr_offset + base_ptr;
+        v3 = bottom_data[ptr3];
+    }
+    opmath_t v4 = 0;
+    if (h_high <= height - 1 && w_high <= width - 1) {
+        const int ptr4 = h_high_ptr_offset + w_high_ptr_offset + base_ptr;
+        v4 = bottom_data[ptr4];
+    }
+    const opmath_t w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;
+
+    const opmath_t val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);
+    return val;
+}
+
+template <typename scalar_t>
+__device__ void dcnv3_col2im_bilinear(
+    const scalar_t *&bottom_data, const int &height, const int &width,
+    const int &nheads, const int &group_channels, const opmath_t &h,
+    const opmath_t &w, const int &m, const int &c, const opmath_t offset_scale,
+    const opmath_t &top_grad, const opmath_t &mask, opmath_t *&grad_im,
+    opmath_t *grad_offset, opmath_t *grad_mask) {
+    const int h_low = floor(h);
+    const int w_low = floor(w);
+    const int h_high = h_low + 1;
+    const int w_high = w_low + 1;
+
+    const opmath_t lh = h - h_low;
+    const opmath_t lw = w - w_low;
+    const opmath_t hh = 1 - lh, hw = 1 - lw;
+
+    const int w_stride = nheads * group_channels;
+    const int h_stride = width * w_stride;
+    const int h_low_ptr_offset = h_low * h_stride;
+    const int h_high_ptr_offset = h_low_ptr_offset + h_stride;
+    const int w_low_ptr_offset = w_low * w_stride;
+    const int w_high_ptr_offset = w_low_ptr_offset + w_stride;
+    const int base_ptr = m * group_channels + c;
+
+    const opmath_t w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;
+    const opmath_t top_grad_im = top_grad * mask;
+    opmath_t grad_h_weight = 0, grad_w_weight = 0;
+
+    opmath_t v1 = 0;
+    if (h_low >= 0 && w_low >= 0) {
+        const int ptr1 = h_low_ptr_offset + w_low_ptr_offset + base_ptr;
+        v1 = bottom_data[ptr1];
+        grad_h_weight -= hw * v1;
+        grad_w_weight -= hh * v1;
+        atomicAdd(grad_im + ptr1, w1 * top_grad_im);
+    }
+    opmath_t v2 = 0;
+    if (h_low >= 0 && w_high <= width - 1) {
+        const int ptr2 = h_low_ptr_offset + w_high_ptr_offset + base_ptr;
+        v2 = bottom_data[ptr2];
+        grad_h_weight -= lw * v2;
+        grad_w_weight += hh * v2;
+        atomicAdd(grad_im + ptr2, w2 * top_grad_im);
+    }
+    opmath_t v3 = 0;
+    if (h_high <= height - 1 && w_low >= 0) {
+        const int ptr3 = h_high_ptr_offset + w_low_ptr_offset + base_ptr;
+        v3 = bottom_data[ptr3];
+        grad_h_weight += hw * v3;
+        grad_w_weight -= lh * v3;
+        atomicAdd(grad_im + ptr3, w3 * top_grad_im);
+    }
+    opmath_t v4 = 0;
+    if (h_high <= height - 1 && w_high <= width - 1) {
+        const int ptr4 = h_high_ptr_offset + w_high_ptr_offset + base_ptr;
+        v4 = bottom_data[ptr4];
+        grad_h_weight += lw * v4;
+        grad_w_weight += lh * v4;
+        atomicAdd(grad_im + ptr4, w4 * top_grad_im);
+    }
+
+    const opmath_t val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);
+    *grad_mask = top_grad * val;
+    *grad_offset = offset_scale * grad_w_weight * top_grad_im;
+    *(grad_offset + 1) = offset_scale * grad_h_weight * top_grad_im;
+}
+
+template <typename scalar_t>
+__device__ void dcnv3_col2im_bilinear_gm(
+    const scalar_t *&bottom_data, const int &height, const int &width,
+    const int &nheads, const int &group_channels, const opmath_t &h,
+    const opmath_t &w, const int &m, const int &c, const opmath_t offset_scale,
+    const opmath_t &top_grad, const opmath_t &mask, opmath_t *&grad_im,
+    opmath_t *grad_offset, opmath_t *grad_mask) {
+    const int h_low = floor(h);
+    const int w_low = floor(w);
+    const int h_high = h_low + 1;
+    const int w_high = w_low + 1;
+
+    const opmath_t lh = h - h_low;
+    const opmath_t lw = w - w_low;
+    const opmath_t hh = 1 - lh, hw = 1 - lw;
+
+    const int w_stride = nheads * group_channels;
+    const int h_stride = width * w_stride;
+    const int h_low_ptr_offset = h_low * h_stride;
+    const int h_high_ptr_offset = h_low_ptr_offset + h_stride;
+    const int w_low_ptr_offset = w_low * w_stride;
+    const int w_high_ptr_offset = w_low_ptr_offset + w_stride;
+    const int base_ptr = m * group_channels + c;
+
+    const opmath_t w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;
+    const opmath_t top_grad_im = top_grad * mask;
+    opmath_t grad_h_weight = 0, grad_w_weight = 0;
+
+    opmath_t v1 = 0;
+    if (h_low >= 0 && w_low >= 0) {
+        const int ptr1 = h_low_ptr_offset + w_low_ptr_offset + base_ptr;
+        v1 = bottom_data[ptr1];
+        grad_h_weight -= hw * v1;
+        grad_w_weight -= hh * v1;
+        atomicAdd(grad_im + ptr1, w1 * top_grad_im);
+    }
+    opmath_t v2 = 0;
+    if (h_low >= 0 && w_high <= width - 1) {
+        const int ptr2 = h_low_ptr_offset + w_high_ptr_offset + base_ptr;
+        v2 = bottom_data[ptr2];
+        grad_h_weight -= lw * v2;
+        grad_w_weight += hh * v2;
+        atomicAdd(grad_im + ptr2, w2 * top_grad_im);
+    }
+    opmath_t v3 = 0;
+    if (h_high <= height - 1 && w_low >= 0) {
+        const int ptr3 = h_high_ptr_offset + w_low_ptr_offset + base_ptr;
+        v3 = bottom_data[ptr3];
+        grad_h_weight += hw * v3;
+        grad_w_weight -= lh * v3;
+        atomicAdd(grad_im + ptr3, w3 * top_grad_im);
+    }
+    opmath_t v4 = 0;
+    if (h_high <= height - 1 && w_high <= width - 1) {
+        const int ptr4 = h_high_ptr_offset + w_high_ptr_offset + base_ptr;
+        v4 = bottom_data[ptr4];
+        grad_h_weight += lw * v4;
+        grad_w_weight += lh * v4;
+        atomicAdd(grad_im + ptr4, w4 * top_grad_im);
+    }
+
+    const opmath_t val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);
+    atomicAdd(grad_mask, top_grad * val);
+    atomicAdd(grad_offset, offset_scale * grad_w_weight * top_grad_im);
+    atomicAdd(grad_offset + 1, offset_scale * grad_h_weight * top_grad_im);
+}
+
+template <typename scalar_t>
+__global__ void dcnv3_im2col_gpu_kernel(
+    const int num_kernels, const scalar_t *data_im, const scalar_t *data_offset,
+    const scalar_t *data_mask, scalar_t *data_col, const int kernel_h,
+    const int kernel_w, const int stride_h, const int stride_w, const int pad_h,
+    const int pad_w, const int dilation_h, const int dilation_w,
+    const int group, const int group_channels, const int height_in,
+    const int width_in, const int height_out, const int width_out,
+    const opmath_t offset_scale, const int remove_center) {
+    CUDA_KERNEL_LOOP(index, num_kernels) {
+        int _temp = index;
+        const int c_col = _temp % group_channels;
+        _temp /= group_channels;
+        const int sampling_index = _temp;
+        const int g_col = _temp % group;
+        _temp /= group;
+        const int p0_w = ((dilation_w * (kernel_w - 1)) >> 1) - pad_w +
+                         (_temp % width_out) * stride_w;
+        _temp /= width_out;
+        const int p0_h = ((dilation_h * (kernel_h - 1)) >> 1) - pad_h +
+                         (_temp % height_out) * stride_h;
+        _temp /= height_out;
+        const int b_col = _temp;
+
+        const int input_size = height_in * width_in;
+        scalar_t *data_col_ptr = data_col + index;
+        const int kernel_size = kernel_h * kernel_w - remove_center;
+        int data_weight_ptr = sampling_index * kernel_size;
+        int data_loc_w_ptr = data_weight_ptr << 1;
+        const int qid_stride = group * group_channels;
+        opmath_t col = 0;
+        const scalar_t *data_im_ptr = data_im + b_col * input_size * qid_stride;
+        // top-left
+        const opmath_t p0_w_ =
+            p0_w - ((dilation_w * (kernel_w - 1)) >> 1) * offset_scale;
+        const opmath_t p0_h_ =
+            p0_h - ((dilation_h * (kernel_h - 1)) >> 1) * offset_scale;
+
+        const int center_h = kernel_h / 2;
+        const int center_w = kernel_w / 2;
+
+        for (int i = 0; i < kernel_w; ++i) {
+            for (int j = 0; j < kernel_h; ++j) {
+                // if not remove center, or remove center and not the center
+                if (i!=center_w || j!=center_h || !remove_center) {
+                    const opmath_t offset_w = data_offset[data_loc_w_ptr];
+                    const opmath_t offset_h = data_offset[data_loc_w_ptr + 1];
+                    const opmath_t loc_w =
+                        p0_w_ + (i * dilation_w + offset_w) * offset_scale;
+                    const opmath_t loc_h =
+                        p0_h_ + (j * dilation_h + offset_h) * offset_scale;
+                    const opmath_t weight = data_mask[data_weight_ptr];
+                    if (loc_h > -1 && loc_w > -1 && loc_h < height_in &&
+                        loc_w < width_in) {
+                        col += dcnv3_im2col_bilinear(
+                                data_im_ptr, height_in, width_in, group,
+                                group_channels, loc_h, loc_w, g_col, c_col) *
+                            weight;
+                    }
+                    data_weight_ptr += 1;
+                    data_loc_w_ptr += 2;
+                }
+            }
+        }
+        *data_col_ptr = col;
+    }
+}
+
+// debug
+template <typename scalar_t, unsigned int blockSize>
+__global__ void dcnv3_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1(
+    const int num_kernels, const scalar_t *grad_col, const scalar_t *data_im,
+    const scalar_t *data_offset, const scalar_t *data_mask, const int kernel_h,
+    const int kernel_w, const int stride_h, const int stride_w, const int pad_h,
+    const int pad_w, const int dilation_h, const int dilation_w,
+    const int group, const int group_channels, const int height_in,
+    const int width_in, const int height_out, const int width_out,
+    const opmath_t offset_scale, const int remove_center, opmath_t *grad_im, opmath_t *grad_offset,
+    opmath_t *grad_mask) {
+    CUDA_KERNEL_LOOP(index, num_kernels) {
+        __shared__ opmath_t cache_grad_offset[blockSize * 2];
+        __shared__ opmath_t cache_grad_mask[blockSize];
+        unsigned int tid = threadIdx.x;
+        int _temp = index;
+        const int c_col = _temp % group_channels;
+        _temp /= group_channels;
+        const int sampling_index = _temp;
+        const int g_col = _temp % group;
+        _temp /= group;
+        const int p0_w = ((dilation_w * (kernel_w - 1)) >> 1) - pad_w +
+                         (_temp % width_out) * stride_w;
+        _temp /= width_out;
+        const int p0_h = ((dilation_h * (kernel_h - 1)) >> 1) - pad_h +
+                         (_temp % height_out) * stride_h;
+        _temp /= height_out;
+        const int b_col = _temp;
+
+        const opmath_t top_grad = grad_col[index];
+        const int input_size = height_in * width_in;
+        const int kernel_size = kernel_h * kernel_w - remove_center;
+        int data_weight_ptr = sampling_index * kernel_size;
+        int data_loc_w_ptr = data_weight_ptr << 1;
+        const int grad_sampling_ptr = data_weight_ptr;
+        grad_offset += grad_sampling_ptr << 1;
+        grad_mask += grad_sampling_ptr;
+        const int qid_stride = group * group_channels;
+        const int im_ptr_offset = b_col * input_size * qid_stride;
+        const scalar_t *data_im_ptr = data_im + im_ptr_offset;
+        opmath_t *grad_im_ptr = grad_im + im_ptr_offset;
+        const opmath_t p0_w_ =
+            p0_w - ((dilation_w * (kernel_w - 1)) >> 1) * offset_scale;
+        const opmath_t p0_h_ =
+            p0_h - ((dilation_h * (kernel_h - 1)) >> 1) * offset_scale;
+
+        const int center_h = kernel_h / 2;
+        const int center_w = kernel_w / 2;
+
+        for (int i = 0; i < kernel_w; ++i) {
+            for (int j = 0; j < kernel_h; ++j) {
+                // if not remove center, or remove center and not the center
+                if (i!=center_w || j!=center_h || !remove_center) {
+                    const opmath_t offset_w = data_offset[data_loc_w_ptr];
+                    const opmath_t offset_h = data_offset[data_loc_w_ptr + 1];
+                    const opmath_t loc_w =
+                        p0_w_ + (i * dilation_w + offset_w) * offset_scale;
+                    const opmath_t loc_h =
+                        p0_h_ + (j * dilation_h + offset_h) * offset_scale;
+                    const opmath_t weight = data_mask[data_weight_ptr];
+                    *(cache_grad_offset + (threadIdx.x << 1)) = 0;
+                    *(cache_grad_offset + ((threadIdx.x << 1) + 1)) = 0;
+                    *(cache_grad_mask + threadIdx.x) = 0;
+                    if (loc_h > -1 && loc_w > -1 && loc_h < height_in &&
+                        loc_w < width_in) {
+                        dcnv3_col2im_bilinear(
+                            data_im_ptr, height_in, width_in, group, group_channels,
+                            loc_h, loc_w, g_col, c_col, offset_scale, top_grad,
+                            weight, grad_im_ptr,
+                            cache_grad_offset + (threadIdx.x << 1),
+                            cache_grad_mask + threadIdx.x);
+                    }
+
+                    __syncthreads();
+                    if (tid == 0) {
+                        opmath_t _grad_w = cache_grad_offset[0],
+                                 _grad_h = cache_grad_offset[1],
+                                 _grad_a = cache_grad_mask[0];
+                        int sid = 2;
+                        for (unsigned int tid = 1; tid < blockSize; ++tid) {
+                            _grad_w += cache_grad_offset[sid];
+                            _grad_h += cache_grad_offset[sid + 1];
+                            _grad_a += cache_grad_mask[tid];
+                            sid += 2;
+                        }
+
+                        *grad_offset = _grad_w;
+                        *(grad_offset + 1) = _grad_h;
+                        *grad_mask = _grad_a;
+                    }
+                    __syncthreads();
+
+                    data_weight_ptr += 1;
+                    data_loc_w_ptr += 2;
+                    grad_mask += 1;
+                    grad_offset += 2;
+                }
+            }
+        }
+    }
+}
+
+template <typename scalar_t, unsigned int blockSize>
+__global__ void dcnv3_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2(
+    const int num_kernels, const scalar_t *grad_col, const scalar_t *data_im,
+    const scalar_t *data_offset, const scalar_t *data_mask, const int kernel_h,
+    const int kernel_w, const int stride_h, const int stride_w, const int pad_h,
+    const int pad_w, const int dilation_h, const int dilation_w,
+    const int group, const int group_channels, const int height_in,
+    const int width_in, const int height_out, const int width_out,
+    const opmath_t offset_scale, const int remove_center, opmath_t *grad_im, opmath_t *grad_offset,
+    opmath_t *grad_mask) {
+    CUDA_KERNEL_LOOP(index, num_kernels) {
+        __shared__ opmath_t cache_grad_offset[blockSize * 2];
+        __shared__ opmath_t cache_grad_mask[blockSize];
+        unsigned int tid = threadIdx.x;
+        int _temp = index;
+        const int c_col = _temp % group_channels;
+        _temp /= group_channels;
+        const int sampling_index = _temp;
+        const int g_col = _temp % group;
+        _temp /= group;
+        const int p0_w = ((dilation_w * (kernel_w - 1)) >> 1) - pad_w +
+                         (_temp % width_out) * stride_w;
+        _temp /= width_out;
+        const int p0_h = ((dilation_h * (kernel_h - 1)) >> 1) - pad_h +
+                         (_temp % height_out) * stride_h;
+        _temp /= height_out;
+        const int b_col = _temp;
+
+        const opmath_t top_grad = grad_col[index];
+        const int input_size = height_in * width_in;
+        const int kernel_size = kernel_h * kernel_w - remove_center;
+        int data_weight_ptr = sampling_index * kernel_size;
+        int data_loc_w_ptr = data_weight_ptr << 1;
+        const int grad_sampling_ptr = data_weight_ptr;
+        grad_offset += grad_sampling_ptr << 1;
+        grad_mask += grad_sampling_ptr;
+        const int qid_stride = group * group_channels;
+        const int im_ptr_offset = b_col * input_size * qid_stride;
+        const scalar_t *data_im_ptr = data_im + im_ptr_offset;
+        opmath_t *grad_im_ptr = grad_im + im_ptr_offset;
+        const opmath_t p0_w_ =
+            p0_w - ((dilation_w * (kernel_w - 1)) >> 1) * offset_scale;
+        const opmath_t p0_h_ =
+            p0_h - ((dilation_h * (kernel_h - 1)) >> 1) * offset_scale;
+
+        const int center_h = kernel_h / 2;
+        const int center_w = kernel_w / 2;
+
+        for (int i = 0; i < kernel_w; ++i) {
+            for (int j = 0; j < kernel_h; ++j) {
+                // if not remove center, or remove center and not the center
+                if (i!=center_w || j!=center_h || !remove_center) {
+                    const opmath_t offset_w = data_offset[data_loc_w_ptr];
+                    const opmath_t offset_h = data_offset[data_loc_w_ptr + 1];
+                    const opmath_t loc_w =
+                        p0_w_ + (i * dilation_w + offset_w) * offset_scale;
+                    const opmath_t loc_h =
+                        p0_h_ + (j * dilation_h + offset_h) * offset_scale;
+                    const opmath_t weight = data_mask[data_weight_ptr];
+                    *(cache_grad_offset + (threadIdx.x << 1)) = 0;
+                    *(cache_grad_offset + ((threadIdx.x << 1) + 1)) = 0;
+                    *(cache_grad_mask + threadIdx.x) = 0;
+                    if (loc_h > -1 && loc_w > -1 && loc_h < height_in &&
+                        loc_w < width_in) {
+                        dcnv3_col2im_bilinear(
+                            data_im_ptr, height_in, width_in, group, group_channels,
+                            loc_h, loc_w, g_col, c_col, offset_scale, top_grad,
+                            weight, grad_im_ptr,
+                            cache_grad_offset + (threadIdx.x << 1),
+                            cache_grad_mask + threadIdx.x);
+                    }
+
+                    __syncthreads();
+
+                    for (unsigned int s = blockSize / 2; s > 0; s >>= 1) {
+                        if (tid < s) {
+                            const unsigned int xid1 = tid << 1;
+                            const unsigned int xid2 = (tid + s) << 1;
+                            cache_grad_mask[tid] += cache_grad_mask[tid + s];
+                            cache_grad_offset[xid1] += cache_grad_offset[xid2];
+                            cache_grad_offset[xid1 + 1] +=
+                                cache_grad_offset[xid2 + 1];
+                        }
+                        __syncthreads();
+                    }
+
+                    if (tid == 0) {
+                        *grad_offset = cache_grad_offset[0];
+                        *(grad_offset + 1) = cache_grad_offset[1];
+                        *grad_mask = cache_grad_mask[0];
+                    }
+                    __syncthreads();
+
+                    data_weight_ptr += 1;
+                    data_loc_w_ptr += 2;
+                    grad_mask += 1;
+                    grad_offset += 2;
+                }
+            }
+        }
+    }
+}
+
+template <typename scalar_t>
+__global__ void dcnv3_col2im_gpu_kernel_shm_reduce_v1(
+    const int num_kernels, const scalar_t *grad_col, const scalar_t *data_im,
+    const scalar_t *data_offset, const scalar_t *data_mask, const int kernel_h,
+    const int kernel_w, const int stride_h, const int stride_w, const int pad_h,
+    const int pad_w, const int dilation_h, const int dilation_w,
+    const int group, const int group_channels, const int height_in,
+    const int width_in, const int height_out, const int width_out,
+    const opmath_t offset_scale, const int remove_center, opmath_t *grad_im, opmath_t *grad_offset,
+    opmath_t *grad_mask) {
+    CUDA_KERNEL_LOOP(index, num_kernels) {
+        extern __shared__ int _s[];
+        opmath_t *cache_grad_offset = (opmath_t *)_s;
+        opmath_t *cache_grad_mask = cache_grad_offset + 2 * blockDim.x;
+        unsigned int tid = threadIdx.x;
+        int _temp = index;
+        const int c_col = _temp % group_channels;
+        _temp /= group_channels;
+        const int sampling_index = _temp;
+        const int g_col = _temp % group;
+        _temp /= group;
+        const int p0_w = ((dilation_w * (kernel_w - 1)) >> 1) - pad_w +
+                         (_temp % width_out) * stride_w;
+        _temp /= width_out;
+        const int p0_h = ((dilation_h * (kernel_h - 1)) >> 1) - pad_h +
+                         (_temp % height_out) * stride_h;
+        _temp /= height_out;
+        const int b_col = _temp;
+
+        const opmath_t top_grad = grad_col[index];
+        const int input_size = height_in * width_in;
+        const int kernel_size = kernel_h * kernel_w - remove_center;
+        int data_weight_ptr = sampling_index * kernel_size;
+        int data_loc_w_ptr = data_weight_ptr << 1;
+        const int grad_sampling_ptr = data_weight_ptr;
+        grad_offset += grad_sampling_ptr << 1;
+        grad_mask += grad_sampling_ptr;
+        const int qid_stride = group * group_channels;
+        const int im_ptr_offset = b_col * input_size * qid_stride;
+        const scalar_t *data_im_ptr = data_im + im_ptr_offset;
+        opmath_t *grad_im_ptr = grad_im + im_ptr_offset;
+        const opmath_t p0_w_ =
+            p0_w - ((dilation_w * (kernel_w - 1)) >> 1) * offset_scale;
+        const opmath_t p0_h_ =
+            p0_h - ((dilation_h * (kernel_h - 1)) >> 1) * offset_scale;
+
+        const int center_h = kernel_h / 2;
+        const int center_w = kernel_w / 2;
+
+        for (int i = 0; i < kernel_w; ++i) {
+            for (int j = 0; j < kernel_h; ++j) {
+                // if not remove center, or remove center and not the center
+                if (i!=center_w || j!=center_h || !remove_center) {
+                    const opmath_t offset_w = data_offset[data_loc_w_ptr];
+                    const opmath_t offset_h = data_offset[data_loc_w_ptr + 1];
+                    const opmath_t loc_w =
+                        p0_w_ + (i * dilation_w + offset_w) * offset_scale;
+                    const opmath_t loc_h =
+                        p0_h_ + (j * dilation_h + offset_h) * offset_scale;
+                    const opmath_t weight = data_mask[data_weight_ptr];
+                    *(cache_grad_offset + (threadIdx.x << 1)) = 0;
+                    *(cache_grad_offset + ((threadIdx.x << 1) + 1)) = 0;
+                    *(cache_grad_mask + threadIdx.x) = 0;
+                    if (loc_h > -1 && loc_w > -1 && loc_h < height_in &&
+                        loc_w < width_in) {
+                        dcnv3_col2im_bilinear(
+                            data_im_ptr, height_in, width_in, group, group_channels,
+                            loc_h, loc_w, g_col, c_col, offset_scale, top_grad,
+                            weight, grad_im_ptr,
+                            cache_grad_offset + (threadIdx.x << 1),
+                            cache_grad_mask + threadIdx.x);
+                    }
+
+                    __syncthreads();
+                    if (tid == 0) {
+                        opmath_t _grad_w = cache_grad_offset[0],
+                                 _grad_h = cache_grad_offset[1],
+                                 _grad_a = cache_grad_mask[0];
+                        int sid = 2;
+                        for (unsigned int tid = 1; tid < blockDim.x; ++tid) {
+                            _grad_w += cache_grad_offset[sid];
+                            _grad_h += cache_grad_offset[sid + 1];
+                            _grad_a += cache_grad_mask[tid];
+                            sid += 2;
+                        }
+
+                        *grad_offset = _grad_w;
+                        *(grad_offset + 1) = _grad_h;
+                        *grad_mask = _grad_a;
+                    }
+                    __syncthreads();
+
+                    data_weight_ptr += 1;
+                    data_loc_w_ptr += 2;
+                    grad_mask += 1;
+                    grad_offset += 2;
+                }
+            }
+        }
+    }
+}
+
+template <typename scalar_t>
+__global__ void dcnv3_col2im_gpu_kernel_shm_reduce_v2(
+    const int num_kernels, const scalar_t *grad_col, const scalar_t *data_im,
+    const scalar_t *data_offset, const scalar_t *data_mask, const int kernel_h,
+    const int kernel_w, const int stride_h, const int stride_w, const int pad_h,
+    const int pad_w, const int dilation_h, const int dilation_w,
+    const int group, const int group_channels, const int height_in,
+    const int width_in, const int height_out, const int width_out,
+    const opmath_t offset_scale, const int remove_center, opmath_t *grad_im, opmath_t *grad_offset,
+    opmath_t *grad_mask) {
+    CUDA_KERNEL_LOOP(index, num_kernels) {
+        extern __shared__ int _s[];
+        opmath_t *cache_grad_offset = (opmath_t *)_s;
+        opmath_t *cache_grad_mask = cache_grad_offset + 2 * blockDim.x;
+        unsigned int tid = threadIdx.x;
+        int _temp = index;
+        const int c_col = _temp % group_channels;
+        _temp /= group_channels;
+        const int sampling_index = _temp;
+        const int g_col = _temp % group;
+        _temp /= group;
+        const int p0_w = ((dilation_w * (kernel_w - 1)) >> 1) - pad_w +
+                         (_temp % width_out) * stride_w;
+        _temp /= width_out;
+        const int p0_h = ((dilation_h * (kernel_h - 1)) >> 1) - pad_h +
+                         (_temp % height_out) * stride_h;
+        _temp /= height_out;
+        const int b_col = _temp;
+
+        const opmath_t top_grad = grad_col[index];
+        const int input_size = height_in * width_in;
+        const int kernel_size = kernel_h * kernel_w - remove_center;
+        int data_weight_ptr = sampling_index * kernel_size;
+        int data_loc_w_ptr = data_weight_ptr << 1;
+        const int grad_sampling_ptr = data_weight_ptr;
+        grad_offset += grad_sampling_ptr << 1;
+        grad_mask += grad_sampling_ptr;
+        const int qid_stride = group * group_channels;
+        const int im_ptr_offset = b_col * input_size * qid_stride;
+        const scalar_t *data_im_ptr = data_im + im_ptr_offset;
+        opmath_t *grad_im_ptr = grad_im + im_ptr_offset;
+        const opmath_t p0_w_ =
+            p0_w - ((dilation_w * (kernel_w - 1)) >> 1) * offset_scale;
+        const opmath_t p0_h_ =
+            p0_h - ((dilation_h * (kernel_h - 1)) >> 1) * offset_scale;
+
+        const int center_h = kernel_h / 2;
+        const int center_w = kernel_w / 2;
+
+        for (int i = 0; i < kernel_w; ++i) {
+            for (int j = 0; j < kernel_h; ++j) {
+                // if not remove center, or remove center and not the center
+                if (i!=center_w || j!=center_h || !remove_center) {
+                    const opmath_t offset_w = data_offset[data_loc_w_ptr];
+                    const opmath_t offset_h = data_offset[data_loc_w_ptr + 1];
+                    const opmath_t loc_w =
+                        p0_w_ + (i * dilation_w + offset_w) * offset_scale;
+                    const opmath_t loc_h =
+                        p0_h_ + (j * dilation_h + offset_h) * offset_scale;
+                    const opmath_t weight = data_mask[data_weight_ptr];
+                    *(cache_grad_offset + (threadIdx.x << 1)) = 0;
+                    *(cache_grad_offset + ((threadIdx.x << 1) + 1)) = 0;
+                    *(cache_grad_mask + threadIdx.x) = 0;
+                    if (loc_h > -1 && loc_w > -1 && loc_h < height_in &&
+                        loc_w < width_in) {
+                        dcnv3_col2im_bilinear(
+                            data_im_ptr, height_in, width_in, group, group_channels,
+                            loc_h, loc_w, g_col, c_col, offset_scale, top_grad,
+                            weight, grad_im_ptr,
+                            cache_grad_offset + (threadIdx.x << 1),
+                            cache_grad_mask + threadIdx.x);
+                    }
+
+                    __syncthreads();
+
+                    for (unsigned int s = blockDim.x / 2, spre = blockDim.x; s > 0;
+                         s >>= 1, spre >>= 1) {
+                        if (tid < s) {
+                            const unsigned int xid1 = tid << 1;
+                            const unsigned int xid2 = (tid + s) << 1;
+                            cache_grad_mask[tid] += cache_grad_mask[tid + s];
+                            cache_grad_offset[xid1] += cache_grad_offset[xid2];
+                            cache_grad_offset[xid1 + 1] +=
+                                cache_grad_offset[xid2 + 1];
+                            if (tid + (s << 1) < spre) {
+                                cache_grad_mask[tid] +=
+                                    cache_grad_mask[tid + (s << 1)];
+                                cache_grad_offset[xid1] +=
+                                    cache_grad_offset[xid2 + (s << 1)];
+                                cache_grad_offset[xid1 + 1] +=
+                                    cache_grad_offset[xid2 + 1 + (s << 1)];
+                            }
+                        }
+                        __syncthreads();
+                    }
+
+                    if (tid == 0) {
+                        *grad_offset = cache_grad_offset[0];
+                        *(grad_offset + 1) = cache_grad_offset[1];
+                        *grad_mask = cache_grad_mask[0];
+                    }
+                    __syncthreads();
+
+                    data_weight_ptr += 1;
+                    data_loc_w_ptr += 2;
+                    grad_mask += 1;
+                    grad_offset += 2;
+                }
+            }
+        }
+    }
+}
+
+template <typename scalar_t>
+__global__ void dcnv3_col2im_gpu_kernel_shm_reduce_v2_multi_blocks(
+    const int num_kernels, const scalar_t *grad_col, const scalar_t *data_im,
+    const scalar_t *data_offset, const scalar_t *data_mask, const int kernel_h,
+    const int kernel_w, const int stride_h, const int stride_w, const int pad_h,
+    const int pad_w, const int dilation_h, const int dilation_w,
+    const int group, const int group_channels, const int height_in,
+    const int width_in, const int height_out, const int width_out,
+    const opmath_t offset_scale, const int remove_center, opmath_t *grad_im, opmath_t *grad_offset,
+    opmath_t *grad_mask) {
+    CUDA_KERNEL_LOOP(index, num_kernels) {
+        extern __shared__ int _s[];
+        opmath_t *cache_grad_offset = (opmath_t *)_s;
+        opmath_t *cache_grad_mask = cache_grad_offset + 2 * blockDim.x;
+        unsigned int tid = threadIdx.x;
+        int _temp = index;
+        const int c_col = _temp % group_channels;
+        _temp /= group_channels;
+        const int sampling_index = _temp;
+        const int g_col = _temp % group;
+        _temp /= group;
+        const int p0_w = ((dilation_w * (kernel_w - 1)) >> 1) - pad_w +
+                         (_temp % width_out) * stride_w;
+        _temp /= width_out;
+        const int p0_h = ((dilation_h * (kernel_h - 1)) >> 1) - pad_h +
+                         (_temp % height_out) * stride_h;
+        _temp /= height_out;
+        const int b_col = _temp;
+
+        const opmath_t top_grad = grad_col[index];
+        const int input_size = height_in * width_in;
+        const int kernel_size = kernel_h * kernel_w - remove_center;
+        int data_weight_ptr = sampling_index * kernel_size;
+        int data_loc_w_ptr = data_weight_ptr << 1;
+        const int grad_sampling_ptr = data_weight_ptr;
+        grad_offset += grad_sampling_ptr << 1;
+        grad_mask += grad_sampling_ptr;
+        const int qid_stride = group * group_channels;
+        const int im_ptr_offset = b_col * input_size * qid_stride;
+        const scalar_t *data_im_ptr = data_im + im_ptr_offset;
+        opmath_t *grad_im_ptr = grad_im + im_ptr_offset;
+        const opmath_t p0_w_ =
+            p0_w - ((dilation_w * (kernel_w - 1)) >> 1) * offset_scale;
+        const opmath_t p0_h_ =
+            p0_h - ((dilation_h * (kernel_h - 1)) >> 1) * offset_scale;
+
+        const int center_h = kernel_h / 2;
+        const int center_w = kernel_w / 2;
+
+        for (int i = 0; i < kernel_w; ++i) {
+            for (int j = 0; j < kernel_h; ++j) {
+                // if not remove center, or remove center and not the center
+                if (i!=center_w || j!=center_h || !remove_center) {
+                    const opmath_t offset_w = data_offset[data_loc_w_ptr];
+                    const opmath_t offset_h = data_offset[data_loc_w_ptr + 1];
+                    const opmath_t loc_w =
+                        p0_w_ + (i * dilation_w + offset_w) * offset_scale;
+                    const opmath_t loc_h =
+                        p0_h_ + (j * dilation_h + offset_h) * offset_scale;
+                    const opmath_t weight = data_mask[data_weight_ptr];
+                    *(cache_grad_offset + (threadIdx.x << 1)) = 0;
+                    *(cache_grad_offset + ((threadIdx.x << 1) + 1)) = 0;
+                    *(cache_grad_mask + threadIdx.x) = 0;
+                    if (loc_h > -1 && loc_w > -1 && loc_h < height_in &&
+                        loc_w < width_in) {
+                        dcnv3_col2im_bilinear(
+                            data_im_ptr, height_in, width_in, group, group_channels,
+                            loc_h, loc_w, g_col, c_col, offset_scale, top_grad,
+                            weight, grad_im_ptr,
+                            cache_grad_offset + (threadIdx.x << 1),
+                            cache_grad_mask + threadIdx.x);
+                    }
+
+                    __syncthreads();
+
+                    for (unsigned int s = blockDim.x / 2, spre = blockDim.x; s > 0;
+                         s >>= 1, spre >>= 1) {
+                        if (tid < s) {
+                            const unsigned int xid1 = tid << 1;
+                            const unsigned int xid2 = (tid + s) << 1;
+                            cache_grad_mask[tid] += cache_grad_mask[tid + s];
+                            cache_grad_offset[xid1] += cache_grad_offset[xid2];
+                            cache_grad_offset[xid1 + 1] +=
+                                cache_grad_offset[xid2 + 1];
+                            if (tid + (s << 1) < spre) {
+                                cache_grad_mask[tid] +=
+                                    cache_grad_mask[tid + (s << 1)];
+                                cache_grad_offset[xid1] +=
+                                    cache_grad_offset[xid2 + (s << 1)];
+                                cache_grad_offset[xid1 + 1] +=
+                                    cache_grad_offset[xid2 + 1 + (s << 1)];
+                            }
+                        }
+                        __syncthreads();
+                    }
+
+                    if (tid == 0) {
+                        atomicAdd(grad_offset, cache_grad_offset[0]);
+                        atomicAdd(grad_offset + 1, cache_grad_offset[1]);
+                        atomicAdd(grad_mask, cache_grad_mask[0]);
+                    }
+                    __syncthreads();
+
+                    data_weight_ptr += 1;
+                    data_loc_w_ptr += 2;
+                    grad_mask += 1;
+                    grad_offset += 2;
+                }
+            }
+        }
+    }
+}
+
+template <typename scalar_t>
+__global__ void dcnv3_col2im_gpu_kernel_gm(
+    const int num_kernels, const scalar_t *grad_col, const scalar_t *data_im,
+    const scalar_t *data_offset, const scalar_t *data_mask, const int kernel_h,
+    const int kernel_w, const int stride_h, const int stride_w, const int pad_h,
+    const int pad_w, const int dilation_h, const int dilation_w,
+    const int group, const int group_channels, const int height_in,
+    const int width_in, const int height_out, const int width_out,
+    const opmath_t offset_scale, const int remove_center, opmath_t *grad_im, opmath_t *grad_offset,
+    opmath_t *grad_mask) {
+    CUDA_KERNEL_LOOP(index, num_kernels) {
+        int _temp = index;
+        const int c_col = _temp % group_channels;
+        _temp /= group_channels;
+        const int sampling_index = _temp;
+        const int g_col = _temp % group;
+        _temp /= group;
+        const int p0_w = ((dilation_w * (kernel_w - 1)) >> 1) - pad_w +
+                         (_temp % width_out) * stride_w;
+        _temp /= width_out;
+        const int p0_h = ((dilation_h * (kernel_h - 1)) >> 1) - pad_h +
+                         (_temp % height_out) * stride_h;
+        _temp /= height_out;
+        const int b_col = _temp;
+
+        const opmath_t top_grad = grad_col[index];
+        const int input_size = height_in * width_in;
+        const int kernel_size = kernel_h * kernel_w - remove_center;
+        int data_weight_ptr = sampling_index * kernel_size;
+        int data_loc_w_ptr = data_weight_ptr << 1;
+        const int grad_sampling_ptr = data_weight_ptr;
+        grad_offset += grad_sampling_ptr << 1;
+        grad_mask += grad_sampling_ptr;
+        const int qid_stride = group * group_channels;
+        const int im_ptr_offset = b_col * input_size * qid_stride;
+        const scalar_t *data_im_ptr = data_im + im_ptr_offset;
+        opmath_t *grad_im_ptr = grad_im + im_ptr_offset;
+        const opmath_t p0_w_ =
+            p0_w - ((dilation_w * (kernel_w - 1)) >> 1) * offset_scale;
+        const opmath_t p0_h_ =
+            p0_h - ((dilation_h * (kernel_h - 1)) >> 1) * offset_scale;
+
+        const int center_h = kernel_h / 2;
+        const int center_w = kernel_w / 2;
+
+        for (int i = 0; i < kernel_w; ++i) {
+            for (int j = 0; j < kernel_h; ++j) {
+                // if not remove center, or remove center and not the center
+                if (i!=center_w || j!=center_h || !remove_center) {
+                    const opmath_t offset_w = data_offset[data_loc_w_ptr];
+                    const opmath_t offset_h = data_offset[data_loc_w_ptr + 1];
+                    const opmath_t loc_w =
+                        p0_w_ + (i * dilation_w + offset_w) * offset_scale;
+                    const opmath_t loc_h =
+                        p0_h_ + (j * dilation_h + offset_h) * offset_scale;
+                    const opmath_t weight = data_mask[data_weight_ptr];
+                    if (loc_h > -1 && loc_w > -1 && loc_h < height_in &&
+                        loc_w < width_in) {
+                        dcnv3_col2im_bilinear_gm(
+                            data_im_ptr, height_in, width_in, group, group_channels,
+                            loc_h, loc_w, g_col, c_col, offset_scale, top_grad,
+                            weight, grad_im_ptr, grad_offset, grad_mask);
+                    }
+                    data_weight_ptr += 1;
+                    data_loc_w_ptr += 2;
+                    grad_mask += 1;
+                    grad_offset += 2;
+                }
+            }
+        }
+    }
+}
+
+template <typename scalar_t>
+void dcnv3_im2col_cuda(cudaStream_t stream, const scalar_t *data_im,
+                       const scalar_t *data_offset, const scalar_t *data_mask,
+                       scalar_t *data_col, const int kernel_h,
+                       const int kernel_w, const int stride_h,
+                       const int stride_w, const int pad_h, const int pad_w,
+                       const int dilation_h, const int dilation_w,
+                       const int group, const int group_channels,
+                       const int batch_n, const int height_in,
+                       const int width_in, const int height_out,
+                       const int width_out, const opmath_t offset_scale, const int remove_center) {
+    const int num_kernels =
+        batch_n * height_out * width_out * group * group_channels;
+    const int num_actual_kernels =
+        batch_n * height_out * width_out * group * group_channels;
+    const int num_threads = CUDA_NUM_THREADS;
+    dcnv3_im2col_gpu_kernel<scalar_t>
+        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads, 0,
+           stream>>>(num_kernels, data_im, data_offset, data_mask, data_col,
+                     kernel_h, kernel_w, stride_h, stride_w, pad_h, pad_w,
+                     dilation_h, dilation_w, group, group_channels, height_in,
+                     width_in, height_out, width_out, offset_scale, remove_center);
+
+    cudaError_t err = cudaGetLastError();
+    if (err != cudaSuccess) {
+        printf("error in dcnv3_im2col_cuda: %s\n", cudaGetErrorString(err));
+    }
+}
+
+template <typename scalar_t>
+void dcnv3_col2im_cuda(
+    cudaStream_t stream, const scalar_t *grad_col, const scalar_t *data_im,
+    const scalar_t *data_offset, const scalar_t *data_mask, const int kernel_h,
+    const int kernel_w, const int stride_h, const int stride_w, const int pad_h,
+    const int pad_w, const int dilation_h, const int dilation_w,
+    const int group, const int group_channels, const int batch_n,
+    const int height_in, const int width_in, const int height_out,
+    const int width_out, const opmath_t offset_scale, const int remove_center,
+    opmath_t *grad_im, opmath_t *grad_offset, opmath_t *grad_mask) {
+    const int num_threads =
+        (group_channels > CUDA_NUM_THREADS) ? CUDA_NUM_THREADS : group_channels;
+    const int num_kernels =
+        batch_n * height_out * width_out * group * group_channels;
+    const int num_actual_kernels =
+        batch_n * height_out * width_out * group * group_channels;
+    if (group_channels > 1024) {
+        if ((group_channels & 1023) == 0) {
+            dcnv3_col2im_gpu_kernel_shm_reduce_v2_multi_blocks<scalar_t>
+                <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,
+                   num_threads * 3 * sizeof(opmath_t), stream>>>(
+                    num_kernels, grad_col, data_im, data_offset, data_mask,
+                    kernel_h, kernel_w, stride_h, stride_w, pad_h, pad_w,
+                    dilation_h, dilation_w, group, group_channels, height_in,
+                    width_in, height_out, width_out, offset_scale, remove_center, grad_im,
+                    grad_offset, grad_mask);
+        } else {
+            dcnv3_col2im_gpu_kernel_gm<scalar_t>
+                <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads, 0,
+                   stream>>>(num_kernels, grad_col, data_im, data_offset,
+                             data_mask, kernel_h, kernel_w, stride_h, stride_w,
+                             pad_h, pad_w, dilation_h, dilation_w, group,
+                             group_channels, height_in, width_in, height_out,
+                             width_out, offset_scale, remove_center, grad_im, grad_offset,
+                             grad_mask);
+        }
+    } else {
+        switch (group_channels) {
+        case 1:
+            dcnv3_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 1>
+                <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads, 0,
+                   stream>>>(num_kernels, grad_col, data_im, data_offset,
+                             data_mask, kernel_h, kernel_w, stride_h, stride_w,
+                             pad_h, pad_w, dilation_h, dilation_w, group,
+                             group_channels, height_in, width_in, height_out,
+                             width_out, offset_scale, remove_center, grad_im, grad_offset,
+                             grad_mask);
+            break;
+        case 2:
+            dcnv3_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 2>
+                <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads, 0,
+                   stream>>>(num_kernels, grad_col, data_im, data_offset,
+                             data_mask, kernel_h, kernel_w, stride_h, stride_w,
+                             pad_h, pad_w, dilation_h, dilation_w, group,
+                             group_channels, height_in, width_in, height_out,
+                             width_out, offset_scale, remove_center, grad_im, grad_offset,
+                             grad_mask);
+            break;
+        case 4:
+            dcnv3_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 4>
+                <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads, 0,
+                   stream>>>(num_kernels, grad_col, data_im, data_offset,
+                             data_mask, kernel_h, kernel_w, stride_h, stride_w,
+                             pad_h, pad_w, dilation_h, dilation_w, group,
+                             group_channels, height_in, width_in, height_out,
+                             width_out, offset_scale, remove_center, grad_im, grad_offset,
+                             grad_mask);
+            break;
+        case 8:
+            dcnv3_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 8>
+                <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads, 0,
+                   stream>>>(num_kernels, grad_col, data_im, data_offset,
+                             data_mask, kernel_h, kernel_w, stride_h, stride_w,
+                             pad_h, pad_w, dilation_h, dilation_w, group,
+                             group_channels, height_in, width_in, height_out,
+                             width_out, offset_scale, remove_center, grad_im, grad_offset,
+                             grad_mask);
+            break;
+        case 16:
+            dcnv3_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 16>
+                <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads, 0,
+                   stream>>>(num_kernels, grad_col, data_im, data_offset,
+                             data_mask, kernel_h, kernel_w, stride_h, stride_w,
+                             pad_h, pad_w, dilation_h, dilation_w, group,
+                             group_channels, height_in, width_in, height_out,
+                             width_out, offset_scale, remove_center, grad_im, grad_offset,
+                             grad_mask);
+            break;
+        case 32:
+            dcnv3_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 32>
+                <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads, 0,
+                   stream>>>(num_kernels, grad_col, data_im, data_offset,
+                             data_mask, kernel_h, kernel_w, stride_h, stride_w,
+                             pad_h, pad_w, dilation_h, dilation_w, group,
+                             group_channels, height_in, width_in, height_out,
+                             width_out, offset_scale, remove_center, grad_im, grad_offset,
+                             grad_mask);
+            break;
+        case 64:
+            dcnv3_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 64>
+                <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads, 0,
+                   stream>>>(num_kernels, grad_col, data_im, data_offset,
+                             data_mask, kernel_h, kernel_w, stride_h, stride_w,
+                             pad_h, pad_w, dilation_h, dilation_w, group,
+                             group_channels, height_in, width_in, height_out,
+                             width_out, offset_scale, remove_center, grad_im, grad_offset,
+                             grad_mask);
+            break;
+        case 128:
+            dcnv3_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 128>
+                <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads, 0,
+                   stream>>>(num_kernels, grad_col, data_im, data_offset,
+                             data_mask, kernel_h, kernel_w, stride_h, stride_w,
+                             pad_h, pad_w, dilation_h, dilation_w, group,
+                             group_channels, height_in, width_in, height_out,
+                             width_out, offset_scale, remove_center, grad_im, grad_offset,
+                             grad_mask);
+            break;
+        case 256:
+            dcnv3_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 256>
+                <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads, 0,
+                   stream>>>(num_kernels, grad_col, data_im, data_offset,
+                             data_mask, kernel_h, kernel_w, stride_h, stride_w,
+                             pad_h, pad_w, dilation_h, dilation_w, group,
+                             group_channels, height_in, width_in, height_out,
+                             width_out, offset_scale, remove_center, grad_im, grad_offset,
+                             grad_mask);
+            break;
+        case 512:
+            dcnv3_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 512>
+                <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads, 0,
+                   stream>>>(num_kernels, grad_col, data_im, data_offset,
+                             data_mask, kernel_h, kernel_w, stride_h, stride_w,
+                             pad_h, pad_w, dilation_h, dilation_w, group,
+                             group_channels, height_in, width_in, height_out,
+                             width_out, offset_scale, remove_center, grad_im, grad_offset,
+                             grad_mask);
+            break;
+        case 1024:
+            dcnv3_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t,
+                                                                  1024>
+                <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads, 0,
+                   stream>>>(num_kernels, grad_col, data_im, data_offset,
+                             data_mask, kernel_h, kernel_w, stride_h, stride_w,
+                             pad_h, pad_w, dilation_h, dilation_w, group,
+                             group_channels, height_in, width_in, height_out,
+                             width_out, offset_scale, remove_center, grad_im, grad_offset,
+                             grad_mask);
+            break;
+        default:
+            if (group_channels < 64) {
+                dcnv3_col2im_gpu_kernel_shm_reduce_v1<scalar_t>
+                    <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,
+                       num_threads * 3 * sizeof(opmath_t), stream>>>(
+                        num_kernels, grad_col, data_im, data_offset, data_mask,
+                        kernel_h, kernel_w, stride_h, stride_w, pad_h, pad_w,
+                        dilation_h, dilation_w, group, group_channels,
+                        height_in, width_in, height_out, width_out,
+                        offset_scale, remove_center, grad_im, grad_offset, grad_mask);
+            } else {
+                dcnv3_col2im_gpu_kernel_shm_reduce_v2<scalar_t>
+                    <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,
+                       num_threads * 3 * sizeof(opmath_t), stream>>>(
+                        num_kernels, grad_col, data_im, data_offset, data_mask,
+                        kernel_h, kernel_w, stride_h, stride_w, pad_h, pad_w,
+                        dilation_h, dilation_w, group, group_channels,
+                        height_in, width_in, height_out, width_out,
+                        offset_scale, remove_center, grad_im, grad_offset, grad_mask);
+            }
+        }
+    }
+    cudaError_t err = cudaGetLastError();
+    if (err != cudaSuccess) {
+        printf("error in dcnv3_col2im_cuda: %s\n", cudaGetErrorString(err));
+    }
+}
diff --git a/projects/internimage_classification/ops_dcnv3/src/dcnv3.h b/projects/internimage_classification/ops_dcnv3/src/dcnv3.h
new file mode 100644
index 0000000000000000000000000000000000000000..ce4500fada624b0c5d40affdba449b620b5d0137
--- /dev/null
+++ b/projects/internimage_classification/ops_dcnv3/src/dcnv3.h
@@ -0,0 +1,59 @@
+/*!
+**************************************************************************************************
+* InternImage
+* Copyright (c) 2022 OpenGVLab
+* Licensed under The MIT License [see LICENSE for details]
+**************************************************************************************************
+* Modified from
+*https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0
+**************************************************************************************************
+*/
+
+#pragma once
+
+#include "cpu/dcnv3_cpu.h"
+
+#ifdef WITH_CUDA
+#include "cuda/dcnv3_cuda.h"
+#endif
+
+at::Tensor dcnv3_forward(const at::Tensor &input, const at::Tensor &offset,
+                         const at::Tensor &mask, const int kernel_h,
+                         const int kernel_w, const int stride_h,
+                         const int stride_w, const int pad_h, const int pad_w,
+                         const int dilation_h, const int dilation_w,
+                         const int group, const int group_channels,
+                         const float offset_scale, const int im2col_step, const int remove_center) {
+    if (input.type().is_cuda()) {
+#ifdef WITH_CUDA
+        return dcnv3_cuda_forward(input, offset, mask, kernel_h, kernel_w,
+                                  stride_h, stride_w, pad_h, pad_w, dilation_h,
+                                  dilation_w, group, group_channels,
+                                  offset_scale, im2col_step, remove_center);
+#else
+        AT_ERROR("Not compiled with GPU support");
+#endif
+    }
+    AT_ERROR("Not implemented on the CPU");
+}
+
+std::vector<at::Tensor>
+dcnv3_backward(const at::Tensor &input, const at::Tensor &offset,
+               const at::Tensor &mask, const int kernel_h, const int kernel_w,
+               const int stride_h, const int stride_w, const int pad_h,
+               const int pad_w, const int dilation_h, const int dilation_w,
+               const int group, const int group_channels,
+               const float offset_scale, const at::Tensor &grad_output,
+               const int im2col_step, const int remove_center) {
+    if (input.type().is_cuda()) {
+#ifdef WITH_CUDA
+        return dcnv3_cuda_backward(input, offset, mask, kernel_h, kernel_w,
+                                   stride_h, stride_w, pad_h, pad_w, dilation_h,
+                                   dilation_w, group, group_channels,
+                                   offset_scale, grad_output, im2col_step, remove_center);
+#else
+        AT_ERROR("Not compiled with GPU support");
+#endif
+    }
+    AT_ERROR("Not implemented on the CPU");
+}
diff --git a/projects/internimage_classification/ops_dcnv3/src/vision.cpp b/projects/internimage_classification/ops_dcnv3/src/vision.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..1f7a9087147bb8752202064c154c43078df3ad88
--- /dev/null
+++ b/projects/internimage_classification/ops_dcnv3/src/vision.cpp
@@ -0,0 +1,17 @@
+/*!
+**************************************************************************************************
+* InternImage
+* Copyright (c) 2022 OpenGVLab
+* Licensed under The MIT License [see LICENSE for details]
+**************************************************************************************************
+* Modified from
+*https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0
+**************************************************************************************************
+*/
+
+#include "dcnv3.h"
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
+    m.def("dcnv3_forward", &dcnv3_forward, "dcnv3_forward");
+    m.def("dcnv3_backward", &dcnv3_backward, "dcnv3_backward");
+}
diff --git a/projects/internimage_classification/ops_dcnv3/test.py b/projects/internimage_classification/ops_dcnv3/test.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9c2204d5f86bfc4499c10a8de61e204676b5897
--- /dev/null
+++ b/projects/internimage_classification/ops_dcnv3/test.py
@@ -0,0 +1,255 @@
+# --------------------------------------------------------
+# InternImage
+# Copyright (c) 2022 OpenGVLab
+# Licensed under The MIT License [see LICENSE for details]
+# --------------------------------------------------------
+
+# Copied from
+# https://github.com/OpenGVLab/InternImage/blob/master/classification/models/
+
+from __future__ import absolute_import, division, print_function
+import math  # noqa
+import time
+
+import torch
+import torch.nn as nn  # noqa
+from functions.dcnv3_func import DCNv3Function, dcnv3_core_pytorch
+from torch.autograd import gradcheck  # noqa
+
+H_in, W_in = 8, 8
+N, M, D = 2, 4, 16
+Kh, Kw = 3, 3
+remove_center = False
+P = Kh * Kw - remove_center
+offset_scale = 2.0
+pad = 1
+dilation = 1
+stride = 1
+H_out = (H_in + 2 * pad - (dilation * (Kh - 1) + 1)) // stride + 1
+W_out = (W_in + 2 * pad - (dilation * (Kw - 1) + 1)) // stride + 1
+
+torch.manual_seed(3)
+
+
+@torch.no_grad()
+def check_forward_equal_with_pytorch_double():
+    input = torch.rand(N, H_in, W_in, M * D).cuda() * 0.01
+    offset = torch.rand(N, H_out, W_out, M * P * 2).cuda() * 10
+    mask = torch.rand(N, H_out, W_out, M, P).cuda() + 1e-5
+    mask /= mask.sum(-1, keepdim=True)
+    mask = mask.reshape(N, H_out, W_out, M * P)
+
+    output_pytorch = dcnv3_core_pytorch(input.double(), offset.double(),
+                                        mask.double(), Kh, Kw, stride, stride,
+                                        Kh // 2, Kw // 2, dilation, dilation,
+                                        M, D, offset_scale,
+                                        remove_center).detach().cpu()
+
+    im2col_step = 2
+    output_cuda = DCNv3Function.apply(input.double(), offset.double(),
+                                      mask.double(), Kh, Kw, stride, stride,
+                                      Kh // 2, Kw // 2, dilation, dilation, M,
+                                      D, offset_scale, im2col_step,
+                                      remove_center).detach().cpu()
+
+    fwdok = torch.allclose(output_cuda, output_pytorch)
+    max_abs_err = (output_cuda - output_pytorch).abs().max()
+    max_rel_err = ((output_cuda - output_pytorch).abs() /
+                   output_pytorch.abs()).max()
+    print('>>> forward double')
+    print(f'* {fwdok} check_forward_equal_with_pytorch_double:'
+          f' max_abs_err {max_abs_err:.2e} max_rel_err {max_rel_err:.2e}')
+
+
+@torch.no_grad()
+def check_forward_equal_with_pytorch_float():
+    input = torch.rand(N, H_in, W_in, M * D).cuda() * 0.01
+    offset = torch.rand(N, H_out, W_out, M * P * 2).cuda() * 10
+    mask = torch.rand(N, H_out, W_out, M, P).cuda() + 1e-5
+    mask /= mask.sum(-1, keepdim=True)
+    mask = mask.reshape(N, H_out, W_out, M * P)
+
+    output_pytorch = dcnv3_core_pytorch(input, offset, mask, Kh, Kw, stride,
+                                        stride, Kh // 2, Kw // 2, dilation,
+                                        dilation, M, D, offset_scale,
+                                        remove_center).detach().cpu()
+
+    im2col_step = 2
+    output_cuda = DCNv3Function.apply(input, offset, mask, Kh, Kw, stride,
+                                      stride, Kh // 2, Kw // 2, dilation,
+                                      dilation, M, D, offset_scale,
+                                      im2col_step,
+                                      remove_center).detach().cpu()
+
+    fwdok = torch.allclose(output_cuda, output_pytorch, rtol=1e-2, atol=1e-3)
+    max_abs_err = (output_cuda - output_pytorch).abs().max()
+    max_rel_err = ((output_cuda - output_pytorch).abs() /
+                   output_pytorch.abs()).max()
+    print('>>> forward float')
+    print(f'* {fwdok} check_forward_equal_with_pytorch_float:'
+          f'max_abs_err {max_abs_err:.2e} max_rel_err {max_rel_err:.2e}')
+
+
+def check_backward_equal_with_pytorch_double(channels=4,
+                                             grad_input=True,
+                                             grad_offset=True,
+                                             grad_mask=True):
+    # H_in, W_in = 4, 4
+    N = 2
+    M = 2
+    H_out = (H_in + 2 * pad - (dilation * (Kh - 1) + 1)) // stride + 1
+    W_out = (W_in + 2 * pad - (dilation * (Kw - 1) + 1)) // stride + 1
+
+    D = channels
+    input0 = torch.rand(N, H_in, W_in, M * D).cuda() * 0.01
+    offset0 = torch.rand(N, H_out, W_out, M * P * 2).cuda() * 10
+    mask0 = torch.rand(N, H_out, W_out, M, P).cuda() + 1e-5
+    mask0 /= mask0.sum(-1, keepdim=True)
+    mask0 = mask0.reshape(N, H_out, W_out, M * P)
+    input0.requires_grad = grad_input
+    offset0.requires_grad = grad_offset
+    mask0.requires_grad = grad_mask
+
+    output_pytorch = dcnv3_core_pytorch(input0.double(), offset0.double(),
+                                        mask0.double(), Kh, Kw, stride, stride,
+                                        Kh // 2, Kw // 2, dilation, dilation,
+                                        M, D, offset_scale, remove_center)
+    output_pytorch.sum().backward()
+
+    input1 = input0.detach()
+    offset1 = offset0.detach()
+    mask1 = mask0.detach()
+    input1.requires_grad = grad_input
+    offset1.requires_grad = grad_offset
+    mask1.requires_grad = grad_mask
+
+    im2col_step = 2
+    output_cuda = DCNv3Function.apply(input1.double(), offset1.double(),
+                                      mask1.double(), Kh, Kw, stride, stride,
+                                      Kh // 2, Kw // 2, dilation, dilation, M,
+                                      D, offset_scale, im2col_step,
+                                      remove_center)
+    output_cuda.sum().backward()
+
+    print(f'>>> backward double: channels {D}')
+    bwdok = torch.allclose(input0.grad, input1.grad, rtol=1e-2, atol=1e-3)
+    max_abs_err = (input0.grad - input1.grad).abs().max()
+    max_rel_err = ((input0.grad - input1.grad).abs() / input0.grad.abs()).max()
+    print(f'* {bwdok} input_grad check_backward_equal_with_pytorch_double:'
+          f'max_abs_err {max_abs_err:.2e} max_rel_err {max_rel_err:.2e}')
+
+    bwdok = torch.allclose(offset0.grad, offset1.grad, rtol=1e-2, atol=1e-3)
+    max_abs_err = (offset0.grad - offset1.grad).abs().max()
+    max_rel_err = ((offset0.grad - offset1.grad).abs() /
+                   offset0.grad.abs()).max()
+    print(f'* {bwdok} offset_grad check_backward_equal_with_pytorch_double:'
+          f'max_abs_err {max_abs_err:.2e} max_rel_err {max_rel_err:.2e}')
+
+    bwdok = torch.allclose(mask0.grad, mask1.grad, rtol=1e-2, atol=1e-3)
+    max_abs_err = (mask0.grad - mask1.grad).abs().max()
+    max_rel_err = ((mask0.grad - mask1.grad).abs() / mask0.grad.abs()).max()
+    print(f'* {bwdok} mask_grad check_backward_equal_with_pytorch_double:'
+          f'max_abs_err {max_abs_err:.2e} max_rel_err {max_rel_err:.2e}')
+
+
+def check_backward_equal_with_pytorch_float(channels=4,
+                                            grad_input=True,
+                                            grad_offset=True,
+                                            grad_mask=True):
+    # H_in, W_in = 4, 4
+    N = 2
+    M = 2
+    H_out = (H_in + 2 * pad - (dilation * (Kh - 1) + 1)) // stride + 1
+    W_out = (W_in + 2 * pad - (dilation * (Kw - 1) + 1)) // stride + 1
+
+    D = channels
+    input0 = torch.rand(N, H_in, W_in, M * D).cuda() * 0.01
+    offset0 = torch.rand(N, H_out, W_out, M * P * 2).cuda() * 10
+    mask0 = torch.rand(N, H_out, W_out, M, P).cuda() + 1e-5
+    mask0 /= mask0.sum(-1, keepdim=True)
+    mask0 = mask0.reshape(N, H_out, W_out, M * P)
+    input0.requires_grad = grad_input
+    offset0.requires_grad = grad_offset
+    mask0.requires_grad = grad_mask
+
+    output_pytorch = dcnv3_core_pytorch(input0, offset0, mask0, Kh, Kw, stride,
+                                        stride, Kh // 2, Kw // 2, dilation,
+                                        dilation, M, D, offset_scale,
+                                        remove_center)
+    output_pytorch.sum().backward()
+
+    input1 = input0.detach()
+    offset1 = offset0.detach()
+    mask1 = mask0.detach()
+    input1.requires_grad = grad_input
+    offset1.requires_grad = grad_offset
+    mask1.requires_grad = grad_mask
+
+    im2col_step = 2
+    output_cuda = DCNv3Function.apply(input1, offset1, mask1, Kh, Kw, stride,
+                                      stride, Kh // 2, Kw // 2, dilation,
+                                      dilation, M, D, offset_scale,
+                                      im2col_step, remove_center)
+    output_cuda.sum().backward()
+
+    print(f'>>> backward float: channels {D}')
+    bwdok = torch.allclose(input0.grad, input1.grad, rtol=1e-2, atol=1e-3)
+    max_abs_err = (input0.grad - input1.grad).abs().max()
+    max_rel_err = ((input0.grad - input1.grad).abs() / input0.grad.abs()).max()
+    print(f'* {bwdok} input_grad check_backward_equal_with_pytorch_float:'
+          f'max_abs_err {max_abs_err:.2e} max_rel_err {max_rel_err:.2e}')
+
+    bwdok = torch.allclose(offset0.grad, offset1.grad, rtol=1e-2, atol=1e-3)
+    max_abs_err = (offset0.grad - offset1.grad).abs().max()
+    max_rel_err = ((offset0.grad - offset1.grad).abs() /
+                   offset0.grad.abs()).max()
+    print(f'* {bwdok} offset_grad check_backward_equal_with_pytorch_float:'
+          f'max_abs_err {max_abs_err:.2e} max_rel_err {max_rel_err:.2e}')
+
+    bwdok = torch.allclose(mask0.grad, mask1.grad, rtol=1e-2, atol=1e-3)
+    max_abs_err = (mask0.grad - mask1.grad).abs().max()
+    max_rel_err = ((mask0.grad - mask1.grad).abs() / mask0.grad.abs()).max()
+    print(f'* {bwdok} mask_grad check_backward_equal_with_pytorch_float:'
+          f'max_abs_err {max_abs_err:.2e} max_rel_err {max_rel_err:.2e}')
+
+
+@torch.no_grad()
+def check_time_cost(im2col_step=128):
+    N = 512
+    H_in, W_in = 64, 64
+    H_out = (H_in + 2 * pad - (dilation * (Kh - 1) + 1)) // stride + 1
+    W_out = (W_in + 2 * pad - (dilation * (Kw - 1) + 1)) // stride + 1
+
+    input = torch.rand(N, H_in, W_in, M * D).cuda() * 0.01
+    offset = torch.rand(N, H_out, W_out, M * P * 2).cuda() * 10
+    mask = torch.rand(N, H_out, W_out, M, P).cuda() + 1e-5
+    mask /= mask.sum(-1, keepdim=True)
+    mask = mask.reshape(N, H_out, W_out, M * P)
+    print(f'>>> time cost: im2col_step {im2col_step};'
+          f'input {input.shape}; points {P} ')
+    repeat = 100
+    for i in range(repeat):
+        output_cuda = DCNv3Function.apply(input, offset, mask, Kh, Kw, stride,
+                                          stride, Kh // 2, Kw // 2, dilation,
+                                          dilation, M, D, 1.0, im2col_step,
+                                          remove_center)
+    torch.cuda.synchronize()
+    start = time.time()
+    for i in range(repeat):
+        output_cuda = DCNv3Function.apply(  # noqa
+            input, offset, mask, Kh, Kw, stride, stride, Kh // 2, Kw // 2,
+            dilation, dilation, M, D, 1.0, im2col_step, remove_center)
+    torch.cuda.synchronize()
+    print(f'foward time cost: {(time.time() - start) / repeat}')
+
+
+if __name__ == '__main__':
+    check_forward_equal_with_pytorch_double()
+    check_forward_equal_with_pytorch_float()
+    for channels in [1, 16, 30, 32, 64, 71, 1025]:
+        check_backward_equal_with_pytorch_double(channels, True, True, True)
+    for channels in [1, 16, 30, 32, 64, 71, 1025]:
+        check_backward_equal_with_pytorch_float(channels, True, True, True)
+    for i in range(3):
+        im2col_step = 128 * (2**i)
+        check_time_cost(im2col_step)
diff --git a/projects/maskfeat_video/README.md b/projects/maskfeat_video/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6a8ce03e44a8ae1b07b7067331d0e50c2a6c1a6e
--- /dev/null
+++ b/projects/maskfeat_video/README.md
@@ -0,0 +1,275 @@
+# MaskFeat Pre-training with Video
+
+- [MaskFeat Pre-training with Video](#maskfeat-pre-training-with-video)
+  - [Description](#description)
+  - [Usage](#usage)
+    - [Setup Environment](#setup-environment)
+    - [Data Preparation](#data-preparation)
+    - [Pre-training Commands](#pre-training-commands)
+      - [On Local Single GPU](#on-local-single-gpu)
+      - [On Multiple GPUs](#on-multiple-gpus)
+      - [On Multiple GPUs with Slurm](#on-multiple-gpus-with-slurm)
+    - [Downstream Tasks Commands](#downstream-tasks-commands)
+      - [On Multiple GPUs](#on-multiple-gpus-1)
+      - [On Multiple GPUs with Slurm](#on-multiple-gpus-with-slurm-1)
+  - [Results](#results)
+  - [Citation](#citation)
+  - [Checklist](#checklist)
+
+## Description
+
+<!-- Share any information you would like others to know. For example:
+Author: @xxx.
+This is an implementation of \[XXX\]. -->
+
+Author: @fangyixiao18
+
+This is the implementation of **MaskFeat** with video dataset, like Kinetics400.
+
+## Usage
+
+<!-- For a typical model, this section should contain the commands for dataset prepareation, pre-training, downstream tasks. You are also suggested to dump your environment specification to env.yml by `conda env export > env.yml`. -->
+
+### Setup Environment
+
+Requirements:
+
+- MMPretrain >= 1.0.0rc0
+- MMAction2 >= 1.0.0rc3
+
+Please refer to [Get Started](https://mmpretrain.readthedocs.io/en/latest/get_started.html) documentation of MMPretrain to finish installation.
+
+Besides, to process the video data, we apply transforms in MMAction2. The instruction to install MMAction2 can be found in [Get Started documentation](https://mmaction2.readthedocs.io/en/1.x/get_started.html).
+
+### Data Preparation
+
+You can refer to the [documentation](https://mmaction2.readthedocs.io/en/1.x/user_guides/2_data_prepare.html) in MMAction2.
+
+### Pre-training Commands
+
+At first, you need to add the current folder to `PYTHONPATH`, so that Python can find your model files. In `projects/maskfeat_video/` root directory, please run command below to add it.
+
+```shell
+export PYTHONPATH=`pwd`:$PYTHONPATH
+```
+
+Then run the following commands to train the model:
+
+#### On Local Single GPU
+
+```bash
+# train with mim
+mim train mmpretrain ${CONFIG} --work-dir ${WORK_DIR}
+
+# a specific command example
+mim train mmpretrain configs/maskfeat_mvit-small_8xb32-amp-coslr-300e_k400.py \
+    --work-dir work_dirs/selfsup/maskfeat_mvit-small_8xb32-amp-coslr-300e_k400/
+
+# train with scripts
+python tools/train.py configs/maskfeat_mvit-small_8xb32-amp-coslr-300e_k400.py \
+    --work-dir work_dirs/selfsup/maskfeat_mvit-small_8xb32-amp-coslr-300e_k400/
+```
+
+#### On Multiple GPUs
+
+```bash
+# train with mim
+# a specific command examples, 8 GPUs here
+mim train mmpretrain configs/maskfeat_mvit-small_8xb32-amp-coslr-300e_k400.py \
+    --work-dir work_dirs/selfsup/maskfeat_mvit-small_8xb32-amp-coslr-300e_k400/ \
+    --launcher pytorch --gpus 8
+
+# train with scripts
+bash tools/dist_train.sh configs/maskfeat_mvit-small_8xb32-amp-coslr-300e_k400.py 8
+```
+
+Note:
+
+- CONFIG: the config files under the directory `configs/`
+- WORK_DIR: the working directory to save configs, logs, and checkpoints
+
+#### On Multiple GPUs with Slurm
+
+```bash
+# train with mim
+mim train mmpretrain configs/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400.py \
+    --work-dir work_dirs/selfsup/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400/ \
+    --launcher slurm --gpus 16 --gpus-per-node 8 \
+    --partition ${PARTITION}
+
+# train with scripts
+GPUS_PER_NODE=8 GPUS=16 bash tools/slurm_train.sh ${PARTITION} maskfeat-video \
+    configs/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400.py \
+    --work-dir work_dirs/selfsup/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400/
+```
+
+Note:
+
+- CONFIG: the config files under the directory `configs/`
+- WORK_DIR: the working directory to save configs, logs, and checkpoints
+- PARTITION: the slurm partition you are using
+
+### Downstream Tasks Commands
+
+To evaluate the **MaskFeat MViT** pretrained with MMPretrain, we recommend to run MMAction2:
+
+#### On Multiple GPUs
+
+```bash
+# command example for train
+mim train mmaction2 ${CONFIG} \
+    --work-dir ${WORK_DIR} \
+    --launcher pytorch -gpus 8 \
+    --cfg-options model.backbone.init_cfg.type=Pretrained \
+    model.backbone.init_cfg.checkpoint=${CHECKPOINT} \
+    model.backbone.init_cfg.prefix="backbone." \
+    ${PY_ARGS}
+    [optional args]
+
+mim train mmaction2 configs/mvit-small_ft-8xb8-coslr-100e_k400.py \
+    --work-dir work_dirs/benchmarks/maskfeat/training_maskfeat-mvit-k400/ \
+    --launcher pytorch -gpus 8 \
+    --cfg-options model.backbone.init_cfg.type=Pretrained \
+    model.backbone.init_cfg.checkpoint=https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400_20230131-87d60b6f.pth \
+    model.backbone.init_cfg.prefix="backbone." \
+    $PY_ARGS
+
+# command example for test
+mim test mmaction2 configs/mvit-small_ft-8xb16-coslr-100e_k400.py \
+  --checkpoint https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400/mvit-small_ft-8xb16-coslr-100e_k400/mvit-small_ft-8xb16-coslr-100e_k400_20230131-5e8303f5.pth \
+  --work-dir work_dirs/benchmarks/maskfeat/maskfeat-mvit-k400/test/ \
+  --launcher pytorch --gpus 8
+```
+
+#### On Multiple GPUs with Slurm
+
+```bash
+mim train mmaction2 ${CONFIG} \
+    --work-dir ${WORK_DIR} \
+    --launcher slurm --gpus 8 --gpus-per-node 8 \
+    --partition ${PARTITION} \
+    --cfg-options model.backbone.init_cfg.type=Pretrained \
+    model.backbone.init_cfg.checkpoint=$CHECKPOINT \
+    model.backbone.init_cfg.prefix="backbone." \
+    $PY_ARGS
+
+mim test mmaction2 ${CONFIG} \
+    --checkpoint https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400/mvit-small_ft-8xb16-coslr-100e_k400/mvit-small_ft-8xb16-coslr-100e_k400_20230131-5e8303f5.pth
+    --work-dir ${WORK_DIR} \
+    --launcher slurm --gpus 8 --gpus-per-node 8 \
+    --partition ${PARTITION} \
+    $PY_ARGS
+```
+
+Note:
+
+- CONFIG: the config files under the directory `configs/`
+- WORK_DIR: the working directory to save configs, logs, and checkpoints
+- PARTITION: the slurm partition you are using
+- CHECKPOINT: the pretrained checkpoint of MMPretrain saved in working directory, like `$WORK_DIR/epoch_300.pth`
+- PY_ARGS: other optional args
+
+## Results
+
+<!-- You should claim whether this is based on the pre-trained weights, which are converted from the official release; or it's a reproduced result obtained from retraining the model in this project. -->
+
+The Fine-tuning results are based on Kinetics400(K400) dataset.
+
+Due to the version of K400 dataset, our pretraining, fine-tuning and the final test results are based on MMAction2 version, which is a little different from PySlowFast version.
+
+<table class="docutils">
+<thead>
+  <tr>
+	    <th>Algorithm</th>
+	    <th>Backbone</th>
+	    <th>Epoch</th>
+      <th>Batch Size</th>
+      <th>Fine-tuning</th>
+      <th>Pretrain Links</th>
+      <th>Fine-tuning Links</th>
+	</tr>
+  </thead>
+  <tbody>
+  <tr>
+      <td>MaskFeat</td>
+	    <td>MViT-small</td>
+	    <td>300</td>
+      <td>512</td>
+      <td>81.8</td>
+      <td><a href='https://github.com/open-mmlab/mmselfsup/blob/dev-1.x/projects/maskfeat_video/configs/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400.py'>config</a> | <a href='https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400_20230131-87d60b6f.pth'>model</a> | <a href='https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400_20230118_114151.json'>log</a></td>
+      <td><a href='https://github.com/open-mmlab/mmselfsup/blob/dev-1.x/projects/maskfeat_video/configs/mvit-small_ft-8xb16-coslr-100e_k400.py'>config</a> | <a href='https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400/mvit-small_ft-8xb16-coslr-100e_k400/mvit-small_ft-8xb16-coslr-100e_k400_20230131-5e8303f5.pth'>model</a> | <a href='https://download.openmmlab.com/mmselfsup/1.x/maskfeat/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400/mvit-small_ft-8xb16-coslr-100e_k400/mvit-small_ft-8xb16-coslr-100e_k400_20230121_142927.json'>log</a></td>
+	</tr>
+</tbody>
+</table>
+
+Remarks:
+
+- We converted the pretrained model from PySlowFast and run fine-tuning with MMAction2, based on MMAction2 version of K400, we got `81.5` test accuracy. The pretrained model from MMPretrain got `81.8`, as provided above.
+- We also tested our model on [other version](https://github.com/facebookresearch/video-nonlocal-net/blob/main/DATASET.md) of K400, we got `82.1` test accuracy.
+- Some other details can be found in [MMAction2 MViT page](https://github.com/open-mmlab/mmaction2/tree/dev-1.x/configs/recognition/mvit).
+
+## Citation
+
+```bibtex
+@InProceedings{wei2022masked,
+    author    = {Wei, Chen and Fan, Haoqi and Xie, Saining and Wu, Chao-Yuan and Yuille, Alan and Feichtenhofer, Christoph},
+    title     = {Masked Feature Prediction for Self-Supervised Visual Pre-Training},
+    booktitle = {CVPR},
+    year      = {2022},
+}
+```
+
+## Checklist
+
+Here is a checklist illustrating a usual development workflow of a successful project, and also serves as an overview of this project's progress.
+
+<!--The PIC (person in charge) or contributors of this project should check all the items that they believe have been finished, which will further be verified by codebase maintainers via a PR.
+
+OpenMMLab's maintainer will review the code to ensure the project's quality. Reaching the first milestone means that this project suffices the minimum requirement of being merged into 'projects/'. But this project is only eligible to become a part of the core package upon attaining the last milestone.
+
+Note that keeping this section up-to-date is crucial not only for this project's developers but the entire community, since there might be some other contributors joining this project and deciding their starting point from this list. It also helps maintainers accurately estimate time and effort on further code polishing, if needed.
+A project does not necessarily have to be finished in a single PR, but it's essential for the project to at least reach the first milestone in its very first PR. -->
+
+- [x] Milestone 1: PR-ready, and acceptable to be one of the `projects/`.
+
+  - [x] Finish the code
+
+    <!-- The code's design shall follow existing interfaces and convention. For example, each model component should be registered into `MMPretrain.registry.MODELS` and configurable via a config file. -->
+
+  - [x] Basic docstrings & proper citation
+
+    <!-- Each major object should contain a docstring, describing its functionality and arguments. If you have adapted the code from other open-source projects, don't forget to cite the source project in docstring and make sure your behavior is not against its license. Typically, we do not accept any code snippet under GPL license. [A Short Guide to Open Source Licenses](https://medium.com/nationwide-technology/a-short-guide-to-open-source-licenses-cf5b1c329edd) -->
+
+  - [x] Inference correctness
+
+    <!-- If you are reproducing the result from a paper, make sure your model's inference-time feature vectors or losses matches that from the original codes. The weights usually could be obtained by simply renaming the keys in the official pre-trained weights. This test could be skipped though, if you are able to prove the training-time correctness and check the second milestone. -->
+
+  - [x] A full README
+
+    <!-- As this template does. -->
+
+- [x] Milestone 2: Indicates a successful model implementation.
+
+  - [x] Training-time correctness
+
+    <!-- If you are reproducing the result from a paper, checking this item means that you should have trained your model from scratch based on the original paper's specification and verified that the final result. Due to the pretrain-downstream pipeline of self-supervised learning, this item requires at least one downstream result matches the report within a minor error range. -->
+
+- [ ] Milestone 3: Good to be a part of our core package!
+
+  - [ ] Type hints and docstrings
+
+    <!-- Ideally *all* the methods should have [type hints](https://www.pythontutorial.net/python-basics/python-type-hints/) and [docstrings](https://google.github.io/styleguide/pyguide.html#381-docstrings). [Example](https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/selfsup/mae.py) -->
+
+  - [ ] Unit tests
+
+    <!-- Unit tests for each module are required. [Example](https://github.com/open-mmlab/mmpretrain/blob/main/tests/test_models/test_selfsup/test_mae.py) -->
+
+  - [ ] Code polishing
+
+    <!-- Refactor your code according to reviewer's comment. -->
+
+  - [ ] `metafile.yml` and `README.md`
+
+    <!-- It will be parsed by MIM and Inferencer. [Example](https://github.com/open-mmlab/mmpretrain/blob/main/configs/mae/metafile.yml). In particular, you may have to refactor this README into a standard one. [Example](https://github.com/open-mmlab/mmpretrain/blob/main/configs/mae/README.md) -->
+
+- [ ] Refactor and Move your modules into the core package following the codebase's file hierarchy structure.
diff --git a/projects/maskfeat_video/configs/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400.py b/projects/maskfeat_video/configs/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400.py
new file mode 100644
index 0000000000000000000000000000000000000000..43085714d85caaafae8488425a93722e0ec79838
--- /dev/null
+++ b/projects/maskfeat_video/configs/maskfeat_mvit-small_16xb32-amp-coslr-300e_k400.py
@@ -0,0 +1,101 @@
+_base_ = 'mmpretrain::_base_/default_runtime.py'
+
+custom_imports = dict(imports=['models'], allow_failed_imports=False)
+
+model = dict(
+    type='VideoMaskFeat',
+    backbone=dict(
+        type='MaskFeatMViT',
+        arch='maskfeat-small',
+        drop_path_rate=0.0,
+        dim_mul_in_attention=False),
+    neck=dict(
+        type='LinearNeck',
+        in_channels=768,
+        out_channels=108,
+        with_avg_pool=False,
+        init_cfg=dict(type='TruncNormal', layer='Linear', std=0.02, bias=0)),
+    head=dict(
+        type='MaskFeatPretrainHead',
+        loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+    target_generator=dict(
+        type='HOGGenerator3d', nbins=9, pool=8, gaussian_window=16))
+
+# dataset settings
+dataset_type = 'mmaction.VideoDataset'
+data_root = 'data/kinetics400/videos_train'
+ann_file_train = 'data/Kinetics400/kinetics400_train_list_videos.txt'
+data_preprocessor = dict(
+    type='VideoDataPreprocessor',
+    mean=[114.75, 114.75, 114.75],
+    std=[57.375, 57.375, 57.375],
+    format_shape='NCTHW')
+train_pipeline = [
+    dict(type='mmaction.DecordInit'),
+    dict(
+        type='mmaction.SampleFrames',
+        clip_len=16,
+        frame_interval=4,
+        num_clips=1),
+    dict(type='mmaction.DecordDecode'),
+    dict(type='mmaction.Resize', scale=(-1, 256)),
+    dict(type='mmaction.RandomResizedCrop', area_range=(0.5, 1.0)),
+    dict(type='mmaction.Resize', scale=(224, 224), keep_ratio=False),
+    dict(type='mmaction.Flip', flip_ratio=0.5),
+    dict(type='mmaction.FormatShape', input_format='NCTHW'),
+    dict(
+        type='MaskFeatMaskGenerator3D',
+        input_size=(8, 7, 7),
+        num_masking_patches=157,
+        min_num_patches=9,
+        max_num_patches=49),
+    dict(type='PackInputs', input_key='imgs')
+]
+
+train_dataloader = dict(
+    batch_size=32,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='default_collate'),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_train,
+        data_prefix=dict(video=data_root),
+        pipeline=train_pipeline))
+
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    loss_scale='dynamic',
+    optimizer=dict(
+        type='AdamW', lr=8e-4 * 2, betas=(0.9, 0.999), weight_decay=0.05),
+    clip_grad=dict(max_norm=0.02),
+    paramwise_cfg=dict(
+        bias_decay_mult=0.,
+        norm_decay_mult=0.,
+        custom_keys={
+            'pos_embed': dict(decay_mult=0.),
+            'cls_token': dict(decay_mult=0.)
+        }))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-4,
+        by_epoch=True,
+        begin=0,
+        end=10,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=290,
+        eta_min=1e-6,
+        by_epoch=True,
+        begin=10,
+        end=300,
+        convert_to_iter_based=True)
+]
+
+train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=300)
+default_hooks = dict(
+    checkpoint=dict(interval=1, max_keep_ckpts=2), logger=dict(interval=100))
diff --git a/projects/maskfeat_video/configs/maskfeat_mvit-small_8xb32-amp-coslr-300e_k400.py b/projects/maskfeat_video/configs/maskfeat_mvit-small_8xb32-amp-coslr-300e_k400.py
new file mode 100644
index 0000000000000000000000000000000000000000..3f26e56e9349f115a0f4288c08eb1a1b3a5d79bf
--- /dev/null
+++ b/projects/maskfeat_video/configs/maskfeat_mvit-small_8xb32-amp-coslr-300e_k400.py
@@ -0,0 +1,5 @@
+_base_ = './maskfeat_mvit-small_16xb32-amp-coslr-300e_k400.py'
+
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=8e-4, betas=(0.9, 0.999), weight_decay=0.05))
diff --git a/projects/maskfeat_video/configs/mvit-small_ft-8xb16-coslr-100e_k400.py b/projects/maskfeat_video/configs/mvit-small_ft-8xb16-coslr-100e_k400.py
new file mode 100644
index 0000000000000000000000000000000000000000..367e4baf5f44ace9f8e0ff35eb491cc839c31b4f
--- /dev/null
+++ b/projects/maskfeat_video/configs/mvit-small_ft-8xb16-coslr-100e_k400.py
@@ -0,0 +1,157 @@
+_base_ = [
+    'mmaction::_base_/models/mvit_small.py',
+    'mmaction::_base_/default_runtime.py'
+]
+
+model = dict(
+    backbone=dict(
+        drop_path_rate=0.1,
+        dim_mul_in_attention=False,
+        pretrained=None,
+        pretrained_type='maskfeat',
+    ),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        blending=dict(
+            type='RandomBatchAugment',
+            augments=[
+                dict(type='MixupBlending', alpha=0.8, num_classes=400),
+                dict(type='CutmixBlending', alpha=1, num_classes=400)
+            ]),
+        format_shape='NCTHW'),
+    cls_head=dict(dropout_ratio=0., init_scale=0.001))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root = 'data/kinetics400/videos_train'
+data_root_val = 'data/kinetics400/videos_val'
+ann_file_train = 'data/kinetics400/kinetics400_train_list_videos.txt'
+ann_file_val = 'data/kinetics400/kinetics400_val_list_videos.txt'
+ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt'
+
+train_pipeline = [
+    dict(type='DecordInit'),
+    dict(type='SampleFrames', clip_len=16, frame_interval=4, num_clips=1),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 256)),
+    dict(type='PytorchVideoWrapper', op='RandAugment', magnitude=7),
+    dict(type='RandomResizedCrop'),
+    dict(type='Resize', scale=(224, 224), keep_ratio=False),
+    dict(type='Flip', flip_ratio=0.5),
+    dict(type='RandomErasing', erase_prob=0.25, mode='rand'),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+val_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='SampleFrames',
+        clip_len=16,
+        frame_interval=4,
+        num_clips=1,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 256)),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='SampleFrames',
+        clip_len=16,
+        frame_interval=4,
+        num_clips=10,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+repeat_sample = 2
+train_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    collate_fn=dict(type='repeat_pseudo_collate'),
+    dataset=dict(
+        type='RepeatAugDataset',
+        num_repeats=repeat_sample,
+        ann_file=ann_file_train,
+        data_prefix=dict(video=data_root),
+        pipeline=train_pipeline))
+val_dataloader = dict(
+    batch_size=16,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_val,
+        data_prefix=dict(video=data_root_val),
+        pipeline=val_pipeline,
+        test_mode=True))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True))
+
+val_evaluator = dict(type='AccMetric')
+test_evaluator = val_evaluator
+
+train_cfg = dict(
+    type='EpochBasedTrainLoop', max_epochs=100, val_begin=1, val_interval=1)
+val_cfg = dict(type='ValLoop')
+test_cfg = dict(type='TestLoop')
+
+base_lr = 9.6e-3
+optim_wrapper = dict(
+    optimizer=dict(
+        type='AdamW', lr=base_lr, betas=(0.9, 0.999), weight_decay=0.05),
+    constructor='LearningRateDecayOptimizerConstructor',
+    paramwise_cfg={
+        'decay_rate': 0.75,
+        'decay_type': 'layer_wise',
+        'num_layers': 16
+    },
+    clip_grad=dict(max_norm=5, norm_type=2))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1 / 600,
+        by_epoch=True,
+        begin=0,
+        end=20,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=80,
+        eta_min_ratio=1 / 600,
+        by_epoch=True,
+        begin=20,
+        end=100,
+        convert_to_iter_based=True)
+]
+
+default_hooks = dict(
+    checkpoint=dict(interval=3, max_keep_ckpts=20), logger=dict(interval=100))
+
+# Default setting for scaling LR automatically
+#   - `enable` means enable scaling LR automatically
+#       or not by default.
+#   - `base_batch_size` = (8 GPUs) x (64 samples per GPU) / repeat_sample.
+auto_scale_lr = dict(enable=True, base_batch_size=512 // repeat_sample)
diff --git a/projects/maskfeat_video/models/__init__.py b/projects/maskfeat_video/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..96e5f913a31e299cd9561614a25eaa74edfb9dce
--- /dev/null
+++ b/projects/maskfeat_video/models/__init__.py
@@ -0,0 +1,9 @@
+from .hog_generator_3d import HOGGenerator3d
+from .maskfeat import VideoMaskFeat
+from .maskfeat_mvit import MaskFeatMViT
+from .transforms import MaskFeatMaskGenerator3D
+
+__all__ = [
+    'HOGGenerator3d', 'VideoMaskFeat', 'MaskFeatMViT',
+    'MaskFeatMaskGenerator3D'
+]
diff --git a/projects/maskfeat_video/models/hog_generator_3d.py b/projects/maskfeat_video/models/hog_generator_3d.py
new file mode 100644
index 0000000000000000000000000000000000000000..02d52c25a6a61e9a0a8b32397e0ff7dec4628ab9
--- /dev/null
+++ b/projects/maskfeat_video/models/hog_generator_3d.py
@@ -0,0 +1,39 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+
+from mmpretrain.models import HOGGenerator
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class HOGGenerator3d(HOGGenerator):
+    """Generate HOG feature for videos.
+
+    This module is used in MaskFeat to generate HOG feature.
+    Here is the link of `HOG wikipedia
+    <https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients>`_.
+
+    Args:
+        nbins (int): Number of bin. Defaults to 9.
+        pool (float): Number of cell. Defaults to 8.
+        gaussian_window (int): Size of gaussian kernel. Defaults to 16.
+    """
+
+    def __init__(self,
+                 nbins: int = 9,
+                 pool: int = 8,
+                 gaussian_window: int = 16) -> None:
+        super().__init__(
+            nbins=nbins, pool=pool, gaussian_window=gaussian_window)
+
+    def _reshape(self, hog_feat: torch.Tensor) -> torch.Tensor:
+        """Reshape HOG Features for output."""
+        hog_feat = hog_feat.flatten(1, 2)
+        self.unfold_size = hog_feat.shape[-1] // 14
+        hog_feat = hog_feat.permute(0, 2, 3, 1)
+        hog_feat = hog_feat.unfold(1, self.unfold_size,
+                                   self.unfold_size).unfold(
+                                       2, self.unfold_size, self.unfold_size)
+        hog_feat = hog_feat.flatten(3).view(self.B, self.T, 14, 14, -1)
+        hog_feat = hog_feat.flatten(1, 3)  # B N C
+        return hog_feat
diff --git a/projects/maskfeat_video/models/maskfeat.py b/projects/maskfeat_video/models/maskfeat.py
new file mode 100644
index 0000000000000000000000000000000000000000..cd50dac0f452ffdbfaa60da2408aa1b7d5c7852b
--- /dev/null
+++ b/projects/maskfeat_video/models/maskfeat.py
@@ -0,0 +1,59 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import Dict, List
+
+import torch
+import torch.nn.functional as F
+
+from mmpretrain.models import BaseSelfSupervisor
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+
+
+@MODELS.register_module()
+class VideoMaskFeat(BaseSelfSupervisor):
+    """MaskFeat.
+
+    Implementation of `Masked Feature Prediction for Self-Supervised Visual
+    Pre-Training <https://arxiv.org/abs/2112.09133>`_.
+    """
+
+    def loss(self, inputs: List[torch.Tensor], data_samples: List[DataSample],
+             **kwargs) -> Dict[str, torch.Tensor]:
+        """The forward function in training.
+
+        Args:
+            inputs (List[torch.Tensor]): The input images.
+            data_samples (List[DataSample]): All elements required
+                during the forward function.
+
+        Returns:
+            Dict[str, torch.Tensor]: A dictionary of loss components.
+        """
+        mask = torch.stack(
+            [data_sample.mask.value for data_sample in data_samples])
+        mask = mask.to(torch.bool)
+
+        video = inputs[0]
+        video = video.view((-1, ) + video.shape[2:])  # B, C, T, H, W
+        latent = self.backbone(video, mask)
+        B, L, C = latent[0].shape
+        pred = self.neck([latent[0].view(B * L, C)])
+        pred = pred[0].view(B, L, -1)
+
+        # generate hog target
+        video = video[:, :, ::self.backbone.patch_stride[0], :, :]
+        video = video.transpose(1, 2)  # B, T, C, H, W
+        self.target_generator.B = video.size(0)
+        self.target_generator.T = video.size(1)
+        video = video.flatten(0, 1)  # B*T, C, H, W
+        hog = self.target_generator(video)
+
+        mask = self._get_output_mask(mask)
+        loss = self.head(pred, hog, mask)
+        losses = dict(loss=loss)
+        return losses
+
+    def _get_output_mask(self, mask: torch.Tensor) -> torch.Tensor:
+        size = self.backbone.out_patch_resolution[-1][-1]
+        output_mask = F.interpolate(mask.float(), size=size)
+        return output_mask
diff --git a/projects/maskfeat_video/models/maskfeat_mvit.py b/projects/maskfeat_video/models/maskfeat_mvit.py
new file mode 100644
index 0000000000000000000000000000000000000000..3661ca735938dd36092d08231ba1ce4bcf1514cd
--- /dev/null
+++ b/projects/maskfeat_video/models/maskfeat_mvit.py
@@ -0,0 +1,146 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from typing import List, Optional, Sequence, Tuple, Union
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmaction.models import MViT
+from mmaction.models.backbones.mvit import resize_pos_embed
+
+from mmpretrain.registry import MODELS
+
+
+@MODELS.register_module()
+class MaskFeatMViT(MViT):
+
+    arch_zoo = {
+        'maskfeat-small': {
+            'embed_dims': 96,
+            'num_layers': 16,
+            'num_heads': 1,
+            'downscale_indices': [1, 3],
+            'dim_mul_indices': [1, 3, 14]
+        },
+        'maskfeat-large': {
+            'embed_dims': 144,
+            'num_layers': 48,
+            'num_heads': 2,
+            'downscale_indices': [2, 8],
+            'dim_mul_indices': [2, 8, 44]
+        },
+    }
+
+    def __init__(
+        self,
+        arch: str = 'base',
+        spatial_size: int = 224,
+        temporal_size: int = 16,
+        in_channels: int = 3,
+        out_scales: Union[int, Sequence[int]] = -1,
+        drop_path_rate: float = 0,
+        use_abs_pos_embed: bool = False,
+        interpolate_mode: str = 'trilinear',
+        pool_kernel: tuple = (3, 3, 3),
+        dim_mul: int = 2,
+        head_mul: int = 2,
+        adaptive_kv_stride: tuple = (1, 8, 8),
+        rel_pos_embed: bool = True,
+        residual_pooling: bool = True,
+        dim_mul_in_attention: bool = True,
+        with_cls_token: bool = True,
+        output_cls_token: bool = True,
+        rel_pos_zero_init: bool = False,
+        mlp_ratio: float = 4,
+        qkv_bias: bool = True,
+        norm_cfg: dict = dict(type='LN', eps=1e-6),
+        patch_cfg: dict = dict(
+            kernel_size=(3, 7, 7), stride=(2, 4, 4), padding=(1, 3, 3)),
+        init_cfg: Optional[Union[dict, List[dict]]] = [
+            dict(type='TruncNormal', layer=['Conv2d', 'Conv3d'], std=0.02),
+            dict(type='TruncNormal', layer='Linear', std=0.02, bias=0.),
+            dict(type='Constant', layer='LayerNorm', val=1., bias=0.02),
+        ]
+    ) -> None:
+        super().__init__(
+            arch=arch,
+            spatial_size=spatial_size,
+            temporal_size=temporal_size,
+            in_channels=in_channels,
+            out_scales=out_scales,
+            drop_path_rate=drop_path_rate,
+            use_abs_pos_embed=use_abs_pos_embed,
+            interpolate_mode=interpolate_mode,
+            pool_kernel=pool_kernel,
+            dim_mul=dim_mul,
+            head_mul=head_mul,
+            adaptive_kv_stride=adaptive_kv_stride,
+            rel_pos_embed=rel_pos_embed,
+            residual_pooling=residual_pooling,
+            dim_mul_in_attention=dim_mul_in_attention,
+            with_cls_token=with_cls_token,
+            output_cls_token=output_cls_token,
+            rel_pos_zero_init=rel_pos_zero_init,
+            mlp_ratio=mlp_ratio,
+            qkv_bias=qkv_bias,
+            norm_cfg=norm_cfg,
+            patch_cfg=patch_cfg,
+            init_cfg=init_cfg)
+
+        self.mask_token = nn.Parameter(torch.zeros(1, 1, self.embed_dims))
+        self.patch_stride = patch_cfg['stride']
+
+    def init_weights(self) -> None:
+        """Initialize mask token and cls token."""
+        super().init_weights()
+        if (isinstance(self.init_cfg, dict)
+                and self.init_cfg['type'] == 'Pretrained'):
+            # Suppress default init if use pretrained model.
+            return
+
+        nn.init.trunc_normal_(self.cls_token, std=.02)
+        nn.init.trunc_normal_(self.mask_token, std=.02)
+
+    def forward(self, x: torch.Tensor,
+                mask: torch.Tensor) -> Tuple[torch.Tensor]:
+
+        x, patch_resolution = self.patch_embed(x)
+        B, L, C = x.shape
+        T, H, W = patch_resolution
+
+        mask_tokens = self.mask_token.expand(B, L, -1)
+        mask = F.interpolate(mask.float(), size=(H, W))
+        mask = mask.flatten(1).unsqueeze(-1)
+        x = x * (1 - mask) + mask_tokens * mask
+
+        cls_tokens = self.cls_token.expand(B, -1, -1)
+        x = torch.cat((cls_tokens, x), dim=1)
+
+        if self.use_abs_pos_embed:
+            x = x + resize_pos_embed(
+                self.pos_embed,
+                self.patch_resolution,
+                patch_resolution,
+                mode=self.interpolate_mode,
+                num_extra_tokens=self.num_extra_tokens)
+
+        # if not self.with_cls_token:
+        #     # Remove class token for transformer encoder input
+        #     x = x[:, 1:]
+
+        outs = []
+        self.out_patch_resolution = []
+        for i, block in enumerate(self.blocks):
+            x, patch_resolution = block(x, patch_resolution)
+
+            if i in self.stage_indices:
+                stage_index = self.stage_indices[i]
+                if stage_index in self.out_scales:
+                    self.out_patch_resolution.append(patch_resolution)
+                    x = getattr(self, f'norm{stage_index}')(x)
+                    if not self.output_cls_token:
+                        out = x[:, 1:]
+                    else:
+                        out = x
+                    outs.append(out)
+
+        return tuple(outs)
diff --git a/projects/maskfeat_video/models/transforms.py b/projects/maskfeat_video/models/transforms.py
new file mode 100644
index 0000000000000000000000000000000000000000..5b3bb376bffd100cb79522e431dc9edbd6a34a9f
--- /dev/null
+++ b/projects/maskfeat_video/models/transforms.py
@@ -0,0 +1,130 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+import random
+from typing import Optional, Tuple
+
+import numpy as np
+from mmcv.transforms.base import BaseTransform
+
+from mmpretrain.registry import TRANSFORMS
+
+
+@TRANSFORMS.register_module()
+class MaskFeatMaskGenerator3D(BaseTransform):
+    """Generate mask for video.
+
+    Added Keys:
+
+    - mask
+
+    This module is borrowed from
+    https://github.com/facebookresearch/SlowFast/blob/main/slowfast/datasets/transform.py
+
+    Args:
+        input_size (int): The size of input video.
+        num_masking_patches (int): The number of patches to be masked.
+        min_num_patches (int): The minimum number of patches to be masked
+            in the process of generating mask. Defaults to 4.
+        max_num_patches (int, optional): The maximum number of patches to be
+            masked in the process of generating mask. Defaults to None.
+        min_aspect (float): The minimum aspect ratio of mask blocks. Defaults
+            to 0.3.
+        min_aspect (float, optional): The minimum aspect ratio of mask blocks.
+            Defaults to None.
+    """
+
+    def __init__(self,
+                 input_size: int,
+                 num_masking_patches: int,
+                 min_num_patches: int = 4,
+                 max_num_patches: Optional[int] = None,
+                 min_aspect: float = 0.3,
+                 max_aspect: Optional[float] = None) -> None:
+
+        self.temporal, self.height, self.width = input_size
+        self.num_masking_patches = num_masking_patches
+        self.min_num_patches = min_num_patches
+        self.max_num_patches = (
+            num_masking_patches
+            if max_num_patches is None else max_num_patches)
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+
+    def get_shape(self) -> Tuple[int, int, int]:
+        """Get the shape of mask.
+
+        Returns:
+            Tuple[int, int, int]: The shape of mask.
+        """
+        return self.temporal, self.height, self.width
+
+    def _mask(self, mask: np.ndarray, max_mask_patches: int) -> int:
+        """Generate mask recursively.
+
+        Args:
+            mask (np.ndarray): The mask to be generated.
+            max_mask_patches (int): The maximum number of patches to be masked.
+
+        Returns:
+            int: The number of patches masked.
+        """
+        delta = 0
+        for _ in range(100):
+            target_area = random.uniform(self.min_num_patches,
+                                         self.max_num_patches)
+            aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+            h = int(round(math.sqrt(target_area * aspect_ratio)))
+            w = int(round(math.sqrt(target_area / aspect_ratio)))
+            t = random.randint(1, self.temporal)  # !
+            if w < self.width and h < self.height:
+                top = random.randint(0, self.height - h)
+                left = random.randint(0, self.width - w)
+                front = random.randint(0, self.temporal - t)
+
+                num_masked = mask[front:front + t, top:top + h,
+                                  left:left + w].sum()
+                # Overlap
+                if 0 < h * w * t - num_masked <= max_mask_patches:
+                    for i in range(front, front + t):
+                        for j in range(top, top + h):
+                            for k in range(left, left + w):
+                                if mask[i, j, k] == 0:
+                                    mask[i, j, k] = 1
+                                    delta += 1
+
+                if delta > 0:
+                    break
+        return delta
+
+    def transform(self, results: dict) -> dict:
+        """Method to generate random block mask.
+
+        Args:
+            results (dict): Result dict from previous pipeline.
+
+        Returns:
+            dict: Result dict with added key ``mask``.
+        """
+        mask = np.zeros(shape=self.get_shape(), dtype=np.int)
+        mask_count = 0
+        while mask_count < self.num_masking_patches:
+            max_mask_patches = self.num_masking_patches - mask_count
+            delta = self._mask(mask, max_mask_patches)
+            if delta == 0:
+                break
+            else:
+                mask_count += delta
+
+        results.update({'mask': mask})
+        return results
+
+    def __repr__(self) -> str:
+        repr_str = self.__class__.__name__
+        repr_str += f'(temporal={self.temporal}, '
+        repr_str += f'height={self.height}, '
+        repr_str += f'width={self.width}, '
+        repr_str += f'num_masking_patches={self.num_masking_patches}, '
+        repr_str += f'min_num_patches={self.min_num_patches}, '
+        repr_str += f'max_num_patches={self.max_num_patches}, '
+        repr_str += f'log_aspect_ratio={self.log_aspect_ratio})'
+        return repr_str
diff --git a/projects/maskfeat_video/tools/dist_train.sh b/projects/maskfeat_video/tools/dist_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3fca7641dec4090930c85991a079c28409529d4e
--- /dev/null
+++ b/projects/maskfeat_video/tools/dist_train.sh
@@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+
+CONFIG=$1
+GPUS=$2
+NNODES=${NNODES:-1}
+NODE_RANK=${NODE_RANK:-0}
+PORT=${PORT:-29500}
+MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+python -m torch.distributed.launch \
+    --nnodes=$NNODES \
+    --node_rank=$NODE_RANK \
+    --master_addr=$MASTER_ADDR \
+    --nproc_per_node=$GPUS \
+    --master_port=$PORT \
+    $(dirname "$0")/train.py \
+    $CONFIG \
+    --launcher pytorch ${@:3}
diff --git a/projects/maskfeat_video/tools/slurm_train.sh b/projects/maskfeat_video/tools/slurm_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ac36d508208e4b034080f1761fadc2a077188b1f
--- /dev/null
+++ b/projects/maskfeat_video/tools/slurm_train.sh
@@ -0,0 +1,23 @@
+#!/usr/bin/env bash
+
+set -x
+
+PARTITION=$1
+JOB_NAME=$2
+CONFIG=$3
+GPUS=${GPUS:-8}
+GPUS_PER_NODE=${GPUS_PER_NODE:-8}
+CPUS_PER_TASK=${CPUS_PER_TASK:-5}
+SRUN_ARGS=${SRUN_ARGS:-""}
+PY_ARGS=${@:4}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+srun -p ${PARTITION} \
+    --job-name=${JOB_NAME} \
+    --gres=gpu:${GPUS_PER_NODE} \
+    --ntasks=${GPUS} \
+    --ntasks-per-node=${GPUS_PER_NODE} \
+    --cpus-per-task=${CPUS_PER_TASK} \
+    --kill-on-bad-exit=1 \
+    ${SRUN_ARGS} \
+    python -u tools/train.py ${CONFIG} --launcher="slurm" ${PY_ARGS}
diff --git a/projects/maskfeat_video/tools/train.py b/projects/maskfeat_video/tools/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..15c5ccb4f8ced194df6bf2820bbc6784dea29725
--- /dev/null
+++ b/projects/maskfeat_video/tools/train.py
@@ -0,0 +1,93 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os
+import os.path as osp
+
+from mmengine.config import Config, DictAction
+from mmengine.runner import Runner
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Train a model')
+    parser.add_argument('config', help='train config file path')
+    parser.add_argument('--work-dir', help='the dir to save logs and models')
+    parser.add_argument(
+        '--resume',
+        nargs='?',
+        type=str,
+        const='auto',
+        help='If specify checkpint path, resume from it, while if not '
+        'specify, try to auto resume from the latest checkpoint '
+        'in the work directory.')
+    parser.add_argument(
+        '--amp',
+        action='store_true',
+        help='enable automatic-mixed-precision training')
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. If the value to '
+        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+        'Note that the quotation marks are necessary and that no white space '
+        'is allowed.')
+    parser.add_argument(
+        '--launcher',
+        choices=['none', 'pytorch', 'slurm', 'mpi'],
+        default='none',
+        help='job launcher')
+    parser.add_argument('--local_rank', type=int, default=0)
+    args = parser.parse_args()
+    if 'LOCAL_RANK' not in os.environ:
+        os.environ['LOCAL_RANK'] = str(args.local_rank)
+
+    return args
+
+
+def main():
+    args = parse_args()
+
+    # load config
+    cfg = Config.fromfile(args.config)
+    cfg.launcher = args.launcher
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+
+    # work_dir is determined in this priority: CLI > segment in file > filename
+    if args.work_dir is not None:
+        # update configs according to CLI args if args.work_dir is not None
+        cfg.work_dir = args.work_dir
+    elif cfg.get('work_dir', None) is None:
+        # use config filename as default work_dir if cfg.work_dir is None
+        work_type = args.config.split('/')[1]
+        cfg.work_dir = osp.join('./work_dirs', work_type,
+                                osp.splitext(osp.basename(args.config))[0])
+
+    # enable automatic-mixed-precision training
+    if args.amp is True:
+        optim_wrapper = cfg.optim_wrapper.get('type', 'OptimWrapper')
+        assert optim_wrapper in ['OptimWrapper', 'AmpOptimWrapper'], \
+            '`--amp` is not supported custom optimizer wrapper type ' \
+            f'`{optim_wrapper}.'
+        cfg.optim_wrapper.type = 'AmpOptimWrapper'
+        cfg.optim_wrapper.setdefault('loss_scale', 'dynamic')
+
+    # resume training
+    if args.resume == 'auto':
+        cfg.resume = True
+        cfg.load_from = None
+    elif args.resume is not None:
+        cfg.resume = True
+        cfg.load_from = args.resume
+
+    # build the runner from config
+    runner = Runner.from_cfg(cfg)
+
+    # start training
+    runner.train()
+
+
+if __name__ == '__main__':
+    main()
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..6da5adea757ffc79ac35e544d4afe85c5f44a90d
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,3 @@
+-r requirements/optional.txt
+-r requirements/runtime.txt
+-r requirements/tests.txt
diff --git a/requirements/docs.txt b/requirements/docs.txt
new file mode 100644
index 0000000000000000000000000000000000000000..208d8ac0add9f5a2603cf45a549d986c7d2ce2ce
--- /dev/null
+++ b/requirements/docs.txt
@@ -0,0 +1,10 @@
+docutils==0.18.1
+modelindex
+myst-parser
+git+https://github.com/mzr1996/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
+sphinx==6.1.3
+sphinx-copybutton
+sphinx-notfound-page
+sphinx-tabs
+sphinxcontrib-jquery
+tabulate
diff --git a/requirements/mminstall.txt b/requirements/mminstall.txt
new file mode 100644
index 0000000000000000000000000000000000000000..9b736b028bdc9d8b5f8f53121f9e41234b4ba424
--- /dev/null
+++ b/requirements/mminstall.txt
@@ -0,0 +1,2 @@
+mmcv>=2.0.0,<2.4.0
+mmengine>=0.8.3,<1.0.0
diff --git a/requirements/multimodal.txt b/requirements/multimodal.txt
new file mode 100644
index 0000000000000000000000000000000000000000..f6150b16d8e5b6deab4e4b34bec25d5ceeb6bf1a
--- /dev/null
+++ b/requirements/multimodal.txt
@@ -0,0 +1,2 @@
+pycocotools
+transformers>=4.28.0
diff --git a/requirements/optional.txt b/requirements/optional.txt
new file mode 100644
index 0000000000000000000000000000000000000000..5f31808f14b9259e18fcd2d2b056b0c611b09131
--- /dev/null
+++ b/requirements/optional.txt
@@ -0,0 +1,4 @@
+albumentations>=0.3.2 --no-binary qudida,albumentations    # For Albumentations data transform
+grad-cam >= 1.3.7,<1.5.0   # For CAM visualization
+requests            # For torchserve
+scikit-learn        # For t-SNE visualization and unit tests.
diff --git a/requirements/readthedocs.txt b/requirements/readthedocs.txt
new file mode 100644
index 0000000000000000000000000000000000000000..145cedab5b84aa9bf0900bd5361627251c6337e8
--- /dev/null
+++ b/requirements/readthedocs.txt
@@ -0,0 +1,7 @@
+--extra-index-url https://download.pytorch.org/whl/cpu
+mmcv-lite>=2.0.0rc4
+mmengine
+pycocotools
+torch
+torchvision
+transformers
diff --git a/requirements/runtime.txt b/requirements/runtime.txt
new file mode 100644
index 0000000000000000000000000000000000000000..e0b0d903f3d77635cd475e531f8142ead65e3b06
--- /dev/null
+++ b/requirements/runtime.txt
@@ -0,0 +1,7 @@
+einops
+importlib-metadata
+mat4py
+matplotlib
+modelindex
+numpy
+rich
diff --git a/requirements/tests.txt b/requirements/tests.txt
new file mode 100644
index 0000000000000000000000000000000000000000..ed0110fe120fe49910012e046c0afd29973bf509
--- /dev/null
+++ b/requirements/tests.txt
@@ -0,0 +1,3 @@
+coverage
+interrogate
+pytest
diff --git a/resnet50-test.py b/resnet50-test.py
new file mode 100644
index 0000000000000000000000000000000000000000..33ede5023e7f48aa83d123f8c535c79ff825c6ce
--- /dev/null
+++ b/resnet50-test.py
@@ -0,0 +1,25 @@
+_base_ = [
+    'configs/_base_/models/resnet50.py', 'configs/_base_/datasets/tiny_imagenet_bs32.py',
+    'configs/_base_/schedules/imagenet_bs256.py', 'configs/_base_/default_runtime.py'
+]
+
+import torch
+
+torch.backends.cuda.matmul.allow_tf32=True
+torch.backends.cudnn.allow_tf32=True
+
+# optimizer
+optim_wrapper = dict(
+        #type='AmpOptimWrapper',
+        #dtype='float16',
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+custom_hooks = [
+        dict(type='ProfilerHook', by_epoch=False,
+            profile_times=12,
+            with_stack=True,
+            with_flops=True,
+            on_trace_ready=dict(type="log_trace", sort_by="self_cuda_time_total"),
+            activity_with_cuda=True,
+            schedule=dict(wait=1, warmup=1, active=10, repeat=1))
+        ]
diff --git a/resnet50_imagenet200_8b32.py b/resnet50_imagenet200_8b32.py
new file mode 100644
index 0000000000000000000000000000000000000000..ecfe2c65ac26dc64dc3defcacce5fa8cfc1414ef
--- /dev/null
+++ b/resnet50_imagenet200_8b32.py
@@ -0,0 +1,22 @@
+_base_ = [
+    'configs/_base_/models/resnet50.py', 'configs/_base_/datasets/tiny_imagenet_bs32.py',
+    'configs/_base_/schedules/imagenet_bs256.py', 'configs/_base_/default_runtime.py'
+]
+
+import os
+
+# optimizer
+optim_wrapper = dict(
+        type='AmpOptimWrapper',
+        dtype='bfloat16',
+    optimizer=dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+
+# 自定义hooks，添加ProfilerHook, 只在rank0启用
+custom_hooks = [
+    dict(type='ProfilerHook', by_epoch=False,
+        profile_times=12,
+        on_trace_ready=dict(type="log_trace", sort_by="self_cuda_time_total"),
+        json_trace_path=f"trace_resnet50_8xb32_bf16.json",
+        activity_with_cuda=True,
+        schedule=dict(wait=1, warmup=1, active=10, repeat=1))  # 这样的设置是10次
+] if os.environ['LOCAL_RANK'] == '0' else []
diff --git a/resources/miaomiao_qrcode.jpg b/resources/miaomiao_qrcode.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..d34cbae6fd131d668b0f16bfe918993610257131
Binary files /dev/null and b/resources/miaomiao_qrcode.jpg differ
diff --git a/resources/mmpt-logo.png b/resources/mmpt-logo.png
new file mode 100644
index 0000000000000000000000000000000000000000..f4e060716520ece5db7e85df3c3ad8fd9e0eda57
Binary files /dev/null and b/resources/mmpt-logo.png differ
diff --git a/resources/xiaozhushou_weixin_qrcode.jpeg b/resources/xiaozhushou_weixin_qrcode.jpeg
new file mode 100644
index 0000000000000000000000000000000000000000..873a0ba40a5af1baec49c11b16f86edd79714eab
Binary files /dev/null and b/resources/xiaozhushou_weixin_qrcode.jpeg differ
diff --git a/resources/zhihu_qrcode.jpg b/resources/zhihu_qrcode.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..c745fb027f06564d41794e9a40069b06c34e2bb5
Binary files /dev/null and b/resources/zhihu_qrcode.jpg differ
diff --git a/setup.cfg b/setup.cfg
new file mode 100644
index 0000000000000000000000000000000000000000..06455344af48c02c611ac95b6f84d76d1de3ec46
--- /dev/null
+++ b/setup.cfg
@@ -0,0 +1,33 @@
+[bdist_wheel]
+universal=1
+
+[aliases]
+test=pytest
+
+[yapf]
+based_on_style = pep8
+blank_line_before_nested_class_or_def = true
+split_before_expression_after_opening_paren = true
+
+[isort]
+line_length = 79
+multi_line_output = 0
+extra_standard_library = pkg_resources,setuptools
+known_first_party = mmpretrain
+no_lines_before = STDLIB,LOCALFOLDER
+default_section = THIRDPARTY
+
+[codespell]
+skip = *.ipynb
+quiet-level = 3
+ignore-words-list = patten,confectionary,nd,ty,formating,dows
+
+[flake8]
+# The E251 check is conflict with yapf in some situation.
+# See https://github.com/google/yapf/issues/393
+extend-ignore = E251
+# The F401 check is wrong if the `__all__` variable is modified
+# in `__init__.py`
+per-file-ignores =
+    */__init__.py: F401
+    mmpretrain/configs/*: F401,F403,F405
diff --git a/setup.py b/setup.py
new file mode 100644
index 0000000000000000000000000000000000000000..e68dff2be8d37d84b98cf1face47a3568e8d0068
--- /dev/null
+++ b/setup.py
@@ -0,0 +1,198 @@
+import os
+import os.path as osp
+import shutil
+import sys
+import warnings
+from setuptools import find_packages, setup
+
+
+def readme():
+    with open('README.md', encoding='utf-8') as f:
+        content = f.read()
+    return content
+
+
+def get_version():
+    version_file = 'mmpretrain/version.py'
+    with open(version_file, 'r', encoding='utf-8') as f:
+        exec(compile(f.read(), version_file, 'exec'))
+    return locals()['__version__']
+
+
+def parse_requirements(fname='requirements.txt', with_version=True):
+    """Parse the package dependencies listed in a requirements file but strips
+    specific versioning information.
+
+    Args:
+        fname (str): path to requirements file
+        with_version (bool, default=False): if True include version specs
+
+    Returns:
+        List[str]: list of requirements items
+
+    CommandLine:
+        python -c "import setup; print(setup.parse_requirements())"
+    """
+    import re
+    import sys
+    from os.path import exists
+    require_fpath = fname
+
+    def parse_line(line):
+        """Parse information from a line in a requirements text file."""
+        if line.startswith('-r '):
+            # Allow specifying requirements in other files
+            target = line.split(' ')[1]
+            for info in parse_require_file(target):
+                yield info
+        else:
+            info = {'line': line}
+            if line.startswith('-e '):
+                info['package'] = line.split('#egg=')[1]
+            else:
+                # Remove versioning from the package
+                pat = '(' + '|'.join(['>=', '==', '>']) + ')'
+                parts = re.split(pat, line, maxsplit=1)
+                parts = [p.strip() for p in parts]
+
+                info['package'] = parts[0]
+                if len(parts) > 1:
+                    op, rest = parts[1:]
+                    if ';' in rest:
+                        # Handle platform specific dependencies
+                        # http://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-platform-specific-dependencies
+                        version, platform_deps = map(str.strip,
+                                                     rest.split(';'))
+                        info['platform_deps'] = platform_deps
+                    else:
+                        version = rest  # NOQA
+                    if '--' in version:
+                        # the `extras_require` doesn't accept options.
+                        version = version.split('--')[0].strip()
+                    info['version'] = (op, version)
+            yield info
+
+    def parse_require_file(fpath):
+        with open(fpath, 'r') as f:
+            for line in f.readlines():
+                line = line.strip()
+                if line and not line.startswith('#'):
+                    for info in parse_line(line):
+                        yield info
+
+    def gen_packages_items():
+        if exists(require_fpath):
+            for info in parse_require_file(require_fpath):
+                parts = [info['package']]
+                if with_version and 'version' in info:
+                    parts.extend(info['version'])
+                if not sys.version.startswith('3.4'):
+                    # apparently package_deps are broken in 3.4
+                    platform_deps = info.get('platform_deps')
+                    if platform_deps is not None:
+                        parts.append(';' + platform_deps)
+                item = ''.join(parts)
+                yield item
+
+    packages = list(gen_packages_items())
+    return packages
+
+
+def add_mim_extension():
+    """Add extra files that are required to support MIM into the package.
+
+    These files will be added by creating a symlink to the originals if the
+    package is installed in `editable` mode (e.g. pip install -e .), or by
+    copying from the originals otherwise.
+    """
+
+    # parse installment mode
+    if 'develop' in sys.argv:
+        # installed by `pip install -e .`
+        mode = 'symlink'
+    elif 'sdist' in sys.argv or 'bdist_wheel' in sys.argv:
+        # installed by `pip install .`
+        # or create source distribution by `python setup.py sdist`
+        mode = 'copy'
+    else:
+        return
+
+    filenames = ['tools', 'configs', 'model-index.yml', 'dataset-index.yml']
+    repo_path = osp.dirname(__file__)
+    mim_path = osp.join(repo_path, 'mmpretrain', '.mim')
+    os.makedirs(mim_path, exist_ok=True)
+
+    for filename in filenames:
+        if osp.exists(filename):
+            src_path = osp.join(repo_path, filename)
+            tar_path = osp.join(mim_path, filename)
+
+            if osp.isfile(tar_path) or osp.islink(tar_path):
+                os.remove(tar_path)
+            elif osp.isdir(tar_path):
+                shutil.rmtree(tar_path)
+
+            if mode == 'symlink':
+                src_relpath = osp.relpath(src_path, osp.dirname(tar_path))
+                try:
+                    os.symlink(src_relpath, tar_path)
+                except OSError:
+                    # Creating a symbolic link on windows may raise an
+                    # `OSError: [WinError 1314]` due to privilege. If
+                    # the error happens, the src file will be copied
+                    mode = 'copy'
+                    warnings.warn(
+                        f'Failed to create a symbolic link for {src_relpath}, '
+                        f'and it will be copied to {tar_path}')
+                else:
+                    continue
+
+            if mode == 'copy':
+                if osp.isfile(src_path):
+                    shutil.copyfile(src_path, tar_path)
+                elif osp.isdir(src_path):
+                    shutil.copytree(src_path, tar_path)
+                else:
+                    warnings.warn(f'Cannot copy file {src_path}.')
+            else:
+                raise ValueError(f'Invalid mode {mode}')
+
+
+if __name__ == '__main__':
+    add_mim_extension()
+    setup(
+        name='mmpretrain',
+        version=get_version(),
+        description='OpenMMLab Model Pretraining Toolbox and Benchmark',
+        long_description=readme(),
+        long_description_content_type='text/markdown',
+        keywords='computer vision, image classification, '
+        'unsupervised learning, self-supervised learning',
+        packages=find_packages(exclude=('configs', 'tools', 'demo', 'tests')),
+        include_package_data=True,
+        python_requires='>=3.7',
+        classifiers=[
+            'Development Status :: 4 - Beta',
+            'License :: OSI Approved :: Apache Software License',
+            'Operating System :: OS Independent',
+            'Programming Language :: Python :: 3',
+            'Programming Language :: Python :: 3.7',
+            'Programming Language :: Python :: 3.8',
+            'Programming Language :: Python :: 3.9',
+            'Programming Language :: Python :: 3.10',
+            'Programming Language :: Python :: 3.11',
+            'Topic :: Scientific/Engineering :: Artificial Intelligence',
+        ],
+        url='https://github.com/open-mmlab/mmpretrain',
+        author='MMPretrain Contributors',
+        author_email='openmmlab@gmail.com',
+        license='Apache License 2.0',
+        install_requires=parse_requirements('requirements/runtime.txt'),
+        extras_require={
+            'all': parse_requirements('requirements.txt'),
+            'tests': parse_requirements('requirements/tests.txt'),
+            'optional': parse_requirements('requirements/optional.txt'),
+            'mim': parse_requirements('requirements/mminstall.txt'),
+            'multimodal': parse_requirements('requirements/multimodal.txt'),
+        },
+        zip_safe=False)
diff --git a/swin-b-test.py b/swin-b-test.py
new file mode 100644
index 0000000000000000000000000000000000000000..1cc59bf16864d39751d42e3050a44133e54b358b
--- /dev/null
+++ b/swin-b-test.py
@@ -0,0 +1,29 @@
+_base_ = [
+    'configs/_base_/models/swin_transformer/tiny_base_224.py',
+    'configs/_base_/datasets/tiny_imagenet_bs64_swin_224.py',
+    'configs/_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    'configs/_base_/default_runtime.py'
+]
+
+#import torch
+
+#torch.backends.cuda.matmul.allow_tf32=True
+#torch.backends.cudnn.allow_tf32=True
+
+
+# schedule settings
+optim_wrapper = dict(
+        #type='AmpOptimWrapper',
+        #dtype='bfloat16',
+        clip_grad=dict(max_norm=5.0))
+
+#custom_hooks = [
+#        dict(type='ProfilerHook', by_epoch=False,
+#        profile_times=12,
+#        with_stack=True,
+#        with_flops=True,
+#        json_trace_path="trace_swin-b-tf32.json",
+#        on_trace_ready=dict(type="log_trace", sort_by="self_cuda_time_total"),
+#        activity_with_cuda=True,
+#        schedule=dict(wait=1, warmup=1, active=10, repeat=1)),
+#        ]
diff --git a/swin-l-test.py b/swin-l-test.py
new file mode 100644
index 0000000000000000000000000000000000000000..8322aa9601ee72d87636ec7af1c8925b0235dbb6
--- /dev/null
+++ b/swin-l-test.py
@@ -0,0 +1,27 @@
+_base_ = [
+    'configs/_base_/models/swin_transformer/tiny_large_224.py',
+    'configs/_base_/datasets/tiny_imagenet_bs64_swin_224.py',
+    'configs/_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    'configs/_base_/default_runtime.py'
+]
+import torch
+
+#torch.backends.cuda.matmul.allow_tf32=False
+#torch.backends.cudnn.allow_tf32=False
+
+# schedule settings
+optim_wrapper = dict(
+        #type='AmpOptimWrapper',
+        #dtype='float16',
+        clip_grad=dict(max_norm=5.0))
+
+#custom_hooks = [
+#        dict(type='ProfilerHook', by_epoch=False,
+#        profile_times=5,
+#        with_stack=True,
+#        with_flops=True,
+#        json_trace_path="trace_swin-l-pure-fp32.json",
+#        on_trace_ready=dict(type="log_trace", sort_by="self_cuda_time_total"),
+#        activity_with_cuda=True,
+#        schedule=dict(wait=3, warmup=1, active=1, repeat=1)),
+#        ]
diff --git a/tests/__init__.py b/tests/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef101fec61e72abc0eb90266d453b5b22331378d
--- /dev/null
+++ b/tests/__init__.py
@@ -0,0 +1 @@
+# Copyright (c) OpenMMLab. All rights reserved.
diff --git a/tests/data/color.jpg b/tests/data/color.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..2f19ebc6c6e867372f61dceadba4d66de46e31ab
Binary files /dev/null and b/tests/data/color.jpg differ
diff --git a/tests/data/dataset/3.jpeg b/tests/data/dataset/3.jpeg
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/tests/data/dataset/a/1.JPG b/tests/data/dataset/a/1.JPG
new file mode 120000
index 0000000000000000000000000000000000000000..e3f9cd1455e3cc1bf414c2dc3881daa7574590bf
--- /dev/null
+++ b/tests/data/dataset/a/1.JPG
@@ -0,0 +1 @@
+../../color.jpg
\ No newline at end of file
diff --git a/tests/data/dataset/ann.json b/tests/data/dataset/ann.json
new file mode 100644
index 0000000000000000000000000000000000000000..a55539329966e7f36233099a6c9e37ce50a09ecf
--- /dev/null
+++ b/tests/data/dataset/ann.json
@@ -0,0 +1,28 @@
+{
+  "metainfo": {
+    "categories": [
+      {
+        "category_name": "first",
+        "id": 0
+      },
+      {
+        "category_name": "second",
+        "id": 1
+      }
+    ]
+  },
+  "data_list": [
+    {
+      "img_path": "a/1.JPG",
+      "gt_label": 0
+    },
+    {
+      "img_path": "b/2.jpeg",
+      "gt_label": 1
+    },
+    {
+      "img_path": "b/subb/2.jpeg",
+      "gt_label": 1
+    }
+  ]
+}
diff --git a/tests/data/dataset/ann.txt b/tests/data/dataset/ann.txt
new file mode 100644
index 0000000000000000000000000000000000000000..f929e873b79cd93a8517ef05dd4302271a605af1
--- /dev/null
+++ b/tests/data/dataset/ann.txt
@@ -0,0 +1,3 @@
+a/1.JPG 0
+b/2.jpeg 1
+b/subb/3.jpg 1
diff --git a/tests/data/dataset/ann_without_labels.txt b/tests/data/dataset/ann_without_labels.txt
new file mode 100644
index 0000000000000000000000000000000000000000..ea467ca5273586d6dba061e4c4be119f06ec2148
--- /dev/null
+++ b/tests/data/dataset/ann_without_labels.txt
@@ -0,0 +1,3 @@
+a/1.JPG
+b/2.jpeg
+b/subb/3.jpg
diff --git a/tests/data/dataset/b/2.jpeg b/tests/data/dataset/b/2.jpeg
new file mode 120000
index 0000000000000000000000000000000000000000..e3f9cd1455e3cc1bf414c2dc3881daa7574590bf
--- /dev/null
+++ b/tests/data/dataset/b/2.jpeg
@@ -0,0 +1 @@
+../../color.jpg
\ No newline at end of file
diff --git a/tests/data/dataset/b/subb/3.jpg b/tests/data/dataset/b/subb/3.jpg
new file mode 120000
index 0000000000000000000000000000000000000000..f40a58deaa3d127eb117a160f232e2f0658c975d
--- /dev/null
+++ b/tests/data/dataset/b/subb/3.jpg
@@ -0,0 +1 @@
+../../../color.jpg
\ No newline at end of file
diff --git a/tests/data/dataset/classes.txt b/tests/data/dataset/classes.txt
new file mode 100644
index 0000000000000000000000000000000000000000..c012a51e609848598ce299aef954746393c49303
--- /dev/null
+++ b/tests/data/dataset/classes.txt
@@ -0,0 +1,2 @@
+bus
+car
diff --git a/tests/data/dataset/multi-task.json b/tests/data/dataset/multi-task.json
new file mode 100644
index 0000000000000000000000000000000000000000..bf96384a0e69e25afe5450668bf91d718dfe7dc3
--- /dev/null
+++ b/tests/data/dataset/multi-task.json
@@ -0,0 +1,40 @@
+{
+  "metainfo": {
+    "tasks": [
+      "gender",
+      "wear"
+    ]
+  },
+  "data_list": [
+    {
+      "img_path": "a/1.JPG",
+      "gt_label": {
+        "gender": 0
+      }
+    },
+    {
+      "img_path": "b/2.jpeg",
+      "gt_label": {
+        "gender": 0,
+        "wear": [
+          1,
+          0,
+          1,
+          0
+        ]
+      }
+    },
+    {
+      "img_path": "b/subb/3.jpg",
+      "gt_label": {
+        "gender": 1,
+        "wear": [
+          0,
+          1,
+          0,
+          1
+        ]
+      }
+    }
+  ]
+}
diff --git a/tests/data/dataset/multi_label_ann.json b/tests/data/dataset/multi_label_ann.json
new file mode 100644
index 0000000000000000000000000000000000000000..5cd8a84d086b619e1dffd25cc8800c2dc5f097dd
--- /dev/null
+++ b/tests/data/dataset/multi_label_ann.json
@@ -0,0 +1,28 @@
+{
+  "metainfo": {
+    "categories": [
+      {
+        "category_name": "first",
+        "id": 0
+      },
+      {
+        "category_name": "second",
+        "id": 1
+      }
+    ]
+  },
+  "data_list": [
+    {
+      "img_path": "a/1.JPG",
+      "gt_label": [0]
+    },
+    {
+      "img_path": "b/2.jpeg",
+      "gt_label": [1]
+    },
+    {
+      "img_path": "b/subb/2.jpeg",
+      "gt_label": [0, 1]
+    }
+  ]
+}
diff --git a/tests/data/gray.jpg b/tests/data/gray.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..94edd7326f2fdcf311c11f4ad6a8edc62a9ac2a5
Binary files /dev/null and b/tests/data/gray.jpg differ
diff --git a/tests/data/meta.yml b/tests/data/meta.yml
new file mode 100644
index 0000000000000000000000000000000000000000..cd78630d03de41011a48576ddb3ec11f0d16869a
--- /dev/null
+++ b/tests/data/meta.yml
@@ -0,0 +1,13 @@
+Models:
+  - Name: test_model
+    Metadata:
+      FLOPs: 319000000
+      Parameters: 3500000
+    Results:
+      - Dataset: ImageNet-1k
+        Metrics:
+          Top 1 Accuracy: 71.86
+          Top 5 Accuracy: 90.42
+        Task: Image Classification
+    Weights: test_weight.pth
+    Config: test_config.py
diff --git a/tests/data/retinanet.py b/tests/data/retinanet.py
new file mode 100644
index 0000000000000000000000000000000000000000..e7e6ea0048d0bdecb8e29967ff8ae1ddaa7981c1
--- /dev/null
+++ b/tests/data/retinanet.py
@@ -0,0 +1,83 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# small RetinaNet
+num_classes = 3
+
+# model settings
+model = dict(
+    type='RetinaNet',
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(0, 1, 2, 3),
+        frozen_stages=1,
+        norm_cfg=dict(type='BN', requires_grad=True),
+        norm_eval=True,
+        style='pytorch',
+        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
+    neck=dict(
+        type='FPN',
+        in_channels=[256, 512, 1024, 2048],
+        out_channels=256,
+        start_level=1,
+        add_extra_convs='on_input',
+        num_outs=5),
+    bbox_head=dict(
+        type='RetinaHead',
+        num_classes=num_classes,
+        in_channels=256,
+        stacked_convs=1,
+        feat_channels=256,
+        anchor_generator=dict(
+            type='AnchorGenerator',
+            octave_base_scale=4,
+            scales_per_octave=3,
+            ratios=[0.5, 1.0, 2.0],
+            strides=[8, 16, 32, 64, 128]),
+        bbox_coder=dict(
+            type='DeltaXYWHBBoxCoder',
+            target_means=[.0, .0, .0, .0],
+            target_stds=[1.0, 1.0, 1.0, 1.0]),
+        loss_cls=dict(
+            type='FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
+    # model training and testing settings
+    train_cfg=dict(
+        assigner=dict(
+            type='MaxIoUAssigner',
+            pos_iou_thr=0.5,
+            neg_iou_thr=0.4,
+            min_pos_iou=0,
+            ignore_iof_thr=-1),
+        allowed_border=-1,
+        pos_weight=-1,
+        debug=False),
+    test_cfg=dict(
+        nms_pre=1000,
+        min_bbox_size=0,
+        score_thr=0.05,
+        nms=dict(type='nms', iou_threshold=0.5),
+        max_per_img=100))
+
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiScaleFlipAug',
+        img_scale=(1333, 800),
+        flip=False,
+        transforms=[
+            dict(type='Resize', keep_ratio=True),
+            dict(type='RandomFlip'),
+            dict(type='Normalize', **img_norm_cfg),
+            dict(type='Pad', size_divisor=32),
+            dict(type='ImageToTensor', keys=['img']),
+            dict(type='Collect', keys=['img']),
+        ])
+]
+data = dict(test=dict(pipeline=test_pipeline))
diff --git a/tests/data/vis_data.json b/tests/data/vis_data.json
new file mode 100644
index 0000000000000000000000000000000000000000..d10acaf2f907673405e988d2d47d0d31ea88155a
--- /dev/null
+++ b/tests/data/vis_data.json
@@ -0,0 +1,21 @@
+{"lr": 0.1, "data_time": 0.0061125516891479496, "loss": 2.6531384229660033, "time": 0.14429793357849122, "epoch": 1, "step": 10}
+{"lr": 0.1, "data_time": 0.00030262470245361327, "loss": 2.9456406116485594, "time": 0.0219132661819458, "epoch": 1, "step": 20}
+{"lr": 0.1, "data_time": 0.00022499561309814454, "loss": 3.1025198698043823, "time": 0.021793675422668458, "epoch": 1, "step": 30}
+{"lr": 0.1, "data_time": 0.00023109912872314452, "loss": 2.5765398740768433, "time": 0.021819329261779784, "epoch": 1, "step": 40}
+{"lr": 0.1, "data_time": 0.00023169517517089843, "loss": 2.671005058288574, "time": 0.02181088924407959, "epoch": 1, "step": 50}
+{"lr": 0.1, "data_time": 0.00021798610687255858, "loss": 2.5273321866989136, "time": 0.021781444549560547, "epoch": 1, "step": 60}
+{"accuracy/top1": 18.80000114440918, "step": 1}
+{"lr": 0.1, "data_time": 0.0007575273513793946, "loss": 2.3254310727119445, "time": 0.02237672805786133, "epoch": 2, "step": 73}
+{"lr": 0.1, "data_time": 0.0002459049224853516, "loss": 2.194095492362976, "time": 0.021792054176330566, "epoch": 2, "step": 83}
+{"lr": 0.1, "data_time": 0.00027666091918945315, "loss": 2.207821953296661, "time": 0.021822547912597655, "epoch": 2, "step": 93}
+{"lr": 0.1, "data_time": 0.00025298595428466795, "loss": 2.090667963027954, "time": 0.02178535461425781, "epoch": 2, "step": 103}
+{"lr": 0.1, "data_time": 0.0002483367919921875, "loss": 2.18342148065567, "time": 0.021893739700317383, "epoch": 2, "step": 113}
+{"lr": 0.1, "data_time": 0.00030078887939453123, "loss": 2.2274346113204957, "time": 0.022345948219299316, "epoch": 2, "step": 123}
+{"accuracy/top1": 21.100000381469727, "step": 2}
+{"lr": 0.1, "data_time": 0.0008128643035888672, "loss": 2.017984461784363, "time": 0.02267434597015381, "epoch": 3, "step": 136}
+{"lr": 0.1, "data_time": 0.00023736953735351563, "loss": 2.0648953437805178, "time": 0.02174344062805176, "epoch": 3, "step": 146}
+{"lr": 0.1, "data_time": 0.00024063587188720702, "loss": 2.0859395623207093, "time": 0.022107195854187012, "epoch": 3, "step": 156}
+{"lr": 0.1, "data_time": 0.0002336740493774414, "loss": 2.1662048220634462, "time": 0.021825361251831054, "epoch": 3, "step": 166}
+{"lr": 0.1, "data_time": 0.0002296924591064453, "loss": 2.1007142066955566, "time": 0.021821355819702147, "epoch": 3, "step": 176}
+{"lr": 0.1, "data_time": 0.00023157596588134765, "loss": 2.0436240792274476, "time": 0.021722936630249025, "epoch": 3, "step": 186}
+{"accuracy/top1": 25.600000381469727, "step": 3}
diff --git a/tests/test_apis/test_inference.py b/tests/test_apis/test_inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..72b20e567c875aef77591d0abd99da5a8c08f220
--- /dev/null
+++ b/tests/test_apis/test_inference.py
@@ -0,0 +1,116 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from tempfile import TemporaryDirectory
+from unittest import TestCase
+from unittest.mock import ANY, MagicMock, patch
+
+from mmcv.image import imread
+
+from mmpretrain.apis import (ImageClassificationInferencer, ModelHub,
+                             get_model, inference_model)
+from mmpretrain.models import MobileNetV3
+from mmpretrain.structures import DataSample
+from mmpretrain.visualization import UniversalVisualizer
+
+MODEL = 'mobilenet-v3-small-050_3rdparty_in1k'
+WEIGHT = 'https://download.openmmlab.com/mmclassification/v0/mobilenet_v3/mobilenet-v3-small-050_3rdparty_in1k_20221114-e0b86be1.pth'  # noqa: E501
+CONFIG = ModelHub.get(MODEL).config
+
+
+class TestImageClassificationInferencer(TestCase):
+
+    def test_init(self):
+        # test input BaseModel
+        model = get_model(MODEL)
+        inferencer = ImageClassificationInferencer(model)
+        self.assertEqual(model._config, inferencer.config)
+        self.assertIsInstance(inferencer.model.backbone, MobileNetV3)
+
+        # test input model name
+        with patch('mmengine.runner.load_checkpoint') as mock:
+            inferencer = ImageClassificationInferencer(MODEL)
+            self.assertIsInstance(inferencer.model.backbone, MobileNetV3)
+            mock.assert_called_once_with(ANY, WEIGHT, map_location='cpu')
+
+        # test input config path
+        inferencer = ImageClassificationInferencer(CONFIG.filename)
+        self.assertIsInstance(inferencer.model.backbone, MobileNetV3)
+
+        # test input config object
+        inferencer = ImageClassificationInferencer(CONFIG)
+        self.assertIsInstance(inferencer.model.backbone, MobileNetV3)
+
+        # test specify weights
+        with patch('mmengine.runner.load_checkpoint') as mock:
+            ImageClassificationInferencer(MODEL, pretrained='custom.pth')
+            mock.assert_called_once_with(ANY, 'custom.pth', map_location='cpu')
+
+    def test_call(self):
+        img_path = osp.join(osp.dirname(__file__), '../data/color.jpg')
+        img = imread(img_path)
+
+        # test inference classification model
+        inferencer = ImageClassificationInferencer(MODEL)
+        results = inferencer(img_path)[0]
+        self.assertEqual(
+            results.keys(),
+            {'pred_score', 'pred_scores', 'pred_label', 'pred_class'})
+
+        # test return_datasample=True
+        results = inferencer(img, return_datasamples=True)[0]
+        self.assertIsInstance(results, DataSample)
+
+    def test_visualize(self):
+        img_path = osp.join(osp.dirname(__file__), '../data/color.jpg')
+        img = imread(img_path)
+
+        inferencer = ImageClassificationInferencer(MODEL)
+        self.assertIsNone(inferencer.visualizer)
+
+        with TemporaryDirectory() as tmpdir:
+            inferencer(img, show_dir=tmpdir)
+            self.assertIsInstance(inferencer.visualizer, UniversalVisualizer)
+            self.assertTrue(osp.exists(osp.join(tmpdir, '0.png')))
+
+            inferencer.visualizer = MagicMock(wraps=inferencer.visualizer)
+            inferencer(
+                img_path, rescale_factor=2., draw_score=False, show_dir=tmpdir)
+            self.assertTrue(osp.exists(osp.join(tmpdir, 'color.png')))
+            inferencer.visualizer.visualize_cls.assert_called_once_with(
+                ANY,
+                ANY,
+                classes=inferencer.classes,
+                resize=None,
+                show=False,
+                wait_time=0,
+                rescale_factor=2.,
+                draw_gt=False,
+                draw_pred=True,
+                draw_score=False,
+                name='color',
+                out_file=osp.join(tmpdir, 'color.png'))
+
+
+class TestInferenceAPIs(TestCase):
+
+    def test_inference_model(self):
+        # test backward compatibility
+        img_path = osp.join(osp.dirname(__file__), '../data/color.jpg')
+        img = imread(img_path)
+
+        model = get_model(MODEL, pretrained=True)
+        results = inference_model(model, img_path)
+        self.assertEqual(
+            results.keys(),
+            {'pred_score', 'pred_scores', 'pred_label', 'pred_class'})
+
+        results = inference_model(model, img)
+        self.assertEqual(
+            results.keys(),
+            {'pred_score', 'pred_scores', 'pred_label', 'pred_class'})
+
+        # test input model name
+        results = inference_model(MODEL, img)
+        self.assertEqual(
+            results.keys(),
+            {'pred_score', 'pred_scores', 'pred_label', 'pred_class'})
diff --git a/tests/test_apis/test_model.py b/tests/test_apis/test_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..1295e03cd42953f9a55979067a6588fef8c37b9c
--- /dev/null
+++ b/tests/test_apis/test_model.py
@@ -0,0 +1,88 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+from unittest import TestCase
+from unittest.mock import patch
+
+from mmengine import Config
+
+from mmpretrain.apis import ModelHub, get_model, init_model, list_models
+from mmpretrain.models import ImageClassifier, MobileNetV2
+
+
+class TestModelHub(TestCase):
+
+    def test_mmpretrain_models(self):
+        self.assertIn('resnet18_8xb32_in1k', ModelHub._models_dict)
+
+    def test_register_model_index(self):
+        model_index_path = osp.join(osp.dirname(__file__), '../data/meta.yml')
+
+        ModelHub.register_model_index(model_index_path)
+        self.assertIn('test_model', ModelHub._models_dict)
+        self.assertEqual(
+            ModelHub._models_dict['test_model'].config,
+            osp.abspath(
+                osp.join(osp.dirname(model_index_path), 'test_config.py')))
+
+        with self.assertRaisesRegex(ValueError, 'meta.yml'):
+            # test name conflict
+            ModelHub.register_model_index(model_index_path)
+
+        # test specify config prefix
+        del ModelHub._models_dict['test_model']
+        ModelHub.register_model_index(
+            model_index_path, config_prefix='configs')
+        self.assertEqual(ModelHub._models_dict['test_model'].config,
+                         osp.abspath(osp.join('configs', 'test_config.py')))
+
+    def test_get_model(self):
+        metainfo = ModelHub.get('resnet18_8xb32_in1k')
+        self.assertIsInstance(metainfo.weights, str)
+        self.assertIsInstance(metainfo.config, Config)
+
+
+class TestHubAPIs(TestCase):
+
+    def test_list_models(self):
+        models_names = list_models()
+        self.assertIsInstance(models_names, list)
+
+        models_names = list_models(pattern='swin*in1k')
+        for model_name in models_names:
+            self.assertTrue(
+                model_name.startswith('swin') and 'in1k' in model_name)
+
+    def test_get_model(self):
+        model = get_model('mobilenet-v2_8xb32_in1k')
+        self.assertIsInstance(model, ImageClassifier)
+        self.assertIsInstance(model.backbone, MobileNetV2)
+
+        with patch('mmengine.runner.load_checkpoint') as mock:
+            model = get_model('mobilenet-v2_8xb32_in1k', pretrained=True)
+            model = get_model('mobilenet-v2_8xb32_in1k', pretrained='test.pth')
+
+            weight = mock.call_args_list[0][0][1]
+            self.assertIn('https', weight)
+            weight = mock.call_args_list[1][0][1]
+            self.assertEqual('test.pth', weight)
+
+        with self.assertRaisesRegex(ValueError, 'Failed to find'):
+            get_model('unknown-model')
+
+    def test_init_model(self):
+        # test init from config object
+        cfg = ModelHub.get('mobilenet-v2_8xb32_in1k').config
+        model = init_model(cfg)
+        self.assertIsInstance(model, ImageClassifier)
+        self.assertIsInstance(model.backbone, MobileNetV2)
+
+        # test init from config file
+        cfg = ModelHub._models_dict['mobilenet-v2_8xb32_in1k'].config
+        self.assertIsInstance(cfg, str)
+        model = init_model(cfg)
+        self.assertIsInstance(model, ImageClassifier)
+        self.assertIsInstance(model.backbone, MobileNetV2)
+
+        # test modify configs of the model
+        model = init_model(cfg, head=dict(num_classes=10))
+        self.assertEqual(model.head.num_classes, 10)
diff --git a/tests/test_datasets/test_dataset_utils.py b/tests/test_datasets/test_dataset_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e92424854defdbe3aafbc6c2acda445608fafdb
--- /dev/null
+++ b/tests/test_datasets/test_dataset_utils.py
@@ -0,0 +1,39 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+import random
+import string
+from unittest.mock import patch
+
+import pytest
+
+from mmpretrain.datasets.utils import (check_integrity,
+                                       open_maybe_compressed_file, rm_suffix)
+
+
+def test_dataset_utils():
+    # test rm_suffix
+    assert rm_suffix('a.jpg') == 'a'
+    assert rm_suffix('a.bak.jpg') == 'a.bak'
+    assert rm_suffix('a.bak.jpg', suffix='.jpg') == 'a.bak'
+    assert rm_suffix('a.bak.jpg', suffix='.bak.jpg') == 'a'
+
+    # test check_integrity
+    rand_file = ''.join(random.sample(string.ascii_letters, 10))
+    assert not check_integrity(rand_file, md5=None)
+    assert not check_integrity(rand_file, md5=2333)
+    test_file = osp.join(osp.dirname(__file__), '../data/color.jpg')
+    assert check_integrity(test_file, md5='08252e5100cb321fe74e0e12a724ce14')
+    assert not check_integrity(test_file, md5=2333)
+
+
+@pytest.mark.parametrize('method,path', [('gzip.open', 'abc.gz'),
+                                         ('lzma.open', 'abc.xz'),
+                                         ('builtins.open', 'abc.txt'),
+                                         (None, 1)])
+def test_open_maybe_compressed_file(method, path):
+    if method:
+        with patch(method) as mock:
+            open_maybe_compressed_file(path)
+            mock.assert_called()
+    else:
+        assert open_maybe_compressed_file(path) == path
diff --git a/tests/test_datasets/test_datasets.py b/tests/test_datasets/test_datasets.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a5c6e380335977ba460a68aeee3aa82189a9cfe
--- /dev/null
+++ b/tests/test_datasets/test_datasets.py
@@ -0,0 +1,2201 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+import os.path as osp
+import pickle
+import sys
+import tempfile
+from unittest import TestCase
+from unittest.mock import MagicMock, call, patch
+
+import mat4py
+import numpy as np
+from mmengine.logging import MMLogger
+
+from mmpretrain.registry import DATASETS, TRANSFORMS
+
+ASSETS_ROOT = osp.abspath(osp.join(osp.dirname(__file__), '../data/dataset'))
+
+
+class TestBaseDataset(TestCase):
+    DATASET_TYPE = 'BaseDataset'
+
+    DEFAULT_ARGS = dict(data_root=ASSETS_ROOT, ann_file='ann.json')
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test loading metainfo from ann_file
+        cfg = {**self.DEFAULT_ARGS, 'metainfo': None, 'classes': None}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(
+            dataset.CLASSES,
+            dataset_class.METAINFO.get('classes', ('first', 'second')))
+        self.assertFalse(dataset.test_mode)
+
+        # Test overriding metainfo by `metainfo` argument
+        cfg = {**self.DEFAULT_ARGS, 'metainfo': {'classes': ('bus', 'car')}}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.CLASSES, ('bus', 'car'))
+
+        # Test overriding metainfo by `classes` argument
+        cfg = {**self.DEFAULT_ARGS, 'classes': ['bus', 'car']}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.CLASSES, ('bus', 'car'))
+
+        classes_file = osp.join(ASSETS_ROOT, 'classes.txt')
+        cfg = {**self.DEFAULT_ARGS, 'classes': classes_file}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.CLASSES, ('bus', 'car'))
+        self.assertEqual(dataset.class_to_idx, {'bus': 0, 'car': 1})
+
+        # Test invalid classes
+        cfg = {**self.DEFAULT_ARGS, 'classes': dict(classes=1)}
+        with self.assertRaisesRegex(ValueError, "type <class 'dict'>"):
+            dataset_class(**cfg)
+
+    def test_get_cat_ids(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+
+        cat_ids = dataset.get_cat_ids(0)
+        self.assertIsInstance(cat_ids, list)
+        self.assertEqual(len(cat_ids), 1)
+        self.assertIsInstance(cat_ids[0], int)
+
+    def test_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS, 'lazy_init': True}
+        dataset = dataset_class(**cfg)
+
+        head = 'Dataset ' + dataset.__class__.__name__
+        self.assertIn(head, repr(dataset))
+
+        if dataset.CLASSES is not None:
+            num_classes = len(dataset.CLASSES)
+            self.assertIn(f'Number of categories: \t{num_classes}',
+                          repr(dataset))
+
+        self.assertIn('Haven\'t been initialized', repr(dataset))
+        dataset.full_init()
+        self.assertIn(f'Number of samples: \t{len(dataset)}', repr(dataset))
+
+        TRANSFORMS.register_module(name='test_mock', module=MagicMock)
+        cfg = {**self.DEFAULT_ARGS, 'pipeline': [dict(type='test_mock')]}
+        dataset = dataset_class(**cfg)
+        self.assertIn('With transforms', repr(dataset))
+        del TRANSFORMS.module_dict['test_mock']
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS, 'lazy_init': True}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f'Annotation file: \t{dataset.ann_file}', repr(dataset))
+        self.assertIn(f'Prefix of images: \t{dataset.img_prefix}',
+                      repr(dataset))
+
+
+class TestCustomDataset(TestBaseDataset):
+    DATASET_TYPE = 'CustomDataset'
+
+    DEFAULT_ARGS = dict(data_root=ASSETS_ROOT, ann_file='ann.txt')
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test overriding metainfo by `metainfo` argument
+        cfg = {**self.DEFAULT_ARGS, 'metainfo': {'classes': ('bus', 'car')}}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.CLASSES, ('bus', 'car'))
+
+        # Test overriding metainfo by `classes` argument
+        cfg = {**self.DEFAULT_ARGS, 'classes': ['bus', 'car']}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.CLASSES, ('bus', 'car'))
+
+        classes_file = osp.join(ASSETS_ROOT, 'classes.txt')
+        cfg = {**self.DEFAULT_ARGS, 'classes': classes_file}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.CLASSES, ('bus', 'car'))
+        self.assertEqual(dataset.class_to_idx, {'bus': 0, 'car': 1})
+
+        # Test invalid classes
+        cfg = {**self.DEFAULT_ARGS, 'classes': dict(classes=1)}
+        with self.assertRaisesRegex(ValueError, "type <class 'dict'>"):
+            dataset_class(**cfg)
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # test load without ann_file
+        cfg = {
+            **self.DEFAULT_ARGS,
+            'data_prefix': ASSETS_ROOT,
+            'ann_file': '',
+        }
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 3)
+        self.assertEqual(dataset.CLASSES, ('a', 'b'))  # auto infer classes
+        self.assertGreaterEqual(
+            dataset.get_data_info(0).items(), {
+                'img_path': osp.join(ASSETS_ROOT, 'a', '1.JPG'),
+                'gt_label': 0
+            }.items())
+        self.assertGreaterEqual(
+            dataset.get_data_info(2).items(), {
+                'img_path': osp.join(ASSETS_ROOT, 'b', 'subb', '3.jpg'),
+                'gt_label': 1
+            }.items())
+
+        # test load without ann_file and without labels
+        # (no specific folder structures)
+        cfg = {
+            **self.DEFAULT_ARGS,
+            'data_prefix': ASSETS_ROOT,
+            'ann_file': '',
+            'with_label': False,
+        }
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 4)
+        self.assertIsNone(dataset.CLASSES, None)
+        self.assertGreaterEqual(
+            dataset.get_data_info(0).items(), {
+                'img_path': osp.join(ASSETS_ROOT, '3.jpeg'),
+            }.items())
+        self.assertGreaterEqual(
+            dataset.get_data_info(1).items(), {
+                'img_path': osp.join(ASSETS_ROOT, 'a', '1.JPG'),
+            }.items())
+        self.assertGreaterEqual(
+            dataset.get_data_info(3).items(), {
+                'img_path': osp.join(ASSETS_ROOT, 'b', 'subb', '3.jpg'),
+            }.items())
+
+        # test ann_file assertion
+        cfg = {
+            **self.DEFAULT_ARGS,
+            'data_prefix': ASSETS_ROOT,
+            'ann_file': ['ann_file.txt'],
+        }
+        with self.assertRaisesRegex(TypeError, 'expected str'):
+            dataset_class(**cfg)
+
+        # test load with ann_file
+        cfg = {
+            **self.DEFAULT_ARGS,
+            'data_root': ASSETS_ROOT,
+            'ann_file': 'ann.txt',
+        }
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 3)
+        # custom dataset won't infer CLASSES from ann_file
+        self.assertIsNone(dataset.CLASSES, None)
+        self.assertGreaterEqual(
+            dataset.get_data_info(0).items(), {
+                'img_path': osp.join(ASSETS_ROOT, 'a/1.JPG'),
+                'gt_label': 0,
+            }.items())
+        self.assertGreaterEqual(
+            dataset.get_data_info(2).items(), {
+                'img_path': osp.join(ASSETS_ROOT, 'b/subb/3.jpg'),
+                'gt_label': 1
+            }.items())
+        np.testing.assert_equal(dataset.get_gt_labels(), np.array([0, 1, 1]))
+
+        # test load with absolute ann_file
+        cfg = {
+            **self.DEFAULT_ARGS,
+            'data_root': '',
+            'data_prefix': '',
+            'ann_file': osp.join(ASSETS_ROOT, 'ann.txt'),
+        }
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 3)
+        # custom dataset won't infer CLASSES from ann_file
+        self.assertIsNone(dataset.CLASSES, None)
+        self.assertGreaterEqual(
+            dataset.get_data_info(0).items(), {
+                'img_path': 'a/1.JPG',
+                'gt_label': 0,
+            }.items())
+        self.assertGreaterEqual(
+            dataset.get_data_info(2).items(), {
+                'img_path': 'b/subb/3.jpg',
+                'gt_label': 1
+            }.items())
+
+        # test load with absolute ann_file and without label
+        cfg = {
+            **self.DEFAULT_ARGS,
+            'data_root': '',
+            'data_prefix': '',
+            'ann_file': osp.join(ASSETS_ROOT, 'ann_without_labels.txt'),
+            'with_label': False,
+        }
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 3)
+        # custom dataset won't infer CLASSES from ann_file
+        self.assertIsNone(dataset.CLASSES, None)
+        self.assertGreaterEqual(
+            dataset.get_data_info(0).items(), {
+                'img_path': 'a/1.JPG',
+            }.items())
+        self.assertGreaterEqual(
+            dataset.get_data_info(2).items(), {
+                'img_path': 'b/subb/3.jpg',
+            }.items())
+
+        # test extensions filter
+        cfg = {
+            **self.DEFAULT_ARGS, 'data_prefix': dict(img_path=ASSETS_ROOT),
+            'ann_file': '',
+            'extensions': ('.txt', )
+        }
+        with self.assertRaisesRegex(RuntimeError,
+                                    'Supported extensions are: .txt'):
+            dataset_class(**cfg)
+
+        cfg = {
+            **self.DEFAULT_ARGS, 'data_prefix': ASSETS_ROOT,
+            'ann_file': '',
+            'extensions': ('.jpeg', )
+        }
+        logger = MMLogger.get_current_instance()
+        with self.assertLogs(logger, 'WARN') as log:
+            dataset = dataset_class(**cfg)
+        self.assertIn('Supported extensions are: .jpeg', log.output[0])
+        self.assertEqual(len(dataset), 1)
+        self.assertGreaterEqual(
+            dataset.get_data_info(0).items(), {
+                'img_path': osp.join(ASSETS_ROOT, 'b', '2.jpeg'),
+                'gt_label': 1
+            }.items())
+
+        # test classes check
+        cfg = {
+            **self.DEFAULT_ARGS,
+            'data_prefix': ASSETS_ROOT,
+            'classes': ('apple', 'banana'),
+            'ann_file': '',
+        }
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.CLASSES, ('apple', 'banana'))
+
+        cfg['classes'] = ['apple', 'banana', 'dog']
+        with self.assertRaisesRegex(AssertionError,
+                                    r"\(2\) doesn't match .* classes \(3\)"):
+            dataset_class(**cfg)
+
+
+class TestImageNet(TestCustomDataset):
+    DATASET_TYPE = 'ImageNet'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        cls.root = tmpdir.name
+        cls.meta_folder = 'meta'
+        cls.train_file = 'train.txt'
+        cls.val_file = 'val.txt'
+        cls.test_file = 'test.txt'
+        cls.categories = ['cat', 'dog']
+
+        os.mkdir(osp.join(cls.root, cls.meta_folder))
+
+        cls.DEFAULT_ARGS = dict(data_root=cls.root, split='train')
+
+        with open(osp.join(cls.root, cls.meta_folder, cls.train_file),
+                  'w') as f:
+            f.write('\n'.join([
+                '1.jpg 0',
+                '2.jpg 1',
+                '3.jpg 1',
+            ]))
+
+        with open(osp.join(cls.root, cls.meta_folder, cls.val_file), 'w') as f:
+            f.write('\n'.join([
+                '11.jpg 0',
+                '22.jpg 1',
+            ]))
+
+        with open(osp.join(cls.root, cls.meta_folder, cls.test_file),
+                  'w') as f:
+            f.write('\n'.join([
+                'aa.jpg',
+                'bb.jpg',
+            ]))
+
+    def test_initialize(self):
+        super().test_initialize()
+
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test valid splits
+        splits = ['train', 'val']
+        for split in splits:
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = split
+            cfg['classes'] = self.categories
+            dataset = dataset_class(**cfg)
+            self.assertEqual(dataset.data_root, self.root)
+
+        # Test split="test"
+        cfg = {**self.DEFAULT_ARGS}
+        cfg['split'] = 'test'
+        logger = MMLogger.get_current_instance()
+        with self.assertLogs(logger, 'INFO') as log:
+            dataset = dataset_class(**cfg)
+            self.assertFalse(dataset.with_label)
+        self.assertIn('Since the ImageNet1k test set', log.output[0])
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test default behavior
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+        self.assertEqual(len(dataset), 3)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, 'train', '1.jpg'))
+        self.assertEqual(data_info['gt_label'], 0)
+
+        # Test split="val"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'val'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 2)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, 'val', '11.jpg'))
+        self.assertEqual(data_info['gt_label'], 0)
+
+        # Test split="test"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'test'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 2)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, 'test', 'aa.jpg'))
+
+        # test override classes
+        cfg = {
+            **self.DEFAULT_ARGS,
+            'classes': ['cat', 'dog'],
+        }
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 3)
+        self.assertEqual(dataset.CLASSES, ('cat', 'dog'))
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f'Root of dataset: \t{dataset.data_root}', repr(dataset))
+
+
+class TestImageNet21k(TestCustomDataset):
+    DATASET_TYPE = 'ImageNet21k'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        cls.root = tmpdir.name
+        cls.meta_folder = 'meta'
+        cls.train_file = 'train.txt'
+
+        os.mkdir(osp.join(cls.root, cls.meta_folder))
+
+        with open(osp.join(cls.root, cls.meta_folder, cls.train_file),
+                  'w') as f:
+            f.write('\n'.join([
+                'cat/a.jpg 0',
+                'cat/b.jpg 0',
+                'dog/a.jpg 1',
+                'dog/b.jpg 1',
+            ]))
+
+        cls.DEFAULT_ARGS = dict(
+            data_root=cls.root,
+            classes=['cat', 'dog'],
+            ann_file='meta/train.txt')
+
+    def test_initialize(self):
+        super().test_initialize()
+
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test valid splits
+        cfg = {**self.DEFAULT_ARGS}
+        cfg['split'] = 'train'
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.split, 'train')
+        self.assertEqual(dataset.data_root, self.root)
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # The multi_label option is not implemented not.
+        cfg = {**self.DEFAULT_ARGS, 'multi_label': True}
+        with self.assertRaisesRegex(NotImplementedError, 'not supported'):
+            dataset_class(**cfg)
+
+        # Warn about ann_file
+        cfg = {**self.DEFAULT_ARGS, 'ann_file': '', 'lazy_init': True}
+        ann_path = osp.join(self.root, self.meta_folder, self.train_file)
+        os.rename(ann_path, ann_path + 'copy')
+        logger = MMLogger.get_current_instance()
+        with self.assertLogs(logger, 'INFO') as log:
+            dataset_class(**cfg)
+        self.assertIn('specify the `ann_file`', log.output[0])
+        os.rename(ann_path + 'copy', ann_path)
+
+        # Warn about classes
+        cfg = {**self.DEFAULT_ARGS, 'classes': None}
+        with self.assertLogs(logger, 'WARN') as log:
+            dataset_class(**cfg)
+        self.assertIn('specify the `classes`', log.output[0])
+
+        # Test split='train'
+        cfg = {**self.DEFAULT_ARGS, 'split': 'train', 'classes': None}
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+        self.assertEqual(len(dataset), 4)
+
+
+class TestPlaces205(TestCustomDataset):
+    DATASET_TYPE = 'Places205'
+
+    DEFAULT_ARGS = dict(data_root=ASSETS_ROOT, ann_file='ann.txt')
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # test classes number
+        cfg = {
+            **self.DEFAULT_ARGS,
+            'data_prefix': ASSETS_ROOT,
+            'ann_file': '',
+        }
+        with self.assertRaisesRegex(AssertionError,
+                                    r"\(2\) doesn't match .* classes \(205\)"):
+            dataset_class(**cfg)
+
+        # test override classes
+        cfg = {
+            **self.DEFAULT_ARGS,
+            'data_prefix': ASSETS_ROOT,
+            'classes': ['cat', 'dog'],
+            'ann_file': '',
+        }
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 3)
+        self.assertEqual(dataset.CLASSES, ('cat', 'dog'))
+
+
+class TestCIFAR10(TestBaseDataset):
+    DATASET_TYPE = 'CIFAR10'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        cls.root = tmpdir.name
+        cls.DEFAULT_ARGS = dict(data_root=cls.root, split='train')
+
+        dataset_class = DATASETS.get(cls.DATASET_TYPE)
+        base_folder = osp.join(cls.root, dataset_class.base_folder)
+        os.mkdir(base_folder)
+
+        cls.fake_imgs = np.random.randint(
+            0, 255, size=(6, 3 * 32 * 32), dtype=np.uint8)
+        cls.fake_labels = np.random.randint(0, 10, size=(6, ))
+        cls.fake_classes = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+
+        batch1 = dict(
+            data=cls.fake_imgs[:2], labels=cls.fake_labels[:2].tolist())
+        with open(osp.join(base_folder, 'data_batch_1'), 'wb') as f:
+            f.write(pickle.dumps(batch1))
+
+        batch2 = dict(
+            data=cls.fake_imgs[2:4], labels=cls.fake_labels[2:4].tolist())
+        with open(osp.join(base_folder, 'data_batch_2'), 'wb') as f:
+            f.write(pickle.dumps(batch2))
+
+        test_batch = dict(
+            data=cls.fake_imgs[4:], fine_labels=cls.fake_labels[4:].tolist())
+        with open(osp.join(base_folder, 'test_batch'), 'wb') as f:
+            f.write(pickle.dumps(test_batch))
+
+        meta = {dataset_class.meta['key']: cls.fake_classes}
+        meta_filename = dataset_class.meta['filename']
+        with open(osp.join(base_folder, meta_filename), 'wb') as f:
+            f.write(pickle.dumps(meta))
+
+        dataset_class.train_list = [['data_batch_1', None],
+                                    ['data_batch_2', None]]
+        dataset_class.test_list = [['test_batch', None]]
+        dataset_class.meta['md5'] = None
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test with invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test with valid split
+        splits = ['train', 'test']
+        test_modes = [False, True]
+
+        for split in splits:
+            for test_mode in test_modes:
+                cfg = {**self.DEFAULT_ARGS}
+                cfg['split'] = split
+                cfg['test_mode'] = test_mode
+
+                if split == 'train' and test_mode:
+                    logger = MMLogger.get_current_instance()
+                    with self.assertLogs(logger, 'WARN') as log:
+                        dataset = dataset_class(**cfg)
+                        self.assertEqual(dataset.split, split)
+                        self.assertEqual(dataset.test_mode, test_mode)
+                        self.assertEqual(dataset.data_root, self.root)
+                    self.assertIn('training set will be used', log.output[0])
+                else:
+                    dataset = dataset_class(**cfg)
+                    self.assertEqual(dataset.split, split)
+                    self.assertEqual(dataset.test_mode, test_mode)
+                    self.assertEqual(dataset.data_root, self.root)
+
+        # Test without dataset path
+        with self.assertRaisesRegex(RuntimeError, 'specify the dataset path'):
+            dataset = dataset_class()
+
+        # Test overriding metainfo by `metainfo` argument
+        cfg = {**self.DEFAULT_ARGS, 'metainfo': {'classes': ('bus', 'car')}}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.CLASSES, ('bus', 'car'))
+
+        # Test overriding metainfo by `classes` argument
+        cfg = {**self.DEFAULT_ARGS, 'classes': ['bus', 'car']}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.CLASSES, ('bus', 'car'))
+
+        classes_file = osp.join(ASSETS_ROOT, 'classes.txt')
+        cfg = {**self.DEFAULT_ARGS, 'classes': classes_file}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.CLASSES, ('bus', 'car'))
+        self.assertEqual(dataset.class_to_idx, {'bus': 0, 'car': 1})
+
+        # Test invalid classes
+        cfg = {**self.DEFAULT_ARGS, 'classes': dict(classes=1)}
+        with self.assertRaisesRegex(ValueError, "type <class 'dict'>"):
+            dataset_class(**cfg)
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test default behavior
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+        self.assertEqual(len(dataset), 4)
+        self.assertEqual(dataset.CLASSES, dataset_class.METAINFO['classes'])
+
+        data_info = dataset[0]
+        fake_img = self.fake_imgs[0].reshape(3, 32, 32).transpose(1, 2, 0)
+        np.testing.assert_equal(data_info['img'], fake_img)
+        np.testing.assert_equal(data_info['gt_label'], self.fake_labels[0])
+
+        # Test with split='test'
+        cfg = {**self.DEFAULT_ARGS, 'split': 'test'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 2)
+
+        data_info = dataset[0]
+        fake_img = self.fake_imgs[4].reshape(3, 32, 32).transpose(1, 2, 0)
+        np.testing.assert_equal(data_info['img'], fake_img)
+        np.testing.assert_equal(data_info['gt_label'], self.fake_labels[4])
+
+        # Test load meta
+        cfg = {**self.DEFAULT_ARGS, 'lazy_init': True}
+        dataset = dataset_class(**cfg)
+        dataset._metainfo = {}
+        dataset.full_init()
+        self.assertEqual(dataset.CLASSES, self.fake_classes)
+
+        cfg = {**self.DEFAULT_ARGS, 'lazy_init': True}
+        dataset = dataset_class(**cfg)
+        dataset._metainfo = {}
+        dataset.meta['filename'] = 'invalid'
+        with self.assertRaisesRegex(RuntimeError, 'not found or corrupted'):
+            dataset.full_init()
+
+        # Test automatically download
+        with patch('mmpretrain.datasets.cifar.download_and_extract_archive'
+                   ) as mock:
+            cfg = {**self.DEFAULT_ARGS, 'lazy_init': True, 'split': 'test'}
+            dataset = dataset_class(**cfg)
+            dataset.test_list = [['invalid_batch', None]]
+            with self.assertRaisesRegex(AssertionError, 'Download failed'):
+                dataset.full_init()
+            mock.assert_called_once_with(
+                dataset.url,
+                dataset.data_prefix['root'],
+                filename=dataset.filename,
+                md5=dataset.tgz_md5)
+
+        with self.assertRaisesRegex(RuntimeError, '`download=True`'):
+            cfg = {
+                **self.DEFAULT_ARGS, 'lazy_init': True,
+                'split': 'test',
+                'download': False
+            }
+            dataset = dataset_class(**cfg)
+            dataset.test_list = [['test_batch', 'invalid_md5']]
+            dataset.full_init()
+
+        # Test different backend
+        cfg = {
+            **self.DEFAULT_ARGS, 'lazy_init': True,
+            'data_prefix': 'http://openmmlab/cifar'
+        }
+        dataset = dataset_class(**cfg)
+        dataset._check_integrity = MagicMock(return_value=False)
+        with self.assertRaisesRegex(RuntimeError, 'http://openmmlab/cifar'):
+            dataset.full_init()
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS, 'lazy_init': True}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f"Prefix of data: \t{dataset.data_prefix['root']}",
+                      repr(dataset))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.tmpdir.cleanup()
+
+
+class TestCIFAR100(TestCIFAR10):
+    DATASET_TYPE = 'CIFAR100'
+
+
+class TestMultiLabelDataset(TestBaseDataset):
+    DATASET_TYPE = 'MultiLabelDataset'
+
+    DEFAULT_ARGS = dict(data_root=ASSETS_ROOT, ann_file='multi_label_ann.json')
+
+    def test_get_cat_ids(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+
+        cat_ids = dataset.get_cat_ids(0)
+        self.assertTrue(cat_ids, [0])
+
+        cat_ids = dataset.get_cat_ids(1)
+        self.assertTrue(cat_ids, [1])
+
+        cat_ids = dataset.get_cat_ids(1)
+        self.assertTrue(cat_ids, [0, 1])
+
+
+class TestVOC(TestBaseDataset):
+    DATASET_TYPE = 'VOC'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        data_root = tmpdir.name
+
+        cls.DEFAULT_ARGS = dict(data_root=data_root, split='trainval')
+
+        cls.image_folder = osp.join(data_root, 'JPEGImages')
+        cls.ann_folder = osp.join(data_root, 'Annotations')
+        cls.image_set_folder = osp.join(data_root, 'ImageSets', 'Main')
+        os.makedirs(cls.image_set_folder)
+        os.mkdir(cls.image_folder)
+        os.mkdir(cls.ann_folder)
+
+        cls.fake_img_paths = [f'{i}' for i in range(6)]
+        cls.fake_labels = [[
+            np.random.randint(10) for _ in range(np.random.randint(1, 4))
+        ] for _ in range(6)]
+        cls.fake_classes = [f'C_{i}' for i in range(10)]
+        train_list = [i for i in range(0, 4)]
+        test_list = [i for i in range(4, 6)]
+
+        with open(osp.join(cls.image_set_folder, 'trainval.txt'), 'w') as f:
+            for train_item in train_list:
+                f.write(str(train_item) + '\n')
+        with open(osp.join(cls.image_set_folder, 'test.txt'), 'w') as f:
+            for test_item in test_list:
+                f.write(str(test_item) + '\n')
+        with open(osp.join(cls.image_set_folder, 'full_path_test.txt'),
+                  'w') as f:
+            for test_item in test_list:
+                f.write(osp.join(cls.image_folder, str(test_item)) + '\n')
+
+        for train_item in train_list:
+            with open(osp.join(cls.ann_folder, f'{train_item}.xml'), 'w') as f:
+                temple = '<object><name>C_{}</name>{}</object>'
+                ann_data = ''.join([
+                    temple.format(label, '<difficult>0</difficult>')
+                    for label in cls.fake_labels[train_item]
+                ])
+                # add difficult label
+                ann_data += ''.join([
+                    temple.format(label, '<difficult>1</difficult>')
+                    for label in cls.fake_labels[train_item]
+                ])
+                xml_ann_data = f'<annotation>{ann_data}</annotation>'
+                f.write(xml_ann_data + '\n')
+
+        for test_item in test_list:
+            with open(osp.join(cls.ann_folder, f'{test_item}.xml'), 'w') as f:
+                temple = '<object><name>C_{}</name>{}</object>'
+                ann_data = ''.join([
+                    temple.format(label, '<difficult>0</difficult>')
+                    for label in cls.fake_labels[test_item]
+                ])
+                xml_ann_data = f'<annotation>{ann_data}</annotation>'
+                f.write(xml_ann_data + '\n')
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test overriding metainfo by `classes` argument
+        cfg = {**self.DEFAULT_ARGS, 'classes': ['bus', 'car']}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.CLASSES, ('bus', 'car'))
+
+        # Test overriding CLASSES by classes file
+        classes_file = osp.join(ASSETS_ROOT, 'classes.txt')
+        cfg = {**self.DEFAULT_ARGS, 'classes': classes_file}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.CLASSES, ('bus', 'car'))
+        self.assertEqual(dataset.class_to_idx, {'bus': 0, 'car': 1})
+
+        # Test invalid classes
+        cfg = {**self.DEFAULT_ARGS, 'classes': dict(classes=1)}
+        with self.assertRaisesRegex(ValueError, "type <class 'dict'>"):
+            dataset_class(**cfg)
+
+        # Test invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test valid splits
+        splits = ['trainval', 'test']
+        for split in splits:
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = split
+            dataset = dataset_class(**cfg)
+            self.assertEqual(dataset.split, split)
+
+        # Test split='trainval' and test_mode = True
+        logger = MMLogger.get_current_instance()
+        with self.assertLogs(logger, 'WARN') as log:
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'trainval'
+            cfg['test_mode'] = True
+            dataset = dataset_class(**cfg)
+            self.assertEqual(dataset.split, 'trainval')
+            self.assertEqual(dataset.test_mode, True)
+        self.assertIn('The trainval set will be used', log.output[0])
+
+    def test_get_cat_ids(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {'classes': self.fake_classes, **self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+
+        cat_ids = dataset.get_cat_ids(0)
+        self.assertIsInstance(cat_ids, list)
+        self.assertIsInstance(cat_ids[0], int)
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test default behavior
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+        self.assertEqual(len(dataset), 4)
+        self.assertEqual(len(dataset.CLASSES), 20)
+
+        cfg = {
+            'classes': self.fake_classes,
+            'lazy_init': True,
+            **self.DEFAULT_ARGS
+        }
+        dataset = dataset_class(**cfg)
+
+        self.assertIn('Haven\'t been initialized', repr(dataset))
+        dataset.full_init()
+        self.assertIn(f'Number of samples: \t{len(dataset)}', repr(dataset))
+
+        data_info = dataset[0]
+        fake_img_path = osp.join(self.image_folder, self.fake_img_paths[0])
+        self.assertEqual(data_info['img_path'], f'{fake_img_path}.jpg')
+        self.assertEqual(set(data_info['gt_label']), set(self.fake_labels[0]))
+
+        # Test with split='test'
+        cfg['split'] = 'test'
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 2)
+
+        data_info = dataset[0]
+        fake_img_path = osp.join(self.image_folder, self.fake_img_paths[4])
+        self.assertEqual(data_info['img_path'], f'{fake_img_path}.jpg')
+        self.assertEqual(set(data_info['gt_label']), set(self.fake_labels[4]))
+
+        # Test with test_mode=True and ann_path = None
+        cfg['split'] = ''
+        cfg['image_set_path'] = 'ImageSets/Main/test.txt'
+        cfg['test_mode'] = True
+        cfg['data_prefix'] = 'JPEGImages'
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 2)
+
+        data_info = dataset[0]
+        fake_img_path = osp.join(self.image_folder, self.fake_img_paths[4])
+        self.assertEqual(data_info['img_path'], f'{fake_img_path}.jpg')
+        self.assertEqual(data_info['gt_label'], None)
+
+        # Test different backend
+        cfg = {
+            **self.DEFAULT_ARGS, 'lazy_init': True,
+            'data_root': 's3://openmmlab/voc'
+        }
+        petrel_mock = MagicMock()
+        sys.modules['petrel_client'] = petrel_mock
+        dataset = dataset_class(**cfg)
+        petrel_mock.client.Client.assert_called()
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f'Path of image set: \t{dataset.image_set_path}',
+                      repr(dataset))
+        self.assertIn(f'Prefix of dataset: \t{dataset.data_root}',
+                      repr(dataset))
+        self.assertIn(f'Prefix of annotations: \t{dataset.ann_prefix}',
+                      repr(dataset))
+        self.assertIn(f'Prefix of images: \t{dataset.img_prefix}',
+                      repr(dataset))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.tmpdir.cleanup()
+
+
+class TestMNIST(TestBaseDataset):
+    DATASET_TYPE = 'MNIST'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        cls.root = tmpdir.name
+        data_prefix = tmpdir.name
+        cls.DEFAULT_ARGS = dict(data_root=cls.root, split='train')
+
+        dataset_class = DATASETS.get(cls.DATASET_TYPE)
+
+        def rm_suffix(s):
+            return s[:s.rfind('.')]
+
+        train_image_file = osp.join(data_prefix,
+                                    rm_suffix(dataset_class.train_list[0][0]))
+        train_label_file = osp.join(data_prefix,
+                                    rm_suffix(dataset_class.train_list[1][0]))
+        test_image_file = osp.join(data_prefix,
+                                   rm_suffix(dataset_class.test_list[0][0]))
+        test_label_file = osp.join(data_prefix,
+                                   rm_suffix(dataset_class.test_list[1][0]))
+        cls.fake_img = np.random.randint(0, 255, size=(28, 28), dtype=np.uint8)
+        cls.fake_label = np.random.randint(0, 10, size=(1, ), dtype=np.uint8)
+
+        for file in [train_image_file, test_image_file]:
+            magic = b'\x00\x00\x08\x03'  # num_dims = 3, type = uint8
+            head = b'\x00\x00\x00\x01' + b'\x00\x00\x00\x1c' * 2  # (1, 28, 28)
+            data = magic + head + cls.fake_img.flatten().tobytes()
+            with open(file, 'wb') as f:
+                f.write(data)
+
+        for file in [train_label_file, test_label_file]:
+            magic = b'\x00\x00\x08\x01'  # num_dims = 3, type = uint8
+            head = b'\x00\x00\x00\x01'  # (1, )
+            data = magic + head + cls.fake_label.tobytes()
+            with open(file, 'wb') as f:
+                f.write(data)
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test with invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test with valid split
+        splits = ['train', 'test']
+        test_modes = [False, True]
+
+        for split in splits:
+            for test_mode in test_modes:
+                cfg = {**self.DEFAULT_ARGS}
+                cfg['split'] = split
+                cfg['test_mode'] = test_mode
+
+                if split == 'train' and test_mode:
+                    logger = MMLogger.get_current_instance()
+                    with self.assertLogs(logger, 'WARN') as log:
+                        dataset = dataset_class(**cfg)
+                        self.assertEqual(dataset.split, split)
+                        self.assertEqual(dataset.test_mode, test_mode)
+                        self.assertEqual(dataset.data_root, self.root)
+                    self.assertIn('training set will be used', log.output[0])
+                else:
+                    dataset = dataset_class(**cfg)
+                    self.assertEqual(dataset.split, split)
+                    self.assertEqual(dataset.test_mode, test_mode)
+                    self.assertEqual(dataset.data_root, self.root)
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test default behavior
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+        self.assertEqual(len(dataset), 1)
+        self.assertEqual(dataset.CLASSES, dataset_class.METAINFO['classes'])
+
+        data_info = dataset[0]
+        np.testing.assert_equal(data_info['img'], self.fake_img)
+        np.testing.assert_equal(data_info['gt_label'], self.fake_label)
+
+        # Test with split='test'
+        cfg = {**self.DEFAULT_ARGS, 'split': 'test'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 1)
+
+        data_info = dataset[0]
+        np.testing.assert_equal(data_info['img'], self.fake_img)
+        np.testing.assert_equal(data_info['gt_label'], self.fake_label)
+
+        # Test automatically download
+        with patch('mmpretrain.datasets.mnist.download_and_extract_archive'
+                   ) as mock:
+            cfg = {**self.DEFAULT_ARGS, 'lazy_init': True, 'split': 'test'}
+            dataset = dataset_class(**cfg)
+            dataset.train_list = [['invalid_train_file', None]]
+            dataset.test_list = [['invalid_test_file', None]]
+            with self.assertRaisesRegex(AssertionError, 'Download failed'):
+                dataset.full_init()
+            calls = [
+                call(
+                    osp.join(dataset.url_prefix, dataset.train_list[0][0]),
+                    download_root=dataset.data_prefix['root'],
+                    filename=dataset.train_list[0][0],
+                    md5=None),
+                call(
+                    osp.join(dataset.url_prefix, dataset.test_list[0][0]),
+                    download_root=dataset.data_prefix['root'],
+                    filename=dataset.test_list[0][0],
+                    md5=None)
+            ]
+            mock.assert_has_calls(calls)
+
+        with self.assertRaisesRegex(RuntimeError, '`download=True`'):
+            cfg = {
+                **self.DEFAULT_ARGS, 'lazy_init': True,
+                'split': 'test',
+                'download': False
+            }
+            dataset = dataset_class(**cfg)
+            dataset._check_exists = MagicMock(return_value=False)
+            dataset.full_init()
+
+        # Test different backend
+        cfg = {
+            **self.DEFAULT_ARGS, 'lazy_init': True,
+            'data_prefix': 'http://openmmlab/mnist'
+        }
+        dataset = dataset_class(**cfg)
+        dataset._check_exists = MagicMock(return_value=False)
+        with self.assertRaisesRegex(RuntimeError, 'http://openmmlab/mnist'):
+            dataset.full_init()
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS, 'lazy_init': True}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f"Prefix of data: \t{dataset.data_prefix['root']}",
+                      repr(dataset))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.tmpdir.cleanup()
+
+
+class FashionMNIST(TestMNIST):
+    DATASET_TYPE = 'FashionMNIST'
+
+
+class TestCUB(TestBaseDataset):
+    DATASET_TYPE = 'CUB'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        cls.root = tmpdir.name
+        cls.ann_file = 'images.txt'
+        cls.image_folder = 'images'
+        cls.image_class_labels_file = 'image_class_labels.txt'
+        cls.train_test_split_file = 'train_test_split.txt'
+
+        cls.DEFAULT_ARGS = dict(
+            data_root=cls.root, split='train', test_mode=False)
+
+        with open(osp.join(cls.root, cls.ann_file), 'w') as f:
+            f.write('\n'.join([
+                '1 1.txt',
+                '2 2.txt',
+                '3 3.txt',
+            ]))
+
+        with open(osp.join(cls.root, cls.image_class_labels_file), 'w') as f:
+            f.write('\n'.join([
+                '1 2',
+                '2 3',
+                '3 1',
+            ]))
+
+        with open(osp.join(cls.root, cls.train_test_split_file), 'w') as f:
+            f.write('\n'.join([
+                '1 0',
+                '2 1',
+                '3 1',
+            ]))
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test with invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test with valid split
+        splits = ['train', 'test']
+        test_modes = [False, True]
+
+        for split in splits:
+            for test_mode in test_modes:
+                cfg = {**self.DEFAULT_ARGS}
+                cfg['split'] = split
+                cfg['test_mode'] = test_mode
+
+                if split == 'train' and test_mode:
+                    logger = MMLogger.get_current_instance()
+                    with self.assertLogs(logger, 'WARN') as log:
+                        dataset = dataset = dataset_class(**cfg)
+                        self.assertEqual(dataset.split, split)
+                        self.assertEqual(dataset.test_mode, test_mode)
+                        self.assertEqual(dataset.data_root, self.root)
+                        self.assertEqual(dataset.ann_file,
+                                         osp.join(self.root, self.ann_file))
+                    self.assertIn('training set will be used', log.output[0])
+                else:
+                    dataset = dataset_class(**cfg)
+                    self.assertEqual(dataset.split, split)
+                    self.assertEqual(dataset.test_mode, test_mode)
+                    self.assertEqual(dataset.data_root, self.root)
+                    self.assertEqual(dataset.ann_file,
+                                     osp.join(self.root, self.ann_file))
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test default behavior
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+        self.assertEqual(len(dataset), 2)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, self.image_folder, '2.txt'))
+        self.assertEqual(data_info['gt_label'], 3 - 1)
+
+        # # Test with split='test'
+        cfg = {**self.DEFAULT_ARGS, 'split': 'test'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 1)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, self.image_folder, '1.txt'))
+        self.assertEqual(data_info['gt_label'], 2 - 1)
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f'Root of dataset: \t{dataset.data_root}', repr(dataset))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.tmpdir.cleanup()
+
+
+class TestMultiTaskDataset(TestCase):
+    DATASET_TYPE = 'MultiTaskDataset'
+
+    DEFAULT_ARGS = dict(
+        data_root=ASSETS_ROOT,
+        ann_file=osp.join(ASSETS_ROOT, 'multi-task.json'),
+        pipeline=[])
+
+    def test_metainfo(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test default behavior
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+        metainfo = {'tasks': ['gender', 'wear']}
+        self.assertDictEqual(dataset.metainfo, metainfo)
+        self.assertFalse(dataset.test_mode)
+
+    def test_parse_data_info(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+
+        data = dataset.parse_data_info({
+            'img_path': 'a.jpg',
+            'gt_label': {
+                'gender': 0
+            }
+        })
+        self.assertDictContainsSubset(
+            {
+                'img_path': os.path.join(ASSETS_ROOT, 'a.jpg'),
+                'gt_label': {
+                    'gender': 0
+                }
+            }, data)
+        np.testing.assert_equal(data['gt_label']['gender'], 0)
+
+        # Test missing path
+        with self.assertRaisesRegex(AssertionError, 'have `img_path` field'):
+            dataset.parse_data_info(
+                {'gt_label': {
+                    'gender': 0,
+                    'wear': [1, 0, 1, 0]
+                }})
+
+    def test_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+
+        task_doc = ('For 2 tasks\n     gender \n     wear ')
+        self.assertIn(task_doc, repr(dataset))
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test default behavior
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+        data = dataset.load_data_list(self.DEFAULT_ARGS['ann_file'])
+        self.assertIsInstance(data, list)
+        np.testing.assert_equal(len(data), 3)
+        np.testing.assert_equal(data[0]['gt_label'], {'gender': 0})
+        np.testing.assert_equal(data[1]['gt_label'], {
+            'gender': 0,
+            'wear': [1, 0, 1, 0]
+        })
+
+
+class TestInShop(TestBaseDataset):
+    DATASET_TYPE = 'InShop'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        cls.root = tmpdir.name
+        cls.list_eval_partition = 'Eval/list_eval_partition.txt'
+        cls.DEFAULT_ARGS = dict(data_root=cls.root, split='train')
+        cls.ann_file = osp.join(cls.root, cls.list_eval_partition)
+        os.makedirs(osp.join(cls.root, 'Eval'))
+        with open(cls.ann_file, 'w') as f:
+            f.write('\n'.join([
+                '8',
+                'image_name item_id evaluation_status',
+                f'{osp.join("img", "02_1_front.jpg")} id_00000002 train',
+                f'{osp.join("img", "02_2_side.jpg")} id_00000002 train',
+                f'{osp.join("img", "12_3_back.jpg")} id_00007982 gallery',
+                f'{osp.join("img", "12_7_addition.jpg")} id_00007982 gallery',
+                f'{osp.join("img", "13_1_front.jpg")} id_00007982 query',
+                f'{osp.join("img", "13_2_side.jpg")} id_00007983 gallery',
+                f'{osp.join("img", "13_3_back.jpg")} id_00007983 query ',
+                f'{osp.join("img", "13_7_additional.jpg")} id_00007983 query',
+            ]))
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test with mode=train
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.split, 'train')
+        self.assertEqual(dataset.data_root, self.root)
+        self.assertEqual(dataset.ann_file, self.ann_file)
+
+        # Test with mode=query
+        cfg = {**self.DEFAULT_ARGS, 'split': 'query'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.split, 'query')
+        self.assertEqual(dataset.data_root, self.root)
+        self.assertEqual(dataset.ann_file, self.ann_file)
+
+        # Test with mode=gallery
+        cfg = {**self.DEFAULT_ARGS, 'split': 'gallery'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.split, 'gallery')
+        self.assertEqual(dataset.data_root, self.root)
+        self.assertEqual(dataset.ann_file, self.ann_file)
+
+        # Test with mode=other
+        cfg = {**self.DEFAULT_ARGS, 'split': 'other'}
+        with self.assertRaisesRegex(AssertionError, "'split' of `InS"):
+            dataset_class(**cfg)
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test with mode=train
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 2)
+        data_info = dataset[0]
+        self.assertEqual(
+            data_info['img_path'],
+            os.path.join(self.root, 'Img', 'img', '02_1_front.jpg'))
+        self.assertEqual(data_info['gt_label'], 0)
+
+        # Test with mode=query
+        cfg = {**self.DEFAULT_ARGS, 'split': 'query'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 3)
+        data_info = dataset[0]
+        self.assertEqual(
+            data_info['img_path'],
+            os.path.join(self.root, 'Img', 'img', '13_1_front.jpg'))
+        self.assertEqual(data_info['gt_label'], [0, 1])
+
+        # Test with mode=gallery
+        cfg = {**self.DEFAULT_ARGS, 'split': 'gallery'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 3)
+        data_info = dataset[0]
+        self.assertEqual(
+            data_info['img_path'],
+            os.path.join(self.root, 'Img', 'img', '12_3_back.jpg'))
+        self.assertEqual(data_info['sample_idx'], 0)
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f'Root of dataset: \t{dataset.data_root}', repr(dataset))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.tmpdir.cleanup()
+
+
+class TestFlowers102(TestBaseDataset):
+    DATASET_TYPE = 'Flowers102'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        cls.root = tmpdir.name
+        cls.DEFAULT_ARGS = dict(data_root=cls.root, split='train')
+        cls.ann_file = osp.join(cls.root, 'imagelabels.mat')
+        cls.train_test_split_file = osp.join(cls.root, 'setid.mat')
+
+        mat4py.savemat(cls.ann_file,
+                       {'labels': [1, 1, 2, 2, 2, 3, 3, 4, 4, 5]})
+        mat4py.savemat(cls.train_test_split_file, {
+            'trnid': [1, 3, 5],
+            'valid': [7, 9],
+            'tstid': [2, 4, 6, 8, 10],
+        })
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test valid splits
+        splits = ['train', 'val', 'trainval', 'test']
+        for split in splits:
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = split
+            dataset = dataset_class(**cfg)
+            self.assertEqual(dataset.split, split)
+            self.assertEqual(dataset.data_root, self.root)
+            self.assertEqual(dataset.ann_file, self.ann_file)
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test with split="train"
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 3)
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         os.path.join(self.root, 'jpg', 'image_00001.jpg'))
+        self.assertEqual(data_info['gt_label'], 0)
+
+        # Test with split="val"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'val'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 2)
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         os.path.join(self.root, 'jpg', 'image_00007.jpg'))
+        self.assertEqual(data_info['gt_label'], 2)
+
+        # Test with split="trainval"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'trainval'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 5)
+        data_info = dataset[2]
+        self.assertEqual(data_info['img_path'],
+                         os.path.join(self.root, 'jpg', 'image_00005.jpg'))
+        self.assertEqual(data_info['gt_label'], 1)
+
+        # Test with split="test"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'test'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 5)
+        data_info = dataset[2]
+        self.assertEqual(data_info['img_path'],
+                         os.path.join(self.root, 'jpg', 'image_00006.jpg'))
+        self.assertEqual(data_info['gt_label'], 2)
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f'Root of dataset: \t{dataset.data_root}', repr(dataset))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.tmpdir.cleanup()
+
+
+class TestOxfordIIITPet(TestBaseDataset):
+    DATASET_TYPE = 'OxfordIIITPet'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        cls.root = tmpdir.name
+        cls.trainval_file = 'trainval.txt'
+        cls.image_folder = 'images'
+        cls.meta_folder = 'annotations'
+        cls.test_file = 'test.txt'
+
+        os.mkdir(osp.join(cls.root, cls.meta_folder))
+
+        cls.DEFAULT_ARGS = dict(data_root=cls.root, split='trainval')
+
+        with open(osp.join(cls.root, cls.meta_folder, cls.trainval_file),
+                  'w') as f:
+            f.write('\n'.join([
+                'Abyssinian_100 1 1 1',
+                'american_bulldog_100 2 2 1',
+                'basset_hound_126 4 2 3',
+            ]))
+
+        with open(osp.join(cls.root, cls.meta_folder, cls.test_file),
+                  'w') as f:
+            f.write('\n'.join([
+                'Abyssinian_204 1 1 1',
+                'american_bulldog_208 2 2 1',
+            ]))
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test valid splits
+        splits = ['trainval', 'test']
+        for split in splits:
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = split
+            dataset = dataset_class(**cfg)
+            self.assertEqual(dataset.split, split)
+            self.assertEqual(dataset.data_root, self.root)
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test default behavior
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+        self.assertEqual(len(dataset), 3)
+
+        data_info = dataset[0]
+        self.assertEqual(
+            data_info['img_path'],
+            osp.join(self.root, self.image_folder, 'Abyssinian_100.jpg'))
+        self.assertEqual(data_info['gt_label'], 1 - 1)
+
+        # Test with split="test"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'test'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 2)
+
+        data_info = dataset[0]
+        self.assertEqual(
+            data_info['img_path'],
+            osp.join(self.root, self.image_folder, 'Abyssinian_204.jpg'))
+        self.assertEqual(data_info['gt_label'], 1 - 1)
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f'Root of dataset: \t{dataset.data_root}', repr(dataset))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.tmpdir.cleanup()
+
+
+class TestDTD(TestBaseDataset):
+    DATASET_TYPE = 'DTD'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        cls.root = tmpdir.name
+        cls.DEFAULT_ARGS = dict(data_root=cls.root, split='train')
+
+        cls.meta_folder = 'imdb'
+
+        os.makedirs(osp.join(cls.root, cls.meta_folder))
+
+        cls.ann_file = osp.join(cls.root, cls.meta_folder, 'imdb.mat')
+
+        mat4py.savemat(
+            cls.ann_file, {
+                'images': {
+                    'name': [
+                        '1.jpg', '2.jpg', '3.jpg', '4.jpg', '5.jpg', '6.jpg',
+                        '7.jpg', '8.jpg', '9.jpg', '10.jpg'
+                    ],
+                    'class': [1, 1, 2, 2, 2, 3, 3, 4, 4, 5],
+                    'set': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1]
+                }
+            })
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test valid splits
+        splits = ['train', 'val', 'trainval', 'test']
+        for split in splits:
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = split
+            dataset = dataset_class(**cfg)
+            self.assertEqual(dataset.split, split)
+            self.assertEqual(dataset.data_root, self.root)
+            self.assertEqual(dataset.ann_file, self.ann_file)
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test with split="train"
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 4)
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         os.path.join(self.root, 'images', '1.jpg'))
+        self.assertEqual(data_info['gt_label'], 0)
+
+        # Test with split="val"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'val'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 3)
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         os.path.join(self.root, 'images', '2.jpg'))
+        self.assertEqual(data_info['gt_label'], 0)
+
+        # Test with split="trainval"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'trainval'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 7)
+        data_info = dataset[2]
+        self.assertEqual(data_info['img_path'],
+                         os.path.join(self.root, 'images', '4.jpg'))
+        self.assertEqual(data_info['gt_label'], 1)
+
+        # Test with split="test"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'test'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 3)
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         os.path.join(self.root, 'images', '3.jpg'))
+        self.assertEqual(data_info['gt_label'], 1)
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f'Root of dataset: \t{dataset.data_root}', repr(dataset))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.tmpdir.cleanup()
+
+
+class TestFGVCAircraft(TestBaseDataset):
+    DATASET_TYPE = 'FGVCAircraft'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        cls.root = tmpdir.name
+
+        os.makedirs(osp.join(cls.root, 'data'))
+
+        cls.train_file = osp.join('data', 'images_variant_train.txt')
+        cls.val_file = osp.join('data', 'images_variant_val.txt')
+        cls.trainval_file = osp.join('data', 'images_variant_trainval.txt')
+        cls.test_file = osp.join('data', 'images_variant_test.txt')
+        cls.image_folder = osp.join('data', 'images')
+
+        cls.DEFAULT_ARGS = dict(data_root=cls.root, split='trainval')
+
+        with open(osp.join(cls.root, cls.train_file), 'w') as f:
+            f.write('\n'.join([
+                '1025794 707-320',
+                '1019011 727-200',
+            ]))
+
+        with open(osp.join(cls.root, cls.val_file), 'w') as f:
+            f.write('\n'.join([
+                '0209554 737-200',
+            ]))
+
+        with open(osp.join(cls.root, cls.trainval_file), 'w') as f:
+            f.write('\n'.join([
+                '1025794 707-320',
+                '1019011 727-200',
+                '0209554 737-200',
+            ]))
+
+        with open(osp.join(cls.root, cls.test_file), 'w') as f:
+            f.write('\n'.join([
+                '1514522 707-320',
+                '0116175 727-200',
+                '0713752 737-200',
+                '2126017 737-300',
+            ]))
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test valid splits
+        splits = ['train', 'val', 'trainval', 'test']
+        ann_files = [
+            self.train_file, self.val_file, self.trainval_file, self.test_file
+        ]
+        for i, split in enumerate(splits):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = split
+            dataset = dataset_class(**cfg)
+            self.assertEqual(dataset.split, split)
+            self.assertEqual(dataset.data_root, self.root)
+            self.assertEqual(dataset.ann_file,
+                             osp.join(self.root, ann_files[i]))
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test default behavior (split="trainval")
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+        self.assertEqual(len(dataset), 3)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, self.image_folder, '1025794.jpg'))
+        self.assertEqual(data_info['gt_label'], 0)
+
+        # # Test with split="train"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'train'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 2)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, self.image_folder, '1025794.jpg'))
+        self.assertEqual(data_info['gt_label'], 0)
+
+        # Test with split="val"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'val'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 1)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, self.image_folder, '0209554.jpg'))
+        self.assertEqual(data_info['gt_label'], 2)
+
+        # Test with split="test"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'test'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 4)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, self.image_folder, '1514522.jpg'))
+        self.assertEqual(data_info['gt_label'], 0)
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f'Root of dataset: \t{dataset.data_root}', repr(dataset))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.tmpdir.cleanup()
+
+
+class TestStanfordCars(TestBaseDataset):
+    DATASET_TYPE = 'StanfordCars'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        cls.root = tmpdir.name
+        cls.ann_file = osp.join(cls.root, 'cars_annos.mat')
+        cls.meta_folder = 'devkit'
+        cls.train_ann_file = osp.join(cls.root, cls.meta_folder,
+                                      'cars_train_annos.mat')
+        cls.test_ann_file = osp.join(cls.root, cls.meta_folder,
+                                     'cars_test_annos_withlabels.mat')
+        cls.train_folder = 'cars_train'
+        cls.test_folder = 'cars_test'
+
+        os.makedirs(osp.join(cls.root, cls.meta_folder))
+
+        cls.DEFAULT_ARGS = dict(data_root=cls.root, split='train')
+
+        mat4py.savemat(
+            cls.ann_file, {
+                'annotations': {
+                    'relative_im_path':
+                    ['car_ims/001.jpg', 'car_ims/002.jpg', 'car_ims/003.jpg'],
+                    'class': [1, 2, 3],
+                    'test': [0, 0, 1]
+                }
+            })
+
+        mat4py.savemat(
+            cls.train_ann_file, {
+                'annotations': {
+                    'fname': ['001.jpg', '002.jpg', '012.jpg'],
+                    'class': [10, 15, 150],
+                }
+            })
+
+        mat4py.savemat(
+            cls.test_ann_file, {
+                'annotations': {
+                    'fname': ['025.jpg', '111.jpg', '222.jpg'],
+                    'class': [150, 1, 15],
+                }
+            })
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test first way
+        # Test invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test valid splits
+        splits = ['train', 'test']
+        for split in splits:
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = split
+            dataset = dataset_class(**cfg)
+            self.assertEqual(dataset.split, split)
+            self.assertEqual(dataset.data_root, self.root)
+            self.assertEqual(dataset.ann_file, self.ann_file)
+
+        # Test second way
+        os.rename(self.ann_file, self.ann_file + 'copy')
+        # Test invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test valid splits
+        cfg = {**self.DEFAULT_ARGS}
+        cfg['split'] = 'train'
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.split, 'train')
+        self.assertEqual(dataset.data_root, self.root)
+        self.assertEqual(dataset.ann_file,
+                         osp.join(self.meta_folder, self.train_ann_file))
+
+        # Test valid splits
+        cfg = {**self.DEFAULT_ARGS}
+        cfg['split'] = 'test'
+        dataset = dataset_class(**cfg)
+        self.assertEqual(dataset.split, 'test')
+        self.assertEqual(dataset.data_root, self.root)
+        self.assertEqual(dataset.ann_file,
+                         osp.join(self.meta_folder, self.test_ann_file))
+
+        # wrong dataset organization
+        os.rename(self.train_ann_file, self.train_ann_file + 'copy')
+        os.rename(self.test_ann_file, self.test_ann_file + 'copy')
+
+        with self.assertRaisesRegex(RuntimeError,
+                                    'The dataset is incorrectly organized'):
+            cfg = {**self.DEFAULT_ARGS}
+            dataset_class(**cfg)
+
+        with self.assertRaisesRegex(RuntimeError,
+                                    'The dataset is incorrectly organized'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'test'
+            dataset_class(**cfg)
+
+        os.rename(self.train_ann_file + 'copy', self.train_ann_file)
+        os.rename(self.test_ann_file + 'copy', self.test_ann_file)
+
+        os.rename(self.ann_file + 'copy', self.ann_file)
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test first way
+        # Test default behavior
+        assert osp.exists(osp.join(self.root, 'cars_annos.mat')), osp.join(
+            self.root, 'cars_annos.mat')
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+        self.assertEqual(len(dataset), 2)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, 'car_ims/001.jpg'))
+        self.assertEqual(data_info['gt_label'], 0)
+
+        # Test with split="test"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'test'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 1)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, 'car_ims/003.jpg'))
+        self.assertEqual(data_info['gt_label'], 2)
+
+        # Test second way
+        os.rename(self.ann_file, self.ann_file + 'copy')
+        # Test with split="train"
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 3)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, self.train_folder, '001.jpg'))
+        self.assertEqual(data_info['gt_label'], 9)
+
+        # Test with split="test"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'test'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 3)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, self.test_folder, '025.jpg'))
+        self.assertEqual(data_info['gt_label'], 149)
+
+        os.rename(self.ann_file + 'copy', self.ann_file)
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f'Root of dataset: \t{dataset.data_root}', repr(dataset))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.tmpdir.cleanup()
+
+
+class TestCaltech101(TestBaseDataset):
+    DATASET_TYPE = 'Caltech101'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        cls.root = tmpdir.name
+        cls.image_folder = '101_ObjectCategories'
+        cls.meta_folder = 'meta'
+        cls.train_file = 'train.txt'
+        cls.test_file = 'test.txt'
+
+        cls.DEFAULT_ARGS = dict(data_root=cls.root, split='train')
+
+        os.makedirs(osp.join(cls.root, cls.meta_folder))
+
+        with open(osp.join(cls.root, cls.meta_folder, cls.train_file),
+                  'w') as f:
+            f.write('\n'.join([
+                '1.jpg 0',
+                '2.jpg 1',
+                '3.jpg 2',
+            ]))
+
+        with open(osp.join(cls.root, cls.meta_folder, cls.test_file),
+                  'w') as f:
+            f.write('\n'.join([
+                '100.jpg 99',
+                '101.jpg 100',
+                '102.jpg 101',
+                '103.jpg 101',
+            ]))
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test valid splits
+        splits = ['train', 'test']
+        for split in splits:
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = split
+            dataset = dataset_class(**cfg)
+            self.assertEqual(dataset.split, split)
+            self.assertEqual(dataset.data_root, self.root)
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test default behavior
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+        self.assertEqual(len(dataset), 3)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, self.image_folder, '1.jpg'))
+        self.assertEqual(data_info['gt_label'], 0)
+
+        # Test with split="test"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'test'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 4)
+
+        data_info = dataset[0]
+        self.assertEqual(data_info['img_path'],
+                         osp.join(self.root, self.image_folder, '100.jpg'))
+        self.assertEqual(data_info['gt_label'], 99)
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f'Root of dataset: \t{dataset.data_root}', repr(dataset))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.tmpdir.cleanup()
+
+
+class TestFood101(TestBaseDataset):
+    DATASET_TYPE = 'Food101'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        cls.root = tmpdir.name
+        cls.image_folder = 'images'
+        cls.meta_folder = 'meta'
+        cls.train_file = 'train.txt'
+        cls.test_file = 'test.txt'
+
+        os.makedirs(osp.join(cls.root, cls.meta_folder))
+
+        cls.DEFAULT_ARGS = dict(data_root=cls.root, split='train')
+
+        with open(osp.join(cls.root, cls.meta_folder, cls.train_file),
+                  'w') as f:
+            f.write('\n'.join([
+                'apple_pie/0001',
+                'baby_back_ribs/0002',
+                'baklava/0003',
+            ]))
+
+        with open(osp.join(cls.root, cls.meta_folder, cls.test_file),
+                  'w') as f:
+            f.write('\n'.join([
+                'beef_carpaccio/0004',
+                'beef_tartare/0005',
+                'beet_salad/0006',
+            ]))
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test valid splits
+        splits = ['train', 'test']
+        for split in splits:
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = split
+            dataset = dataset_class(**cfg)
+            self.assertEqual(dataset.split, split)
+            self.assertEqual(dataset.data_root, self.root)
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test default behavior
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+        self.assertEqual(len(dataset), 3)
+
+        data_info = dataset[0]
+        self.assertEqual(
+            data_info['img_path'],
+            osp.join(self.root, self.image_folder, 'apple_pie', '0001.jpg'))
+        self.assertEqual(data_info['gt_label'], 0)
+
+        # Test with split="test"
+        cfg = {**self.DEFAULT_ARGS, 'split': 'test'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 3)
+
+        data_info = dataset[0]
+        self.assertEqual(
+            data_info['img_path'],
+            osp.join(self.root, self.image_folder, 'beef_carpaccio',
+                     '0004.jpg'))
+        self.assertEqual(data_info['gt_label'], 3)
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f'Root of dataset: \t{dataset.data_root}', repr(dataset))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.tmpdir.cleanup()
+
+
+class TestSUN397(TestBaseDataset):
+    DATASET_TYPE = 'SUN397'
+
+    @classmethod
+    def setUpClass(cls) -> None:
+        super().setUpClass()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        cls.tmpdir = tmpdir
+        cls.root = tmpdir.name
+        cls.train_file = 'Training_01.txt'
+        cls.test_file = 'Testing_01.txt'
+        cls.data_prefix = 'SUN397'
+        cls.meta_folder = 'Partitions'
+
+        os.makedirs(osp.join(cls.root, cls.meta_folder))
+
+        cls.DEFAULT_ARGS = dict(data_root=cls.root, split='train')
+
+        with open(osp.join(cls.root, cls.meta_folder, cls.train_file),
+                  'w') as f:
+            f.write('\n'.join([
+                '/a/abbey/sun_aqswjsnjlrfzzhiz.jpg',
+                '/a/airplane_cabin/sun_blczihbhbntqccux.jpg',
+                '/a/assembly_line/sun_ajckcfldgdrdjogj.jpg',
+            ]))
+
+        with open(osp.join(cls.root, cls.meta_folder, cls.test_file),
+                  'w') as f:
+            f.write('\n'.join([
+                '/a/abbey/sun_ajkqrqitspwywirx.jpg',
+                '/a/airplane_cabin/sun_aqylhacwdsqfjuuu.jpg',
+                '/a/auto_factory/sun_apfsprenzdnzbhmt.jpg',
+                '/b/baggage_claim/sun_avittiqqaiibgcau.jpg',
+            ]))
+
+    def test_initialize(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test invalid split
+        with self.assertRaisesRegex(AssertionError, 'The split must be'):
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = 'unknown'
+            dataset_class(**cfg)
+
+        # Test valid splits
+        splits = ['train', 'test']
+        for split in splits:
+            cfg = {**self.DEFAULT_ARGS}
+            cfg['split'] = split
+            dataset = dataset_class(**cfg)
+            self.assertEqual(dataset.split, split)
+            self.assertEqual(dataset.data_root, self.root)
+
+    def test_load_data_list(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+
+        # Test default behavior
+        dataset = dataset_class(**self.DEFAULT_ARGS)
+        self.assertEqual(len(dataset), 3)
+        data_info = dataset[0]
+        self.assertEqual(
+            data_info['img_path'],
+            osp.join(self.root, self.data_prefix,
+                     'a/abbey/sun_aqswjsnjlrfzzhiz.jpg'))
+        self.assertEqual(data_info['gt_label'], 0)
+
+        # Test with split='test'
+        cfg = {**self.DEFAULT_ARGS, 'split': 'test'}
+        dataset = dataset_class(**cfg)
+        self.assertEqual(len(dataset), 4)
+        data_info = dataset[-1]
+        self.assertEqual(
+            data_info['img_path'],
+            osp.join(self.root, self.data_prefix,
+                     'b/baggage_claim/sun_avittiqqaiibgcau.jpg'))
+        self.assertEqual(data_info['gt_label'], 26)
+
+    def test_extra_repr(self):
+        dataset_class = DATASETS.get(self.DATASET_TYPE)
+        cfg = {**self.DEFAULT_ARGS}
+        dataset = dataset_class(**cfg)
+
+        self.assertIn(f'Root of dataset: \t{dataset.data_root}', repr(dataset))
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.tmpdir.cleanup()
diff --git a/tests/test_datasets/test_samplers/test_repeat_aug.py b/tests/test_datasets/test_samplers/test_repeat_aug.py
new file mode 100644
index 0000000000000000000000000000000000000000..01926e9355a741942c94f165d9e0193bc0b0ec82
--- /dev/null
+++ b/tests/test_datasets/test_samplers/test_repeat_aug.py
@@ -0,0 +1,98 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+
+import math
+from unittest import TestCase
+from unittest.mock import patch
+
+import torch
+from mmengine.logging import MMLogger
+
+from mmpretrain.datasets import RepeatAugSampler
+
+file = 'mmpretrain.datasets.samplers.repeat_aug.'
+
+
+class MockDist:
+
+    def __init__(self, dist_info=(0, 1), seed=7):
+        self.dist_info = dist_info
+        self.seed = seed
+
+    def get_dist_info(self):
+        return self.dist_info
+
+    def sync_random_seed(self):
+        return self.seed
+
+    def is_main_process(self):
+        return self.dist_info[0] == 0
+
+
+class TestRepeatAugSampler(TestCase):
+
+    def setUp(self):
+        self.data_length = 100
+        self.dataset = list(range(self.data_length))
+
+    @patch(file + 'get_dist_info', return_value=(0, 1))
+    def test_non_dist(self, mock):
+        sampler = RepeatAugSampler(self.dataset, num_repeats=3, shuffle=False)
+        self.assertEqual(sampler.world_size, 1)
+        self.assertEqual(sampler.rank, 0)
+        self.assertEqual(sampler.total_size, self.data_length * 3)
+        self.assertEqual(sampler.num_samples, self.data_length * 3)
+        self.assertEqual(sampler.num_selected_samples, self.data_length)
+        self.assertEqual(len(sampler), sampler.num_selected_samples)
+        indices = [x for x in range(self.data_length) for _ in range(3)]
+        self.assertEqual(list(sampler), indices[:self.data_length])
+
+        logger = MMLogger.get_current_instance()
+        with self.assertLogs(logger, 'WARN') as log:
+            sampler = RepeatAugSampler(self.dataset, shuffle=False)
+        self.assertIn('always picks a fixed part', log.output[0])
+
+    @patch(file + 'get_dist_info', return_value=(2, 3))
+    @patch(file + 'is_main_process', return_value=False)
+    def test_dist(self, mock1, mock2):
+        sampler = RepeatAugSampler(self.dataset, num_repeats=3, shuffle=False)
+        self.assertEqual(sampler.world_size, 3)
+        self.assertEqual(sampler.rank, 2)
+        self.assertEqual(sampler.num_samples, self.data_length)
+        self.assertEqual(sampler.total_size, self.data_length * 3)
+        self.assertEqual(sampler.num_selected_samples,
+                         math.ceil(self.data_length / 3))
+        self.assertEqual(len(sampler), sampler.num_selected_samples)
+        indices = [x for x in range(self.data_length) for _ in range(3)]
+        self.assertEqual(
+            list(sampler), indices[2::3][:sampler.num_selected_samples])
+
+        logger = MMLogger.get_current_instance()
+        with patch.object(logger, 'warning') as mock_log:
+            sampler = RepeatAugSampler(self.dataset, shuffle=False)
+            mock_log.assert_not_called()
+
+    @patch(file + 'get_dist_info', return_value=(0, 1))
+    @patch(file + 'sync_random_seed', return_value=7)
+    def test_shuffle(self, mock1, mock2):
+        # test seed=None
+        sampler = RepeatAugSampler(self.dataset, seed=None)
+        self.assertEqual(sampler.seed, 7)
+
+        # test random seed
+        sampler = RepeatAugSampler(self.dataset, shuffle=True, seed=0)
+        sampler.set_epoch(10)
+        g = torch.Generator()
+        g.manual_seed(10)
+        indices = torch.randperm(len(self.dataset), generator=g).tolist()
+        indices = [x for x in indices
+                   for _ in range(3)][:sampler.num_selected_samples]
+        self.assertEqual(list(sampler), indices)
+
+        sampler = RepeatAugSampler(self.dataset, shuffle=True, seed=42)
+        sampler.set_epoch(10)
+        g = torch.Generator()
+        g.manual_seed(42 + 10)
+        indices = torch.randperm(len(self.dataset), generator=g).tolist()
+        indices = [x for x in indices
+                   for _ in range(3)][:sampler.num_selected_samples]
+        self.assertEqual(list(sampler), indices)
diff --git a/tests/test_datasets/test_transforms/test_auto_augment.py b/tests/test_datasets/test_transforms/test_auto_augment.py
new file mode 100644
index 0000000000000000000000000000000000000000..d9f65c3d9935c4012731cba772b2275948c9a7dd
--- /dev/null
+++ b/tests/test_datasets/test_transforms/test_auto_augment.py
@@ -0,0 +1,1330 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+import math
+from unittest import TestCase
+from unittest.mock import ANY, patch
+
+import numpy as np
+
+from mmpretrain.registry import TRANSFORMS
+
+
+def construct_toy_data():
+    img = np.random.randint(0, 256, (100, 200, 3), dtype=np.uint8)
+    results = dict()
+    # image
+    results['ori_img'] = img
+    results['img'] = img
+    results['img2'] = copy.deepcopy(img)
+    results['img_shape'] = img.shape
+    results['ori_shape'] = img.shape
+    results['img_fields'] = ['img', 'img2']
+    return results
+
+
+def construct_toy_data_photometric():
+    img = np.array([[0, 128, 255], [1, 127, 254], [2, 129, 253]],
+                   dtype=np.uint8)
+    img = np.stack([img, img, img], axis=-1)
+    results = dict()
+    # image
+    results['ori_img'] = img
+    results['img'] = img
+    results['img2'] = copy.deepcopy(img)
+    results['img_shape'] = img.shape
+    results['ori_shape'] = img.shape
+    results['img_fields'] = ['img', 'img2']
+    return results
+
+
+class TestAutoAugment(TestCase):
+
+    def test_construct(self):
+        policies = [[
+            dict(type='Posterize', bits=4, prob=0.4),
+            dict(type='Rotate', angle=30., prob=0.6)
+        ]]
+
+        cfg = dict(type='AutoAugment', policies=policies)
+        transform = TRANSFORMS.build(cfg)
+        results = construct_toy_data()
+        with patch.object(transform.transforms[0], 'transform') as mock:
+            transform(results)
+            mock.assert_called_once()
+
+        cfg = dict(type='AutoAugment', policies='imagenet')
+        transform = TRANSFORMS.build(cfg)
+        with patch.object(transform.transforms[5], 'transform') as mock:
+            with patch('numpy.random', np.random.RandomState(1)):
+                transform(results)
+                mock.assert_called()
+
+        # test hparams
+        cfg = dict(
+            type='AutoAugment',
+            policies=policies,
+            hparams=dict(pad_val=[255, 255, 255]))
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(transform.policies[0][1]['pad_val'], [255, 255, 255])
+        self.assertNotIn('pad_val', transform.policies[0][0])
+
+        with self.assertRaisesRegex(AssertionError, 'choose from .*imagenet'):
+            cfg = dict(type='AutoAugment', policies='unknown')
+            transform = TRANSFORMS.build(cfg)
+
+    def test_repr(self):
+        policies = [[
+            dict(type='Posterize', bits=4, prob=0.4),
+            dict(type='Rotate', angle=30., prob=0.6)
+        ]]
+
+        cfg = dict(type='AutoAugment', policies=policies)
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Posterize, \tRotate', repr(transform))
+
+
+class TestRandAugment(TestCase):
+    DEFAULT_ARGS = dict(
+        type='RandAugment',
+        magnitude_level=7,
+        num_policies=1,
+        policies='timm_increasing')
+
+    def test_construct(self):
+        policies = [
+            dict(type='Posterize', magnitude_range=(4, 0)),
+            dict(type='Rotate', magnitude_range=(0, 30))
+        ]
+
+        cfg = {**self.DEFAULT_ARGS, 'policies': policies}
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(len(list(transform)), 2)
+        results = construct_toy_data()
+        with patch.object(transform.transforms[1], 'transform') as mock:
+            with patch('numpy.random', np.random.RandomState(1)):
+                transform(results)
+                mock.assert_called_once()
+
+        cfg = {**self.DEFAULT_ARGS, 'policies': 'timm_increasing'}
+        transform = TRANSFORMS.build(cfg)
+        with patch.object(transform.transforms[5], 'transform') as mock:
+            with patch('numpy.random', np.random.RandomState(1)):
+                transform(results)
+                mock.assert_called()
+
+        # test hparams
+        cfg = {
+            **self.DEFAULT_ARGS,
+            'policies': policies,
+            'hparams': dict(pad_val=[255, 255, 255]),
+        }
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(transform.policies[1]['pad_val'], [255, 255, 255])
+        self.assertNotIn('pad_val', transform.policies[0])
+
+        # test magnitude related parameters
+        cfg = {
+            **self.DEFAULT_ARGS, 'policies': [
+                dict(type='Equalize'),
+                dict(type='Rotate', magnitude_range=(0, 30))
+            ]
+        }
+        transform = TRANSFORMS.build(cfg)
+        self.assertNotIn('magnitude_range', transform.policies[0])
+        self.assertNotIn('magnitude_level', transform.policies[0])
+        self.assertNotIn('magnitude_range', transform.policies[0])
+        self.assertNotIn('total_level', transform.policies[0])
+        self.assertEqual(transform.policies[1]['magnitude_range'], (0, 30))
+        self.assertEqual(transform.policies[1]['magnitude_level'], 7)
+        self.assertEqual(transform.policies[1]['magnitude_std'], 0.)
+        self.assertEqual(transform.policies[1]['total_level'], 10)
+
+        # test invalid policies
+        with self.assertRaisesRegex(AssertionError,
+                                    'choose from .*timm_increasing'):
+            cfg = {**self.DEFAULT_ARGS, 'policies': 'unknown'}
+            transform = TRANSFORMS.build(cfg)
+
+        # test invalid magnitude_std
+        with self.assertRaisesRegex(AssertionError, 'got "unknown" instead'):
+            cfg = {**self.DEFAULT_ARGS, 'magnitude_std': 'unknown'}
+            transform = TRANSFORMS.build(cfg)
+
+    def test_repr(self):
+        policies = [
+            dict(type='Posterize', magnitude_range=(4, 0)),
+            dict(type='Equalize')
+        ]
+
+        cfg = {**self.DEFAULT_ARGS, 'policies': policies}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('    Posterize (4, 0)\n    Equalize\n', repr(transform))
+
+
+class TestShear(TestCase):
+    DEFAULT_ARGS = dict(type='Shear')
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            TRANSFORMS.build(self.DEFAULT_ARGS)
+
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            cfg = {
+                **self.DEFAULT_ARGS, 'magnitude': 1,
+                'magnitude_range': (1, 2)
+            }
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaisesRegex(AssertionError, 'got "unknown" instead'):
+            cfg = {**self.DEFAULT_ARGS, 'magnitude': 1, 'direction': 'unknown'}
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        # test params inputs
+        with patch('mmcv.imshear') as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.2,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'direction': 'horizontal',
+                'pad_val': 255,
+                'interpolation': 'nearest',
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(
+                ANY,
+                0.2,
+                direction='horizontal',
+                border_value=255,
+                interpolation='nearest')
+
+        # test random_negative_prob
+        with patch('mmcv.imshear') as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.2,
+                'random_negative_prob': 1.,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(
+                ANY, -0.2, direction=ANY, border_value=ANY, interpolation=ANY)
+
+        # test prob
+        with patch('mmcv.imshear') as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.2,
+                'random_negative_prob': 0.,
+                'prob': 0.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_not_called()
+
+        # test sequeue pad_val
+        with patch('mmcv.imshear') as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.2,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'direction': 'horizontal',
+                'pad_val': (255, 255, 255),
+                'interpolation': 'nearest',
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(
+                ANY,
+                0.2,
+                direction='horizontal',
+                border_value=(255, 255, 255),
+                interpolation='nearest')
+
+        # test magnitude_range
+        with patch('mmcv.imshear') as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (0, 0.3),
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(
+                ANY, 0.18, direction=ANY, border_value=ANY, interpolation=ANY)
+
+        # test magnitude_std is positive
+        with patch('mmcv.imshear') as mock:
+            cfg = {
+                **self.DEFAULT_ARGS, 'random_negative_prob': 0.,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (0, 0.3),
+                'magnitude_std': 1
+            }
+            with patch('numpy.random', np.random.RandomState(1)):
+                TRANSFORMS.build(cfg)(construct_toy_data())
+                self.assertAlmostEqual(mock.call_args[0][1], 0.1811, places=4)
+
+        # test magnitude_std = 'inf'
+        with patch('mmcv.imshear') as mock:
+            cfg = {
+                **self.DEFAULT_ARGS, 'random_negative_prob': 0.,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (0, 0.3),
+                'magnitude_std': 'inf'
+            }
+            with patch('numpy.random', np.random.RandomState(9)):
+                TRANSFORMS.build(cfg)(construct_toy_data())
+                self.assertAlmostEqual(mock.call_args[0][1], 0.0882, places=4)
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'magnitude': 0.1}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Shear(magnitude=0.1', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (0, 0.3)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Shear(magnitude=None', repr(transform))
+        self.assertIn('magnitude_range=(0, 0.3)', repr(transform))
+
+
+class TestTranslate(TestCase):
+    DEFAULT_ARGS = dict(type='Translate')
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            TRANSFORMS.build(self.DEFAULT_ARGS)
+
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            cfg = {
+                **self.DEFAULT_ARGS, 'magnitude': 1,
+                'magnitude_range': (1, 2)
+            }
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaisesRegex(AssertionError, 'got "unknown" instead'):
+            cfg = {**self.DEFAULT_ARGS, 'magnitude': 1, 'direction': 'unknown'}
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        transform_func = 'mmcv.imtranslate'
+
+        # test params inputs
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.2,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'direction': 'horizontal',
+                'pad_val': 255,
+                'interpolation': 'nearest',
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(
+                ANY,
+                200 * 0.2,
+                direction='horizontal',
+                border_value=255,
+                interpolation='nearest')
+
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.2,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'direction': 'vertical',
+                'pad_val': 255,
+                'interpolation': 'nearest',
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(
+                ANY,
+                100 * 0.2,
+                direction='vertical',
+                border_value=255,
+                interpolation='nearest')
+
+        # test sequeue pad_val
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.2,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'direction': 'horizontal',
+                'pad_val': [255, 255, 255],
+                'interpolation': 'nearest',
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(
+                ANY,
+                200 * 0.2,
+                direction='horizontal',
+                border_value=(255, 255, 255),
+                interpolation='nearest')
+
+        # test prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.2,
+                'random_negative_prob': 0.,
+                'prob': 0.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_not_called()
+
+        # test random_negative_prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.2,
+                'random_negative_prob': 1.,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(
+                ANY,
+                -0.2 * 200,
+                direction=ANY,
+                border_value=ANY,
+                interpolation=ANY)
+
+        # test magnitude_range
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (0, 0.3),
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(
+                ANY,
+                0.18 * 200,
+                direction=ANY,
+                border_value=ANY,
+                interpolation=ANY)
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'magnitude': 0.1}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Translate(magnitude=0.1', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (0, 0.3)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Translate(magnitude=None', repr(transform))
+        self.assertIn('magnitude_range=(0, 0.3)', repr(transform))
+
+
+class TestRotate(TestCase):
+    DEFAULT_ARGS = dict(type='Rotate')
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            TRANSFORMS.build(self.DEFAULT_ARGS)
+
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            cfg = {**self.DEFAULT_ARGS, 'angle': 30, 'magnitude_range': (1, 2)}
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        transform_func = 'mmcv.imrotate'
+
+        # test params inputs
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'angle': 30,
+                'center': (10, 10),
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'scale': 1.5,
+                'pad_val': 255,
+                'interpolation': 'bilinear',
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(
+                ANY,
+                30,
+                center=(10, 10),
+                scale=1.5,
+                border_value=255,
+                interpolation='bilinear')
+
+        # test params inputs
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'angle': 30,
+                'center': (10, 10),
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'scale': 1.5,
+                'pad_val': (255, 255, 255),
+                'interpolation': 'bilinear',
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(
+                ANY,
+                30,
+                center=(10, 10),
+                scale=1.5,
+                border_value=(255, 255, 255),
+                interpolation='bilinear')
+
+        # test prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'angle': 30,
+                'random_negative_prob': 0.,
+                'prob': 0.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_not_called()
+
+        # test random_negative_prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'angle': 30,
+                'random_negative_prob': 1.,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(
+                ANY,
+                -30,
+                center=ANY,
+                scale=ANY,
+                border_value=ANY,
+                interpolation=ANY)
+
+        # test magnitude_range
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (0, 30),
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(
+                ANY,
+                18,
+                center=ANY,
+                scale=ANY,
+                border_value=ANY,
+                interpolation=ANY)
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'angle': 30}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Rotate(angle=30', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (0, 30)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Rotate(angle=None', repr(transform))
+        self.assertIn('magnitude_range=(0, 30)', repr(transform))
+
+
+class TestAutoContrast(TestCase):
+    DEFAULT_ARGS = dict(type='AutoContrast')
+
+    def test_transform(self):
+        transform_func = 'mmcv.auto_contrast'
+
+        # test prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 0.,
+                'prob': 0.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_not_called()
+
+        # test random_negative_prob
+        # No effect
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 1.,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY)
+
+        # test magnitude_range
+        # No effect
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (0, 30),
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY)
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'prob': 0.5}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('AutoContrast(prob=0.5)', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (0, 30)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('AutoContrast(prob=', repr(transform))
+        self.assertNotIn('magnitude_range=(0, 30)', repr(transform))
+
+
+class TestInvert(TestCase):
+    DEFAULT_ARGS = dict(type='Invert')
+
+    def test_transform(self):
+        transform_func = 'mmcv.iminvert'
+
+        # test prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 0.,
+                'prob': 0.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_not_called()
+
+        # test random_negative_prob
+        # No effect
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 1.,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY)
+
+        # test magnitude_range
+        # No effect
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (0, 30),
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY)
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'prob': 0.5}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Invert(prob=0.5)', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (0, 30)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Invert(prob=', repr(transform))
+        self.assertNotIn('magnitude_range=(0, 30)', repr(transform))
+
+
+class TestEqualize(TestCase):
+    DEFAULT_ARGS = dict(type='Equalize')
+
+    def test_transform(self):
+        transform_func = 'mmcv.imequalize'
+
+        # test prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 0.,
+                'prob': 0.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_not_called()
+
+        # test random_negative_prob
+        # No effect
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 1.,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY)
+
+        # test magnitude_range
+        # No effect
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (0, 30),
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY)
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'prob': 0.5}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Equalize(prob=0.5)', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (0, 30)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Equalize(prob=', repr(transform))
+        self.assertNotIn('magnitude_range=(0, 30)', repr(transform))
+
+
+class TestSolarize(TestCase):
+    DEFAULT_ARGS = dict(type='Solarize')
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            TRANSFORMS.build(self.DEFAULT_ARGS)
+
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            cfg = {**self.DEFAULT_ARGS, 'thr': 1, 'magnitude_range': (1, 2)}
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        transform_func = 'mmcv.solarize'
+
+        # test params inputs
+        with patch(transform_func) as mock:
+            cfg = {**self.DEFAULT_ARGS, 'thr': 128, 'prob': 1.}
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, thr=128)
+
+        # test prob
+        with patch(transform_func) as mock:
+            cfg = {**self.DEFAULT_ARGS, 'thr': 128, 'prob': 0.}
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_not_called()
+
+        # test random_negative_prob
+        # cannot accept `random_negative_prob` argument
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'thr': 128,
+                'random_negative_prob': 1.,
+                'prob': 1.,
+            }
+            with self.assertRaisesRegex(TypeError, 'multiple values'):
+                TRANSFORMS.build(cfg)(construct_toy_data())
+
+        # test magnitude_range
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (256, 0),
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, thr=256 * 0.4)
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'thr': 128}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Solarize(thr=128', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (256, 0)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Solarize(thr=None', repr(transform))
+        self.assertIn('magnitude_range=(256, 0)', repr(transform))
+
+
+class TestSolarizeAdd(TestCase):
+    DEFAULT_ARGS = dict(type='SolarizeAdd')
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            TRANSFORMS.build(self.DEFAULT_ARGS)
+
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            cfg = {
+                **self.DEFAULT_ARGS, 'magnitude': 1,
+                'magnitude_range': (1, 2)
+            }
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaisesRegex(AssertionError, 'str'):
+            cfg = {**self.DEFAULT_ARGS, 'magnitude': 1, 'thr': 'hi'}
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+
+        # test params inputs
+        cfg = {**self.DEFAULT_ARGS, 'magnitude': 100, 'thr': 128, 'prob': 1.}
+        results = construct_toy_data_photometric()
+        expected = np.where(results['img'] < 128,
+                            np.minimum(results['img'] + 100, 255),
+                            results['img'])
+        TRANSFORMS.build(cfg)(results)
+        np.testing.assert_allclose(results['img'], expected)
+
+        # test prob
+        cfg = {**self.DEFAULT_ARGS, 'magnitude': 100, 'thr': 128, 'prob': 0.}
+        results = construct_toy_data_photometric()
+        expected = copy.deepcopy(results['img'])
+        TRANSFORMS.build(cfg)(results)
+        np.testing.assert_allclose(results['img'], expected)
+
+        # test random_negative_prob
+        # cannot accept `random_negative_prob` argument
+        cfg = {
+            **self.DEFAULT_ARGS,
+            'magnitude': 100,
+            'thr': 128,
+            'random_negative_prob': 1.,
+            'prob': 1.,
+        }
+        with self.assertRaisesRegex(TypeError, 'multiple values'):
+            TRANSFORMS.build(cfg)(construct_toy_data())
+
+        # test magnitude_range
+        cfg = {
+            **self.DEFAULT_ARGS,
+            'prob': 1.,
+            'magnitude_level': 6,
+            'magnitude_range': (0, 110),
+        }
+        results = construct_toy_data_photometric()
+        expected = np.where(results['img'] < 128,
+                            np.minimum(results['img'] + 110 * 0.6, 255),
+                            results['img'])
+        TRANSFORMS.build(cfg)(results)
+        np.testing.assert_allclose(results['img'], expected)
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'magnitude': 100}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('SolarizeAdd(magnitude=100', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (0, 110)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('SolarizeAdd(magnitude=None', repr(transform))
+        self.assertIn('magnitude_range=(0, 110)', repr(transform))
+
+
+class TestPosterize(TestCase):
+    DEFAULT_ARGS = dict(type='Posterize')
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            TRANSFORMS.build(self.DEFAULT_ARGS)
+
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            cfg = {**self.DEFAULT_ARGS, 'bits': 1, 'magnitude_range': (1, 2)}
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaisesRegex(AssertionError, 'got 100 instead'):
+            cfg = {**self.DEFAULT_ARGS, 'bits': 100}
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        transform_func = 'mmcv.posterize'
+
+        # test params inputs
+        with patch(transform_func) as mock:
+            cfg = {**self.DEFAULT_ARGS, 'bits': 4, 'prob': 1.}
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, bits=4)
+
+        # test prob
+        with patch(transform_func) as mock:
+            cfg = {**self.DEFAULT_ARGS, 'bits': 4, 'prob': 0.}
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_not_called()
+
+        # test random_negative_prob
+        # cannot accept `random_negative_prob` argument
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'bits': 4,
+                'random_negative_prob': 1.,
+                'prob': 1.,
+            }
+            with self.assertRaisesRegex(TypeError, 'multiple values'):
+                TRANSFORMS.build(cfg)(construct_toy_data())
+
+        # test magnitude_range
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (4, 0),
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, bits=math.ceil(4 * 0.4))
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'bits': 4}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Posterize(bits=4', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (4, 0)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Posterize(bits=None', repr(transform))
+        self.assertIn('magnitude_range=(4, 0)', repr(transform))
+
+
+class TestContrast(TestCase):
+    DEFAULT_ARGS = dict(type='Contrast')
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            TRANSFORMS.build(self.DEFAULT_ARGS)
+
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            cfg = {
+                **self.DEFAULT_ARGS, 'magnitude': 1,
+                'magnitude_range': (1, 2)
+            }
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        transform_func = 'mmcv.adjust_contrast'
+
+        # test params inputs
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.5,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, factor=1 + 0.5)
+
+        # test prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.5,
+                'random_negative_prob': 0.,
+                'prob': 0.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_not_called()
+
+        # test random_negative_prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.5,
+                'random_negative_prob': 1.,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, factor=1 - 0.5)
+
+        # test magnitude_range
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (0, 0.5),
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, factor=1 + 0.6 * 0.5)
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'magnitude': 0.1}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Contrast(magnitude=0.1', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (0, 0.3)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Contrast(magnitude=None', repr(transform))
+        self.assertIn('magnitude_range=(0, 0.3)', repr(transform))
+
+
+class TestColorTransform(TestCase):
+    DEFAULT_ARGS = dict(type='ColorTransform')
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            TRANSFORMS.build(self.DEFAULT_ARGS)
+
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            cfg = {
+                **self.DEFAULT_ARGS, 'magnitude': 1,
+                'magnitude_range': (1, 2)
+            }
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        transform_func = 'mmcv.adjust_color'
+
+        # test params inputs
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.5,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, alpha=1 + 0.5)
+
+        # test prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.5,
+                'random_negative_prob': 0.,
+                'prob': 0.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_not_called()
+
+        # test random_negative_prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.5,
+                'random_negative_prob': 1.,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, alpha=1 - 0.5)
+
+        # test magnitude_range
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (0, 0.5),
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, alpha=1 + 0.6 * 0.5)
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'magnitude': 0.1}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('ColorTransform(magnitude=0.1', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (0, 0.3)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('ColorTransform(magnitude=None', repr(transform))
+        self.assertIn('magnitude_range=(0, 0.3)', repr(transform))
+
+
+class TestBrightness(TestCase):
+    DEFAULT_ARGS = dict(type='Brightness')
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            TRANSFORMS.build(self.DEFAULT_ARGS)
+
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            cfg = {
+                **self.DEFAULT_ARGS, 'magnitude': 1,
+                'magnitude_range': (1, 2)
+            }
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        transform_func = 'mmcv.adjust_brightness'
+
+        # test params inputs
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.5,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, factor=1 + 0.5)
+
+        # test prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.5,
+                'random_negative_prob': 0.,
+                'prob': 0.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_not_called()
+
+        # test random_negative_prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.5,
+                'random_negative_prob': 1.,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, factor=1 - 0.5)
+
+        # test magnitude_range
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (0, 0.5),
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, factor=1 + 0.6 * 0.5)
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'magnitude': 0.1}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Brightness(magnitude=0.1', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (0, 0.3)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Brightness(magnitude=None', repr(transform))
+        self.assertIn('magnitude_range=(0, 0.3)', repr(transform))
+
+
+class TestSharpness(TestCase):
+    DEFAULT_ARGS = dict(type='Sharpness')
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            TRANSFORMS.build(self.DEFAULT_ARGS)
+
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            cfg = {
+                **self.DEFAULT_ARGS, 'magnitude': 1,
+                'magnitude_range': (1, 2)
+            }
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        transform_func = 'mmcv.adjust_sharpness'
+
+        # test params inputs
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.5,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, factor=1 + 0.5)
+
+        # test prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.5,
+                'random_negative_prob': 0.,
+                'prob': 0.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_not_called()
+
+        # test random_negative_prob
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude': 0.5,
+                'random_negative_prob': 1.,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, factor=1 - 0.5)
+
+        # test magnitude_range
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'random_negative_prob': 0.,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (0, 0.5),
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, factor=1 + 0.6 * 0.5)
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'magnitude': 0.1}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Sharpness(magnitude=0.1', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (0, 0.3)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Sharpness(magnitude=None', repr(transform))
+        self.assertIn('magnitude_range=(0, 0.3)', repr(transform))
+
+
+class TestCutout(TestCase):
+    DEFAULT_ARGS = dict(type='Cutout')
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            TRANSFORMS.build(self.DEFAULT_ARGS)
+
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            cfg = {
+                **self.DEFAULT_ARGS, 'shape': 10,
+                'magnitude_range': (10, 20)
+            }
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        transform_func = 'mmcv.cutout'
+
+        # test params inputs
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'shape': (10, 15),
+                'prob': 1.,
+                'pad_val': 255,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, (10, 15), pad_val=255)
+
+        # test prob
+        with patch(transform_func) as mock:
+            cfg = {**self.DEFAULT_ARGS, 'shape': 10, 'prob': 0.}
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_not_called()
+
+        # test sequeue pad_val
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'shape': (10, 15),
+                'prob': 1.,
+                'pad_val': [255, 255, 255],
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(
+                ANY, (10, 15), pad_val=(255, 255, 255))
+
+        # test random_negative_prob
+        # cannot accept `random_negative_prob` argument
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'shape': 10,
+                'random_negative_prob': 1.,
+                'prob': 1.,
+            }
+            with self.assertRaisesRegex(TypeError, 'multiple values'):
+                TRANSFORMS.build(cfg)(construct_toy_data())
+
+        # test magnitude_range
+        with patch(transform_func) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'prob': 1.,
+                'magnitude_level': 6,
+                'magnitude_range': (1, 41),
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(ANY, 40 * 0.6 + 1, pad_val=ANY)
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'shape': 15}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Cutout(shape=15', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (0, 41)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('Cutout(shape=None', repr(transform))
+        self.assertIn('magnitude_range=(0, 41)', repr(transform))
+
+
+class TestGaussianBlur(TestCase):
+    DEFAULT_ARGS = dict(type='GaussianBlur')
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            TRANSFORMS.build(self.DEFAULT_ARGS)
+
+        with self.assertRaisesRegex(AssertionError, 'only one of'):
+            cfg = {**self.DEFAULT_ARGS, 'radius': 1, 'magnitude_range': (1, 2)}
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        transform_func = 'PIL.ImageFilter.GaussianBlur'
+        from PIL.ImageFilter import GaussianBlur
+
+        # test params inputs
+        with patch(transform_func, wraps=GaussianBlur) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'radius': 0.5,
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_called_once_with(radius=0.5)
+
+        # test prob
+        with patch(transform_func, wraps=GaussianBlur) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'radius': 0.5,
+                'prob': 0.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            mock.assert_not_called()
+
+        # test magnitude_range
+        with patch(transform_func, wraps=GaussianBlur) as mock:
+            cfg = {
+                **self.DEFAULT_ARGS,
+                'magnitude_range': (0.1, 2),
+                'magnitude_std': 'inf',
+                'prob': 1.,
+            }
+            TRANSFORMS.build(cfg)(construct_toy_data())
+            self.assertTrue(0.1 < mock.call_args[1]['radius'] < 2)
+
+    def test_repr(self):
+        cfg = {**self.DEFAULT_ARGS, 'radius': 0.1}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('GaussianBlur(radius=0.1, prob=0.5', repr(transform))
+        self.assertNotIn('magnitude_range', repr(transform))
+
+        cfg = {**self.DEFAULT_ARGS, 'magnitude_range': (0.1, 2)}
+        transform = TRANSFORMS.build(cfg)
+        self.assertIn('GaussianBlur(radius=None, prob=0.5', repr(transform))
+        self.assertIn('magnitude_range=(0.1, 2)', repr(transform))
diff --git a/tests/test_datasets/test_transforms/test_formatting.py b/tests/test_datasets/test_transforms/test_formatting.py
new file mode 100644
index 0000000000000000000000000000000000000000..e515c6d33e5d29f629fd5ae8fb90cf00e770d13e
--- /dev/null
+++ b/tests/test_datasets/test_transforms/test_formatting.py
@@ -0,0 +1,219 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+import os.path as osp
+import unittest
+
+import mmcv
+import numpy as np
+import torch
+from PIL import Image
+
+from mmpretrain.registry import TRANSFORMS
+from mmpretrain.structures import DataSample, MultiTaskDataSample
+
+
+class TestPackInputs(unittest.TestCase):
+
+    def test_transform(self):
+        img_path = osp.join(osp.dirname(__file__), '../../data/color.jpg')
+        data = {
+            'sample_idx': 1,
+            'img_path': img_path,
+            'ori_shape': (300, 400),
+            'img_shape': (300, 400),
+            'scale_factor': 1.0,
+            'flip': False,
+            'img': mmcv.imread(img_path),
+            'gt_label': 2,
+            'custom_key': torch.tensor([1, 2, 3])
+        }
+
+        cfg = dict(type='PackInputs', algorithm_keys=['custom_key'])
+        transform = TRANSFORMS.build(cfg)
+        results = transform(copy.deepcopy(data))
+        self.assertIn('inputs', results)
+        self.assertIsInstance(results['inputs'], torch.Tensor)
+        self.assertIn('data_samples', results)
+        self.assertIsInstance(results['data_samples'], DataSample)
+        self.assertIn('flip', results['data_samples'].metainfo_keys())
+        self.assertIsInstance(results['data_samples'].gt_label, torch.Tensor)
+        self.assertIsInstance(results['data_samples'].custom_key, torch.Tensor)
+
+        # Test grayscale image
+        data['img'] = data['img'].mean(-1)
+        results = transform(copy.deepcopy(data))
+        self.assertIn('inputs', results)
+        self.assertIsInstance(results['inputs'], torch.Tensor)
+        self.assertEqual(results['inputs'].shape, (1, 300, 400))
+
+        # Test video input
+        data['img'] = np.random.randint(
+            0, 256, (10, 3, 1, 224, 224), dtype=np.uint8)
+        results = transform(copy.deepcopy(data))
+        self.assertIn('inputs', results)
+        self.assertIsInstance(results['inputs'], torch.Tensor)
+        self.assertEqual(results['inputs'].shape, (10, 3, 1, 224, 224))
+
+        # Test Pillow input
+        data['img'] = Image.open(img_path)
+        results = transform(copy.deepcopy(data))
+        self.assertIn('inputs', results)
+        self.assertIsInstance(results['inputs'], torch.Tensor)
+        self.assertEqual(results['inputs'].shape, (3, 300, 400))
+
+        # Test without `img` and `gt_label`
+        del data['img']
+        del data['gt_label']
+        results = transform(copy.deepcopy(data))
+        self.assertNotIn('gt_label', results['data_samples'])
+
+    def test_repr(self):
+        cfg = dict(type='PackInputs', meta_keys=['flip', 'img_shape'])
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(
+            repr(transform), "PackInputs(input_key='img', algorithm_keys=(), "
+            "meta_keys=['flip', 'img_shape'])")
+
+
+class TestTranspose(unittest.TestCase):
+
+    def test_transform(self):
+        cfg = dict(type='Transpose', keys=['img'], order=[2, 0, 1])
+        transform = TRANSFORMS.build(cfg)
+
+        data = {'img': np.random.randint(0, 256, (224, 224, 3), dtype='uint8')}
+
+        results = transform(copy.deepcopy(data))
+        self.assertEqual(results['img'].shape, (3, 224, 224))
+
+    def test_repr(self):
+        cfg = dict(type='Transpose', keys=['img'], order=(2, 0, 1))
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(
+            repr(transform), "Transpose(keys=['img'], order=(2, 0, 1))")
+
+
+class TestToPIL(unittest.TestCase):
+
+    def test_transform(self):
+        cfg = dict(type='ToPIL')
+        transform = TRANSFORMS.build(cfg)
+
+        data = {'img': np.random.randint(0, 256, (224, 224, 3), dtype='uint8')}
+
+        results = transform(copy.deepcopy(data))
+        self.assertIsInstance(results['img'], Image.Image)
+
+        cfg = dict(type='ToPIL', to_rgb=True)
+        transform = TRANSFORMS.build(cfg)
+
+        data = {'img': np.random.randint(0, 256, (224, 224, 3), dtype='uint8')}
+
+        results = transform(copy.deepcopy(data))
+        self.assertIsInstance(results['img'], Image.Image)
+        np.equal(np.array(results['img']), data['img'][:, :, ::-1])
+
+    def test_repr(self):
+        cfg = dict(type='ToPIL', to_rgb=True)
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(repr(transform), 'NumpyToPIL(to_rgb=True)')
+
+
+class TestToNumpy(unittest.TestCase):
+
+    def test_transform(self):
+        img_path = osp.join(osp.dirname(__file__), '../../data/color.jpg')
+        data = {
+            'img': Image.open(img_path),
+        }
+
+        cfg = dict(type='ToNumpy')
+        transform = TRANSFORMS.build(cfg)
+        results = transform(copy.deepcopy(data))
+        self.assertIsInstance(results['img'], np.ndarray)
+        self.assertEqual(results['img'].dtype, 'uint8')
+
+        cfg = dict(type='ToNumpy', to_bgr=True)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(copy.deepcopy(data))
+        self.assertIsInstance(results['img'], np.ndarray)
+        self.assertEqual(results['img'].dtype, 'uint8')
+        np.equal(results['img'], np.array(data['img'])[:, :, ::-1])
+
+    def test_repr(self):
+        cfg = dict(type='ToNumpy', to_bgr=True)
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(
+            repr(transform), 'PILToNumpy(to_bgr=True, dtype=None)')
+
+
+class TestCollect(unittest.TestCase):
+
+    def test_transform(self):
+        data = {'img': [1, 2, 3], 'gt_label': 1}
+
+        cfg = dict(type='Collect', keys=['img'])
+        transform = TRANSFORMS.build(cfg)
+        results = transform(copy.deepcopy(data))
+        self.assertIn('img', results)
+        self.assertNotIn('gt_label', results)
+
+    def test_repr(self):
+        cfg = dict(type='Collect', keys=['img'])
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(repr(transform), "Collect(keys=['img'])")
+
+
+class TestPackMultiTaskInputs(unittest.TestCase):
+
+    def test_transform(self):
+        img_path = osp.join(osp.dirname(__file__), '../../data/color.jpg')
+        data = {
+            'sample_idx': 1,
+            'img_path': img_path,
+            'ori_shape': (300, 400),
+            'img_shape': (300, 400),
+            'scale_factor': 1.0,
+            'flip': False,
+            'img': mmcv.imread(img_path),
+            'gt_label': {
+                'task1': 1,
+                'task3': 3
+            },
+        }
+
+        cfg = dict(type='PackMultiTaskInputs', multi_task_fields=['gt_label'])
+        transform = TRANSFORMS.build(cfg)
+        results = transform(copy.deepcopy(data))
+        self.assertIn('inputs', results)
+        self.assertIsInstance(results['inputs'], torch.Tensor)
+        self.assertIn('data_samples', results)
+        self.assertIsInstance(results['data_samples'], MultiTaskDataSample)
+        self.assertIn('flip', results['data_samples'].task1.metainfo_keys())
+        self.assertIsInstance(results['data_samples'].task1.gt_label,
+                              torch.Tensor)
+
+        # Test grayscale image
+        data['img'] = data['img'].mean(-1)
+        results = transform(copy.deepcopy(data))
+        self.assertIn('inputs', results)
+        self.assertIsInstance(results['inputs'], torch.Tensor)
+        self.assertEqual(results['inputs'].shape, (1, 300, 400))
+
+        # Test without `img` and `gt_label`
+        del data['img']
+        del data['gt_label']
+        results = transform(copy.deepcopy(data))
+        self.assertNotIn('gt_label', results['data_samples'])
+
+    def test_repr(self):
+        cfg = dict(
+            type='PackMultiTaskInputs',
+            multi_task_fields=['gt_label'],
+            task_handlers=dict(task1=dict(type='PackInputs')),
+        )
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(
+            repr(transform),
+            "PackMultiTaskInputs(multi_task_fields=['gt_label'], "
+            "input_key='img', task_handlers={'task1': PackInputs})")
diff --git a/tests/test_datasets/test_transforms/test_processing.py b/tests/test_datasets/test_transforms/test_processing.py
new file mode 100644
index 0000000000000000000000000000000000000000..3386568f62a8ce2935e17cac58fe6dd112de772a
--- /dev/null
+++ b/tests/test_datasets/test_transforms/test_processing.py
@@ -0,0 +1,959 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+import math
+import os.path as osp
+import random
+from unittest import TestCase
+from unittest.mock import ANY, call, patch
+
+import mmengine
+import numpy as np
+import pytest
+import torch
+import torchvision
+from mmcv.transforms import Compose
+from mmengine.utils import digit_version
+from PIL import Image
+from torchvision import transforms
+
+from mmpretrain.datasets.transforms.processing import VISION_TRANSFORMS
+from mmpretrain.registry import TRANSFORMS
+
+try:
+    import albumentations
+except ImportError:
+    albumentations = None
+
+
+def construct_toy_data():
+    img = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]],
+                   dtype=np.uint8)
+    img = np.stack([img, img, img], axis=-1)
+    results = dict()
+    # image
+    results['ori_img'] = img
+    results['img'] = copy.deepcopy(img)
+    results['ori_shape'] = img.shape
+    results['img_shape'] = img.shape
+    return results
+
+
+class TestRandomCrop(TestCase):
+
+    def test_assertion(self):
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomCrop', crop_size=-1)
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomCrop', crop_size=(1, 2, 3))
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomCrop', crop_size=(1, -2))
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomCrop', crop_size=224, padding_mode='co')
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        results = dict(img=np.random.randint(0, 256, (256, 256, 3), np.uint8))
+
+        # test random crop by default.
+        cfg = dict(type='RandomCrop', crop_size=224)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test int padding and int pad_val.
+        cfg = dict(
+            type='RandomCrop', crop_size=(224, 224), padding=2, pad_val=1)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test int padding and sequence pad_val.
+        cfg = dict(
+            type='RandomCrop', crop_size=224, padding=2, pad_val=(0, 50, 0))
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test sequence padding.
+        cfg = dict(type='RandomCrop', crop_size=224, padding=(2, 3, 4, 5))
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test pad_if_needed.
+        cfg = dict(
+            type='RandomCrop',
+            crop_size=300,
+            pad_if_needed=True,
+            padding_mode='edge')
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (300, 300, 3))
+
+        # test large crop size.
+        results = dict(img=np.random.randint(0, 256, (256, 256, 3), np.uint8))
+        cfg = dict(type='RandomCrop', crop_size=300)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (256, 256, 3))
+
+        # test equal size.
+        results = dict(img=np.random.randint(0, 256, (256, 256, 3), np.uint8))
+        cfg = dict(type='RandomCrop', crop_size=256)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (256, 256, 3))
+
+    def test_repr(self):
+        cfg = dict(type='RandomCrop', crop_size=224)
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(
+            repr(transform), 'RandomCrop(crop_size=(224, 224), padding=None, '
+            'pad_if_needed=False, pad_val=0, padding_mode=constant)')
+
+
+class TestRandomResizedCrop(TestCase):
+
+    def test_assertion(self):
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomResizedCrop', scale=-1)
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomResizedCrop', scale=(1, 2, 3))
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomResizedCrop', scale=(1, -2))
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(ValueError):
+            cfg = dict(
+                type='RandomResizedCrop', scale=224, crop_ratio_range=(1, 0.1))
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(ValueError):
+            cfg = dict(
+                type='RandomResizedCrop',
+                scale=224,
+                aspect_ratio_range=(1, 0.1))
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomResizedCrop', scale=224, max_attempts=-1)
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomResizedCrop', scale=224, interpolation='ne')
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        results = dict(img=np.random.randint(0, 256, (256, 256, 3), np.uint8))
+
+        # test random crop by default.
+        cfg = dict(type='RandomResizedCrop', scale=224)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test crop_ratio_range.
+        cfg = dict(
+            type='RandomResizedCrop',
+            scale=(224, 224),
+            crop_ratio_range=(0.5, 0.8))
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test aspect_ratio_range.
+        cfg = dict(
+            type='RandomResizedCrop', scale=224, aspect_ratio_range=(0.5, 0.8))
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test max_attempts.
+        cfg = dict(type='RandomResizedCrop', scale=224, max_attempts=0)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+        # test fall back with extreme low in_ratio
+        results = dict(img=np.random.randint(0, 256, (10, 256, 3), np.uint8))
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+        # test fall back with extreme low in_ratio
+        results = dict(img=np.random.randint(0, 256, (256, 10, 3), np.uint8))
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test large crop size.
+        results = dict(img=np.random.randint(0, 256, (256, 256, 3), np.uint8))
+        cfg = dict(type='RandomResizedCrop', scale=300)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (300, 300, 3))
+
+    def test_repr(self):
+        cfg = dict(type='RandomResizedCrop', scale=224)
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(
+            repr(transform), 'RandomResizedCrop(scale=(224, 224), '
+            'crop_ratio_range=(0.08, 1.0), aspect_ratio_range=(0.75, 1.3333), '
+            'max_attempts=10, interpolation=bilinear, backend=cv2)')
+
+
+class TestEfficientNetRandomCrop(TestCase):
+
+    def test_assertion(self):
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='EfficientNetRandomCrop', scale=(1, 1))
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = dict(
+                type='EfficientNetRandomCrop', scale=224, min_covered=-1)
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = dict(
+                type='EfficientNetRandomCrop', scale=224, crop_padding=-1)
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        results = dict(img=np.random.randint(0, 256, (256, 256, 3), np.uint8))
+
+        # test random crop by default.
+        cfg = dict(type='EfficientNetRandomCrop', scale=224)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test crop_ratio_range.
+        cfg = dict(
+            type='EfficientNetRandomCrop',
+            scale=224,
+            crop_ratio_range=(0.5, 0.8))
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test aspect_ratio_range.
+        cfg = dict(
+            type='EfficientNetRandomCrop',
+            scale=224,
+            aspect_ratio_range=(0.5, 0.8))
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test max_attempts.
+        cfg = dict(type='EfficientNetRandomCrop', scale=224, max_attempts=0)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test min_covered.
+        cfg = dict(type='EfficientNetRandomCrop', scale=224, min_covered=.9)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test crop_padding.
+        cfg = dict(
+            type='EfficientNetRandomCrop',
+            scale=224,
+            min_covered=0.9,
+            crop_padding=10)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test large crop size.
+        results = dict(img=np.random.randint(0, 256, (256, 256, 3), np.uint8))
+        cfg = dict(type='EfficientNetRandomCrop', scale=300)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (300, 300, 3))
+
+    def test_repr(self):
+        cfg = dict(type='EfficientNetRandomCrop', scale=224)
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(
+            repr(transform), 'EfficientNetRandomCrop(scale=(224, 224), '
+            'crop_ratio_range=(0.08, 1.0), aspect_ratio_range=(0.75, 1.3333), '
+            'max_attempts=10, interpolation=bicubic, backend=cv2, '
+            'min_covered=0.1, crop_padding=32)')
+
+
+class TestResizeEdge(TestCase):
+
+    def test_transform(self):
+        results = dict(img=np.random.randint(0, 256, (128, 256, 3), np.uint8))
+
+        # test resize short edge by default.
+        cfg = dict(type='ResizeEdge', scale=224)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 448, 3))
+
+        # test resize long edge.
+        cfg = dict(type='ResizeEdge', scale=224, edge='long')
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (112, 224, 3))
+
+        # test resize width.
+        cfg = dict(type='ResizeEdge', scale=224, edge='width')
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (112, 224, 3))
+
+        # test resize height.
+        cfg = dict(type='ResizeEdge', scale=224, edge='height')
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 448, 3))
+
+        # test invalid edge
+        with self.assertRaisesRegex(AssertionError, 'Invalid edge "hi"'):
+            cfg = dict(type='ResizeEdge', scale=224, edge='hi')
+            TRANSFORMS.build(cfg)
+
+    def test_repr(self):
+        cfg = dict(type='ResizeEdge', scale=224, edge='height')
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(
+            repr(transform), 'ResizeEdge(scale=224, edge=height, backend=cv2, '
+            'interpolation=bilinear)')
+
+
+class TestEfficientNetCenterCrop(TestCase):
+
+    def test_assertion(self):
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='EfficientNetCenterCrop', crop_size=(1, 1))
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='EfficientNetCenterCrop', crop_size=-1)
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = dict(
+                type='EfficientNetCenterCrop', crop_size=224, crop_padding=-1)
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        results = dict(img=np.random.randint(0, 256, (256, 256, 3), np.uint8))
+
+        # test random crop by default.
+        cfg = dict(type='EfficientNetCenterCrop', crop_size=224)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test crop_padding.
+        cfg = dict(
+            type='EfficientNetCenterCrop', crop_size=224, crop_padding=10)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (224, 224, 3))
+
+        # test large crop size.
+        results = dict(img=np.random.randint(0, 256, (256, 256, 3), np.uint8))
+        cfg = dict(type='EfficientNetCenterCrop', crop_size=300)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertTupleEqual(results['img'].shape, (300, 300, 3))
+
+    def test_repr(self):
+        cfg = dict(type='EfficientNetCenterCrop', crop_size=224)
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(
+            repr(transform), 'EfficientNetCenterCrop(crop_size=224, '
+            'crop_padding=32, interpolation=bicubic, backend=cv2)')
+
+
+class TestRandomErasing(TestCase):
+
+    def test_initialize(self):
+        # test erase_prob assertion
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomErasing', erase_prob=-1.)
+            TRANSFORMS.build(cfg)
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomErasing', erase_prob=1)
+            TRANSFORMS.build(cfg)
+
+        # test area_ratio assertion
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomErasing', min_area_ratio=-1.)
+            TRANSFORMS.build(cfg)
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomErasing', max_area_ratio=1)
+            TRANSFORMS.build(cfg)
+        with self.assertRaises(AssertionError):
+            # min_area_ratio should be smaller than max_area_ratio
+            cfg = dict(
+                type='RandomErasing', min_area_ratio=0.6, max_area_ratio=0.4)
+            TRANSFORMS.build(cfg)
+
+        # test aspect_range assertion
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomErasing', aspect_range='str')
+            TRANSFORMS.build(cfg)
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomErasing', aspect_range=-1)
+            TRANSFORMS.build(cfg)
+        with self.assertRaises(AssertionError):
+            # In aspect_range (min, max), min should be smaller than max.
+            cfg = dict(type='RandomErasing', aspect_range=[1.6, 0.6])
+            TRANSFORMS.build(cfg)
+
+        # test mode assertion
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomErasing', mode='unknown')
+            TRANSFORMS.build(cfg)
+
+        # test fill_std assertion
+        with self.assertRaises(AssertionError):
+            cfg = dict(type='RandomErasing', fill_std='unknown')
+            TRANSFORMS.build(cfg)
+
+        # test implicit conversion of aspect_range
+        cfg = dict(type='RandomErasing', aspect_range=0.5)
+        random_erasing = TRANSFORMS.build(cfg)
+        assert random_erasing.aspect_range == (0.5, 2.)
+
+        cfg = dict(type='RandomErasing', aspect_range=2.)
+        random_erasing = TRANSFORMS.build(cfg)
+        assert random_erasing.aspect_range == (0.5, 2.)
+
+        # test implicit conversion of fill_color
+        cfg = dict(type='RandomErasing', fill_color=15)
+        random_erasing = TRANSFORMS.build(cfg)
+        assert random_erasing.fill_color == [15, 15, 15]
+
+        # test implicit conversion of fill_std
+        cfg = dict(type='RandomErasing', fill_std=0.5)
+        random_erasing = TRANSFORMS.build(cfg)
+        assert random_erasing.fill_std == [0.5, 0.5, 0.5]
+
+    def test_transform(self):
+        # test when erase_prob=0.
+        results = construct_toy_data()
+        cfg = dict(
+            type='RandomErasing',
+            erase_prob=0.,
+            mode='const',
+            fill_color=(255, 255, 255))
+        random_erasing = TRANSFORMS.build(cfg)
+        results = random_erasing(results)
+        np.testing.assert_array_equal(results['img'], results['ori_img'])
+
+        # test mode 'const'
+        results = construct_toy_data()
+        cfg = dict(
+            type='RandomErasing',
+            erase_prob=1.,
+            mode='const',
+            fill_color=(255, 255, 255))
+        with patch('numpy.random', np.random.RandomState(0)):
+            random_erasing = TRANSFORMS.build(cfg)
+            results = random_erasing(results)
+            expect_out = np.array(
+                [[1, 255, 3, 4], [5, 255, 7, 8], [9, 10, 11, 12]],
+                dtype=np.uint8)
+            expect_out = np.stack([expect_out] * 3, axis=-1)
+            np.testing.assert_array_equal(results['img'], expect_out)
+
+        # test mode 'rand' with normal distribution
+        results = construct_toy_data()
+        cfg = dict(type='RandomErasing', erase_prob=1., mode='rand')
+        with patch('numpy.random', np.random.RandomState(0)):
+            random_erasing = TRANSFORMS.build(cfg)
+            results = random_erasing(results)
+            expect_out = results['ori_img']
+            expect_out[:2, 1] = [[159, 98, 76], [14, 69, 122]]
+            np.testing.assert_array_equal(results['img'], expect_out)
+
+        # test mode 'rand' with uniform distribution
+        results = construct_toy_data()
+        cfg = dict(
+            type='RandomErasing',
+            erase_prob=1.,
+            mode='rand',
+            fill_std=(10, 255, 0))
+        with patch('numpy.random', np.random.RandomState(0)):
+            random_erasing = TRANSFORMS.build(cfg)
+            results = random_erasing(results)
+
+            expect_out = results['ori_img']
+            expect_out[:2, 1] = [[113, 255, 128], [126, 83, 128]]
+            np.testing.assert_array_equal(results['img'], expect_out)
+
+    def test_repr(self):
+        cfg = dict(
+            type='RandomErasing',
+            erase_prob=0.5,
+            mode='const',
+            aspect_range=(0.3, 1.3),
+            fill_color=(255, 255, 255))
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(
+            repr(transform),
+            'RandomErasing(erase_prob=0.5, min_area_ratio=0.02, '
+            'max_area_ratio=0.4, aspect_range=(0.3, 1.3), mode=const, '
+            'fill_color=(255, 255, 255), fill_std=None)')
+
+
+class TestColorJitter(TestCase):
+
+    DEFAULT_ARGS = dict(
+        type='ColorJitter',
+        brightness=0.5,
+        contrast=0.5,
+        saturation=0.5,
+        hue=0.2)
+
+    def test_initialize(self):
+        cfg = dict(
+            type='ColorJitter',
+            brightness=(0.8, 1.2),
+            contrast=[0.5, 1.5],
+            saturation=0.,
+            hue=0.2)
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(transform.brightness, (0.8, 1.2))
+        self.assertEqual(transform.contrast, (0.5, 1.5))
+        self.assertIsNone(transform.saturation)
+        self.assertEqual(transform.hue, (-0.2, 0.2))
+
+        with self.assertRaisesRegex(ValueError, 'If hue is a single number'):
+            cfg = {**self.DEFAULT_ARGS, 'hue': -0.2}
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaisesRegex(TypeError, 'hue should be a single'):
+            cfg = {**self.DEFAULT_ARGS, 'hue': [0.5, 0.4, 0.2]}
+            TRANSFORMS.build(cfg)
+
+        logger = mmengine.MMLogger.get_current_instance()
+        with self.assertLogs(logger, 'WARN') as log:
+            cfg = {**self.DEFAULT_ARGS, 'hue': [-1, 0.4]}
+            transform = TRANSFORMS.build(cfg)
+        self.assertIn('ColorJitter hue values', log.output[0])
+        self.assertEqual(transform.hue, (-0.5, 0.4))
+
+    def test_transform(self):
+        ori_img = np.random.randint(0, 256, (256, 256, 3), np.uint8)
+        results = dict(img=copy.deepcopy(ori_img))
+
+        # test transform
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertEqual(results['img'].dtype, ori_img.dtype)
+        assert not np.equal(results['img'], ori_img).all()
+
+        # test call with brightness, contrast and saturation are all 0
+        results = dict(img=copy.deepcopy(ori_img))
+        cfg = dict(
+            type='ColorJitter', brightness=0., contrast=0., saturation=0.)
+        transform = TRANSFORMS.build(cfg)
+        results = transform(results)
+        self.assertEqual(results['img'].dtype, ori_img.dtype)
+        assert np.equal(results['img'], ori_img).all()
+
+        # test call index
+        cfg = {**self.DEFAULT_ARGS, 'contrast': 0.}
+        transform = TRANSFORMS.build(cfg)
+        with patch('numpy.random', np.random.RandomState(0)):
+            mmcv_module = 'mmpretrain.datasets.transforms.processing.mmcv'
+            call_list = [
+                call.adjust_color(ANY, alpha=ANY, backend='pillow'),
+                call.adjust_hue(ANY, ANY, backend='pillow'),
+                call.adjust_brightness(ANY, ANY, backend='pillow'),
+            ]
+            with patch(mmcv_module, autospec=True) as mock:
+                transform(results)
+                self.assertEqual(mock.mock_calls, call_list)
+
+    def test_repr(self):
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(
+            repr(transform), 'ColorJitter(brightness=(0.5, 1.5), '
+            'contrast=(0.5, 1.5), saturation=(0.5, 1.5), hue=(-0.2, 0.2))')
+
+
+class TestLighting(TestCase):
+
+    def setUp(self):
+        EIGVAL = [0.2175, 0.0188, 0.0045]
+        EIGVEC = [
+            [-0.5836, -0.6948, 0.4203],
+            [-0.5808, -0.0045, -0.814],
+            [-0.5675, 0.7192, 0.4009],
+        ]
+        self.DEFAULT_ARGS = dict(
+            type='Lighting',
+            eigval=EIGVAL,
+            eigvec=EIGVEC,
+            alphastd=25.5,
+            to_rgb=False)
+
+    def test_assertion(self):
+        with self.assertRaises(AssertionError):
+            cfg = copy.deepcopy(self.DEFAULT_ARGS)
+            cfg['eigval'] = -1
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = copy.deepcopy(self.DEFAULT_ARGS)
+            cfg['eigvec'] = None
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = copy.deepcopy(self.DEFAULT_ARGS)
+            cfg['alphastd'] = 'Lighting'
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = copy.deepcopy(self.DEFAULT_ARGS)
+            cfg['eigvec'] = dict()
+            TRANSFORMS.build(cfg)
+
+        with self.assertRaises(AssertionError):
+            cfg = copy.deepcopy(self.DEFAULT_ARGS)
+            cfg['eigvec'] = [
+                [-0.5836, -0.6948, 0.4203],
+                [-0.5808, -0.0045, -0.814],
+                [-0.5675, 0.7192, 0.4009, 0.10],
+            ]
+            TRANSFORMS.build(cfg)
+
+    def test_transform(self):
+        ori_img = np.ones((256, 256, 3), np.uint8) * 127
+        results = dict(img=copy.deepcopy(ori_img))
+
+        # Test transform with non-img-keyword result
+        with self.assertRaises(AssertionError):
+            cfg = copy.deepcopy(self.DEFAULT_ARGS)
+            lightening_module = TRANSFORMS.build(cfg)
+            empty_results = dict()
+            lightening_module(empty_results)
+
+        # test call
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        lightening_module = TRANSFORMS.build(cfg)
+        with patch('numpy.random', np.random.RandomState(0)):
+            results = lightening_module(results)
+            self.assertEqual(results['img'].dtype, ori_img.dtype)
+            assert not np.equal(results['img'], ori_img).all()
+
+        # test call with alphastd == 0
+        results = dict(img=copy.deepcopy(ori_img))
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        cfg['alphastd'] = 0.0
+        lightening_module = TRANSFORMS.build(cfg)
+        results = lightening_module(results)
+        self.assertEqual(results['img'].dtype, ori_img.dtype)
+        assert np.equal(results['img'], ori_img).all()
+
+    def test_repr(self):
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(
+            repr(transform), 'Lighting(eigval=[0.2175, 0.0188, 0.0045], eigvec'
+            '=[[-0.5836, -0.6948, 0.4203], [-0.5808, -0.0045, -0.814], ['
+            '-0.5675, 0.7192, 0.4009]], alphastd=25.5, to_rgb=False)')
+
+
+class TestAlbumentations(TestCase):
+    DEFAULT_ARGS = dict(
+        type='Albumentations', transforms=[dict(type='ChannelShuffle', p=1)])
+
+    @pytest.mark.skipif(
+        albumentations is None, reason='No Albumentations module.')
+    def test_assertion(self):
+        # Test with non-list transforms
+        with self.assertRaises(AssertionError):
+            cfg = copy.deepcopy(self.DEFAULT_ARGS)
+            cfg['transforms'] = 1
+            TRANSFORMS.build(cfg)
+
+        # Test with non-dict transforms item.
+        with self.assertRaises(AssertionError):
+            cfg = copy.deepcopy(self.DEFAULT_ARGS)
+            cfg['transforms'] = [dict(p=1)]
+            TRANSFORMS.build(cfg)
+
+        # Test with dict transforms item without keyword 'type'.
+        with self.assertRaises(AssertionError):
+            cfg = copy.deepcopy(self.DEFAULT_ARGS)
+            cfg['transforms'] = [[]]
+            TRANSFORMS.build(cfg)
+
+        # Test with dict transforms item with wrong type.
+        with self.assertRaises(TypeError):
+            cfg = copy.deepcopy(self.DEFAULT_ARGS)
+            cfg['transforms'] = [dict(type=[])]
+            TRANSFORMS.build(cfg)
+
+        # Test with dict transforms item with wrong type.
+        with self.assertRaises(AssertionError):
+            cfg = copy.deepcopy(self.DEFAULT_ARGS)
+            cfg['keymap'] = []
+            TRANSFORMS.build(cfg)
+
+    @pytest.mark.skipif(
+        albumentations is None, reason='No Albumentations module.')
+    def test_transform(self):
+        ori_img = np.random.randint(0, 256, (256, 256, 3), np.uint8)
+        results = dict(img=copy.deepcopy(ori_img))
+
+        # Test transform with non-img-keyword result
+        with self.assertRaises(AssertionError):
+            cfg = copy.deepcopy(self.DEFAULT_ARGS)
+            albu_module = TRANSFORMS.build(cfg)
+            empty_results = dict()
+            albu_module(empty_results)
+
+        # Test normal case
+        results = dict(img=copy.deepcopy(ori_img))
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        albu_module = TRANSFORMS.build(cfg)
+        ablu_result = albu_module(results)
+
+        # Test using 'Albu'
+        results = dict(img=copy.deepcopy(ori_img))
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        cfg['type'] = 'Albu'
+        albu_module = TRANSFORMS.build(cfg)
+        ablu_result = albu_module(results)
+
+        # Test with keymap
+        results = dict(img=copy.deepcopy(ori_img))
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        cfg['keymap'] = dict(img='image')
+        albu_module = TRANSFORMS.build(cfg)
+        ablu_result = albu_module(results)
+
+        # Test with nested transform
+        results = dict(img=copy.deepcopy(ori_img))
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        nested_transform_cfg = [
+            dict(
+                type='ShiftScaleRotate',
+                shift_limit=0.0625,
+                scale_limit=0.0,
+                rotate_limit=0,
+                interpolation=1,
+                p=0.5),
+            dict(
+                type='RandomBrightnessContrast',
+                brightness_limit=[0.1, 0.3],
+                contrast_limit=[0.1, 0.3],
+                p=0.2),
+            dict(type='ChannelShuffle', p=0.1),
+            dict(
+                type='OneOf',
+                transforms=[
+                    dict(type='Blur', blur_limit=3, p=1.0),
+                    dict(type='MedianBlur', blur_limit=3, p=1.0)
+                ],
+                p=0.1),
+        ]
+        cfg['transforms'] = nested_transform_cfg
+        mmpretrain_module = TRANSFORMS.build(cfg)
+        mmpretrain_module(results)
+
+        # test to be same with albumentations 3rd package
+        np.random.seed(0)
+        random.seed(0)
+        import albumentations as A
+        ablu_transform_3rd = A.Compose([
+            A.RandomCrop(width=256, height=256),
+            A.HorizontalFlip(p=0.5),
+            A.RandomBrightnessContrast(p=0.2),
+        ])
+        transformed_image_3rd = ablu_transform_3rd(
+            image=copy.deepcopy(ori_img))['image']
+
+        np.random.seed(0)
+        random.seed(0)
+        results = dict(img=copy.deepcopy(ori_img))
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        cfg['transforms'] = [
+            dict(type='RandomCrop', width=256, height=256),
+            dict(type='HorizontalFlip', p=0.5),
+            dict(type='RandomBrightnessContrast', p=0.2)
+        ]
+        mmpretrain_module = TRANSFORMS.build(cfg)
+        transformed_image_mmpretrain = mmpretrain_module(results)['img']
+        assert np.equal(transformed_image_3rd,
+                        transformed_image_mmpretrain).all()
+
+        # Test class obj case
+        results = dict(img=np.random.randint(0, 256, (200, 300, 3), np.uint8))
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        cfg['transforms'] = [
+            dict(type=albumentations.SmallestMaxSize, max_size=400, p=1)
+        ]
+        albu_module = TRANSFORMS.build(cfg)
+        ablu_result = albu_module(results)
+        assert 'img' in ablu_result
+        assert min(ablu_result['img'].shape[:2]) == 400
+        assert ablu_result['img_shape'] == (400, 600)
+
+    @pytest.mark.skipif(
+        albumentations is None, reason='No Albumentations module.')
+    def test_repr(self):
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(
+            repr(transform), "Albumentations(transforms=[{'type': "
+            "'ChannelShuffle', 'p': 1}])")
+
+
+class TestSimMIMMaskGenerator(TestCase):
+    DEFAULT_ARGS = dict(
+        type='SimMIMMaskGenerator',
+        input_size=192,
+        mask_patch_size=32,
+        model_patch_size=4,
+        mask_ratio=0.6)
+
+    def test_transform(self):
+        img = np.random.randint(0, 256, (3, 192, 192), np.uint8)
+        results = {'img': img}
+        module = TRANSFORMS.build(self.DEFAULT_ARGS)
+
+        results = module(results)
+
+        self.assertTupleEqual(results['img'].shape, (3, 192, 192))
+        self.assertTupleEqual(results['mask'].shape, (48, 48))
+
+    def test_repr(self):
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        transform = TRANSFORMS.build(cfg)
+        self.assertEqual(
+            repr(transform), 'SimMIMMaskGenerator(input_size=192, '
+            'mask_patch_size=32, model_patch_size=4, mask_ratio=0.6)')
+
+
+class TestBEiTMaskGenerator(TestCase):
+    DEFAULT_ARGS = dict(
+        type='BEiTMaskGenerator',
+        input_size=(14, 14),
+        num_masking_patches=75,
+        max_num_patches=None,
+        min_num_patches=16)
+
+    def test_transform(self):
+        module = TRANSFORMS.build(self.DEFAULT_ARGS)
+
+        results = module({})
+
+        self.assertTupleEqual(results['mask'].shape, (14, 14))
+
+    def test_repr(self):
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        transform = TRANSFORMS.build(cfg)
+
+        log_aspect_ratio = (math.log(0.3), math.log(1 / 0.3))
+        self.assertEqual(
+            repr(transform), 'BEiTMaskGenerator(height=14, width=14, '
+            'num_patches=196, num_masking_patches=75, min_num_patches=16, '
+            f'max_num_patches=75, log_aspect_ratio={log_aspect_ratio})')
+
+
+class TestVisionTransformWrapper(TestCase):
+
+    def test_register(self):
+        for t in VISION_TRANSFORMS:
+            self.assertIn('torchvision/', t)
+            self.assertIn(t, TRANSFORMS)
+
+    def test_transform(self):
+        img_path = osp.join(osp.dirname(__file__), '../../data/color.jpg')
+        data = {'img': Image.open(img_path)}
+
+        # test normal transform
+        vision_trans = transforms.RandomResizedCrop(224)
+        vision_transformed_img = vision_trans(data['img'])
+        mmcls_trans = TRANSFORMS.build(
+            dict(type='torchvision/RandomResizedCrop', size=224))
+        mmcls_transformed_img = mmcls_trans(data)['img']
+        np.equal(
+            np.array(vision_transformed_img), np.array(mmcls_transformed_img))
+
+        # test convert type dtype
+        data = {'img': torch.randn(3, 224, 224)}
+        vision_trans = transforms.ConvertImageDtype(torch.float)
+        vision_transformed_img = vision_trans(data['img'])
+        mmcls_trans = TRANSFORMS.build(
+            dict(type='torchvision/ConvertImageDtype', dtype='float'))
+        mmcls_transformed_img = mmcls_trans(data)['img']
+        np.equal(
+            np.array(vision_transformed_img), np.array(mmcls_transformed_img))
+
+        # test transform with interpolation
+        data = {'img': Image.open(img_path)}
+        if digit_version(torchvision.__version__) > digit_version('0.8.0'):
+            from torchvision.transforms import InterpolationMode
+            interpolation_t = InterpolationMode.NEAREST
+        else:
+            interpolation_t = Image.NEAREST
+        vision_trans = transforms.Resize(224, interpolation_t)
+        vision_transformed_img = vision_trans(data['img'])
+        mmcls_trans = TRANSFORMS.build(
+            dict(type='torchvision/Resize', size=224, interpolation='nearest'))
+        mmcls_transformed_img = mmcls_trans(data)['img']
+        np.equal(
+            np.array(vision_transformed_img), np.array(mmcls_transformed_img))
+
+        # test compose transforms
+        data = {'img': Image.open(img_path)}
+        vision_trans = transforms.Compose([
+            transforms.Resize(176),
+            transforms.RandomHorizontalFlip(),
+            transforms.PILToTensor(),
+            transforms.ConvertImageDtype(torch.float),
+            transforms.Normalize(
+                mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+        ])
+        vision_transformed_img = vision_trans(data['img'])
+
+        pipeline_cfg = [
+            dict(type='LoadImageFromFile'),
+            dict(type='NumpyToPIL', to_rgb=True),
+            dict(type='torchvision/Resize', size=176),
+            dict(type='torchvision/RandomHorizontalFlip'),
+            dict(type='torchvision/PILToTensor'),
+            dict(type='torchvision/ConvertImageDtype', dtype='float'),
+            dict(
+                type='torchvision/Normalize',
+                mean=(0.485, 0.456, 0.406),
+                std=(0.229, 0.224, 0.225),
+            )
+        ]
+        pipeline = [TRANSFORMS.build(t) for t in pipeline_cfg]
+        mmcls_trans = Compose(transforms=pipeline)
+        mmcls_data = {'img_path': img_path}
+        mmcls_transformed_img = mmcls_trans(mmcls_data)['img']
+        np.equal(
+            np.array(vision_transformed_img), np.array(mmcls_transformed_img))
+
+    def test_repr(self):
+        vision_trans = transforms.RandomResizedCrop(224)
+        mmcls_trans = TRANSFORMS.build(
+            dict(type='torchvision/RandomResizedCrop', size=224))
+
+        self.assertEqual(f'TorchVision{repr(vision_trans)}', repr(mmcls_trans))
diff --git a/tests/test_datasets/test_transforms/test_wrappers.py b/tests/test_datasets/test_transforms/test_wrappers.py
new file mode 100644
index 0000000000000000000000000000000000000000..fc487edeffec39b66e406b9c50379b2dc77ad747
--- /dev/null
+++ b/tests/test_datasets/test_transforms/test_wrappers.py
@@ -0,0 +1,43 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import numpy as np
+from mmcv.transforms import Resize
+
+from mmpretrain.datasets import GaussianBlur, MultiView, Solarize
+
+
+def test_multi_view():
+    original_img = np.ones((4, 4, 3), dtype=np.uint8)
+
+    # test 1 pipeline with 2 views
+    pipeline1 = [
+        Resize(2),
+        GaussianBlur(magnitude_range=(0.1, 2), magnitude_std='inf')
+    ]
+
+    transform = MultiView([pipeline1], 2)
+    results = dict(img=original_img)
+    results = transform(results)
+    assert len(results['img']) == 2
+    assert results['img'][0].shape == (2, 2, 3)
+
+    transform = MultiView([pipeline1], [2])
+    results = dict(img=original_img)
+    results = transform(results)
+    assert len(results['img']) == 2
+    assert results['img'][0].shape == (2, 2, 3)
+
+    # test 2 pipeline with 3 views
+    pipeline2 = [
+        Solarize(thr=128),
+        GaussianBlur(magnitude_range=(0.1, 2), magnitude_std='inf')
+    ]
+    transform = MultiView([pipeline1, pipeline2], [1, 2])
+
+    results = dict(img=original_img)
+    results = transform(results)
+    assert len(results['img']) == 3
+    assert results['img'][0].shape == (2, 2, 3)
+    assert results['img'][1].shape == (4, 4, 3)
+
+    # test repr
+    assert isinstance(str(transform), str)
diff --git a/tests/test_engine/test_hooks/test_arcface_hooks.py b/tests/test_engine/test_hooks/test_arcface_hooks.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f2831f52ebbf291feb7a1dede9196e1166186cd
--- /dev/null
+++ b/tests/test_engine/test_hooks/test_arcface_hooks.py
@@ -0,0 +1,102 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import tempfile
+from unittest import TestCase
+
+import numpy as np
+import torch
+from mmengine.runner import Runner
+from torch.utils.data import DataLoader, Dataset
+
+
+class ExampleDataset(Dataset):
+
+    def __init__(self):
+        self.index = 0
+        self.metainfo = None
+
+    def __getitem__(self, idx):
+        results = dict(imgs=torch.rand((224, 224, 3)).float(), )
+        return results
+
+    def get_gt_labels(self):
+        gt_labels = np.array([0, 1, 2, 4, 0, 4, 1, 2, 2, 1])
+        return gt_labels
+
+    def __len__(self):
+        return 10
+
+
+class TestSetAdaptiveMarginsHook(TestCase):
+    DEFAULT_HOOK_CFG = dict(type='SetAdaptiveMarginsHook')
+    DEFAULT_MODEL = dict(
+        type='ImageClassifier',
+        backbone=dict(
+            type='ResNet',
+            depth=34,
+            num_stages=4,
+            out_indices=(3, ),
+            style='pytorch'),
+        neck=dict(type='GlobalAveragePooling'),
+        head=dict(type='ArcFaceClsHead', in_channels=512, num_classes=5))
+
+    def test_before_train(self):
+        default_hooks = dict(
+            timer=dict(type='IterTimerHook'),
+            logger=None,
+            param_scheduler=dict(type='ParamSchedulerHook'),
+            checkpoint=dict(type='CheckpointHook', interval=1),
+            sampler_seed=dict(type='DistSamplerSeedHook'),
+            visualization=dict(type='VisualizationHook', enable=False),
+        )
+        tmpdir = tempfile.TemporaryDirectory()
+        loader = DataLoader(ExampleDataset(), batch_size=2)
+        self.runner = Runner(
+            model=self.DEFAULT_MODEL,
+            work_dir=tmpdir.name,
+            train_dataloader=loader,
+            train_cfg=dict(by_epoch=True, max_epochs=1),
+            log_level='WARNING',
+            optim_wrapper=dict(
+                optimizer=dict(type='SGD', lr=0.1, momentum=0.9)),
+            param_scheduler=dict(
+                type='MultiStepLR', milestones=[1, 2], gamma=0.1),
+            default_scope='mmpretrain',
+            default_hooks=default_hooks,
+            experiment_name='test_construct_with_arcface',
+            custom_hooks=[self.DEFAULT_HOOK_CFG])
+
+        default_margins = torch.tensor([0.5] * 5)
+        torch.allclose(self.runner.model.head.margins.cpu(), default_margins)
+        self.runner.call_hook('before_train')
+        # counts = [2 ,3 , 3, 0, 2] -> [2 ,3 , 3, 1, 2] at least occur once
+        # feqercy**-0.25 = [0.84089642, 0.75983569, 0.75983569, 1., 0.84089642]
+        # normized = [0.33752196, 0.   , 0.   , 1.  , 0.33752196]
+        # margins =  [0.20188488, 0.05, 0.05, 0.5, 0.20188488]
+        expert_margins = torch.tensor(
+            [0.20188488, 0.05, 0.05, 0.5, 0.20188488])
+        torch.allclose(self.runner.model.head.margins.cpu(), expert_margins)
+
+        model_cfg = {**self.DEFAULT_MODEL}
+        model_cfg['head'] = dict(
+            type='LinearClsHead',
+            num_classes=1000,
+            in_channels=512,
+            loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+            topk=(1, 5),
+        )
+        self.runner = Runner(
+            model=model_cfg,
+            work_dir=tmpdir.name,
+            train_dataloader=loader,
+            train_cfg=dict(by_epoch=True, max_epochs=1),
+            log_level='WARNING',
+            optim_wrapper=dict(
+                optimizer=dict(type='SGD', lr=0.1, momentum=0.9)),
+            param_scheduler=dict(
+                type='MultiStepLR', milestones=[1, 2], gamma=0.1),
+            default_scope='mmpretrain',
+            default_hooks=default_hooks,
+            experiment_name='test_construct_wo_arcface',
+            custom_hooks=[self.DEFAULT_HOOK_CFG])
+        with self.assertRaises(ValueError):
+            self.runner.call_hook('before_train')
diff --git a/tests/test_engine/test_hooks/test_class_num_check_hook.py b/tests/test_engine/test_hooks/test_class_num_check_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..5663c603f83410bff72b56bd510fb89b142a7ba8
--- /dev/null
+++ b/tests/test_engine/test_hooks/test_class_num_check_hook.py
@@ -0,0 +1,52 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+from unittest.mock import MagicMock, patch
+
+from mmpretrain.engine import ClassNumCheckHook
+
+
+class TestClassNumCheckHook(TestCase):
+
+    def setUp(self):
+        self.runner = MagicMock()
+        self.dataset = MagicMock()
+        self.hook = ClassNumCheckHook()
+
+    def test_check_head(self):
+        # check sequence of string
+        with self.assertRaises(AssertionError):
+            self.hook._check_head(self.runner, self.dataset)
+
+        # check no CLASSES
+        with patch.object(self.runner.logger, 'warning') as mock:
+            self.dataset.CLASSES = None
+            self.hook._check_head(self.runner, self.dataset)
+            mock.assert_called_once()
+
+        # check no modules
+        self.dataset.CLASSES = ['str'] * 10
+        self.hook._check_head(self.runner, self.dataset)
+
+        # check number of classes not match
+        self.dataset.CLASSES = ['str'] * 10
+        module1 = MagicMock(spec_set=True)
+        module2 = MagicMock(num_classes=5)
+        self.runner.model.named_modules.return_value = iter([(None, module1),
+                                                             (None, module2)])
+        with self.assertRaises(AssertionError):
+            self.hook._check_head(self.runner, self.dataset)
+
+    def test_before_train(self):
+        with patch.object(self.hook, '_check_head') as mock:
+            self.hook.before_train(self.runner)
+            mock.assert_called_once()
+
+    def test_before_val(self):
+        with patch.object(self.hook, '_check_head') as mock:
+            self.hook.before_val(self.runner)
+            mock.assert_called_once()
+
+    def test_before_test(self):
+        with patch.object(self.hook, '_check_head') as mock:
+            self.hook.before_test(self.runner)
+            mock.assert_called_once()
diff --git a/tests/test_engine/test_hooks/test_densecl_hook.py b/tests/test_engine/test_hooks/test_densecl_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..645d10216b5badd47139a0e7690a3435cb8a1452
--- /dev/null
+++ b/tests/test_engine/test_hooks/test_densecl_hook.py
@@ -0,0 +1,113 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import logging
+import tempfile
+from unittest import TestCase
+
+import torch
+import torch.nn as nn
+from mmengine.device import get_device
+from mmengine.logging import MMLogger
+from mmengine.model import BaseModule
+from mmengine.optim import OptimWrapper
+from mmengine.runner import Runner
+from mmengine.structures import LabelData
+from torch.utils.data import Dataset
+
+from mmpretrain.engine import DenseCLHook
+from mmpretrain.models.selfsup import BaseSelfSupervisor
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from mmpretrain.utils import get_ori_model
+
+
+class DummyDataset(Dataset):
+    METAINFO = dict()  # type: ignore
+    data = torch.randn(12, 2)
+    label = torch.ones(12)
+
+    @property
+    def metainfo(self):
+        return self.METAINFO
+
+    def __len__(self):
+        return self.data.size(0)
+
+    def __getitem__(self, index):
+        data_sample = DataSample()
+        gt_label = LabelData(value=self.label[index])
+        setattr(data_sample, 'gt_label', gt_label)
+        return dict(inputs=[self.data[index]], data_samples=data_sample)
+
+
+@MODELS.register_module()
+class DenseCLDummyLayer(BaseModule):
+
+    def __init__(self, init_cfg=None):
+        super().__init__(init_cfg)
+        self.linear = nn.Linear(2, 1)
+
+    def forward(self, x):
+        return self.linear(x)
+
+
+class ToyModel(BaseSelfSupervisor):
+
+    def __init__(self):
+        super().__init__(backbone=dict(type='DenseCLDummyLayer'))
+        self.loss_lambda = 0.5
+
+    def loss(self, inputs, data_samples):
+        labels = []
+        for x in data_samples:
+            labels.append(x.gt_label.value)
+            labels = torch.stack(labels)
+        outputs = self.backbone(inputs[0])
+        loss = (labels - outputs).sum()
+        outputs = dict(loss=loss)
+        return outputs
+
+
+class TestDenseCLHook(TestCase):
+
+    def setUp(self):
+        self.temp_dir = tempfile.TemporaryDirectory()
+
+    def tearDown(self):
+        # `FileHandler` should be closed in Windows, otherwise we cannot
+        # delete the temporary directory
+        logging.shutdown()
+        MMLogger._instance_dict.clear()
+        self.temp_dir.cleanup()
+
+    def test_densecl_hook(self):
+        device = get_device()
+        dummy_dataset = DummyDataset()
+        toy_model = ToyModel().to(device)
+        densecl_hook = DenseCLHook(start_iters=1)
+
+        # test DenseCLHook with model wrapper
+        runner = Runner(
+            model=toy_model,
+            work_dir=self.temp_dir.name,
+            train_dataloader=dict(
+                dataset=dummy_dataset,
+                sampler=dict(type='DefaultSampler', shuffle=True),
+                collate_fn=dict(type='default_collate'),
+                batch_size=1,
+                num_workers=0),
+            optim_wrapper=OptimWrapper(
+                torch.optim.Adam(toy_model.parameters())),
+            param_scheduler=dict(type='MultiStepLR', milestones=[1]),
+            train_cfg=dict(by_epoch=True, max_epochs=2),
+            custom_hooks=[densecl_hook],
+            default_hooks=dict(logger=None),
+            log_processor=dict(window_size=1),
+            experiment_name='test_densecl_hook',
+            default_scope='mmpretrain')
+
+        runner.train()
+
+        if runner.iter >= 1:
+            assert get_ori_model(runner.model).loss_lambda == 0.5
+        else:
+            assert get_ori_model(runner.model).loss_lambda == 0.
diff --git a/tests/test_engine/test_hooks/test_ema_hook.py b/tests/test_engine/test_hooks/test_ema_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..0520725620198e9291470e92d2545dfcd2aa011b
--- /dev/null
+++ b/tests/test_engine/test_hooks/test_ema_hook.py
@@ -0,0 +1,224 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import logging
+import os.path as osp
+import tempfile
+from collections import OrderedDict
+from unittest import TestCase
+from unittest.mock import ANY, MagicMock, call
+
+import torch
+import torch.nn as nn
+from mmengine.device import get_device
+from mmengine.evaluator import Evaluator
+from mmengine.logging import MMLogger
+from mmengine.model import BaseModel
+from mmengine.optim import OptimWrapper
+from mmengine.runner import Runner
+from mmengine.testing import assert_allclose
+from torch.utils.data import Dataset
+
+from mmpretrain.engine import EMAHook
+
+
+class SimpleModel(BaseModel):
+
+    def __init__(self):
+        super().__init__()
+        self.para = nn.Parameter(torch.zeros(1))
+
+    def forward(self, *args, mode='tensor', **kwargs):
+        if mode == 'predict':
+            return self.para.clone()
+        elif mode == 'loss':
+            return {'loss': self.para.mean()}
+
+
+class DummyDataset(Dataset):
+    METAINFO = dict()  # type: ignore
+    data = torch.randn(6, 2)
+    label = torch.ones(6)
+
+    @property
+    def metainfo(self):
+        return self.METAINFO
+
+    def __len__(self):
+        return self.data.size(0)
+
+    def __getitem__(self, index):
+        return dict(inputs=self.data[index], data_sample=self.label[index])
+
+
+class TestEMAHook(TestCase):
+
+    def setUp(self):
+        self.temp_dir = tempfile.TemporaryDirectory()
+        state_dict = OrderedDict(
+            meta=dict(epoch=1, iter=2),
+            # The actual ema para
+            state_dict={'para': torch.tensor([1.])},
+            # The actual original para
+            ema_state_dict={'module.para': torch.tensor([2.])},
+        )
+        self.ckpt = osp.join(self.temp_dir.name, 'ema.pth')
+        torch.save(state_dict, self.ckpt)
+
+    def tearDown(self):
+        # `FileHandler` should be closed in Windows, otherwise we cannot
+        # delete the temporary directory
+        logging.shutdown()
+        MMLogger._instance_dict.clear()
+        self.temp_dir.cleanup()
+
+    def test_load_state_dict(self):
+        device = get_device()
+        model = SimpleModel().to(device)
+        ema_hook = EMAHook()
+        runner = Runner(
+            model=model,
+            train_dataloader=dict(
+                dataset=DummyDataset(),
+                sampler=dict(type='DefaultSampler', shuffle=False),
+                batch_size=3,
+                num_workers=0),
+            optim_wrapper=OptimWrapper(
+                optimizer=torch.optim.Adam(model.parameters(), lr=0.)),
+            train_cfg=dict(by_epoch=True, max_epochs=2),
+            work_dir=self.temp_dir.name,
+            resume=False,
+            load_from=self.ckpt,
+            default_hooks=dict(logger=None),
+            custom_hooks=[ema_hook],
+            default_scope='mmpretrain',
+            experiment_name='load_state_dict')
+        runner.train()
+        assert_allclose(runner.model.para, torch.tensor([1.], device=device))
+
+    def test_evaluate_on_ema(self):
+
+        device = get_device()
+        model = SimpleModel().to(device)
+
+        # Test validate on ema model
+        evaluator = Evaluator([MagicMock()])
+        runner = Runner(
+            model=model,
+            val_dataloader=dict(
+                dataset=DummyDataset(),
+                sampler=dict(type='DefaultSampler', shuffle=False),
+                batch_size=3,
+                num_workers=0),
+            val_evaluator=evaluator,
+            val_cfg=dict(),
+            work_dir=self.temp_dir.name,
+            load_from=self.ckpt,
+            default_hooks=dict(logger=None),
+            custom_hooks=[dict(type='EMAHook')],
+            default_scope='mmpretrain',
+            experiment_name='validate_on_ema')
+        runner.val()
+        evaluator.metrics[0].process.assert_has_calls([
+            call(ANY, [torch.tensor([1.]).to(device)]),
+        ])
+        self.assertNotIn(
+            call(ANY, [torch.tensor([2.]).to(device)]),
+            evaluator.metrics[0].process.mock_calls)
+
+        # Test test on ema model
+        evaluator = Evaluator([MagicMock()])
+        runner = Runner(
+            model=model,
+            test_dataloader=dict(
+                dataset=DummyDataset(),
+                sampler=dict(type='DefaultSampler', shuffle=False),
+                batch_size=3,
+                num_workers=0),
+            test_evaluator=evaluator,
+            test_cfg=dict(),
+            work_dir=self.temp_dir.name,
+            load_from=self.ckpt,
+            default_hooks=dict(logger=None),
+            custom_hooks=[dict(type='EMAHook')],
+            default_scope='mmpretrain',
+            experiment_name='test_on_ema')
+        runner.test()
+        evaluator.metrics[0].process.assert_has_calls([
+            call(ANY, [torch.tensor([1.]).to(device)]),
+        ])
+        self.assertNotIn(
+            call(ANY, [torch.tensor([2.]).to(device)]),
+            evaluator.metrics[0].process.mock_calls)
+
+        # Test validate on both models
+        evaluator = Evaluator([MagicMock()])
+        runner = Runner(
+            model=model,
+            val_dataloader=dict(
+                dataset=DummyDataset(),
+                sampler=dict(type='DefaultSampler', shuffle=True),
+                batch_size=3,
+                num_workers=0),
+            val_evaluator=evaluator,
+            val_cfg=dict(),
+            work_dir=self.temp_dir.name,
+            load_from=self.ckpt,
+            default_hooks=dict(logger=None),
+            custom_hooks=[dict(type='EMAHook', evaluate_on_origin=True)],
+            default_scope='mmpretrain',
+            experiment_name='validate_on_ema_false',
+        )
+        runner.val()
+        evaluator.metrics[0].process.assert_has_calls([
+            call(ANY, [torch.tensor([1.]).to(device)]),
+            call(ANY, [torch.tensor([2.]).to(device)]),
+        ])
+
+        # Test test on both models
+        evaluator = Evaluator([MagicMock()])
+        runner = Runner(
+            model=model,
+            test_dataloader=dict(
+                dataset=DummyDataset(),
+                sampler=dict(type='DefaultSampler', shuffle=True),
+                batch_size=3,
+                num_workers=0),
+            test_evaluator=evaluator,
+            test_cfg=dict(),
+            work_dir=self.temp_dir.name,
+            load_from=self.ckpt,
+            default_hooks=dict(logger=None),
+            custom_hooks=[dict(type='EMAHook', evaluate_on_origin=True)],
+            default_scope='mmpretrain',
+            experiment_name='test_on_ema_false',
+        )
+        runner.test()
+        evaluator.metrics[0].process.assert_has_calls([
+            call(ANY, [torch.tensor([1.]).to(device)]),
+            call(ANY, [torch.tensor([2.]).to(device)]),
+        ])
+
+        # Test evaluate_on_ema=False
+        evaluator = Evaluator([MagicMock()])
+        with self.assertWarnsRegex(UserWarning, 'evaluate_on_origin'):
+            runner = Runner(
+                model=model,
+                test_dataloader=dict(
+                    dataset=DummyDataset(),
+                    sampler=dict(type='DefaultSampler', shuffle=False),
+                    batch_size=3,
+                    num_workers=0),
+                test_evaluator=evaluator,
+                test_cfg=dict(),
+                work_dir=self.temp_dir.name,
+                load_from=self.ckpt,
+                default_hooks=dict(logger=None),
+                custom_hooks=[dict(type='EMAHook', evaluate_on_ema=False)],
+                default_scope='mmpretrain',
+                experiment_name='not_test_on_ema')
+        runner.test()
+        evaluator.metrics[0].process.assert_has_calls([
+            call(ANY, [torch.tensor([2.]).to(device)]),
+        ])
+        self.assertNotIn(
+            call(ANY, [torch.tensor([1.]).to(device)]),
+            evaluator.metrics[0].process.mock_calls)
diff --git a/tests/test_engine/test_hooks/test_precise_bn_hook.py b/tests/test_engine/test_hooks/test_precise_bn_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..f549b0dbbe4beea5adefe818f0b067fe08cb9d6b
--- /dev/null
+++ b/tests/test_engine/test_hooks/test_precise_bn_hook.py
@@ -0,0 +1,232 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+import logging
+import tempfile
+from unittest import TestCase
+from unittest.mock import MagicMock, patch
+
+import pytest
+import torch
+import torch.nn as nn
+from mmengine.logging import MMLogger
+from mmengine.model import BaseModel
+from mmengine.runner import Runner
+from torch.utils.data import DataLoader, Dataset
+
+from mmpretrain.models.utils import ClsDataPreprocessor
+from mmpretrain.registry import HOOKS
+from mmpretrain.structures import DataSample
+
+
+class ExampleDataset(Dataset):
+
+    def __init__(self):
+        self.index = 0
+        self.metainfo = None
+
+    def __getitem__(self, idx):
+        results = dict(imgs=torch.tensor([1.0], dtype=torch.float32))
+        return results
+
+    def __len__(self):
+        return 10
+
+
+class MockDataPreprocessor(ClsDataPreprocessor):
+    """mock preprocessor that do nothing."""
+
+    def forward(self, data, training=False):
+
+        return dict(inputs=data['imgs'], data_samples=DataSample())
+
+
+class ExampleModel(BaseModel):
+
+    def __init__(self):
+        super(ExampleModel, self).__init__()
+        self.data_preprocessor = MockDataPreprocessor()
+        self.conv = nn.Linear(1, 1)
+        self.bn = nn.BatchNorm1d(1)
+        self.test_cfg = None
+
+    def forward(self, inputs, data_samples, mode='tensor'):
+        inputs = inputs.to(next(self.parameters()).device)
+        return self.bn(self.conv(inputs))
+
+    def train_step(self, data, optim_wrapper):
+        outputs = {'loss': 0.5, 'num_samples': 1}
+        return outputs
+
+
+class SingleBNModel(ExampleModel):
+
+    def __init__(self):
+        super().__init__()
+        self.bn = nn.BatchNorm1d(1)
+        self.test_cfg = None
+
+    def forward(self, inputs, data_samples, mode='tensor'):
+        return self.bn(inputs)
+
+
+class GNExampleModel(ExampleModel):
+
+    def __init__(self):
+        super().__init__()
+        self.gn = nn.GroupNorm(1, 1)
+        self.test_cfg = None
+
+
+class NoBNExampleModel(ExampleModel):
+
+    def __init__(self):
+        super().__init__()
+        self.conv = nn.Linear(1, 1)
+        delattr(self, 'bn')
+        self.test_cfg = None
+
+    def forward(self, inputs, data_samples, mode='tensor'):
+        return self.conv(inputs)
+
+
+class TestPreciseBNHookHook(TestCase):
+    DEFAULT_ARGS = dict(type='PreciseBNHook', num_samples=4, interval=1)
+    count = 0
+
+    def setUp(self) -> None:
+        # optimizer
+        self.optim_wrapper = dict(
+            optimizer=dict(
+                type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001))
+        # learning policy
+        self.epoch_param_scheduler = dict(
+            type='MultiStepLR', by_epoch=True, milestones=[1, 2], gamma=0.1)
+        self.iter_param_scheduler = dict(
+            type='MultiStepLR', by_epoch=False, milestones=[1, 2], gamma=0.1)
+
+        self.default_hooks = dict(
+            timer=dict(type='IterTimerHook'),
+            logger=None,
+            param_scheduler=dict(type='ParamSchedulerHook'),
+            checkpoint=dict(type='CheckpointHook', interval=1),
+            sampler_seed=dict(type='DistSamplerSeedHook'),
+            visualization=dict(type='VisualizationHook', enable=False),
+        )
+        self.epoch_train_cfg = dict(by_epoch=True, max_epochs=1)
+        self.iter_train_cfg = dict(by_epoch=False, max_iters=5)
+        self.tmpdir = tempfile.TemporaryDirectory()
+        self.preciseBN_cfg = copy.deepcopy(self.DEFAULT_ARGS)
+
+        test_dataset = ExampleDataset()
+        self.loader = DataLoader(test_dataset, batch_size=2)
+        self.model = ExampleModel()
+
+    def test_construct(self):
+        self.runner = Runner(
+            model=self.model,
+            work_dir=self.tmpdir.name,
+            train_dataloader=self.loader,
+            train_cfg=self.epoch_train_cfg,
+            log_level='WARNING',
+            optim_wrapper=self.optim_wrapper,
+            param_scheduler=self.epoch_param_scheduler,
+            default_scope='mmpretrain',
+            default_hooks=self.default_hooks,
+            experiment_name='test_construct',
+            custom_hooks=None)
+
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        precise_bn = HOOKS.build(cfg)
+        self.assertEqual(precise_bn.num_samples, 4)
+        self.assertEqual(precise_bn.interval, 1)
+
+        with pytest.raises(AssertionError):
+            # num_samples must be larger than 0
+            cfg = copy.deepcopy(self.DEFAULT_ARGS)
+            cfg['num_samples'] = -1
+            HOOKS.build(cfg)
+
+        with pytest.raises(AssertionError):
+            # interval must be larger than 0
+            cfg = copy.deepcopy(self.DEFAULT_ARGS)
+            cfg['interval'] = 0
+            HOOKS.build(cfg)
+
+    @patch('mmengine.dist.get_dist_info', MagicMock(return_value=(1, 2)))
+    @patch('torch.distributed.all_reduce', MagicMock())
+    def test_after_train_epoch_multi_machines(self):
+        # Test with normal conv model in single machine
+        self.preciseBN_cfg['priority'] = 'ABOVE_NORMAL'
+        self.runner = Runner(
+            model=self.model,
+            work_dir=self.tmpdir.name,
+            train_dataloader=self.loader,
+            train_cfg=self.epoch_train_cfg,
+            log_level='WARNING',
+            optim_wrapper=self.optim_wrapper,
+            param_scheduler=self.epoch_param_scheduler,
+            default_scope='mmpretrain',
+            default_hooks=self.default_hooks,
+            experiment_name='test_after_train_epoch_multi_machines',
+            custom_hooks=[self.preciseBN_cfg])
+        self.runner.train()
+
+    def test_after_train_epoch(self):
+        self.preciseBN_cfg['priority'] = 'ABOVE_NORMAL'
+        self.runner = Runner(
+            model=self.model,
+            work_dir=self.tmpdir.name,
+            train_dataloader=self.loader,
+            train_cfg=self.epoch_train_cfg,
+            log_level='WARNING',
+            optim_wrapper=self.optim_wrapper,
+            param_scheduler=self.epoch_param_scheduler,
+            default_scope='mmpretrain',
+            default_hooks=self.default_hooks,
+            experiment_name='test_after_train_epoch',
+            custom_hooks=[self.preciseBN_cfg])
+
+        # Test with normal conv model in single machine
+        self.runner._train_loop = self.epoch_train_cfg
+        self.runner.train()
+
+        # Test with only BN model
+        self.runner.model = SingleBNModel()
+        self.runner._train_loop = self.epoch_train_cfg
+        self.runner.train()
+
+        # Test with GN model
+        self.runner.model = GNExampleModel()
+        self.runner._train_loop = self.epoch_train_cfg
+        self.runner.train()
+
+        # Test with no BN model
+        self.runner.model = NoBNExampleModel()
+        self.runner._train_loop = self.epoch_train_cfg
+        self.runner.train()
+
+    def test_after_train_iter(self):
+        # test precise bn hook in iter base loop
+        self.preciseBN_cfg['priority'] = 'ABOVE_NORMAL'
+        test_dataset = ExampleDataset()
+        self.loader = DataLoader(test_dataset, batch_size=2)
+        self.runner = Runner(
+            model=self.model,
+            work_dir=self.tmpdir.name,
+            train_dataloader=self.loader,
+            train_cfg=self.iter_train_cfg,
+            log_level='WARNING',
+            optim_wrapper=self.optim_wrapper,
+            param_scheduler=self.iter_param_scheduler,
+            default_scope='mmpretrain',
+            default_hooks=self.default_hooks,
+            experiment_name='test_after_train_iter',
+            custom_hooks=[self.preciseBN_cfg])
+        self.runner.train()
+
+    def tearDown(self) -> None:
+        # `FileHandler` should be closed in Windows, otherwise we cannot
+        # delete the temporary directory.
+        logging.shutdown()
+        MMLogger._instance_dict.clear()
+        self.tmpdir.cleanup()
diff --git a/tests/test_engine/test_hooks/test_retrievers_hooks.py b/tests/test_engine/test_hooks/test_retrievers_hooks.py
new file mode 100644
index 0000000000000000000000000000000000000000..c14e70ca5c7e77846616a7c850a7ebbde4d6456c
--- /dev/null
+++ b/tests/test_engine/test_hooks/test_retrievers_hooks.py
@@ -0,0 +1,34 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+from unittest.mock import MagicMock
+
+import torch
+
+from mmpretrain.engine import PrepareProtoBeforeValLoopHook
+from mmpretrain.models.retrievers import BaseRetriever
+
+
+class ToyRetriever(BaseRetriever):
+
+    def forward(self, inputs, data_samples=None, mode: str = 'loss'):
+        self.prototype_inited is False
+
+    def prepare_prototype(self):
+        """Preprocessing the prototype before predict."""
+        self.prototype_vecs = torch.tensor([0])
+        self.prototype_inited = True
+
+
+class TestPrepareProtBeforeValLoopHook(TestCase):
+
+    def setUp(self):
+        self.hook = PrepareProtoBeforeValLoopHook
+        self.runner = MagicMock()
+        self.runner.model = ToyRetriever()
+
+    def test_before_val(self):
+        self.runner.model.prepare_prototype()
+        self.assertTrue(self.runner.model.prototype_inited)
+        self.hook.before_val(self, self.runner)
+        self.assertIsNotNone(self.runner.model.prototype_vecs)
+        self.assertTrue(self.runner.model.prototype_inited)
diff --git a/tests/test_engine/test_hooks/test_simsiam_hook.py b/tests/test_engine/test_hooks/test_simsiam_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..29bda933c0c4e9ea813a52f1d65cabaa5ce521a2
--- /dev/null
+++ b/tests/test_engine/test_hooks/test_simsiam_hook.py
@@ -0,0 +1,117 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import logging
+import tempfile
+from unittest import TestCase
+
+import torch
+import torch.nn as nn
+from mmengine.device import get_device
+from mmengine.logging import MMLogger
+from mmengine.model import BaseModule
+from mmengine.runner import Runner
+from mmengine.structures import LabelData
+from torch.utils.data import Dataset
+
+from mmpretrain.engine import SimSiamHook
+from mmpretrain.models.selfsup import BaseSelfSupervisor
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+
+
+class DummyDataset(Dataset):
+    METAINFO = dict()  # type: ignore
+    data = torch.randn(12, 2)
+    label = torch.ones(12)
+
+    @property
+    def metainfo(self):
+        return self.METAINFO
+
+    def __len__(self):
+        return self.data.size(0)
+
+    def __getitem__(self, index):
+        data_sample = DataSample()
+        gt_label = LabelData(value=self.label[index])
+        setattr(data_sample, 'gt_label', gt_label)
+        return dict(inputs=[self.data[index]], data_samples=data_sample)
+
+
+@MODELS.register_module()
+class SimSiamDummyLayer(BaseModule):
+
+    def __init__(self, init_cfg=None):
+        super().__init__(init_cfg)
+        self.predictor = nn.Linear(2, 1)
+
+    def forward(self, x):
+        return self.predictor(x)
+
+
+class ToyModel(BaseSelfSupervisor):
+
+    def __init__(self):
+        super().__init__(backbone=dict(type='SimSiamDummyLayer'))
+
+    def extract_feat(self):
+        pass
+
+    def loss(self, inputs, data_samples):
+        labels = []
+        for x in data_samples:
+            labels.append(x.gt_label.value)
+            labels = torch.stack(labels)
+        outputs = self.backbone(inputs[0])
+        loss = (labels - outputs).sum()
+        outputs = dict(loss=loss)
+        return outputs
+
+
+class TestSimSiamHook(TestCase):
+
+    def setUp(self):
+        self.temp_dir = tempfile.TemporaryDirectory()
+
+    def tearDown(self):
+        # `FileHandler` should be closed in Windows, otherwise we cannot
+        # delete the temporary directory
+        logging.shutdown()
+        MMLogger._instance_dict.clear()
+        self.temp_dir.cleanup()
+
+    def test_simsiam_hook(self):
+        device = get_device()
+        dummy_dataset = DummyDataset()
+        toy_model = ToyModel().to(device)
+        simsiam_hook = SimSiamHook(
+            fix_pred_lr=True, lr=0.05, adjust_by_epoch=False)
+
+        # test SimSiamHook
+        runner = Runner(
+            model=toy_model,
+            work_dir=self.temp_dir.name,
+            train_dataloader=dict(
+                dataset=dummy_dataset,
+                sampler=dict(type='DefaultSampler', shuffle=True),
+                collate_fn=dict(type='default_collate'),
+                batch_size=1,
+                num_workers=0),
+            optim_wrapper=dict(
+                optimizer=dict(type='SGD', lr=0.05),
+                paramwise_cfg=dict(
+                    custom_keys={'predictor': dict(fix_lr=True)})),
+            param_scheduler=dict(type='MultiStepLR', milestones=[1]),
+            train_cfg=dict(by_epoch=True, max_epochs=2),
+            custom_hooks=[simsiam_hook],
+            default_hooks=dict(logger=None),
+            log_processor=dict(window_size=1),
+            experiment_name='test_simsiam_hook',
+            default_scope='mmpretrain')
+
+        runner.train()
+
+        for param_group in runner.optim_wrapper.optimizer.param_groups:
+            if 'fix_lr' in param_group and param_group['fix_lr']:
+                assert param_group['lr'] == 0.05
+            else:
+                assert param_group['lr'] != 0.05
diff --git a/tests/test_engine/test_hooks/test_swav_hook.py b/tests/test_engine/test_hooks/test_swav_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..2239ccb764a61255459334220493684cbe22c0a9
--- /dev/null
+++ b/tests/test_engine/test_hooks/test_swav_hook.py
@@ -0,0 +1,127 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import logging
+import tempfile
+from unittest import TestCase
+
+import torch
+import torch.nn as nn
+from mmengine.device import get_device
+from mmengine.logging import MMLogger
+from mmengine.model import BaseModule
+from mmengine.optim import OptimWrapper
+from mmengine.runner import Runner
+from mmengine.structures import LabelData
+from torch.utils.data import Dataset
+
+from mmpretrain.engine import SwAVHook
+from mmpretrain.models.heads import SwAVHead
+from mmpretrain.models.selfsup import BaseSelfSupervisor
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+from mmpretrain.utils import get_ori_model
+
+
+class DummyDataset(Dataset):
+    METAINFO = dict()  # type: ignore
+    data = torch.randn(12, 2)
+    label = torch.ones(12)
+
+    @property
+    def metainfo(self):
+        return self.METAINFO
+
+    def __len__(self):
+        return self.data.size(0)
+
+    def __getitem__(self, index):
+        data_sample = DataSample()
+        gt_label = LabelData(value=self.label[index])
+        setattr(data_sample, 'gt_label', gt_label)
+        return dict(inputs=[self.data[index]], data_samples=data_sample)
+
+
+@MODELS.register_module()
+class SwAVDummyLayer(BaseModule):
+
+    def __init__(self, init_cfg=None):
+        super().__init__(init_cfg)
+        self.linear = nn.Linear(2, 1)
+
+    def forward(self, x):
+        return self.linear(x)
+
+
+class ToyModel(BaseSelfSupervisor):
+
+    def __init__(self):
+        super().__init__(backbone=dict(type='SwAVDummyLayer'))
+        self.prototypes_test = nn.Linear(1, 1)
+        self.head = SwAVHead(
+            loss=dict(
+                type='SwAVLoss',
+                feat_dim=2,
+                num_crops=[2, 6],
+                num_prototypes=3))
+
+    def loss(self, inputs, data_samples):
+        labels = []
+        for x in data_samples:
+            labels.append(x.gt_label.value)
+            labels = torch.stack(labels)
+        outputs = self.backbone(inputs[0])
+        loss = (labels - outputs).sum()
+        outputs = dict(loss=loss)
+        return outputs
+
+
+class TestSwAVHook(TestCase):
+
+    def setUp(self):
+        self.temp_dir = tempfile.TemporaryDirectory()
+
+    def tearDown(self):
+        # `FileHandler` should be closed in Windows, otherwise we cannot
+        # delete the temporary directory
+        logging.shutdown()
+        MMLogger._instance_dict.clear()
+        self.temp_dir.cleanup()
+
+    def test_swav_hook(self):
+        device = get_device()
+        dummy_dataset = DummyDataset()
+        toy_model = ToyModel().to(device)
+        swav_hook = SwAVHook(
+            batch_size=1,
+            epoch_queue_starts=15,
+            crops_for_assign=[0, 1],
+            feat_dim=128,
+            queue_length=300,
+            frozen_layers_cfg=dict(prototypes=2))
+
+        # test SwAVHook
+        runner = Runner(
+            model=toy_model,
+            work_dir=self.temp_dir.name,
+            train_dataloader=dict(
+                dataset=dummy_dataset,
+                sampler=dict(type='DefaultSampler', shuffle=True),
+                collate_fn=dict(type='default_collate'),
+                batch_size=1,
+                num_workers=0),
+            optim_wrapper=OptimWrapper(
+                torch.optim.Adam(toy_model.parameters())),
+            param_scheduler=dict(type='MultiStepLR', milestones=[1]),
+            train_cfg=dict(by_epoch=True, max_epochs=2),
+            custom_hooks=[swav_hook],
+            default_hooks=dict(logger=None),
+            log_processor=dict(window_size=1),
+            experiment_name='test_swav_hook',
+            default_scope='mmpretrain')
+
+        runner.train()
+
+        for hook in runner.hooks:
+            if isinstance(hook, SwAVHook):
+                assert hook.queue_length == 300
+
+        assert get_ori_model(runner.model).head.loss_module.use_queue is False
diff --git a/tests/test_engine/test_hooks/test_switch_recipe_hook.py b/tests/test_engine/test_hooks/test_switch_recipe_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..c8c7b564ac56c427c4a75780847388c2801d6bdd
--- /dev/null
+++ b/tests/test_engine/test_hooks/test_switch_recipe_hook.py
@@ -0,0 +1,371 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import logging
+import os.path as osp
+import tempfile
+from typing import List
+from unittest import TestCase
+from unittest.mock import MagicMock
+
+import torch
+import torch.nn as nn
+from mmcv.transforms import Compose
+from mmengine.dataset import BaseDataset, ConcatDataset, RepeatDataset
+from mmengine.device import get_device
+from mmengine.logging import MMLogger
+from mmengine.model import BaseDataPreprocessor, BaseModel
+from mmengine.optim import OptimWrapper
+from mmengine.runner import Runner
+
+from mmpretrain.engine import SwitchRecipeHook
+from mmpretrain.models import CrossEntropyLoss
+from mmpretrain.models.heads.cls_head import ClsHead
+from mmpretrain.models.losses import LabelSmoothLoss
+from mmpretrain.models.utils.batch_augments import RandomBatchAugment
+
+
+class SimpleDataPreprocessor(BaseDataPreprocessor):
+
+    def __init__(self):
+        super().__init__()
+        self.batch_augments = None
+
+    def forward(self, data, training):
+
+        data = self.cast_data(data)
+        assert 'inputs' in data, 'No `input` in data.'
+        inputs = data['inputs']
+        labels = data['labels']
+
+        if self.batch_augments is not None and training:
+            inputs, labels = self.batch_augments(inputs, labels)
+
+        return {'inputs': inputs, 'labels': labels}
+
+
+class SimpleModel(BaseModel):
+
+    def __init__(self):
+        super().__init__()
+        self.data_preprocessor = SimpleDataPreprocessor()
+        self.gap = nn.AdaptiveAvgPool2d((1, 1))
+        self.fc = nn.Linear(3, 10)
+        self.loss_module = CrossEntropyLoss(use_soft=True)
+
+    def forward(self, inputs, labels, mode='tensor'):
+        if mode == 'loss':
+            score = self.fc(self.gap(inputs).view(inputs.size(0), -1))
+            loss = self.loss_module(score, labels)
+            return {'loss': loss}
+        else:
+            return None
+
+
+class ExampleDataset(BaseDataset):
+
+    def load_data_list(self) -> List[dict]:
+        return [{
+            'inputs': torch.rand(3, 12, 12),
+            'labels': torch.rand(10),
+        } for _ in range(10)]
+
+
+class EmptyTransform:
+
+    def __call__(self, results):
+        return {}
+
+
+class TestSwitchRecipeHook(TestCase):
+
+    def setUp(self):
+        self.tmpdir = tempfile.TemporaryDirectory()
+
+    def tearDown(self) -> None:
+        logging.shutdown()
+        MMLogger._instance_dict.clear()
+        self.tmpdir.cleanup()
+
+    def test_init(self):
+        # test `action_epoch` is set
+        with self.assertRaisesRegex(AssertionError, 'Please set'):
+            SwitchRecipeHook([dict(batch_augments=None)])
+
+        # test `action_epoch` is not repeated
+        with self.assertRaisesRegex(AssertionError, 'is repeated'):
+            SwitchRecipeHook([dict(action_epoch=1), dict(action_epoch=1)])
+
+        # test recipe build
+        hook = SwitchRecipeHook([
+            dict(
+                action_epoch=1,
+                train_pipeline=[dict(type='LoadImageFromFile')],
+                loss=dict(type='LabelSmoothLoss', label_smooth_val=0.1),
+                batch_augments=dict(augments=dict(type='Mixup', alpha=0.8)),
+            )
+        ])
+        self.assertIn(1, hook.schedule)
+        self.assertIsInstance(hook.schedule[1]['train_pipeline'], Compose)
+        self.assertIsInstance(hook.schedule[1]['loss'], LabelSmoothLoss)
+        self.assertIsInstance(hook.schedule[1]['batch_augments'],
+                              RandomBatchAugment)
+
+        # test recipe build with instance
+        hook = SwitchRecipeHook([
+            dict(
+                action_epoch=1,
+                train_pipeline=[MagicMock()],
+                loss=MagicMock(),
+                batch_augments=MagicMock(),
+            )
+        ])
+        self.assertIn(1, hook.schedule)
+        self.assertIsInstance(hook.schedule[1]['train_pipeline'], Compose)
+        self.assertIsInstance(hook.schedule[1]['loss'], MagicMock)
+        self.assertIsInstance(hook.schedule[1]['batch_augments'], MagicMock)
+
+        # test empty pieline and train_augments
+        hook = SwitchRecipeHook(
+            [dict(action_epoch=1, train_pipeline=[], batch_augments=None)])
+        self.assertIn(1, hook.schedule)
+        self.assertIsInstance(hook.schedule[1]['train_pipeline'], Compose)
+        self.assertIsNone(hook.schedule[1]['batch_augments'])
+
+    def test_do_switch(self):
+        device = get_device()
+        model = SimpleModel().to(device)
+
+        loss = CrossEntropyLoss(use_soft=True)
+        loss.forward = MagicMock(
+            side_effect=lambda x, y: CrossEntropyLoss.forward(loss, x, y))
+        batch_augments = RandomBatchAugment(dict(type='Mixup', alpha=0.5))
+        switch_hook = SwitchRecipeHook([
+            dict(
+                action_epoch=2,
+                train_pipeline=[MagicMock(side_effect=lambda x: x)],
+                loss=loss,
+                batch_augments=MagicMock(
+                    side_effect=lambda x, y: batch_augments(x, y)),
+            )
+        ])
+
+        runner = Runner(
+            model=model,
+            train_dataloader=dict(
+                dataset=ExampleDataset(),
+                sampler=dict(type='DefaultSampler', shuffle=True),
+                batch_size=5,
+                num_workers=0,
+                collate_fn=dict(type='default_collate'),
+            ),
+            optim_wrapper=OptimWrapper(
+                optimizer=torch.optim.Adam(model.parameters(), lr=0.)),
+            train_cfg=dict(by_epoch=True, max_epochs=2, val_interval=10),
+            work_dir=self.tmpdir.name,
+            default_hooks=dict(logger=None),
+            custom_hooks=[switch_hook],
+            default_scope='mmpretrain',
+            experiment_name='test_switch')
+        runner.train()
+        self.assertEqual(switch_hook.schedule[2]['batch_augments'].call_count,
+                         2)
+        self.assertEqual(switch_hook.schedule[2]['loss'].forward.call_count, 2)
+        self.assertEqual(
+            switch_hook.schedule[2]['train_pipeline'].transforms[0].call_count,
+            10)
+
+        # Due to the unknown error in Windows environment, the unit test for
+        # `num_workers>0` is disabled temporarily
+
+        # switch_hook = SwitchRecipeHook(
+        #     [dict(
+        #         action_epoch=2,
+        #         train_pipeline=[EmptyTransform()],
+        #     )])
+
+        # runner = Runner(
+        #     model=model,
+        #     train_dataloader=dict(
+        #         dataset=ExampleDataset(),
+        #         sampler=dict(type='DefaultSampler', shuffle=True),
+        #         batch_size=5,
+        #         num_workers=1,
+        #         persistent_workers=True,
+        #         collate_fn=dict(type='default_collate'),
+        #     ),
+        #     optim_wrapper=OptimWrapper(
+        #         optimizer=torch.optim.Adam(model.parameters(), lr=0.)),
+        #     train_cfg=dict(by_epoch=True, max_epochs=2, val_interval=10),
+        #     work_dir=self.tmpdir.name,
+        #     default_hooks=dict(logger=None),
+        #     custom_hooks=[switch_hook],
+        #     default_scope='mmpretrain',
+        #     experiment_name='test_switch_multi_workers')
+        # with self.assertRaisesRegex(AssertionError, 'No `input` in data.'):
+        #     # If the pipeline switch works, the data_preprocessor cannot
+        #     # receive `inputs`.
+        #     runner.train()
+
+    def test_resume(self):
+        device = get_device()
+        model = SimpleModel().to(device)
+
+        loss = CrossEntropyLoss(use_soft=True)
+        loss.forward = MagicMock(
+            side_effect=lambda x, y: CrossEntropyLoss.forward(loss, x, y))
+        batch_augments = RandomBatchAugment(dict(type='Mixup', alpha=0.5))
+        switch_hook = SwitchRecipeHook([
+            dict(
+                action_epoch=1,
+                train_pipeline=[MagicMock(side_effect=lambda x: x)]),
+            dict(action_epoch=2, loss=loss),
+            dict(
+                action_epoch=4,
+                batch_augments=MagicMock(
+                    side_effect=lambda x, y: batch_augments(x, y)),
+            ),
+        ])
+
+        runner = Runner(
+            model=model,
+            train_dataloader=dict(
+                dataset=ExampleDataset(),
+                sampler=dict(type='DefaultSampler', shuffle=True),
+                batch_size=5,
+                num_workers=0,
+                collate_fn=dict(type='default_collate'),
+            ),
+            optim_wrapper=OptimWrapper(
+                optimizer=torch.optim.Adam(model.parameters(), lr=0.)),
+            train_cfg=dict(by_epoch=True, max_epochs=2, val_interval=10),
+            work_dir=self.tmpdir.name,
+            default_hooks=dict(logger=None),
+            custom_hooks=[switch_hook],
+            default_scope='mmpretrain',
+            experiment_name='test_resume1')
+        runner.train()
+
+        runner = Runner(
+            model=model,
+            train_dataloader=dict(
+                dataset=ExampleDataset(),
+                sampler=dict(type='DefaultSampler', shuffle=True),
+                batch_size=5,
+                num_workers=0,
+                collate_fn=dict(type='default_collate'),
+            ),
+            optim_wrapper=OptimWrapper(
+                optimizer=torch.optim.Adam(model.parameters(), lr=0.)),
+            train_cfg=dict(by_epoch=True, max_epochs=4, val_interval=10),
+            resume=True,
+            load_from=osp.join(self.tmpdir.name, 'epoch_2.pth'),
+            work_dir=self.tmpdir.name,
+            default_hooks=dict(logger=None),
+            custom_hooks=[switch_hook],
+            default_scope='mmpretrain',
+            experiment_name='test_resume2')
+
+        with self.assertLogs(runner.logger, 'INFO') as logs:
+            runner.train()
+        prefix = 'INFO:mmengine:'
+        self.assertIn(
+            prefix + 'Switch train pipeline (resume recipe of epoch 1).',
+            logs.output)
+        self.assertIn(prefix + 'Switch loss (resume recipe of epoch 2).',
+                      logs.output)
+        self.assertIn(prefix + 'Switch batch augments at epoch 4.',
+                      logs.output)
+
+    def test_switch_train_pipeline(self):
+        device = get_device()
+        model = SimpleModel().to(device)
+
+        runner = Runner(
+            model=model,
+            train_dataloader=dict(
+                dataset=ConcatDataset([ExampleDataset(),
+                                       ExampleDataset()]),
+                sampler=dict(type='DefaultSampler', shuffle=True),
+                batch_size=5,
+                num_workers=0,
+                collate_fn=dict(type='default_collate'),
+            ),
+            optim_wrapper=OptimWrapper(
+                optimizer=torch.optim.Adam(model.parameters(), lr=0.)),
+            train_cfg=dict(by_epoch=True, max_epochs=2, val_interval=10),
+            work_dir=self.tmpdir.name,
+            default_hooks=dict(logger=None),
+            default_scope='mmpretrain',
+            experiment_name='test_concat_dataset')
+        pipeline = MagicMock()
+        SwitchRecipeHook._switch_train_pipeline(runner, pipeline)
+        self.assertIs(runner.train_dataloader.dataset.datasets[0].pipeline,
+                      pipeline)
+        self.assertIs(runner.train_dataloader.dataset.datasets[1].pipeline,
+                      pipeline)
+
+        runner = Runner(
+            model=model,
+            train_dataloader=dict(
+                dataset=RepeatDataset(ExampleDataset(), 3),
+                sampler=dict(type='DefaultSampler', shuffle=True),
+                batch_size=5,
+                num_workers=0,
+                collate_fn=dict(type='default_collate'),
+            ),
+            optim_wrapper=OptimWrapper(
+                optimizer=torch.optim.Adam(model.parameters(), lr=0.)),
+            train_cfg=dict(by_epoch=True, max_epochs=2, val_interval=10),
+            work_dir=self.tmpdir.name,
+            default_hooks=dict(logger=None),
+            default_scope='mmpretrain',
+            experiment_name='test_repeat_dataset')
+        pipeline = MagicMock()
+        SwitchRecipeHook._switch_train_pipeline(runner, pipeline)
+        self.assertIs(runner.train_dataloader.dataset.dataset.pipeline,
+                      pipeline)
+
+    def test_switch_loss(self):
+        device = get_device()
+        model = SimpleModel().to(device)
+
+        runner = Runner(
+            model=model,
+            train_dataloader=dict(
+                dataset=ExampleDataset(),
+                sampler=dict(type='DefaultSampler', shuffle=True),
+                batch_size=5,
+                num_workers=0,
+                collate_fn=dict(type='default_collate'),
+            ),
+            optim_wrapper=OptimWrapper(
+                optimizer=torch.optim.Adam(model.parameters(), lr=0.)),
+            train_cfg=dict(by_epoch=True, max_epochs=2, val_interval=10),
+            work_dir=self.tmpdir.name,
+            default_hooks=dict(logger=None),
+            default_scope='mmpretrain',
+            experiment_name='test_model_loss')
+        loss = CrossEntropyLoss(use_soft=True)
+        SwitchRecipeHook._switch_loss(runner, loss)
+        self.assertIs(runner.model.loss_module, loss)
+
+        model.add_module('head', ClsHead())
+        del model.loss_module
+        runner = Runner(
+            model=model,
+            train_dataloader=dict(
+                dataset=ExampleDataset(),
+                sampler=dict(type='DefaultSampler', shuffle=True),
+                batch_size=5,
+                num_workers=0,
+                collate_fn=dict(type='default_collate'),
+            ),
+            optim_wrapper=OptimWrapper(
+                optimizer=torch.optim.Adam(model.parameters(), lr=0.)),
+            train_cfg=dict(by_epoch=True, max_epochs=2, val_interval=10),
+            work_dir=self.tmpdir.name,
+            default_hooks=dict(logger=None),
+            default_scope='mmpretrain',
+            experiment_name='test_head_loss')
+        loss = CrossEntropyLoss(use_soft=True)
+        SwitchRecipeHook._switch_loss(runner, loss)
+        self.assertIs(runner.model.head.loss_module, loss)
diff --git a/tests/test_engine/test_hooks/test_visualization_hook.py b/tests/test_engine/test_hooks/test_visualization_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..2fe0ae3097c8706dbaa8c88cbda461180c05cbbe
--- /dev/null
+++ b/tests/test_engine/test_hooks/test_visualization_hook.py
@@ -0,0 +1,148 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+import tempfile
+from unittest import TestCase
+from unittest.mock import ANY, MagicMock, patch
+
+import torch
+from mmengine.runner import EpochBasedTrainLoop, IterBasedTrainLoop
+
+from mmpretrain.engine import VisualizationHook
+from mmpretrain.registry import HOOKS
+from mmpretrain.structures import DataSample
+from mmpretrain.visualization import UniversalVisualizer
+
+
+class TestVisualizationHook(TestCase):
+
+    def setUp(self) -> None:
+        UniversalVisualizer.get_instance('visualizer')
+
+        data_sample = DataSample().set_gt_label(1).set_pred_label(2)
+        data_sample.set_metainfo({'img_path': 'tests/data/color.jpg'})
+        self.data_batch = {
+            'inputs': torch.randint(0, 256, (10, 3, 224, 224)),
+            'data_sample': [data_sample] * 10
+        }
+
+        self.outputs = [data_sample] * 10
+
+        self.tmpdir = tempfile.TemporaryDirectory()
+
+    def test_draw_samples(self):
+        # test enable=False
+        cfg = dict(type='VisualizationHook', enable=False)
+        hook: VisualizationHook = HOOKS.build(cfg)
+        with patch.object(hook._visualizer, 'visualize_cls') as mock:
+            hook._draw_samples(1, self.data_batch, self.outputs, step=1)
+            mock.assert_not_called()
+
+        # test enable=True
+        cfg = dict(type='VisualizationHook', enable=True, show=True)
+        hook: VisualizationHook = HOOKS.build(cfg)
+        with patch.object(hook._visualizer, 'visualize_cls') as mock:
+            hook._draw_samples(0, self.data_batch, self.outputs, step=0)
+            mock.assert_called_once_with(
+                image=ANY,
+                data_sample=self.outputs[0],
+                step=0,
+                show=True,
+                name='color.jpg')
+
+        # test samples without path
+        cfg = dict(type='VisualizationHook', enable=True)
+        hook: VisualizationHook = HOOKS.build(cfg)
+        with patch.object(hook._visualizer, 'visualize_cls') as mock:
+            outputs = [DataSample()] * 10
+            hook._draw_samples(0, self.data_batch, outputs, step=0)
+            mock.assert_called_once_with(
+                image=ANY,
+                data_sample=outputs[0],
+                step=0,
+                show=False,
+                name='0')
+
+        # test out_dir
+        cfg = dict(
+            type='VisualizationHook', enable=True, out_dir=self.tmpdir.name)
+        hook: VisualizationHook = HOOKS.build(cfg)
+        with patch.object(hook._visualizer, 'visualize_cls') as mock:
+            hook._draw_samples(0, self.data_batch, self.outputs, step=0)
+            mock.assert_called_once_with(
+                image=ANY,
+                data_sample=self.outputs[0],
+                step=0,
+                show=False,
+                name='color.jpg',
+                out_file=osp.join(self.tmpdir.name, 'color.jpg_0.png'))
+
+        # test sample idx
+        cfg = dict(type='VisualizationHook', enable=True, interval=4)
+        hook: VisualizationHook = HOOKS.build(cfg)
+        with patch.object(hook._visualizer, 'visualize_cls') as mock:
+            hook._draw_samples(1, self.data_batch, self.outputs, step=0)
+            mock.assert_called_with(
+                image=ANY,
+                data_sample=self.outputs[2],
+                step=0,
+                show=False,
+                name='color.jpg',
+            )
+            mock.assert_called_with(
+                image=ANY,
+                data_sample=self.outputs[6],
+                step=0,
+                show=False,
+                name='color.jpg',
+            )
+
+    def test_after_val_iter(self):
+        runner = MagicMock()
+
+        # test epoch-based
+        runner.train_loop = MagicMock(spec=EpochBasedTrainLoop)
+        runner.epoch = 5
+        cfg = dict(type='VisualizationHook', enable=True)
+        hook = HOOKS.build(cfg)
+        with patch.object(hook._visualizer, 'visualize_cls') as mock:
+            hook.after_val_iter(runner, 0, self.data_batch, self.outputs)
+            mock.assert_called_once_with(
+                image=ANY,
+                data_sample=self.outputs[0],
+                step=5,
+                show=False,
+                name='color.jpg',
+            )
+
+        # test iter-based
+        runner.train_loop = MagicMock(spec=IterBasedTrainLoop)
+        runner.iter = 300
+        cfg = dict(type='VisualizationHook', enable=True)
+        hook = HOOKS.build(cfg)
+        with patch.object(hook._visualizer, 'visualize_cls') as mock:
+            hook.after_val_iter(runner, 0, self.data_batch, self.outputs)
+            mock.assert_called_once_with(
+                image=ANY,
+                data_sample=self.outputs[0],
+                step=300,
+                show=False,
+                name='color.jpg',
+            )
+
+    def test_after_test_iter(self):
+        runner = MagicMock()
+
+        cfg = dict(type='VisualizationHook', enable=True)
+        hook = HOOKS.build(cfg)
+        with patch.object(hook._visualizer, 'visualize_cls') as mock:
+            hook.after_test_iter(runner, 0, self.data_batch, self.outputs)
+            mock.assert_called_once_with(
+                image=ANY,
+                data_sample=self.outputs[0],
+                step=0,
+                show=False,
+                name='color.jpg',
+            )
+
+    def tearDown(self) -> None:
+        self.tmpdir.cleanup()
diff --git a/tests/test_engine/test_optimizers/test_layer_decay_optim_wrapper_constructor.py b/tests/test_engine/test_optimizers/test_layer_decay_optim_wrapper_constructor.py
new file mode 100644
index 0000000000000000000000000000000000000000..d92b2972a113ed7098efd8873cab64af2ec56549
--- /dev/null
+++ b/tests/test_engine/test_optimizers/test_layer_decay_optim_wrapper_constructor.py
@@ -0,0 +1,107 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import torch
+from torch import nn
+
+from mmpretrain.engine import LearningRateDecayOptimWrapperConstructor
+from mmpretrain.models import ImageClassifier, VisionTransformer
+
+
+class ToyViTBackbone(nn.Module):
+
+    get_layer_depth = VisionTransformer.get_layer_depth
+
+    def __init__(self, num_layers=2):
+        super().__init__()
+        self.cls_token = nn.Parameter(torch.ones(1))
+        self.pos_embed = nn.Parameter(torch.ones(1))
+        self.num_layers = num_layers
+        self.layers = nn.ModuleList()
+        for _ in range(num_layers):
+            layer = nn.Conv2d(3, 3, 1)
+            self.layers.append(layer)
+
+
+class ToyViT(nn.Module):
+    get_layer_depth = ImageClassifier.get_layer_depth
+
+    def __init__(self):
+        super().__init__()
+        # add some variables to meet unit test coverate rate
+        self.backbone = ToyViTBackbone()
+        self.head = nn.Linear(1, 1)
+
+
+class TestLearningRateDecayOptimWrapperConstructor(TestCase):
+    base_lr = 1.0
+    base_wd = 0.05
+
+    def test_add_params(self):
+        model = ToyViT()
+        optim_wrapper_cfg = dict(
+            type='OptimWrapper',
+            optimizer=dict(
+                type='AdamW',
+                lr=self.base_lr,
+                betas=(0.9, 0.999),
+                weight_decay=self.base_wd))
+        paramwise_cfg = dict(
+            layer_decay_rate=2.0,
+            bias_decay_mult=0.,
+            custom_keys={
+                '.cls_token': dict(decay_mult=0.0),
+                '.pos_embed': dict(decay_mult=0.0),
+            })
+
+        constructor = LearningRateDecayOptimWrapperConstructor(
+            optim_wrapper_cfg=optim_wrapper_cfg,
+            paramwise_cfg=paramwise_cfg,
+        )
+        optimizer_wrapper = constructor(model)
+
+        expected_groups = [{
+            'weight_decay': 0.0,
+            'lr': 8 * self.base_lr,
+            'param_name': 'backbone.cls_token',
+        }, {
+            'weight_decay': 0.0,
+            'lr': 8 * self.base_lr,
+            'param_name': 'backbone.pos_embed',
+        }, {
+            'weight_decay': self.base_wd,
+            'lr': 4 * self.base_lr,
+            'param_name': 'backbone.layers.0.weight',
+        }, {
+            'weight_decay': 0.0,
+            'lr': 4 * self.base_lr,
+            'param_name': 'backbone.layers.0.bias',
+        }, {
+            'weight_decay': self.base_wd,
+            'lr': 2 * self.base_lr,
+            'param_name': 'backbone.layers.1.weight',
+        }, {
+            'weight_decay': 0.0,
+            'lr': 2 * self.base_lr,
+            'param_name': 'backbone.layers.1.bias',
+        }, {
+            'weight_decay': self.base_wd,
+            'lr': 1 * self.base_lr,
+            'param_name': 'head.weight',
+        }, {
+            'weight_decay': 0.0,
+            'lr': 1 * self.base_lr,
+            'param_name': 'head.bias',
+        }]
+        self.assertIsInstance(optimizer_wrapper.optimizer, torch.optim.AdamW)
+        self.assertEqual(optimizer_wrapper.optimizer.defaults['lr'],
+                         self.base_lr)
+        self.assertEqual(optimizer_wrapper.optimizer.defaults['weight_decay'],
+                         self.base_wd)
+        param_groups = optimizer_wrapper.optimizer.param_groups
+        self.assertEqual(len(param_groups), len(expected_groups))
+
+        for expect, param in zip(expected_groups, param_groups):
+            self.assertEqual(param['param_name'], expect['param_name'])
+            self.assertEqual(param['lr'], expect['lr'])
+            self.assertEqual(param['weight_decay'], expect['weight_decay'])
diff --git a/tests/test_evaluation/test_metrics/test_gqa.py b/tests/test_evaluation/test_metrics/test_gqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..abb9d14b69fdcf6aab31d1071e8dfc39610496d9
--- /dev/null
+++ b/tests/test_evaluation/test_metrics/test_gqa.py
@@ -0,0 +1,30 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.evaluator import Evaluator
+
+from mmpretrain.structures import DataSample
+
+
+class TestScienceQAMetric:
+
+    def test_evaluate(self):
+        meta_info = {
+            'pred_answer': 'dog',
+            'gt_answer': 'dog',
+        }
+        data_sample = DataSample(metainfo=meta_info)
+        data_samples = [data_sample for _ in range(10)]
+        evaluator = Evaluator(dict(type='mmpretrain.GQAAcc'))
+        evaluator.process(data_samples)
+        res = evaluator.evaluate(4)
+        assert res['GQA/acc'] == 1.0
+
+        meta_info = {
+            'pred_answer': 'dog',
+            'gt_answer': 'cat',
+        }
+        data_sample = DataSample(metainfo=meta_info)
+        data_samples = [data_sample for _ in range(10)]
+        evaluator = Evaluator(dict(type='mmpretrain.GQAAcc'))
+        evaluator.process(data_samples)
+        res = evaluator.evaluate(4)
+        assert res['GQA/acc'] == 0.0
diff --git a/tests/test_evaluation/test_metrics/test_metric_utils.py b/tests/test_evaluation/test_metrics/test_metric_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..3102ac5495a3f2b4a0047c0bc1140dc9eb3fe419
--- /dev/null
+++ b/tests/test_evaluation/test_metrics/test_metric_utils.py
@@ -0,0 +1,33 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmpretrain.models.losses.utils import convert_to_one_hot
+
+
+def ori_convert_to_one_hot(targets: torch.Tensor, classes) -> torch.Tensor:
+    assert (torch.max(targets).item() <
+            classes), 'Class Index must be less than number of classes'
+    one_hot_targets = torch.zeros((targets.shape[0], classes),
+                                  dtype=torch.long,
+                                  device=targets.device)
+    one_hot_targets.scatter_(1, targets.long(), 1)
+    return one_hot_targets
+
+
+def test_convert_to_one_hot():
+    # label should smaller than classes
+    targets = torch.tensor([1, 2, 3, 8, 5])
+    classes = 5
+    with pytest.raises(AssertionError):
+        _ = convert_to_one_hot(targets, classes)
+
+    # test with original impl
+    classes = 10
+    targets = torch.randint(high=classes, size=(10, 1))
+    ori_one_hot_targets = torch.zeros((targets.shape[0], classes),
+                                      dtype=torch.long,
+                                      device=targets.device)
+    ori_one_hot_targets.scatter_(1, targets.long(), 1)
+    one_hot_targets = convert_to_one_hot(targets, classes)
+    assert torch.equal(ori_one_hot_targets, one_hot_targets)
diff --git a/tests/test_evaluation/test_metrics/test_multi_label.py b/tests/test_evaluation/test_metrics/test_multi_label.py
new file mode 100644
index 0000000000000000000000000000000000000000..cfd2590d1bd327ba20089a5b750a50be5ca396d9
--- /dev/null
+++ b/tests/test_evaluation/test_metrics/test_multi_label.py
@@ -0,0 +1,388 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import numpy as np
+import sklearn.metrics
+import torch
+from mmengine.evaluator import Evaluator
+from mmengine.registry import init_default_scope
+
+from mmpretrain.evaluation.metrics import AveragePrecision, MultiLabelMetric
+from mmpretrain.structures import DataSample
+
+init_default_scope('mmpretrain')
+
+
+class TestMultiLabel(TestCase):
+
+    def test_calculate(self):
+        """Test using the metric from static method."""
+
+        y_true = [[0], [1, 3], [0, 1, 2], [3]]
+        y_pred = [[0, 3], [0, 2], [1, 2], [2, 3]]
+        y_true_binary = np.array([
+            [1, 0, 0, 0],
+            [0, 1, 0, 1],
+            [1, 1, 1, 0],
+            [0, 0, 0, 1],
+        ])
+        y_pred_binary = np.array([
+            [1, 0, 0, 1],
+            [1, 0, 1, 0],
+            [0, 1, 1, 0],
+            [0, 0, 1, 1],
+        ])
+        y_pred_score = np.array([
+            [0.8, 0, 0, 0.6],
+            [0.2, 0, 0.6, 0],
+            [0, 0.9, 0.6, 0],
+            [0, 0, 0.2, 0.3],
+        ])
+
+        # Test with sequence of category indexes
+        res = MultiLabelMetric.calculate(
+            y_pred,
+            y_true,
+            pred_indices=True,
+            target_indices=True,
+            num_classes=4)
+        self.assertIsInstance(res, tuple)
+        precision, recall, f1_score, support = res
+        expect_precision = sklearn.metrics.precision_score(
+            y_true_binary, y_pred_binary, average='macro') * 100
+        expect_recall = sklearn.metrics.recall_score(
+            y_true_binary, y_pred_binary, average='macro') * 100
+        expect_f1 = sklearn.metrics.f1_score(
+            y_true_binary, y_pred_binary, average='macro') * 100
+        self.assertTensorEqual(precision, expect_precision)
+        self.assertTensorEqual(recall, expect_recall)
+        self.assertTensorEqual(f1_score, expect_f1)
+        self.assertTensorEqual(support, 7)
+
+        # Test with onehot input
+        res = MultiLabelMetric.calculate(y_pred_binary,
+                                         torch.from_numpy(y_true_binary))
+        self.assertIsInstance(res, tuple)
+        precision, recall, f1_score, support = res
+        # Expected values come from sklearn
+        self.assertTensorEqual(precision, expect_precision)
+        self.assertTensorEqual(recall, expect_recall)
+        self.assertTensorEqual(f1_score, expect_f1)
+        self.assertTensorEqual(support, 7)
+
+        # Test with topk argument
+        res = MultiLabelMetric.calculate(
+            y_pred_score, y_true, target_indices=True, topk=1, num_classes=4)
+        self.assertIsInstance(res, tuple)
+        precision, recall, f1_score, support = res
+        # Expected values come from sklearn
+        top1_y_pred = np.array([
+            [1, 0, 0, 0],
+            [0, 0, 1, 0],
+            [0, 1, 0, 0],
+            [0, 0, 0, 1],
+        ])
+        expect_precision = sklearn.metrics.precision_score(
+            y_true_binary, top1_y_pred, average='macro') * 100
+        expect_recall = sklearn.metrics.recall_score(
+            y_true_binary, top1_y_pred, average='macro') * 100
+        expect_f1 = sklearn.metrics.f1_score(
+            y_true_binary, top1_y_pred, average='macro') * 100
+        self.assertTensorEqual(precision, expect_precision)
+        self.assertTensorEqual(recall, expect_recall)
+        self.assertTensorEqual(f1_score, expect_f1)
+        self.assertTensorEqual(support, 7)
+
+        # Test with thr argument
+        res = MultiLabelMetric.calculate(
+            y_pred_score, y_true, target_indices=True, thr=0.25, num_classes=4)
+        self.assertIsInstance(res, tuple)
+        precision, recall, f1_score, support = res
+        # Expected values come from sklearn
+        thr_y_pred = np.array([
+            [1, 0, 0, 1],
+            [0, 0, 1, 0],
+            [0, 1, 1, 0],
+            [0, 0, 0, 1],
+        ])
+        expect_precision = sklearn.metrics.precision_score(
+            y_true_binary, thr_y_pred, average='macro') * 100
+        expect_recall = sklearn.metrics.recall_score(
+            y_true_binary, thr_y_pred, average='macro') * 100
+        expect_f1 = sklearn.metrics.f1_score(
+            y_true_binary, thr_y_pred, average='macro') * 100
+        self.assertTensorEqual(precision, expect_precision)
+        self.assertTensorEqual(recall, expect_recall)
+        self.assertTensorEqual(f1_score, expect_f1)
+        self.assertTensorEqual(support, 7)
+
+        # Test with invalid inputs
+        with self.assertRaisesRegex(TypeError, "<class 'str'> is not"):
+            MultiLabelMetric.calculate(y_pred, 'hi', num_classes=10)
+
+        # Test with invalid input
+        with self.assertRaisesRegex(AssertionError,
+                                    'Invalid `average` argument,'):
+            MultiLabelMetric.calculate(
+                y_pred, y_true, average='m', num_classes=10)
+
+        y_true_binary = np.array([[1, 0, 0, 0], [0, 1, 0, 1]])
+        y_pred_binary = np.array([[1, 0, 0, 1], [1, 0, 1, 0], [0, 1, 1, 0]])
+        # Test with invalid inputs
+        with self.assertRaisesRegex(AssertionError, 'The size of pred'):
+            MultiLabelMetric.calculate(y_pred_binary, y_true_binary)
+
+        # Test with invalid inputs
+        with self.assertRaisesRegex(TypeError, 'The `pred` and `target` must'):
+            MultiLabelMetric.calculate(y_pred_binary, 5)
+
+    def test_evaluate(self):
+        y_true = [[0], [1, 3], [0, 1, 2], [3]]
+        y_true_binary = torch.tensor([
+            [1, 0, 0, 0],
+            [0, 1, 0, 1],
+            [1, 1, 1, 0],
+            [0, 0, 0, 1],
+        ])
+        y_pred_score = torch.tensor([
+            [0.8, 0, 0, 0.6],
+            [0.2, 0, 0.6, 0],
+            [0, 0.9, 0.6, 0],
+            [0, 0, 0.2, 0.3],
+        ])
+
+        pred = [
+            DataSample(num_classes=4).set_pred_score(i).set_gt_label(j)
+            for i, j in zip(y_pred_score, y_true)
+        ]
+
+        # Test with default argument
+        evaluator = Evaluator(dict(type='MultiLabelMetric'))
+        evaluator.process(pred)
+        res = evaluator.evaluate(4)
+        self.assertIsInstance(res, dict)
+        thr05_y_pred = np.array([
+            [1, 0, 0, 1],
+            [0, 0, 1, 0],
+            [0, 1, 1, 0],
+            [0, 0, 0, 0],
+        ])
+        expect_precision = sklearn.metrics.precision_score(
+            y_true_binary, thr05_y_pred, average='macro') * 100
+        expect_recall = sklearn.metrics.recall_score(
+            y_true_binary, thr05_y_pred, average='macro') * 100
+        expect_f1 = sklearn.metrics.f1_score(
+            y_true_binary, thr05_y_pred, average='macro') * 100
+        self.assertEqual(res['multi-label/precision'], expect_precision)
+        self.assertEqual(res['multi-label/recall'], expect_recall)
+        self.assertEqual(res['multi-label/f1-score'], expect_f1)
+
+        # Test with topk argument
+        evaluator = Evaluator(dict(type='MultiLabelMetric', topk=1))
+        evaluator.process(pred)
+        res = evaluator.evaluate(4)
+        self.assertIsInstance(res, dict)
+        top1_y_pred = np.array([
+            [1, 0, 0, 0],
+            [0, 0, 1, 0],
+            [0, 1, 0, 0],
+            [0, 0, 0, 1],
+        ])
+        expect_precision = sklearn.metrics.precision_score(
+            y_true_binary, top1_y_pred, average='macro') * 100
+        expect_recall = sklearn.metrics.recall_score(
+            y_true_binary, top1_y_pred, average='macro') * 100
+        expect_f1 = sklearn.metrics.f1_score(
+            y_true_binary, top1_y_pred, average='macro') * 100
+        self.assertEqual(res['multi-label/precision_top1'], expect_precision)
+        self.assertEqual(res['multi-label/recall_top1'], expect_recall)
+        self.assertEqual(res['multi-label/f1-score_top1'], expect_f1)
+
+        # Test with both argument
+        evaluator = Evaluator(dict(type='MultiLabelMetric', thr=0.25, topk=1))
+        evaluator.process(pred)
+        res = evaluator.evaluate(4)
+        self.assertIsInstance(res, dict)
+        # Expected values come from sklearn
+        thr_y_pred = np.array([
+            [1, 0, 0, 1],
+            [0, 0, 1, 0],
+            [0, 1, 1, 0],
+            [0, 0, 0, 1],
+        ])
+        expect_precision = sklearn.metrics.precision_score(
+            y_true_binary, thr_y_pred, average='macro') * 100
+        expect_recall = sklearn.metrics.recall_score(
+            y_true_binary, thr_y_pred, average='macro') * 100
+        expect_f1 = sklearn.metrics.f1_score(
+            y_true_binary, thr_y_pred, average='macro') * 100
+        self.assertEqual(res['multi-label/precision_thr-0.25'],
+                         expect_precision)
+        self.assertEqual(res['multi-label/recall_thr-0.25'], expect_recall)
+        self.assertEqual(res['multi-label/f1-score_thr-0.25'], expect_f1)
+
+        # Test with average micro
+        evaluator = Evaluator(dict(type='MultiLabelMetric', average='micro'))
+        evaluator.process(pred)
+        res = evaluator.evaluate(4)
+        self.assertIsInstance(res, dict)
+        # Expected values come from sklearn
+        expect_precision = sklearn.metrics.precision_score(
+            y_true_binary, thr05_y_pred, average='micro') * 100
+        expect_recall = sklearn.metrics.recall_score(
+            y_true_binary, thr05_y_pred, average='micro') * 100
+        expect_f1 = sklearn.metrics.f1_score(
+            y_true_binary, thr05_y_pred, average='micro') * 100
+        self.assertAlmostEqual(
+            res['multi-label/precision_micro'], expect_precision, places=4)
+        self.assertAlmostEqual(
+            res['multi-label/recall_micro'], expect_recall, places=4)
+        self.assertAlmostEqual(
+            res['multi-label/f1-score_micro'], expect_f1, places=4)
+
+        # Test with average None
+        evaluator = Evaluator(dict(type='MultiLabelMetric', average=None))
+        evaluator.process(pred)
+        res = evaluator.evaluate(4)
+        self.assertIsInstance(res, dict)
+        # Expected values come from sklearn
+        expect_precision = sklearn.metrics.precision_score(
+            y_true_binary, thr05_y_pred, average=None) * 100
+        expect_recall = sklearn.metrics.recall_score(
+            y_true_binary, thr05_y_pred, average=None) * 100
+        expect_f1 = sklearn.metrics.f1_score(
+            y_true_binary, thr05_y_pred, average=None) * 100
+        np.testing.assert_allclose(res['multi-label/precision_classwise'],
+                                   expect_precision)
+        np.testing.assert_allclose(res['multi-label/recall_classwise'],
+                                   expect_recall)
+        np.testing.assert_allclose(res['multi-label/f1-score_classwise'],
+                                   expect_f1)
+
+        # Test with gt_score
+        pred = [
+            DataSample(num_classes=4).set_pred_score(i).set_gt_score(j)
+            for i, j in zip(y_pred_score, y_true_binary)
+        ]
+
+        evaluator = Evaluator(dict(type='MultiLabelMetric', items=['support']))
+        evaluator.process(pred)
+        res = evaluator.evaluate(4)
+        self.assertIsInstance(res, dict)
+        self.assertEqual(res['multi-label/support'], 7)
+
+    def assertTensorEqual(self,
+                          tensor: torch.Tensor,
+                          value: float,
+                          msg=None,
+                          **kwarg):
+        tensor = tensor.to(torch.float32)
+        if tensor.dim() == 0:
+            tensor = tensor.unsqueeze(0)
+        value = torch.FloatTensor([value])
+        try:
+            torch.testing.assert_allclose(tensor, value, **kwarg)
+        except AssertionError as e:
+            self.fail(self._formatMessage(msg, str(e) + str(tensor)))
+
+
+class TestAveragePrecision(TestCase):
+
+    def test_evaluate(self):
+        """Test using the metric in the same way as Evalutor."""
+        y_pred = torch.tensor([
+            [0.9, 0.8, 0.3, 0.2],
+            [0.1, 0.2, 0.2, 0.1],
+            [0.7, 0.5, 0.9, 0.3],
+            [0.8, 0.1, 0.1, 0.2],
+        ])
+        y_true = torch.tensor([
+            [1, 1, 0, 0],
+            [0, 1, 0, 0],
+            [0, 0, 1, 0],
+            [1, 0, 0, 0],
+        ])
+
+        pred = [
+            DataSample(num_classes=4).set_pred_score(i).set_gt_score(j)
+            for i, j in zip(y_pred, y_true)
+        ]
+
+        # Test with default macro avergae
+        evaluator = Evaluator(dict(type='AveragePrecision'))
+        evaluator.process(pred)
+        res = evaluator.evaluate(5)
+        self.assertIsInstance(res, dict)
+        self.assertAlmostEqual(res['multi-label/mAP'], 70.83333, places=4)
+
+        # Test with average mode None
+        evaluator = Evaluator(dict(type='AveragePrecision', average=None))
+        evaluator.process(pred)
+        res = evaluator.evaluate(5)
+        self.assertIsInstance(res, dict)
+        aps = res['multi-label/AP_classwise']
+        self.assertAlmostEqual(aps[0], 100., places=4)
+        self.assertAlmostEqual(aps[1], 83.3333, places=4)
+        self.assertAlmostEqual(aps[2], 100, places=4)
+        self.assertAlmostEqual(aps[3], 0, places=4)
+
+        # Test with gt_label without score
+        pred = [
+            DataSample(num_classes=4).set_pred_score(i).set_gt_label(j)
+            for i, j in zip(y_pred, [[0, 1], [1], [2], [0]])
+        ]
+        evaluator = Evaluator(dict(type='AveragePrecision'))
+        evaluator.process(pred)
+        res = evaluator.evaluate(5)
+        self.assertAlmostEqual(res['multi-label/mAP'], 70.83333, places=4)
+
+    def test_calculate(self):
+        """Test using the metric from static method."""
+
+        y_true = np.array([
+            [1, 0, 0, 0],
+            [0, 1, 0, 1],
+            [1, 1, 1, 0],
+            [0, 0, 0, 1],
+        ])
+        y_pred = np.array([
+            [0.9, 0.8, 0.3, 0.2],
+            [0.1, 0.2, 0.2, 0.1],
+            [0.7, 0.5, 0.9, 0.3],
+            [0.8, 0.1, 0.1, 0.2],
+        ])
+
+        ap_score = AveragePrecision.calculate(y_pred, y_true)
+        expect_ap = sklearn.metrics.average_precision_score(y_true,
+                                                            y_pred) * 100
+        self.assertTensorEqual(ap_score, expect_ap)
+
+        # Test with invalid inputs
+        with self.assertRaisesRegex(AssertionError,
+                                    'Invalid `average` argument,'):
+            AveragePrecision.calculate(y_pred, y_true, average='m')
+
+        y_true = np.array([[1, 0, 0, 0], [0, 1, 0, 1]])
+        y_pred = np.array([[1, 0, 0, 1], [1, 0, 1, 0], [0, 1, 1, 0]])
+        # Test with invalid inputs
+        with self.assertRaisesRegex(AssertionError,
+                                    'Both `pred` and `target`'):
+            AveragePrecision.calculate(y_pred, y_true)
+
+        # Test with invalid inputs
+        with self.assertRaisesRegex(TypeError, "<class 'int'> is not an"):
+            AveragePrecision.calculate(y_pred, 5)
+
+    def assertTensorEqual(self,
+                          tensor: torch.Tensor,
+                          value: float,
+                          msg=None,
+                          **kwarg):
+        tensor = tensor.to(torch.float32)
+        if tensor.dim() == 0:
+            tensor = tensor.unsqueeze(0)
+        value = torch.FloatTensor([value])
+        try:
+            torch.testing.assert_allclose(tensor, value, **kwarg)
+        except AssertionError as e:
+            self.fail(self._formatMessage(msg, str(e) + str(tensor)))
diff --git a/tests/test_evaluation/test_metrics/test_multi_task_metrics.py b/tests/test_evaluation/test_metrics/test_multi_task_metrics.py
new file mode 100644
index 0000000000000000000000000000000000000000..25027718f605fa21766060b224afdd125c119f08
--- /dev/null
+++ b/tests/test_evaluation/test_metrics/test_multi_task_metrics.py
@@ -0,0 +1,134 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import torch
+
+from mmpretrain.evaluation.metrics import MultiTasksMetric
+from mmpretrain.structures import DataSample
+
+
+class MultiTaskMetric(TestCase):
+    data_pred = [
+        {
+            'task0': torch.tensor([0.7, 0.0, 0.3]),
+            'task1': torch.tensor([0.5, 0.2, 0.3])
+        },
+        {
+            'task0': torch.tensor([0.0, 0.0, 1.0]),
+            'task1': torch.tensor([0.0, 0.0, 1.0])
+        },
+    ]
+    data_gt = [{'task0': 0, 'task1': 2}, {'task1': 2}]
+
+    preds = []
+    for i, pred in enumerate(data_pred):
+        sample = {}
+        for task_name in pred:
+            task_sample = DataSample().set_pred_score(pred[task_name])
+            if task_name in data_gt[i]:
+                task_sample.set_gt_label(data_gt[i][task_name])
+                task_sample.set_field(True, 'eval_mask', field_type='metainfo')
+            else:
+                task_sample.set_field(
+                    False, 'eval_mask', field_type='metainfo')
+            sample[task_name] = task_sample.to_dict()
+
+        preds.append(sample)
+    data2 = zip([
+        {
+            'task0': torch.tensor([0.7, 0.0, 0.3]),
+            'task1': {
+                'task10': torch.tensor([0.5, 0.2, 0.3]),
+                'task11': torch.tensor([0.4, 0.3, 0.3])
+            }
+        },
+        {
+            'task0': torch.tensor([0.0, 0.0, 1.0]),
+            'task1': {
+                'task10': torch.tensor([0.1, 0.6, 0.3]),
+                'task11': torch.tensor([0.5, 0.2, 0.3])
+            }
+        },
+    ], [{
+        'task0': 0,
+        'task1': {
+            'task10': 2,
+            'task11': 0
+        }
+    }, {
+        'task0': 2,
+        'task1': {
+            'task10': 1,
+            'task11': 0
+        }
+    }])
+
+    pred2 = []
+    for score, label in data2:
+        sample = {}
+        for task_name in score:
+            if type(score[task_name]) != dict:
+                task_sample = DataSample().set_pred_score(score[task_name])
+                task_sample.set_gt_label(label[task_name])
+                sample[task_name] = task_sample.to_dict()
+                sample[task_name]['eval_mask'] = True
+            else:
+                sample[task_name] = {}
+                sample[task_name]['eval_mask'] = True
+                for task_name2 in score[task_name]:
+                    task_sample = DataSample().set_pred_score(
+                        score[task_name][task_name2])
+                    task_sample.set_gt_label(label[task_name][task_name2])
+                    sample[task_name][task_name2] = task_sample.to_dict()
+                    sample[task_name][task_name2]['eval_mask'] = True
+
+        pred2.append(sample)
+
+    pred3 = [{'task0': {'eval_mask': False}, 'task1': {'eval_mask': False}}]
+    task_metrics = {
+        'task0': [dict(type='Accuracy', topk=(1, ))],
+        'task1': [
+            dict(type='Accuracy', topk=(1, 3)),
+            dict(type='SingleLabelMetric', items=['precision', 'recall'])
+        ]
+    }
+    task_metrics2 = {
+        'task0': [dict(type='Accuracy', topk=(1, ))],
+        'task1': [
+            dict(
+                type='MultiTasksMetric',
+                task_metrics={
+                    'task10': [
+                        dict(type='Accuracy', topk=(1, 3)),
+                        dict(type='SingleLabelMetric', items=['precision'])
+                    ],
+                    'task11': [dict(type='Accuracy', topk=(1, ))]
+                })
+        ]
+    }
+
+    def test_evaluate(self):
+        """Test using the metric in the same way as Evalutor."""
+
+        # Test with score (use score instead of label if score exists)
+        metric = MultiTasksMetric(self.task_metrics)
+        metric.process(None, self.preds)
+        results = metric.evaluate(2)
+        self.assertIsInstance(results, dict)
+        self.assertAlmostEqual(results['task0_accuracy/top1'], 100)
+        self.assertGreater(results['task1_single-label/precision'], 0)
+
+        # Test nested
+        metric = MultiTasksMetric(self.task_metrics2)
+        metric.process(None, self.pred2)
+        results = metric.evaluate(2)
+        self.assertIsInstance(results, dict)
+        self.assertGreater(results['task1_task10_single-label/precision'], 0)
+        self.assertGreater(results['task1_task11_accuracy/top1'], 0)
+
+        # Test with without any ground truth value
+        metric = MultiTasksMetric(self.task_metrics)
+        metric.process(None, self.pred3)
+        results = metric.evaluate(2)
+        self.assertIsInstance(results, dict)
+        self.assertEqual(results['task0_Accuracy'], 0)
diff --git a/tests/test_evaluation/test_metrics/test_retrieval.py b/tests/test_evaluation/test_metrics/test_retrieval.py
new file mode 100644
index 0000000000000000000000000000000000000000..de49754ae55ad6ed6f698741a0103a8c29bfe216
--- /dev/null
+++ b/tests/test_evaluation/test_metrics/test_retrieval.py
@@ -0,0 +1,227 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import numpy as np
+import torch
+
+from mmpretrain.evaluation.metrics import (RetrievalAveragePrecision,
+                                           RetrievalRecall)
+from mmpretrain.registry import METRICS
+from mmpretrain.structures import DataSample
+
+
+class TestRetrievalRecall(TestCase):
+
+    def test_evaluate(self):
+        """Test using the metric in the same way as Evalutor."""
+        pred = [
+            DataSample().set_pred_score(i).set_gt_label(k).to_dict()
+            for i, k in zip([
+                torch.tensor([0.7, 0.0, 0.3]),
+                torch.tensor([0.5, 0.2, 0.3]),
+                torch.tensor([0.4, 0.5, 0.1]),
+                torch.tensor([0.0, 0.0, 1.0]),
+                torch.tensor([0.0, 0.0, 1.0]),
+                torch.tensor([0.0, 0.0, 1.0]),
+            ], [[0], [0, 1], [1], [2], [1, 2], [0, 1]])
+        ]
+
+        # Test with score (use score instead of label if score exists)
+        metric = METRICS.build(dict(type='RetrievalRecall', topk=1))
+        metric.process(None, pred)
+        recall = metric.evaluate(6)
+        self.assertIsInstance(recall, dict)
+        self.assertAlmostEqual(
+            recall['retrieval/Recall@1'], 5 / 6 * 100, places=4)
+
+        # Test with invalid topk
+        with self.assertRaisesRegex(RuntimeError, 'selected index k'):
+            metric = METRICS.build(dict(type='RetrievalRecall', topk=10))
+            metric.process(None, pred)
+            metric.evaluate(6)
+
+        with self.assertRaisesRegex(ValueError, '`topk` must be a'):
+            METRICS.build(dict(type='RetrievalRecall', topk=-1))
+
+        # Test initialization
+        metric = METRICS.build(dict(type='RetrievalRecall', topk=5))
+        self.assertEqual(metric.topk, (5, ))
+
+        # Test initialization
+        metric = METRICS.build(dict(type='RetrievalRecall', topk=(1, 2, 5)))
+        self.assertEqual(metric.topk, (1, 2, 5))
+
+    def test_calculate(self):
+        """Test using the metric from static method."""
+
+        # seq of indices format
+        y_true = [[0, 2, 5, 8, 9], [1, 4, 6]]
+        y_pred = [np.arange(10)] * 2
+
+        # test with average is 'macro'
+        recall_score = RetrievalRecall.calculate(
+            y_pred, y_true, topk=1, pred_indices=True, target_indices=True)
+        expect_recall = 50.
+        self.assertEqual(recall_score[0].item(), expect_recall)
+
+        # test with tensor input
+        y_true = torch.Tensor([[1, 0, 1, 0, 0, 1, 0, 0, 1, 1],
+                               [0, 1, 0, 0, 1, 0, 1, 0, 0, 0]])
+        y_pred = np.array([np.linspace(0.95, 0.05, 10)] * 2)
+        recall_score = RetrievalRecall.calculate(y_pred, y_true, topk=1)
+        expect_recall = 50.
+        self.assertEqual(recall_score[0].item(), expect_recall)
+
+        # test with topk is 5
+        y_pred = np.array([np.linspace(0.95, 0.05, 10)] * 2)
+        recall_score = RetrievalRecall.calculate(y_pred, y_true, topk=2)
+        expect_recall = 100.
+        self.assertEqual(recall_score[0].item(), expect_recall)
+
+        # test with topk is (1, 5)
+        y_pred = np.array([np.linspace(0.95, 0.05, 10)] * 2)
+        recall_score = RetrievalRecall.calculate(y_pred, y_true, topk=(1, 5))
+        expect_recalls = [50., 100.]
+        self.assertEqual(len(recall_score), len(expect_recalls))
+        for i in range(len(expect_recalls)):
+            self.assertEqual(recall_score[i].item(), expect_recalls[i])
+
+        # Test with invalid pred
+        y_pred = dict()
+        y_true = [[0, 2, 5, 8, 9], [1, 4, 6]]
+        with self.assertRaisesRegex(AssertionError, '`pred` must be Seq'):
+            RetrievalRecall.calculate(y_pred, y_true, True, True)
+
+        # Test with invalid target
+        y_true = dict()
+        y_pred = [np.arange(10)] * 2
+        with self.assertRaisesRegex(AssertionError, '`target` must be Seq'):
+            RetrievalRecall.calculate(
+                y_pred, y_true, topk=1, pred_indices=True, target_indices=True)
+
+        # Test with different length `pred` with `target`
+        y_true = [[0, 2, 5, 8, 9], [1, 4, 6]]
+        y_pred = [np.arange(10)] * 3
+        with self.assertRaisesRegex(AssertionError, 'Length of `pred`'):
+            RetrievalRecall.calculate(
+                y_pred, y_true, topk=1, pred_indices=True, target_indices=True)
+
+        # Test with invalid pred
+        y_true = [[0, 2, 5, 8, 9], dict()]
+        y_pred = [np.arange(10)] * 2
+        with self.assertRaisesRegex(AssertionError, '`target` should be'):
+            RetrievalRecall.calculate(
+                y_pred, y_true, topk=1, pred_indices=True, target_indices=True)
+
+        # Test with invalid target
+        y_true = [[0, 2, 5, 8, 9], [1, 4, 6]]
+        y_pred = [np.arange(10), dict()]
+        with self.assertRaisesRegex(AssertionError, '`pred` should be'):
+            RetrievalRecall.calculate(
+                y_pred, y_true, topk=1, pred_indices=True, target_indices=True)
+
+
+class TestRetrievalAveragePrecision(TestCase):
+
+    def test_evaluate(self):
+        """Test using the metric in the same way as Evalutor."""
+        y_true = torch.tensor([[1, 0, 1, 0, 0, 1, 0, 0, 1, 1],
+                               [0, 1, 0, 0, 1, 0, 1, 0, 0, 0]])
+        y_pred = torch.tensor([np.linspace(0.95, 0.05, 10)] * 2)
+
+        pred = [
+            DataSample().set_pred_score(i).set_gt_score(j)
+            for i, j in zip(y_pred, y_true)
+        ]
+
+        # Test with default macro avergae
+        metric = METRICS.build(dict(type='RetrievalAveragePrecision', topk=10))
+        metric.process([], pred)
+        res = metric.evaluate(len(pred))
+        self.assertIsInstance(res, dict)
+        self.assertAlmostEqual(
+            res['retrieval/mAP@10'], 53.25396825396825, places=4)
+
+        # Test with invalid topk
+        with self.assertRaisesRegex(ValueError, '`topk` must be a'):
+            METRICS.build(dict(type='RetrievalAveragePrecision', topk=-1))
+
+        # Test with invalid mode
+        with self.assertRaisesRegex(AssertionError, 'Invalid `mode` '):
+            METRICS.build(
+                dict(type='RetrievalAveragePrecision', topk=5, mode='m'))
+
+    def test_calculate(self):
+        """Test using the metric from static method."""
+        # Test IR mode
+        # example from https://zhuanlan.zhihu.com/p/35983818
+        # or https://www.youtube.com/watch?v=pM6DJ0ZZee0
+
+        # seq of indices format
+        y_true = [[0, 2, 5, 8, 9], [1, 4, 6]]
+        y_pred = [np.arange(10)] * 2
+
+        # test with average is 'macro'
+        ap_score = RetrievalAveragePrecision.calculate(y_pred, y_true, 10,
+                                                       True, True)
+        expect_ap = 53.25396825396825
+        self.assertEqual(ap_score.item(), expect_ap)
+
+        # test with tensor input
+        y_true = torch.Tensor([[1, 0, 1, 0, 0, 1, 0, 0, 1, 1],
+                               [0, 1, 0, 0, 1, 0, 1, 0, 0, 0]])
+        y_pred = np.array([np.linspace(0.95, 0.05, 10)] * 2)
+        ap_score = RetrievalAveragePrecision.calculate(y_pred, y_true, 10)
+        expect_ap = 53.25396825396825
+        self.assertEqual(ap_score.item(), expect_ap)
+
+        # test with topk is 5
+        y_pred = np.array([np.linspace(0.95, 0.05, 10)] * 2)
+        ap_score = RetrievalAveragePrecision.calculate(y_pred, y_true, topk=5)
+        expect_ap = 31.666666666666664
+        self.assertEqual(ap_score.item(), expect_ap)
+
+        # Test with invalid mode
+        with self.assertRaisesRegex(AssertionError, 'Invalid `mode` '):
+            RetrievalAveragePrecision.calculate(
+                y_pred, y_true, True, True, mode='m')
+
+        # Test with invalid pred
+        y_pred = dict()
+        y_true = [[0, 2, 5, 8, 9], [1, 4, 6]]
+        with self.assertRaisesRegex(AssertionError, '`pred` must be Seq'):
+            RetrievalAveragePrecision.calculate(y_pred, y_true, 10, True, True)
+
+        # Test with invalid target
+        y_true = dict()
+        y_pred = [np.arange(10)] * 2
+        with self.assertRaisesRegex(AssertionError, '`target` must be Seq'):
+            RetrievalAveragePrecision.calculate(y_pred, y_true, 10, True, True)
+
+        # Test with different length `pred` with `target`
+        y_true = [[0, 2, 5, 8, 9], [1, 4, 6]]
+        y_pred = [np.arange(10)] * 3
+        with self.assertRaisesRegex(AssertionError, 'Length of `pred`'):
+            RetrievalAveragePrecision.calculate(y_pred, y_true, 10, True, True)
+
+        # Test with invalid pred
+        y_true = [[0, 2, 5, 8, 9], dict()]
+        y_pred = [np.arange(10)] * 2
+        with self.assertRaisesRegex(AssertionError, '`target` should be'):
+            RetrievalAveragePrecision.calculate(y_pred, y_true, 10, True, True)
+
+        # Test with invalid target
+        y_true = [[0, 2, 5, 8, 9], [1, 4, 6]]
+        y_pred = [np.arange(10), dict()]
+        with self.assertRaisesRegex(AssertionError, '`pred` should be'):
+            RetrievalAveragePrecision.calculate(y_pred, y_true, 10, True, True)
+
+        # Test with mode 'integrate'
+        y_true = torch.Tensor([[1, 0, 1, 0, 0, 1, 0, 0, 1, 1],
+                               [0, 1, 0, 0, 1, 0, 1, 0, 0, 0]])
+        y_pred = np.array([np.linspace(0.95, 0.05, 10)] * 2)
+
+        ap_score = RetrievalAveragePrecision.calculate(
+            y_pred, y_true, topk=5, mode='integrate')
+        expect_ap = 25.416666666666664
+        self.assertEqual(ap_score.item(), expect_ap)
diff --git a/tests/test_evaluation/test_metrics/test_scienceqa.py b/tests/test_evaluation/test_metrics/test_scienceqa.py
new file mode 100644
index 0000000000000000000000000000000000000000..5bfc81b5ff7d2ff128f2a76da45f9f65337de879
--- /dev/null
+++ b/tests/test_evaluation/test_metrics/test_scienceqa.py
@@ -0,0 +1,44 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmengine.evaluator import Evaluator
+
+from mmpretrain.structures import DataSample
+
+
+class TestScienceQAMetric:
+
+    def test_evaluate(self):
+        meta_info = {
+            'choices': ['A', 'B', 'C', 'D'],
+            'pred_answer': 'A',
+            'grade': 'grade1',
+            'subject': 'language science',
+            'gt_answer': 1,
+            'hint': 'hint',
+            'has_image': True
+        }
+        data_sample = DataSample(metainfo=meta_info)
+        data_samples = [data_sample for _ in range(10)]
+        evaluator = Evaluator(dict(type='mmpretrain.ScienceQAMetric'))
+        evaluator.process(data_samples)
+        res = evaluator.evaluate(4)
+        assert res['acc_grade_1_6'] == 0.0
+        assert res['acc_language'] == 0.0
+        assert res['all_acc'] == 0.0
+
+        meta_info = {
+            'choices': ['A', 'B', 'C', 'D'],
+            'pred_answer': 'A',
+            'grade': 'grade1',
+            'subject': 'language science',
+            'gt_answer': 0,
+            'hint': 'hint',
+            'has_image': True
+        }
+        data_sample = DataSample(metainfo=meta_info)
+        data_samples = [data_sample for _ in range(10)]
+        evaluator = Evaluator(dict(type='mmpretrain.ScienceQAMetric'))
+        evaluator.process(data_samples)
+        res = evaluator.evaluate(4)
+        assert res['acc_grade_1_6'] == 1.0
+        assert res['acc_language'] == 1.0
+        assert res['all_acc'] == 1.0
diff --git a/tests/test_evaluation/test_metrics/test_shape_bias_metric.py b/tests/test_evaluation/test_metrics/test_shape_bias_metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..d57ace899b8c32ada2e9a2eae87d046ae5c73761
--- /dev/null
+++ b/tests/test_evaluation/test_metrics/test_shape_bias_metric.py
@@ -0,0 +1,15 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+
+from mmpretrain.evaluation import ShapeBiasMetric
+
+
+def test_shape_bias_metric():
+    data_sample = dict()
+    data_sample['pred_score'] = torch.rand(1000, )
+    data_sample['pred_label'] = torch.tensor(1)
+    data_sample['gt_label'] = torch.tensor(1)
+    data_sample['img_path'] = 'tests/airplane/test.JPEG'
+    evaluator = ShapeBiasMetric(
+        csv_dir='tests/data', dataset_name='test', model_name='test')
+    evaluator.process(None, [data_sample])
diff --git a/tests/test_evaluation/test_metrics/test_single_label.py b/tests/test_evaluation/test_metrics/test_single_label.py
new file mode 100644
index 0000000000000000000000000000000000000000..33264ec1046ed8f3c5137a4a17a34ec37a5c52b9
--- /dev/null
+++ b/tests/test_evaluation/test_metrics/test_single_label.py
@@ -0,0 +1,409 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+from unittest import TestCase
+
+import numpy as np
+import torch
+
+from mmpretrain.evaluation.metrics import (Accuracy, ConfusionMatrix,
+                                           SingleLabelMetric)
+from mmpretrain.registry import METRICS
+from mmpretrain.structures import DataSample
+
+
+class TestAccuracy(TestCase):
+
+    def test_evaluate(self):
+        """Test using the metric in the same way as Evalutor."""
+        pred = [
+            DataSample().set_pred_score(i).set_pred_label(j).set_gt_label(
+                k).to_dict() for i, j, k in zip([
+                    torch.tensor([0.7, 0.0, 0.3]),
+                    torch.tensor([0.5, 0.2, 0.3]),
+                    torch.tensor([0.4, 0.5, 0.1]),
+                    torch.tensor([0.0, 0.0, 1.0]),
+                    torch.tensor([0.0, 0.0, 1.0]),
+                    torch.tensor([0.0, 0.0, 1.0]),
+                ], [0, 0, 1, 2, 2, 2], [0, 0, 1, 2, 1, 0])
+        ]
+
+        # Test with score (use score instead of label if score exists)
+        metric = METRICS.build(dict(type='Accuracy', thrs=0.6))
+        metric.process(None, pred)
+        acc = metric.evaluate(6)
+        self.assertIsInstance(acc, dict)
+        self.assertAlmostEqual(acc['accuracy/top1'], 2 / 6 * 100, places=4)
+
+        # Test with multiple thrs
+        metric = METRICS.build(dict(type='Accuracy', thrs=(0., 0.6, None)))
+        metric.process(None, pred)
+        acc = metric.evaluate(6)
+        self.assertSetEqual(
+            set(acc.keys()), {
+                'accuracy/top1_thr-0.00', 'accuracy/top1_thr-0.60',
+                'accuracy/top1_no-thr'
+            })
+
+        # Test with invalid topk
+        with self.assertRaisesRegex(ValueError, 'check the `val_evaluator`'):
+            metric = METRICS.build(dict(type='Accuracy', topk=(1, 5)))
+            metric.process(None, pred)
+            metric.evaluate(6)
+
+        # Test with label
+        for sample in pred:
+            del sample['pred_score']
+        metric = METRICS.build(dict(type='Accuracy', thrs=(0., 0.6, None)))
+        metric.process(None, pred)
+        acc = metric.evaluate(6)
+        self.assertIsInstance(acc, dict)
+        self.assertAlmostEqual(acc['accuracy/top1'], 4 / 6 * 100, places=4)
+
+        # Test initialization
+        metric = METRICS.build(dict(type='Accuracy', thrs=0.6))
+        self.assertTupleEqual(metric.thrs, (0.6, ))
+        metric = METRICS.build(dict(type='Accuracy', thrs=[0.6]))
+        self.assertTupleEqual(metric.thrs, (0.6, ))
+        metric = METRICS.build(dict(type='Accuracy', topk=5))
+        self.assertTupleEqual(metric.topk, (5, ))
+        metric = METRICS.build(dict(type='Accuracy', topk=[5]))
+        self.assertTupleEqual(metric.topk, (5, ))
+
+    def test_calculate(self):
+        """Test using the metric from static method."""
+
+        # Test with score
+        y_true = np.array([0, 0, 1, 2, 1, 0])
+        y_label = torch.tensor([0, 0, 1, 2, 2, 2])
+        y_score = [
+            [0.7, 0.0, 0.3],
+            [0.5, 0.2, 0.3],
+            [0.4, 0.5, 0.1],
+            [0.0, 0.0, 1.0],
+            [0.0, 0.0, 1.0],
+            [0.0, 0.0, 1.0],
+        ]
+
+        # Test with score
+        acc = Accuracy.calculate(y_score, y_true, thrs=(0.6, ))
+        self.assertIsInstance(acc, list)
+        self.assertIsInstance(acc[0], list)
+        self.assertIsInstance(acc[0][0], torch.Tensor)
+        self.assertTensorEqual(acc[0][0], 2 / 6 * 100)
+
+        # Test with label
+        acc = Accuracy.calculate(y_label, y_true, thrs=(0.6, ))
+        self.assertIsInstance(acc, torch.Tensor)
+        # the thrs will be ignored
+        self.assertTensorEqual(acc, 4 / 6 * 100)
+
+        # Test with invalid inputs
+        with self.assertRaisesRegex(TypeError, "<class 'str'> is not"):
+            Accuracy.calculate(y_label, 'hi')
+
+        # Test with invalid topk
+        with self.assertRaisesRegex(ValueError, 'Top-5 accuracy .* is 3'):
+            Accuracy.calculate(y_score, y_true, topk=(1, 5))
+
+    def assertTensorEqual(self,
+                          tensor: torch.Tensor,
+                          value: float,
+                          msg=None,
+                          **kwarg):
+        tensor = tensor.to(torch.float32)
+        value = torch.FloatTensor([value])
+        try:
+            torch.testing.assert_allclose(tensor, value, **kwarg)
+        except AssertionError as e:
+            self.fail(self._formatMessage(msg, str(e)))
+
+
+class TestSingleLabel(TestCase):
+
+    def test_evaluate(self):
+        """Test using the metric in the same way as Evalutor."""
+        pred = [
+            DataSample().set_pred_score(i).set_pred_label(j).set_gt_label(
+                k).to_dict() for i, j, k in zip([
+                    torch.tensor([0.7, 0.0, 0.3]),
+                    torch.tensor([0.5, 0.2, 0.3]),
+                    torch.tensor([0.4, 0.5, 0.1]),
+                    torch.tensor([0.0, 0.0, 1.0]),
+                    torch.tensor([0.0, 0.0, 1.0]),
+                    torch.tensor([0.0, 0.0, 1.0]),
+                ], [0, 0, 1, 2, 2, 2], [0, 0, 1, 2, 1, 0])
+        ]
+
+        # Test with score (use score instead of label if score exists)
+        metric = METRICS.build(
+            dict(
+                type='SingleLabelMetric',
+                thrs=0.6,
+                items=('precision', 'recall', 'f1-score', 'support')))
+        metric.process(None, pred)
+        res = metric.evaluate(6)
+        self.assertIsInstance(res, dict)
+        self.assertAlmostEqual(
+            res['single-label/precision'], (1 + 0 + 1 / 3) / 3 * 100, places=4)
+        self.assertAlmostEqual(
+            res['single-label/recall'], (1 / 3 + 0 + 1) / 3 * 100, places=4)
+        self.assertAlmostEqual(
+            res['single-label/f1-score'], (1 / 2 + 0 + 1 / 2) / 3 * 100,
+            places=4)
+        self.assertEqual(res['single-label/support'], 6)
+
+        # Test with multiple thrs
+        metric = METRICS.build(
+            dict(type='SingleLabelMetric', thrs=(0., 0.6, None)))
+        metric.process(None, pred)
+        res = metric.evaluate(6)
+        self.assertSetEqual(
+            set(res.keys()), {
+                'single-label/precision_thr-0.00',
+                'single-label/recall_thr-0.00',
+                'single-label/f1-score_thr-0.00',
+                'single-label/precision_thr-0.60',
+                'single-label/recall_thr-0.60',
+                'single-label/f1-score_thr-0.60',
+                'single-label/precision_no-thr', 'single-label/recall_no-thr',
+                'single-label/f1-score_no-thr'
+            })
+
+        # Test with average mode "micro"
+        metric = METRICS.build(
+            dict(
+                type='SingleLabelMetric',
+                average='micro',
+                items=('precision', 'recall', 'f1-score', 'support')))
+        metric.process(None, pred)
+        res = metric.evaluate(6)
+        self.assertIsInstance(res, dict)
+        self.assertAlmostEqual(
+            res['single-label/precision_micro'], 66.666, places=2)
+        self.assertAlmostEqual(
+            res['single-label/recall_micro'], 66.666, places=2)
+        self.assertAlmostEqual(
+            res['single-label/f1-score_micro'], 66.666, places=2)
+        self.assertEqual(res['single-label/support_micro'], 6)
+
+        # Test with average mode None
+        metric = METRICS.build(
+            dict(
+                type='SingleLabelMetric',
+                average=None,
+                items=('precision', 'recall', 'f1-score', 'support')))
+        metric.process(None, pred)
+        res = metric.evaluate(6)
+        self.assertIsInstance(res, dict)
+        precision = res['single-label/precision_classwise']
+        self.assertAlmostEqual(precision[0], 100., places=4)
+        self.assertAlmostEqual(precision[1], 100., places=4)
+        self.assertAlmostEqual(precision[2], 1 / 3 * 100, places=4)
+        recall = res['single-label/recall_classwise']
+        self.assertAlmostEqual(recall[0], 2 / 3 * 100, places=4)
+        self.assertAlmostEqual(recall[1], 50., places=4)
+        self.assertAlmostEqual(recall[2], 100., places=4)
+        f1_score = res['single-label/f1-score_classwise']
+        self.assertAlmostEqual(f1_score[0], 80., places=4)
+        self.assertAlmostEqual(f1_score[1], 2 / 3 * 100, places=4)
+        self.assertAlmostEqual(f1_score[2], 50., places=4)
+        self.assertEqual(res['single-label/support_classwise'], [3, 2, 1])
+
+        # Test with label, the thrs will be ignored
+        pred_no_score = copy.deepcopy(pred)
+        for sample in pred_no_score:
+            del sample['pred_score']
+            del sample['num_classes']
+        metric = METRICS.build(
+            dict(type='SingleLabelMetric', thrs=(0., 0.6), num_classes=3))
+        metric.process(None, pred_no_score)
+        res = metric.evaluate(6)
+        self.assertIsInstance(res, dict)
+        # Expected values come from sklearn
+        self.assertAlmostEqual(res['single-label/precision'], 77.777, places=2)
+        self.assertAlmostEqual(res['single-label/recall'], 72.222, places=2)
+        self.assertAlmostEqual(res['single-label/f1-score'], 65.555, places=2)
+
+        metric = METRICS.build(dict(type='SingleLabelMetric', thrs=(0., 0.6)))
+        with self.assertRaisesRegex(AssertionError, 'must be specified'):
+            metric.process(None, pred_no_score)
+
+        # Test with empty items
+        metric = METRICS.build(
+            dict(type='SingleLabelMetric', items=tuple(), num_classes=3))
+        metric.process(None, pred)
+        res = metric.evaluate(6)
+        self.assertIsInstance(res, dict)
+        self.assertEqual(len(res), 0)
+
+        metric.process(None, pred_no_score)
+        res = metric.evaluate(6)
+        self.assertIsInstance(res, dict)
+        self.assertEqual(len(res), 0)
+
+        # Test initialization
+        metric = METRICS.build(dict(type='SingleLabelMetric', thrs=0.6))
+        self.assertTupleEqual(metric.thrs, (0.6, ))
+        metric = METRICS.build(dict(type='SingleLabelMetric', thrs=[0.6]))
+        self.assertTupleEqual(metric.thrs, (0.6, ))
+
+    def test_calculate(self):
+        """Test using the metric from static method."""
+
+        # Test with score
+        y_true = np.array([0, 0, 1, 2, 1, 0])
+        y_label = torch.tensor([0, 0, 1, 2, 2, 2])
+        y_score = [
+            [0.7, 0.0, 0.3],
+            [0.5, 0.2, 0.3],
+            [0.4, 0.5, 0.1],
+            [0.0, 0.0, 1.0],
+            [0.0, 0.0, 1.0],
+            [0.0, 0.0, 1.0],
+        ]
+
+        # Test with score
+        res = SingleLabelMetric.calculate(y_score, y_true, thrs=(0.6, ))
+        self.assertIsInstance(res, list)
+        self.assertIsInstance(res[0], tuple)
+        precision, recall, f1_score, support = res[0]
+        self.assertTensorEqual(precision, (1 + 0 + 1 / 3) / 3 * 100)
+        self.assertTensorEqual(recall, (1 / 3 + 0 + 1) / 3 * 100)
+        self.assertTensorEqual(f1_score, (1 / 2 + 0 + 1 / 2) / 3 * 100)
+        self.assertTensorEqual(support, 6)
+
+        # Test with label
+        res = SingleLabelMetric.calculate(y_label, y_true, num_classes=3)
+        self.assertIsInstance(res, tuple)
+        precision, recall, f1_score, support = res
+        # Expected values come from sklearn
+        self.assertTensorEqual(precision, 77.7777)
+        self.assertTensorEqual(recall, 72.2222)
+        self.assertTensorEqual(f1_score, 65.5555)
+        self.assertTensorEqual(support, 6)
+
+        # Test with invalid inputs
+        with self.assertRaisesRegex(TypeError, "<class 'str'> is not"):
+            SingleLabelMetric.calculate(y_label, 'hi')
+
+    def assertTensorEqual(self,
+                          tensor: torch.Tensor,
+                          value: float,
+                          msg=None,
+                          **kwarg):
+        tensor = tensor.to(torch.float32)
+        value = torch.tensor(value).float()
+        try:
+            torch.testing.assert_allclose(tensor, value, **kwarg)
+        except AssertionError as e:
+            self.fail(self._formatMessage(msg, str(e)))
+
+
+class TestConfusionMatrix(TestCase):
+
+    def test_evaluate(self):
+        """Test using the metric in the same way as Evalutor."""
+        pred = [
+            DataSample().set_pred_score(i).set_pred_label(j).set_gt_label(
+                k).to_dict() for i, j, k in zip([
+                    torch.tensor([0.7, 0.0, 0.3]),
+                    torch.tensor([0.5, 0.2, 0.3]),
+                    torch.tensor([0.4, 0.5, 0.1]),
+                    torch.tensor([0.0, 0.0, 1.0]),
+                    torch.tensor([0.0, 0.0, 1.0]),
+                    torch.tensor([0.0, 0.0, 1.0]),
+                ], [0, 0, 1, 2, 2, 2], [0, 0, 1, 2, 1, 0])
+        ]
+
+        # Test with score (use score instead of label if score exists)
+        metric = METRICS.build(dict(type='ConfusionMatrix'))
+        metric.process(None, pred)
+        res = metric.evaluate(6)
+        self.assertIsInstance(res, dict)
+        self.assertTensorEqual(
+            res['confusion_matrix/result'],
+            torch.tensor([
+                [2, 0, 1],
+                [0, 1, 1],
+                [0, 0, 1],
+            ]))
+
+        # Test with label
+        for sample in pred:
+            del sample['pred_score']
+        metric = METRICS.build(dict(type='ConfusionMatrix'))
+        metric.process(None, pred)
+        with self.assertRaisesRegex(AssertionError,
+                                    'Please specify the `num_classes`'):
+            metric.evaluate(6)
+
+        metric = METRICS.build(dict(type='ConfusionMatrix', num_classes=3))
+        metric.process(None, pred)
+        self.assertIsInstance(res, dict)
+        self.assertTensorEqual(
+            res['confusion_matrix/result'],
+            torch.tensor([
+                [2, 0, 1],
+                [0, 1, 1],
+                [0, 0, 1],
+            ]))
+
+    def test_calculate(self):
+        y_true = np.array([0, 0, 1, 2, 1, 0])
+        y_label = torch.tensor([0, 0, 1, 2, 2, 2])
+        y_score = [
+            [0.7, 0.0, 0.3],
+            [0.5, 0.2, 0.3],
+            [0.4, 0.5, 0.1],
+            [0.0, 0.0, 1.0],
+            [0.0, 0.0, 1.0],
+            [0.0, 0.0, 1.0],
+        ]
+
+        # Test with score
+        cm = ConfusionMatrix.calculate(y_score, y_true)
+        self.assertIsInstance(cm, torch.Tensor)
+        self.assertTensorEqual(
+            cm, torch.tensor([
+                [2, 0, 1],
+                [0, 1, 1],
+                [0, 0, 1],
+            ]))
+
+        # Test with label
+        with self.assertRaisesRegex(AssertionError,
+                                    'Please specify the `num_classes`'):
+            ConfusionMatrix.calculate(y_label, y_true)
+
+        cm = ConfusionMatrix.calculate(y_label, y_true, num_classes=3)
+        self.assertIsInstance(cm, torch.Tensor)
+        self.assertTensorEqual(
+            cm, torch.tensor([
+                [2, 0, 1],
+                [0, 1, 1],
+                [0, 0, 1],
+            ]))
+
+        # Test with invalid inputs
+        with self.assertRaisesRegex(TypeError, "<class 'str'> is not"):
+            ConfusionMatrix.calculate(y_label, 'hi')
+
+    def test_plot(self):
+        import matplotlib.pyplot as plt
+
+        cm = torch.tensor([[2, 0, 1], [0, 1, 1], [0, 0, 1]])
+        fig = ConfusionMatrix.plot(cm, include_values=True, show=False)
+
+        self.assertIsInstance(fig, plt.Figure)
+
+    def assertTensorEqual(self,
+                          tensor: torch.Tensor,
+                          value: float,
+                          msg=None,
+                          **kwarg):
+        tensor = tensor.to(torch.float32)
+        value = torch.tensor(value).float()
+        try:
+            torch.testing.assert_allclose(tensor, value, **kwarg)
+        except AssertionError as e:
+            self.fail(self._formatMessage(msg, str(e)))
diff --git a/tests/test_evaluation/test_metrics/test_voc_metrics.py b/tests/test_evaluation/test_metrics/test_voc_metrics.py
new file mode 100644
index 0000000000000000000000000000000000000000..5101cf8833479ab8ebff89725b49fae62ce8b35a
--- /dev/null
+++ b/tests/test_evaluation/test_metrics/test_voc_metrics.py
@@ -0,0 +1,228 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import numpy as np
+import sklearn.metrics
+import torch
+from mmengine.evaluator import Evaluator
+from mmengine.registry import init_default_scope
+
+from mmpretrain.structures import DataSample
+
+init_default_scope('mmpretrain')
+
+
+class TestVOCMultiLabel(TestCase):
+
+    def test_evaluate(self):
+        # prepare input data
+        y_true_label = [[0], [1, 3], [0, 1, 2], [3]]
+        y_true_difficult = [[0], [2], [1], []]
+        y_pred_score = torch.tensor([
+            [0.8, 0, 0, 0.6],
+            [0.2, 0, 0.6, 0],
+            [0, 0.9, 0.6, 0],
+            [0, 0, 0.2, 0.3],
+        ])
+
+        # generate data samples
+        pred = [
+            DataSample(num_classes=4).set_pred_score(i).set_gt_label(j)
+            for i, j in zip(y_pred_score, y_true_label)
+        ]
+        for sample, difficult_label in zip(pred, y_true_difficult):
+            sample.set_metainfo({'gt_label_difficult': difficult_label})
+
+        # 1. Test with default argument
+        evaluator = Evaluator(dict(type='VOCMultiLabelMetric'))
+        evaluator.process(pred)
+        res = evaluator.evaluate(4)
+        self.assertIsInstance(res, dict)
+
+        # generate sklearn input
+        y_true = np.array([
+            [1, 0, 0, 0],
+            [0, 1, -1, 1],
+            [1, 1, 1, 0],
+            [0, 0, 0, 1],
+        ])
+        ignored_index = y_true == -1
+        y_true[ignored_index] = 0
+        thr05_y_pred = np.array([
+            [1, 0, 0, 1],
+            [0, 0, 1, 0],
+            [0, 1, 1, 0],
+            [0, 0, 0, 0],
+        ])
+        thr05_y_pred[ignored_index] = 0
+
+        expect_precision = sklearn.metrics.precision_score(
+            y_true, thr05_y_pred, average='macro') * 100
+        expect_recall = sklearn.metrics.recall_score(
+            y_true, thr05_y_pred, average='macro') * 100
+        expect_f1 = sklearn.metrics.f1_score(
+            y_true, thr05_y_pred, average='macro') * 100
+        self.assertEqual(res['multi-label/precision'], expect_precision)
+        self.assertEqual(res['multi-label/recall'], expect_recall)
+        # precision is different between torch and sklearn
+        self.assertAlmostEqual(res['multi-label/f1-score'], expect_f1, 5)
+
+        # 2. Test with `difficult_as_positive`=False argument
+        evaluator = Evaluator(
+            dict(type='VOCMultiLabelMetric', difficult_as_positive=False))
+        evaluator.process(pred)
+        res = evaluator.evaluate(4)
+        self.assertIsInstance(res, dict)
+
+        # generate sklearn input
+        y_true = np.array([
+            [1, 0, 0, 0],
+            [0, 1, 0, 1],
+            [1, 1, 1, 0],
+            [0, 0, 0, 1],
+        ])
+        thr05_y_pred = np.array([
+            [1, 0, 0, 1],
+            [0, 0, 1, 0],
+            [0, 1, 1, 0],
+            [0, 0, 0, 0],
+        ])
+
+        expect_precision = sklearn.metrics.precision_score(
+            y_true, thr05_y_pred, average='macro') * 100
+        expect_recall = sklearn.metrics.recall_score(
+            y_true, thr05_y_pred, average='macro') * 100
+        expect_f1 = sklearn.metrics.f1_score(
+            y_true, thr05_y_pred, average='macro') * 100
+        self.assertEqual(res['multi-label/precision'], expect_precision)
+        self.assertEqual(res['multi-label/recall'], expect_recall)
+        # precision is different between torch and sklearn
+        self.assertAlmostEqual(res['multi-label/f1-score'], expect_f1, 5)
+
+        # 3. Test with `difficult_as_positive`=True argument
+        evaluator = Evaluator(
+            dict(type='VOCMultiLabelMetric', difficult_as_positive=True))
+        evaluator.process(pred)
+        res = evaluator.evaluate(4)
+        self.assertIsInstance(res, dict)
+
+        # generate sklearn input
+        y_true = np.array([
+            [1, 0, 0, 0],
+            [0, 1, 1, 1],
+            [1, 1, 1, 0],
+            [0, 0, 0, 1],
+        ])
+        thr05_y_pred = np.array([
+            [1, 0, 0, 1],
+            [0, 0, 1, 0],
+            [0, 1, 1, 0],
+            [0, 0, 0, 0],
+        ])
+
+        expect_precision = sklearn.metrics.precision_score(
+            y_true, thr05_y_pred, average='macro') * 100
+        expect_recall = sklearn.metrics.recall_score(
+            y_true, thr05_y_pred, average='macro') * 100
+        expect_f1 = sklearn.metrics.f1_score(
+            y_true, thr05_y_pred, average='macro') * 100
+        self.assertEqual(res['multi-label/precision'], expect_precision)
+        self.assertEqual(res['multi-label/recall'], expect_recall)
+        # precision is different between torch and sklearn
+        self.assertAlmostEqual(res['multi-label/f1-score'], expect_f1, 5)
+
+
+class TestVOCAveragePrecision(TestCase):
+
+    def test_evaluate(self):
+        """Test using the metric in the same way as Evalutor."""
+        # prepare input data
+        y_true_difficult = [[0], [2], [1], []]
+        y_pred_score = torch.tensor([
+            [0.8, 0.1, 0, 0.6],
+            [0.2, 0.2, 0.7, 0],
+            [0.1, 0.9, 0.6, 0.1],
+            [0, 0, 0.2, 0.3],
+        ])
+        y_true_label = [[0], [1, 3], [0, 1, 2], [3]]
+        y_true = torch.tensor([
+            [1, 0, 0, 0],
+            [0, 1, 0, 1],
+            [1, 1, 1, 0],
+            [0, 0, 0, 1],
+        ])
+        y_true_difficult = [[0], [2], [1], []]
+
+        # generate data samples
+        pred = [
+            DataSample(num_classes=4).set_pred_score(i).set_gt_score(
+                j).set_gt_label(k)
+            for i, j, k in zip(y_pred_score, y_true, y_true_label)
+        ]
+        for sample, difficult_label in zip(pred, y_true_difficult):
+            sample.set_metainfo({'gt_label_difficult': difficult_label})
+
+        # 1. Test with default
+        evaluator = Evaluator(dict(type='VOCAveragePrecision'))
+        evaluator.process(pred)
+        res = evaluator.evaluate(4)
+        self.assertIsInstance(res, dict)
+
+        # prepare inputs for sklearn for this case
+        y_pred_score = [[0.8, 0.2, 0.1, 0], [0.1, 0.2, 0.9, 0], [0, 0.6, 0.2],
+                        [0.6, 0, 0.1, 0.3]]
+        y_true = [[1, 0, 1, 0], [0, 1, 1, 0], [0, 1, 0], [0, 1, 0, 1]]
+        expected_res = []
+        for pred_per_class, gt_per_class in zip(y_pred_score, y_true):
+            expected_res.append(
+                sklearn.metrics.average_precision_score(
+                    gt_per_class, pred_per_class))
+
+        self.assertAlmostEqual(
+            res['multi-label/mAP'],
+            sum(expected_res) * 100 / len(expected_res),
+            places=4)
+
+        # 2. Test with `difficult_as_positive`=False argument
+        evaluator = Evaluator(
+            dict(type='VOCAveragePrecision', difficult_as_positive=False))
+        evaluator.process(pred)
+        res = evaluator.evaluate(4)
+        self.assertIsInstance(res, dict)
+
+        # prepare inputs for sklearn for this case
+        y_pred_score = [[0.8, 0.2, 0.1, 0], [0.1, 0.2, 0.9, 0],
+                        [0, 0.7, 0.6, 0.2], [0.6, 0, 0.1, 0.3]]
+        y_true = [[1, 0, 1, 0], [0, 1, 1, 0], [0, 0, 1, 0], [0, 1, 0, 1]]
+        expected_res = []
+        for pred_per_class, gt_per_class in zip(y_pred_score, y_true):
+            expected_res.append(
+                sklearn.metrics.average_precision_score(
+                    gt_per_class, pred_per_class))
+
+        self.assertAlmostEqual(
+            res['multi-label/mAP'],
+            sum(expected_res) * 100 / len(expected_res),
+            places=4)
+
+        # 3. Test with `difficult_as_positive`=True argument
+        evaluator = Evaluator(
+            dict(type='VOCAveragePrecision', difficult_as_positive=True))
+        evaluator.process(pred)
+        res = evaluator.evaluate(4)
+        self.assertIsInstance(res, dict)
+
+        # prepare inputs for sklearn for this case
+        y_pred_score = [[0.8, 0.2, 0.1, 0], [0.1, 0.2, 0.9, 0],
+                        [0, 0.7, 0.6, 0.2], [0.6, 0, 0.1, 0.3]]
+        y_true = [[1, 0, 1, 0], [0, 1, 1, 0], [0, 1, 1, 0], [0, 1, 0, 1]]
+        expected_res = []
+        for pred_per_class, gt_per_class in zip(y_pred_score, y_true):
+            expected_res.append(
+                sklearn.metrics.average_precision_score(
+                    gt_per_class, pred_per_class))
+
+        self.assertAlmostEqual(
+            res['multi-label/mAP'],
+            sum(expected_res) * 100 / len(expected_res),
+            places=4)
diff --git a/tests/test_models/test_backbones/__init__.py b/tests/test_models/test_backbones/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef101fec61e72abc0eb90266d453b5b22331378d
--- /dev/null
+++ b/tests/test_models/test_backbones/__init__.py
@@ -0,0 +1 @@
+# Copyright (c) OpenMMLab. All rights reserved.
diff --git a/tests/test_models/test_backbones/test_beit.py b/tests/test_models/test_backbones/test_beit.py
new file mode 100644
index 0000000000000000000000000000000000000000..eed2be5d7a329020b77c0def750579e6bc2f2fef
--- /dev/null
+++ b/tests/test_models/test_backbones/test_beit.py
@@ -0,0 +1,122 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+
+from mmpretrain.models.backbones import BEiTViT
+
+
+class TestBEiT(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(
+            arch='b', img_size=224, patch_size=16, drop_path_rate=0.1)
+
+    def test_structure(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'not in default archs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            BEiTViT(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'Custom arch needs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'num_layers': 24,
+                'num_heads': 16,
+                'feedforward_channels': 4096
+            }
+            BEiTViT(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        cfg['arch'] = {
+            'embed_dims': 128,
+            'num_layers': 24,
+            'num_heads': 16,
+            'feedforward_channels': 1024
+        }
+        model = BEiTViT(**cfg)
+        self.assertEqual(model.embed_dims, 128)
+        self.assertEqual(model.num_layers, 24)
+        self.assertIsNone(model.pos_embed)
+        self.assertIsNone(model.rel_pos_bias)
+        for layer in model.layers:
+            self.assertEqual(layer.attn.num_heads, 16)
+            self.assertEqual(layer.ffn.feedforward_channels, 1024)
+
+        # Test out_indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = {1: 1}
+        with self.assertRaisesRegex(AssertionError, "get <class 'dict'>"):
+            BEiTViT(**cfg)
+        cfg['out_indices'] = [0, 13]
+        with self.assertRaisesRegex(AssertionError, 'Invalid out_indices 13'):
+            BEiTViT(**cfg)
+
+        # Test pos_embed
+        cfg = deepcopy(self.cfg)
+        cfg['use_abs_pos_emb'] = True
+        model = BEiTViT(**cfg)
+        self.assertEqual(model.pos_embed.shape, (1, 197, 768))
+
+        # Test model structure
+        cfg = deepcopy(self.cfg)
+        cfg['drop_path_rate'] = 0.1
+        model = BEiTViT(**cfg)
+        self.assertEqual(len(model.layers), 12)
+        dpr_inc = 0.1 / (12 - 1)
+        dpr = 0
+        for layer in model.layers:
+            self.assertEqual(layer.gamma_1.shape, (768, ))
+            self.assertEqual(layer.gamma_2.shape, (768, ))
+            self.assertEqual(layer.attn.embed_dims, 768)
+            self.assertEqual(layer.attn.num_heads, 12)
+            self.assertEqual(layer.ffn.feedforward_channels, 3072)
+            self.assertFalse(layer.ffn.add_identity)
+            self.assertAlmostEqual(layer.ffn.dropout_layer.drop_prob, dpr)
+            dpr += dpr_inc
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+
+        cfg = deepcopy(self.cfg)
+        cfg['out_type'] = 'cls_token'
+        model = BEiTViT(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        cls_token = outs[-1]
+        self.assertEqual(cls_token.shape, (1, 768))
+
+        # test without output cls_token
+        cfg = deepcopy(self.cfg)
+        model = BEiTViT(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        patch_token = outs[-1]
+        self.assertEqual(patch_token.shape, (1, 768))
+
+        # test without average
+        cfg = deepcopy(self.cfg)
+        cfg['out_type'] = 'featmap'
+        model = BEiTViT(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        patch_token = outs[-1]
+        self.assertEqual(patch_token.shape, (1, 768, 14, 14))
+
+        # Test forward with multi out indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = [-3, -2, -1]
+        model = BEiTViT(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 3)
+        for out in outs:
+            patch_token = out
+            self.assertEqual(patch_token.shape, (1, 768))
diff --git a/tests/test_models/test_backbones/test_conformer.py b/tests/test_models/test_backbones/test_conformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..d28ad5ac075e3be29faba7024ff9637f03307cfa
--- /dev/null
+++ b/tests/test_models/test_backbones/test_conformer.py
@@ -0,0 +1,112 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+
+import pytest
+import torch
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import Conformer
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+@torch.no_grad()  # To save memory
+def test_conformer_backbone():
+
+    cfg_ori = dict(
+        arch='T',
+        drop_path_rate=0.1,
+    )
+
+    with pytest.raises(AssertionError):
+        # test invalid arch
+        cfg = deepcopy(cfg_ori)
+        cfg['arch'] = 'unknown'
+        Conformer(**cfg)
+
+    with pytest.raises(AssertionError):
+        # test arch without essential keys
+        cfg = deepcopy(cfg_ori)
+        cfg['arch'] = {'embed_dims': 24, 'channel_ratio': 6, 'num_heads': 9}
+        Conformer(**cfg)
+
+    # Test Conformer small model with patch size of 16
+    model = Conformer(**cfg_ori)
+    model.init_weights()
+    model.train()
+
+    assert check_norm_state(model.modules(), True)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    conv_feature, transformer_feature = model(imgs)[-1]
+    assert conv_feature.shape == (1, 64 * 1 * 4
+                                  )  # base_channels * channel_ratio * 4
+    assert transformer_feature.shape == (1, 384)
+
+    # Test Conformer with irregular input size.
+    model = Conformer(**cfg_ori)
+    model.init_weights()
+    model.train()
+
+    assert check_norm_state(model.modules(), True)
+
+    imgs = torch.randn(1, 3, 241, 241)
+    conv_feature, transformer_feature = model(imgs)[-1]
+    assert conv_feature.shape == (1, 64 * 1 * 4
+                                  )  # base_channels * channel_ratio * 4
+    assert transformer_feature.shape == (1, 384)
+
+    imgs = torch.randn(1, 3, 321, 221)
+    conv_feature, transformer_feature = model(imgs)[-1]
+    assert conv_feature.shape == (1, 64 * 1 * 4
+                                  )  # base_channels * channel_ratio * 4
+    assert transformer_feature.shape == (1, 384)
+
+    # Test custom arch Conformer without output cls token
+    cfg = deepcopy(cfg_ori)
+    cfg['arch'] = {
+        'embed_dims': 128,
+        'depths': 15,
+        'num_heads': 16,
+        'channel_ratio': 3,
+    }
+    cfg['with_cls_token'] = False
+    cfg['base_channels'] = 32
+    model = Conformer(**cfg)
+    conv_feature, transformer_feature = model(imgs)[-1]
+    assert conv_feature.shape == (1, 32 * 3 * 4)
+    assert transformer_feature.shape == (1, 128)
+
+    # Test Conformer with multi out indices
+    cfg = deepcopy(cfg_ori)
+    cfg['out_indices'] = [4, 8, 12]
+    model = Conformer(**cfg)
+    outs = model(imgs)
+    assert len(outs) == 3
+    # stage 1
+    conv_feature, transformer_feature = outs[0]
+    assert conv_feature.shape == (1, 64 * 1)
+    assert transformer_feature.shape == (1, 384)
+    # stage 2
+    conv_feature, transformer_feature = outs[1]
+    assert conv_feature.shape == (1, 64 * 1 * 2)
+    assert transformer_feature.shape == (1, 384)
+    # stage 3
+    conv_feature, transformer_feature = outs[2]
+    assert conv_feature.shape == (1, 64 * 1 * 4)
+    assert transformer_feature.shape == (1, 384)
diff --git a/tests/test_models/test_backbones/test_convmixer.py b/tests/test_models/test_backbones/test_convmixer.py
new file mode 100644
index 0000000000000000000000000000000000000000..abe6c1385368370d7a0a05cca585906fba9b0e6f
--- /dev/null
+++ b/tests/test_models/test_backbones/test_convmixer.py
@@ -0,0 +1,85 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmpretrain.models.backbones import ConvMixer
+
+
+def test_assertion():
+    with pytest.raises(AssertionError):
+        ConvMixer(arch='unknown')
+
+    with pytest.raises(AssertionError):
+        # ConvMixer arch dict should include essential_keys,
+        ConvMixer(arch=dict(channels=[2, 3, 4, 5]))
+
+    with pytest.raises(AssertionError):
+        # ConvMixer out_indices should be valid depth.
+        ConvMixer(out_indices=-100)
+
+
+@torch.no_grad()  # To save memory
+def test_convmixer():
+
+    # Test forward
+    model = ConvMixer(arch='768/32')
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 768, 32, 32])
+
+    # Test forward with multiple outputs
+    model = ConvMixer(arch='768/32', out_indices=range(32))
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 32
+    for f in feat:
+        assert f.shape == torch.Size([1, 768, 32, 32])
+
+    # Test with custom arch
+    model = ConvMixer(
+        arch={
+            'embed_dims': 99,
+            'depth': 5,
+            'patch_size': 5,
+            'kernel_size': 9
+        },
+        out_indices=range(5))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 5
+    for f in feat:
+        assert f.shape == torch.Size([1, 99, 44, 44])
+
+    # Test with even kernel size arch
+    model = ConvMixer(arch={
+        'embed_dims': 99,
+        'depth': 5,
+        'patch_size': 5,
+        'kernel_size': 8
+    })
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 99, 44, 44])
+
+    # Test frozen_stages
+    model = ConvMixer(arch='768/32', frozen_stages=10)
+    model.init_weights()
+    model.train()
+
+    for i in range(10):
+        assert not model.stages[i].training
+
+    for i in range(10, 32):
+        assert model.stages[i].training
diff --git a/tests/test_models/test_backbones/test_convnext.py b/tests/test_models/test_backbones/test_convnext.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f63795ca3cf7d388789c63872855ad8c3a748b0
--- /dev/null
+++ b/tests/test_models/test_backbones/test_convnext.py
@@ -0,0 +1,106 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmpretrain.models.backbones import ConvNeXt
+
+
+def test_assertion():
+    with pytest.raises(AssertionError):
+        ConvNeXt(arch='unknown')
+
+    with pytest.raises(AssertionError):
+        # ConvNeXt arch dict should include 'embed_dims',
+        ConvNeXt(arch=dict(channels=[2, 3, 4, 5]))
+
+    with pytest.raises(AssertionError):
+        # ConvNeXt arch dict should include 'embed_dims',
+        ConvNeXt(arch=dict(depths=[2, 3, 4], channels=[2, 3, 4, 5]))
+
+
+def test_convnext():
+
+    # Test forward
+    model = ConvNeXt(arch='tiny', out_indices=-1)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 768])
+
+    # Test forward with multiple outputs
+    model = ConvNeXt(arch='small', out_indices=(0, 1, 2, 3))
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size([1, 96])
+    assert feat[1].shape == torch.Size([1, 192])
+    assert feat[2].shape == torch.Size([1, 384])
+    assert feat[3].shape == torch.Size([1, 768])
+
+    # Test with custom arch
+    model = ConvNeXt(
+        arch={
+            'depths': [2, 3, 4, 5, 6],
+            'channels': [16, 32, 64, 128, 256]
+        },
+        out_indices=(0, 1, 2, 3, 4))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 5
+    assert feat[0].shape == torch.Size([1, 16])
+    assert feat[1].shape == torch.Size([1, 32])
+    assert feat[2].shape == torch.Size([1, 64])
+    assert feat[3].shape == torch.Size([1, 128])
+    assert feat[4].shape == torch.Size([1, 256])
+
+    # Test without gap before final norm
+    model = ConvNeXt(
+        arch='small', out_indices=(0, 1, 2, 3), gap_before_final_norm=False)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size([1, 96, 56, 56])
+    assert feat[1].shape == torch.Size([1, 192, 28, 28])
+    assert feat[2].shape == torch.Size([1, 384, 14, 14])
+    assert feat[3].shape == torch.Size([1, 768, 7, 7])
+
+    # Test frozen_stages
+    model = ConvNeXt(arch='small', out_indices=(0, 1, 2, 3), frozen_stages=2)
+    model.init_weights()
+    model.train()
+
+    for i in range(2):
+        assert not model.downsample_layers[i].training
+        assert not model.stages[i].training
+
+    for i in range(2, 4):
+        assert model.downsample_layers[i].training
+        assert model.stages[i].training
+
+    # Test Activation Checkpointing
+    model = ConvNeXt(arch='tiny', out_indices=-1, with_cp=True)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 768])
+
+    # Test linear_pw_conv=False
+    model = ConvNeXt(arch='tiny', out_indices=-1, linear_pw_conv=False)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 768])
diff --git a/tests/test_models/test_backbones/test_cspnet.py b/tests/test_models/test_backbones/test_cspnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..9063e2fe2a56972b584fd58e2e3b07de797a2b23
--- /dev/null
+++ b/tests/test_models/test_backbones/test_cspnet.py
@@ -0,0 +1,147 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from functools import partial
+from unittest import TestCase
+
+import torch
+from mmcv.cnn import ConvModule
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+
+from mmpretrain.models.backbones import CSPDarkNet, CSPResNet, CSPResNeXt
+from mmpretrain.models.backbones.cspnet import (CSPNet, DarknetBottleneck,
+                                                ResNetBottleneck,
+                                                ResNeXtBottleneck)
+
+
+class TestCSPNet(TestCase):
+
+    def setUp(self):
+        self.arch = dict(
+            block_fn=(DarknetBottleneck, ResNetBottleneck, ResNeXtBottleneck),
+            in_channels=(32, 64, 128),
+            out_channels=(64, 128, 256),
+            num_blocks=(1, 2, 8),
+            expand_ratio=(2, 1, 1),
+            bottle_ratio=(3, 1, 1),
+            has_downsampler=True,
+            down_growth=True,
+            block_args=({}, {}, dict(base_channels=32)))
+        self.stem_fn = partial(torch.nn.Conv2d, out_channels=32, kernel_size=3)
+
+    def test_structure(self):
+        # Test with attribute arch_setting.
+        model = CSPNet(arch=self.arch, stem_fn=self.stem_fn, out_indices=[-1])
+        self.assertEqual(len(model.stages), 3)
+        self.assertEqual(type(model.stages[0].blocks[0]), DarknetBottleneck)
+        self.assertEqual(type(model.stages[1].blocks[0]), ResNetBottleneck)
+        self.assertEqual(type(model.stages[2].blocks[0]), ResNeXtBottleneck)
+
+
+class TestCSPDarkNet(TestCase):
+
+    def setUp(self):
+        self.class_name = CSPDarkNet
+        self.cfg = dict(depth=53)
+        self.out_channels = [64, 128, 256, 512, 1024]
+        self.all_out_indices = [0, 1, 2, 3, 4]
+        self.frozen_stages = 2
+        self.stem_down = (1, 1)
+        self.num_stages = 5
+
+    def test_structure(self):
+        # Test invalid default depths
+        with self.assertRaisesRegex(AssertionError, 'depth must be one of'):
+            cfg = deepcopy(self.cfg)
+            cfg['depth'] = 'unknown'
+            self.class_name(**cfg)
+
+        # Test out_indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = {1: 1}
+        with self.assertRaisesRegex(AssertionError, "get <class 'dict'>"):
+            self.class_name(**cfg)
+        cfg['out_indices'] = [0, 13]
+        with self.assertRaisesRegex(AssertionError, 'Invalid out_indices 13'):
+            self.class_name(**cfg)
+
+        # Test model structure
+        cfg = deepcopy(self.cfg)
+        model = self.class_name(**cfg)
+        self.assertEqual(len(model.stages), self.num_stages)
+
+    def test_forward(self):
+        imgs = torch.randn(3, 3, 224, 224)
+
+        cfg = deepcopy(self.cfg)
+        model = self.class_name(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        self.assertEqual(outs[-1].size(), (3, self.out_channels[-1], 7, 7))
+
+        # Test forward with multi out indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = self.all_out_indices
+        model = self.class_name(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), len(self.all_out_indices))
+        w, h = 224 / self.stem_down[0], 224 / self.stem_down[1]
+        for i, out in enumerate(outs):
+            self.assertEqual(
+                out.size(),
+                (3, self.out_channels[i], w // 2**(i + 1), h // 2**(i + 1)))
+
+        # Test frozen stages
+        cfg = deepcopy(self.cfg)
+        cfg['frozen_stages'] = self.frozen_stages
+        model = self.class_name(**cfg)
+        model.init_weights()
+        model.train()
+        assert model.stem.training is False
+        for param in model.stem.parameters():
+            assert param.requires_grad is False
+        for i in range(self.frozen_stages + 1):
+            stage = model.stages[i]
+            for mod in stage.modules():
+                if isinstance(mod, _BatchNorm):
+                    assert mod.training is False, i
+            for param in stage.parameters():
+                assert param.requires_grad is False
+
+
+class TestCSPResNet(TestCSPDarkNet):
+
+    def setUp(self):
+        self.class_name = CSPResNet
+        self.cfg = dict(depth=50)
+        self.out_channels = [128, 256, 512, 1024]
+        self.all_out_indices = [0, 1, 2, 3]
+        self.frozen_stages = 2
+        self.stem_down = (2, 2)
+        self.num_stages = 4
+
+    def test_deep_stem(self, ):
+        cfg = deepcopy(self.cfg)
+        cfg['deep_stem'] = True
+        model = self.class_name(**cfg)
+        self.assertEqual(len(model.stem), 3)
+        for i in range(3):
+            self.assertEqual(type(model.stem[i]), ConvModule)
+
+
+class TestCSPResNeXt(TestCSPDarkNet):
+
+    def setUp(self):
+        self.class_name = CSPResNeXt
+        self.cfg = dict(depth=50)
+        self.out_channels = [256, 512, 1024, 2048]
+        self.all_out_indices = [0, 1, 2, 3]
+        self.frozen_stages = 2
+        self.stem_down = (2, 2)
+        self.num_stages = 4
+
+
+if __name__ == '__main__':
+    import unittest
+    unittest.main()
diff --git a/tests/test_models/test_backbones/test_davit.py b/tests/test_models/test_backbones/test_davit.py
new file mode 100644
index 0000000000000000000000000000000000000000..726db74109edf2dd68f045fb2bcc51e48d3f0a42
--- /dev/null
+++ b/tests/test_models/test_backbones/test_davit.py
@@ -0,0 +1,110 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+
+from mmpretrain.models.backbones import DaViT
+from mmpretrain.models.backbones.davit import SpatialBlock
+
+
+class TestDaViT(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(arch='t', patch_size=4, drop_path_rate=0.1)
+
+    def test_structure(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'not in default archs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            DaViT(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'Custom arch needs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'num_layers': 24,
+                'num_heads': 16,
+                'feedforward_channels': 4096
+            }
+            DaViT(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        cfg['arch'] = {
+            'embed_dims': 64,
+            'num_heads': [3, 3, 3, 3],
+            'depths': [1, 1, 2, 1]
+        }
+        model = DaViT(**cfg)
+        self.assertEqual(model.embed_dims, 64)
+        self.assertEqual(model.num_layers, 4)
+        for layer in model.stages:
+            self.assertEqual(
+                layer.blocks[0].spatial_block.attn.w_msa.num_heads, 3)
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]
+        model = DaViT(**cfg)
+        ori_weight = model.patch_embed.projection.weight.clone().detach()
+
+        model.init_weights()
+        initialized_weight = model.patch_embed.projection.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+
+        cfg = deepcopy(self.cfg)
+        model = DaViT(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        self.assertEqual(outs[0].shape, (1, 768, 7, 7))
+
+        # Test forward with multi out indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = [2, 3]
+        model = DaViT(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 2)
+        self.assertEqual(outs[0].shape, (1, 384, 14, 14))
+        self.assertEqual(outs[1].shape, (1, 768, 7, 7))
+
+        # test with checkpoint forward
+        cfg = deepcopy(self.cfg)
+        cfg['with_cp'] = True
+        model = DaViT(**cfg)
+        for m in model.modules():
+            if isinstance(m, SpatialBlock):
+                self.assertTrue(m.with_cp)
+        model.init_weights()
+        model.train()
+
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        self.assertEqual(outs[0].shape, (1, 768, 7, 7))
+
+        # Test forward with dynamic input size
+        imgs1 = torch.randn(1, 3, 224, 224)
+        imgs2 = torch.randn(1, 3, 256, 256)
+        imgs3 = torch.randn(1, 3, 256, 309)
+        cfg = deepcopy(self.cfg)
+        model = DaViT(**cfg)
+        for imgs in [imgs1, imgs2, imgs3]:
+            outs = model(imgs)
+            self.assertIsInstance(outs, tuple)
+            self.assertEqual(len(outs), 1)
+            expect_feat_shape = (imgs.shape[2] // 32, imgs.shape[3] // 32)
+            self.assertEqual(outs[0].shape, (1, 768, *expect_feat_shape))
diff --git a/tests/test_models/test_backbones/test_deit.py b/tests/test_models/test_backbones/test_deit.py
new file mode 100644
index 0000000000000000000000000000000000000000..b2d096dfbdeccf8b7f73aa5695b9e84effe1152b
--- /dev/null
+++ b/tests/test_models/test_backbones/test_deit.py
@@ -0,0 +1,111 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+import os
+import tempfile
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+from mmengine.runner import load_checkpoint, save_checkpoint
+
+from mmpretrain.models.backbones import DistilledVisionTransformer
+from .utils import timm_resize_pos_embed
+
+
+class TestDeiT(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(
+            arch='deit-tiny', img_size=224, patch_size=16, drop_rate=0.1)
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]
+        model = DistilledVisionTransformer(**cfg)
+        ori_weight = model.patch_embed.projection.weight.clone().detach()
+        # The pos_embed is all zero before initialize
+        self.assertTrue(torch.allclose(model.dist_token, torch.tensor(0.)))
+
+        model.init_weights()
+        initialized_weight = model.patch_embed.projection.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+        self.assertFalse(torch.allclose(model.dist_token, torch.tensor(0.)))
+
+        # test load checkpoint
+        pretrain_pos_embed = model.pos_embed.clone().detach()
+        tmpdir = tempfile.gettempdir()
+        checkpoint = os.path.join(tmpdir, 'test.pth')
+        save_checkpoint(model.state_dict(), checkpoint)
+        cfg = deepcopy(self.cfg)
+        model = DistilledVisionTransformer(**cfg)
+        load_checkpoint(model, checkpoint, strict=True)
+        self.assertTrue(torch.allclose(model.pos_embed, pretrain_pos_embed))
+
+        # test load checkpoint with different img_size
+        cfg = deepcopy(self.cfg)
+        cfg['img_size'] = 384
+        model = DistilledVisionTransformer(**cfg)
+        load_checkpoint(model, checkpoint, strict=True)
+        resized_pos_embed = timm_resize_pos_embed(
+            pretrain_pos_embed, model.pos_embed, num_tokens=2)
+        self.assertTrue(torch.allclose(model.pos_embed, resized_pos_embed))
+
+        os.remove(checkpoint)
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+
+        # test with output cls_token
+        cfg = deepcopy(self.cfg)
+        model = DistilledVisionTransformer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        cls_token, dist_token = outs[-1]
+        self.assertEqual(cls_token.shape, (1, 192))
+        self.assertEqual(dist_token.shape, (1, 192))
+
+        # test without output cls_token
+        cfg = deepcopy(self.cfg)
+        cfg['out_type'] = 'featmap'
+        model = DistilledVisionTransformer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        patch_token = outs[-1]
+        self.assertEqual(patch_token.shape, (1, 192, 14, 14))
+
+        # Test forward with multi out indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = [-3, -2, -1]
+        model = DistilledVisionTransformer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 3)
+        for out in outs:
+            cls_token, dist_token = out
+            self.assertEqual(cls_token.shape, (1, 192))
+            self.assertEqual(dist_token.shape, (1, 192))
+
+        # Test forward with dynamic input size
+        imgs1 = torch.randn(1, 3, 224, 224)
+        imgs2 = torch.randn(1, 3, 256, 256)
+        imgs3 = torch.randn(1, 3, 256, 309)
+        cfg = deepcopy(self.cfg)
+        cfg['out_type'] = 'featmap'
+        model = DistilledVisionTransformer(**cfg)
+        for imgs in [imgs1, imgs2, imgs3]:
+            outs = model(imgs)
+            self.assertIsInstance(outs, tuple)
+            self.assertEqual(len(outs), 1)
+            featmap = outs[-1]
+            expect_feat_shape = (math.ceil(imgs.shape[2] / 16),
+                                 math.ceil(imgs.shape[3] / 16))
+            self.assertEqual(featmap.shape, (1, 192, *expect_feat_shape))
diff --git a/tests/test_models/test_backbones/test_deit3.py b/tests/test_models/test_backbones/test_deit3.py
new file mode 100644
index 0000000000000000000000000000000000000000..7acb5072b17745e9bef4fdb2a6d087d4685a25af
--- /dev/null
+++ b/tests/test_models/test_backbones/test_deit3.py
@@ -0,0 +1,167 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+import os
+import tempfile
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+from mmengine.runner import load_checkpoint, save_checkpoint
+
+from mmpretrain.models.backbones import DeiT3
+
+
+class TestDeiT3(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(
+            arch='b', img_size=224, patch_size=16, drop_path_rate=0.1)
+
+    def test_structure(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'not in default archs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            DeiT3(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'Custom arch needs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'num_layers': 24,
+                'num_heads': 16,
+                'feedforward_channels': 4096
+            }
+            DeiT3(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        cfg['arch'] = {
+            'embed_dims': 128,
+            'num_layers': 24,
+            'num_heads': 16,
+            'feedforward_channels': 1024
+        }
+        model = DeiT3(**cfg)
+        self.assertEqual(model.embed_dims, 128)
+        self.assertEqual(model.num_layers, 24)
+        for layer in model.layers:
+            self.assertEqual(layer.attn.num_heads, 16)
+            self.assertEqual(layer.ffn.feedforward_channels, 1024)
+
+        # Test out_indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = {1: 1}
+        with self.assertRaisesRegex(AssertionError, "get <class 'dict'>"):
+            DeiT3(**cfg)
+        cfg['out_indices'] = [0, 13]
+        with self.assertRaisesRegex(AssertionError, 'Invalid out_indices 13'):
+            DeiT3(**cfg)
+
+        # Test model structure
+        cfg = deepcopy(self.cfg)
+        model = DeiT3(**cfg)
+        self.assertEqual(len(model.layers), 12)
+        dpr_inc = 0.1 / (12 - 1)
+        dpr = 0
+        for layer in model.layers:
+            self.assertEqual(layer.attn.embed_dims, 768)
+            self.assertEqual(layer.attn.num_heads, 12)
+            self.assertEqual(layer.ffn.feedforward_channels, 3072)
+            self.assertAlmostEqual(layer.attn.out_drop.drop_prob, dpr)
+            self.assertAlmostEqual(layer.ffn.dropout_layer.drop_prob, dpr)
+            dpr += dpr_inc
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]
+        model = DeiT3(**cfg)
+        ori_weight = model.patch_embed.projection.weight.clone().detach()
+        # The pos_embed is all zero before initialize
+        self.assertTrue(torch.allclose(model.pos_embed, torch.tensor(0.)))
+
+        model.init_weights()
+        initialized_weight = model.patch_embed.projection.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+        self.assertFalse(torch.allclose(model.pos_embed, torch.tensor(0.)))
+
+        # test load checkpoint
+        pretrain_pos_embed = model.pos_embed.clone().detach()
+        tmpdir = tempfile.gettempdir()
+        checkpoint = os.path.join(tmpdir, 'test.pth')
+        save_checkpoint(model.state_dict(), checkpoint)
+        cfg = deepcopy(self.cfg)
+        model = DeiT3(**cfg)
+        load_checkpoint(model, checkpoint, strict=True)
+        self.assertTrue(torch.allclose(model.pos_embed, pretrain_pos_embed))
+
+        # test load checkpoint with different img_size
+        cfg = deepcopy(self.cfg)
+        cfg['img_size'] = 384
+        model = DeiT3(**cfg)
+        load_checkpoint(model, checkpoint, strict=True)
+
+        os.remove(checkpoint)
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+
+        # test with_cls_token=False
+        cfg = deepcopy(self.cfg)
+        cfg['with_cls_token'] = False
+        cfg['out_type'] = 'cls_token'
+        with self.assertRaisesRegex(ValueError, 'must be True'):
+            DeiT3(**cfg)
+
+        cfg = deepcopy(self.cfg)
+        cfg['with_cls_token'] = False
+        cfg['out_type'] = 'featmap'
+        model = DeiT3(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        patch_token = outs[-1]
+        self.assertEqual(patch_token.shape, (1, 768, 14, 14))
+
+        # test with output cls_token
+        cfg = deepcopy(self.cfg)
+        model = DeiT3(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        cls_token = outs[-1]
+        self.assertEqual(cls_token.shape, (1, 768))
+
+        # Test forward with multi out indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = [-3, -2, -1]
+        model = DeiT3(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 3)
+        for out in outs:
+            cls_token = out
+            self.assertEqual(cls_token.shape, (1, 768))
+
+        # Test forward with dynamic input size
+        imgs1 = torch.randn(1, 3, 224, 224)
+        imgs2 = torch.randn(1, 3, 256, 256)
+        imgs3 = torch.randn(1, 3, 256, 309)
+        cfg = deepcopy(self.cfg)
+        cfg['out_type'] = 'featmap'
+        model = DeiT3(**cfg)
+        for imgs in [imgs1, imgs2, imgs3]:
+            outs = model(imgs)
+            self.assertIsInstance(outs, tuple)
+            self.assertEqual(len(outs), 1)
+            featmap = outs[-1]
+            expect_feat_shape = (math.ceil(imgs.shape[2] / 16),
+                                 math.ceil(imgs.shape[3] / 16))
+            self.assertEqual(featmap.shape, (1, 768, *expect_feat_shape))
diff --git a/tests/test_models/test_backbones/test_densenet.py b/tests/test_models/test_backbones/test_densenet.py
new file mode 100644
index 0000000000000000000000000000000000000000..6b02bd1e7e9c9f47ff1e13e320c12f266e312fac
--- /dev/null
+++ b/tests/test_models/test_backbones/test_densenet.py
@@ -0,0 +1,95 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmpretrain.models.backbones import DenseNet
+
+
+def test_assertion():
+    with pytest.raises(AssertionError):
+        DenseNet(arch='unknown')
+
+    with pytest.raises(AssertionError):
+        # DenseNet arch dict should include essential_keys,
+        DenseNet(arch=dict(channels=[2, 3, 4, 5]))
+
+    with pytest.raises(AssertionError):
+        # DenseNet out_indices should be valid depth.
+        DenseNet(out_indices=-100)
+
+
+def test_DenseNet():
+
+    # Test forward
+    model = DenseNet(arch='121')
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 1024, 7, 7])
+
+    # Test memory efficient option
+    model = DenseNet(arch='121', memory_efficient=True)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 1024, 7, 7])
+
+    # Test drop rate
+    model = DenseNet(arch='121', drop_rate=0.05)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 1024, 7, 7])
+
+    # Test forward with multiple outputs
+    model = DenseNet(arch='121', out_indices=(0, 1, 2, 3))
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size([1, 128, 28, 28])
+    assert feat[1].shape == torch.Size([1, 256, 14, 14])
+    assert feat[2].shape == torch.Size([1, 512, 7, 7])
+    assert feat[3].shape == torch.Size([1, 1024, 7, 7])
+
+    # Test with custom arch
+    model = DenseNet(
+        arch={
+            'growth_rate': 20,
+            'depths': [4, 8, 12, 16, 20],
+            'init_channels': 40,
+        },
+        out_indices=(0, 1, 2, 3, 4))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 5
+    assert feat[0].shape == torch.Size([1, 60, 28, 28])
+    assert feat[1].shape == torch.Size([1, 110, 14, 14])
+    assert feat[2].shape == torch.Size([1, 175, 7, 7])
+    assert feat[3].shape == torch.Size([1, 247, 3, 3])
+    assert feat[4].shape == torch.Size([1, 647, 3, 3])
+
+    # Test frozen_stages
+    model = DenseNet(arch='121', out_indices=(0, 1, 2, 3), frozen_stages=2)
+    model.init_weights()
+    model.train()
+
+    for i in range(2):
+        assert not model.stages[i].training
+        assert not model.transitions[i].training
+
+    for i in range(2, 4):
+        assert model.stages[i].training
+        assert model.transitions[i].training
diff --git a/tests/test_models/test_backbones/test_edgenext.py b/tests/test_models/test_backbones/test_edgenext.py
new file mode 100644
index 0000000000000000000000000000000000000000..93b48a40c11e2a3ea35af6d77110b32763ec0b91
--- /dev/null
+++ b/tests/test_models/test_backbones/test_edgenext.py
@@ -0,0 +1,84 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmpretrain.models.backbones import EdgeNeXt
+
+
+def test_assertion():
+    with pytest.raises(AssertionError):
+        EdgeNeXt(arch='unknown')
+
+    with pytest.raises(AssertionError):
+        # EdgeNeXt arch dict should include 'embed_dims',
+        EdgeNeXt(arch=dict(channels=[24, 48, 88, 168]))
+
+    with pytest.raises(AssertionError):
+        # EdgeNeXt arch dict should include 'embed_dims',
+        EdgeNeXt(arch=dict(depths=[2, 2, 6, 2], channels=[24, 48, 88, 168]))
+
+
+def test_edgenext():
+
+    # Test forward
+    model = EdgeNeXt(arch='xxsmall', out_indices=-1)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 168])
+
+    # Test forward with multiple outputs
+    model = EdgeNeXt(arch='xxsmall', out_indices=(0, 1, 2, 3))
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size([1, 24])
+    assert feat[1].shape == torch.Size([1, 48])
+    assert feat[2].shape == torch.Size([1, 88])
+    assert feat[3].shape == torch.Size([1, 168])
+
+    # Test with custom arch
+    model = EdgeNeXt(
+        arch={
+            'depths': [2, 3, 4, 5],
+            'channels': [20, 40, 80, 160],
+            'num_heads': [4, 4, 4, 4]
+        },
+        out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size([1, 20])
+    assert feat[1].shape == torch.Size([1, 40])
+    assert feat[2].shape == torch.Size([1, 80])
+    assert feat[3].shape == torch.Size([1, 160])
+
+    # Test without gap before final norm
+    model = EdgeNeXt(
+        arch='small', out_indices=(0, 1, 2, 3), gap_before_final_norm=False)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size([1, 48, 56, 56])
+    assert feat[1].shape == torch.Size([1, 96, 28, 28])
+    assert feat[2].shape == torch.Size([1, 160, 14, 14])
+    assert feat[3].shape == torch.Size([1, 304, 7, 7])
+
+    # Test frozen_stages
+    model = EdgeNeXt(arch='small', out_indices=(0, 1, 2, 3), frozen_stages=2)
+    model.init_weights()
+    model.train()
+
+    for i in range(2):
+        assert not model.downsample_layers[i].training
+        assert not model.stages[i].training
+
+    for i in range(2, 4):
+        assert model.downsample_layers[i].training
+        assert model.stages[i].training
diff --git a/tests/test_models/test_backbones/test_efficientformer.py b/tests/test_models/test_backbones/test_efficientformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..36876dcbaca1ea2e1d21bc803eca57d1df5e12b3
--- /dev/null
+++ b/tests/test_models/test_backbones/test_efficientformer.py
@@ -0,0 +1,199 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+from mmcv.cnn import ConvModule
+from torch import nn
+
+from mmpretrain.models.backbones import EfficientFormer
+from mmpretrain.models.backbones.efficientformer import (AttentionWithBias,
+                                                         Flat, Meta3D, Meta4D)
+from mmpretrain.models.backbones.poolformer import Pooling
+
+
+class TestEfficientFormer(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(arch='l1', drop_path_rate=0.1)
+        self.arch = EfficientFormer.arch_settings['l1']
+        self.custom_arch = {
+            'layers': [1, 1, 1, 4],
+            'embed_dims': [48, 96, 224, 448],
+            'downsamples': [False, True, True, True],
+            'vit_num': 2,
+        }
+        self.custom_cfg = dict(arch=self.custom_arch)
+
+    def test_arch(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'Unavailable arch'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            EfficientFormer(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'must have'):
+            cfg = deepcopy(self.custom_cfg)
+            cfg['arch'].pop('layers')
+            EfficientFormer(**cfg)
+
+        # Test vit_num < 0
+        with self.assertRaisesRegex(AssertionError, "'vit_num' must"):
+            cfg = deepcopy(self.custom_cfg)
+            cfg['arch']['vit_num'] = -1
+            EfficientFormer(**cfg)
+
+        # Test vit_num > last stage layers
+        with self.assertRaisesRegex(AssertionError, "'vit_num' must"):
+            cfg = deepcopy(self.custom_cfg)
+            cfg['arch']['vit_num'] = 10
+            EfficientFormer(**cfg)
+
+        # Test out_ind
+        with self.assertRaisesRegex(AssertionError, '"out_indices" must'):
+            cfg = deepcopy(self.custom_cfg)
+            cfg['out_indices'] = dict
+            EfficientFormer(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.custom_cfg)
+        model = EfficientFormer(**cfg)
+        self.assertEqual(len(model.patch_embed), 2)
+        layers = self.custom_arch['layers']
+        downsamples = self.custom_arch['downsamples']
+        vit_num = self.custom_arch['vit_num']
+
+        for i, stage in enumerate(model.network):
+            if downsamples[i]:
+                self.assertIsInstance(stage[0], ConvModule)
+                self.assertEqual(stage[0].conv.stride, (2, 2))
+                self.assertTrue(hasattr(stage[0].conv, 'bias'))
+                self.assertTrue(isinstance(stage[0].bn, nn.BatchNorm2d))
+
+            if i < len(model.network) - 1:
+                self.assertIsInstance(stage[-1], Meta4D)
+                self.assertIsInstance(stage[-1].token_mixer, Pooling)
+                self.assertEqual(len(stage) - downsamples[i], layers[i])
+            elif vit_num > 0:
+                self.assertIsInstance(stage[-1], Meta3D)
+                self.assertIsInstance(stage[-1].token_mixer, AttentionWithBias)
+                self.assertEqual(len(stage) - downsamples[i] - 1, layers[i])
+                flat_layer_idx = len(stage) - vit_num - downsamples[i]
+                self.assertIsInstance(stage[flat_layer_idx], Flat)
+                count = 0
+                for layer in stage:
+                    if isinstance(layer, Meta3D):
+                        count += 1
+                self.assertEqual(count, vit_num)
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear'),
+            dict(type='Constant', layer=['LayerScale'], val=1e-4)
+        ]
+        model = EfficientFormer(**cfg)
+        ori_weight = model.patch_embed[0].conv.weight.clone().detach()
+        ori_ls_weight = model.network[0][-1].ls1.weight.clone().detach()
+
+        model.init_weights()
+        initialized_weight = model.patch_embed[0].conv.weight
+        initialized_ls_weight = model.network[0][-1].ls1.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+        self.assertFalse(torch.allclose(ori_ls_weight, initialized_ls_weight))
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+
+        # test last stage output
+        cfg = deepcopy(self.cfg)
+        model = EfficientFormer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (1, 448, 49))
+        assert hasattr(model, 'norm3')
+        assert isinstance(getattr(model, 'norm3'), nn.LayerNorm)
+
+        # test multiple output indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = (0, 1, 2, 3)
+        cfg['reshape_last_feat'] = True
+        model = EfficientFormer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 4)
+        # Test out features shape
+        for dim, stride, out in zip(self.arch['embed_dims'], [1, 2, 4, 8],
+                                    outs):
+            self.assertEqual(out.shape, (1, dim, 56 // stride, 56 // stride))
+
+        # Test norm layer
+        for i in range(4):
+            assert hasattr(model, f'norm{i}')
+            stage_norm = getattr(model, f'norm{i}')
+            assert isinstance(stage_norm, nn.GroupNorm)
+            assert stage_norm.num_groups == 1
+
+        # Test vit_num == 0
+        cfg = deepcopy(self.custom_cfg)
+        cfg['arch']['vit_num'] = 0
+        cfg['out_indices'] = (0, 1, 2, 3)
+        model = EfficientFormer(**cfg)
+        for i in range(4):
+            assert hasattr(model, f'norm{i}')
+            stage_norm = getattr(model, f'norm{i}')
+            assert isinstance(stage_norm, nn.GroupNorm)
+            assert stage_norm.num_groups == 1
+
+    def test_structure(self):
+        # test drop_path_rate decay
+        cfg = deepcopy(self.cfg)
+        cfg['drop_path_rate'] = 0.2
+        model = EfficientFormer(**cfg)
+        layers = self.arch['layers']
+        for i, block in enumerate(model.network):
+            expect_prob = 0.2 / (sum(layers) - 1) * i
+            if hasattr(block, 'drop_path'):
+                if expect_prob == 0:
+                    self.assertIsInstance(block.drop_path, torch.nn.Identity)
+                else:
+                    self.assertAlmostEqual(block.drop_path.drop_prob,
+                                           expect_prob)
+
+        # test with first stage frozen.
+        cfg = deepcopy(self.cfg)
+        frozen_stages = 1
+        cfg['frozen_stages'] = frozen_stages
+        cfg['out_indices'] = (0, 1, 2, 3)
+        model = EfficientFormer(**cfg)
+        model.init_weights()
+        model.train()
+
+        # the patch_embed and first stage should not require grad.
+        self.assertFalse(model.patch_embed.training)
+        for param in model.patch_embed.parameters():
+            self.assertFalse(param.requires_grad)
+        for i in range(frozen_stages):
+            module = model.network[i]
+            for param in module.parameters():
+                self.assertFalse(param.requires_grad)
+        for param in model.norm0.parameters():
+            self.assertFalse(param.requires_grad)
+
+        # the second stage should require grad.
+        for i in range(frozen_stages + 1, 4):
+            module = model.network[i]
+            for param in module.parameters():
+                self.assertTrue(param.requires_grad)
+            if hasattr(model, f'norm{i}'):
+                norm = getattr(model, f'norm{i}')
+                for param in norm.parameters():
+                    self.assertTrue(param.requires_grad)
diff --git a/tests/test_models/test_backbones/test_efficientnet.py b/tests/test_models/test_backbones/test_efficientnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..37551ffd16df6fb9e7afc2b1fda1f1e1208f9787
--- /dev/null
+++ b/tests/test_models/test_backbones/test_efficientnet.py
@@ -0,0 +1,144 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import EfficientNet
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def test_efficientnet_backbone():
+    archs = ['b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b7', 'b8', 'es', 'em', 'el']
+    with pytest.raises(TypeError):
+        # pretrained must be a string path
+        model = EfficientNet()
+        model.init_weights(pretrained=0)
+
+    with pytest.raises(AssertionError):
+        # arch must in arc_settings
+        EfficientNet(arch='others')
+
+    for arch in archs:
+        with pytest.raises(ValueError):
+            # frozen_stages must less than 7
+            EfficientNet(arch=arch, frozen_stages=12)
+
+    # Test EfficientNet
+    model = EfficientNet()
+    model.init_weights()
+    model.train()
+
+    # Test EfficientNet with first stage frozen
+    frozen_stages = 7
+    model = EfficientNet(arch='b0', frozen_stages=frozen_stages)
+    model.init_weights()
+    model.train()
+    for i in range(frozen_stages):
+        layer = model.layers[i]
+        for mod in layer.modules():
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in layer.parameters():
+            assert param.requires_grad is False
+
+    # Test EfficientNet with norm eval
+    model = EfficientNet(norm_eval=True)
+    model.init_weights()
+    model.train()
+    assert check_norm_state(model.modules(), False)
+
+    # Test EfficientNet forward with 'b0' arch
+    out_channels = [32, 16, 24, 40, 112, 320, 1280]
+    model = EfficientNet(arch='b0', out_indices=(0, 1, 2, 3, 4, 5, 6))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 7
+    assert feat[0].shape == torch.Size([1, out_channels[0], 112, 112])
+    assert feat[1].shape == torch.Size([1, out_channels[1], 112, 112])
+    assert feat[2].shape == torch.Size([1, out_channels[2], 56, 56])
+    assert feat[3].shape == torch.Size([1, out_channels[3], 28, 28])
+    assert feat[4].shape == torch.Size([1, out_channels[4], 14, 14])
+    assert feat[5].shape == torch.Size([1, out_channels[5], 7, 7])
+    assert feat[6].shape == torch.Size([1, out_channels[6], 7, 7])
+
+    # Test EfficientNet forward with 'b0' arch and GroupNorm
+    out_channels = [32, 16, 24, 40, 112, 320, 1280]
+    model = EfficientNet(
+        arch='b0',
+        out_indices=(0, 1, 2, 3, 4, 5, 6),
+        norm_cfg=dict(type='GN', num_groups=2, requires_grad=True))
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, GroupNorm)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 7
+    assert feat[0].shape == torch.Size([1, out_channels[0], 112, 112])
+    assert feat[1].shape == torch.Size([1, out_channels[1], 112, 112])
+    assert feat[2].shape == torch.Size([1, out_channels[2], 56, 56])
+    assert feat[3].shape == torch.Size([1, out_channels[3], 28, 28])
+    assert feat[4].shape == torch.Size([1, out_channels[4], 14, 14])
+    assert feat[5].shape == torch.Size([1, out_channels[5], 7, 7])
+    assert feat[6].shape == torch.Size([1, out_channels[6], 7, 7])
+
+    # Test EfficientNet forward with 'es' arch
+    out_channels = [32, 24, 32, 48, 144, 192, 1280]
+    model = EfficientNet(arch='es', out_indices=(0, 1, 2, 3, 4, 5, 6))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 7
+    assert feat[0].shape == torch.Size([1, out_channels[0], 112, 112])
+    assert feat[1].shape == torch.Size([1, out_channels[1], 112, 112])
+    assert feat[2].shape == torch.Size([1, out_channels[2], 56, 56])
+    assert feat[3].shape == torch.Size([1, out_channels[3], 28, 28])
+    assert feat[4].shape == torch.Size([1, out_channels[4], 14, 14])
+    assert feat[5].shape == torch.Size([1, out_channels[5], 7, 7])
+    assert feat[6].shape == torch.Size([1, out_channels[6], 7, 7])
+
+    # Test EfficientNet forward with 'es' arch and GroupNorm
+    out_channels = [32, 24, 32, 48, 144, 192, 1280]
+    model = EfficientNet(
+        arch='es',
+        out_indices=(0, 1, 2, 3, 4, 5, 6),
+        norm_cfg=dict(type='GN', num_groups=2, requires_grad=True))
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, GroupNorm)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 7
+    assert feat[0].shape == torch.Size([1, out_channels[0], 112, 112])
+    assert feat[1].shape == torch.Size([1, out_channels[1], 112, 112])
+    assert feat[2].shape == torch.Size([1, out_channels[2], 56, 56])
+    assert feat[3].shape == torch.Size([1, out_channels[3], 28, 28])
+    assert feat[4].shape == torch.Size([1, out_channels[4], 14, 14])
+    assert feat[5].shape == torch.Size([1, out_channels[5], 7, 7])
+    assert feat[6].shape == torch.Size([1, out_channels[6], 7, 7])
diff --git a/tests/test_models/test_backbones/test_efficientnet_v2.py b/tests/test_models/test_backbones/test_efficientnet_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca5c9b07ea264f619377d820abacd3c6d0b3397e
--- /dev/null
+++ b/tests/test_models/test_backbones/test_efficientnet_v2.py
@@ -0,0 +1,150 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import EfficientNetV2
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def test_efficientnet_v2_backbone():
+    with pytest.raises(TypeError):
+        # pretrained must be a string path
+        model = EfficientNetV2()
+        model.init_weights(pretrained=0)
+
+    with pytest.raises(AssertionError):
+        # arch must in arc_settings
+        EfficientNetV2(arch='others')
+
+    with pytest.raises(ValueError):
+        # frozen_stages must less than 8
+        EfficientNetV2(arch='b1', frozen_stages=12)
+
+    # Test EfficientNetV2
+    model = EfficientNetV2()
+    model.init_weights()
+    model.train()
+    x = torch.rand((1, 3, 224, 224))
+    model(x)
+
+    # Test EfficientNetV2 with first stage frozen
+    frozen_stages = 7
+    model = EfficientNetV2(arch='b0', frozen_stages=frozen_stages)
+    model.init_weights()
+    model.train()
+    for i in range(frozen_stages):
+        layer = model.layers[i]
+        for mod in layer.modules():
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in layer.parameters():
+            assert param.requires_grad is False
+
+    # Test EfficientNetV2 with norm eval
+    model = EfficientNetV2(norm_eval=True)
+    model.init_weights()
+    model.train()
+    assert check_norm_state(model.modules(), False)
+
+    # Test EfficientNetV2 forward with 'b0' arch
+    out_channels = [32, 16, 32, 48, 96, 112, 192, 1280]
+    model = EfficientNetV2(arch='b0', out_indices=(0, 1, 2, 3, 4, 5, 6, 7))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 8
+    assert feat[0].shape == torch.Size([1, out_channels[0], 112, 112])
+    assert feat[1].shape == torch.Size([1, out_channels[1], 112, 112])
+    assert feat[2].shape == torch.Size([1, out_channels[2], 56, 56])
+    assert feat[3].shape == torch.Size([1, out_channels[3], 28, 28])
+    assert feat[4].shape == torch.Size([1, out_channels[4], 14, 14])
+    assert feat[5].shape == torch.Size([1, out_channels[5], 14, 14])
+    assert feat[6].shape == torch.Size([1, out_channels[6], 7, 7])
+    assert feat[6].shape == torch.Size([1, out_channels[6], 7, 7])
+
+    # Test EfficientNetV2 forward with 'b0' arch and GroupNorm
+    out_channels = [32, 16, 32, 48, 96, 112, 192, 1280]
+    model = EfficientNetV2(
+        arch='b0',
+        out_indices=(0, 1, 2, 3, 4, 5, 6, 7),
+        norm_cfg=dict(type='GN', num_groups=2, requires_grad=True))
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, GroupNorm)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 64, 64)
+    feat = model(imgs)
+    assert len(feat) == 8
+    assert feat[0].shape == torch.Size([1, out_channels[0], 32, 32])
+    assert feat[1].shape == torch.Size([1, out_channels[1], 32, 32])
+    assert feat[2].shape == torch.Size([1, out_channels[2], 16, 16])
+    assert feat[3].shape == torch.Size([1, out_channels[3], 8, 8])
+    assert feat[4].shape == torch.Size([1, out_channels[4], 4, 4])
+    assert feat[5].shape == torch.Size([1, out_channels[5], 4, 4])
+    assert feat[6].shape == torch.Size([1, out_channels[6], 2, 2])
+    assert feat[7].shape == torch.Size([1, out_channels[7], 2, 2])
+
+    # Test EfficientNetV2 forward with 'm' arch
+    out_channels = [24, 24, 48, 80, 160, 176, 304, 512, 1280]
+    model = EfficientNetV2(arch='m', out_indices=(0, 1, 2, 3, 4, 5, 6, 7, 8))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 64, 64)
+    feat = model(imgs)
+    assert len(feat) == 9
+    assert feat[0].shape == torch.Size([1, out_channels[0], 32, 32])
+    assert feat[1].shape == torch.Size([1, out_channels[1], 32, 32])
+    assert feat[2].shape == torch.Size([1, out_channels[2], 16, 16])
+    assert feat[3].shape == torch.Size([1, out_channels[3], 8, 8])
+    assert feat[4].shape == torch.Size([1, out_channels[4], 4, 4])
+    assert feat[5].shape == torch.Size([1, out_channels[5], 4, 4])
+    assert feat[6].shape == torch.Size([1, out_channels[6], 2, 2])
+    assert feat[7].shape == torch.Size([1, out_channels[7], 2, 2])
+    assert feat[8].shape == torch.Size([1, out_channels[8], 2, 2])
+
+    # Test EfficientNetV2 forward with 'm' arch and GroupNorm
+    out_channels = [24, 24, 48, 80, 160, 176, 304, 512, 1280]
+    model = EfficientNetV2(
+        arch='m',
+        out_indices=(0, 1, 2, 3, 4, 5, 6, 7, 8),
+        norm_cfg=dict(type='GN', num_groups=2, requires_grad=True))
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, GroupNorm)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 64, 64)
+    feat = model(imgs)
+    assert len(feat) == 9
+    assert feat[0].shape == torch.Size([1, out_channels[0], 32, 32])
+    assert feat[1].shape == torch.Size([1, out_channels[1], 32, 32])
+    assert feat[2].shape == torch.Size([1, out_channels[2], 16, 16])
+    assert feat[3].shape == torch.Size([1, out_channels[3], 8, 8])
+    assert feat[4].shape == torch.Size([1, out_channels[4], 4, 4])
+    assert feat[5].shape == torch.Size([1, out_channels[5], 4, 4])
+    assert feat[6].shape == torch.Size([1, out_channels[6], 2, 2])
+    assert feat[7].shape == torch.Size([1, out_channels[7], 2, 2])
+    assert feat[8].shape == torch.Size([1, out_channels[8], 2, 2])
diff --git a/tests/test_models/test_backbones/test_eva02.py b/tests/test_models/test_backbones/test_eva02.py
new file mode 100644
index 0000000000000000000000000000000000000000..0672754223c8a56540e42b41775a42bdc71e0d6f
--- /dev/null
+++ b/tests/test_models/test_backbones/test_eva02.py
@@ -0,0 +1,143 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+
+from mmpretrain.models.backbones import ViTEVA02
+
+
+class TestEVA02(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(
+            arch='t',
+            img_size=336,
+            patch_size=14,
+            drop_path_rate=0.1,
+            drop_rate=0.1,
+            attn_drop_rate=0.2,
+            proj_drop_rate=0.3,
+        )
+
+    def test_structure(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'not in default archs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            ViTEVA02(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'Custom arch needs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'num_layers': 24,
+                'num_heads': 16,
+                'feedforward_channels': int(24 * 4 * 2 / 3)
+            }
+            ViTEVA02(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        cfg['arch'] = {
+            'embed_dims': 128,
+            'num_layers': 6,
+            'num_heads': 16,
+            'feedforward_channels': int(128 * 4 * 2 / 3)
+        }
+        model = ViTEVA02(**cfg)
+        self.assertEqual(model.embed_dims, 128)
+        self.assertEqual(model.num_layers, 6)
+        for layer in model.layers:
+            self.assertEqual(layer.attn.num_heads, 16)
+
+        # Test out_indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = {1: 1}
+        with self.assertRaisesRegex(AssertionError, "get <class 'dict'>"):
+            ViTEVA02(**cfg)
+        cfg['out_indices'] = [0, 13]
+        with self.assertRaisesRegex(AssertionError, 'Invalid out_indices 13'):
+            ViTEVA02(**cfg)
+
+        # Test model structure
+        cfg = deepcopy(self.cfg)
+        model = ViTEVA02(**cfg)
+        self.assertEqual(len(model.layers), 12)
+        self.assertEqual(model.cls_token.shape, (1, 1, 192))
+        self.assertEqual(model.pos_embed.shape, (1, 577, 192))
+        dpr_inc = 0.1 / (12 - 1)
+        dpr = 0
+        for layer in model.layers:
+            self.assertEqual(layer.attn.embed_dims, 192)
+            self.assertEqual(layer.attn.num_heads, 3)
+            self.assertAlmostEqual(layer.drop_path.drop_prob, dpr)
+            self.assertAlmostEqual(layer.mlp.dropout_layer.p, 0.1)
+            self.assertAlmostEqual(layer.attn.attn_drop.p, 0.2)
+            self.assertAlmostEqual(layer.attn.proj_drop.p, 0.3)
+            dpr += dpr_inc
+
+        # Test model structure: final_norm
+        cfg = deepcopy(self.cfg)
+        cfg['final_norm'] = True
+        model = ViTEVA02(**cfg)
+        self.assertNotEqual(model.norm1.__class__, torch.nn.Identity)
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 336, 336)
+
+        # test with_cls_token=False
+        cfg = deepcopy(self.cfg)
+        cfg['with_cls_token'] = False
+        cfg['out_type'] = 'cls_token'
+        with self.assertRaisesRegex(ValueError, 'must be True'):
+            ViTEVA02(**cfg)
+
+        cfg = deepcopy(self.cfg)
+        cfg['with_cls_token'] = False
+        cfg['out_type'] = 'raw'
+        model = ViTEVA02(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        patch_token = outs[-1]
+        self.assertEqual(patch_token.shape, (1, 24 * 24, 192))
+
+        cfg = deepcopy(self.cfg)
+        cfg['with_cls_token'] = False
+        cfg['out_type'] = 'featmap'
+        model = ViTEVA02(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        patch_token = outs[-1]
+        self.assertEqual(patch_token.shape, (1, 192, 24, 24))
+
+        cfg = deepcopy(self.cfg)
+        cfg['with_cls_token'] = False
+        cfg['out_type'] = 'avg_featmap'
+        model = ViTEVA02(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        patch_token = outs[-1]
+        self.assertEqual(patch_token.shape, (1, 192))
+
+        # test with output cls_token
+        cfg = deepcopy(self.cfg)
+        model = ViTEVA02(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        cls_token = outs[-1]
+        self.assertEqual(cls_token.shape, (1, 192))
+
+        # Test forward with multi out indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = [-3, -2, -1]
+        model = ViTEVA02(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 3)
+        for out in outs:
+            self.assertEqual(out.shape, (1, 192))
diff --git a/tests/test_models/test_backbones/test_hornet.py b/tests/test_models/test_backbones/test_hornet.py
new file mode 100644
index 0000000000000000000000000000000000000000..8031d1b3cd4b0dedee2eed6cac6bcdbf055b9c39
--- /dev/null
+++ b/tests/test_models/test_backbones/test_hornet.py
@@ -0,0 +1,174 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from copy import deepcopy
+from itertools import chain
+from unittest import TestCase
+
+import pytest
+import torch
+from mmengine.utils import digit_version
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+from torch import nn
+
+from mmpretrain.models.backbones import HorNet
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+@pytest.mark.skipif(
+    digit_version(torch.__version__) < digit_version('1.7.0'),
+    reason='torch.fft is not available before 1.7.0')
+class TestHorNet(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(
+            arch='t', drop_path_rate=0.1, gap_before_final_norm=False)
+
+    def test_arch(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'not in default archs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            HorNet(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'Custom arch needs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'depths': [1, 1, 1, 1],
+                'orders': [1, 1, 1, 1],
+            }
+            HorNet(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        base_dim = 64
+        depths = [2, 3, 18, 2]
+        embed_dims = [base_dim, base_dim * 2, base_dim * 4, base_dim * 8]
+        cfg['arch'] = {
+            'base_dim':
+            base_dim,
+            'depths':
+            depths,
+            'orders': [2, 3, 4, 5],
+            'dw_cfg': [
+                dict(type='DW', kernel_size=7),
+                dict(type='DW', kernel_size=7),
+                dict(type='GF', h=14, w=8),
+                dict(type='GF', h=7, w=4)
+            ],
+        }
+        model = HorNet(**cfg)
+
+        for i in range(len(depths)):
+            stage = model.stages[i]
+            self.assertEqual(stage[-1].out_channels, embed_dims[i])
+            self.assertEqual(len(stage), depths[i])
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]
+        model = HorNet(**cfg)
+        ori_weight = model.downsample_layers[0][0].weight.clone().detach()
+
+        model.init_weights()
+        initialized_weight = model.downsample_layers[0][0].weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+
+    def test_forward(self):
+        imgs = torch.randn(3, 3, 224, 224)
+
+        cfg = deepcopy(self.cfg)
+        model = HorNet(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (3, 512, 7, 7))
+
+        # test multiple output indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = (0, 1, 2, 3)
+        model = HorNet(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 4)
+        for emb_size, stride, out in zip([64, 128, 256, 512], [1, 2, 4, 8],
+                                         outs):
+            self.assertEqual(out.shape,
+                             (3, emb_size, 56 // stride, 56 // stride))
+
+        # test with dynamic input shape
+        imgs1 = torch.randn(3, 3, 224, 224)
+        imgs2 = torch.randn(3, 3, 256, 256)
+        imgs3 = torch.randn(3, 3, 256, 309)
+        cfg = deepcopy(self.cfg)
+        model = HorNet(**cfg)
+        for imgs in [imgs1, imgs2, imgs3]:
+            outs = model(imgs)
+            self.assertIsInstance(outs, tuple)
+            self.assertEqual(len(outs), 1)
+            feat = outs[-1]
+            expect_feat_shape = (math.floor(imgs.shape[2] / 32),
+                                 math.floor(imgs.shape[3] / 32))
+            self.assertEqual(feat.shape, (3, 512, *expect_feat_shape))
+
+    def test_structure(self):
+        # test drop_path_rate decay
+        cfg = deepcopy(self.cfg)
+        cfg['drop_path_rate'] = 0.2
+        model = HorNet(**cfg)
+        depths = model.arch_settings['depths']
+        stages = model.stages
+        blocks = chain(*[stage for stage in stages])
+        total_depth = sum(depths)
+        dpr = [
+            x.item()
+            for x in torch.linspace(0, cfg['drop_path_rate'], total_depth)
+        ]
+        for i, (block, expect_prob) in enumerate(zip(blocks, dpr)):
+            if expect_prob == 0:
+                assert isinstance(block.drop_path, nn.Identity)
+            else:
+                self.assertAlmostEqual(block.drop_path.drop_prob, expect_prob)
+
+        # test VAN with first stage frozen.
+        cfg = deepcopy(self.cfg)
+        frozen_stages = 0
+        cfg['frozen_stages'] = frozen_stages
+        cfg['out_indices'] = (0, 1, 2, 3)
+        model = HorNet(**cfg)
+        model.init_weights()
+        model.train()
+
+        # the patch_embed and first stage should not require grad.
+        for i in range(frozen_stages + 1):
+            down = model.downsample_layers[i]
+            for param in down.parameters():
+                self.assertFalse(param.requires_grad)
+            blocks = model.stages[i]
+            for param in blocks.parameters():
+                self.assertFalse(param.requires_grad)
+
+        # the second stage should require grad.
+        for i in range(frozen_stages + 1, 4):
+            down = model.downsample_layers[i]
+            for param in down.parameters():
+                self.assertTrue(param.requires_grad)
+            blocks = model.stages[i]
+            for param in blocks.parameters():
+                self.assertTrue(param.requires_grad)
diff --git a/tests/test_models/test_backbones/test_hrnet.py b/tests/test_models/test_backbones/test_hrnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..96fec469b4bb611b61d3eec1007fdadcd057d45f
--- /dev/null
+++ b/tests/test_models/test_backbones/test_hrnet.py
@@ -0,0 +1,93 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import HRNet
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+@pytest.mark.parametrize('base_channels', [18, 30, 32, 40, 44, 48, 64])
+def test_hrnet_arch_zoo(base_channels):
+
+    cfg_ori = dict(arch=f'w{base_channels}')
+
+    # Test HRNet model with input size of 224
+    model = HRNet(**cfg_ori)
+    model.init_weights()
+    model.train()
+
+    assert check_norm_state(model.modules(), True)
+
+    imgs = torch.randn(3, 3, 224, 224)
+    outs = model(imgs)
+    out_channels = base_channels
+    out_size = 56
+    assert isinstance(outs, tuple)
+    for out in outs:
+        assert out.shape == (3, out_channels, out_size, out_size)
+        out_channels = out_channels * 2
+        out_size = out_size // 2
+
+
+def test_hrnet_custom_arch():
+
+    cfg_ori = dict(
+        extra=dict(
+            stage1=dict(
+                num_modules=1,
+                num_branches=1,
+                block='BOTTLENECK',
+                num_blocks=(4, ),
+                num_channels=(64, )),
+            stage2=dict(
+                num_modules=1,
+                num_branches=2,
+                block='BASIC',
+                num_blocks=(4, 4),
+                num_channels=(32, 64)),
+            stage3=dict(
+                num_modules=4,
+                num_branches=3,
+                block='BOTTLENECK',
+                num_blocks=(4, 4, 2),
+                num_channels=(32, 64, 128)),
+            stage4=dict(
+                num_modules=3,
+                num_branches=4,
+                block='BASIC',
+                num_blocks=(4, 3, 4, 4),
+                num_channels=(32, 64, 152, 256)),
+        ), )
+
+    # Test HRNet model with input size of 224
+    model = HRNet(**cfg_ori)
+    model.init_weights()
+    model.train()
+
+    assert check_norm_state(model.modules(), True)
+
+    imgs = torch.randn(3, 3, 224, 224)
+    outs = model(imgs)
+    out_channels = (32, 64, 152, 256)
+    out_size = 56
+    assert isinstance(outs, tuple)
+    for out, out_channel in zip(outs, out_channels):
+        assert out.shape == (3, out_channel, out_size, out_size)
+        out_size = out_size // 2
diff --git a/tests/test_models/test_backbones/test_inception_v3.py b/tests/test_models/test_backbones/test_inception_v3.py
new file mode 100644
index 0000000000000000000000000000000000000000..4450dd2712f549cf0452908b6e0fa066bdf21cfe
--- /dev/null
+++ b/tests/test_models/test_backbones/test_inception_v3.py
@@ -0,0 +1,56 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from types import MethodType
+from unittest import TestCase
+
+import torch
+
+from mmpretrain.models import InceptionV3
+from mmpretrain.models.backbones.inception_v3 import InceptionAux
+
+
+class TestInceptionV3(TestCase):
+    DEFAULT_ARGS = dict(num_classes=10, aux_logits=False, dropout=0.)
+
+    def test_structure(self):
+        # Test without auxiliary branch.
+        model = InceptionV3(**self.DEFAULT_ARGS)
+        self.assertIsNone(model.AuxLogits)
+
+        # Test with auxiliary branch.
+        cfg = {**self.DEFAULT_ARGS, 'aux_logits': True}
+        model = InceptionV3(**cfg)
+        self.assertIsInstance(model.AuxLogits, InceptionAux)
+
+    def test_init_weights(self):
+        cfg = {**self.DEFAULT_ARGS, 'aux_logits': True}
+        model = InceptionV3(**cfg)
+
+        init_info = {}
+
+        def get_init_info(self, *args):
+            for name, param in self.named_parameters():
+                init_info[name] = ''.join(
+                    self._params_init_info[param]['init_info'])
+
+        model._dump_init_info = MethodType(get_init_info, model)
+        model.init_weights()
+        self.assertIn('TruncNormalInit: a=-2, b=2, mean=0, std=0.1, bias=0',
+                      init_info['Conv2d_1a_3x3.conv.weight'])
+        self.assertIn('TruncNormalInit: a=-2, b=2, mean=0, std=0.01, bias=0',
+                      init_info['AuxLogits.conv0.conv.weight'])
+        self.assertIn('TruncNormalInit: a=-2, b=2, mean=0, std=0.001, bias=0',
+                      init_info['AuxLogits.fc.weight'])
+
+    def test_forward(self):
+        inputs = torch.rand(2, 3, 299, 299)
+
+        model = InceptionV3(**self.DEFAULT_ARGS)
+        aux_out, out = model(inputs)
+        self.assertIsNone(aux_out)
+        self.assertEqual(out.shape, (2, 10))
+
+        cfg = {**self.DEFAULT_ARGS, 'aux_logits': True}
+        model = InceptionV3(**cfg)
+        aux_out, out = model(inputs)
+        self.assertEqual(aux_out.shape, (2, 10))
+        self.assertEqual(out.shape, (2, 10))
diff --git a/tests/test_models/test_backbones/test_levit.py b/tests/test_models/test_backbones/test_levit.py
new file mode 100644
index 0000000000000000000000000000000000000000..af274f1cdfa040960602f794cc72efa72bbd300b
--- /dev/null
+++ b/tests/test_models/test_backbones/test_levit.py
@@ -0,0 +1,169 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+import tempfile
+
+import pytest
+import torch
+from mmengine.runner import load_checkpoint, save_checkpoint
+from torch import nn
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import levit
+from mmpretrain.models.backbones.levit import (Attention, AttentionSubsample,
+                                               LeViT)
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def is_levit_block(modules):
+    if isinstance(modules, (AttentionSubsample, Attention)):
+        return True
+    return False
+
+
+def test_levit_attention():
+    block = Attention(128, 16, 4, 2, act_cfg=dict(type='HSwish'))
+    block.eval()
+    x = torch.randn(1, 196, 128)
+    y = block(x)
+    assert y.shape == x.shape
+    assert hasattr(block, 'ab')
+    assert block.key_dim == 16
+    assert block.attn_ratio == 2
+    assert block.num_heads == 4
+    assert block.qkv.linear.in_features == 128
+
+
+def test_levit():
+    with pytest.raises(TypeError):
+        # arch must be str or dict
+        LeViT(arch=[4, 6, 16, 1])
+
+    with pytest.raises(AssertionError):
+        # arch must in arch_settings
+        LeViT(arch='512')
+
+    with pytest.raises(AssertionError):
+        arch = dict(num_blocks=[2, 4, 14, 1])
+        LeViT(arch=arch)
+
+    # Test out_indices not type of int or Sequence
+    with pytest.raises(TypeError):
+        LeViT('128s', out_indices=dict())
+
+    # Test max(out_indices) < len(arch['num_blocks'])
+    with pytest.raises(AssertionError):
+        LeViT('128s', out_indices=(3, ))
+
+    model = LeViT('128s', out_indices=(-1, ))
+    assert model.out_indices == [2]
+
+    model = LeViT(arch='256', drop_path_rate=0.1)
+    model.eval()
+    assert model.key_dims == [32, 32, 32]
+    assert model.embed_dims == [256, 384, 512]
+    assert model.num_heads == [4, 6, 8]
+    assert model.depths == [4, 4, 4]
+    assert model.drop_path_rate == 0.1
+    assert isinstance(model.stages[0][0].block.qkv, levit.LinearBatchNorm)
+    assert isinstance(model.patch_embed.patch_embed[0],
+                      levit.ConvolutionBatchNorm)
+
+    model = LeViT(
+        arch='128s',
+        hybrid_backbone=lambda embed_dims: nn.Conv2d(
+            embed_dims, embed_dims, kernel_size=2))
+    model.eval()
+    assert isinstance(model.patch_embed, nn.Conv2d)
+
+    # Test eval of "train" mode and "deploy" mode
+    model = LeViT(arch='128s', deploy=True)
+    model.eval()
+    assert not isinstance(model.stages[0][0].block.qkv, levit.LinearBatchNorm)
+    assert not isinstance(model.patch_embed.patch_embed[0],
+                          levit.ConvolutionBatchNorm)
+    assert isinstance(model.stages[0][0].block.qkv, nn.Linear)
+    assert isinstance(model.patch_embed.patch_embed[0], nn.Conv2d)
+
+    # Test LeViT forward with layer 2 forward
+    model = LeViT('128s', out_indices=(2, ))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert isinstance(feat, tuple)
+    assert len(feat) == 1
+    assert isinstance(feat[0], torch.Tensor)
+    assert feat[0].shape == torch.Size((1, 384, 4, 4))
+
+    # Test LeViT forward
+    arch_settings = {
+        '128s': dict(out_channels=[128, 256, 384]),
+        '128': dict(out_channels=[128, 256, 384]),
+        '192': dict(out_channels=[192, 288, 384]),
+        '256': dict(out_channels=[256, 384, 512]),
+        '384': dict(out_channels=[384, 512, 768])
+    }
+
+    choose_models = ['128s', '192', '256', '384']
+    # Test LeViT model forward
+    for model_name, model_arch in arch_settings.items():
+        if model_name not in choose_models:
+            continue
+        model = LeViT(model_name, out_indices=(0, 1, 2))
+        model.init_weights()
+
+        # Test Norm
+        for m in model.modules():
+            if is_norm(m):
+                assert isinstance(m, _BatchNorm)
+
+        model.train()
+        imgs = torch.randn(1, 3, 224, 224)
+        feat = model(imgs)
+        assert feat[0].shape == torch.Size(
+            (1, model_arch['out_channels'][0], 14, 14))
+        assert feat[1].shape == torch.Size(
+            (1, model_arch['out_channels'][1], 7, 7))
+        assert feat[2].shape == torch.Size(
+            (1, model_arch['out_channels'][2], 4, 4))
+
+
+def test_load_deploy_LeViT():
+    # Test output before and load from deploy checkpoint
+    model = LeViT('128s', out_indices=(0, 1, 2))
+    inputs = torch.randn((1, 3, 224, 224))
+    tmpdir = tempfile.gettempdir()
+    ckpt_path = os.path.join(tmpdir, 'ckpt.pth')
+    model.switch_to_deploy()
+    model.eval()
+    outputs = model(inputs)
+
+    model_deploy = LeViT('128s', out_indices=(0, 1, 2), deploy=True)
+    save_checkpoint(model.state_dict(), ckpt_path)
+    load_checkpoint(model_deploy, ckpt_path)
+
+    outputs_load = model_deploy(inputs)
+    for feat, feat_load in zip(outputs, outputs_load):
+        assert torch.allclose(feat, feat_load)
+    os.remove(ckpt_path)
diff --git a/tests/test_models/test_backbones/test_mixmim.py b/tests/test_models/test_backbones/test_mixmim.py
new file mode 100644
index 0000000000000000000000000000000000000000..8d349639c5ca898508235ae80007021bc72eba78
--- /dev/null
+++ b/tests/test_models/test_backbones/test_mixmim.py
@@ -0,0 +1,40 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+
+from mmpretrain.models.backbones import MixMIMTransformer
+
+
+class TestMixMIM(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(arch='b', drop_rate=0.0, drop_path_rate=0.1)
+
+    def test_structure(self):
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+
+        model = MixMIMTransformer(**cfg)
+        self.assertEqual(model.embed_dims, 128)
+        self.assertEqual(sum(model.depths), 24)
+        self.assertIsNotNone(model.absolute_pos_embed)
+
+        num_heads = [4, 8, 16, 32]
+        for i, layer in enumerate(model.layers):
+            self.assertEqual(layer.blocks[0].num_heads, num_heads[i])
+            self.assertEqual(layer.blocks[0].ffn.feedforward_channels,
+                             128 * (2**i) * 4)
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+
+        cfg = deepcopy(self.cfg)
+        model = MixMIMTransformer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        averaged_token = outs[-1]
+        self.assertEqual(averaged_token.shape, (1, 1024))
diff --git a/tests/test_models/test_backbones/test_mlp_mixer.py b/tests/test_models/test_backbones/test_mlp_mixer.py
new file mode 100644
index 0000000000000000000000000000000000000000..8a4f176c679a9cafc570009621f780693f830c8a
--- /dev/null
+++ b/tests/test_models/test_backbones/test_mlp_mixer.py
@@ -0,0 +1,119 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import MlpMixer
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+class TestMLPMixer(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(
+            arch='b',
+            img_size=224,
+            patch_size=16,
+            drop_rate=0.1,
+            init_cfg=[
+                dict(
+                    type='Kaiming',
+                    layer='Conv2d',
+                    mode='fan_in',
+                    nonlinearity='linear')
+            ])
+
+    def test_arch(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'not in default archs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            MlpMixer(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'Custom arch needs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'embed_dims': 24,
+                'num_layers': 16,
+                'tokens_mlp_dims': 4096
+            }
+            MlpMixer(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        cfg['arch'] = {
+            'embed_dims': 128,
+            'num_layers': 6,
+            'tokens_mlp_dims': 256,
+            'channels_mlp_dims': 1024
+        }
+        model = MlpMixer(**cfg)
+        self.assertEqual(model.embed_dims, 128)
+        self.assertEqual(model.num_layers, 6)
+        for layer in model.layers:
+            self.assertEqual(layer.token_mix.feedforward_channels, 256)
+            self.assertEqual(layer.channel_mix.feedforward_channels, 1024)
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]
+        model = MlpMixer(**cfg)
+        ori_weight = model.patch_embed.projection.weight.clone().detach()
+        model.init_weights()
+        initialized_weight = model.patch_embed.projection.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+
+        # test forward with single out indices
+        cfg = deepcopy(self.cfg)
+        model = MlpMixer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (1, 768, 196))
+
+        # test forward with multi out indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = [-3, -2, -1]
+        model = MlpMixer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 3)
+        for feat in outs:
+            self.assertEqual(feat.shape, (1, 768, 196))
+
+        # test with invalid input shape
+        imgs2 = torch.randn(1, 3, 256, 256)
+        cfg = deepcopy(self.cfg)
+        model = MlpMixer(**cfg)
+        with self.assertRaisesRegex(AssertionError, 'dynamic input shape.'):
+            model(imgs2)
diff --git a/tests/test_models/test_backbones/test_mobilenet_v2.py b/tests/test_models/test_backbones/test_mobilenet_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..ffe43ff877ac87a2ddb5e755276268fe270b9819
--- /dev/null
+++ b/tests/test_models/test_backbones/test_mobilenet_v2.py
@@ -0,0 +1,259 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import MobileNetV2
+from mmpretrain.models.backbones.mobilenet_v2 import InvertedResidual
+
+
+def is_block(modules):
+    """Check if is ResNet building block."""
+    if isinstance(modules, (InvertedResidual, )):
+        return True
+    return False
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def test_mobilenetv2_invertedresidual():
+
+    with pytest.raises(AssertionError):
+        # stride must be in [1, 2]
+        InvertedResidual(16, 24, stride=3, expand_ratio=6)
+
+    # Test InvertedResidual with checkpoint forward, stride=1
+    block = InvertedResidual(16, 24, stride=1, expand_ratio=6)
+    x = torch.randn(1, 16, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size((1, 24, 56, 56))
+
+    # Test InvertedResidual with expand_ratio=1
+    block = InvertedResidual(16, 16, stride=1, expand_ratio=1)
+    assert len(block.conv) == 2
+
+    # Test InvertedResidual with use_res_connect
+    block = InvertedResidual(16, 16, stride=1, expand_ratio=6)
+    x = torch.randn(1, 16, 56, 56)
+    x_out = block(x)
+    assert block.use_res_connect is True
+    assert x_out.shape == torch.Size((1, 16, 56, 56))
+
+    # Test InvertedResidual with checkpoint forward, stride=2
+    block = InvertedResidual(16, 24, stride=2, expand_ratio=6)
+    x = torch.randn(1, 16, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size((1, 24, 28, 28))
+
+    # Test InvertedResidual with checkpoint forward
+    block = InvertedResidual(16, 24, stride=1, expand_ratio=6, with_cp=True)
+    assert block.with_cp
+    x = torch.randn(1, 16, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size((1, 24, 56, 56))
+
+    # Test InvertedResidual with act_cfg=dict(type='ReLU')
+    block = InvertedResidual(
+        16, 24, stride=1, expand_ratio=6, act_cfg=dict(type='ReLU'))
+    x = torch.randn(1, 16, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size((1, 24, 56, 56))
+
+
+def test_mobilenetv2_backbone():
+    with pytest.raises(TypeError):
+        # pretrained must be a string path
+        model = MobileNetV2()
+        model.init_weights(pretrained=0)
+
+    with pytest.raises(ValueError):
+        # frozen_stages must in range(-1, 8)
+        MobileNetV2(frozen_stages=8)
+
+    with pytest.raises(ValueError):
+        # out_indices in range(0, 8)
+        MobileNetV2(out_indices=[8])
+
+    # Test MobileNetV2 with first stage frozen
+    frozen_stages = 1
+    model = MobileNetV2(frozen_stages=frozen_stages)
+    model.init_weights()
+    model.train()
+
+    for mod in model.conv1.modules():
+        for param in mod.parameters():
+            assert param.requires_grad is False
+    for i in range(1, frozen_stages + 1):
+        layer = getattr(model, f'layer{i}')
+        for mod in layer.modules():
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in layer.parameters():
+            assert param.requires_grad is False
+
+    # Test MobileNetV2 with norm_eval=True
+    model = MobileNetV2(norm_eval=True)
+    model.init_weights()
+    model.train()
+
+    assert check_norm_state(model.modules(), False)
+
+    # Test MobileNetV2 forward with widen_factor=1.0
+    model = MobileNetV2(widen_factor=1.0, out_indices=range(0, 8))
+    model.init_weights()
+    model.train()
+
+    assert check_norm_state(model.modules(), True)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 8
+    assert feat[0].shape == torch.Size((1, 16, 112, 112))
+    assert feat[1].shape == torch.Size((1, 24, 56, 56))
+    assert feat[2].shape == torch.Size((1, 32, 28, 28))
+    assert feat[3].shape == torch.Size((1, 64, 14, 14))
+    assert feat[4].shape == torch.Size((1, 96, 14, 14))
+    assert feat[5].shape == torch.Size((1, 160, 7, 7))
+    assert feat[6].shape == torch.Size((1, 320, 7, 7))
+    assert feat[7].shape == torch.Size((1, 1280, 7, 7))
+
+    # Test MobileNetV2 forward with widen_factor=0.5
+    model = MobileNetV2(widen_factor=0.5, out_indices=range(0, 7))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 7
+    assert feat[0].shape == torch.Size((1, 8, 112, 112))
+    assert feat[1].shape == torch.Size((1, 16, 56, 56))
+    assert feat[2].shape == torch.Size((1, 16, 28, 28))
+    assert feat[3].shape == torch.Size((1, 32, 14, 14))
+    assert feat[4].shape == torch.Size((1, 48, 14, 14))
+    assert feat[5].shape == torch.Size((1, 80, 7, 7))
+    assert feat[6].shape == torch.Size((1, 160, 7, 7))
+
+    # Test MobileNetV2 forward with widen_factor=2.0
+    model = MobileNetV2(widen_factor=2.0)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size((1, 2560, 7, 7))
+
+    # Test MobileNetV2 forward with out_indices=None
+    model = MobileNetV2(widen_factor=1.0)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size((1, 1280, 7, 7))
+
+    # Test MobileNetV2 forward with dict(type='ReLU')
+    model = MobileNetV2(
+        widen_factor=1.0, act_cfg=dict(type='ReLU'), out_indices=range(0, 7))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 7
+    assert feat[0].shape == torch.Size((1, 16, 112, 112))
+    assert feat[1].shape == torch.Size((1, 24, 56, 56))
+    assert feat[2].shape == torch.Size((1, 32, 28, 28))
+    assert feat[3].shape == torch.Size((1, 64, 14, 14))
+    assert feat[4].shape == torch.Size((1, 96, 14, 14))
+    assert feat[5].shape == torch.Size((1, 160, 7, 7))
+    assert feat[6].shape == torch.Size((1, 320, 7, 7))
+
+    # Test MobileNetV2 with BatchNorm forward
+    model = MobileNetV2(widen_factor=1.0, out_indices=range(0, 7))
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 7
+    assert feat[0].shape == torch.Size((1, 16, 112, 112))
+    assert feat[1].shape == torch.Size((1, 24, 56, 56))
+    assert feat[2].shape == torch.Size((1, 32, 28, 28))
+    assert feat[3].shape == torch.Size((1, 64, 14, 14))
+    assert feat[4].shape == torch.Size((1, 96, 14, 14))
+    assert feat[5].shape == torch.Size((1, 160, 7, 7))
+    assert feat[6].shape == torch.Size((1, 320, 7, 7))
+
+    # Test MobileNetV2 with GroupNorm forward
+    model = MobileNetV2(
+        widen_factor=1.0,
+        norm_cfg=dict(type='GN', num_groups=2, requires_grad=True),
+        out_indices=range(0, 7))
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, GroupNorm)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 7
+    assert feat[0].shape == torch.Size((1, 16, 112, 112))
+    assert feat[1].shape == torch.Size((1, 24, 56, 56))
+    assert feat[2].shape == torch.Size((1, 32, 28, 28))
+    assert feat[3].shape == torch.Size((1, 64, 14, 14))
+    assert feat[4].shape == torch.Size((1, 96, 14, 14))
+    assert feat[5].shape == torch.Size((1, 160, 7, 7))
+    assert feat[6].shape == torch.Size((1, 320, 7, 7))
+
+    # Test MobileNetV2 with layers 1, 3, 5 out forward
+    model = MobileNetV2(widen_factor=1.0, out_indices=(0, 2, 4))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 3
+    assert feat[0].shape == torch.Size((1, 16, 112, 112))
+    assert feat[1].shape == torch.Size((1, 32, 28, 28))
+    assert feat[2].shape == torch.Size((1, 96, 14, 14))
+
+    # Test MobileNetV2 with checkpoint forward
+    model = MobileNetV2(
+        widen_factor=1.0, with_cp=True, out_indices=range(0, 7))
+    for m in model.modules():
+        if is_block(m):
+            assert m.with_cp
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 7
+    assert feat[0].shape == torch.Size((1, 16, 112, 112))
+    assert feat[1].shape == torch.Size((1, 24, 56, 56))
+    assert feat[2].shape == torch.Size((1, 32, 28, 28))
+    assert feat[3].shape == torch.Size((1, 64, 14, 14))
+    assert feat[4].shape == torch.Size((1, 96, 14, 14))
+    assert feat[5].shape == torch.Size((1, 160, 7, 7))
+    assert feat[6].shape == torch.Size((1, 320, 7, 7))
diff --git a/tests/test_models/test_backbones/test_mobilenet_v3.py b/tests/test_models/test_backbones/test_mobilenet_v3.py
new file mode 100644
index 0000000000000000000000000000000000000000..560b948c2e15b499d56a174ca31b4f0e3f778fe8
--- /dev/null
+++ b/tests/test_models/test_backbones/test_mobilenet_v3.py
@@ -0,0 +1,175 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import MobileNetV3
+from mmpretrain.models.utils import InvertedResidual
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def test_mobilenetv3_backbone():
+    with pytest.raises(TypeError):
+        # pretrained must be a string path
+        model = MobileNetV3()
+        model.init_weights(pretrained=0)
+
+    with pytest.raises(AssertionError):
+        # arch must in [small, large]
+        MobileNetV3(arch='others')
+
+    with pytest.raises(ValueError):
+        # frozen_stages must less than 13 when arch is small
+        MobileNetV3(arch='small', frozen_stages=13)
+
+    with pytest.raises(ValueError):
+        # frozen_stages must less than 17 when arch is large
+        MobileNetV3(arch='large', frozen_stages=17)
+
+    with pytest.raises(ValueError):
+        # max out_indices must less than 13 when arch is small
+        MobileNetV3(arch='small', out_indices=(13, ))
+
+    with pytest.raises(ValueError):
+        # max out_indices must less than 17 when arch is large
+        MobileNetV3(arch='large', out_indices=(17, ))
+
+    # Test MobileNetV3
+    model = MobileNetV3()
+    model.init_weights()
+    model.train()
+
+    # Test MobileNetV3 with first stage frozen
+    frozen_stages = 1
+    model = MobileNetV3(frozen_stages=frozen_stages)
+    model.init_weights()
+    model.train()
+    for i in range(0, frozen_stages + 1):
+        layer = getattr(model, f'layer{i}')
+        for mod in layer.modules():
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in layer.parameters():
+            assert param.requires_grad is False
+
+    # Test MobileNetV3 with norm eval
+    model = MobileNetV3(norm_eval=True, out_indices=range(0, 12))
+    model.init_weights()
+    model.train()
+    assert check_norm_state(model.modules(), False)
+
+    # Test MobileNetV3 forward with small arch
+    model = MobileNetV3(out_indices=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 13
+    assert feat[0].shape == torch.Size([1, 16, 112, 112])
+    assert feat[1].shape == torch.Size([1, 16, 56, 56])
+    assert feat[2].shape == torch.Size([1, 24, 28, 28])
+    assert feat[3].shape == torch.Size([1, 24, 28, 28])
+    assert feat[4].shape == torch.Size([1, 40, 14, 14])
+    assert feat[5].shape == torch.Size([1, 40, 14, 14])
+    assert feat[6].shape == torch.Size([1, 40, 14, 14])
+    assert feat[7].shape == torch.Size([1, 48, 14, 14])
+    assert feat[8].shape == torch.Size([1, 48, 14, 14])
+    assert feat[9].shape == torch.Size([1, 96, 7, 7])
+    assert feat[10].shape == torch.Size([1, 96, 7, 7])
+    assert feat[11].shape == torch.Size([1, 96, 7, 7])
+    assert feat[12].shape == torch.Size([1, 576, 7, 7])
+
+    # Test MobileNetV3 forward with small arch and GroupNorm
+    model = MobileNetV3(
+        out_indices=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
+        norm_cfg=dict(type='GN', num_groups=2, requires_grad=True))
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, GroupNorm)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 13
+    assert feat[0].shape == torch.Size([1, 16, 112, 112])
+    assert feat[1].shape == torch.Size([1, 16, 56, 56])
+    assert feat[2].shape == torch.Size([1, 24, 28, 28])
+    assert feat[3].shape == torch.Size([1, 24, 28, 28])
+    assert feat[4].shape == torch.Size([1, 40, 14, 14])
+    assert feat[5].shape == torch.Size([1, 40, 14, 14])
+    assert feat[6].shape == torch.Size([1, 40, 14, 14])
+    assert feat[7].shape == torch.Size([1, 48, 14, 14])
+    assert feat[8].shape == torch.Size([1, 48, 14, 14])
+    assert feat[9].shape == torch.Size([1, 96, 7, 7])
+    assert feat[10].shape == torch.Size([1, 96, 7, 7])
+    assert feat[11].shape == torch.Size([1, 96, 7, 7])
+    assert feat[12].shape == torch.Size([1, 576, 7, 7])
+
+    # Test MobileNetV3 forward with large arch
+    model = MobileNetV3(
+        arch='large',
+        out_indices=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 17
+    assert feat[0].shape == torch.Size([1, 16, 112, 112])
+    assert feat[1].shape == torch.Size([1, 16, 112, 112])
+    assert feat[2].shape == torch.Size([1, 24, 56, 56])
+    assert feat[3].shape == torch.Size([1, 24, 56, 56])
+    assert feat[4].shape == torch.Size([1, 40, 28, 28])
+    assert feat[5].shape == torch.Size([1, 40, 28, 28])
+    assert feat[6].shape == torch.Size([1, 40, 28, 28])
+    assert feat[7].shape == torch.Size([1, 80, 14, 14])
+    assert feat[8].shape == torch.Size([1, 80, 14, 14])
+    assert feat[9].shape == torch.Size([1, 80, 14, 14])
+    assert feat[10].shape == torch.Size([1, 80, 14, 14])
+    assert feat[11].shape == torch.Size([1, 112, 14, 14])
+    assert feat[12].shape == torch.Size([1, 112, 14, 14])
+    assert feat[13].shape == torch.Size([1, 160, 7, 7])
+    assert feat[14].shape == torch.Size([1, 160, 7, 7])
+    assert feat[15].shape == torch.Size([1, 160, 7, 7])
+    assert feat[16].shape == torch.Size([1, 960, 7, 7])
+
+    # Test MobileNetV3 forward with large arch
+    model = MobileNetV3(arch='large', out_indices=(0, ))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 16, 112, 112])
+
+    # Test MobileNetV3 with checkpoint forward
+    model = MobileNetV3(with_cp=True)
+    for m in model.modules():
+        if isinstance(m, InvertedResidual):
+            assert m.with_cp
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 576, 7, 7])
diff --git a/tests/test_models/test_backbones/test_mobileone.py b/tests/test_models/test_backbones/test_mobileone.py
new file mode 100644
index 0000000000000000000000000000000000000000..93a13f152371921ac3e1b007a8ae7be7d24e4271
--- /dev/null
+++ b/tests/test_models/test_backbones/test_mobileone.py
@@ -0,0 +1,337 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+import tempfile
+
+import pytest
+import torch
+from mmengine.runner import load_checkpoint, save_checkpoint
+from torch import nn
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import MobileOne
+from mmpretrain.models.backbones.mobileone import MobileOneBlock
+from mmpretrain.models.utils import SELayer
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def is_mobileone_block(modules):
+    if isinstance(modules, MobileOneBlock):
+        return True
+    return False
+
+
+def test_mobileoneblock():
+    # Test MobileOneBlock with kernel_size 3
+    block = MobileOneBlock(5, 10, 3, 1, stride=1, groups=5)
+    block.eval()
+    x = torch.randn(1, 5, 16, 16)
+    y = block(x)
+    assert block.branch_norm is None
+    assert not hasattr(block, 'branch_reparam')
+    assert hasattr(block, 'branch_scale')
+    assert hasattr(block, 'branch_conv_list')
+    assert hasattr(block, 'branch_norm')
+    assert block.branch_conv_list[0].conv.kernel_size == (3, 3)
+    assert block.branch_conv_list[0].conv.groups == 5
+    assert block.se_cfg is None
+    assert y.shape == torch.Size((1, 10, 16, 16))
+    block.switch_to_deploy()
+    assert hasattr(block, 'branch_reparam')
+    assert block.branch_reparam.kernel_size == (3, 3)
+    assert block.branch_reparam.groups == 5
+    assert block.deploy is True
+    y_deploy = block(x)
+    assert y_deploy.shape == torch.Size((1, 10, 16, 16))
+    assert torch.allclose(y, y_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test MobileOneBlock with num_con = 4
+    block = MobileOneBlock(5, 10, 3, 4, stride=1, groups=5)
+    block.eval()
+    x = torch.randn(1, 5, 16, 16)
+    y = block(x)
+    assert block.branch_norm is None
+    assert not hasattr(block, 'branch_reparam')
+    assert hasattr(block, 'branch_scale')
+    assert hasattr(block, 'branch_conv_list')
+    assert hasattr(block, 'branch_norm')
+    assert block.branch_conv_list[0].conv.kernel_size == (3, 3)
+    assert block.branch_conv_list[0].conv.groups == 5
+    assert len(block.branch_conv_list) == 4
+    assert block.se_cfg is None
+    assert y.shape == torch.Size((1, 10, 16, 16))
+    block.switch_to_deploy()
+    assert hasattr(block, 'branch_reparam')
+    assert block.branch_reparam.kernel_size == (3, 3)
+    assert block.branch_reparam.groups == 5
+    assert block.deploy is True
+    y_deploy = block(x)
+    assert y_deploy.shape == torch.Size((1, 10, 16, 16))
+    assert torch.allclose(y, y_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test MobileOneBlock with kernel_size 1
+    block = MobileOneBlock(5, 10, 1, 1, stride=1, padding=0)
+    block.eval()
+    x = torch.randn(1, 5, 16, 16)
+    y = block(x)
+    assert block.branch_norm is None
+    assert not hasattr(block, 'branch_reparam')
+    assert hasattr(block, 'branch_scale')
+    assert hasattr(block, 'branch_conv_list')
+    assert hasattr(block, 'branch_norm')
+    assert block.branch_conv_list[0].conv.kernel_size == (1, 1)
+    assert block.branch_conv_list[0].conv.groups == 1
+    assert len(block.branch_conv_list) == 1
+    assert block.se_cfg is None
+    assert y.shape == torch.Size((1, 10, 16, 16))
+    block.switch_to_deploy()
+    assert hasattr(block, 'branch_reparam')
+    assert block.branch_reparam.kernel_size == (1, 1)
+    assert block.branch_reparam.groups == 1
+    assert block.deploy is True
+    y_deploy = block(x)
+    assert y_deploy.shape == torch.Size((1, 10, 16, 16))
+    assert torch.allclose(y, y_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test MobileOneBlock with stride = 2
+    block = MobileOneBlock(10, 10, 3, 4, stride=2, groups=10)
+    x = torch.randn(1, 10, 16, 16)
+    block.eval()
+    y = block(x)
+    assert block.branch_norm is None
+    assert not hasattr(block, 'branch_reparam')
+    assert hasattr(block, 'branch_scale')
+    assert hasattr(block, 'branch_conv_list')
+    assert hasattr(block, 'branch_norm')
+    assert block.branch_conv_list[0].conv.kernel_size == (3, 3)
+    assert block.branch_conv_list[0].conv.groups == 10
+    assert len(block.branch_conv_list) == 4
+    assert block.se_cfg is None
+    assert y.shape == torch.Size((1, 10, 8, 8))
+    block.switch_to_deploy()
+    assert hasattr(block, 'branch_reparam')
+    assert block.branch_reparam.kernel_size == (3, 3)
+    assert block.branch_reparam.groups == 10
+    assert block.deploy is True
+    y_deploy = block(x)
+    assert y_deploy.shape == torch.Size((1, 10, 8, 8))
+    assert torch.allclose(y, y_deploy, atol=1e-5, rtol=1e-4)
+
+    # # Test MobileOneBlock with padding == dilation == 2
+    block = MobileOneBlock(
+        10, 10, 3, 4, stride=1, groups=10, padding=2, dilation=2)
+    x = torch.randn(1, 10, 16, 16)
+    block.eval()
+    y = block(x)
+    assert not hasattr(block, 'branch_reparam')
+    assert hasattr(block, 'branch_scale')
+    assert hasattr(block, 'branch_conv_list')
+    assert hasattr(block, 'branch_norm')
+    assert block.branch_conv_list[0].conv.kernel_size == (3, 3)
+    assert block.branch_conv_list[0].conv.groups == 10
+    assert len(block.branch_conv_list) == 4
+    assert block.se_cfg is None
+    assert y.shape == torch.Size((1, 10, 16, 16))
+    block.switch_to_deploy()
+    assert hasattr(block, 'branch_reparam')
+    assert block.branch_reparam.kernel_size == (3, 3)
+    assert block.branch_reparam.groups == 10
+    assert block.deploy is True
+    y_deploy = block(x)
+    assert y_deploy.shape == torch.Size((1, 10, 16, 16))
+    assert torch.allclose(y, y_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test MobileOneBlock with se
+    se_cfg = dict(ratio=4, divisor=1)
+    block = MobileOneBlock(32, 32, 3, 4, stride=1, se_cfg=se_cfg, groups=32)
+    x = torch.randn(1, 32, 16, 16)
+    block.eval()
+    y = block(x)
+    assert not hasattr(block, 'branch_reparam')
+    assert hasattr(block, 'branch_scale')
+    assert hasattr(block, 'branch_conv_list')
+    assert hasattr(block, 'branch_norm')
+    assert block.branch_conv_list[0].conv.kernel_size == (3, 3)
+    assert block.branch_conv_list[0].conv.groups == 32
+    assert len(block.branch_conv_list) == 4
+    assert isinstance(block.se, SELayer)
+    assert y.shape == torch.Size((1, 32, 16, 16))
+    block.switch_to_deploy()
+    assert hasattr(block, 'branch_reparam')
+    assert block.branch_reparam.kernel_size == (3, 3)
+    assert block.branch_reparam.groups == 32
+    assert block.deploy is True
+    y_deploy = block(x)
+    assert y_deploy.shape == torch.Size((1, 32, 16, 16))
+    assert torch.allclose(y, y_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test MobileOneBlock with deploy == True
+    se_cfg = dict(ratio=4, divisor=1)
+    block = MobileOneBlock(
+        32, 32, 3, 4, stride=1, se_cfg=se_cfg, groups=32, deploy=True)
+    x = torch.randn(1, 32, 16, 16)
+    block.eval()
+    assert hasattr(block, 'branch_reparam')
+    assert block.branch_reparam.kernel_size == (3, 3)
+    assert block.branch_reparam.groups == 32
+    assert isinstance(block.se, SELayer)
+    assert block.deploy is True
+    y = block(x)
+    assert y.shape == torch.Size((1, 32, 16, 16))
+
+
+def test_mobileone_backbone():
+    with pytest.raises(TypeError):
+        # arch must be str or dict
+        MobileOne(arch=[4, 6, 16, 1])
+
+    with pytest.raises(AssertionError):
+        # arch must in arch_settings
+        MobileOne(arch='S3')
+
+    with pytest.raises(KeyError):
+        arch = dict(num_blocks=[2, 4, 14, 1])
+        MobileOne(arch=arch)
+
+    # Test  len(arch['num_blocks']) == len(arch['width_factor'])
+    with pytest.raises(AssertionError):
+        arch = dict(
+            num_blocks=[2, 4, 14, 1],
+            width_factor=[0.75, 0.75, 0.75],
+            num_conv_branches=[1, 1, 1, 1],
+            num_se_blocks=[0, 0, 5, 1])
+        MobileOne(arch=arch)
+
+    # Test max(out_indices) < len(arch['num_blocks'])
+    with pytest.raises(AssertionError):
+        MobileOne('s0', out_indices=dict())
+
+    # Test out_indices not type of int or Sequence
+    with pytest.raises(AssertionError):
+        MobileOne('s0', out_indices=(5, ))
+
+    # Test MobileOne norm state
+    model = MobileOne('s0')
+    model.train()
+    assert check_norm_state(model.modules(), True)
+
+    # Test MobileOne with first stage frozen
+    frozen_stages = 1
+    model = MobileOne('s0', frozen_stages=frozen_stages)
+    model.train()
+    for param in model.stage0.parameters():
+        assert param.requires_grad is False
+    for i in range(0, frozen_stages):
+        stage_name = model.stages[i]
+        stage = model.__getattr__(stage_name)
+        for mod in stage:
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in stage.parameters():
+            assert param.requires_grad is False
+
+    # Test MobileOne with norm_eval
+    model = MobileOne('s0', norm_eval=True)
+    model.train()
+    assert check_norm_state(model.modules(), False)
+
+    # Test MobileOne forward with layer 3 forward
+    model = MobileOne('s0', out_indices=(3, ))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert isinstance(feat, tuple)
+    assert len(feat) == 1
+    assert isinstance(feat[0], torch.Tensor)
+    assert feat[0].shape == torch.Size((1, 1024, 7, 7))
+
+    # Test MobileOne forward
+    arch_settings = {
+        's0': dict(out_channels=[48, 128, 256, 1024], ),
+        's1': dict(out_channels=[96, 192, 512, 1280]),
+        's2': dict(out_channels=[96, 256, 640, 2048]),
+        's3': dict(out_channels=[128, 320, 768, 2048], ),
+        's4': dict(out_channels=[192, 448, 896, 2048], )
+    }
+
+    choose_models = ['s0', 's1', 's4']
+    # Test RepVGG model forward
+    for model_name, model_arch in arch_settings.items():
+        if model_name not in choose_models:
+            continue
+        model = MobileOne(model_name, out_indices=(0, 1, 2, 3))
+        model.init_weights()
+
+        # Test Norm
+        for m in model.modules():
+            if is_norm(m):
+                assert isinstance(m, _BatchNorm)
+
+        model.train()
+        imgs = torch.randn(1, 3, 224, 224)
+        feat = model(imgs)
+        assert feat[0].shape == torch.Size(
+            (1, model_arch['out_channels'][0], 56, 56))
+        assert feat[1].shape == torch.Size(
+            (1, model_arch['out_channels'][1], 28, 28))
+        assert feat[2].shape == torch.Size(
+            (1, model_arch['out_channels'][2], 14, 14))
+        assert feat[3].shape == torch.Size(
+            (1, model_arch['out_channels'][3], 7, 7))
+
+        # Test eval of "train" mode and "deploy" mode
+        gap = nn.AdaptiveAvgPool2d(output_size=(1))
+        fc = nn.Linear(model_arch['out_channels'][3], 10)
+        model.eval()
+        feat = model(imgs)
+        pred = fc(gap(feat[3]).flatten(1))
+        model.switch_to_deploy()
+        for m in model.modules():
+            if isinstance(m, MobileOneBlock):
+                assert m.deploy is True
+        feat_deploy = model(imgs)
+        pred_deploy = fc(gap(feat_deploy[3]).flatten(1))
+        for i in range(4):
+            torch.allclose(feat[i], feat_deploy[i])
+        torch.allclose(pred, pred_deploy)
+
+
+def test_load_deploy_mobileone():
+    # Test output before and load from deploy checkpoint
+    model = MobileOne('s0', out_indices=(0, 1, 2, 3))
+    inputs = torch.randn((1, 3, 224, 224))
+    tmpdir = tempfile.gettempdir()
+    ckpt_path = os.path.join(tmpdir, 'ckpt.pth')
+    model.switch_to_deploy()
+    model.eval()
+    outputs = model(inputs)
+
+    model_deploy = MobileOne('s0', out_indices=(0, 1, 2, 3), deploy=True)
+    save_checkpoint(model.state_dict(), ckpt_path)
+    load_checkpoint(model_deploy, ckpt_path)
+
+    outputs_load = model_deploy(inputs)
+    for feat, feat_load in zip(outputs, outputs_load):
+        assert torch.allclose(feat, feat_load)
+    os.remove(ckpt_path)
diff --git a/tests/test_models/test_backbones/test_mobilevit.py b/tests/test_models/test_backbones/test_mobilevit.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b7d8d9af361e2d116710c38b5c62f146df92b24
--- /dev/null
+++ b/tests/test_models/test_backbones/test_mobilevit.py
@@ -0,0 +1,86 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmpretrain.models.backbones import MobileViT
+
+
+def test_assertion():
+    with pytest.raises(AssertionError):
+        MobileViT(arch='unknown')
+
+    with pytest.raises(AssertionError):
+        # MobileViT out_indices should be valid depth.
+        MobileViT(out_indices=-100)
+
+
+def test_mobilevit():
+
+    # Test forward
+    model = MobileViT(arch='small')
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 256, 256)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 640, 8, 8])
+
+    # Test custom arch
+    model = MobileViT(arch=[
+        ['mobilenetv2', 16, 1, 1, 2],
+        ['mobilenetv2', 24, 2, 3, 2],
+        ['mobilevit', 48, 2, 64, 128, 2, 2],
+        ['mobilevit', 64, 2, 80, 160, 4, 2],
+        ['mobilevit', 80, 2, 96, 192, 3, 2],
+    ])
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 256, 256)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 320, 8, 8])
+
+    # Test last_exp_factor
+    model = MobileViT(arch='small', last_exp_factor=8)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 256, 256)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 1280, 8, 8])
+
+    # Test stem_channels
+    model = MobileViT(arch='small', stem_channels=32)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 256, 256)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 640, 8, 8])
+
+    # Test forward with multiple outputs
+    model = MobileViT(arch='small', out_indices=range(5))
+
+    imgs = torch.randn(1, 3, 256, 256)
+    feat = model(imgs)
+    assert len(feat) == 5
+    assert feat[0].shape == torch.Size([1, 32, 128, 128])
+    assert feat[1].shape == torch.Size([1, 64, 64, 64])
+    assert feat[2].shape == torch.Size([1, 96, 32, 32])
+    assert feat[3].shape == torch.Size([1, 128, 16, 16])
+    assert feat[4].shape == torch.Size([1, 640, 8, 8])
+
+    # Test frozen_stages
+    model = MobileViT(arch='small', frozen_stages=2)
+    model.init_weights()
+    model.train()
+
+    for i in range(2):
+        assert not model.layers[i].training
+
+    for i in range(2, 5):
+        assert model.layers[i].training
diff --git a/tests/test_models/test_backbones/test_mvit.py b/tests/test_models/test_backbones/test_mvit.py
new file mode 100644
index 0000000000000000000000000000000000000000..0a5e12699180387ac2e841c2afb8761f69d93144
--- /dev/null
+++ b/tests/test_models/test_backbones/test_mvit.py
@@ -0,0 +1,130 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+
+from mmpretrain.models import MViT
+
+
+class TestMViT(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(arch='tiny', drop_path_rate=0.1)
+
+    def test_structure(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'not in default archs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            MViT(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'Custom arch needs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'num_layers': 24,
+                'num_heads': 16,
+                'feedforward_channels': 4096
+            }
+            MViT(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        cfg['arch'] = {
+            'embed_dims': 96,
+            'num_layers': 10,
+            'num_heads': 1,
+            'downscale_indices': [2, 5, 8]
+        }
+        stage_indices = [0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]
+        model = MViT(**cfg)
+        self.assertEqual(model.embed_dims, 96)
+        self.assertEqual(model.num_layers, 10)
+        for i, block in enumerate(model.blocks):
+            stage = stage_indices[i]
+            self.assertEqual(block.out_dims, 96 * 2**(stage))
+
+        # Test out_indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_scales'] = {1: 1}
+        with self.assertRaisesRegex(AssertionError, "get <class 'dict'>"):
+            MViT(**cfg)
+        cfg['out_scales'] = [0, 13]
+        with self.assertRaisesRegex(AssertionError, 'Invalid out_scales 13'):
+            MViT(**cfg)
+
+        # Test model structure
+        cfg = deepcopy(self.cfg)
+        model = MViT(**cfg)
+        stage_indices = [0, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3]
+        self.assertEqual(len(model.blocks), 10)
+        dpr_inc = 0.1 / (10 - 1)
+        dpr = 0
+        for i, block in enumerate(model.blocks):
+            stage = stage_indices[i]
+            print(i, stage)
+            self.assertEqual(block.attn.num_heads, 2**stage)
+            if dpr > 0:
+                self.assertAlmostEqual(block.drop_path.drop_prob, dpr)
+            dpr += dpr_inc
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]
+        cfg['use_abs_pos_embed'] = True
+        model = MViT(**cfg)
+        ori_weight = model.patch_embed.projection.weight.clone().detach()
+        # The pos_embed is all zero before initialize
+        self.assertTrue(torch.allclose(model.pos_embed, torch.tensor(0.)))
+
+        model.init_weights()
+        initialized_weight = model.patch_embed.projection.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+        self.assertFalse(torch.allclose(model.pos_embed, torch.tensor(0.)))
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+
+        cfg = deepcopy(self.cfg)
+        model = MViT(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        patch_token = outs[-1]
+        self.assertEqual(patch_token.shape, (1, 768, 7, 7))
+
+        # Test forward with multi out scales
+        cfg = deepcopy(self.cfg)
+        cfg['out_scales'] = (0, 1, 2, 3)
+        model = MViT(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 4)
+        for stage, out in enumerate(outs):
+            stride = 2**stage
+            self.assertEqual(out.shape,
+                             (1, 96 * stride, 56 // stride, 56 // stride))
+
+        # Test forward with dynamic input size
+        imgs1 = torch.randn(1, 3, 224, 224)
+        imgs2 = torch.randn(1, 3, 256, 256)
+        imgs3 = torch.randn(1, 3, 256, 309)
+        cfg = deepcopy(self.cfg)
+        model = MViT(**cfg)
+        for imgs in [imgs1, imgs2, imgs3]:
+            outs = model(imgs)
+            self.assertIsInstance(outs, tuple)
+            self.assertEqual(len(outs), 1)
+            patch_token = outs[-1]
+            expect_feat_shape = (math.ceil(imgs.shape[2] / 32),
+                                 math.ceil(imgs.shape[3] / 32))
+            self.assertEqual(patch_token.shape, (1, 768, *expect_feat_shape))
diff --git a/tests/test_models/test_backbones/test_poolformer.py b/tests/test_models/test_backbones/test_poolformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..f61b3040442873bf77f1098de153d25d4e491b50
--- /dev/null
+++ b/tests/test_models/test_backbones/test_poolformer.py
@@ -0,0 +1,143 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+
+from mmpretrain.models.backbones import PoolFormer
+from mmpretrain.models.backbones.poolformer import PoolFormerBlock
+
+
+class TestPoolFormer(TestCase):
+
+    def setUp(self):
+        arch = 's12'
+        self.cfg = dict(arch=arch, drop_path_rate=0.1)
+        self.arch = PoolFormer.arch_settings[arch]
+
+    def test_arch(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'Unavailable arch'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            PoolFormer(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'must have "layers"'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'embed_dims': 96,
+                'num_heads': [3, 6, 12, 16],
+            }
+            PoolFormer(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        layers = [2, 2, 4, 2]
+        embed_dims = [6, 12, 6, 12]
+        mlp_ratios = [2, 3, 4, 4]
+        layer_scale_init_value = 1e-4
+        cfg['arch'] = dict(
+            layers=layers,
+            embed_dims=embed_dims,
+            mlp_ratios=mlp_ratios,
+            layer_scale_init_value=layer_scale_init_value,
+        )
+        model = PoolFormer(**cfg)
+        for i, stage in enumerate(model.network):
+            if not isinstance(stage, PoolFormerBlock):
+                continue
+            self.assertEqual(len(stage), layers[i])
+            self.assertEqual(stage[0].mlp.fc1.in_channels, embed_dims[i])
+            self.assertEqual(stage[0].mlp.fc1.out_channels,
+                             embed_dims[i] * mlp_ratios[i])
+            self.assertTrue(
+                torch.allclose(stage[0].layer_scale_1,
+                               torch.tensor(layer_scale_init_value)))
+            self.assertTrue(
+                torch.allclose(stage[0].layer_scale_2,
+                               torch.tensor(layer_scale_init_value)))
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]
+        model = PoolFormer(**cfg)
+        ori_weight = model.patch_embed.proj.weight.clone().detach()
+
+        model.init_weights()
+        initialized_weight = model.patch_embed.proj.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+
+        cfg = deepcopy(self.cfg)
+        model = PoolFormer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (1, 512, 7, 7))
+
+        # test multiple output indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = (0, 2, 4, 6)
+        model = PoolFormer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 4)
+        for dim, stride, out in zip(self.arch['embed_dims'], [1, 2, 4, 8],
+                                    outs):
+            self.assertEqual(out.shape, (1, dim, 56 // stride, 56 // stride))
+
+    def test_structure(self):
+        # test drop_path_rate decay
+        cfg = deepcopy(self.cfg)
+        cfg['drop_path_rate'] = 0.2
+        model = PoolFormer(**cfg)
+        layers = self.arch['layers']
+        for i, block in enumerate(model.network):
+            expect_prob = 0.2 / (sum(layers) - 1) * i
+            if hasattr(block, 'drop_path'):
+                if expect_prob == 0:
+                    self.assertIsInstance(block.drop_path, torch.nn.Identity)
+                else:
+                    self.assertAlmostEqual(block.drop_path.drop_prob,
+                                           expect_prob)
+
+        # test with first stage frozen.
+        cfg = deepcopy(self.cfg)
+        frozen_stages = 1
+        cfg['frozen_stages'] = frozen_stages
+        cfg['out_indices'] = (0, 2, 4, 6)
+        model = PoolFormer(**cfg)
+        model.init_weights()
+        model.train()
+
+        # the patch_embed and first stage should not require grad.
+        self.assertFalse(model.patch_embed.training)
+        for param in model.patch_embed.parameters():
+            self.assertFalse(param.requires_grad)
+        for i in range(frozen_stages):
+            module = model.network[i]
+            for param in module.parameters():
+                self.assertFalse(param.requires_grad)
+        for param in model.norm0.parameters():
+            self.assertFalse(param.requires_grad)
+
+        # the second stage should require grad.
+        for i in range(frozen_stages + 1, 7):
+            module = model.network[i]
+            for param in module.parameters():
+                self.assertTrue(param.requires_grad)
+            if hasattr(model, f'norm{i}'):
+                norm = getattr(model, f'norm{i}')
+                for param in norm.parameters():
+                    self.assertTrue(param.requires_grad)
diff --git a/tests/test_models/test_backbones/test_regnet.py b/tests/test_models/test_backbones/test_regnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..bed26fea2b7ee62138aafcf4e9f1724cd09f6568
--- /dev/null
+++ b/tests/test_models/test_backbones/test_regnet.py
@@ -0,0 +1,94 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmpretrain.models.backbones import RegNet
+
+regnet_test_data = [
+    ('regnetx_400mf',
+     dict(w0=24, wa=24.48, wm=2.54, group_w=16, depth=22,
+          bot_mul=1.0), [32, 64, 160, 384]),
+    ('regnetx_800mf',
+     dict(w0=56, wa=35.73, wm=2.28, group_w=16, depth=16,
+          bot_mul=1.0), [64, 128, 288, 672]),
+    ('regnetx_1.6gf',
+     dict(w0=80, wa=34.01, wm=2.25, group_w=24, depth=18,
+          bot_mul=1.0), [72, 168, 408, 912]),
+    ('regnetx_3.2gf',
+     dict(w0=88, wa=26.31, wm=2.25, group_w=48, depth=25,
+          bot_mul=1.0), [96, 192, 432, 1008]),
+    ('regnetx_4.0gf',
+     dict(w0=96, wa=38.65, wm=2.43, group_w=40, depth=23,
+          bot_mul=1.0), [80, 240, 560, 1360]),
+    ('regnetx_6.4gf',
+     dict(w0=184, wa=60.83, wm=2.07, group_w=56, depth=17,
+          bot_mul=1.0), [168, 392, 784, 1624]),
+    ('regnetx_8.0gf',
+     dict(w0=80, wa=49.56, wm=2.88, group_w=120, depth=23,
+          bot_mul=1.0), [80, 240, 720, 1920]),
+    ('regnetx_12gf',
+     dict(w0=168, wa=73.36, wm=2.37, group_w=112, depth=19,
+          bot_mul=1.0), [224, 448, 896, 2240]),
+]
+
+
+@pytest.mark.parametrize('arch_name,arch,out_channels', regnet_test_data)
+def test_regnet_backbone(arch_name, arch, out_channels):
+    with pytest.raises(AssertionError):
+        # ResNeXt depth should be in [50, 101, 152]
+        RegNet(arch_name + '233')
+
+    # output the last feature map
+    model = RegNet(arch_name)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert isinstance(feat[0], torch.Tensor)
+    assert feat[0].shape == (1, out_channels[-1], 7, 7)
+
+    # output feature map of all stages
+    model = RegNet(arch_name, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == (1, out_channels[0], 56, 56)
+    assert feat[1].shape == (1, out_channels[1], 28, 28)
+    assert feat[2].shape == (1, out_channels[2], 14, 14)
+    assert feat[3].shape == (1, out_channels[3], 7, 7)
+
+
+@pytest.mark.parametrize('arch_name,arch,out_channels', regnet_test_data)
+def test_custom_arch(arch_name, arch, out_channels):
+    # output the last feature map
+    model = RegNet(arch)
+    model.init_weights()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert isinstance(feat[0], torch.Tensor)
+    assert feat[0].shape == (1, out_channels[-1], 7, 7)
+
+    # output feature map of all stages
+    model = RegNet(arch, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == (1, out_channels[0], 56, 56)
+    assert feat[1].shape == (1, out_channels[1], 28, 28)
+    assert feat[2].shape == (1, out_channels[2], 14, 14)
+    assert feat[3].shape == (1, out_channels[3], 7, 7)
+
+
+def test_exception():
+    # arch must be a str or dict
+    with pytest.raises(TypeError):
+        _ = RegNet(50)
diff --git a/tests/test_models/test_backbones/test_replknet.py b/tests/test_models/test_backbones/test_replknet.py
new file mode 100644
index 0000000000000000000000000000000000000000..ed9305c431a5cb3d336c762fe87397dd2d09b66c
--- /dev/null
+++ b/tests/test_models/test_backbones/test_replknet.py
@@ -0,0 +1,304 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+import tempfile
+
+import pytest
+import torch
+from mmengine.runner import load_checkpoint, save_checkpoint
+from torch import nn
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import RepLKNet
+from mmpretrain.models.backbones.replknet import ReparamLargeKernelConv
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def is_replk_block(modules):
+    if isinstance(modules, ReparamLargeKernelConv):
+        return True
+    return False
+
+
+def test_replknet_replkblock():
+    # Test ReparamLargeKernelConv with in_channels != out_channels,
+    # kernel_size = 31, stride = 1, groups=in_channels, small_kernel = 5
+    block = ReparamLargeKernelConv(
+        5, 10, kernel_size=31, stride=1, groups=5, small_kernel=5)
+    block.eval()
+    x = torch.randn(1, 5, 64, 64)
+    x_out_not_deploy = block(x)
+    assert block.small_kernel <= block.kernel_size
+    assert not hasattr(block, 'lkb_reparam')
+    assert hasattr(block, 'lkb_origin')
+    assert hasattr(block, 'small_conv')
+    assert x_out_not_deploy.shape == torch.Size((1, 10, 64, 64))
+    block.merge_kernel()
+    assert block.small_kernel_merged is True
+    x_out_deploy = block(x)
+    assert x_out_deploy.shape == torch.Size((1, 10, 64, 64))
+    assert torch.allclose(x_out_not_deploy, x_out_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test ReparamLargeKernelConv with in_channels == out_channels,
+    # kernel_size = 31, stride = 1, groups=in_channels, small_kernel = 5
+    block = ReparamLargeKernelConv(
+        12, 12, kernel_size=31, stride=1, groups=12, small_kernel=5)
+    block.eval()
+    x = torch.randn(1, 12, 64, 64)
+    x_out_not_deploy = block(x)
+    assert block.small_kernel <= block.kernel_size
+    assert not hasattr(block, 'lkb_reparam')
+    assert hasattr(block, 'lkb_origin')
+    assert hasattr(block, 'small_conv')
+    assert x_out_not_deploy.shape == torch.Size((1, 12, 64, 64))
+    block.merge_kernel()
+    assert block.small_kernel_merged is True
+    x_out_deploy = block(x)
+    assert x_out_deploy.shape == torch.Size((1, 12, 64, 64))
+    assert torch.allclose(x_out_not_deploy, x_out_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test ReparamLargeKernelConv with in_channels == out_channels,
+    # kernel_size = 31, stride = 2, groups=in_channels, small_kernel = 5
+    block = ReparamLargeKernelConv(
+        16, 16, kernel_size=31, stride=2, groups=16, small_kernel=5)
+    block.eval()
+    x = torch.randn(1, 16, 64, 64)
+    x_out_not_deploy = block(x)
+    assert block.small_kernel <= block.kernel_size
+    assert not hasattr(block, 'lkb_reparam')
+    assert hasattr(block, 'lkb_origin')
+    assert hasattr(block, 'small_conv')
+    assert x_out_not_deploy.shape == torch.Size((1, 16, 32, 32))
+    block.merge_kernel()
+    assert block.small_kernel_merged is True
+    x_out_deploy = block(x)
+    assert x_out_deploy.shape == torch.Size((1, 16, 32, 32))
+    assert torch.allclose(x_out_not_deploy, x_out_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test ReparamLargeKernelConv with in_channels == out_channels,
+    # kernel_size = 27, stride = 1, groups=in_channels, small_kernel = 5
+    block = ReparamLargeKernelConv(
+        12, 12, kernel_size=27, stride=1, groups=12, small_kernel=5)
+    block.eval()
+    x = torch.randn(1, 12, 48, 48)
+    x_out_not_deploy = block(x)
+    assert block.small_kernel <= block.kernel_size
+    assert not hasattr(block, 'lkb_reparam')
+    assert hasattr(block, 'lkb_origin')
+    assert hasattr(block, 'small_conv')
+    assert x_out_not_deploy.shape == torch.Size((1, 12, 48, 48))
+    block.merge_kernel()
+    assert block.small_kernel_merged is True
+    x_out_deploy = block(x)
+    assert x_out_deploy.shape == torch.Size((1, 12, 48, 48))
+    assert torch.allclose(x_out_not_deploy, x_out_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test ReparamLargeKernelConv with in_channels == out_channels,
+    # kernel_size = 31, stride = 1, groups=in_channels, small_kernel = 7
+    block = ReparamLargeKernelConv(
+        12, 12, kernel_size=31, stride=1, groups=12, small_kernel=7)
+    block.eval()
+    x = torch.randn(1, 12, 64, 64)
+    x_out_not_deploy = block(x)
+    assert block.small_kernel <= block.kernel_size
+    assert not hasattr(block, 'lkb_reparam')
+    assert hasattr(block, 'lkb_origin')
+    assert hasattr(block, 'small_conv')
+    assert x_out_not_deploy.shape == torch.Size((1, 12, 64, 64))
+    block.merge_kernel()
+    assert block.small_kernel_merged is True
+    x_out_deploy = block(x)
+    assert x_out_deploy.shape == torch.Size((1, 12, 64, 64))
+    assert torch.allclose(x_out_not_deploy, x_out_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test ReparamLargeKernelConv with deploy == True
+    block = ReparamLargeKernelConv(
+        8,
+        8,
+        kernel_size=31,
+        stride=1,
+        groups=8,
+        small_kernel=5,
+        small_kernel_merged=True)
+    assert isinstance(block.lkb_reparam, nn.Conv2d)
+    assert not hasattr(block, 'lkb_origin')
+    assert not hasattr(block, 'small_conv')
+    x = torch.randn(1, 8, 48, 48)
+    x_out = block(x)
+    assert x_out.shape == torch.Size((1, 8, 48, 48))
+
+
+def test_replknet_backbone():
+    with pytest.raises(TypeError):
+        # arch must be str or dict
+        RepLKNet(arch=[4, 6, 16, 1])
+
+    with pytest.raises(AssertionError):
+        # arch must in arch_settings
+        RepLKNet(arch='31C')
+
+    with pytest.raises(KeyError):
+        # arch must have num_blocks and width_factor
+        arch = dict(large_kernel_sizes=[31, 29, 27, 13])
+        RepLKNet(arch=arch)
+
+    with pytest.raises(KeyError):
+        # arch must have num_blocks and width_factor
+        arch = dict(large_kernel_sizes=[31, 29, 27, 13], layers=[2, 2, 18, 2])
+        RepLKNet(arch=arch)
+
+    with pytest.raises(KeyError):
+        # arch must have num_blocks and width_factor
+        arch = dict(
+            large_kernel_sizes=[31, 29, 27, 13],
+            layers=[2, 2, 18, 2],
+            channels=[128, 256, 512, 1024])
+        RepLKNet(arch=arch)
+
+    # len(arch['large_kernel_sizes']) == arch['layers'])
+    # == len(arch['channels'])
+    # == len(strides) == len(dilations)
+    with pytest.raises(AssertionError):
+        arch = dict(
+            large_kernel_sizes=[31, 29, 27, 13],
+            layers=[2, 2, 18, 2],
+            channels=[128, 256, 1024],
+            small_kernel=5,
+            dw_ratio=1)
+        RepLKNet(arch=arch)
+
+    # len(strides) must equal to 4
+    with pytest.raises(AssertionError):
+        RepLKNet('31B', strides=(2, 2, 2))
+
+    # len(dilations) must equal to 4
+    with pytest.raises(AssertionError):
+        RepLKNet('31B', strides=(2, 2, 2, 2), dilations=(1, 1, 1))
+
+    # max(out_indices) < len(arch['num_blocks'])
+    with pytest.raises(AssertionError):
+        RepLKNet('31B', out_indices=(5, ))
+
+    # Test RepLKNet norm state
+    model = RepLKNet('31B')
+    model.train()
+    assert check_norm_state(model.modules(), True)
+
+    # Test RepLKNet with first stage frozen
+    frozen_stages = 1
+    model = RepLKNet('31B', frozen_stages=frozen_stages)
+    model.train()
+    for param in model.stem.parameters():
+        assert param.requires_grad is False
+    for i in range(0, frozen_stages):
+        stage = model.stages[i]
+        for mod in stage.modules():
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in stage.parameters():
+            assert param.requires_grad is False
+
+    # Test RepLKNet with norm_eval
+    model = RepLKNet('31B', norm_eval=True)
+    model.train()
+    assert check_norm_state(model.modules(), False)
+
+    # Test RepLKNet forward with layer 3 forward
+    model = RepLKNet('31B', out_indices=(3, ))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert isinstance(feat, tuple)
+    assert len(feat) == 1
+    assert isinstance(feat[0], torch.Tensor)
+    assert feat[0].shape == torch.Size((1, 1024, 7, 7))
+
+    # Test RepLKNet forward
+    model_test_settings = [
+        dict(model_name='31B', out_sizes=(128, 256, 512, 1024)),
+        # dict(model_name='31L', out_sizes=(192, 384, 768, 1536)),
+        # dict(model_name='XL', out_sizes=(256, 512, 1024, 2048))
+    ]
+
+    choose_models = ['31B']
+    # Test RepLKNet model forward
+    for model_test_setting in model_test_settings:
+        if model_test_setting['model_name'] not in choose_models:
+            continue
+        model = RepLKNet(
+            model_test_setting['model_name'], out_indices=(0, 1, 2, 3))
+        model.init_weights()
+
+        # Test Norm
+        for m in model.modules():
+            if is_norm(m):
+                assert isinstance(m, _BatchNorm)
+
+        model.train()
+        imgs = torch.randn(1, 3, 224, 224)
+        feat = model(imgs)
+        assert feat[0].shape == torch.Size(
+            (1, model_test_setting['out_sizes'][0], 56, 56))
+        assert feat[1].shape == torch.Size(
+            (1, model_test_setting['out_sizes'][1], 28, 28))
+        assert feat[2].shape == torch.Size(
+            (1, model_test_setting['out_sizes'][2], 14, 14))
+        assert feat[3].shape == torch.Size(
+            (1, model_test_setting['out_sizes'][3], 7, 7))
+
+        # Test eval of "train" mode and "deploy" mode
+        gap = nn.AdaptiveAvgPool2d(output_size=(1))
+        fc = nn.Linear(model_test_setting['out_sizes'][3], 10)
+        model.eval()
+        feat = model(imgs)
+        pred = fc(gap(feat[3]).flatten(1))
+        model.switch_to_deploy()
+        for m in model.modules():
+            if isinstance(m, ReparamLargeKernelConv):
+                assert m.small_kernel_merged is True
+        feat_deploy = model(imgs)
+        pred_deploy = fc(gap(feat_deploy[3]).flatten(1))
+        for i in range(4):
+            torch.allclose(feat[i], feat_deploy[i])
+        torch.allclose(pred, pred_deploy)
+
+
+def test_replknet_load():
+    # Test output before and load from deploy checkpoint
+    model = RepLKNet('31B', out_indices=(0, 1, 2, 3))
+    inputs = torch.randn((1, 3, 224, 224))
+    ckpt_path = os.path.join(tempfile.gettempdir(), 'ckpt.pth')
+    model.switch_to_deploy()
+    model.eval()
+    outputs = model(inputs)
+
+    model_deploy = RepLKNet(
+        '31B', out_indices=(0, 1, 2, 3), small_kernel_merged=True)
+    model_deploy.eval()
+    save_checkpoint(model.state_dict(), ckpt_path)
+    load_checkpoint(model_deploy, ckpt_path, strict=True)
+
+    outputs_load = model_deploy(inputs)
+    for feat, feat_load in zip(outputs, outputs_load):
+        assert torch.allclose(feat, feat_load)
diff --git a/tests/test_models/test_backbones/test_repmlp.py b/tests/test_models/test_backbones/test_repmlp.py
new file mode 100644
index 0000000000000000000000000000000000000000..f03fce4ed419d0c65375fa196276b43a27aa234b
--- /dev/null
+++ b/tests/test_models/test_backbones/test_repmlp.py
@@ -0,0 +1,173 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+import tempfile
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+from mmengine.runner import load_checkpoint, save_checkpoint
+
+from mmpretrain.models.backbones import RepMLPNet
+
+
+class TestRepMLP(TestCase):
+
+    def setUp(self):
+        # default model setting
+        self.cfg = dict(
+            arch='b',
+            img_size=224,
+            out_indices=(3, ),
+            reparam_conv_kernels=(1, 3),
+            final_norm=True)
+
+        # default model setting and output stage channels
+        self.model_forward_settings = [
+            dict(model_name='B', out_sizes=(96, 192, 384, 768)),
+        ]
+
+        # temp ckpt path
+        self.ckpt_path = os.path.join(tempfile.gettempdir(), 'ckpt.pth')
+
+    def test_arch(self):
+        # Test invalid arch data type
+        with self.assertRaisesRegex(AssertionError, 'arch needs a dict'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = [96, 192, 384, 768]
+            RepMLPNet(**cfg)
+
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'not in default archs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'A'
+            RepMLPNet(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'Custom arch needs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'channels': [96, 192, 384, 768],
+                'depths': [2, 2, 12, 2]
+            }
+            RepMLPNet(**cfg)
+
+        # test len(arch['depths']) equals to len(arch['channels'])
+        # equals to len(arch['sharesets_nums'])
+        with self.assertRaisesRegex(AssertionError, 'Length of setting'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'channels': [96, 192, 384, 768],
+                'depths': [2, 2, 12, 2],
+                'sharesets_nums': [1, 4, 32]
+            }
+            RepMLPNet(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        channels = [96, 192, 384, 768]
+        depths = [2, 2, 12, 2]
+        sharesets_nums = [1, 4, 32, 128]
+        cfg['arch'] = {
+            'channels': channels,
+            'depths': depths,
+            'sharesets_nums': sharesets_nums
+        }
+        cfg['out_indices'] = (0, 1, 2, 3)
+        model = RepMLPNet(**cfg)
+        for i, stage in enumerate(model.stages):
+            self.assertEqual(len(stage), depths[i])
+            self.assertEqual(stage[0].repmlp_block.channels, channels[i])
+            self.assertEqual(stage[0].repmlp_block.deploy, False)
+            self.assertEqual(stage[0].repmlp_block.num_sharesets,
+                             sharesets_nums[i])
+
+    def test_init(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]
+        model = RepMLPNet(**cfg)
+        ori_weight = model.patch_embed.projection.weight.clone().detach()
+
+        model.init_weights()
+        initialized_weight = model.patch_embed.projection.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+        cfg = deepcopy(self.cfg)
+        model = RepMLPNet(**cfg)
+        feat = model(imgs)
+        self.assertTrue(isinstance(feat, tuple))
+        self.assertEqual(len(feat), 1)
+        self.assertTrue(isinstance(feat[0], torch.Tensor))
+        self.assertEqual(feat[0].shape, torch.Size((1, 768, 7, 7)))
+
+        imgs = torch.randn(1, 3, 256, 256)
+        with self.assertRaisesRegex(AssertionError, "doesn't support dynamic"):
+            model(imgs)
+
+        # Test RepMLPNet model forward
+        for model_test_setting in self.model_forward_settings:
+            model = RepMLPNet(
+                model_test_setting['model_name'],
+                out_indices=(0, 1, 2, 3),
+                final_norm=False)
+            model.init_weights()
+
+            model.train()
+            imgs = torch.randn(1, 3, 224, 224)
+            feat = model(imgs)
+            self.assertEqual(
+                feat[0].shape,
+                torch.Size((1, model_test_setting['out_sizes'][1], 28, 28)))
+            self.assertEqual(
+                feat[1].shape,
+                torch.Size((1, model_test_setting['out_sizes'][2], 14, 14)))
+            self.assertEqual(
+                feat[2].shape,
+                torch.Size((1, model_test_setting['out_sizes'][3], 7, 7)))
+            self.assertEqual(
+                feat[3].shape,
+                torch.Size((1, model_test_setting['out_sizes'][3], 7, 7)))
+
+    def test_deploy_(self):
+        # Test output before and load from deploy checkpoint
+        imgs = torch.randn((1, 3, 224, 224))
+        cfg = dict(
+            arch='b', out_indices=(
+                1,
+                3,
+            ), reparam_conv_kernels=(1, 3, 5))
+        model = RepMLPNet(**cfg)
+
+        model.eval()
+        feats = model(imgs)
+        model.switch_to_deploy()
+        for m in model.modules():
+            if hasattr(m, 'deploy'):
+                self.assertTrue(m.deploy)
+        model.eval()
+        feats_ = model(imgs)
+        assert len(feats) == len(feats_)
+        for i in range(len(feats)):
+            self.assertTrue(
+                torch.allclose(
+                    feats[i].sum(), feats_[i].sum(), rtol=0.1, atol=0.1))
+
+        cfg['deploy'] = True
+        model_deploy = RepMLPNet(**cfg)
+        model_deploy.eval()
+        save_checkpoint(model.state_dict(), self.ckpt_path)
+        load_checkpoint(model_deploy, self.ckpt_path, strict=True)
+        feats__ = model_deploy(imgs)
+
+        assert len(feats_) == len(feats__)
+        for i in range(len(feats)):
+            self.assertTrue(
+                torch.allclose(feats__[i], feats_[i], rtol=0.01, atol=0.01))
diff --git a/tests/test_models/test_backbones/test_repvgg.py b/tests/test_models/test_backbones/test_repvgg.py
new file mode 100644
index 0000000000000000000000000000000000000000..a558dbc9bb03b50771024d1f73d42b4d604b7a72
--- /dev/null
+++ b/tests/test_models/test_backbones/test_repvgg.py
@@ -0,0 +1,351 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+import tempfile
+
+import pytest
+import torch
+from mmengine.runner import load_checkpoint, save_checkpoint
+from torch import nn
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import RepVGG
+from mmpretrain.models.backbones.repvgg import RepVGGBlock
+from mmpretrain.models.utils import SELayer
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def is_repvgg_block(modules):
+    if isinstance(modules, RepVGGBlock):
+        return True
+    return False
+
+
+def test_repvgg_repvggblock():
+    # Test RepVGGBlock with in_channels != out_channels, stride = 1
+    block = RepVGGBlock(5, 10, stride=1)
+    block.eval()
+    x = torch.randn(1, 5, 16, 16)
+    x_out_not_deploy = block(x)
+    assert block.branch_norm is None
+    assert not hasattr(block, 'branch_reparam')
+    assert hasattr(block, 'branch_1x1')
+    assert hasattr(block, 'branch_3x3')
+    assert hasattr(block, 'branch_norm')
+    assert block.se_cfg is None
+    assert x_out_not_deploy.shape == torch.Size((1, 10, 16, 16))
+    block.switch_to_deploy()
+    assert block.deploy is True
+    x_out_deploy = block(x)
+    assert x_out_deploy.shape == torch.Size((1, 10, 16, 16))
+    assert torch.allclose(x_out_not_deploy, x_out_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test RepVGGBlock with in_channels == out_channels, stride = 1
+    block = RepVGGBlock(12, 12, stride=1)
+    block.eval()
+    x = torch.randn(1, 12, 8, 8)
+    x_out_not_deploy = block(x)
+    assert isinstance(block.branch_norm, nn.BatchNorm2d)
+    assert not hasattr(block, 'branch_reparam')
+    assert x_out_not_deploy.shape == torch.Size((1, 12, 8, 8))
+    block.switch_to_deploy()
+    assert block.deploy is True
+    x_out_deploy = block(x)
+    assert x_out_deploy.shape == torch.Size((1, 12, 8, 8))
+    assert torch.allclose(x_out_not_deploy, x_out_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test RepVGGBlock with in_channels == out_channels, stride = 2
+    block = RepVGGBlock(16, 16, stride=2)
+    block.eval()
+    x = torch.randn(1, 16, 8, 8)
+    x_out_not_deploy = block(x)
+    assert block.branch_norm is None
+    assert x_out_not_deploy.shape == torch.Size((1, 16, 4, 4))
+    block.switch_to_deploy()
+    assert block.deploy is True
+    x_out_deploy = block(x)
+    assert x_out_deploy.shape == torch.Size((1, 16, 4, 4))
+    assert torch.allclose(x_out_not_deploy, x_out_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test RepVGGBlock with padding == dilation == 2
+    block = RepVGGBlock(14, 14, stride=1, padding=2, dilation=2)
+    block.eval()
+    x = torch.randn(1, 14, 16, 16)
+    x_out_not_deploy = block(x)
+    assert isinstance(block.branch_norm, nn.BatchNorm2d)
+    assert x_out_not_deploy.shape == torch.Size((1, 14, 16, 16))
+    block.switch_to_deploy()
+    assert block.deploy is True
+    x_out_deploy = block(x)
+    assert x_out_deploy.shape == torch.Size((1, 14, 16, 16))
+    assert torch.allclose(x_out_not_deploy, x_out_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test RepVGGBlock with groups = 2
+    block = RepVGGBlock(4, 4, stride=1, groups=2)
+    block.eval()
+    x = torch.randn(1, 4, 5, 6)
+    x_out_not_deploy = block(x)
+    assert x_out_not_deploy.shape == torch.Size((1, 4, 5, 6))
+    block.switch_to_deploy()
+    assert block.deploy is True
+    x_out_deploy = block(x)
+    assert x_out_deploy.shape == torch.Size((1, 4, 5, 6))
+    assert torch.allclose(x_out_not_deploy, x_out_deploy, atol=1e-5, rtol=1e-4)
+
+    # Test RepVGGBlock with se
+    se_cfg = dict(ratio=4, divisor=1)
+    block = RepVGGBlock(18, 18, stride=1, se_cfg=se_cfg)
+    block.train()
+    x = torch.randn(1, 18, 5, 5)
+    x_out_not_deploy = block(x)
+    assert isinstance(block.se_layer, SELayer)
+    assert x_out_not_deploy.shape == torch.Size((1, 18, 5, 5))
+
+    # Test RepVGGBlock with checkpoint forward
+    block = RepVGGBlock(24, 24, stride=1, with_cp=True)
+    assert block.with_cp
+    x = torch.randn(1, 24, 7, 7)
+    x_out = block(x)
+    assert x_out.shape == torch.Size((1, 24, 7, 7))
+
+    # Test RepVGGBlock with deploy == True
+    block = RepVGGBlock(8, 8, stride=1, deploy=True)
+    assert isinstance(block.branch_reparam, nn.Conv2d)
+    assert not hasattr(block, 'branch_3x3')
+    assert not hasattr(block, 'branch_1x1')
+    assert not hasattr(block, 'branch_norm')
+    x = torch.randn(1, 8, 16, 16)
+    x_out = block(x)
+    assert x_out.shape == torch.Size((1, 8, 16, 16))
+
+
+def test_repvgg_backbone():
+    with pytest.raises(TypeError):
+        # arch must be str or dict
+        RepVGG(arch=[4, 6, 16, 1])
+
+    with pytest.raises(AssertionError):
+        # arch must in arch_settings
+        RepVGG(arch='A3')
+
+    with pytest.raises(KeyError):
+        # arch must have num_blocks and width_factor
+        arch = dict(num_blocks=[2, 4, 14, 1])
+        RepVGG(arch=arch)
+
+    # len(arch['num_blocks']) == len(arch['width_factor'])
+    # == len(strides) == len(dilations)
+    with pytest.raises(AssertionError):
+        arch = dict(num_blocks=[2, 4, 14, 1], width_factor=[0.75, 0.75, 0.75])
+        RepVGG(arch=arch)
+
+    # len(strides) must equal to 4
+    with pytest.raises(AssertionError):
+        RepVGG('A0', strides=(1, 1, 1))
+
+    # len(dilations) must equal to 4
+    with pytest.raises(AssertionError):
+        RepVGG('A0', strides=(1, 1, 1, 1), dilations=(1, 1, 2))
+
+    # max(out_indices) < len(arch['num_blocks'])
+    with pytest.raises(AssertionError):
+        RepVGG('A0', out_indices=(5, ))
+
+    # max(arch['group_idx'].keys()) <= sum(arch['num_blocks'])
+    with pytest.raises(AssertionError):
+        arch = dict(
+            num_blocks=[2, 4, 14, 1],
+            width_factor=[0.75, 0.75, 0.75],
+            group_idx={22: 2})
+        RepVGG(arch=arch)
+
+    # Test RepVGG norm state
+    model = RepVGG('A0')
+    model.train()
+    assert check_norm_state(model.modules(), True)
+
+    # Test RepVGG with first stage frozen
+    frozen_stages = 1
+    model = RepVGG('A0', frozen_stages=frozen_stages)
+    model.train()
+    for param in model.stem.parameters():
+        assert param.requires_grad is False
+    for i in range(0, frozen_stages):
+        stage_name = model.stages[i]
+        stage = model.__getattr__(stage_name)
+        for mod in stage:
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in stage.parameters():
+            assert param.requires_grad is False
+
+    # Test RepVGG with norm_eval
+    model = RepVGG('A0', norm_eval=True)
+    model.train()
+    assert check_norm_state(model.modules(), False)
+
+    # Test RepVGG forward with layer 3 forward
+    model = RepVGG('A0', out_indices=(3, ))
+    model.init_weights()
+    model.eval()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 32, 32)
+    feat = model(imgs)
+    assert isinstance(feat, tuple)
+    assert len(feat) == 1
+    assert isinstance(feat[0], torch.Tensor)
+    assert feat[0].shape == torch.Size((1, 1280, 1, 1))
+
+    # Test with custom arch
+    cfg = dict(
+        num_blocks=[3, 5, 7, 3],
+        width_factor=[1, 1, 1, 1],
+        group_layer_map=None,
+        se_cfg=None,
+        stem_channels=16)
+    model = RepVGG(arch=cfg, out_indices=(3, ))
+    model.eval()
+    assert model.stem.out_channels == min(16, 64 * 1)
+
+    imgs = torch.randn(1, 3, 32, 32)
+    feat = model(imgs)
+    assert isinstance(feat, tuple)
+    assert len(feat) == 1
+    assert isinstance(feat[0], torch.Tensor)
+    assert feat[0].shape == torch.Size((1, 512, 1, 1))
+
+    # Test RepVGG forward
+    model_test_settings = [
+        dict(model_name='A0', out_sizes=(48, 96, 192, 1280)),
+        dict(model_name='A1', out_sizes=(64, 128, 256, 1280)),
+        dict(model_name='A2', out_sizes=(96, 192, 384, 1408)),
+        dict(model_name='B0', out_sizes=(64, 128, 256, 1280)),
+        dict(model_name='B1', out_sizes=(128, 256, 512, 2048)),
+        dict(model_name='B1g2', out_sizes=(128, 256, 512, 2048)),
+        dict(model_name='B1g4', out_sizes=(128, 256, 512, 2048)),
+        dict(model_name='B2', out_sizes=(160, 320, 640, 2560)),
+        dict(model_name='B2g2', out_sizes=(160, 320, 640, 2560)),
+        dict(model_name='B2g4', out_sizes=(160, 320, 640, 2560)),
+        dict(model_name='B3', out_sizes=(192, 384, 768, 2560)),
+        dict(model_name='B3g2', out_sizes=(192, 384, 768, 2560)),
+        dict(model_name='B3g4', out_sizes=(192, 384, 768, 2560)),
+        dict(model_name='D2se', out_sizes=(160, 320, 640, 2560))
+    ]
+
+    choose_models = ['A0', 'B1', 'B1g2']
+    # Test RepVGG model forward
+    for model_test_setting in model_test_settings:
+        if model_test_setting['model_name'] not in choose_models:
+            continue
+        model = RepVGG(
+            model_test_setting['model_name'], out_indices=(0, 1, 2, 3))
+        model.init_weights()
+        model.eval()
+
+        # Test Norm
+        for m in model.modules():
+            if is_norm(m):
+                assert isinstance(m, _BatchNorm)
+
+        imgs = torch.randn(1, 3, 32, 32)
+        feat = model(imgs)
+        assert feat[0].shape == torch.Size(
+            (1, model_test_setting['out_sizes'][0], 8, 8))
+        assert feat[1].shape == torch.Size(
+            (1, model_test_setting['out_sizes'][1], 4, 4))
+        assert feat[2].shape == torch.Size(
+            (1, model_test_setting['out_sizes'][2], 2, 2))
+        assert feat[3].shape == torch.Size(
+            (1, model_test_setting['out_sizes'][3], 1, 1))
+
+        # Test eval of "train" mode and "deploy" mode
+        gap = nn.AdaptiveAvgPool2d(output_size=(1))
+        fc = nn.Linear(model_test_setting['out_sizes'][3], 10)
+        model.eval()
+        feat = model(imgs)
+        pred = fc(gap(feat[3]).flatten(1))
+        model.switch_to_deploy()
+        for m in model.modules():
+            if isinstance(m, RepVGGBlock):
+                assert m.deploy is True
+        feat_deploy = model(imgs)
+        pred_deploy = fc(gap(feat_deploy[3]).flatten(1))
+        for i in range(4):
+            torch.allclose(feat[i], feat_deploy[i])
+        torch.allclose(pred, pred_deploy)
+
+    # Test RepVGG forward with add_ppf
+    model = RepVGG('A0', out_indices=(3, ), add_ppf=True)
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 64, 64)
+    feat = model(imgs)
+    assert isinstance(feat, tuple)
+    assert len(feat) == 1
+    assert isinstance(feat[0], torch.Tensor)
+    assert feat[0].shape == torch.Size((1, 1280, 2, 2))
+
+    # Test RepVGG forward with 'stem_channels' not in arch
+    arch = dict(
+        num_blocks=[2, 4, 14, 1],
+        width_factor=[0.75, 0.75, 0.75, 2.5],
+        group_layer_map=None,
+        se_cfg=None)
+    model = RepVGG(arch, add_ppf=True)
+    model.stem.in_channels = min(64, 64 * 0.75)
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 64, 64)
+    feat = model(imgs)
+    assert isinstance(feat, tuple)
+    assert len(feat) == 1
+    assert isinstance(feat[0], torch.Tensor)
+    assert feat[0].shape == torch.Size((1, 1280, 2, 2))
+
+
+def test_repvgg_load():
+    # Test output before and load from deploy checkpoint
+    model = RepVGG('A1', out_indices=(0, 1, 2, 3))
+    inputs = torch.randn((1, 3, 32, 32))
+    ckpt_path = os.path.join(tempfile.gettempdir(), 'ckpt.pth')
+    model.switch_to_deploy()
+    model.eval()
+    outputs = model(inputs)
+
+    model_deploy = RepVGG('A1', out_indices=(0, 1, 2, 3), deploy=True)
+    model_deploy.eval()
+    save_checkpoint(model.state_dict(), ckpt_path)
+    load_checkpoint(model_deploy, ckpt_path, strict=True)
+
+    outputs_load = model_deploy(inputs)
+    for feat, feat_load in zip(outputs, outputs_load):
+        assert torch.allclose(feat, feat_load)
diff --git a/tests/test_models/test_backbones/test_res2net.py b/tests/test_models/test_backbones/test_res2net.py
new file mode 100644
index 0000000000000000000000000000000000000000..365f5f1e1f8efbdb6f3301f32084a1c0fe1c2057
--- /dev/null
+++ b/tests/test_models/test_backbones/test_res2net.py
@@ -0,0 +1,71 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+
+from mmpretrain.models.backbones import Res2Net
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def test_resnet_cifar():
+    # Only support depth 50, 101 and 152
+    with pytest.raises(KeyError):
+        Res2Net(depth=18)
+
+    # test the feature map size when depth is 50
+    # and deep_stem=True, avg_down=True
+    model = Res2Net(
+        depth=50, out_indices=(0, 1, 2, 3), deep_stem=True, avg_down=True)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model.stem(imgs)
+    assert feat.shape == (1, 64, 112, 112)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == (1, 256, 56, 56)
+    assert feat[1].shape == (1, 512, 28, 28)
+    assert feat[2].shape == (1, 1024, 14, 14)
+    assert feat[3].shape == (1, 2048, 7, 7)
+
+    # test the feature map size when depth is 101
+    # and deep_stem=False, avg_down=False
+    model = Res2Net(
+        depth=101, out_indices=(0, 1, 2, 3), deep_stem=False, avg_down=False)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model.conv1(imgs)
+    assert feat.shape == (1, 64, 112, 112)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == (1, 256, 56, 56)
+    assert feat[1].shape == (1, 512, 28, 28)
+    assert feat[2].shape == (1, 1024, 14, 14)
+    assert feat[3].shape == (1, 2048, 7, 7)
+
+    # Test Res2Net with first stage frozen
+    frozen_stages = 1
+    model = Res2Net(depth=50, frozen_stages=frozen_stages, deep_stem=False)
+    model.init_weights()
+    model.train()
+    assert check_norm_state([model.norm1], False)
+    for param in model.conv1.parameters():
+        assert param.requires_grad is False
+    for i in range(1, frozen_stages + 1):
+        layer = getattr(model, f'layer{i}')
+        for mod in layer.modules():
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in layer.parameters():
+            assert param.requires_grad is False
diff --git a/tests/test_models/test_backbones/test_resnest.py b/tests/test_models/test_backbones/test_resnest.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c265cb17fa00f8ca614bdfe2afe6b48f93821ac
--- /dev/null
+++ b/tests/test_models/test_backbones/test_resnest.py
@@ -0,0 +1,44 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmpretrain.models.backbones import ResNeSt
+from mmpretrain.models.backbones.resnest import Bottleneck as BottleneckS
+
+
+def test_bottleneck():
+    with pytest.raises(AssertionError):
+        # Style must be in ['pytorch', 'caffe']
+        BottleneckS(64, 64, radix=2, reduction_factor=4, style='tensorflow')
+
+    # Test ResNeSt Bottleneck structure
+    block = BottleneckS(
+        64, 256, radix=2, reduction_factor=4, stride=2, style='pytorch')
+    assert block.avd_layer.stride == 2
+    assert block.conv2.channels == 64
+
+    # Test ResNeSt Bottleneck forward
+    block = BottleneckS(64, 64, radix=2, reduction_factor=4)
+    x = torch.randn(2, 64, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size([2, 64, 56, 56])
+
+
+def test_resnest():
+    with pytest.raises(KeyError):
+        # ResNeSt depth should be in [50, 101, 152, 200]
+        ResNeSt(depth=18)
+
+    # Test ResNeSt with radix 2, reduction_factor 4
+    model = ResNeSt(
+        depth=50, radix=2, reduction_factor=4, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(2, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size([2, 256, 56, 56])
+    assert feat[1].shape == torch.Size([2, 512, 28, 28])
+    assert feat[2].shape == torch.Size([2, 1024, 14, 14])
+    assert feat[3].shape == torch.Size([2, 2048, 7, 7])
diff --git a/tests/test_models/test_backbones/test_resnet.py b/tests/test_models/test_backbones/test_resnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..bf2900dbb0a1d6ec621bff8f2c16c44a30245efd
--- /dev/null
+++ b/tests/test_models/test_backbones/test_resnet.py
@@ -0,0 +1,618 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+import torch.nn as nn
+from mmcv.cnn import ConvModule
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+
+from mmpretrain.models.backbones import ResNet, ResNetV1c, ResNetV1d
+from mmpretrain.models.backbones.resnet import (BasicBlock, Bottleneck,
+                                                ResLayer, get_expansion)
+
+
+def is_block(modules):
+    """Check if is ResNet building block."""
+    if isinstance(modules, (BasicBlock, Bottleneck)):
+        return True
+    return False
+
+
+def all_zeros(modules):
+    """Check if the weight(and bias) is all zero."""
+    weight_zero = torch.equal(modules.weight.data,
+                              torch.zeros_like(modules.weight.data))
+    if hasattr(modules, 'bias'):
+        bias_zero = torch.equal(modules.bias.data,
+                                torch.zeros_like(modules.bias.data))
+    else:
+        bias_zero = True
+
+    return weight_zero and bias_zero
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def test_get_expansion():
+    assert get_expansion(Bottleneck, 2) == 2
+    assert get_expansion(BasicBlock) == 1
+    assert get_expansion(Bottleneck) == 4
+
+    class MyResBlock(nn.Module):
+
+        expansion = 8
+
+    assert get_expansion(MyResBlock) == 8
+
+    # expansion must be an integer or None
+    with pytest.raises(TypeError):
+        get_expansion(Bottleneck, '0')
+
+    # expansion is not specified and cannot be inferred
+    with pytest.raises(TypeError):
+
+        class SomeModule(nn.Module):
+            pass
+
+        get_expansion(SomeModule)
+
+
+def test_basic_block():
+    # expansion must be 1
+    with pytest.raises(AssertionError):
+        BasicBlock(64, 64, expansion=2)
+
+    # BasicBlock with stride 1, out_channels == in_channels
+    block = BasicBlock(64, 64)
+    assert block.in_channels == 64
+    assert block.mid_channels == 64
+    assert block.out_channels == 64
+    assert block.conv1.in_channels == 64
+    assert block.conv1.out_channels == 64
+    assert block.conv1.kernel_size == (3, 3)
+    assert block.conv1.stride == (1, 1)
+    assert block.conv2.in_channels == 64
+    assert block.conv2.out_channels == 64
+    assert block.conv2.kernel_size == (3, 3)
+    x = torch.randn(1, 64, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size([1, 64, 56, 56])
+
+    # BasicBlock with stride 1 and downsample
+    downsample = nn.Sequential(
+        nn.Conv2d(64, 128, kernel_size=1, bias=False), nn.BatchNorm2d(128))
+    block = BasicBlock(64, 128, downsample=downsample)
+    assert block.in_channels == 64
+    assert block.mid_channels == 128
+    assert block.out_channels == 128
+    assert block.conv1.in_channels == 64
+    assert block.conv1.out_channels == 128
+    assert block.conv1.kernel_size == (3, 3)
+    assert block.conv1.stride == (1, 1)
+    assert block.conv2.in_channels == 128
+    assert block.conv2.out_channels == 128
+    assert block.conv2.kernel_size == (3, 3)
+    x = torch.randn(1, 64, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size([1, 128, 56, 56])
+
+    # BasicBlock with stride 2 and downsample
+    downsample = nn.Sequential(
+        nn.Conv2d(64, 128, kernel_size=1, stride=2, bias=False),
+        nn.BatchNorm2d(128))
+    block = BasicBlock(64, 128, stride=2, downsample=downsample)
+    assert block.in_channels == 64
+    assert block.mid_channels == 128
+    assert block.out_channels == 128
+    assert block.conv1.in_channels == 64
+    assert block.conv1.out_channels == 128
+    assert block.conv1.kernel_size == (3, 3)
+    assert block.conv1.stride == (2, 2)
+    assert block.conv2.in_channels == 128
+    assert block.conv2.out_channels == 128
+    assert block.conv2.kernel_size == (3, 3)
+    x = torch.randn(1, 64, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size([1, 128, 28, 28])
+
+    # forward with checkpointing
+    block = BasicBlock(64, 64, with_cp=True)
+    assert block.with_cp
+    x = torch.randn(1, 64, 56, 56, requires_grad=True)
+    x_out = block(x)
+    assert x_out.shape == torch.Size([1, 64, 56, 56])
+
+
+def test_bottleneck():
+    # style must be in ['pytorch', 'caffe']
+    with pytest.raises(AssertionError):
+        Bottleneck(64, 64, style='tensorflow')
+
+    # expansion must be divisible by out_channels
+    with pytest.raises(AssertionError):
+        Bottleneck(64, 64, expansion=3)
+
+    # Test Bottleneck style
+    block = Bottleneck(64, 64, stride=2, style='pytorch')
+    assert block.conv1.stride == (1, 1)
+    assert block.conv2.stride == (2, 2)
+    block = Bottleneck(64, 64, stride=2, style='caffe')
+    assert block.conv1.stride == (2, 2)
+    assert block.conv2.stride == (1, 1)
+
+    # Bottleneck with stride 1
+    block = Bottleneck(64, 64, style='pytorch')
+    assert block.in_channels == 64
+    assert block.mid_channels == 16
+    assert block.out_channels == 64
+    assert block.conv1.in_channels == 64
+    assert block.conv1.out_channels == 16
+    assert block.conv1.kernel_size == (1, 1)
+    assert block.conv2.in_channels == 16
+    assert block.conv2.out_channels == 16
+    assert block.conv2.kernel_size == (3, 3)
+    assert block.conv3.in_channels == 16
+    assert block.conv3.out_channels == 64
+    assert block.conv3.kernel_size == (1, 1)
+    x = torch.randn(1, 64, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == (1, 64, 56, 56)
+
+    # Bottleneck with stride 1 and downsample
+    downsample = nn.Sequential(
+        nn.Conv2d(64, 128, kernel_size=1), nn.BatchNorm2d(128))
+    block = Bottleneck(64, 128, style='pytorch', downsample=downsample)
+    assert block.in_channels == 64
+    assert block.mid_channels == 32
+    assert block.out_channels == 128
+    assert block.conv1.in_channels == 64
+    assert block.conv1.out_channels == 32
+    assert block.conv1.kernel_size == (1, 1)
+    assert block.conv2.in_channels == 32
+    assert block.conv2.out_channels == 32
+    assert block.conv2.kernel_size == (3, 3)
+    assert block.conv3.in_channels == 32
+    assert block.conv3.out_channels == 128
+    assert block.conv3.kernel_size == (1, 1)
+    x = torch.randn(1, 64, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == (1, 128, 56, 56)
+
+    # Bottleneck with stride 2 and downsample
+    downsample = nn.Sequential(
+        nn.Conv2d(64, 128, kernel_size=1, stride=2), nn.BatchNorm2d(128))
+    block = Bottleneck(
+        64, 128, stride=2, style='pytorch', downsample=downsample)
+    x = torch.randn(1, 64, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == (1, 128, 28, 28)
+
+    # Bottleneck with expansion 2
+    block = Bottleneck(64, 64, style='pytorch', expansion=2)
+    assert block.in_channels == 64
+    assert block.mid_channels == 32
+    assert block.out_channels == 64
+    assert block.conv1.in_channels == 64
+    assert block.conv1.out_channels == 32
+    assert block.conv1.kernel_size == (1, 1)
+    assert block.conv2.in_channels == 32
+    assert block.conv2.out_channels == 32
+    assert block.conv2.kernel_size == (3, 3)
+    assert block.conv3.in_channels == 32
+    assert block.conv3.out_channels == 64
+    assert block.conv3.kernel_size == (1, 1)
+    x = torch.randn(1, 64, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == (1, 64, 56, 56)
+
+    # Test Bottleneck with checkpointing
+    block = Bottleneck(64, 64, with_cp=True)
+    block.train()
+    assert block.with_cp
+    x = torch.randn(1, 64, 56, 56, requires_grad=True)
+    x_out = block(x)
+    assert x_out.shape == torch.Size([1, 64, 56, 56])
+
+
+def test_basicblock_reslayer():
+    # 3 BasicBlock w/o downsample
+    layer = ResLayer(BasicBlock, 3, 32, 32)
+    assert len(layer) == 3
+    for i in range(3):
+        assert layer[i].in_channels == 32
+        assert layer[i].out_channels == 32
+        assert layer[i].downsample is None
+    x = torch.randn(1, 32, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == (1, 32, 56, 56)
+
+    # 3 BasicBlock w/ stride 1 and downsample
+    layer = ResLayer(BasicBlock, 3, 32, 64)
+    assert len(layer) == 3
+    assert layer[0].in_channels == 32
+    assert layer[0].out_channels == 64
+    assert layer[0].downsample is not None and len(layer[0].downsample) == 2
+    assert isinstance(layer[0].downsample[0], nn.Conv2d)
+    assert layer[0].downsample[0].stride == (1, 1)
+    for i in range(1, 3):
+        assert layer[i].in_channels == 64
+        assert layer[i].out_channels == 64
+        assert layer[i].downsample is None
+    x = torch.randn(1, 32, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == (1, 64, 56, 56)
+
+    # 3 BasicBlock w/ stride 2 and downsample
+    layer = ResLayer(BasicBlock, 3, 32, 64, stride=2)
+    assert len(layer) == 3
+    assert layer[0].in_channels == 32
+    assert layer[0].out_channels == 64
+    assert layer[0].stride == 2
+    assert layer[0].downsample is not None and len(layer[0].downsample) == 2
+    assert isinstance(layer[0].downsample[0], nn.Conv2d)
+    assert layer[0].downsample[0].stride == (2, 2)
+    for i in range(1, 3):
+        assert layer[i].in_channels == 64
+        assert layer[i].out_channels == 64
+        assert layer[i].stride == 1
+        assert layer[i].downsample is None
+    x = torch.randn(1, 32, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == (1, 64, 28, 28)
+
+    # 3 BasicBlock w/ stride 2 and downsample with avg pool
+    layer = ResLayer(BasicBlock, 3, 32, 64, stride=2, avg_down=True)
+    assert len(layer) == 3
+    assert layer[0].in_channels == 32
+    assert layer[0].out_channels == 64
+    assert layer[0].stride == 2
+    assert layer[0].downsample is not None and len(layer[0].downsample) == 3
+    assert isinstance(layer[0].downsample[0], nn.AvgPool2d)
+    assert layer[0].downsample[0].stride == 2
+    for i in range(1, 3):
+        assert layer[i].in_channels == 64
+        assert layer[i].out_channels == 64
+        assert layer[i].stride == 1
+        assert layer[i].downsample is None
+    x = torch.randn(1, 32, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == (1, 64, 28, 28)
+
+
+def test_bottleneck_reslayer():
+    # 3 Bottleneck w/o downsample
+    layer = ResLayer(Bottleneck, 3, 32, 32)
+    assert len(layer) == 3
+    for i in range(3):
+        assert layer[i].in_channels == 32
+        assert layer[i].out_channels == 32
+        assert layer[i].downsample is None
+    x = torch.randn(1, 32, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == (1, 32, 56, 56)
+
+    # 3 Bottleneck w/ stride 1 and downsample
+    layer = ResLayer(Bottleneck, 3, 32, 64)
+    assert len(layer) == 3
+    assert layer[0].in_channels == 32
+    assert layer[0].out_channels == 64
+    assert layer[0].stride == 1
+    assert layer[0].conv1.out_channels == 16
+    assert layer[0].downsample is not None and len(layer[0].downsample) == 2
+    assert isinstance(layer[0].downsample[0], nn.Conv2d)
+    assert layer[0].downsample[0].stride == (1, 1)
+    for i in range(1, 3):
+        assert layer[i].in_channels == 64
+        assert layer[i].out_channels == 64
+        assert layer[i].conv1.out_channels == 16
+        assert layer[i].stride == 1
+        assert layer[i].downsample is None
+    x = torch.randn(1, 32, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == (1, 64, 56, 56)
+
+    # 3 Bottleneck w/ stride 2 and downsample
+    layer = ResLayer(Bottleneck, 3, 32, 64, stride=2)
+    assert len(layer) == 3
+    assert layer[0].in_channels == 32
+    assert layer[0].out_channels == 64
+    assert layer[0].stride == 2
+    assert layer[0].conv1.out_channels == 16
+    assert layer[0].downsample is not None and len(layer[0].downsample) == 2
+    assert isinstance(layer[0].downsample[0], nn.Conv2d)
+    assert layer[0].downsample[0].stride == (2, 2)
+    for i in range(1, 3):
+        assert layer[i].in_channels == 64
+        assert layer[i].out_channels == 64
+        assert layer[i].conv1.out_channels == 16
+        assert layer[i].stride == 1
+        assert layer[i].downsample is None
+    x = torch.randn(1, 32, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == (1, 64, 28, 28)
+
+    # 3 Bottleneck w/ stride 2 and downsample with avg pool
+    layer = ResLayer(Bottleneck, 3, 32, 64, stride=2, avg_down=True)
+    assert len(layer) == 3
+    assert layer[0].in_channels == 32
+    assert layer[0].out_channels == 64
+    assert layer[0].stride == 2
+    assert layer[0].conv1.out_channels == 16
+    assert layer[0].downsample is not None and len(layer[0].downsample) == 3
+    assert isinstance(layer[0].downsample[0], nn.AvgPool2d)
+    assert layer[0].downsample[0].stride == 2
+    for i in range(1, 3):
+        assert layer[i].in_channels == 64
+        assert layer[i].out_channels == 64
+        assert layer[i].conv1.out_channels == 16
+        assert layer[i].stride == 1
+        assert layer[i].downsample is None
+    x = torch.randn(1, 32, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == (1, 64, 28, 28)
+
+    # 3 Bottleneck with custom expansion
+    layer = ResLayer(Bottleneck, 3, 32, 32, expansion=2)
+    assert len(layer) == 3
+    for i in range(3):
+        assert layer[i].in_channels == 32
+        assert layer[i].out_channels == 32
+        assert layer[i].stride == 1
+        assert layer[i].conv1.out_channels == 16
+        assert layer[i].downsample is None
+    x = torch.randn(1, 32, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == (1, 32, 56, 56)
+
+
+def test_resnet():
+    """Test resnet backbone."""
+    with pytest.raises(KeyError):
+        # ResNet depth should be in [18, 34, 50, 101, 152]
+        ResNet(20)
+
+    with pytest.raises(AssertionError):
+        # In ResNet: 1 <= num_stages <= 4
+        ResNet(50, num_stages=0)
+
+    with pytest.raises(AssertionError):
+        # In ResNet: 1 <= num_stages <= 4
+        ResNet(50, num_stages=5)
+
+    with pytest.raises(AssertionError):
+        # len(strides) == len(dilations) == num_stages
+        ResNet(50, strides=(1, ), dilations=(1, 1), num_stages=3)
+
+    with pytest.raises(TypeError):
+        # pretrained must be a string path
+        model = ResNet(50)
+        model.init_weights(pretrained=0)
+
+    with pytest.raises(AssertionError):
+        # Style must be in ['pytorch', 'caffe']
+        ResNet(50, style='tensorflow')
+
+    # Test ResNet50 norm_eval=True
+    model = ResNet(50, norm_eval=True)
+    model.init_weights()
+    model.train()
+    assert check_norm_state(model.modules(), False)
+
+    # Test ResNet50 with torchvision pretrained weight
+    model = ResNet(
+        depth=50,
+        norm_eval=True,
+        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50'))
+    model.init_weights()
+    model.train()
+    assert check_norm_state(model.modules(), False)
+
+    # Test ResNet50 with first stage frozen
+    frozen_stages = 1
+    model = ResNet(50, frozen_stages=frozen_stages)
+    model.init_weights()
+    model.train()
+    assert model.norm1.training is False
+    for layer in [model.conv1, model.norm1]:
+        for param in layer.parameters():
+            assert param.requires_grad is False
+    for i in range(1, frozen_stages + 1):
+        layer = getattr(model, f'layer{i}')
+        for mod in layer.modules():
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in layer.parameters():
+            assert param.requires_grad is False
+
+    # Test ResNet18 forward
+    model = ResNet(18, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == (1, 64, 56, 56)
+    assert feat[1].shape == (1, 128, 28, 28)
+    assert feat[2].shape == (1, 256, 14, 14)
+    assert feat[3].shape == (1, 512, 7, 7)
+
+    # Test ResNet50 with BatchNorm forward
+    model = ResNet(50, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == (1, 256, 56, 56)
+    assert feat[1].shape == (1, 512, 28, 28)
+    assert feat[2].shape == (1, 1024, 14, 14)
+    assert feat[3].shape == (1, 2048, 7, 7)
+
+    # Test ResNet50 with DropPath forward
+    model = ResNet(50, out_indices=(0, 1, 2, 3), drop_path_rate=0.5)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == (1, 256, 56, 56)
+    assert feat[1].shape == (1, 512, 28, 28)
+    assert feat[2].shape == (1, 1024, 14, 14)
+    assert feat[3].shape == (1, 2048, 7, 7)
+
+    # Test ResNet50 with layers 1, 2, 3 out forward
+    model = ResNet(50, out_indices=(0, 1, 2))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 3
+    assert feat[0].shape == (1, 256, 56, 56)
+    assert feat[1].shape == (1, 512, 28, 28)
+    assert feat[2].shape == (1, 1024, 14, 14)
+
+    # Test ResNet50 with layers 3 (top feature maps) out forward
+    model = ResNet(50, out_indices=(3, ))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == (1, 2048, 7, 7)
+
+    # Test ResNet50 with checkpoint forward
+    model = ResNet(50, out_indices=(0, 1, 2, 3), with_cp=True)
+    for m in model.modules():
+        if is_block(m):
+            assert m.with_cp
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == (1, 256, 56, 56)
+    assert feat[1].shape == (1, 512, 28, 28)
+    assert feat[2].shape == (1, 1024, 14, 14)
+    assert feat[3].shape == (1, 2048, 7, 7)
+
+    # zero initialization of residual blocks
+    model = ResNet(50, out_indices=(0, 1, 2, 3), zero_init_residual=True)
+    model.init_weights()
+    for m in model.modules():
+        if isinstance(m, Bottleneck):
+            assert all_zeros(m.norm3)
+        elif isinstance(m, BasicBlock):
+            assert all_zeros(m.norm2)
+
+    # non-zero initialization of residual blocks
+    model = ResNet(50, out_indices=(0, 1, 2, 3), zero_init_residual=False)
+    model.init_weights()
+    for m in model.modules():
+        if isinstance(m, Bottleneck):
+            assert not all_zeros(m.norm3)
+        elif isinstance(m, BasicBlock):
+            assert not all_zeros(m.norm2)
+
+
+def test_resnet_v1c():
+    model = ResNetV1c(depth=50, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    assert len(model.stem) == 3
+    for i in range(3):
+        assert isinstance(model.stem[i], ConvModule)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model.stem(imgs)
+    assert feat.shape == (1, 64, 112, 112)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == (1, 256, 56, 56)
+    assert feat[1].shape == (1, 512, 28, 28)
+    assert feat[2].shape == (1, 1024, 14, 14)
+    assert feat[3].shape == (1, 2048, 7, 7)
+
+    # Test ResNet50V1d with first stage frozen
+    frozen_stages = 1
+    model = ResNetV1d(depth=50, frozen_stages=frozen_stages)
+    assert len(model.stem) == 3
+    for i in range(3):
+        assert isinstance(model.stem[i], ConvModule)
+    model.init_weights()
+    model.train()
+    check_norm_state(model.stem, False)
+    for param in model.stem.parameters():
+        assert param.requires_grad is False
+    for i in range(1, frozen_stages + 1):
+        layer = getattr(model, f'layer{i}')
+        for mod in layer.modules():
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in layer.parameters():
+            assert param.requires_grad is False
+
+
+def test_resnet_v1d():
+    model = ResNetV1d(depth=50, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    assert len(model.stem) == 3
+    for i in range(3):
+        assert isinstance(model.stem[i], ConvModule)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model.stem(imgs)
+    assert feat.shape == (1, 64, 112, 112)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == (1, 256, 56, 56)
+    assert feat[1].shape == (1, 512, 28, 28)
+    assert feat[2].shape == (1, 1024, 14, 14)
+    assert feat[3].shape == (1, 2048, 7, 7)
+
+    # Test ResNet50V1d with first stage frozen
+    frozen_stages = 1
+    model = ResNetV1d(depth=50, frozen_stages=frozen_stages)
+    assert len(model.stem) == 3
+    for i in range(3):
+        assert isinstance(model.stem[i], ConvModule)
+    model.init_weights()
+    model.train()
+    check_norm_state(model.stem, False)
+    for param in model.stem.parameters():
+        assert param.requires_grad is False
+    for i in range(1, frozen_stages + 1):
+        layer = getattr(model, f'layer{i}')
+        for mod in layer.modules():
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in layer.parameters():
+            assert param.requires_grad is False
+
+
+def test_resnet_half_channel():
+    model = ResNet(50, base_channels=32, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == (1, 128, 56, 56)
+    assert feat[1].shape == (1, 256, 28, 28)
+    assert feat[2].shape == (1, 512, 14, 14)
+    assert feat[3].shape == (1, 1024, 7, 7)
diff --git a/tests/test_models/test_backbones/test_resnet_cifar.py b/tests/test_models/test_backbones/test_resnet_cifar.py
new file mode 100644
index 0000000000000000000000000000000000000000..458656691506a983fcf5e430c0297453f736b31f
--- /dev/null
+++ b/tests/test_models/test_backbones/test_resnet_cifar.py
@@ -0,0 +1,67 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+
+from mmpretrain.models.backbones import ResNet_CIFAR
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def test_resnet_cifar():
+    # deep_stem must be False
+    with pytest.raises(AssertionError):
+        ResNet_CIFAR(depth=18, deep_stem=True)
+
+    # test the feature map size when depth is 18
+    model = ResNet_CIFAR(depth=18, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 32, 32)
+    feat = model.conv1(imgs)
+    assert feat.shape == (1, 64, 32, 32)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == (1, 64, 32, 32)
+    assert feat[1].shape == (1, 128, 16, 16)
+    assert feat[2].shape == (1, 256, 8, 8)
+    assert feat[3].shape == (1, 512, 4, 4)
+
+    # test the feature map size when depth is 50
+    model = ResNet_CIFAR(depth=50, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 32, 32)
+    feat = model.conv1(imgs)
+    assert feat.shape == (1, 64, 32, 32)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == (1, 256, 32, 32)
+    assert feat[1].shape == (1, 512, 16, 16)
+    assert feat[2].shape == (1, 1024, 8, 8)
+    assert feat[3].shape == (1, 2048, 4, 4)
+
+    # Test ResNet_CIFAR with first stage frozen
+    frozen_stages = 1
+    model = ResNet_CIFAR(depth=50, frozen_stages=frozen_stages)
+    model.init_weights()
+    model.train()
+    check_norm_state([model.norm1], False)
+    for param in model.conv1.parameters():
+        assert param.requires_grad is False
+    for i in range(1, frozen_stages + 1):
+        layer = getattr(model, f'layer{i}')
+        for mod in layer.modules():
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in layer.parameters():
+            assert param.requires_grad is False
diff --git a/tests/test_models/test_backbones/test_resnext.py b/tests/test_models/test_backbones/test_resnext.py
new file mode 100644
index 0000000000000000000000000000000000000000..5c33f9a2a24f45dbcfefef67fade25b3b4ca3475
--- /dev/null
+++ b/tests/test_models/test_backbones/test_resnext.py
@@ -0,0 +1,61 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmpretrain.models.backbones import ResNeXt
+from mmpretrain.models.backbones.resnext import Bottleneck as BottleneckX
+
+
+def test_bottleneck():
+    with pytest.raises(AssertionError):
+        # Style must be in ['pytorch', 'caffe']
+        BottleneckX(64, 64, groups=32, width_per_group=4, style='tensorflow')
+
+    # Test ResNeXt Bottleneck structure
+    block = BottleneckX(
+        64, 256, groups=32, width_per_group=4, stride=2, style='pytorch')
+    assert block.conv2.stride == (2, 2)
+    assert block.conv2.groups == 32
+    assert block.conv2.out_channels == 128
+
+    # Test ResNeXt Bottleneck forward
+    block = BottleneckX(64, 64, base_channels=16, groups=32, width_per_group=4)
+    x = torch.randn(1, 64, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size([1, 64, 56, 56])
+
+
+def test_resnext():
+    with pytest.raises(KeyError):
+        # ResNeXt depth should be in [50, 101, 152]
+        ResNeXt(depth=18)
+
+    # Test ResNeXt with group 32, width_per_group 4
+    model = ResNeXt(
+        depth=50, groups=32, width_per_group=4, out_indices=(0, 1, 2, 3))
+    for m in model.modules():
+        if isinstance(m, BottleneckX):
+            assert m.conv2.groups == 32
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size([1, 256, 56, 56])
+    assert feat[1].shape == torch.Size([1, 512, 28, 28])
+    assert feat[2].shape == torch.Size([1, 1024, 14, 14])
+    assert feat[3].shape == torch.Size([1, 2048, 7, 7])
+
+    # Test ResNeXt with group 32, width_per_group 4 and layers 3 out forward
+    model = ResNeXt(depth=50, groups=32, width_per_group=4, out_indices=(3, ))
+    for m in model.modules():
+        if isinstance(m, BottleneckX):
+            assert m.conv2.groups == 32
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 2048, 7, 7])
diff --git a/tests/test_models/test_backbones/test_revvit.py b/tests/test_models/test_backbones/test_revvit.py
new file mode 100644
index 0000000000000000000000000000000000000000..f18ca7827c31cbca83cce874bc97f3028dcb1c14
--- /dev/null
+++ b/tests/test_models/test_backbones/test_revvit.py
@@ -0,0 +1,131 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+import tempfile
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+from mmengine.runner import load_checkpoint, save_checkpoint
+
+from mmpretrain.models.backbones import RevVisionTransformer
+from .utils import timm_resize_pos_embed
+
+
+class TestRevVisionTransformer(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(
+            arch='b', img_size=224, patch_size=16, drop_path_rate=0.1)
+
+    def test_structure(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'not in default archs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            RevVisionTransformer(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'Custom arch needs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'num_layers': 24,
+                'num_heads': 16,
+                'feedforward_channels': 4096
+            }
+            RevVisionTransformer(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        cfg['arch'] = {
+            'embed_dims': 128,
+            'num_layers': 24,
+            'num_heads': 16,
+            'feedforward_channels': 1024
+        }
+        model = RevVisionTransformer(**cfg)
+        self.assertEqual(model.embed_dims, 128)
+        self.assertEqual(model.num_layers, 24)
+        for layer in model.layers:
+            self.assertEqual(layer.attn.num_heads, 16)
+            self.assertEqual(layer.ffn.feedforward_channels, 1024)
+
+        # Test model structure
+        cfg = deepcopy(self.cfg)
+        model = RevVisionTransformer(**cfg)
+        self.assertEqual(len(model.layers), 12)
+        dpr_inc = 0.1 / (12 - 1)
+        dpr = 0
+        for layer in model.layers:
+            self.assertEqual(layer.attn.embed_dims, 768)
+            self.assertEqual(layer.attn.num_heads, 12)
+            self.assertEqual(layer.ffn.feedforward_channels, 3072)
+            # self.assertAlmostEqual(layer.attn.out_drop.drop_prob, dpr)
+            # self.assertAlmostEqual(layer.ffn.dropout_layer.drop_prob, dpr)
+            dpr += dpr_inc
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]
+        model = RevVisionTransformer(**cfg)
+        ori_weight = model.patch_embed.projection.weight.clone().detach()
+        # The pos_embed is all zero before initialize
+        self.assertTrue(torch.allclose(model.pos_embed, torch.tensor(0.)))
+
+        model.init_weights()
+        initialized_weight = model.patch_embed.projection.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+        self.assertFalse(torch.allclose(model.pos_embed, torch.tensor(0.)))
+
+        # test load checkpoint
+        pretrain_pos_embed = model.pos_embed.clone().detach()
+        tmpdir = tempfile.gettempdir()
+        checkpoint = os.path.join(tmpdir, 'test.pth')
+        save_checkpoint(model.state_dict(), checkpoint)
+        cfg = deepcopy(self.cfg)
+        model = RevVisionTransformer(**cfg)
+        load_checkpoint(model, checkpoint, strict=True)
+        self.assertTrue(torch.allclose(model.pos_embed, pretrain_pos_embed))
+
+        # test load checkpoint with different img_size
+        cfg = deepcopy(self.cfg)
+        cfg['img_size'] = 384
+        model = RevVisionTransformer(**cfg)
+        load_checkpoint(model, checkpoint, strict=True)
+        resized_pos_embed = timm_resize_pos_embed(
+            pretrain_pos_embed, model.pos_embed, num_tokens=0)
+        self.assertTrue(torch.allclose(model.pos_embed, resized_pos_embed))
+
+        os.remove(checkpoint)
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+
+        cfg = deepcopy(self.cfg)
+        cfg['with_cls_token'] = False
+        cfg['out_type'] = 'avg_featmap'
+        model = RevVisionTransformer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        patch_token = outs[-1]
+        self.assertEqual(patch_token.shape, (1, 768 * 2))
+
+        # Test forward with dynamic input size
+        imgs1 = torch.randn(1, 3, 224, 224)
+        imgs2 = torch.randn(1, 3, 256, 256)
+        imgs3 = torch.randn(1, 3, 256, 309)
+        cfg = deepcopy(self.cfg)
+        model = RevVisionTransformer(**cfg)
+        for imgs in [imgs1, imgs2, imgs3]:
+            outs = model(imgs)
+            self.assertIsInstance(outs, tuple)
+            self.assertEqual(len(outs), 1)
+            avg_featmap = outs[-1]
+            self.assertEqual(avg_featmap.shape, (1, 768 * 2))
diff --git a/tests/test_models/test_backbones/test_riformer.py b/tests/test_models/test_backbones/test_riformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..86847ee052e6d22253a63efd8b139f2c509da95b
--- /dev/null
+++ b/tests/test_models/test_backbones/test_riformer.py
@@ -0,0 +1,168 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+import torch.nn as nn
+
+from mmpretrain.models.backbones import RIFormer
+from mmpretrain.models.backbones.riformer import RIFormerBlock
+
+
+class TestRIFormer(TestCase):
+
+    def setUp(self):
+        arch = 's12'
+        self.cfg = dict(arch=arch, drop_path_rate=0.1)
+        self.arch = RIFormer.arch_settings[arch]
+
+    def test_arch(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'Unavailable arch'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            RIFormer(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'must have "layers"'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'embed_dims': 96,
+                'num_heads': [3, 6, 12, 16],
+            }
+            RIFormer(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        layers = [2, 2, 4, 2]
+        embed_dims = [6, 12, 6, 12]
+        mlp_ratios = [2, 3, 4, 4]
+        layer_scale_init_value = 1e-4
+        cfg['arch'] = dict(
+            layers=layers,
+            embed_dims=embed_dims,
+            mlp_ratios=mlp_ratios,
+            layer_scale_init_value=layer_scale_init_value,
+        )
+        model = RIFormer(**cfg)
+        for i, stage in enumerate(model.network):
+            if not isinstance(stage, RIFormerBlock):
+                continue
+            self.assertEqual(len(stage), layers[i])
+            self.assertEqual(stage[0].mlp.fc1.in_channels, embed_dims[i])
+            self.assertEqual(stage[0].mlp.fc1.out_channels,
+                             embed_dims[i] * mlp_ratios[i])
+            self.assertTrue(
+                torch.allclose(stage[0].layer_scale_1,
+                               torch.tensor(layer_scale_init_value)))
+            self.assertTrue(
+                torch.allclose(stage[0].layer_scale_2,
+                               torch.tensor(layer_scale_init_value)))
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]
+        model = RIFormer(**cfg)
+        ori_weight = model.patch_embed.proj.weight.clone().detach()
+
+        model.init_weights()
+        initialized_weight = model.patch_embed.proj.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+
+        cfg = deepcopy(self.cfg)
+        model = RIFormer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (1, 512, 7, 7))
+
+        # test multiple output indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = (0, 2, 4, 6)
+        model = RIFormer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 4)
+        for dim, stride, out in zip(self.arch['embed_dims'], [1, 2, 4, 8],
+                                    outs):
+            self.assertEqual(out.shape, (1, dim, 56 // stride, 56 // stride))
+
+    def test_repameterization(self):
+        # Test eval of "train" mode and "deploy" mode
+        imgs = torch.randn(1, 3, 224, 224)
+        gap = nn.AdaptiveAvgPool2d(output_size=(1))
+        fc = nn.Linear(self.arch['embed_dims'][3], 10)
+
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = (0, 2, 4, 6)
+        model = RIFormer(**cfg)
+        model.eval()
+        feats = model(imgs)
+        self.assertIsInstance(feats, tuple)
+        feat = feats[-1]
+        pred = fc(gap(feat).flatten(1))
+        model.switch_to_deploy()
+        for m in model.modules():
+            if isinstance(m, RIFormerBlock):
+                assert m.deploy is True
+        feats_deploy = model(imgs)
+        pred_deploy = fc(gap(feats_deploy[-1]).flatten(1))
+        for i in range(4):
+            torch.allclose(feats[i], feats_deploy[i])
+        torch.allclose(pred, pred_deploy)
+
+    def test_structure(self):
+        # test drop_path_rate decay
+        cfg = deepcopy(self.cfg)
+        cfg['drop_path_rate'] = 0.2
+        model = RIFormer(**cfg)
+        layers = self.arch['layers']
+        for i, block in enumerate(model.network):
+            expect_prob = 0.2 / (sum(layers) - 1) * i
+            if hasattr(block, 'drop_path'):
+                if expect_prob == 0:
+                    self.assertIsInstance(block.drop_path, torch.nn.Identity)
+                else:
+                    self.assertAlmostEqual(block.drop_path.drop_prob,
+                                           expect_prob)
+
+        # test with first stage frozen.
+        cfg = deepcopy(self.cfg)
+        frozen_stages = 1
+        cfg['frozen_stages'] = frozen_stages
+        cfg['out_indices'] = (0, 2, 4, 6)
+        model = RIFormer(**cfg)
+        model.init_weights()
+        model.train()
+
+        # the patch_embed and first stage should not require grad.
+        self.assertFalse(model.patch_embed.training)
+        for param in model.patch_embed.parameters():
+            self.assertFalse(param.requires_grad)
+        for i in range(frozen_stages):
+            module = model.network[i]
+            for param in module.parameters():
+                self.assertFalse(param.requires_grad)
+        for param in model.norm0.parameters():
+            self.assertFalse(param.requires_grad)
+
+        # the second stage should require grad.
+        for i in range(frozen_stages + 1, 7):
+            module = model.network[i]
+            for param in module.parameters():
+                self.assertTrue(param.requires_grad)
+            if hasattr(model, f'norm{i}'):
+                norm = getattr(model, f'norm{i}')
+                for param in norm.parameters():
+                    self.assertTrue(param.requires_grad)
diff --git a/tests/test_models/test_backbones/test_seresnet.py b/tests/test_models/test_backbones/test_seresnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..d7f9dffe2d8a4cebe11ff95ec4611597acfb8b66
--- /dev/null
+++ b/tests/test_models/test_backbones/test_seresnet.py
@@ -0,0 +1,247 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from torch.nn.modules import AvgPool2d
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import SEResNet
+from mmpretrain.models.backbones.resnet import ResLayer
+from mmpretrain.models.backbones.seresnet import SEBottleneck, SELayer
+
+
+def all_zeros(modules):
+    """Check if the weight(and bias) is all zero."""
+    weight_zero = torch.equal(modules.weight.data,
+                              torch.zeros_like(modules.weight.data))
+    if hasattr(modules, 'bias'):
+        bias_zero = torch.equal(modules.bias.data,
+                                torch.zeros_like(modules.bias.data))
+    else:
+        bias_zero = True
+
+    return weight_zero and bias_zero
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def test_selayer():
+    # Test selayer forward
+    layer = SELayer(64)
+    x = torch.randn(1, 64, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == torch.Size([1, 64, 56, 56])
+
+    # Test selayer forward with different ratio
+    layer = SELayer(64, ratio=8)
+    x = torch.randn(1, 64, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == torch.Size([1, 64, 56, 56])
+
+
+def test_bottleneck():
+
+    with pytest.raises(AssertionError):
+        # Style must be in ['pytorch', 'caffe']
+        SEBottleneck(64, 64, style='tensorflow')
+
+    # Test SEBottleneck with checkpoint forward
+    block = SEBottleneck(64, 64, with_cp=True)
+    assert block.with_cp
+    x = torch.randn(1, 64, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size([1, 64, 56, 56])
+
+    # Test Bottleneck style
+    block = SEBottleneck(64, 256, stride=2, style='pytorch')
+    assert block.conv1.stride == (1, 1)
+    assert block.conv2.stride == (2, 2)
+    block = SEBottleneck(64, 256, stride=2, style='caffe')
+    assert block.conv1.stride == (2, 2)
+    assert block.conv2.stride == (1, 1)
+
+    # Test Bottleneck forward
+    block = SEBottleneck(64, 64)
+    x = torch.randn(1, 64, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size([1, 64, 56, 56])
+
+
+def test_res_layer():
+    # Test ResLayer of 3 Bottleneck w\o downsample
+    layer = ResLayer(SEBottleneck, 3, 64, 64, se_ratio=16)
+    assert len(layer) == 3
+    assert layer[0].conv1.in_channels == 64
+    assert layer[0].conv1.out_channels == 16
+    for i in range(1, len(layer)):
+        assert layer[i].conv1.in_channels == 64
+        assert layer[i].conv1.out_channels == 16
+    for i in range(len(layer)):
+        assert layer[i].downsample is None
+    x = torch.randn(1, 64, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == torch.Size([1, 64, 56, 56])
+
+    # Test ResLayer of 3 SEBottleneck with downsample
+    layer = ResLayer(SEBottleneck, 3, 64, 256, se_ratio=16)
+    assert layer[0].downsample[0].out_channels == 256
+    for i in range(1, len(layer)):
+        assert layer[i].downsample is None
+    x = torch.randn(1, 64, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == torch.Size([1, 256, 56, 56])
+
+    # Test ResLayer of 3 SEBottleneck with stride=2
+    layer = ResLayer(SEBottleneck, 3, 64, 256, stride=2, se_ratio=8)
+    assert layer[0].downsample[0].out_channels == 256
+    assert layer[0].downsample[0].stride == (2, 2)
+    for i in range(1, len(layer)):
+        assert layer[i].downsample is None
+    x = torch.randn(1, 64, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == torch.Size([1, 256, 28, 28])
+
+    # Test ResLayer of 3 SEBottleneck with stride=2 and average downsample
+    layer = ResLayer(
+        SEBottleneck, 3, 64, 256, stride=2, avg_down=True, se_ratio=8)
+    assert isinstance(layer[0].downsample[0], AvgPool2d)
+    assert layer[0].downsample[1].out_channels == 256
+    assert layer[0].downsample[1].stride == (1, 1)
+    for i in range(1, len(layer)):
+        assert layer[i].downsample is None
+    x = torch.randn(1, 64, 56, 56)
+    x_out = layer(x)
+    assert x_out.shape == torch.Size([1, 256, 28, 28])
+
+
+def test_seresnet():
+    """Test resnet backbone."""
+    with pytest.raises(KeyError):
+        # SEResNet depth should be in [50, 101, 152]
+        SEResNet(20)
+
+    with pytest.raises(AssertionError):
+        # In SEResNet: 1 <= num_stages <= 4
+        SEResNet(50, num_stages=0)
+
+    with pytest.raises(AssertionError):
+        # In SEResNet: 1 <= num_stages <= 4
+        SEResNet(50, num_stages=5)
+
+    with pytest.raises(AssertionError):
+        # len(strides) == len(dilations) == num_stages
+        SEResNet(50, strides=(1, ), dilations=(1, 1), num_stages=3)
+
+    with pytest.raises(TypeError):
+        # pretrained must be a string path
+        model = SEResNet(50)
+        model.init_weights(pretrained=0)
+
+    with pytest.raises(AssertionError):
+        # Style must be in ['pytorch', 'caffe']
+        SEResNet(50, style='tensorflow')
+
+    # Test SEResNet50 norm_eval=True
+    model = SEResNet(50, norm_eval=True)
+    model.init_weights()
+    model.train()
+    assert check_norm_state(model.modules(), False)
+
+    # Test SEResNet50 with torchvision pretrained weight
+    model = SEResNet(
+        depth=50,
+        norm_eval=True,
+        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50'))
+    model.init_weights()
+    model.train()
+    assert check_norm_state(model.modules(), False)
+
+    # Test SEResNet50 with first stage frozen
+    frozen_stages = 1
+    model = SEResNet(50, frozen_stages=frozen_stages)
+    model.init_weights()
+    model.train()
+    assert model.norm1.training is False
+    for layer in [model.conv1, model.norm1]:
+        for param in layer.parameters():
+            assert param.requires_grad is False
+    for i in range(1, frozen_stages + 1):
+        layer = getattr(model, f'layer{i}')
+        for mod in layer.modules():
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in layer.parameters():
+            assert param.requires_grad is False
+
+    # Test SEResNet50 with BatchNorm forward
+    model = SEResNet(50, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size([1, 256, 56, 56])
+    assert feat[1].shape == torch.Size([1, 512, 28, 28])
+    assert feat[2].shape == torch.Size([1, 1024, 14, 14])
+    assert feat[3].shape == torch.Size([1, 2048, 7, 7])
+
+    # Test SEResNet50 with layers 1, 2, 3 out forward
+    model = SEResNet(50, out_indices=(0, 1, 2))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 3
+    assert feat[0].shape == torch.Size([1, 256, 56, 56])
+    assert feat[1].shape == torch.Size([1, 512, 28, 28])
+    assert feat[2].shape == torch.Size([1, 1024, 14, 14])
+
+    # Test SEResNet50 with layers 3 (top feature maps) out forward
+    model = SEResNet(50, out_indices=(3, ))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 2048, 7, 7])
+
+    # Test SEResNet50 with checkpoint forward
+    model = SEResNet(50, out_indices=(0, 1, 2, 3), with_cp=True)
+    for m in model.modules():
+        if isinstance(m, SEBottleneck):
+            assert m.with_cp
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size([1, 256, 56, 56])
+    assert feat[1].shape == torch.Size([1, 512, 28, 28])
+    assert feat[2].shape == torch.Size([1, 1024, 14, 14])
+    assert feat[3].shape == torch.Size([1, 2048, 7, 7])
+
+    # Test SEResNet50 zero initialization of residual
+    model = SEResNet(50, out_indices=(0, 1, 2, 3), zero_init_residual=True)
+    model.init_weights()
+    for m in model.modules():
+        if isinstance(m, SEBottleneck):
+            assert all_zeros(m.norm3)
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size([1, 256, 56, 56])
+    assert feat[1].shape == torch.Size([1, 512, 28, 28])
+    assert feat[2].shape == torch.Size([1, 1024, 14, 14])
+    assert feat[3].shape == torch.Size([1, 2048, 7, 7])
diff --git a/tests/test_models/test_backbones/test_seresnext.py b/tests/test_models/test_backbones/test_seresnext.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b84f84ecf1d649910381ec79d60ac8077f3587d
--- /dev/null
+++ b/tests/test_models/test_backbones/test_seresnext.py
@@ -0,0 +1,74 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmpretrain.models.backbones import SEResNeXt
+from mmpretrain.models.backbones.seresnext import SEBottleneck as SEBottleneckX
+
+
+def test_bottleneck():
+    with pytest.raises(AssertionError):
+        # Style must be in ['pytorch', 'caffe']
+        SEBottleneckX(64, 64, groups=32, width_per_group=4, style='tensorflow')
+
+    # Test SEResNeXt Bottleneck structure
+    block = SEBottleneckX(
+        64, 256, groups=32, width_per_group=4, stride=2, style='pytorch')
+    assert block.width_per_group == 4
+    assert block.conv2.stride == (2, 2)
+    assert block.conv2.groups == 32
+    assert block.conv2.out_channels == 128
+    assert block.conv2.out_channels == block.mid_channels
+
+    # Test SEResNeXt Bottleneck structure (groups=1)
+    block = SEBottleneckX(
+        64, 256, groups=1, width_per_group=4, stride=2, style='pytorch')
+    assert block.conv2.stride == (2, 2)
+    assert block.conv2.groups == 1
+    assert block.conv2.out_channels == 64
+    assert block.mid_channels == 64
+    assert block.conv2.out_channels == block.mid_channels
+
+    # Test SEResNeXt Bottleneck forward
+    block = SEBottleneckX(
+        64, 64, base_channels=16, groups=32, width_per_group=4)
+    x = torch.randn(1, 64, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size([1, 64, 56, 56])
+
+
+def test_seresnext():
+    with pytest.raises(KeyError):
+        # SEResNeXt depth should be in [50, 101, 152]
+        SEResNeXt(depth=18)
+
+    # Test SEResNeXt with group 32, width_per_group 4
+    model = SEResNeXt(
+        depth=50, groups=32, width_per_group=4, out_indices=(0, 1, 2, 3))
+    for m in model.modules():
+        if isinstance(m, SEBottleneckX):
+            assert m.conv2.groups == 32
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size([1, 256, 56, 56])
+    assert feat[1].shape == torch.Size([1, 512, 28, 28])
+    assert feat[2].shape == torch.Size([1, 1024, 14, 14])
+    assert feat[3].shape == torch.Size([1, 2048, 7, 7])
+
+    # Test SEResNeXt with group 32, width_per_group 4 and layers 3 out forward
+    model = SEResNeXt(
+        depth=50, groups=32, width_per_group=4, out_indices=(3, ))
+    for m in model.modules():
+        if isinstance(m, SEBottleneckX):
+            assert m.conv2.groups == 32
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 2048, 7, 7])
diff --git a/tests/test_models/test_backbones/test_shufflenet_v1.py b/tests/test_models/test_backbones/test_shufflenet_v1.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a55acfd467050ffeb8280021caa8deefaadea18
--- /dev/null
+++ b/tests/test_models/test_backbones/test_shufflenet_v1.py
@@ -0,0 +1,246 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import ShuffleNetV1
+from mmpretrain.models.backbones.shufflenet_v1 import ShuffleUnit
+
+
+def is_block(modules):
+    """Check if is ResNet building block."""
+    if isinstance(modules, (ShuffleUnit, )):
+        return True
+    return False
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def test_shufflenetv1_shuffleuint():
+
+    with pytest.raises(ValueError):
+        # combine must be in ['add', 'concat']
+        ShuffleUnit(24, 16, groups=3, first_block=True, combine='test')
+
+    with pytest.raises(AssertionError):
+        # in_channels must be equal tp = outplanes when combine='add'
+        ShuffleUnit(64, 24, groups=4, first_block=True, combine='add')
+
+    # Test ShuffleUnit with combine='add'
+    block = ShuffleUnit(24, 24, groups=3, first_block=True, combine='add')
+    x = torch.randn(1, 24, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size((1, 24, 56, 56))
+
+    # Test ShuffleUnit with combine='concat'
+    block = ShuffleUnit(24, 240, groups=3, first_block=True, combine='concat')
+    x = torch.randn(1, 24, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size((1, 240, 28, 28))
+
+    # Test ShuffleUnit with checkpoint forward
+    block = ShuffleUnit(
+        24, 24, groups=3, first_block=True, combine='add', with_cp=True)
+    assert block.with_cp
+    x = torch.randn(1, 24, 56, 56)
+    x.requires_grad = True
+    x_out = block(x)
+    assert x_out.shape == torch.Size((1, 24, 56, 56))
+
+
+def test_shufflenetv1_backbone():
+
+    with pytest.raises(ValueError):
+        # frozen_stages must be in  range(-1, 4)
+        ShuffleNetV1(frozen_stages=10)
+
+    with pytest.raises(ValueError):
+        # the item in out_indices must be in  range(0, 4)
+        ShuffleNetV1(out_indices=[5])
+
+    with pytest.raises(ValueError):
+        # groups must be in  [1, 2, 3, 4, 8]
+        ShuffleNetV1(groups=10)
+
+    with pytest.raises(TypeError):
+        # pretrained must be str or None
+        model = ShuffleNetV1()
+        model.init_weights(pretrained=1)
+
+    # Test ShuffleNetV1 norm state
+    model = ShuffleNetV1()
+    model.init_weights()
+    model.train()
+    assert check_norm_state(model.modules(), True)
+
+    # Test ShuffleNetV1 with first stage frozen
+    frozen_stages = 1
+    model = ShuffleNetV1(frozen_stages=frozen_stages, out_indices=(0, 1, 2))
+    model.init_weights()
+    model.train()
+    for param in model.conv1.parameters():
+        assert param.requires_grad is False
+    for i in range(frozen_stages):
+        layer = model.layers[i]
+        for mod in layer.modules():
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in layer.parameters():
+            assert param.requires_grad is False
+
+    # Test ShuffleNetV1 forward with groups=1
+    model = ShuffleNetV1(groups=1, out_indices=(0, 1, 2))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 3
+    assert feat[0].shape == torch.Size((1, 144, 28, 28))
+    assert feat[1].shape == torch.Size((1, 288, 14, 14))
+    assert feat[2].shape == torch.Size((1, 576, 7, 7))
+
+    # Test ShuffleNetV1 forward with groups=2
+    model = ShuffleNetV1(groups=2, out_indices=(0, 1, 2))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 3
+    assert feat[0].shape == torch.Size((1, 200, 28, 28))
+    assert feat[1].shape == torch.Size((1, 400, 14, 14))
+    assert feat[2].shape == torch.Size((1, 800, 7, 7))
+
+    # Test ShuffleNetV1 forward with groups=3
+    model = ShuffleNetV1(groups=3, out_indices=(0, 1, 2))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 3
+    assert feat[0].shape == torch.Size((1, 240, 28, 28))
+    assert feat[1].shape == torch.Size((1, 480, 14, 14))
+    assert feat[2].shape == torch.Size((1, 960, 7, 7))
+
+    # Test ShuffleNetV1 forward with groups=4
+    model = ShuffleNetV1(groups=4, out_indices=(0, 1, 2))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 3
+    assert feat[0].shape == torch.Size((1, 272, 28, 28))
+    assert feat[1].shape == torch.Size((1, 544, 14, 14))
+    assert feat[2].shape == torch.Size((1, 1088, 7, 7))
+
+    # Test ShuffleNetV1 forward with groups=8
+    model = ShuffleNetV1(groups=8, out_indices=(0, 1, 2))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 3
+    assert feat[0].shape == torch.Size((1, 384, 28, 28))
+    assert feat[1].shape == torch.Size((1, 768, 14, 14))
+    assert feat[2].shape == torch.Size((1, 1536, 7, 7))
+
+    # Test ShuffleNetV1 forward with GroupNorm forward
+    model = ShuffleNetV1(
+        groups=3,
+        norm_cfg=dict(type='GN', num_groups=2, requires_grad=True),
+        out_indices=(0, 1, 2))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, GroupNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 3
+    assert feat[0].shape == torch.Size((1, 240, 28, 28))
+    assert feat[1].shape == torch.Size((1, 480, 14, 14))
+    assert feat[2].shape == torch.Size((1, 960, 7, 7))
+
+    # Test ShuffleNetV1 forward with layers 1, 2 forward
+    model = ShuffleNetV1(groups=3, out_indices=(1, 2))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 2
+    assert feat[0].shape == torch.Size((1, 480, 14, 14))
+    assert feat[1].shape == torch.Size((1, 960, 7, 7))
+
+    # Test ShuffleNetV1 forward with layers 2 forward
+    model = ShuffleNetV1(groups=3, out_indices=(2, ))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert isinstance(feat[0], torch.Tensor)
+    assert feat[0].shape == torch.Size((1, 960, 7, 7))
+
+    # Test ShuffleNetV1 forward with checkpoint forward
+    model = ShuffleNetV1(groups=3, with_cp=True)
+    for m in model.modules():
+        if is_block(m):
+            assert m.with_cp
+
+    # Test ShuffleNetV1 with norm_eval
+    model = ShuffleNetV1(norm_eval=True)
+    model.init_weights()
+    model.train()
+
+    assert check_norm_state(model.modules(), False)
diff --git a/tests/test_models/test_backbones/test_shufflenet_v2.py b/tests/test_models/test_backbones/test_shufflenet_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..84bcec1f9743291329845fb812419889f0adbaf2
--- /dev/null
+++ b/tests/test_models/test_backbones/test_shufflenet_v2.py
@@ -0,0 +1,205 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import ShuffleNetV2
+from mmpretrain.models.backbones.shufflenet_v2 import InvertedResidual
+
+
+def is_block(modules):
+    """Check if is ResNet building block."""
+    if isinstance(modules, (InvertedResidual, )):
+        return True
+    return False
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def test_shufflenetv2_invertedresidual():
+
+    with pytest.raises(AssertionError):
+        # when stride==1, in_channels should be equal to out_channels // 2 * 2
+        InvertedResidual(24, 32, stride=1)
+
+    with pytest.raises(AssertionError):
+        # when in_channels !=  out_channels // 2 * 2, stride should not be
+        # equal to 1.
+        InvertedResidual(24, 32, stride=1)
+
+    # Test InvertedResidual forward
+    block = InvertedResidual(24, 48, stride=2)
+    x = torch.randn(1, 24, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size((1, 48, 28, 28))
+
+    # Test InvertedResidual with checkpoint forward
+    block = InvertedResidual(48, 48, stride=1, with_cp=True)
+    assert block.with_cp
+    x = torch.randn(1, 48, 56, 56)
+    x.requires_grad = True
+    x_out = block(x)
+    assert x_out.shape == torch.Size((1, 48, 56, 56))
+
+
+def test_shufflenetv2_backbone():
+
+    with pytest.raises(ValueError):
+        # groups must be in 0.5, 1.0, 1.5, 2.0]
+        ShuffleNetV2(widen_factor=3.0)
+
+    with pytest.raises(ValueError):
+        # frozen_stages must be in [0, 1, 2, 3]
+        ShuffleNetV2(widen_factor=1.0, frozen_stages=4)
+
+    with pytest.raises(ValueError):
+        # out_indices must be in [0, 1, 2, 3]
+        ShuffleNetV2(widen_factor=1.0, out_indices=(4, ))
+
+    with pytest.raises(TypeError):
+        # pretrained must be str or None
+        model = ShuffleNetV2()
+        model.init_weights(pretrained=1)
+
+    # Test ShuffleNetV2 norm state
+    model = ShuffleNetV2()
+    model.init_weights()
+    model.train()
+    assert check_norm_state(model.modules(), True)
+
+    # Test ShuffleNetV2 with first stage frozen
+    frozen_stages = 1
+    model = ShuffleNetV2(frozen_stages=frozen_stages)
+    model.init_weights()
+    model.train()
+    for param in model.conv1.parameters():
+        assert param.requires_grad is False
+    for i in range(0, frozen_stages):
+        layer = model.layers[i]
+        for mod in layer.modules():
+            if isinstance(mod, _BatchNorm):
+                assert mod.training is False
+        for param in layer.parameters():
+            assert param.requires_grad is False
+
+    # Test ShuffleNetV2 with norm_eval
+    model = ShuffleNetV2(norm_eval=True)
+    model.init_weights()
+    model.train()
+
+    assert check_norm_state(model.modules(), False)
+
+    # Test ShuffleNetV2 forward with widen_factor=0.5
+    model = ShuffleNetV2(widen_factor=0.5, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size((1, 48, 28, 28))
+    assert feat[1].shape == torch.Size((1, 96, 14, 14))
+    assert feat[2].shape == torch.Size((1, 192, 7, 7))
+
+    # Test ShuffleNetV2 forward with widen_factor=1.0
+    model = ShuffleNetV2(widen_factor=1.0, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size((1, 116, 28, 28))
+    assert feat[1].shape == torch.Size((1, 232, 14, 14))
+    assert feat[2].shape == torch.Size((1, 464, 7, 7))
+
+    # Test ShuffleNetV2 forward with widen_factor=1.5
+    model = ShuffleNetV2(widen_factor=1.5, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size((1, 176, 28, 28))
+    assert feat[1].shape == torch.Size((1, 352, 14, 14))
+    assert feat[2].shape == torch.Size((1, 704, 7, 7))
+
+    # Test ShuffleNetV2 forward with widen_factor=2.0
+    model = ShuffleNetV2(widen_factor=2.0, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size((1, 244, 28, 28))
+    assert feat[1].shape == torch.Size((1, 488, 14, 14))
+    assert feat[2].shape == torch.Size((1, 976, 7, 7))
+
+    # Test ShuffleNetV2 forward with layers 3 forward
+    model = ShuffleNetV2(widen_factor=1.0, out_indices=(2, ))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert isinstance(feat[0], torch.Tensor)
+    assert feat[0].shape == torch.Size((1, 464, 7, 7))
+
+    # Test ShuffleNetV2 forward with layers 1 2 forward
+    model = ShuffleNetV2(widen_factor=1.0, out_indices=(1, 2))
+    model.init_weights()
+    model.train()
+
+    for m in model.modules():
+        if is_norm(m):
+            assert isinstance(m, _BatchNorm)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 2
+    assert feat[0].shape == torch.Size((1, 232, 14, 14))
+    assert feat[1].shape == torch.Size((1, 464, 7, 7))
+
+    # Test ShuffleNetV2 forward with checkpoint forward
+    model = ShuffleNetV2(widen_factor=1.0, with_cp=True)
+    for m in model.modules():
+        if is_block(m):
+            assert m.with_cp
diff --git a/tests/test_models/test_backbones/test_swin_transformer.py b/tests/test_models/test_backbones/test_swin_transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..1437dac79bfb8af55efb7fe201908ccc171c7091
--- /dev/null
+++ b/tests/test_models/test_backbones/test_swin_transformer.py
@@ -0,0 +1,255 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+import os
+import tempfile
+from copy import deepcopy
+from itertools import chain
+from unittest import TestCase
+
+import torch
+from mmengine.runner import load_checkpoint, save_checkpoint
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+
+from mmpretrain.models.backbones import SwinTransformer
+from mmpretrain.models.backbones.swin_transformer import SwinBlock
+from .utils import timm_resize_pos_embed
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+class TestSwinTransformer(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(
+            arch='tiny', img_size=224, patch_size=4, drop_path_rate=0.1)
+
+    def test_arch(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'not in default archs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            SwinTransformer(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'Custom arch needs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'embed_dims': 96,
+                'num_heads': [3, 6, 12, 16],
+            }
+            SwinTransformer(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        depths = [2, 2, 4, 2]
+        num_heads = [6, 12, 6, 12]
+        cfg['arch'] = {
+            'embed_dims': 256,
+            'depths': depths,
+            'num_heads': num_heads
+        }
+        model = SwinTransformer(**cfg)
+        for i, stage in enumerate(model.stages):
+            self.assertEqual(stage.embed_dims, 256 * (2**i))
+            self.assertEqual(len(stage.blocks), depths[i])
+            self.assertEqual(stage.blocks[0].attn.w_msa.num_heads,
+                             num_heads[i])
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['use_abs_pos_embed'] = True
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]
+        model = SwinTransformer(**cfg)
+        ori_weight = model.patch_embed.projection.weight.clone().detach()
+        # The pos_embed is all zero before initialize
+        self.assertTrue(
+            torch.allclose(model.absolute_pos_embed, torch.tensor(0.)))
+
+        model.init_weights()
+        initialized_weight = model.patch_embed.projection.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+        self.assertFalse(
+            torch.allclose(model.absolute_pos_embed, torch.tensor(0.)))
+
+        pretrain_pos_embed = model.absolute_pos_embed.clone().detach()
+
+        tmpdir = tempfile.gettempdir()
+        # Save v3 checkpoints
+        checkpoint_v2 = os.path.join(tmpdir, 'v3.pth')
+        save_checkpoint(model.state_dict(), checkpoint_v2)
+        # Save v1 checkpoints
+        setattr(model, 'norm', model.norm3)
+        setattr(model.stages[0].blocks[1].attn, 'attn_mask',
+                torch.zeros(64, 49, 49))
+        model._version = 1
+        del model.norm3
+        checkpoint_v1 = os.path.join(tmpdir, 'v1.pth')
+        save_checkpoint(model.state_dict(), checkpoint_v1)
+
+        # test load v1 checkpoint
+        cfg = deepcopy(self.cfg)
+        cfg['use_abs_pos_embed'] = True
+        model = SwinTransformer(**cfg)
+        load_checkpoint(model, checkpoint_v1, strict=True)
+
+        # test load v3 checkpoint
+        cfg = deepcopy(self.cfg)
+        cfg['use_abs_pos_embed'] = True
+        model = SwinTransformer(**cfg)
+        load_checkpoint(model, checkpoint_v2, strict=True)
+
+        # test load v3 checkpoint with different img_size
+        cfg = deepcopy(self.cfg)
+        cfg['img_size'] = 384
+        cfg['use_abs_pos_embed'] = True
+        model = SwinTransformer(**cfg)
+        load_checkpoint(model, checkpoint_v2, strict=True)
+        resized_pos_embed = timm_resize_pos_embed(
+            pretrain_pos_embed, model.absolute_pos_embed, num_tokens=0)
+        self.assertTrue(
+            torch.allclose(model.absolute_pos_embed, resized_pos_embed))
+
+        os.remove(checkpoint_v1)
+        os.remove(checkpoint_v2)
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+
+        cfg = deepcopy(self.cfg)
+        model = SwinTransformer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (1, 768, 7, 7))
+
+        # test with window_size=12
+        cfg = deepcopy(self.cfg)
+        cfg['window_size'] = 12
+        model = SwinTransformer(**cfg)
+        outs = model(torch.randn(1, 3, 384, 384))
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (1, 768, 12, 12))
+        with self.assertRaisesRegex(AssertionError, r'the window size \(12\)'):
+            model(torch.randn(1, 3, 224, 224))
+
+        # test with pad_small_map=True
+        cfg = deepcopy(self.cfg)
+        cfg['window_size'] = 12
+        cfg['pad_small_map'] = True
+        model = SwinTransformer(**cfg)
+        outs = model(torch.randn(1, 3, 224, 224))
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (1, 768, 7, 7))
+
+        # test multiple output indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = (0, 1, 2, 3)
+        model = SwinTransformer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 4)
+        for stride, out in zip([1, 2, 4, 8], outs):
+            self.assertEqual(out.shape,
+                             (1, 96 * stride, 56 // stride, 56 // stride))
+
+        # test with checkpoint forward
+        cfg = deepcopy(self.cfg)
+        cfg['with_cp'] = True
+        model = SwinTransformer(**cfg)
+        for m in model.modules():
+            if isinstance(m, SwinBlock):
+                self.assertTrue(m.with_cp)
+        model.init_weights()
+        model.train()
+
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (1, 768, 7, 7))
+
+        # test with dynamic input shape
+        imgs1 = torch.randn(1, 3, 224, 224)
+        imgs2 = torch.randn(1, 3, 256, 256)
+        imgs3 = torch.randn(1, 3, 256, 309)
+        cfg = deepcopy(self.cfg)
+        model = SwinTransformer(**cfg)
+        for imgs in [imgs1, imgs2, imgs3]:
+            outs = model(imgs)
+            self.assertIsInstance(outs, tuple)
+            self.assertEqual(len(outs), 1)
+            feat = outs[-1]
+            expect_feat_shape = (math.ceil(imgs.shape[2] / 32),
+                                 math.ceil(imgs.shape[3] / 32))
+            self.assertEqual(feat.shape, (1, 768, *expect_feat_shape))
+
+    def test_structure(self):
+        # test drop_path_rate decay
+        cfg = deepcopy(self.cfg)
+        cfg['drop_path_rate'] = 0.2
+        model = SwinTransformer(**cfg)
+        depths = model.arch_settings['depths']
+        blocks = chain(*[stage.blocks for stage in model.stages])
+        for i, block in enumerate(blocks):
+            expect_prob = 0.2 / (sum(depths) - 1) * i
+            self.assertAlmostEqual(block.ffn.dropout_layer.drop_prob,
+                                   expect_prob)
+            self.assertAlmostEqual(block.attn.drop.drop_prob, expect_prob)
+
+        # test Swin-Transformer with norm_eval=True
+        cfg = deepcopy(self.cfg)
+        cfg['norm_eval'] = True
+        cfg['norm_cfg'] = dict(type='BN')
+        cfg['stage_cfgs'] = dict(block_cfgs=dict(norm_cfg=dict(type='BN')))
+        model = SwinTransformer(**cfg)
+        model.init_weights()
+        model.train()
+        self.assertTrue(check_norm_state(model.modules(), False))
+
+        # test Swin-Transformer with first stage frozen.
+        cfg = deepcopy(self.cfg)
+        frozen_stages = 0
+        cfg['frozen_stages'] = frozen_stages
+        cfg['out_indices'] = (0, 1, 2, 3)
+        model = SwinTransformer(**cfg)
+        model.init_weights()
+        model.train()
+
+        # the patch_embed and first stage should not require grad.
+        self.assertFalse(model.patch_embed.training)
+        for param in model.patch_embed.parameters():
+            self.assertFalse(param.requires_grad)
+        for i in range(frozen_stages + 1):
+            stage = model.stages[i]
+            for param in stage.parameters():
+                self.assertFalse(param.requires_grad)
+        for param in model.norm0.parameters():
+            self.assertFalse(param.requires_grad)
+
+        # the second stage should require grad.
+        for i in range(frozen_stages + 1, 4):
+            stage = model.stages[i]
+            for param in stage.parameters():
+                self.assertTrue(param.requires_grad)
+            norm = getattr(model, f'norm{i}')
+            for param in norm.parameters():
+                self.assertTrue(param.requires_grad)
diff --git a/tests/test_models/test_backbones/test_swin_transformer_v2.py b/tests/test_models/test_backbones/test_swin_transformer_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..02e238c235981a2ff03f17eb397ff37190c05f1c
--- /dev/null
+++ b/tests/test_models/test_backbones/test_swin_transformer_v2.py
@@ -0,0 +1,243 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+import os
+import tempfile
+from copy import deepcopy
+from itertools import chain
+from unittest import TestCase
+
+import torch
+from mmengine.runner import load_checkpoint, save_checkpoint
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+
+from mmpretrain.models.backbones import SwinTransformerV2
+from mmpretrain.models.backbones.swin_transformer import SwinBlock
+from .utils import timm_resize_pos_embed
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+class TestSwinTransformerV2(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(
+            arch='b', img_size=256, patch_size=4, drop_path_rate=0.1)
+
+    def test_arch(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'not in default archs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            SwinTransformerV2(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'Custom arch needs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'embed_dims': 96,
+                'num_heads': [3, 6, 12, 16],
+            }
+            SwinTransformerV2(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        depths = [2, 2, 6, 2]
+        num_heads = [6, 12, 6, 12]
+        cfg['arch'] = {
+            'embed_dims': 256,
+            'depths': depths,
+            'num_heads': num_heads,
+            'extra_norm_every_n_blocks': 2
+        }
+        model = SwinTransformerV2(**cfg)
+        for i, stage in enumerate(model.stages):
+            self.assertEqual(stage.out_channels, 256 * (2**i))
+            self.assertEqual(len(stage.blocks), depths[i])
+            self.assertEqual(stage.blocks[0].attn.w_msa.num_heads,
+                             num_heads[i])
+        self.assertIsInstance(model.stages[2].blocks[5], torch.nn.Module)
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['use_abs_pos_embed'] = True
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]
+        model = SwinTransformerV2(**cfg)
+        ori_weight = model.patch_embed.projection.weight.clone().detach()
+        # The pos_embed is all zero before initialize
+        self.assertTrue(
+            torch.allclose(model.absolute_pos_embed, torch.tensor(0.)))
+
+        model.init_weights()
+        initialized_weight = model.patch_embed.projection.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+        self.assertFalse(
+            torch.allclose(model.absolute_pos_embed, torch.tensor(0.)))
+
+        pretrain_pos_embed = model.absolute_pos_embed.clone().detach()
+
+        tmpdir = tempfile.TemporaryDirectory()
+        # Save checkpoints
+        checkpoint = os.path.join(tmpdir.name, 'checkpoint.pth')
+        save_checkpoint(model.state_dict(), checkpoint)
+
+        # test load checkpoint
+        cfg = deepcopy(self.cfg)
+        cfg['use_abs_pos_embed'] = True
+        model = SwinTransformerV2(**cfg)
+        load_checkpoint(model, checkpoint, strict=False)
+
+        # test load checkpoint with different img_size
+        cfg = deepcopy(self.cfg)
+        cfg['img_size'] = 384
+        cfg['use_abs_pos_embed'] = True
+        model = SwinTransformerV2(**cfg)
+        load_checkpoint(model, checkpoint, strict=False)
+        resized_pos_embed = timm_resize_pos_embed(
+            pretrain_pos_embed, model.absolute_pos_embed, num_tokens=0)
+        self.assertTrue(
+            torch.allclose(model.absolute_pos_embed, resized_pos_embed))
+
+        tmpdir.cleanup()
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 256, 256)
+
+        cfg = deepcopy(self.cfg)
+        model = SwinTransformerV2(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (1, 1024, 8, 8))
+
+        # test with window_size=12
+        cfg = deepcopy(self.cfg)
+        cfg['window_size'] = 12
+        model = SwinTransformerV2(**cfg)
+        outs = model(torch.randn(1, 3, 384, 384))
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (1, 1024, 12, 12))
+        with self.assertRaisesRegex(AssertionError, r'the window size \(12\)'):
+            model(torch.randn(1, 3, 256, 256))
+
+        # test with pad_small_map=True
+        cfg = deepcopy(self.cfg)
+        cfg['window_size'] = 12
+        cfg['pad_small_map'] = True
+        model = SwinTransformerV2(**cfg)
+        outs = model(torch.randn(1, 3, 256, 256))
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (1, 1024, 8, 8))
+
+        # test multiple output indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = (0, 1, 2, 3)
+        model = SwinTransformerV2(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 4)
+        for stride, out in zip([1, 2, 4, 8], outs):
+            self.assertEqual(out.shape,
+                             (1, 128 * stride, 64 // stride, 64 // stride))
+
+        # test with checkpoint forward
+        cfg = deepcopy(self.cfg)
+        cfg['with_cp'] = True
+        model = SwinTransformerV2(**cfg)
+        for m in model.modules():
+            if isinstance(m, SwinBlock):
+                self.assertTrue(m.with_cp)
+        model.init_weights()
+        model.train()
+
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (1, 1024, 8, 8))
+
+        # test with dynamic input shape
+        imgs1 = torch.randn(1, 3, 224, 224)
+        imgs2 = torch.randn(1, 3, 256, 256)
+        imgs3 = torch.randn(1, 3, 256, 309)
+        cfg = deepcopy(self.cfg)
+        cfg['pad_small_map'] = True
+        model = SwinTransformerV2(**cfg)
+        for imgs in [imgs1, imgs2, imgs3]:
+            outs = model(imgs)
+            self.assertIsInstance(outs, tuple)
+            self.assertEqual(len(outs), 1)
+            feat = outs[-1]
+            expect_feat_shape = (math.ceil(imgs.shape[2] / 32),
+                                 math.ceil(imgs.shape[3] / 32))
+            self.assertEqual(feat.shape, (1, 1024, *expect_feat_shape))
+
+    def test_structure(self):
+        # test drop_path_rate decay
+        cfg = deepcopy(self.cfg)
+        cfg['drop_path_rate'] = 0.2
+        model = SwinTransformerV2(**cfg)
+        depths = model.arch_settings['depths']
+        blocks = chain(*[stage.blocks for stage in model.stages])
+        for i, block in enumerate(blocks):
+            expect_prob = 0.2 / (sum(depths) - 1) * i
+            self.assertAlmostEqual(block.ffn.dropout_layer.drop_prob,
+                                   expect_prob)
+            self.assertAlmostEqual(block.attn.drop.drop_prob, expect_prob)
+
+        # test Swin-Transformer V2 with norm_eval=True
+        cfg = deepcopy(self.cfg)
+        cfg['norm_eval'] = True
+        cfg['norm_cfg'] = dict(type='BN')
+        cfg['stage_cfgs'] = dict(block_cfgs=dict(norm_cfg=dict(type='BN')))
+        model = SwinTransformerV2(**cfg)
+        model.init_weights()
+        model.train()
+        self.assertTrue(check_norm_state(model.modules(), False))
+
+        # test Swin-Transformer V2 with first stage frozen.
+        cfg = deepcopy(self.cfg)
+        frozen_stages = 0
+        cfg['frozen_stages'] = frozen_stages
+        cfg['out_indices'] = (0, 1, 2, 3)
+        model = SwinTransformerV2(**cfg)
+        model.init_weights()
+        model.train()
+
+        # the patch_embed and first stage should not require grad.
+        self.assertFalse(model.patch_embed.training)
+        for param in model.patch_embed.parameters():
+            self.assertFalse(param.requires_grad)
+        for i in range(frozen_stages + 1):
+            stage = model.stages[i]
+            for param in stage.parameters():
+                self.assertFalse(param.requires_grad)
+        for param in model.norm0.parameters():
+            self.assertFalse(param.requires_grad)
+
+        # the second stage should require grad.
+        for i in range(frozen_stages + 1, 4):
+            stage = model.stages[i]
+            for param in stage.parameters():
+                self.assertTrue(param.requires_grad)
+            norm = getattr(model, f'norm{i}')
+            for param in norm.parameters():
+                self.assertTrue(param.requires_grad)
diff --git a/tests/test_models/test_backbones/test_t2t_vit.py b/tests/test_models/test_backbones/test_t2t_vit.py
new file mode 100644
index 0000000000000000000000000000000000000000..76bfe9ce0f4210cbdb9c0e96e47c8177109ad6a5
--- /dev/null
+++ b/tests/test_models/test_backbones/test_t2t_vit.py
@@ -0,0 +1,144 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+import os
+import tempfile
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+from mmengine.runner import load_checkpoint, save_checkpoint
+
+from mmpretrain.models.backbones import T2T_ViT
+from .utils import timm_resize_pos_embed
+
+
+class TestT2TViT(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(
+            img_size=224,
+            in_channels=3,
+            embed_dims=384,
+            t2t_cfg=dict(
+                token_dims=64,
+                use_performer=False,
+            ),
+            num_layers=14,
+            drop_path_rate=0.1)
+
+    def test_structure(self):
+        # The performer hasn't been implemented
+        cfg = deepcopy(self.cfg)
+        cfg['t2t_cfg']['use_performer'] = True
+        with self.assertRaises(NotImplementedError):
+            T2T_ViT(**cfg)
+
+        # Test out_indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = {1: 1}
+        with self.assertRaisesRegex(AssertionError, "get <class 'dict'>"):
+            T2T_ViT(**cfg)
+        cfg['out_indices'] = [0, 15]
+        with self.assertRaisesRegex(AssertionError, 'Invalid out_indices 15'):
+            T2T_ViT(**cfg)
+
+        # Test model structure
+        cfg = deepcopy(self.cfg)
+        model = T2T_ViT(**cfg)
+        self.assertEqual(len(model.encoder), 14)
+        dpr_inc = 0.1 / (14 - 1)
+        dpr = 0
+        for layer in model.encoder:
+            self.assertEqual(layer.attn.embed_dims, 384)
+            # The default mlp_ratio is 3
+            self.assertEqual(layer.ffn.feedforward_channels, 384 * 3)
+            self.assertAlmostEqual(layer.attn.out_drop.drop_prob, dpr)
+            self.assertAlmostEqual(layer.ffn.dropout_layer.drop_prob, dpr)
+            dpr += dpr_inc
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['init_cfg'] = [dict(type='TruncNormal', layer='Linear', std=.02)]
+        model = T2T_ViT(**cfg)
+        ori_weight = model.tokens_to_token.project.weight.clone().detach()
+
+        model.init_weights()
+        initialized_weight = model.tokens_to_token.project.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+
+        # test load checkpoint
+        pretrain_pos_embed = model.pos_embed.clone().detach()
+        tmpdir = tempfile.gettempdir()
+        checkpoint = os.path.join(tmpdir, 'test.pth')
+        save_checkpoint(model.state_dict(), checkpoint)
+        cfg = deepcopy(self.cfg)
+        model = T2T_ViT(**cfg)
+        load_checkpoint(model, checkpoint, strict=True)
+        self.assertTrue(torch.allclose(model.pos_embed, pretrain_pos_embed))
+
+        # test load checkpoint with different img_size
+        cfg = deepcopy(self.cfg)
+        cfg['img_size'] = 384
+        model = T2T_ViT(**cfg)
+        load_checkpoint(model, checkpoint, strict=True)
+        resized_pos_embed = timm_resize_pos_embed(pretrain_pos_embed,
+                                                  model.pos_embed)
+        self.assertTrue(torch.allclose(model.pos_embed, resized_pos_embed))
+
+        os.remove(checkpoint)
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+
+        # test with_cls_token=False
+        cfg = deepcopy(self.cfg)
+        cfg['with_cls_token'] = False
+        cfg['out_type'] = 'cls_token'
+        with self.assertRaisesRegex(ValueError, 'must be True'):
+            T2T_ViT(**cfg)
+
+        cfg = deepcopy(self.cfg)
+        cfg['with_cls_token'] = False
+        cfg['out_type'] = 'featmap'
+        model = T2T_ViT(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        patch_token = outs[-1]
+        self.assertEqual(patch_token.shape, (1, 384, 14, 14))
+
+        # test with output cls_token
+        cfg = deepcopy(self.cfg)
+        model = T2T_ViT(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        cls_token = outs[-1]
+        self.assertEqual(cls_token.shape, (1, 384))
+
+        # Test forward with multi out indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = [-3, -2, -1]
+        model = T2T_ViT(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 3)
+        for out in outs:
+            self.assertEqual(out.shape, (1, 384))
+
+        # Test forward with dynamic input size
+        imgs1 = torch.randn(1, 3, 224, 224)
+        imgs2 = torch.randn(1, 3, 256, 256)
+        imgs3 = torch.randn(1, 3, 256, 309)
+        cfg = deepcopy(self.cfg)
+        cfg['out_type'] = 'featmap'
+        model = T2T_ViT(**cfg)
+        for imgs in [imgs1, imgs2, imgs3]:
+            outs = model(imgs)
+            self.assertIsInstance(outs, tuple)
+            self.assertEqual(len(outs), 1)
+            patch_token = outs[-1]
+            expect_feat_shape = (math.ceil(imgs.shape[2] / 16),
+                                 math.ceil(imgs.shape[3] / 16))
+            self.assertEqual(patch_token.shape, (1, 384, *expect_feat_shape))
diff --git a/tests/test_models/test_backbones/test_timm_backbone.py b/tests/test_models/test_backbones/test_timm_backbone.py
new file mode 100644
index 0000000000000000000000000000000000000000..cfc659bdc404b59cfcff95ce279570264d5c0a8b
--- /dev/null
+++ b/tests/test_models/test_backbones/test_timm_backbone.py
@@ -0,0 +1,216 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import unittest
+
+import pytest
+import torch
+from torch import nn
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import TIMMBackbone
+
+
+def has_timm() -> bool:
+    try:
+        import timm  # noqa: F401
+        return True
+    except ImportError:
+        return False
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+@unittest.skipIf(not has_timm(), 'timm is not installed')
+def test_timm_backbone():
+    """Test timm backbones, features_only=False (default)."""
+    with pytest.raises(TypeError):
+        # TIMMBackbone has 1 required positional argument: 'model_name'
+        model = TIMMBackbone(pretrained=True)
+
+    with pytest.raises(TypeError):
+        # pretrained must be bool
+        model = TIMMBackbone(model_name='resnet18', pretrained='model.pth')
+
+    # Test resnet18 from timm
+    model = TIMMBackbone(model_name='resnet18')
+    model.init_weights()
+    model.train()
+    assert check_norm_state(model.modules(), True)
+    assert isinstance(model.timm_model.global_pool.pool, nn.Identity)
+    assert isinstance(model.timm_model.fc, nn.Identity)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size((1, 512, 7, 7))
+
+    # Test efficientnet_b1 with pretrained weights
+    model = TIMMBackbone(model_name='efficientnet_b1', pretrained=True)
+    model.init_weights()
+    model.train()
+    assert isinstance(model.timm_model.global_pool.pool, nn.Identity)
+    assert isinstance(model.timm_model.classifier, nn.Identity)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size((1, 1280, 7, 7))
+
+    # Test vit_tiny_patch16_224 with pretrained weights
+    model = TIMMBackbone(model_name='vit_tiny_patch16_224', pretrained=True)
+    model.init_weights()
+    model.train()
+    assert isinstance(model.timm_model.head, nn.Identity)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    # Disable the test since TIMM's behavior changes between 0.5.4 and 0.5.5
+    # assert feat[0].shape == torch.Size((1, 197, 192))
+
+
+@unittest.skipIf(not has_timm(), 'timm is not installed')
+def test_timm_backbone_features_only():
+    """Test timm backbones, features_only=True."""
+    # Test different norm_layer, can be: 'SyncBN', 'BN2d', 'GN', 'LN', 'IN'
+    # Test resnet18 from timm, norm_layer='BN2d'
+    model = TIMMBackbone(
+        model_name='resnet18',
+        features_only=True,
+        pretrained=False,
+        output_stride=32,
+        norm_layer='BN2d')
+
+    # Test resnet18 from timm, norm_layer='SyncBN'
+    model = TIMMBackbone(
+        model_name='resnet18',
+        features_only=True,
+        pretrained=False,
+        output_stride=32,
+        norm_layer='SyncBN')
+
+    # Test resnet18 from timm, output_stride=32
+    model = TIMMBackbone(
+        model_name='resnet18',
+        features_only=True,
+        pretrained=False,
+        output_stride=32)
+    model.init_weights()
+    model.train()
+    assert check_norm_state(model.modules(), True)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feats = model(imgs)
+    assert len(feats) == 5
+    assert feats[0].shape == torch.Size((1, 64, 112, 112))
+    assert feats[1].shape == torch.Size((1, 64, 56, 56))
+    assert feats[2].shape == torch.Size((1, 128, 28, 28))
+    assert feats[3].shape == torch.Size((1, 256, 14, 14))
+    assert feats[4].shape == torch.Size((1, 512, 7, 7))
+
+    # Test resnet18 from timm, output_stride=32, out_indices=(1, 2, 3)
+    model = TIMMBackbone(
+        model_name='resnet18',
+        features_only=True,
+        pretrained=False,
+        output_stride=32,
+        out_indices=(1, 2, 3))
+    imgs = torch.randn(1, 3, 224, 224)
+    feats = model(imgs)
+    assert len(feats) == 3
+    assert feats[0].shape == torch.Size((1, 64, 56, 56))
+    assert feats[1].shape == torch.Size((1, 128, 28, 28))
+    assert feats[2].shape == torch.Size((1, 256, 14, 14))
+
+    # Test resnet18 from timm, output_stride=16
+    model = TIMMBackbone(
+        model_name='resnet18',
+        features_only=True,
+        pretrained=False,
+        output_stride=16)
+    imgs = torch.randn(1, 3, 224, 224)
+    feats = model(imgs)
+    assert len(feats) == 5
+    assert feats[0].shape == torch.Size((1, 64, 112, 112))
+    assert feats[1].shape == torch.Size((1, 64, 56, 56))
+    assert feats[2].shape == torch.Size((1, 128, 28, 28))
+    assert feats[3].shape == torch.Size((1, 256, 14, 14))
+    assert feats[4].shape == torch.Size((1, 512, 14, 14))
+
+    # Test resnet18 from timm, output_stride=8
+    model = TIMMBackbone(
+        model_name='resnet18',
+        features_only=True,
+        pretrained=False,
+        output_stride=8)
+    imgs = torch.randn(1, 3, 224, 224)
+    feats = model(imgs)
+    assert len(feats) == 5
+    assert feats[0].shape == torch.Size((1, 64, 112, 112))
+    assert feats[1].shape == torch.Size((1, 64, 56, 56))
+    assert feats[2].shape == torch.Size((1, 128, 28, 28))
+    assert feats[3].shape == torch.Size((1, 256, 28, 28))
+    assert feats[4].shape == torch.Size((1, 512, 28, 28))
+
+    # Test efficientnet_b1 with pretrained weights
+    model = TIMMBackbone(
+        model_name='efficientnet_b1', features_only=True, pretrained=True)
+    imgs = torch.randn(1, 3, 64, 64)
+    feats = model(imgs)
+    assert len(feats) == 5
+    assert feats[0].shape == torch.Size((1, 16, 32, 32))
+    assert feats[1].shape == torch.Size((1, 24, 16, 16))
+    assert feats[2].shape == torch.Size((1, 40, 8, 8))
+    assert feats[3].shape == torch.Size((1, 112, 4, 4))
+    assert feats[4].shape == torch.Size((1, 320, 2, 2))
+
+    # Test resnetv2_50x1_bitm from timm, output_stride=8
+    model = TIMMBackbone(
+        model_name='resnetv2_50x1_bitm',
+        features_only=True,
+        pretrained=False,
+        output_stride=8)
+    imgs = torch.randn(1, 3, 8, 8)
+    feats = model(imgs)
+    assert len(feats) == 5
+    assert feats[0].shape == torch.Size((1, 64, 4, 4))
+    assert feats[1].shape == torch.Size((1, 256, 2, 2))
+    assert feats[2].shape == torch.Size((1, 512, 1, 1))
+    assert feats[3].shape == torch.Size((1, 1024, 1, 1))
+    assert feats[4].shape == torch.Size((1, 2048, 1, 1))
+
+    # Test resnetv2_50x3_bitm from timm, output_stride=8
+    model = TIMMBackbone(
+        model_name='resnetv2_50x3_bitm',
+        features_only=True,
+        pretrained=False,
+        output_stride=8)
+    imgs = torch.randn(1, 3, 8, 8)
+    feats = model(imgs)
+    assert len(feats) == 5
+    assert feats[0].shape == torch.Size((1, 192, 4, 4))
+    assert feats[1].shape == torch.Size((1, 768, 2, 2))
+    assert feats[2].shape == torch.Size((1, 1536, 1, 1))
+    assert feats[3].shape == torch.Size((1, 3072, 1, 1))
+    assert feats[4].shape == torch.Size((1, 6144, 1, 1))
+
+    # Test resnetv2_101x1_bitm from timm, output_stride=8
+    model = TIMMBackbone(
+        model_name='resnetv2_101x1_bitm',
+        features_only=True,
+        pretrained=False,
+        output_stride=8)
+    imgs = torch.randn(1, 3, 8, 8)
+    feats = model(imgs)
+    assert len(feats) == 5
+    assert feats[0].shape == torch.Size((1, 64, 4, 4))
+    assert feats[1].shape == torch.Size((1, 256, 2, 2))
+    assert feats[2].shape == torch.Size((1, 512, 1, 1))
+    assert feats[3].shape == torch.Size((1, 1024, 1, 1))
+    assert feats[4].shape == torch.Size((1, 2048, 1, 1))
diff --git a/tests/test_models/test_backbones/test_tinyvit.py b/tests/test_models/test_backbones/test_tinyvit.py
new file mode 100644
index 0000000000000000000000000000000000000000..9747b76b3a465069fe66eb41e0a51dac28f9bd5f
--- /dev/null
+++ b/tests/test_models/test_backbones/test_tinyvit.py
@@ -0,0 +1,80 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmpretrain.models.backbones import TinyViT
+
+
+def test_assertion():
+    with pytest.raises(AssertionError):
+        TinyViT(arch='unknown')
+
+    with pytest.raises(AssertionError):
+        # MobileViT out_indices should be valid depth.
+        TinyViT(out_indices=-100)
+
+
+def test_tinyvit():
+
+    # Test forward
+    model = TinyViT(arch='5m')
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size([1, 320])
+
+    # Test forward with multiple outputs
+    model = TinyViT(arch='5m', out_indices=(0, 1, 2, 3))
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size([1, 128])
+    assert feat[1].shape == torch.Size([1, 160])
+    assert feat[2].shape == torch.Size([1, 320])
+    assert feat[3].shape == torch.Size([1, 320])
+
+    # Test with custom arch
+    model = TinyViT(
+        arch={
+            'depths': [2, 3, 4, 5],
+            'channels': [64, 128, 256, 448],
+            'num_heads': [4, 4, 4, 4]
+        },
+        out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+    assert feat[0].shape == torch.Size([1, 128])
+    assert feat[1].shape == torch.Size([1, 256])
+    assert feat[2].shape == torch.Size([1, 448])
+    assert feat[3].shape == torch.Size([1, 448])
+
+    # Test without gap before final norm
+    model = TinyViT(
+        arch='21m', out_indices=(0, 1, 2, 3), gap_before_final_norm=False)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 4
+
+    assert feat[0].shape == torch.Size([1, 192, 28, 28])
+    assert feat[1].shape == torch.Size([1, 384, 14, 14])
+    assert feat[2].shape == torch.Size([1, 576, 7, 7])
+    assert feat[3].shape == torch.Size([1, 576, 7, 7])
+
+    # Test frozen_stages
+    model = TinyViT(arch='11m', out_indices=(0, 1, 2, 3), frozen_stages=2)
+    model.init_weights()
+    model.train()
+
+    for i in range(2):
+        assert not model.stages[i].training
+
+    for i in range(2, 4):
+        assert model.stages[i].training
diff --git a/tests/test_models/test_backbones/test_tnt.py b/tests/test_models/test_backbones/test_tnt.py
new file mode 100644
index 0000000000000000000000000000000000000000..83b997dabf7293b0a96f2f228d27b10de776de07
--- /dev/null
+++ b/tests/test_models/test_backbones/test_tnt.py
@@ -0,0 +1,50 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.backbones import TNT
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def test_tnt_backbone():
+    with pytest.raises(TypeError):
+        # pretrained must be a string path
+        model = TNT()
+        model.init_weights(pretrained=0)
+
+    # Test tnt_base_patch16_224
+    model = TNT()
+    model.init_weights()
+    model.train()
+    assert check_norm_state(model.modules(), True)
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size((1, 640))
+
+    # Test tnt with embed_dims=768
+    arch = {
+        'embed_dims_outer': 768,
+        'embed_dims_inner': 48,
+        'num_layers': 12,
+        'num_heads_outer': 6,
+        'num_heads_inner': 4
+    }
+    model = TNT(arch=arch)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == torch.Size((1, 768))
diff --git a/tests/test_models/test_backbones/test_twins.py b/tests/test_models/test_backbones/test_twins.py
new file mode 100644
index 0000000000000000000000000000000000000000..e7ca43ee7175d33ae1b722ac3938de6b9c737b0f
--- /dev/null
+++ b/tests/test_models/test_backbones/test_twins.py
@@ -0,0 +1,243 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+
+import pytest
+import torch
+import torch.nn as nn
+
+from mmpretrain.models.backbones.twins import (PCPVT, SVT,
+                                               GlobalSubsampledAttention,
+                                               LocallyGroupedSelfAttention)
+
+
+def test_LSA_module():
+    lsa = LocallyGroupedSelfAttention(embed_dims=32, window_size=3)
+    outs = lsa(torch.randn(1, 3136, 32), (56, 56))
+    assert outs.shape == torch.Size([1, 3136, 32])
+
+
+def test_GSA_module():
+    gsa = GlobalSubsampledAttention(embed_dims=32, num_heads=8)
+    outs = gsa(torch.randn(1, 3136, 32), (56, 56))
+    assert outs.shape == torch.Size([1, 3136, 32])
+
+
+def test_pcpvt():
+    # test init
+    path = 'PATH_THAT_DO_NOT_EXIST'
+
+    # init_cfg loads pretrain from an non-existent file
+    model = PCPVT('s', init_cfg=dict(type='Pretrained', checkpoint=path))
+    assert model.init_cfg == dict(type='Pretrained', checkpoint=path)
+
+    # Test loading a checkpoint from an non-existent file
+    with pytest.raises(OSError):
+        model.init_weights()
+
+    # init_cfg=123, whose type is unsupported
+    model = PCPVT('s', init_cfg=123)
+    with pytest.raises(TypeError):
+        model.init_weights()
+
+    H, W = (64, 64)
+    temp = torch.randn((1, 3, H, W))
+
+    # test output last feat
+    model = PCPVT('small')
+    model.init_weights()
+    outs = model(temp)
+    assert len(outs) == 1
+    assert outs[-1].shape == (1, 512, H // 32, W // 32)
+
+    # test with multi outputs
+    model = PCPVT('small', out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    outs = model(temp)
+    assert len(outs) == 4
+    assert outs[0].shape == (1, 64, H // 4, W // 4)
+    assert outs[1].shape == (1, 128, H // 8, W // 8)
+    assert outs[2].shape == (1, 320, H // 16, W // 16)
+    assert outs[3].shape == (1, 512, H // 32, W // 32)
+
+    # test with arch of dict
+    arch = {
+        'embed_dims': [64, 128, 320, 512],
+        'depths': [3, 4, 18, 3],
+        'num_heads': [1, 2, 5, 8],
+        'patch_sizes': [4, 2, 2, 2],
+        'strides': [4, 2, 2, 2],
+        'mlp_ratios': [8, 8, 4, 4],
+        'sr_ratios': [8, 4, 2, 1]
+    }
+
+    pcpvt_arch = copy.deepcopy(arch)
+    model = PCPVT(pcpvt_arch, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    outs = model(temp)
+    assert len(outs) == 4
+    assert outs[0].shape == (1, 64, H // 4, W // 4)
+    assert outs[1].shape == (1, 128, H // 8, W // 8)
+    assert outs[2].shape == (1, 320, H // 16, W // 16)
+    assert outs[3].shape == (1, 512, H // 32, W // 32)
+
+    # assert length of arch value not equal
+    pcpvt_arch = copy.deepcopy(arch)
+    pcpvt_arch['sr_ratios'] = [8, 4, 2]
+    with pytest.raises(AssertionError):
+        model = PCPVT(pcpvt_arch, out_indices=(0, 1, 2, 3))
+
+    # assert lack arch essential_keys
+    pcpvt_arch = copy.deepcopy(arch)
+    del pcpvt_arch['sr_ratios']
+    with pytest.raises(AssertionError):
+        model = PCPVT(pcpvt_arch, out_indices=(0, 1, 2, 3))
+
+    # assert arch value not list
+    pcpvt_arch = copy.deepcopy(arch)
+    pcpvt_arch['sr_ratios'] = 1
+    with pytest.raises(AssertionError):
+        model = PCPVT(pcpvt_arch, out_indices=(0, 1, 2, 3))
+
+    pcpvt_arch = copy.deepcopy(arch)
+    pcpvt_arch['sr_ratios'] = '1, 2, 3, 4'
+    with pytest.raises(AssertionError):
+        model = PCPVT(pcpvt_arch, out_indices=(0, 1, 2, 3))
+
+    # test norm_after_stage is bool True
+    model = PCPVT('small', norm_after_stage=True, norm_cfg=dict(type='LN'))
+    for i in range(model.num_stage):
+        assert hasattr(model, f'norm_after_stage{i}')
+        assert isinstance(getattr(model, f'norm_after_stage{i}'), nn.LayerNorm)
+
+    # test norm_after_stage is bool Flase
+    model = PCPVT('small', norm_after_stage=False)
+    for i in range(model.num_stage):
+        assert hasattr(model, f'norm_after_stage{i}')
+        assert isinstance(getattr(model, f'norm_after_stage{i}'), nn.Identity)
+
+    # test norm_after_stage is bool list
+    norm_after_stage = [False, True, False, True]
+    model = PCPVT('small', norm_after_stage=norm_after_stage)
+    assert len(norm_after_stage) == model.num_stage
+    for i in range(model.num_stage):
+        assert hasattr(model, f'norm_after_stage{i}')
+        norm_layer = getattr(model, f'norm_after_stage{i}')
+        if norm_after_stage[i]:
+            assert isinstance(norm_layer, nn.LayerNorm)
+        else:
+            assert isinstance(norm_layer, nn.Identity)
+
+    # test norm_after_stage is not bool list
+    norm_after_stage = [False, 'True', False, True]
+    with pytest.raises(AssertionError):
+        model = PCPVT('small', norm_after_stage=norm_after_stage)
+
+
+def test_svt():
+    # test init
+    path = 'PATH_THAT_DO_NOT_EXIST'
+
+    # init_cfg loads pretrain from an non-existent file
+    model = SVT('s', init_cfg=dict(type='Pretrained', checkpoint=path))
+    assert model.init_cfg == dict(type='Pretrained', checkpoint=path)
+
+    # Test loading a checkpoint from an non-existent file
+    with pytest.raises(OSError):
+        model.init_weights()
+
+    # init_cfg=123, whose type is unsupported
+    model = SVT('s', init_cfg=123)
+    with pytest.raises(TypeError):
+        model.init_weights()
+
+    # Test feature map output
+    H, W = (64, 64)
+    temp = torch.randn((1, 3, H, W))
+
+    model = SVT('s')
+    model.init_weights()
+    outs = model(temp)
+    assert len(outs) == 1
+    assert outs[-1].shape == (1, 512, H // 32, W // 32)
+
+    # test with multi outputs
+    model = SVT('small', out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    outs = model(temp)
+    assert len(outs) == 4
+    assert outs[0].shape == (1, 64, H // 4, W // 4)
+    assert outs[1].shape == (1, 128, H // 8, W // 8)
+    assert outs[2].shape == (1, 256, H // 16, W // 16)
+    assert outs[3].shape == (1, 512, H // 32, W // 32)
+
+    # test with arch of dict
+    arch = {
+        'embed_dims': [96, 192, 384, 768],
+        'depths': [2, 2, 18, 2],
+        'num_heads': [3, 6, 12, 24],
+        'patch_sizes': [4, 2, 2, 2],
+        'strides': [4, 2, 2, 2],
+        'mlp_ratios': [4, 4, 4, 4],
+        'sr_ratios': [8, 4, 2, 1],
+        'window_sizes': [7, 7, 7, 7]
+    }
+    model = SVT(arch, out_indices=(0, 1, 2, 3))
+    model.init_weights()
+    outs = model(temp)
+    assert len(outs) == 4
+    assert outs[0].shape == (1, 96, H // 4, W // 4)
+    assert outs[1].shape == (1, 192, H // 8, W // 8)
+    assert outs[2].shape == (1, 384, H // 16, W // 16)
+    assert outs[3].shape == (1, 768, H // 32, W // 32)
+
+    # assert length of arch value not equal
+    svt_arch = copy.deepcopy(arch)
+    svt_arch['sr_ratios'] = [8, 4, 2]
+    with pytest.raises(AssertionError):
+        model = SVT(svt_arch, out_indices=(0, 1, 2, 3))
+
+    # assert lack arch essential_keys
+    svt_arch = copy.deepcopy(arch)
+    del svt_arch['window_sizes']
+    with pytest.raises(AssertionError):
+        model = SVT(svt_arch, out_indices=(0, 1, 2, 3))
+
+    # assert arch value not list
+    svt_arch = copy.deepcopy(arch)
+    svt_arch['sr_ratios'] = 1
+    with pytest.raises(AssertionError):
+        model = SVT(svt_arch, out_indices=(0, 1, 2, 3))
+
+    svt_arch = copy.deepcopy(arch)
+    svt_arch['sr_ratios'] = '1, 2, 3, 4'
+    with pytest.raises(AssertionError):
+        model = SVT(svt_arch, out_indices=(0, 1, 2, 3))
+
+    # test norm_after_stage is bool True
+    model = SVT('small', norm_after_stage=True, norm_cfg=dict(type='LN'))
+    for i in range(model.num_stage):
+        assert hasattr(model, f'norm_after_stage{i}')
+        assert isinstance(getattr(model, f'norm_after_stage{i}'), nn.LayerNorm)
+
+    # test norm_after_stage is bool Flase
+    model = SVT('small', norm_after_stage=False)
+    for i in range(model.num_stage):
+        assert hasattr(model, f'norm_after_stage{i}')
+        assert isinstance(getattr(model, f'norm_after_stage{i}'), nn.Identity)
+
+    # test norm_after_stage is bool list
+    norm_after_stage = [False, True, False, True]
+    model = SVT('small', norm_after_stage=norm_after_stage)
+    assert len(norm_after_stage) == model.num_stage
+    for i in range(model.num_stage):
+        assert hasattr(model, f'norm_after_stage{i}')
+        norm_layer = getattr(model, f'norm_after_stage{i}')
+        if norm_after_stage[i]:
+            assert isinstance(norm_layer, nn.LayerNorm)
+        else:
+            assert isinstance(norm_layer, nn.Identity)
+
+    # test norm_after_stage is not bool list
+    norm_after_stage = [False, 'True', False, True]
+    with pytest.raises(AssertionError):
+        model = SVT('small', norm_after_stage=norm_after_stage)
diff --git a/tests/test_models/test_backbones/test_van.py b/tests/test_models/test_backbones/test_van.py
new file mode 100644
index 0000000000000000000000000000000000000000..fed9e3e56d7cb12ec0f0d3546adf27daa1e2cdc0
--- /dev/null
+++ b/tests/test_models/test_backbones/test_van.py
@@ -0,0 +1,188 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from copy import deepcopy
+from itertools import chain
+from unittest import TestCase
+
+import torch
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+from torch import nn
+
+from mmpretrain.models.backbones import VAN
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+class TestVAN(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(arch='t', drop_path_rate=0.1)
+
+    def test_arch(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'not in default archs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            VAN(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'Custom arch needs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'embed_dims': [32, 64, 160, 256],
+                'ffn_ratios': [8, 8, 4, 4],
+            }
+            VAN(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        embed_dims = [32, 64, 160, 256]
+        depths = [3, 3, 5, 2]
+        ffn_ratios = [8, 8, 4, 4]
+        cfg['arch'] = {
+            'embed_dims': embed_dims,
+            'depths': depths,
+            'ffn_ratios': ffn_ratios
+        }
+        model = VAN(**cfg)
+
+        for i in range(len(depths)):
+            stage = getattr(model, f'blocks{i + 1}')
+            self.assertEqual(stage[-1].out_channels, embed_dims[i])
+            self.assertEqual(len(stage), depths[i])
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]
+        model = VAN(**cfg)
+        ori_weight = model.patch_embed1.projection.weight.clone().detach()
+
+        model.init_weights()
+        initialized_weight = model.patch_embed1.projection.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+
+    def test_forward(self):
+        imgs = torch.randn(3, 3, 224, 224)
+
+        cfg = deepcopy(self.cfg)
+        model = VAN(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (3, 256, 7, 7))
+
+        # test with patch_sizes
+        cfg = deepcopy(self.cfg)
+        cfg['patch_sizes'] = [7, 5, 5, 5]
+        model = VAN(**cfg)
+        outs = model(torch.randn(3, 3, 224, 224))
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        feat = outs[-1]
+        self.assertEqual(feat.shape, (3, 256, 3, 3))
+
+        # test multiple output indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = (0, 1, 2, 3)
+        model = VAN(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 4)
+        for emb_size, stride, out in zip([32, 64, 160, 256], [1, 2, 4, 8],
+                                         outs):
+            self.assertEqual(out.shape,
+                             (3, emb_size, 56 // stride, 56 // stride))
+
+        # test with dynamic input shape
+        imgs1 = torch.randn(3, 3, 224, 224)
+        imgs2 = torch.randn(3, 3, 256, 256)
+        imgs3 = torch.randn(3, 3, 256, 309)
+        cfg = deepcopy(self.cfg)
+        model = VAN(**cfg)
+        for imgs in [imgs1, imgs2, imgs3]:
+            outs = model(imgs)
+            self.assertIsInstance(outs, tuple)
+            self.assertEqual(len(outs), 1)
+            feat = outs[-1]
+            expect_feat_shape = (math.ceil(imgs.shape[2] / 32),
+                                 math.ceil(imgs.shape[3] / 32))
+            self.assertEqual(feat.shape, (3, 256, *expect_feat_shape))
+
+    def test_structure(self):
+        # test drop_path_rate decay
+        cfg = deepcopy(self.cfg)
+        cfg['drop_path_rate'] = 0.2
+        model = VAN(**cfg)
+        depths = model.arch_settings['depths']
+        stages = [model.blocks1, model.blocks2, model.blocks3, model.blocks4]
+        blocks = chain(*[stage for stage in stages])
+        total_depth = sum(depths)
+        dpr = [
+            x.item()
+            for x in torch.linspace(0, cfg['drop_path_rate'], total_depth)
+        ]
+        for i, (block, expect_prob) in enumerate(zip(blocks, dpr)):
+            if expect_prob == 0:
+                assert isinstance(block.drop_path, nn.Identity)
+            else:
+                self.assertAlmostEqual(block.drop_path.drop_prob, expect_prob)
+
+        # test VAN with norm_eval=True
+        cfg = deepcopy(self.cfg)
+        cfg['norm_eval'] = True
+        cfg['norm_cfg'] = dict(type='BN')
+        model = VAN(**cfg)
+        model.init_weights()
+        model.train()
+        self.assertTrue(check_norm_state(model.modules(), False))
+
+        # test VAN with first stage frozen.
+        cfg = deepcopy(self.cfg)
+        frozen_stages = 0
+        cfg['frozen_stages'] = frozen_stages
+        cfg['out_indices'] = (0, 1, 2, 3)
+        model = VAN(**cfg)
+        model.init_weights()
+        model.train()
+
+        # the patch_embed and first stage should not require grad.
+        self.assertFalse(model.patch_embed1.training)
+        for param in model.patch_embed1.parameters():
+            self.assertFalse(param.requires_grad)
+        for i in range(frozen_stages + 1):
+            patch = getattr(model, f'patch_embed{i+1}')
+            for param in patch.parameters():
+                self.assertFalse(param.requires_grad)
+            blocks = getattr(model, f'blocks{i + 1}')
+            for param in blocks.parameters():
+                self.assertFalse(param.requires_grad)
+            norm = getattr(model, f'norm{i + 1}')
+            for param in norm.parameters():
+                self.assertFalse(param.requires_grad)
+
+        # the second stage should require grad.
+        for i in range(frozen_stages + 1, 4):
+            patch = getattr(model, f'patch_embed{i + 1}')
+            for param in patch.parameters():
+                self.assertTrue(param.requires_grad)
+            blocks = getattr(model, f'blocks{i+1}')
+            for param in blocks.parameters():
+                self.assertTrue(param.requires_grad)
+            norm = getattr(model, f'norm{i + 1}')
+            for param in norm.parameters():
+                self.assertTrue(param.requires_grad)
diff --git a/tests/test_models/test_backbones/test_vgg.py b/tests/test_models/test_backbones/test_vgg.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd3910fe751ed3778c0ee7bec354991ebff7a85c
--- /dev/null
+++ b/tests/test_models/test_backbones/test_vgg.py
@@ -0,0 +1,139 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from mmengine.utils.dl_utils.parrots_wrapper import _BatchNorm
+
+from mmpretrain.models.backbones import VGG
+
+
+def check_norm_state(modules, train_state):
+    """Check if norm layer is in correct train state."""
+    for mod in modules:
+        if isinstance(mod, _BatchNorm):
+            if mod.training != train_state:
+                return False
+    return True
+
+
+def test_vgg():
+    """Test VGG backbone."""
+    with pytest.raises(KeyError):
+        # VGG depth should be in [11, 13, 16, 19]
+        VGG(18)
+
+    with pytest.raises(AssertionError):
+        # In VGG: 1 <= num_stages <= 5
+        VGG(11, num_stages=0)
+
+    with pytest.raises(AssertionError):
+        # In VGG: 1 <= num_stages <= 5
+        VGG(11, num_stages=6)
+
+    with pytest.raises(AssertionError):
+        # len(dilations) == num_stages
+        VGG(11, dilations=(1, 1), num_stages=3)
+
+    with pytest.raises(TypeError):
+        # pretrained must be a string path
+        model = VGG(11)
+        model.init_weights(pretrained=0)
+
+    # Test VGG11 norm_eval=True
+    model = VGG(11, norm_eval=True)
+    model.init_weights()
+    model.train()
+    assert check_norm_state(model.modules(), False)
+
+    # Test VGG11 forward without classifiers
+    model = VGG(11, out_indices=(0, 1, 2, 3, 4))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 5
+    assert feat[0].shape == (1, 64, 112, 112)
+    assert feat[1].shape == (1, 128, 56, 56)
+    assert feat[2].shape == (1, 256, 28, 28)
+    assert feat[3].shape == (1, 512, 14, 14)
+    assert feat[4].shape == (1, 512, 7, 7)
+
+    # Test VGG11 forward with classifiers
+    model = VGG(11, num_classes=10, out_indices=(0, 1, 2, 3, 4, 5))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 6
+    assert feat[0].shape == (1, 64, 112, 112)
+    assert feat[1].shape == (1, 128, 56, 56)
+    assert feat[2].shape == (1, 256, 28, 28)
+    assert feat[3].shape == (1, 512, 14, 14)
+    assert feat[4].shape == (1, 512, 7, 7)
+    assert feat[5].shape == (1, 10)
+
+    # Test VGG11BN forward
+    model = VGG(11, norm_cfg=dict(type='BN'), out_indices=(0, 1, 2, 3, 4))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 5
+    assert feat[0].shape == (1, 64, 112, 112)
+    assert feat[1].shape == (1, 128, 56, 56)
+    assert feat[2].shape == (1, 256, 28, 28)
+    assert feat[3].shape == (1, 512, 14, 14)
+    assert feat[4].shape == (1, 512, 7, 7)
+
+    # Test VGG11BN forward with classifiers
+    model = VGG(
+        11,
+        num_classes=10,
+        norm_cfg=dict(type='BN'),
+        out_indices=(0, 1, 2, 3, 4, 5))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 6
+    assert feat[0].shape == (1, 64, 112, 112)
+    assert feat[1].shape == (1, 128, 56, 56)
+    assert feat[2].shape == (1, 256, 28, 28)
+    assert feat[3].shape == (1, 512, 14, 14)
+    assert feat[4].shape == (1, 512, 7, 7)
+    assert feat[5].shape == (1, 10)
+
+    # Test VGG13 with layers 1, 2, 3 out forward
+    model = VGG(13, out_indices=(0, 1, 2))
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 3
+    assert feat[0].shape == (1, 64, 112, 112)
+    assert feat[1].shape == (1, 128, 56, 56)
+    assert feat[2].shape == (1, 256, 28, 28)
+
+    # Test VGG16 with top feature maps out forward
+    model = VGG(16)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == (1, 512, 7, 7)
+
+    # Test VGG19 with classification score out forward
+    model = VGG(19, num_classes=10)
+    model.init_weights()
+    model.train()
+
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert len(feat) == 1
+    assert feat[0].shape == (1, 10)
diff --git a/tests/test_models/test_backbones/test_vision_transformer.py b/tests/test_models/test_backbones/test_vision_transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..d6638ae3a1d04074e96775f4147f0595b4907bfb
--- /dev/null
+++ b/tests/test_models/test_backbones/test_vision_transformer.py
@@ -0,0 +1,176 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+import os
+import tempfile
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+from mmengine.runner import load_checkpoint, save_checkpoint
+
+from mmpretrain.models.backbones import VisionTransformer
+from .utils import timm_resize_pos_embed
+
+
+class TestVisionTransformer(TestCase):
+
+    def setUp(self):
+        self.cfg = dict(
+            arch='b', img_size=224, patch_size=16, drop_path_rate=0.1)
+
+    def test_structure(self):
+        # Test invalid default arch
+        with self.assertRaisesRegex(AssertionError, 'not in default archs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = 'unknown'
+            VisionTransformer(**cfg)
+
+        # Test invalid custom arch
+        with self.assertRaisesRegex(AssertionError, 'Custom arch needs'):
+            cfg = deepcopy(self.cfg)
+            cfg['arch'] = {
+                'num_layers': 24,
+                'num_heads': 16,
+                'feedforward_channels': 4096
+            }
+            VisionTransformer(**cfg)
+
+        # Test custom arch
+        cfg = deepcopy(self.cfg)
+        cfg['arch'] = {
+            'embed_dims': 128,
+            'num_layers': 24,
+            'num_heads': 16,
+            'feedforward_channels': 1024
+        }
+        model = VisionTransformer(**cfg)
+        self.assertEqual(model.embed_dims, 128)
+        self.assertEqual(model.num_layers, 24)
+        for layer in model.layers:
+            self.assertEqual(layer.attn.num_heads, 16)
+            self.assertEqual(layer.ffn.feedforward_channels, 1024)
+
+        # Test out_indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = {1: 1}
+        with self.assertRaisesRegex(AssertionError, "get <class 'dict'>"):
+            VisionTransformer(**cfg)
+        cfg['out_indices'] = [0, 13]
+        with self.assertRaisesRegex(AssertionError, 'Invalid out_indices 13'):
+            VisionTransformer(**cfg)
+
+        # Test model structure
+        cfg = deepcopy(self.cfg)
+        model = VisionTransformer(**cfg)
+        self.assertEqual(len(model.layers), 12)
+        dpr_inc = 0.1 / (12 - 1)
+        dpr = 0
+        for layer in model.layers:
+            self.assertEqual(layer.attn.embed_dims, 768)
+            self.assertEqual(layer.attn.num_heads, 12)
+            self.assertEqual(layer.ffn.feedforward_channels, 3072)
+            self.assertAlmostEqual(layer.attn.out_drop.drop_prob, dpr)
+            self.assertAlmostEqual(layer.ffn.dropout_layer.drop_prob, dpr)
+            dpr += dpr_inc
+
+        # Test model structure:  prenorm
+        cfg = deepcopy(self.cfg)
+        cfg['pre_norm'] = True
+        model = VisionTransformer(**cfg)
+        self.assertNotEqual(model.pre_norm.__class__, torch.nn.Identity)
+
+    def test_init_weights(self):
+        # test weight init cfg
+        cfg = deepcopy(self.cfg)
+        cfg['init_cfg'] = [
+            dict(
+                type='Kaiming',
+                layer='Conv2d',
+                mode='fan_in',
+                nonlinearity='linear')
+        ]
+        model = VisionTransformer(**cfg)
+        ori_weight = model.patch_embed.projection.weight.clone().detach()
+        # The pos_embed is all zero before initialize
+        self.assertTrue(torch.allclose(model.pos_embed, torch.tensor(0.)))
+
+        model.init_weights()
+        initialized_weight = model.patch_embed.projection.weight
+        self.assertFalse(torch.allclose(ori_weight, initialized_weight))
+        self.assertFalse(torch.allclose(model.pos_embed, torch.tensor(0.)))
+
+        # test load checkpoint
+        pretrain_pos_embed = model.pos_embed.clone().detach()
+        tmpdir = tempfile.gettempdir()
+        checkpoint = os.path.join(tmpdir, 'test.pth')
+        save_checkpoint(model.state_dict(), checkpoint)
+        cfg = deepcopy(self.cfg)
+        model = VisionTransformer(**cfg)
+        load_checkpoint(model, checkpoint, strict=True)
+        self.assertTrue(torch.allclose(model.pos_embed, pretrain_pos_embed))
+
+        # test load checkpoint with different img_size
+        cfg = deepcopy(self.cfg)
+        cfg['img_size'] = 384
+        model = VisionTransformer(**cfg)
+        load_checkpoint(model, checkpoint, strict=True)
+        resized_pos_embed = timm_resize_pos_embed(pretrain_pos_embed,
+                                                  model.pos_embed)
+        self.assertTrue(torch.allclose(model.pos_embed, resized_pos_embed))
+
+        os.remove(checkpoint)
+
+    def test_forward(self):
+        imgs = torch.randn(1, 3, 224, 224)
+
+        # test with_cls_token=False
+        cfg = deepcopy(self.cfg)
+        cfg['with_cls_token'] = False
+        cfg['out_type'] = 'cls_token'
+        with self.assertRaisesRegex(ValueError, 'must be True'):
+            VisionTransformer(**cfg)
+
+        cfg = deepcopy(self.cfg)
+        cfg['with_cls_token'] = False
+        cfg['out_type'] = 'featmap'
+        model = VisionTransformer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        patch_token = outs[-1]
+        self.assertEqual(patch_token.shape, (1, 768, 14, 14))
+
+        # test with output cls_token
+        cfg = deepcopy(self.cfg)
+        model = VisionTransformer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        cls_token = outs[-1]
+        self.assertEqual(cls_token.shape, (1, 768))
+
+        # Test forward with multi out indices
+        cfg = deepcopy(self.cfg)
+        cfg['out_indices'] = [-3, -2, -1]
+        model = VisionTransformer(**cfg)
+        outs = model(imgs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 3)
+        for out in outs:
+            self.assertEqual(out.shape, (1, 768))
+
+        # Test forward with dynamic input size
+        imgs1 = torch.randn(1, 3, 224, 224)
+        imgs2 = torch.randn(1, 3, 256, 256)
+        imgs3 = torch.randn(1, 3, 256, 309)
+        cfg = deepcopy(self.cfg)
+        cfg['out_type'] = 'featmap'
+        model = VisionTransformer(**cfg)
+        for imgs in [imgs1, imgs2, imgs3]:
+            outs = model(imgs)
+            self.assertIsInstance(outs, tuple)
+            self.assertEqual(len(outs), 1)
+            patch_token = outs[-1]
+            expect_feat_shape = (math.ceil(imgs.shape[2] / 16),
+                                 math.ceil(imgs.shape[3] / 16))
+            self.assertEqual(patch_token.shape, (1, 768, *expect_feat_shape))
diff --git a/tests/test_models/test_backbones/test_xcit.py b/tests/test_models/test_backbones/test_xcit.py
new file mode 100644
index 0000000000000000000000000000000000000000..95a8cfdf1cf6367f99ee9e63ae608b517a47c2d7
--- /dev/null
+++ b/tests/test_models/test_backbones/test_xcit.py
@@ -0,0 +1,41 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# The basic forward/backward tests are in ../test_models.py
+import torch
+
+from mmpretrain.apis import get_model
+
+
+def test_out_type():
+    inputs = torch.rand(1, 3, 224, 224)
+
+    model = get_model(
+        'xcit-nano-12-p16_3rdparty_in1k',
+        backbone=dict(out_type='raw'),
+        neck=None,
+        head=None)
+    outputs = model(inputs)[0]
+    assert outputs.shape == (1, 197, 128)
+
+    model = get_model(
+        'xcit-nano-12-p16_3rdparty_in1k',
+        backbone=dict(out_type='featmap'),
+        neck=None,
+        head=None)
+    outputs = model(inputs)[0]
+    assert outputs.shape == (1, 128, 14, 14)
+
+    model = get_model(
+        'xcit-nano-12-p16_3rdparty_in1k',
+        backbone=dict(out_type='cls_token'),
+        neck=None,
+        head=None)
+    outputs = model(inputs)[0]
+    assert outputs.shape == (1, 128)
+
+    model = get_model(
+        'xcit-nano-12-p16_3rdparty_in1k',
+        backbone=dict(out_type='avg_featmap'),
+        neck=None,
+        head=None)
+    outputs = model(inputs)[0]
+    assert outputs.shape == (1, 128)
diff --git a/tests/test_models/test_backbones/utils.py b/tests/test_models/test_backbones/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..aba9cafbf8c092ed743d267078937fa992ca05fe
--- /dev/null
+++ b/tests/test_models/test_backbones/utils.py
@@ -0,0 +1,31 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+
+import torch
+import torch.nn.functional as F
+
+
+def timm_resize_pos_embed(posemb, posemb_new, num_tokens=1, gs_new=()):
+    """Timm version pos embed resize function.
+
+    copied from https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py
+    """  # noqa:E501
+    ntok_new = posemb_new.shape[1]
+    if num_tokens:
+        posemb_tok, posemb_grid = posemb[:, :num_tokens], posemb[0,
+                                                                 num_tokens:]
+        ntok_new -= num_tokens
+    else:
+        posemb_tok, posemb_grid = posemb[:, :0], posemb[0]
+    gs_old = int(math.sqrt(len(posemb_grid)))
+    if not len(gs_new):  # backwards compatibility
+        gs_new = [int(math.sqrt(ntok_new))] * 2
+    assert len(gs_new) >= 2
+    posemb_grid = posemb_grid.reshape(1, gs_old, gs_old,
+                                      -1).permute(0, 3, 1, 2)
+    posemb_grid = F.interpolate(
+        posemb_grid, size=gs_new, mode='bicubic', align_corners=False)
+    posemb_grid = posemb_grid.permute(0, 2, 3,
+                                      1).reshape(1, gs_new[0] * gs_new[1], -1)
+    posemb = torch.cat([posemb_tok, posemb_grid], dim=1)
+    return posemb
diff --git a/tests/test_models/test_classifiers.py b/tests/test_models/test_classifiers.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ee75e2a00cbcf8a0b445096516d0fd325f64f79
--- /dev/null
+++ b/tests/test_models/test_classifiers.py
@@ -0,0 +1,471 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import unittest
+from unittest import TestCase
+from unittest.mock import MagicMock
+
+import torch
+import torch.nn as nn
+from mmengine import ConfigDict
+
+from mmpretrain.models import ImageClassifier
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+
+
+def has_timm() -> bool:
+    try:
+        import timm  # noqa: F401
+        return True
+    except ImportError:
+        return False
+
+
+def has_huggingface() -> bool:
+    try:
+        import transformers  # noqa: F401
+        return True
+    except ImportError:
+        return False
+
+
+class TestImageClassifier(TestCase):
+    DEFAULT_ARGS = dict(
+        type='ImageClassifier',
+        backbone=dict(type='ResNet', depth=18),
+        neck=dict(type='GlobalAveragePooling'),
+        head=dict(
+            type='LinearClsHead',
+            num_classes=10,
+            in_channels=512,
+            loss=dict(type='CrossEntropyLoss')))
+
+    def test_initialize(self):
+        model = MODELS.build(self.DEFAULT_ARGS)
+        self.assertTrue(model.with_neck)
+        self.assertTrue(model.with_head)
+
+        cfg = {**self.DEFAULT_ARGS, 'pretrained': 'checkpoint'}
+        model = MODELS.build(cfg)
+        self.assertDictEqual(model.init_cfg,
+                             dict(type='Pretrained', checkpoint='checkpoint'))
+
+        cfg = ConfigDict(self.DEFAULT_ARGS)
+        cfg.pop('neck')
+        model = MODELS.build(cfg)
+        self.assertFalse(model.with_neck)
+
+        cfg = ConfigDict(self.DEFAULT_ARGS)
+        cfg.pop('head')
+        model = MODELS.build(cfg)
+        self.assertFalse(model.with_head)
+
+        # test set batch augmentation from train_cfg
+        cfg = {
+            **self.DEFAULT_ARGS, 'train_cfg':
+            dict(augments=dict(type='Mixup', alpha=1.))
+        }
+        model: ImageClassifier = MODELS.build(cfg)
+        self.assertIsNotNone(model.data_preprocessor.batch_augments)
+
+        cfg = {**self.DEFAULT_ARGS, 'train_cfg': dict()}
+        model: ImageClassifier = MODELS.build(cfg)
+        self.assertIsNone(model.data_preprocessor.batch_augments)
+
+    def test_extract_feat(self):
+        inputs = torch.rand(1, 3, 224, 224)
+        cfg = ConfigDict(self.DEFAULT_ARGS)
+        cfg.backbone.out_indices = (0, 1, 2, 3)
+        model: ImageClassifier = MODELS.build(cfg)
+
+        # test backbone output
+        feats = model.extract_feat(inputs, stage='backbone')
+        self.assertEqual(len(feats), 4)
+        self.assertEqual(feats[0].shape, (1, 64, 56, 56))
+        self.assertEqual(feats[1].shape, (1, 128, 28, 28))
+        self.assertEqual(feats[2].shape, (1, 256, 14, 14))
+        self.assertEqual(feats[3].shape, (1, 512, 7, 7))
+
+        # test neck output
+        feats = model.extract_feat(inputs, stage='neck')
+        self.assertEqual(len(feats), 4)
+        self.assertEqual(feats[0].shape, (1, 64))
+        self.assertEqual(feats[1].shape, (1, 128))
+        self.assertEqual(feats[2].shape, (1, 256))
+        self.assertEqual(feats[3].shape, (1, 512))
+
+        # test pre_logits output
+        feats = model.extract_feat(inputs, stage='pre_logits')
+        self.assertEqual(feats.shape, (1, 512))
+
+        # TODO: test transformer style feature extraction
+
+        # test extract_feats
+        multi_feats = model.extract_feats([inputs, inputs], stage='backbone')
+        self.assertEqual(len(multi_feats), 2)
+        for feats in multi_feats:
+            self.assertEqual(len(feats), 4)
+            self.assertEqual(feats[0].shape, (1, 64, 56, 56))
+            self.assertEqual(feats[1].shape, (1, 128, 28, 28))
+            self.assertEqual(feats[2].shape, (1, 256, 14, 14))
+            self.assertEqual(feats[3].shape, (1, 512, 7, 7))
+
+        # Without neck, return backbone
+        cfg = ConfigDict(self.DEFAULT_ARGS)
+        cfg.backbone.out_indices = (0, 1, 2, 3)
+        cfg.pop('neck')
+        model: ImageClassifier = MODELS.build(cfg)
+        feats = model.extract_feat(inputs, stage='neck')
+        self.assertEqual(len(feats), 4)
+        self.assertEqual(feats[0].shape, (1, 64, 56, 56))
+        self.assertEqual(feats[1].shape, (1, 128, 28, 28))
+        self.assertEqual(feats[2].shape, (1, 256, 14, 14))
+        self.assertEqual(feats[3].shape, (1, 512, 7, 7))
+
+        # Without head, raise error
+        cfg = ConfigDict(self.DEFAULT_ARGS)
+        cfg.backbone.out_indices = (0, 1, 2, 3)
+        cfg.pop('head')
+        model: ImageClassifier = MODELS.build(cfg)
+        with self.assertRaisesRegex(AssertionError, 'No head or the head'):
+            model.extract_feat(inputs, stage='pre_logits')
+
+        with self.assertRaisesRegex(AssertionError, 'use `extract_feat`'):
+            model.extract_feats(inputs)
+
+    def test_loss(self):
+        inputs = torch.rand(1, 3, 224, 224)
+        data_samples = [DataSample().set_gt_label(1)]
+
+        model: ImageClassifier = MODELS.build(self.DEFAULT_ARGS)
+        losses = model.loss(inputs, data_samples)
+        self.assertGreater(losses['loss'].item(), 0)
+
+    def test_predict(self):
+        inputs = torch.rand(1, 3, 224, 224)
+        data_samples = [DataSample().set_gt_label(1)]
+
+        model: ImageClassifier = MODELS.build(self.DEFAULT_ARGS)
+        predictions = model.predict(inputs)
+        self.assertEqual(predictions[0].pred_score.shape, (10, ))
+
+        predictions = model.predict(inputs, data_samples)
+        self.assertEqual(predictions[0].pred_score.shape, (10, ))
+        self.assertEqual(data_samples[0].pred_score.shape, (10, ))
+        torch.testing.assert_allclose(data_samples[0].pred_score,
+                                      predictions[0].pred_score)
+
+    def test_forward(self):
+        inputs = torch.rand(1, 3, 224, 224)
+        data_samples = [DataSample().set_gt_label(1)]
+        model: ImageClassifier = MODELS.build(self.DEFAULT_ARGS)
+
+        # test pure forward
+        outs = model(inputs)
+        self.assertIsInstance(outs, torch.Tensor)
+
+        # test forward train
+        losses = model(inputs, data_samples, mode='loss')
+        self.assertGreater(losses['loss'].item(), 0)
+
+        # test forward test
+        predictions = model(inputs, mode='predict')
+        self.assertEqual(predictions[0].pred_score.shape, (10, ))
+
+        predictions = model(inputs, data_samples, mode='predict')
+        self.assertEqual(predictions[0].pred_score.shape, (10, ))
+        self.assertEqual(data_samples[0].pred_score.shape, (10, ))
+        torch.testing.assert_allclose(data_samples[0].pred_score,
+                                      predictions[0].pred_score)
+
+        # test forward with invalid mode
+        with self.assertRaisesRegex(RuntimeError, 'Invalid mode "unknown"'):
+            model(inputs, mode='unknown')
+
+    def test_train_step(self):
+        cfg = {
+            **self.DEFAULT_ARGS, 'data_preprocessor':
+            dict(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])
+        }
+        model: ImageClassifier = MODELS.build(cfg)
+
+        data = {
+            'inputs': torch.randint(0, 256, (1, 3, 224, 224)),
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+
+        optim_wrapper = MagicMock()
+        log_vars = model.train_step(data, optim_wrapper)
+        self.assertIn('loss', log_vars)
+        optim_wrapper.update_params.assert_called_once()
+
+    def test_val_step(self):
+        cfg = {
+            **self.DEFAULT_ARGS, 'data_preprocessor':
+            dict(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])
+        }
+        model: ImageClassifier = MODELS.build(cfg)
+
+        data = {
+            'inputs': torch.randint(0, 256, (1, 3, 224, 224)),
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+
+        predictions = model.val_step(data)
+        self.assertEqual(predictions[0].pred_score.shape, (10, ))
+
+    def test_test_step(self):
+        cfg = {
+            **self.DEFAULT_ARGS, 'data_preprocessor':
+            dict(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])
+        }
+        model: ImageClassifier = MODELS.build(cfg)
+
+        data = {
+            'inputs': torch.randint(0, 256, (1, 3, 224, 224)),
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+
+        predictions = model.test_step(data)
+        self.assertEqual(predictions[0].pred_score.shape, (10, ))
+
+
+@unittest.skipIf(not has_timm(), 'timm is not installed.')
+class TestTimmClassifier(TestCase):
+    DEFAULT_ARGS = dict(
+        type='TimmClassifier',
+        model_name='resnet18',
+        loss=dict(type='CrossEntropyLoss'),
+    )
+
+    def test_initialize(self):
+        model = MODELS.build(self.DEFAULT_ARGS)
+        assert isinstance(model.model, nn.Module)
+
+        # test set batch augmentation from train_cfg
+        cfg = {
+            **self.DEFAULT_ARGS, 'train_cfg':
+            dict(augments=dict(type='Mixup', alpha=1.))
+        }
+        model: ImageClassifier = MODELS.build(cfg)
+        self.assertIsNotNone(model.data_preprocessor.batch_augments)
+
+        cfg = {**self.DEFAULT_ARGS, 'train_cfg': dict()}
+        model: ImageClassifier = MODELS.build(cfg)
+        self.assertIsNone(model.data_preprocessor.batch_augments)
+
+    def test_loss(self):
+        inputs = torch.rand(1, 3, 224, 224)
+        data_samples = [DataSample().set_gt_label(1)]
+
+        model: ImageClassifier = MODELS.build(self.DEFAULT_ARGS)
+        losses = model.loss(inputs, data_samples)
+        self.assertGreater(losses['loss'].item(), 0)
+
+    def test_predict(self):
+        inputs = torch.rand(1, 3, 224, 224)
+        data_samples = [DataSample().set_gt_label(1)]
+
+        model: ImageClassifier = MODELS.build(self.DEFAULT_ARGS)
+        predictions = model.predict(inputs)
+        self.assertEqual(predictions[0].pred_score.shape, (1000, ))
+
+        predictions = model.predict(inputs, data_samples)
+        self.assertEqual(predictions[0].pred_score.shape, (1000, ))
+        self.assertEqual(data_samples[0].pred_score.shape, (1000, ))
+        torch.testing.assert_allclose(data_samples[0].pred_score,
+                                      predictions[0].pred_score)
+
+    def test_forward(self):
+        inputs = torch.rand(1, 3, 224, 224)
+        data_samples = [DataSample().set_gt_label(1)]
+        model: ImageClassifier = MODELS.build(self.DEFAULT_ARGS)
+
+        # test pure forward
+        outs = model(inputs)
+        self.assertIsInstance(outs, torch.Tensor)
+
+        # test forward train
+        losses = model(inputs, data_samples, mode='loss')
+        self.assertGreater(losses['loss'].item(), 0)
+
+        # test forward test
+        predictions = model(inputs, mode='predict')
+        self.assertEqual(predictions[0].pred_score.shape, (1000, ))
+
+        predictions = model(inputs, data_samples, mode='predict')
+        self.assertEqual(predictions[0].pred_score.shape, (1000, ))
+        self.assertEqual(data_samples[0].pred_score.shape, (1000, ))
+        torch.testing.assert_allclose(data_samples[0].pred_score,
+                                      predictions[0].pred_score)
+
+        # test forward with invalid mode
+        with self.assertRaisesRegex(RuntimeError, 'Invalid mode "unknown"'):
+            model(inputs, mode='unknown')
+
+    def test_train_step(self):
+        cfg = {
+            **self.DEFAULT_ARGS, 'data_preprocessor':
+            dict(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])
+        }
+        model: ImageClassifier = MODELS.build(cfg)
+
+        data = {
+            'inputs': torch.randint(0, 256, (1, 3, 224, 224)),
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+
+        optim_wrapper = MagicMock()
+        log_vars = model.train_step(data, optim_wrapper)
+        self.assertIn('loss', log_vars)
+        optim_wrapper.update_params.assert_called_once()
+
+    def test_val_step(self):
+        cfg = {
+            **self.DEFAULT_ARGS, 'data_preprocessor':
+            dict(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])
+        }
+        model: ImageClassifier = MODELS.build(cfg)
+
+        data = {
+            'inputs': torch.randint(0, 256, (1, 3, 224, 224)),
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+
+        predictions = model.val_step(data)
+        self.assertEqual(predictions[0].pred_score.shape, (1000, ))
+
+    def test_test_step(self):
+        cfg = {
+            **self.DEFAULT_ARGS, 'data_preprocessor':
+            dict(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])
+        }
+        model: ImageClassifier = MODELS.build(cfg)
+
+        data = {
+            'inputs': torch.randint(0, 256, (1, 3, 224, 224)),
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+
+        predictions = model.test_step(data)
+        self.assertEqual(predictions[0].pred_score.shape, (1000, ))
+
+
+@unittest.skipIf(not has_huggingface(), 'huggingface is not installed.')
+class TestHuggingFaceClassifier(TestCase):
+    DEFAULT_ARGS = dict(
+        type='HuggingFaceClassifier',
+        model_name='microsoft/resnet-18',
+        loss=dict(type='CrossEntropyLoss'),
+    )
+
+    def test_initialize(self):
+        model = MODELS.build(self.DEFAULT_ARGS)
+        assert isinstance(model.model, nn.Module)
+
+        # test set batch augmentation from train_cfg
+        cfg = {
+            **self.DEFAULT_ARGS, 'train_cfg':
+            dict(augments=dict(type='Mixup', alpha=1.))
+        }
+        model: ImageClassifier = MODELS.build(cfg)
+        self.assertIsNotNone(model.data_preprocessor.batch_augments)
+
+        cfg = {**self.DEFAULT_ARGS, 'train_cfg': dict()}
+        model: ImageClassifier = MODELS.build(cfg)
+        self.assertIsNone(model.data_preprocessor.batch_augments)
+
+    def test_loss(self):
+        inputs = torch.rand(1, 3, 224, 224)
+        data_samples = [DataSample().set_gt_label(1)]
+
+        model: ImageClassifier = MODELS.build(self.DEFAULT_ARGS)
+        losses = model.loss(inputs, data_samples)
+        self.assertGreater(losses['loss'].item(), 0)
+
+    def test_predict(self):
+        inputs = torch.rand(1, 3, 224, 224)
+        data_samples = [DataSample().set_gt_label(1)]
+
+        model: ImageClassifier = MODELS.build(self.DEFAULT_ARGS)
+        predictions = model.predict(inputs)
+        self.assertEqual(predictions[0].pred_score.shape, (1000, ))
+
+        predictions = model.predict(inputs, data_samples)
+        self.assertEqual(predictions[0].pred_score.shape, (1000, ))
+        self.assertEqual(data_samples[0].pred_score.shape, (1000, ))
+        torch.testing.assert_allclose(data_samples[0].pred_score,
+                                      predictions[0].pred_score)
+
+    def test_forward(self):
+        inputs = torch.rand(1, 3, 224, 224)
+        data_samples = [DataSample().set_gt_label(1)]
+        model: ImageClassifier = MODELS.build(self.DEFAULT_ARGS)
+
+        # test pure forward
+        outs = model(inputs)
+        self.assertIsInstance(outs, torch.Tensor)
+
+        # test forward train
+        losses = model(inputs, data_samples, mode='loss')
+        self.assertGreater(losses['loss'].item(), 0)
+
+        # test forward test
+        predictions = model(inputs, mode='predict')
+        self.assertEqual(predictions[0].pred_score.shape, (1000, ))
+
+        predictions = model(inputs, data_samples, mode='predict')
+        self.assertEqual(predictions[0].pred_score.shape, (1000, ))
+        self.assertEqual(data_samples[0].pred_score.shape, (1000, ))
+        torch.testing.assert_allclose(data_samples[0].pred_score,
+                                      predictions[0].pred_score)
+
+        # test forward with invalid mode
+        with self.assertRaisesRegex(RuntimeError, 'Invalid mode "unknown"'):
+            model(inputs, mode='unknown')
+
+    def test_train_step(self):
+        cfg = {
+            **self.DEFAULT_ARGS, 'data_preprocessor':
+            dict(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])
+        }
+        model: ImageClassifier = MODELS.build(cfg)
+
+        data = {
+            'inputs': torch.randint(0, 256, (1, 3, 224, 224)),
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+
+        optim_wrapper = MagicMock()
+        log_vars = model.train_step(data, optim_wrapper)
+        self.assertIn('loss', log_vars)
+        optim_wrapper.update_params.assert_called_once()
+
+    def test_val_step(self):
+        cfg = {
+            **self.DEFAULT_ARGS, 'data_preprocessor':
+            dict(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])
+        }
+        model: ImageClassifier = MODELS.build(cfg)
+
+        data = {
+            'inputs': torch.randint(0, 256, (1, 3, 224, 224)),
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+
+        predictions = model.val_step(data)
+        self.assertEqual(predictions[0].pred_score.shape, (1000, ))
+
+    def test_test_step(self):
+        cfg = {
+            **self.DEFAULT_ARGS, 'data_preprocessor':
+            dict(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])
+        }
+        model: ImageClassifier = MODELS.build(cfg)
+
+        data = {
+            'inputs': torch.randint(0, 256, (1, 3, 224, 224)),
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+
+        predictions = model.test_step(data)
+        self.assertEqual(predictions[0].pred_score.shape, (1000, ))
diff --git a/tests/test_models/test_heads.py b/tests/test_models/test_heads.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4ddf4951215408730a23c3ebda3d3d784420c21
--- /dev/null
+++ b/tests/test_models/test_heads.py
@@ -0,0 +1,736 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+import os
+import random
+import tempfile
+from unittest import TestCase
+
+import numpy as np
+import torch
+from mmengine import is_seq_of
+
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample, MultiTaskDataSample
+
+
+def setup_seed(seed):
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    torch.backends.cudnn.deterministic = True
+
+
+class TestClsHead(TestCase):
+    DEFAULT_ARGS = dict(type='ClsHead')
+    FAKE_FEATS = (torch.rand(4, 10), )
+
+    def test_pre_logits(self):
+        head = MODELS.build(self.DEFAULT_ARGS)
+
+        # return the last item
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        pre_logits = head.pre_logits(feats)
+        self.assertIs(pre_logits, feats[-1])
+
+    def test_forward(self):
+        head = MODELS.build(self.DEFAULT_ARGS)
+
+        # return the last item (same as pre_logits)
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        outs = head(feats)
+        self.assertIs(outs, feats[-1])
+
+    def test_loss(self):
+        feats = self.FAKE_FEATS
+        data_samples = [DataSample().set_gt_label(1) for _ in range(4)]
+
+        # with cal_acc = False
+        head = MODELS.build(self.DEFAULT_ARGS)
+
+        losses = head.loss(feats, data_samples)
+        self.assertEqual(losses.keys(), {'loss'})
+        self.assertGreater(losses['loss'].item(), 0)
+
+        # with cal_acc = True
+        cfg = {**self.DEFAULT_ARGS, 'topk': (1, 2), 'cal_acc': True}
+        head = MODELS.build(cfg)
+
+        losses = head.loss(feats, data_samples)
+        self.assertEqual(losses.keys(),
+                         {'loss', 'accuracy_top-1', 'accuracy_top-2'})
+        self.assertGreater(losses['loss'].item(), 0)
+
+        # test assertion when cal_acc but data is batch agumented.
+        data_samples = [
+            sample.set_gt_score(torch.rand(10)) for sample in data_samples
+        ]
+        cfg = {
+            **self.DEFAULT_ARGS, 'cal_acc': True,
+            'loss': dict(type='CrossEntropyLoss', use_soft=True)
+        }
+        head = MODELS.build(cfg)
+        with self.assertRaisesRegex(AssertionError, 'batch augmentation'):
+            head.loss(feats, data_samples)
+
+    def test_predict(self):
+        feats = (torch.rand(4, 10), )
+        data_samples = [DataSample().set_gt_label(1) for _ in range(4)]
+        head = MODELS.build(self.DEFAULT_ARGS)
+
+        # with without data_samples
+        predictions = head.predict(feats)
+        self.assertTrue(is_seq_of(predictions, DataSample))
+        for pred in predictions:
+            self.assertIn('pred_label', pred)
+            self.assertIn('pred_score', pred)
+
+        # with with data_samples
+        predictions = head.predict(feats, data_samples)
+        self.assertTrue(is_seq_of(predictions, DataSample))
+        for sample, pred in zip(data_samples, predictions):
+            self.assertIs(sample, pred)
+            self.assertIn('pred_label', pred)
+            self.assertIn('pred_score', pred)
+
+
+class TestLinearClsHead(TestCase):
+    DEFAULT_ARGS = dict(type='LinearClsHead', in_channels=10, num_classes=5)
+    FAKE_FEATS = (torch.rand(4, 10), )
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(ValueError, 'num_classes=-5 must be'):
+            MODELS.build({**self.DEFAULT_ARGS, 'num_classes': -5})
+
+    def test_pre_logits(self):
+        head = MODELS.build(self.DEFAULT_ARGS)
+
+        # return the last item
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        pre_logits = head.pre_logits(feats)
+        self.assertIs(pre_logits, feats[-1])
+
+    def test_forward(self):
+        head = MODELS.build(self.DEFAULT_ARGS)
+
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        outs = head(feats)
+        self.assertEqual(outs.shape, (4, 5))
+
+
+class TestVisionTransformerClsHead(TestCase):
+    DEFAULT_ARGS = dict(
+        type='VisionTransformerClsHead', in_channels=10, num_classes=5)
+    fake_feats = ([torch.rand(4, 7, 7, 16), torch.rand(4, 10)], )
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(ValueError, 'num_classes=-5 must be'):
+            MODELS.build({**self.DEFAULT_ARGS, 'num_classes': -5})
+
+        # test vit head default
+        head = MODELS.build(self.DEFAULT_ARGS)
+        assert not hasattr(head.layers, 'pre_logits')
+        assert not hasattr(head.layers, 'act')
+
+        # test vit head hidden_dim
+        head = MODELS.build({**self.DEFAULT_ARGS, 'hidden_dim': 30})
+        assert hasattr(head.layers, 'pre_logits')
+        assert hasattr(head.layers, 'act')
+
+        # test vit head init_weights
+        head = MODELS.build(self.DEFAULT_ARGS)
+        head.init_weights()
+
+        # test vit head init_weights with hidden_dim
+        head = MODELS.build({**self.DEFAULT_ARGS, 'hidden_dim': 30})
+        head.init_weights()
+        assert abs(head.layers.pre_logits.weight).sum() > 0
+
+    def test_pre_logits(self):
+        # test default
+        head = MODELS.build(self.DEFAULT_ARGS)
+        pre_logits = head.pre_logits(self.fake_feats)
+        self.assertIs(pre_logits, self.fake_feats[-1][1])
+
+        # test hidden_dim
+        head = MODELS.build({**self.DEFAULT_ARGS, 'hidden_dim': 30})
+        pre_logits = head.pre_logits(self.fake_feats)
+        self.assertEqual(pre_logits.shape, (4, 30))
+
+    def test_forward(self):
+        # test default
+        head = MODELS.build(self.DEFAULT_ARGS)
+        outs = head(self.fake_feats)
+        self.assertEqual(outs.shape, (4, 5))
+
+        # test hidden_dim
+        head = MODELS.build({**self.DEFAULT_ARGS, 'hidden_dim': 30})
+        outs = head(self.fake_feats)
+        self.assertEqual(outs.shape, (4, 5))
+
+
+class TestDeiTClsHead(TestVisionTransformerClsHead):
+    DEFAULT_ARGS = dict(type='DeiTClsHead', in_channels=10, num_classes=5)
+    fake_feats = ([
+        torch.rand(4, 7, 7, 16),
+        torch.rand(4, 10),
+        torch.rand(4, 10)
+    ], )
+
+    def test_pre_logits(self):
+        # test default
+        head = MODELS.build(self.DEFAULT_ARGS)
+        cls_token, dist_token = head.pre_logits(self.fake_feats)
+        self.assertIs(cls_token, self.fake_feats[-1][1])
+        self.assertIs(dist_token, self.fake_feats[-1][2])
+
+        # test hidden_dim
+        head = MODELS.build({**self.DEFAULT_ARGS, 'hidden_dim': 30})
+        cls_token, dist_token = head.pre_logits(self.fake_feats)
+        self.assertEqual(cls_token.shape, (4, 30))
+        self.assertEqual(dist_token.shape, (4, 30))
+
+
+class TestConformerHead(TestCase):
+    DEFAULT_ARGS = dict(
+        type='ConformerHead', in_channels=[64, 96], num_classes=5)
+    fake_feats = ([torch.rand(4, 64), torch.rand(4, 96)], )
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(ValueError, 'num_classes=-5 must be'):
+            MODELS.build({**self.DEFAULT_ARGS, 'num_classes': -5})
+
+        # test default
+        head = MODELS.build(self.DEFAULT_ARGS)
+        assert hasattr(head, 'conv_cls_head')
+        assert hasattr(head, 'trans_cls_head')
+
+        # test init_weights
+        head = MODELS.build(self.DEFAULT_ARGS)
+        head.init_weights()
+        assert abs(head.conv_cls_head.weight).sum() > 0
+        assert abs(head.trans_cls_head.weight).sum() > 0
+
+    def test_pre_logits(self):
+        # test default
+        head = MODELS.build(self.DEFAULT_ARGS)
+        pre_logits = head.pre_logits(self.fake_feats)
+        self.assertIs(pre_logits, self.fake_feats[-1])
+
+    def test_forward(self):
+        head = MODELS.build(self.DEFAULT_ARGS)
+        outs = head(self.fake_feats)
+        self.assertEqual(outs[0].shape, (4, 5))
+        self.assertEqual(outs[1].shape, (4, 5))
+
+    def test_loss(self):
+        data_samples = [DataSample().set_gt_label(1) for _ in range(4)]
+
+        # with cal_acc = False
+        head = MODELS.build(self.DEFAULT_ARGS)
+
+        losses = head.loss(self.fake_feats, data_samples)
+        self.assertEqual(losses.keys(), {'loss'})
+        self.assertGreater(losses['loss'].item(), 0)
+
+        # with cal_acc = True
+        cfg = {**self.DEFAULT_ARGS, 'topk': (1, 2), 'cal_acc': True}
+        head = MODELS.build(cfg)
+
+        losses = head.loss(self.fake_feats, data_samples)
+        self.assertEqual(losses.keys(),
+                         {'loss', 'accuracy_top-1', 'accuracy_top-2'})
+        self.assertGreater(losses['loss'].item(), 0)
+
+        # test assertion when cal_acc but data is batch agumented.
+        data_samples = [
+            sample.set_gt_score(torch.rand(5)) for sample in data_samples
+        ]
+        cfg = {
+            **self.DEFAULT_ARGS, 'cal_acc': True,
+            'loss': dict(type='CrossEntropyLoss', use_soft=True)
+        }
+        head = MODELS.build(cfg)
+        with self.assertRaisesRegex(AssertionError, 'batch augmentation'):
+            head.loss(self.fake_feats, data_samples)
+
+    def test_predict(self):
+        data_samples = [DataSample().set_gt_label(1) for _ in range(4)]
+        head = MODELS.build(self.DEFAULT_ARGS)
+
+        # with without data_samples
+        predictions = head.predict(self.fake_feats)
+        self.assertTrue(is_seq_of(predictions, DataSample))
+        for pred in predictions:
+            self.assertIn('pred_label', pred)
+            self.assertIn('pred_score', pred)
+
+        # with with data_samples
+        predictions = head.predict(self.fake_feats, data_samples)
+        self.assertTrue(is_seq_of(predictions, DataSample))
+        for sample, pred in zip(data_samples, predictions):
+            self.assertIs(sample, pred)
+            self.assertIn('pred_label', pred)
+            self.assertIn('pred_score', pred)
+
+
+class TestStackedLinearClsHead(TestCase):
+    DEFAULT_ARGS = dict(
+        type='StackedLinearClsHead', in_channels=10, num_classes=5)
+    fake_feats = (torch.rand(4, 10), )
+
+    def test_initialize(self):
+        with self.assertRaisesRegex(ValueError, 'num_classes=-5 must be'):
+            MODELS.build({
+                **self.DEFAULT_ARGS, 'num_classes': -5,
+                'mid_channels': 10
+            })
+
+        # test mid_channels
+        with self.assertRaisesRegex(AssertionError, 'should be a sequence'):
+            MODELS.build({**self.DEFAULT_ARGS, 'mid_channels': 10})
+
+        # test default
+        head = MODELS.build({**self.DEFAULT_ARGS, 'mid_channels': [20]})
+        assert len(head.layers) == 2
+        head.init_weights()
+
+    def test_pre_logits(self):
+        # test default
+        head = MODELS.build({**self.DEFAULT_ARGS, 'mid_channels': [20, 30]})
+        pre_logits = head.pre_logits(self.fake_feats)
+        self.assertEqual(pre_logits.shape, (4, 30))
+
+    def test_forward(self):
+        # test default
+        head = MODELS.build({**self.DEFAULT_ARGS, 'mid_channels': [20, 30]})
+        outs = head(self.fake_feats)
+        self.assertEqual(outs.shape, (4, 5))
+
+        head = MODELS.build({
+            **self.DEFAULT_ARGS, 'mid_channels': [8, 10],
+            'dropout_rate': 0.2,
+            'norm_cfg': dict(type='BN1d'),
+            'act_cfg': dict(type='HSwish')
+        })
+        outs = head(self.fake_feats)
+        self.assertEqual(outs.shape, (4, 5))
+
+
+class TestMultiLabelClsHead(TestCase):
+    DEFAULT_ARGS = dict(type='MultiLabelClsHead')
+
+    def test_pre_logits(self):
+        head = MODELS.build(self.DEFAULT_ARGS)
+
+        # return the last item
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        pre_logits = head.pre_logits(feats)
+        self.assertIs(pre_logits, feats[-1])
+
+    def test_forward(self):
+        head = MODELS.build(self.DEFAULT_ARGS)
+
+        # return the last item (same as pre_logits)
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        outs = head(feats)
+        self.assertIs(outs, feats[-1])
+
+    def test_loss(self):
+        feats = (torch.rand(4, 10), )
+        data_samples = [DataSample().set_gt_label([0, 3]) for _ in range(4)]
+
+        # Test with thr and topk are all None
+        head = MODELS.build(self.DEFAULT_ARGS)
+        losses = head.loss(feats, data_samples)
+        self.assertEqual(head.thr, 0.5)
+        self.assertEqual(head.topk, None)
+        self.assertEqual(losses.keys(), {'loss'})
+        self.assertGreater(losses['loss'].item(), 0)
+
+        # Test with topk
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        cfg['topk'] = 2
+        head = MODELS.build(cfg)
+        losses = head.loss(feats, data_samples)
+        self.assertEqual(head.thr, None, cfg)
+        self.assertEqual(head.topk, 2)
+        self.assertEqual(losses.keys(), {'loss'})
+        self.assertGreater(losses['loss'].item(), 0)
+
+        # Test with thr
+        setup_seed(0)
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        cfg['thr'] = 0.1
+        head = MODELS.build(cfg)
+        thr_losses = head.loss(feats, data_samples)
+        self.assertEqual(head.thr, 0.1)
+        self.assertEqual(head.topk, None)
+        self.assertEqual(thr_losses.keys(), {'loss'})
+        self.assertGreater(thr_losses['loss'].item(), 0)
+
+        # Test with thr and topk are all not None
+        setup_seed(0)
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        cfg['thr'] = 0.1
+        cfg['topk'] = 2
+        head = MODELS.build(cfg)
+        thr_topk_losses = head.loss(feats, data_samples)
+        self.assertEqual(head.thr, 0.1)
+        self.assertEqual(head.topk, 2)
+        self.assertEqual(thr_topk_losses.keys(), {'loss'})
+        self.assertGreater(thr_topk_losses['loss'].item(), 0)
+
+        # Test with gt_lable with score
+        data_samples = [
+            DataSample().set_gt_score(torch.rand((10, ))) for _ in range(4)
+        ]
+
+        head = MODELS.build(self.DEFAULT_ARGS)
+        losses = head.loss(feats, data_samples)
+        self.assertEqual(head.thr, 0.5)
+        self.assertEqual(head.topk, None)
+        self.assertEqual(losses.keys(), {'loss'})
+        self.assertGreater(losses['loss'].item(), 0)
+
+    def test_predict(self):
+        feats = (torch.rand(4, 10), )
+        data_samples = [DataSample().set_gt_label([1, 2]) for _ in range(4)]
+        head = MODELS.build(self.DEFAULT_ARGS)
+
+        # with without data_samples
+        predictions = head.predict(feats)
+        self.assertTrue(is_seq_of(predictions, DataSample))
+        for pred in predictions:
+            self.assertIn('pred_label', pred)
+            self.assertIn('pred_score', pred)
+
+        # with with data_samples
+        predictions = head.predict(feats, data_samples)
+        self.assertTrue(is_seq_of(predictions, DataSample))
+        for sample, pred in zip(data_samples, predictions):
+            self.assertIs(sample, pred)
+            self.assertIn('pred_label', pred)
+            self.assertIn('pred_score', pred)
+
+        # Test with topk
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        cfg['topk'] = 2
+        head = MODELS.build(cfg)
+        predictions = head.predict(feats, data_samples)
+        self.assertEqual(head.thr, None)
+        self.assertTrue(is_seq_of(predictions, DataSample))
+        for sample, pred in zip(data_samples, predictions):
+            self.assertIs(sample, pred)
+            self.assertIn('pred_label', pred)
+            self.assertIn('pred_score', pred)
+
+
+class EfficientFormerClsHead(TestClsHead):
+    DEFAULT_ARGS = dict(
+        type='EfficientFormerClsHead',
+        in_channels=10,
+        num_classes=10,
+        distillation=False)
+    FAKE_FEATS = (torch.rand(4, 10), )
+
+    def test_forward(self):
+        # test with distillation head
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        cfg['distillation'] = True
+        head = MODELS.build(cfg)
+        self.assertTrue(hasattr(head, 'dist_head'))
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        outs = head(feats)
+        self.assertEqual(outs.shape, (4, 10))
+
+        # test without distillation head
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        head = MODELS.build(cfg)
+        self.assertFalse(hasattr(head, 'dist_head'))
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        outs = head(feats)
+        self.assertEqual(outs.shape, (4, 10))
+
+    def test_loss(self):
+        feats = (torch.rand(4, 10), )
+        data_samples = [DataSample().set_gt_label(1) for _ in range(4)]
+
+        # test with distillation head
+        cfg = copy.deepcopy(self.DEFAULT_ARGS)
+        cfg['distillation'] = True
+        head = MODELS.build(cfg)
+        with self.assertRaisesRegex(NotImplementedError, 'MMPretrain '):
+            head.loss(feats, data_samples)
+
+        # test without distillation head
+        super().test_loss()
+
+
+class TestMultiLabelLinearClsHead(TestMultiLabelClsHead):
+    DEFAULT_ARGS = dict(
+        type='MultiLabelLinearClsHead', num_classes=10, in_channels=10)
+
+    def test_forward(self):
+        head = MODELS.build(self.DEFAULT_ARGS)
+        self.assertTrue(hasattr(head, 'fc'))
+        self.assertTrue(isinstance(head.fc, torch.nn.Linear))
+
+        # return the last item (same as pre_logits)
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        head(feats)
+
+
+class TestMultiTaskHead(TestCase):
+    DEFAULT_ARGS = dict(
+        type='MultiTaskHead',  # <- Head config, depends on #675
+        task_heads={
+            'task0': dict(type='LinearClsHead', num_classes=3),
+            'task1': dict(type='LinearClsHead', num_classes=6),
+        },
+        in_channels=10,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    )
+
+    DEFAULT_ARGS2 = dict(
+        type='MultiTaskHead',  # <- Head config, depends on #675
+        task_heads={
+            'task0':
+            dict(
+                type='MultiTaskHead',
+                task_heads={
+                    'task00': dict(type='LinearClsHead', num_classes=3),
+                    'task01': dict(type='LinearClsHead', num_classes=6),
+                }),
+            'task1':
+            dict(type='LinearClsHead', num_classes=6)
+        },
+        in_channels=10,
+        loss=dict(type='CrossEntropyLoss', loss_weight=1.0),
+    )
+
+    def test_forward(self):
+        head = MODELS.build(self.DEFAULT_ARGS)
+        # return the last item (same as pre_logits)
+        feats = (torch.rand(4, 10), )
+        outs = head(feats)
+        self.assertEqual(outs['task0'].shape, (4, 3))
+        self.assertEqual(outs['task1'].shape, (4, 6))
+        self.assertTrue(isinstance(outs, dict))
+
+    def test_loss(self):
+        feats = (torch.rand(4, 10), )
+        data_samples = []
+
+        for _ in range(4):
+            data_sample = MultiTaskDataSample()
+            for task_name in self.DEFAULT_ARGS['task_heads']:
+                task_sample = DataSample().set_gt_label(1)
+                data_sample.set_field(task_sample, task_name)
+            data_samples.append(data_sample)
+        # with cal_acc = False
+        head = MODELS.build(self.DEFAULT_ARGS)
+
+        losses = head.loss(feats, data_samples)
+        self.assertEqual(
+            losses.keys(),
+            {'task0_loss', 'task0_mask_size', 'task1_loss', 'task1_mask_size'})
+        self.assertGreater(losses['task0_loss'].item(), 0)
+        self.assertGreater(losses['task1_loss'].item(), 0)
+
+    def test_predict(self):
+        feats = (torch.rand(4, 10), )
+        data_samples = []
+
+        for _ in range(4):
+            data_sample = MultiTaskDataSample()
+            for task_name in self.DEFAULT_ARGS['task_heads']:
+                task_sample = DataSample().set_gt_label(1)
+                data_sample.set_field(task_sample, task_name)
+            data_samples.append(data_sample)
+        head = MODELS.build(self.DEFAULT_ARGS)
+        # without data_samples
+        predictions = head.predict(feats)
+        self.assertTrue(is_seq_of(predictions, MultiTaskDataSample))
+        for pred in predictions:
+            self.assertIn('task0', pred)
+        task0_sample = predictions[0].task0
+        self.assertTrue(type(task0_sample.pred_score), 'torch.tensor')
+
+        # with with data_samples
+        predictions = head.predict(feats, data_samples)
+        self.assertTrue(is_seq_of(predictions, MultiTaskDataSample))
+        for sample, pred in zip(data_samples, predictions):
+            self.assertIs(sample, pred)
+            self.assertIn('task0', pred)
+
+        # with data samples and nested
+        head_nested = MODELS.build(self.DEFAULT_ARGS2)
+        # adding a None data sample at the beginning
+        data_samples_nested = [None]
+        for _ in range(3):
+            data_sample_nested = MultiTaskDataSample()
+            data_sample_nested0 = MultiTaskDataSample()
+            data_sample_nested0.set_field(DataSample().set_gt_label(1),
+                                          'task00')
+            data_sample_nested0.set_field(DataSample().set_gt_label(1),
+                                          'task01')
+            data_sample_nested.set_field(data_sample_nested0, 'task0')
+            data_sample_nested.set_field(DataSample().set_gt_label(1), 'task1')
+            data_samples_nested.append(data_sample_nested)
+
+        predictions = head_nested.predict(feats, data_samples_nested)
+        self.assertTrue(is_seq_of(predictions, MultiTaskDataSample))
+        for i in range(3):
+            sample = data_samples_nested[i + 1]
+            pred = predictions[i + 1]
+            self.assertIn('task0', pred)
+            self.assertIn('task1', pred)
+            self.assertIn('task01', pred.get('task0'))
+            self.assertEqual(
+                sample.get('task0').get('task01').gt_label.numpy()[0], 1)
+
+    def test_loss_empty_data_sample(self):
+        feats = (torch.rand(4, 10), )
+        data_samples = []
+
+        for _ in range(4):
+            data_sample = MultiTaskDataSample()
+            data_samples.append(data_sample)
+        # with cal_acc = False
+        head = MODELS.build(self.DEFAULT_ARGS)
+        losses = head.loss(feats, data_samples)
+        self.assertEqual(
+            losses.keys(),
+            {'task0_loss', 'task0_mask_size', 'task1_loss', 'task1_mask_size'})
+        self.assertEqual(losses['task0_loss'].item(), 0)
+        self.assertEqual(losses['task1_loss'].item(), 0)
+
+    def test_nested_multi_task_loss(self):
+
+        head = MODELS.build(self.DEFAULT_ARGS2)
+        # return the last item (same as pre_logits)
+        feats = (torch.rand(4, 10), )
+        outs = head(feats)
+        self.assertEqual(outs['task0']['task01'].shape, (4, 6))
+        self.assertTrue(isinstance(outs, dict))
+        self.assertTrue(isinstance(outs['task0'], dict))
+
+    def test_nested_invalid_sample(self):
+        feats = (torch.rand(4, 10), )
+        gt_label = {'task0': 1, 'task1': 1}
+        head = MODELS.build(self.DEFAULT_ARGS2)
+        data_sample = MultiTaskDataSample()
+        for task_name in gt_label:
+            task_sample = DataSample().set_gt_label(gt_label[task_name])
+            data_sample.set_field(task_sample, task_name)
+        with self.assertRaises(Exception):
+            head.loss(feats, data_sample)
+
+    def test_nested_invalid_sample2(self):
+        feats = (torch.rand(4, 10), )
+        gt_label = {'task0': {'task00': 1, 'task01': 1}, 'task1': 1}
+        head = MODELS.build(self.DEFAULT_ARGS)
+        data_sample = MultiTaskDataSample()
+        task_sample = DataSample().set_gt_label(gt_label['task1'])
+        data_sample.set_field(task_sample, 'task1')
+        data_sample.set_field(MultiTaskDataSample(), 'task0')
+        for task_name in gt_label['task0']:
+            task_sample = DataSample().set_gt_label(
+                gt_label['task0'][task_name])
+            data_sample.task0.set_field(task_sample, task_name)
+        with self.assertRaises(Exception):
+            head.loss(feats, data_sample)
+
+
+class TestArcFaceClsHead(TestCase):
+    DEFAULT_ARGS = dict(type='ArcFaceClsHead', in_channels=10, num_classes=5)
+
+    def test_initialize(self):
+        with self.assertRaises(AssertionError):
+            MODELS.build({**self.DEFAULT_ARGS, 'num_classes': -5})
+
+        with self.assertRaises(AssertionError):
+            MODELS.build({**self.DEFAULT_ARGS, 'num_subcenters': 0})
+
+        # Test margins
+        with self.assertRaises(AssertionError):
+            MODELS.build({**self.DEFAULT_ARGS, 'margins': dict()})
+
+        with self.assertRaises(AssertionError):
+            MODELS.build({**self.DEFAULT_ARGS, 'margins': [0.1] * 4})
+
+        with self.assertRaises(AssertionError):
+            MODELS.build({**self.DEFAULT_ARGS, 'margins': [0.1] * 4 + ['0.1']})
+
+        arcface = MODELS.build(self.DEFAULT_ARGS)
+        torch.allclose(arcface.margins, torch.tensor([0.5] * 5))
+
+        arcface = MODELS.build({**self.DEFAULT_ARGS, 'margins': [0.1] * 5})
+        torch.allclose(arcface.margins, torch.tensor([0.1] * 5))
+
+        margins = [0.1, 0.2, 0.3, 0.4, 5]
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            tmp_path = os.path.join(tmpdirname, 'margins.txt')
+            with open(tmp_path, 'w') as tmp_file:
+                for m in margins:
+                    tmp_file.write(f'{m}\n')
+            arcface = MODELS.build({**self.DEFAULT_ARGS, 'margins': tmp_path})
+            torch.allclose(arcface.margins, torch.tensor(margins))
+
+    def test_pre_logits(self):
+        head = MODELS.build(self.DEFAULT_ARGS)
+
+        # return the last item
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        pre_logits = head.pre_logits(feats)
+        self.assertIs(pre_logits, feats[-1])
+
+        # Test with SubCenterArcFace
+        head = MODELS.build({**self.DEFAULT_ARGS, 'num_subcenters': 3})
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        pre_logits = head.pre_logits(feats)
+        self.assertIs(pre_logits, feats[-1])
+
+    def test_forward(self):
+        head = MODELS.build(self.DEFAULT_ARGS)
+        # target is not None
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        target = torch.zeros(4).long()
+        outs = head(feats, target)
+        self.assertEqual(outs.shape, (4, 5))
+
+        # target is None
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        outs = head(feats)
+        self.assertEqual(outs.shape, (4, 5))
+
+        # Test with SubCenterArcFace
+        head = MODELS.build({**self.DEFAULT_ARGS, 'num_subcenters': 3})
+        # target is not None
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        target = torch.zeros(4)
+        outs = head(feats, target)
+        self.assertEqual(outs.shape, (4, 5))
+
+        # target is None
+        feats = (torch.rand(4, 10), torch.rand(4, 10))
+        outs = head(feats)
+        self.assertEqual(outs.shape, (4, 5))
+
+    def test_loss(self):
+        feats = (torch.rand(4, 10), )
+        data_samples = [DataSample().set_gt_label(1) for _ in range(4)]
+
+        # test loss with used='before'
+        head = MODELS.build(self.DEFAULT_ARGS)
+        losses = head.loss(feats, data_samples)
+        self.assertEqual(losses.keys(), {'loss'})
+        self.assertGreater(losses['loss'].item(), 0)
+
+        # Test with SubCenterArcFace
+        head = MODELS.build({**self.DEFAULT_ARGS, 'num_subcenters': 3})
+        # test loss with used='before'
+        losses = head.loss(feats, data_samples)
+        self.assertEqual(losses.keys(), {'loss'})
+        self.assertGreater(losses['loss'].item(), 0)
diff --git a/tests/test_models/test_losses.py b/tests/test_models/test_losses.py
new file mode 100644
index 0000000000000000000000000000000000000000..a9bd09e950af85dd25123fa4dabab5c92aa33f9c
--- /dev/null
+++ b/tests/test_models/test_losses.py
@@ -0,0 +1,403 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmpretrain.models import build_loss
+
+
+def test_asymmetric_loss():
+    # test asymmetric_loss
+    cls_score = torch.Tensor([[5, -5, 0], [5, -5, 0]])
+    label = torch.Tensor([[1, 0, 1], [0, 1, 0]])
+    weight = torch.tensor([0.5, 0.5])
+
+    loss_cfg = dict(
+        type='AsymmetricLoss',
+        gamma_pos=1.0,
+        gamma_neg=4.0,
+        clip=0.05,
+        reduction='mean',
+        loss_weight=1.0)
+    loss = build_loss(loss_cfg)
+    assert torch.allclose(loss(cls_score, label), torch.tensor(3.80845 / 3))
+
+    # test asymmetric_loss with weight
+    assert torch.allclose(
+        loss(cls_score, label, weight=weight), torch.tensor(3.80845 / 6))
+
+    # test asymmetric_loss without clip
+    loss_cfg = dict(
+        type='AsymmetricLoss',
+        gamma_pos=1.0,
+        gamma_neg=4.0,
+        clip=None,
+        reduction='mean',
+        loss_weight=1.0)
+    loss = build_loss(loss_cfg)
+    assert torch.allclose(loss(cls_score, label), torch.tensor(5.1186 / 3))
+
+    # test asymmetric_loss with softmax for single label task
+    cls_score = torch.Tensor([[5, -5, 0], [5, -5, 0]])
+    label = torch.Tensor([0, 1])
+    weight = torch.tensor([0.5, 0.5])
+    loss_cfg = dict(
+        type='AsymmetricLoss',
+        gamma_pos=0.0,
+        gamma_neg=0.0,
+        clip=None,
+        reduction='mean',
+        loss_weight=1.0,
+        use_sigmoid=False,
+        eps=1e-8)
+    loss = build_loss(loss_cfg)
+    # test asymmetric_loss for single label task without weight
+    assert torch.allclose(loss(cls_score, label), torch.tensor(2.5045))
+    # test asymmetric_loss for single label task with weight
+    assert torch.allclose(
+        loss(cls_score, label, weight=weight), torch.tensor(2.5045 * 0.5))
+
+    # test soft asymmetric_loss with softmax
+    cls_score = torch.Tensor([[5, -5, 0], [5, -5, 0]])
+    label = torch.Tensor([[1, 0, 0], [0, 1, 0]])
+    weight = torch.tensor([0.5, 0.5])
+    loss_cfg = dict(
+        type='AsymmetricLoss',
+        gamma_pos=0.0,
+        gamma_neg=0.0,
+        clip=None,
+        reduction='mean',
+        loss_weight=1.0,
+        use_sigmoid=False,
+        eps=1e-8)
+    loss = build_loss(loss_cfg)
+    # test soft asymmetric_loss with softmax without weight
+    assert torch.allclose(loss(cls_score, label), torch.tensor(2.5045))
+    # test soft asymmetric_loss with softmax with weight
+    assert torch.allclose(
+        loss(cls_score, label, weight=weight), torch.tensor(2.5045 * 0.5))
+
+
+def test_cross_entropy_loss():
+    with pytest.raises(AssertionError):
+        # use_sigmoid and use_soft could not be set simultaneously
+        loss_cfg = dict(
+            type='CrossEntropyLoss', use_sigmoid=True, use_soft=True)
+        loss = build_loss(loss_cfg)
+
+    # test ce_loss
+    cls_score = torch.Tensor([[-1000, 1000], [100, -100]])
+    label = torch.Tensor([0, 1]).long()
+    class_weight = [0.3, 0.7]  # class 0 : 0.3, class 1 : 0.7
+    weight = torch.tensor([0.6, 0.4])
+
+    # test ce_loss without class weight
+    loss_cfg = dict(type='CrossEntropyLoss', reduction='mean', loss_weight=1.0)
+    loss = build_loss(loss_cfg)
+    assert torch.allclose(loss(cls_score, label), torch.tensor(1100.))
+    # test ce_loss with weight
+    assert torch.allclose(
+        loss(cls_score, label, weight=weight), torch.tensor(640.))
+
+    # test ce_loss with class weight
+    loss_cfg = dict(
+        type='CrossEntropyLoss',
+        reduction='mean',
+        loss_weight=1.0,
+        class_weight=class_weight)
+    loss = build_loss(loss_cfg)
+    assert torch.allclose(loss(cls_score, label), torch.tensor(370.))
+    # test ce_loss with weight
+    assert torch.allclose(
+        loss(cls_score, label, weight=weight), torch.tensor(208.))
+
+    # test bce_loss
+    cls_score = torch.Tensor([[-200, 100], [500, -1000], [300, -300]])
+    label = torch.Tensor([[1, 0], [0, 1], [1, 0]])
+    weight = torch.Tensor([0.6, 0.4, 0.5])
+    class_weight = [0.1, 0.9]  # class 0: 0.1, class 1: 0.9
+    pos_weight = [0.1, 0.2]
+
+    # test bce_loss without class weight
+    loss_cfg = dict(
+        type='CrossEntropyLoss',
+        use_sigmoid=True,
+        reduction='mean',
+        loss_weight=1.0)
+    loss = build_loss(loss_cfg)
+    assert torch.allclose(loss(cls_score, label), torch.tensor(300.))
+    # test ce_loss with weight
+    assert torch.allclose(
+        loss(cls_score, label, weight=weight), torch.tensor(130.))
+
+    # test bce_loss with class weight
+    loss_cfg = dict(
+        type='CrossEntropyLoss',
+        use_sigmoid=True,
+        reduction='mean',
+        loss_weight=1.0,
+        class_weight=class_weight)
+    loss = build_loss(loss_cfg)
+    assert torch.allclose(loss(cls_score, label), torch.tensor(176.667))
+    # test bce_loss with weight
+    assert torch.allclose(
+        loss(cls_score, label, weight=weight), torch.tensor(74.333))
+
+    # test bce loss with pos_weight
+    loss_cfg = dict(
+        type='CrossEntropyLoss',
+        use_sigmoid=True,
+        reduction='mean',
+        loss_weight=1.0,
+        pos_weight=pos_weight)
+    loss = build_loss(loss_cfg)
+    assert torch.allclose(loss(cls_score, label), torch.tensor(136.6667))
+
+    # test soft_ce_loss
+    cls_score = torch.Tensor([[-1000, 1000], [100, -100]])
+    label = torch.Tensor([[1.0, 0.0], [0.0, 1.0]])
+    class_weight = [0.3, 0.7]  # class 0 : 0.3, class 1 : 0.7
+    weight = torch.tensor([0.6, 0.4])
+
+    # test soft_ce_loss without class weight
+    loss_cfg = dict(
+        type='CrossEntropyLoss',
+        use_soft=True,
+        reduction='mean',
+        loss_weight=1.0)
+    loss = build_loss(loss_cfg)
+    assert torch.allclose(loss(cls_score, label), torch.tensor(1100.))
+    # test soft_ce_loss with weight
+    assert torch.allclose(
+        loss(cls_score, label, weight=weight), torch.tensor(640.))
+
+    # test soft_ce_loss with class weight
+    loss_cfg = dict(
+        type='CrossEntropyLoss',
+        use_soft=True,
+        reduction='mean',
+        loss_weight=1.0,
+        class_weight=class_weight)
+    loss = build_loss(loss_cfg)
+    assert torch.allclose(loss(cls_score, label), torch.tensor(370.))
+    # test soft_ce_loss with weight
+    assert torch.allclose(
+        loss(cls_score, label, weight=weight), torch.tensor(208.))
+
+
+def test_focal_loss():
+    # test focal_loss
+    cls_score = torch.Tensor([[5, -5, 0], [5, -5, 0]])
+    label = torch.Tensor([[1, 0, 1], [0, 1, 0]])
+    weight = torch.tensor([0.5, 0.5])
+
+    loss_cfg = dict(
+        type='FocalLoss',
+        gamma=2.0,
+        alpha=0.25,
+        reduction='mean',
+        loss_weight=1.0)
+    loss = build_loss(loss_cfg)
+    assert torch.allclose(loss(cls_score, label), torch.tensor(0.8522))
+    # test focal_loss with weight
+    assert torch.allclose(
+        loss(cls_score, label, weight=weight), torch.tensor(0.8522 / 2))
+    # test focal loss for single label task
+    cls_score = torch.Tensor([[5, -5, 0], [5, -5, 0]])
+    label = torch.Tensor([0, 1])
+    weight = torch.tensor([0.5, 0.5])
+    assert torch.allclose(loss(cls_score, label), torch.tensor(0.86664125))
+    # test focal_loss single label with weight
+    assert torch.allclose(
+        loss(cls_score, label, weight=weight), torch.tensor(0.86664125 / 2))
+
+
+def test_label_smooth_loss():
+    # test label_smooth_val assertion
+    with pytest.raises(AssertionError):
+        loss_cfg = dict(type='LabelSmoothLoss', label_smooth_val=1.0)
+        build_loss(loss_cfg)
+
+    with pytest.raises(AssertionError):
+        loss_cfg = dict(type='LabelSmoothLoss', label_smooth_val='str')
+        build_loss(loss_cfg)
+
+    # test reduction assertion
+    with pytest.raises(AssertionError):
+        loss_cfg = dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, reduction='unknown')
+        build_loss(loss_cfg)
+
+    # test mode assertion
+    with pytest.raises(AssertionError):
+        loss_cfg = dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='unknown')
+        build_loss(loss_cfg)
+
+    # test original mode label smooth loss
+    cls_score = torch.tensor([[1., -1.]])
+    label = torch.tensor([0])
+
+    loss_cfg = dict(
+        type='LabelSmoothLoss',
+        label_smooth_val=0.1,
+        mode='original',
+        reduction='mean',
+        loss_weight=1.0)
+    loss = build_loss(loss_cfg)
+    correct = 0.2269  # from timm
+    assert loss(cls_score, label) - correct <= 0.0001
+
+    loss_cfg = dict(
+        type='LabelSmoothLoss',
+        label_smooth_val=0.1,
+        mode='original',
+        use_sigmoid=True,
+        reduction='mean',
+        loss_weight=1.0)
+    loss = build_loss(loss_cfg)
+    correct = 0.3633  # from timm
+    assert loss(cls_score, label) - correct <= 0.0001
+
+    # test classy_vision mode label smooth loss
+    loss_cfg = dict(
+        type='LabelSmoothLoss',
+        label_smooth_val=0.1,
+        mode='classy_vision',
+        reduction='mean',
+        loss_weight=1.0)
+    loss = build_loss(loss_cfg)
+    correct = 0.2178  # from ClassyVision
+    assert loss(cls_score, label) - correct <= 0.0001
+
+    # test multi_label mode label smooth loss
+    cls_score = torch.tensor([[1., -1., 1]])
+    label = torch.tensor([[1, 0, 1]])
+
+    loss_cfg = dict(
+        type='LabelSmoothLoss',
+        label_smooth_val=0.1,
+        mode='multi_label',
+        reduction='mean',
+        loss_weight=1.0)
+    loss = build_loss(loss_cfg)
+    smooth_label = torch.tensor([[0.9, 0.1, 0.9]])
+    correct = torch.binary_cross_entropy_with_logits(cls_score,
+                                                     smooth_label).mean()
+    assert torch.allclose(loss(cls_score, label), correct)
+
+    # test label linear combination smooth loss
+    cls_score = torch.tensor([[1., -1., 0.]])
+    label1 = torch.tensor([[1., 0., 0.]])
+    label2 = torch.tensor([[0., 0., 1.]])
+    label_mix = label1 * 0.6 + label2 * 0.4
+
+    loss_cfg = dict(
+        type='LabelSmoothLoss',
+        label_smooth_val=0.1,
+        mode='original',
+        reduction='mean',
+        num_classes=3,
+        loss_weight=1.0)
+    loss = build_loss(loss_cfg)
+    smooth_label1 = loss.original_smooth_label(label1)
+    smooth_label2 = loss.original_smooth_label(label2)
+    label_smooth_mix = smooth_label1 * 0.6 + smooth_label2 * 0.4
+    correct = (-torch.log_softmax(cls_score, -1) * label_smooth_mix).sum()
+
+    assert loss(cls_score, label_mix) - correct <= 0.0001
+
+    # test label smooth loss with weight
+    cls_score = torch.tensor([[1., -1.], [1., -1.]])
+    label = torch.tensor([0, 1])
+    weight = torch.tensor([0.5, 0.5])
+
+    loss_cfg = dict(
+        type='LabelSmoothLoss',
+        reduction='mean',
+        label_smooth_val=0.1,
+        loss_weight=1.0)
+    loss = build_loss(loss_cfg)
+    assert torch.allclose(
+        loss(cls_score, label, weight=weight),
+        loss(cls_score, label) / 2)
+
+
+# migrate from mmdetection with modifications
+def test_seesaw_loss():
+    # only softmax version of Seesaw Loss is implemented
+    with pytest.raises(AssertionError):
+        loss_cfg = dict(type='SeesawLoss', use_sigmoid=True, loss_weight=1.0)
+        build_loss(loss_cfg)
+
+    # test that cls_score.size(-1) == num_classes
+    loss_cls_cfg = dict(
+        type='SeesawLoss', p=0.0, q=0.0, loss_weight=1.0, num_classes=2)
+    loss_cls = build_loss(loss_cls_cfg)
+    # the length of fake_pred should be num_classe = 4
+    with pytest.raises(AssertionError):
+        fake_pred = torch.Tensor([[-100, 100, -100]])
+        fake_label = torch.Tensor([1]).long()
+        loss_cls(fake_pred, fake_label)
+    # the length of fake_pred should be num_classes + 2 = 4
+    with pytest.raises(AssertionError):
+        fake_pred = torch.Tensor([[-100, 100, -100, 100]])
+        fake_label = torch.Tensor([1]).long()
+        loss_cls(fake_pred, fake_label)
+
+    # test the calculation without p and q
+    loss_cls_cfg = dict(
+        type='SeesawLoss', p=0.0, q=0.0, loss_weight=1.0, num_classes=2)
+    loss_cls = build_loss(loss_cls_cfg)
+    fake_pred = torch.Tensor([[-100, 100]])
+    fake_label = torch.Tensor([1]).long()
+    loss = loss_cls(fake_pred, fake_label)
+    assert torch.allclose(loss, torch.tensor(0.))
+
+    # test the calculation with p and without q
+    loss_cls_cfg = dict(
+        type='SeesawLoss', p=1.0, q=0.0, loss_weight=1.0, num_classes=2)
+    loss_cls = build_loss(loss_cls_cfg)
+    fake_pred = torch.Tensor([[-100, 100]])
+    fake_label = torch.Tensor([0]).long()
+    loss_cls.cum_samples[0] = torch.exp(torch.Tensor([20]))
+    loss = loss_cls(fake_pred, fake_label)
+    assert torch.allclose(loss, torch.tensor(180.))
+
+    # test the calculation with q and without p
+    loss_cls_cfg = dict(
+        type='SeesawLoss', p=0.0, q=1.0, loss_weight=1.0, num_classes=2)
+    loss_cls = build_loss(loss_cls_cfg)
+    fake_pred = torch.Tensor([[-100, 100]])
+    fake_label = torch.Tensor([0]).long()
+    loss = loss_cls(fake_pred, fake_label)
+    assert torch.allclose(loss, torch.tensor(200.) + torch.tensor(100.).log())
+
+
+def test_reconstruction_loss():
+
+    # test L2 loss
+    loss_config = dict(type='PixelReconstructionLoss', criterion='L2')
+    loss = build_loss(loss_config)
+
+    fake_pred = torch.rand((2, 196, 768))
+    fake_target = torch.rand((2, 196, 768))
+    fake_mask = torch.ones((2, 196))
+    loss_value = loss(fake_pred, fake_target, fake_mask)
+
+    assert isinstance(loss_value.item(), float)
+
+    # test L1 loss
+    loss_config = dict(
+        type='PixelReconstructionLoss', criterion='L1', channel=3)
+    loss = build_loss(loss_config)
+
+    fake_pred = torch.rand((2, 3, 192, 192))
+    fake_target = torch.rand((2, 3, 192, 192))
+    fake_mask = torch.ones((2, 1, 192, 192))
+    loss_value = loss(fake_pred, fake_target, fake_mask)
+
+    assert isinstance(loss_value.item(), float)
+
+    with pytest.raises(NotImplementedError):
+        loss_config = dict(type='PixelReconstructionLoss', criterion='L3')
+        loss = build_loss(loss_config)
diff --git a/tests/test_models/test_models.py b/tests/test_models/test_models.py
new file mode 100644
index 0000000000000000000000000000000000000000..29e087736f7ba7e02b6112e7f59f7ae8747c6843
--- /dev/null
+++ b/tests/test_models/test_models.py
@@ -0,0 +1,95 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from dataclasses import dataclass
+
+import pytest
+import torch
+
+import mmpretrain.models
+from mmpretrain.apis import ModelHub, get_model
+
+
+@dataclass
+class Cfg:
+    name: str
+    backbone: type
+    num_classes: int = 1000
+    build: bool = True
+    forward: bool = True
+    backward: bool = True
+    extract_feat: bool = True
+    input_shape: tuple = (1, 3, 224, 224)
+
+
+test_list = [
+    Cfg(name='xcit-small-12-p16_3rdparty_in1k',
+        backbone=mmpretrain.models.XCiT),
+    Cfg(name='xcit-nano-12-p8_3rdparty-dist_in1k-384px',
+        backbone=mmpretrain.models.XCiT,
+        input_shape=(1, 3, 384, 384)),
+    Cfg(name='vit-base-p16_sam-pre_3rdparty_sa1b-1024px',
+        backbone=mmpretrain.models.ViTSAM,
+        forward=False,
+        backward=False),
+    Cfg(name='vit-base-p14_dinov2-pre_3rdparty',
+        backbone=mmpretrain.models.VisionTransformer,
+        forward=False,
+        backward=False),
+    Cfg(name='hivit-tiny-p16_16xb64_in1k', backbone=mmpretrain.models.HiViT),
+]
+
+
+@pytest.mark.parametrize('cfg', test_list)
+def test_build(cfg: Cfg):
+    if not cfg.build:
+        return
+
+    model_name = cfg.name
+    ModelHub._register_mmpretrain_models()
+    assert ModelHub.has(model_name)
+
+    model = get_model(model_name)
+    backbone_class = cfg.backbone
+    assert isinstance(model.backbone, backbone_class)
+
+
+@pytest.mark.parametrize('cfg', test_list)
+def test_forward(cfg: Cfg):
+    if not cfg.forward:
+        return
+
+    model = get_model(cfg.name)
+    inputs = torch.rand(*cfg.input_shape)
+    outputs = model(inputs)
+    assert outputs.shape == (1, cfg.num_classes)
+
+
+@pytest.mark.parametrize('cfg', test_list)
+def test_extract_feat(cfg: Cfg):
+    if not cfg.extract_feat:
+        return
+
+    model = get_model(cfg.name)
+    inputs = torch.rand(*cfg.input_shape)
+    feats = model.extract_feat(inputs)
+    assert isinstance(feats, tuple)
+    assert len(feats) == 1
+
+
+@pytest.mark.parametrize('cfg', test_list)
+def test_backward(cfg: Cfg):
+    if not cfg.backward:
+        return
+
+    model = get_model(cfg.name)
+    inputs = torch.rand(*cfg.input_shape)
+    outputs = model(inputs)
+    outputs.mean().backward()
+
+    for n, x in model.named_parameters():
+        assert x.grad is not None, f'No gradient for {n}'
+    num_grad = sum(
+        [x.grad.numel() for x in model.parameters() if x.grad is not None])
+    assert outputs.shape[-1] == cfg.num_classes
+    num_params = sum([x.numel() for x in model.parameters()])
+    assert num_params == num_grad, 'Some parameters are missing gradients'
+    assert not torch.isnan(outputs).any(), 'Output included NaNs'
diff --git a/tests/test_models/test_necks.py b/tests/test_models/test_necks.py
new file mode 100644
index 0000000000000000000000000000000000000000..f245c4aa76d8c09060dce7d21132c8b8e38dad4c
--- /dev/null
+++ b/tests/test_models/test_necks.py
@@ -0,0 +1,179 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmpretrain.models.necks import (GeneralizedMeanPooling,
+                                     GlobalAveragePooling, HRFuseScales,
+                                     LinearNeck)
+
+
+def test_gap_neck():
+
+    # test 1d gap_neck
+    neck = GlobalAveragePooling(dim=1)
+    # batch_size, num_features, feature_size
+    fake_input = torch.rand(1, 16, 24)
+
+    output = neck(fake_input)
+    # batch_size, num_features
+    assert output.shape == (1, 16)
+
+    # test 1d gap_neck
+    neck = GlobalAveragePooling(dim=2)
+    # batch_size, num_features, feature_size(2)
+    fake_input = torch.rand(1, 16, 24, 24)
+
+    output = neck(fake_input)
+    # batch_size, num_features
+    assert output.shape == (1, 16)
+
+    # test 1d gap_neck
+    neck = GlobalAveragePooling(dim=3)
+    # batch_size, num_features, feature_size(3)
+    fake_input = torch.rand(1, 16, 24, 24, 5)
+
+    output = neck(fake_input)
+    # batch_size, num_features
+    assert output.shape == (1, 16)
+
+    with pytest.raises(AssertionError):
+        # dim must in [1, 2, 3]
+        GlobalAveragePooling(dim='other')
+
+
+def test_gem_neck():
+
+    # test gem_neck
+    neck = GeneralizedMeanPooling()
+
+    # default p is trainable
+    assert neck.p.requires_grad
+
+    # batch_size, num_features, feature_size(2)
+    fake_input = torch.rand(1, 16, 24, 24)
+
+    output = neck(fake_input)
+    # batch_size, num_features
+    assert output.shape == (1, 16)
+
+    # test tuple input gem_neck
+    neck = GeneralizedMeanPooling()
+    # batch_size, num_features, feature_size(2)
+    fake_input = (torch.rand(1, 8, 24, 24), torch.rand(1, 16, 24, 24))
+
+    output = neck(fake_input)
+    # batch_size, num_features
+    assert output[0].shape == (1, 8)
+    assert output[1].shape == (1, 16)
+
+    # test gem_neck with p_trainable=False
+    neck = GeneralizedMeanPooling(p_trainable=False)
+
+    # p is not trainable
+    assert not neck.p.requires_grad
+
+    # batch_size, num_features, feature_size(2)
+    fake_input = torch.rand(1, 16, 24, 24)
+
+    output = neck(fake_input)
+    # batch_size, num_features
+    assert output.shape == (1, 16)
+
+    with pytest.raises(AssertionError):
+        # p must be a value greater then 1
+        GeneralizedMeanPooling(p=0.5)
+
+
+def test_hr_fuse_scales():
+
+    in_channels = (18, 32, 64, 128)
+    neck = HRFuseScales(in_channels=in_channels, out_channels=1024)
+
+    feat_size = 56
+    inputs = []
+    for in_channel in in_channels:
+        input_tensor = torch.rand(3, in_channel, feat_size, feat_size)
+        inputs.append(input_tensor)
+        feat_size = feat_size // 2
+
+    with pytest.raises(AssertionError):
+        neck(inputs)
+
+    outs = neck(tuple(inputs))
+    assert isinstance(outs, tuple)
+    assert len(outs) == 1
+    assert outs[0].shape == (3, 1024, 7, 7)
+
+
+def test_linear_reduction():
+    # test linear_reduction without `act_cfg` and `norm_cfg`
+    neck = LinearNeck(10, 5, 0, None, None)
+    neck.eval()
+    assert isinstance(neck.gap, torch.nn.Identity)
+    assert isinstance(neck.act, torch.nn.Identity)
+    assert isinstance(neck.norm, torch.nn.Identity)
+
+    # batch_size, in_channels, out_channels
+    fake_input = torch.rand(1, 10)
+    output = neck(fake_input)
+    # batch_size, out_features
+    assert output[-1].shape == (1, 5)
+
+    # batch_size, in_features, feature_size(2)
+    fake_input = (torch.rand(1, 20), torch.rand(1, 10))
+
+    output = neck(fake_input)
+    # batch_size, out_features
+    assert output[-1].shape == (1, 5)
+
+    # batch_size, in_channels, out_channels, gap_dim
+    neck = LinearNeck(10, 5, 1, None, None)
+    fake_input = torch.rand(1, 10, 10)
+    output = neck(fake_input)
+    # batch_size, out_features
+    assert output[-1].shape == (1, 5)
+
+    # batch_size, in_channels, out_channels, gap_dim
+    neck = LinearNeck(10, 5, 2, None, None)
+    fake_input = torch.rand(1, 10, 10, 10)
+    output = neck(fake_input)
+    # batch_size, out_features
+    assert output[-1].shape == (1, 5)
+
+    # batch_size, in_channels, out_channels, gap_dim
+    neck = LinearNeck(10, 5, 3, None, None)
+    fake_input = torch.rand(1, 10, 10, 10, 10)
+    output = neck(fake_input)
+    # batch_size, out_features
+    assert output[-1].shape == (1, 5)
+
+    # batch_size, in_channels, out_channels, gap_dim
+    with pytest.raises(AssertionError):
+        neck = LinearNeck(10, 5, None, None, None)
+
+    # test linear_reduction with `init_cfg`
+    neck = LinearNeck(10, 5, init_cfg=dict(type='Xavier', layer=['Linear']))
+
+    # test linear_reduction with `act_cfg` and `norm_cfg`
+    neck = LinearNeck(
+        10, 5, act_cfg=dict(type='ReLU'), norm_cfg=dict(type='BN1d'))
+    neck.eval()
+
+    assert isinstance(neck.act, torch.nn.ReLU)
+    assert isinstance(neck.norm, torch.nn.BatchNorm1d)
+
+    # batch_size, in_channels, out_channels
+    fake_input = torch.rand(1, 10)
+    output = neck(fake_input)
+    # batch_size, out_features
+    assert output[-1].shape == (1, 5)
+    #
+    # # batch_size, in_features, feature_size(2)
+    fake_input = (torch.rand(1, 20), torch.rand(1, 10))
+
+    output = neck(fake_input)
+    # batch_size, out_features
+    assert output[-1].shape == (1, 5)
+
+    with pytest.raises(AssertionError):
+        neck([])
diff --git a/tests/test_models/test_peft/test_lora.py b/tests/test_models/test_peft/test_lora.py
new file mode 100644
index 0000000000000000000000000000000000000000..d148538172946baf7ff70063f0ad8e672e15bfb5
--- /dev/null
+++ b/tests/test_models/test_peft/test_lora.py
@@ -0,0 +1,122 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import re
+
+import pytest
+from mmengine.utils import digit_version
+from mmengine.utils.dl_utils import TORCH_VERSION
+
+from mmpretrain.models.peft import LoRAModel
+
+
+@pytest.mark.skipif(
+    digit_version(TORCH_VERSION) < digit_version('1.9.0'),
+    reason='get_submodule requires torch >= 1.9.0')
+def test_lora_backbone():
+    module = dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1,
+        out_type='avg_featmap',
+        final_norm=False)
+
+    lora_cfg = dict(
+        module=module,
+        alpha=1,
+        rank=4,
+        drop_rate=0.1,
+        targets=[
+            dict(type='qkv'),
+            dict(type='.*proj', alpha=2, rank=2, drop_rate=0.2),
+        ])
+
+    lora_model = LoRAModel(**lora_cfg)
+
+    # test replace module
+    for name, module in lora_model.named_modules():
+        if name.endswith('qkv'):
+            assert module.scaling == 0.25
+        if re.fullmatch('.*proj', name):
+            assert module.scaling == 1
+
+    # test freeze module
+    for name, param in lora_model.named_parameters():
+        if 'lora_' in name:
+            assert param.requires_grad
+        else:
+            assert not param.requires_grad
+
+    # test get state dict
+    state_dict = lora_model.state_dict()
+    assert len(state_dict) != 0
+    for name, param in state_dict.items():
+        assert 'lora_' in name
+
+    # test load state dict
+    incompatible_keys = lora_model.load_state_dict(state_dict, strict=True)
+    assert str(incompatible_keys) == '<All keys matched successfully>'
+
+
+@pytest.mark.skipif(
+    digit_version(TORCH_VERSION) < digit_version('1.9.0'),
+    reason='get_submodule requires torch >= 1.9.0')
+def test_lora_model():
+    module = dict(
+        type='MAE',
+        backbone=dict(type='MAEViT', arch='b', patch_size=16, mask_ratio=0.75),
+        neck=dict(
+            type='MAEPretrainDecoder',
+            patch_size=16,
+            in_chans=3,
+            embed_dim=768,
+            decoder_embed_dim=512,
+            decoder_depth=8,
+            decoder_num_heads=16,
+            mlp_ratio=4.,
+        ),
+        head=dict(
+            type='MAEPretrainHead',
+            norm_pix=True,
+            patch_size=16,
+            loss=dict(type='PixelReconstructionLoss', criterion='L2')),
+        init_cfg=[
+            dict(type='Xavier', layer='Linear', distribution='uniform'),
+            dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0)
+        ])
+
+    lora_cfg = dict(
+        module=module,
+        alpha=1,
+        rank=4,
+        drop_rate=0.1,
+        targets=[
+            dict(type='qkv'),
+            dict(type='.*proj', alpha=2, rank=2, drop_rate=0.2),
+        ])
+
+    lora_model = LoRAModel(**lora_cfg)
+
+    # test replace module
+    for name, module in lora_model.named_modules():
+        if name.endswith('qkv'):
+            assert module.scaling == 0.25
+        if re.fullmatch('.*proj', name):
+            assert module.scaling == 1
+
+    # test freeze module
+    for name, param in lora_model.named_parameters():
+        if 'lora_' in name:
+            assert param.requires_grad
+        else:
+            assert not param.requires_grad
+
+    # test get state dict
+    state_dict = lora_model.state_dict()
+    assert len(state_dict) != 0
+    for name, param in state_dict.items():
+        assert 'lora_' in name
+
+    # test load state dict
+    incompatible_keys = lora_model.load_state_dict(state_dict, strict=True)
+    assert str(incompatible_keys) == '<All keys matched successfully>'
diff --git a/tests/test_models/test_retrievers.py b/tests/test_models/test_retrievers.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e7d9dc02353c8477f7ae591e087a0910269d29b
--- /dev/null
+++ b/tests/test_models/test_retrievers.py
@@ -0,0 +1,273 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+import tempfile
+from typing import Callable
+from unittest import TestCase
+from unittest.mock import MagicMock
+
+import numpy as np
+import torch
+from mmengine import ConfigDict
+from mmengine.dataset.utils import default_collate
+from torch.utils.data import DataLoader, Dataset
+
+from mmpretrain.datasets.transforms import PackInputs
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+
+
+class ExampleDataset(Dataset):
+
+    def __init__(self):
+        self.metainfo = None
+        self.pipe = PackInputs()
+
+    def __getitem__(self, idx):
+        results = dict(
+            img=np.random.random((64, 64, 3)), meta=dict(sampleidx=idx))
+
+        return self.pipe(results)
+
+    def __len__(self):
+        return 10
+
+
+class TestImageToImageRetriever(TestCase):
+    DEFAULT_ARGS = dict(
+        type='ImageToImageRetriever',
+        image_encoder=[
+            dict(type='ResNet', depth=18, out_indices=(3, )),
+            dict(type='GlobalAveragePooling'),
+        ],
+        head=dict(
+            type='LinearClsHead',
+            num_classes=10,
+            in_channels=512,
+            loss=dict(type='CrossEntropyLoss')),
+        prototype=torch.rand((10, 512)),
+    )
+
+    def test_initialize(self):
+        # test error prototype type
+        cfg = {**self.DEFAULT_ARGS, 'prototype': 5}
+        with self.assertRaises(AssertionError):
+            model = MODELS.build(cfg)
+
+        # test prototype is tensor
+        model = MODELS.build(self.DEFAULT_ARGS)
+        self.assertEqual(type(model.prototype), torch.Tensor)
+        self.assertFalse(model.prototype_inited)
+        self.assertIsInstance(model.similarity_fn, Callable)
+        self.assertEqual(model.topk, -1)
+
+        # test prototype is str
+        cfg = {**self.DEFAULT_ARGS, 'prototype': './proto.pth'}
+        model = MODELS.build(cfg)
+        self.assertEqual(type(model.prototype), str)
+
+        # test prototype is dict
+        lodaer = DataLoader(ExampleDataset())
+        cfg = {**self.DEFAULT_ARGS, 'prototype': lodaer}
+        model = MODELS.build(cfg)
+        self.assertEqual(type(model.prototype), DataLoader)
+
+        # test prototype is dataloader
+        loader_cfg = dict(
+            batch_size=16,
+            num_workers=2,
+            dataset=dict(
+                type='CIFAR100',
+                data_prefix='data/cifar100',
+                test_mode=False,
+                pipeline=[]),
+            sampler=dict(type='DefaultSampler', shuffle=True),
+            persistent_workers=True)
+        cfg = {**self.DEFAULT_ARGS, 'prototype': loader_cfg}
+        model = MODELS.build(cfg)
+        self.assertEqual(type(model.prototype), dict)
+
+        # test similarity function
+        self.assertEqual(model.similarity, 'cosine_similarity')
+
+        def fn(a, b):
+            return a * b
+
+        cfg = {**self.DEFAULT_ARGS, 'similarity_fn': fn}
+        model = MODELS.build(cfg)
+        self.assertEqual(model.similarity, fn)
+        self.assertIsInstance(model.similarity_fn, Callable)
+
+        # test set batch augmentation from train_cfg
+        cfg = {
+            **self.DEFAULT_ARGS, 'train_cfg':
+            dict(augments=dict(
+                type='Mixup',
+                alpha=1.,
+            ))
+        }
+        model = MODELS.build(cfg)
+
+        self.assertIsNotNone(model.data_preprocessor.batch_augments)
+
+        cfg = {**self.DEFAULT_ARGS, 'train_cfg': dict()}
+        model = MODELS.build(cfg)
+        self.assertIsNone(model.data_preprocessor.batch_augments)
+
+    def test_extract_feat(self):
+        inputs = torch.rand(1, 3, 64, 64)
+        cfg = ConfigDict(self.DEFAULT_ARGS)
+        model = MODELS.build(cfg)
+
+        # test extract_feat
+        feats = model.extract_feat(inputs)
+        self.assertEqual(len(feats), 1)
+        self.assertEqual(feats[0].shape, (1, 512))
+
+    def test_loss(self):
+        inputs = torch.rand(1, 3, 64, 64)
+        data_samples = [DataSample().set_gt_label(1)]
+
+        model = MODELS.build(self.DEFAULT_ARGS)
+        losses = model.loss(inputs, data_samples)
+        self.assertGreater(losses['loss'].item(), 0)
+
+    def test_prepare_prototype(self):
+        tmpdir = tempfile.TemporaryDirectory()
+        # tensor
+        cfg = {**self.DEFAULT_ARGS}
+        model = MODELS.build(cfg)
+        model.prepare_prototype()
+        self.assertEqual(type(model.prototype_vecs), torch.Tensor)
+        self.assertEqual(model.prototype_vecs.shape, (10, 512))
+        self.assertTrue(model.prototype_inited)
+
+        # test dump prototype
+        ori_proto_vecs = model.prototype_vecs
+        save_path = os.path.join(tmpdir.name, 'proto.pth')
+        model.dump_prototype(save_path)
+
+        # Check whether the saved feature exists
+        feat = torch.load(save_path)
+        self.assertEqual(feat.shape, (10, 512))
+
+        # str
+        cfg = {**self.DEFAULT_ARGS, 'prototype': save_path}
+        model = MODELS.build(cfg)
+        model.prepare_prototype()
+        self.assertEqual(type(model.prototype_vecs), torch.Tensor)
+        self.assertEqual(model.prototype_vecs.shape, (10, 512))
+        self.assertTrue(model.prototype_inited)
+        torch.allclose(ori_proto_vecs, model.prototype_vecs)
+
+        # dict
+        lodaer = DataLoader(ExampleDataset(), collate_fn=default_collate)
+        cfg = {**self.DEFAULT_ARGS, 'prototype': lodaer}
+        model = MODELS.build(cfg)
+        model.prepare_prototype()
+        self.assertEqual(type(model.prototype_vecs), torch.Tensor)
+        self.assertEqual(model.prototype_vecs.shape, (10, 512))
+        self.assertTrue(model.prototype_inited)
+
+        tmpdir.cleanup()
+
+    def test_predict(self):
+        inputs = torch.rand(1, 3, 64, 64)
+        data_samples = [DataSample().set_gt_label([1, 2, 6])]
+        # default
+        model = MODELS.build(self.DEFAULT_ARGS)
+        predictions = model.predict(inputs)
+        self.assertEqual(predictions[0].pred_score.shape, (10, ))
+
+        predictions = model.predict(inputs, data_samples)
+        self.assertEqual(predictions[0].pred_score.shape, (10, ))
+        self.assertEqual(data_samples[0].pred_score.shape, (10, ))
+        torch.testing.assert_allclose(data_samples[0].pred_score,
+                                      predictions[0].pred_score)
+
+        # k is not -1
+        cfg = {**self.DEFAULT_ARGS, 'topk': 2}
+        model = MODELS.build(cfg)
+
+        predictions = model.predict(inputs)
+        self.assertEqual(predictions[0].pred_score.shape, (10, ))
+
+        predictions = model.predict(inputs, data_samples)
+        assert predictions is data_samples
+        self.assertEqual(data_samples[0].pred_score.shape, (10, ))
+
+    def test_forward(self):
+        inputs = torch.rand(1, 3, 64, 64)
+        data_samples = [DataSample().set_gt_label(1)]
+        model = MODELS.build(self.DEFAULT_ARGS)
+
+        # test pure forward
+        outs = model(inputs)
+        # assert False, type(outs)
+        self.assertIsInstance(outs, tuple)
+        self.assertEqual(len(outs), 1)
+        self.assertIsInstance(outs[0], torch.Tensor)
+
+        # test forward train
+        losses = model(inputs, data_samples, mode='loss')
+        self.assertGreater(losses['loss'].item(), 0)
+
+        # test forward test
+        predictions = model(inputs, mode='predict')
+        self.assertEqual(predictions[0].pred_score.shape, (10, ))
+
+        predictions = model(inputs, data_samples, mode='predict')
+        self.assertEqual(predictions[0].pred_score.shape, (10, ))
+        self.assertEqual(data_samples[0].pred_score.shape, (10, ))
+        torch.testing.assert_allclose(data_samples[0].pred_score,
+                                      predictions[0].pred_score)
+
+        # test forward with invalid mode
+        with self.assertRaisesRegex(RuntimeError, 'Invalid mode "unknown"'):
+            model(inputs, mode='unknown')
+
+    def test_train_step(self):
+        cfg = {
+            **self.DEFAULT_ARGS, 'data_preprocessor':
+            dict(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])
+        }
+        model = MODELS.build(cfg)
+
+        data = {
+            'inputs': torch.randint(0, 256, (1, 3, 64, 64)),
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+
+        optim_wrapper = MagicMock()
+        log_vars = model.train_step(data, optim_wrapper)
+        self.assertIn('loss', log_vars)
+        optim_wrapper.update_params.assert_called_once()
+
+    def test_val_step(self):
+        cfg = {
+            **self.DEFAULT_ARGS, 'data_preprocessor':
+            dict(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])
+        }
+        model = MODELS.build(cfg)
+
+        data = {
+            'inputs': torch.randint(0, 256, (1, 3, 64, 64)),
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+
+        predictions = model.val_step(data)
+        self.assertEqual(predictions[0].pred_score.shape, (10, ))
+
+    def test_test_step(self):
+        cfg = {
+            **self.DEFAULT_ARGS, 'data_preprocessor':
+            dict(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])
+        }
+        model = MODELS.build(cfg)
+
+        data = {
+            'inputs': torch.randint(0, 256, (1, 3, 64, 64)),
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+
+        predictions = model.test_step(data)
+        self.assertEqual(predictions[0].pred_score.shape, (10, ))
diff --git a/tests/test_models/test_selfsup/test_barlowtwins.py b/tests/test_models/test_selfsup/test_barlowtwins.py
new file mode 100644
index 0000000000000000000000000000000000000000..72502afd29dc0e9c171cc78db24f88a8d0b9521a
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_barlowtwins.py
@@ -0,0 +1,49 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+
+import pytest
+import torch
+
+from mmpretrain.models import BarlowTwins
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_barlowtwins():
+    data_preprocessor = {
+        'mean': (123.675, 116.28, 103.53),
+        'std': (58.395, 57.12, 57.375),
+        'to_rgb': True
+    }
+    backbone = dict(type='ResNet', depth=18, norm_cfg=dict(type='BN'))
+    neck = dict(
+        type='NonLinearNeck',
+        in_channels=512,
+        hid_channels=2,
+        out_channels=2,
+        num_layers=3,
+        with_last_bn=False,
+        with_last_bn_affine=False,
+        with_avg_pool=True,
+        norm_cfg=dict(type='BN1d'))
+    head = dict(
+        type='LatentCrossCorrelationHead',
+        in_channels=2,
+        loss=dict(type='CrossCorrelationLoss'))
+
+    alg = BarlowTwins(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        data_preprocessor=data_preprocessor)
+
+    fake_data = {
+        'inputs':
+        [torch.randn((2, 3, 224, 224)),
+         torch.randn((2, 3, 224, 224))],
+        'data_sample': [DataSample() for _ in range(2)]
+    }
+
+    fake_inputs = alg.data_preprocessor(fake_data)
+    fake_loss = alg(**fake_inputs, mode='loss')
+    assert isinstance(fake_loss['loss'].item(), float)
diff --git a/tests/test_models/test_selfsup/test_beit.py b/tests/test_models/test_selfsup/test_beit.py
new file mode 100644
index 0000000000000000000000000000000000000000..7fbd5a7019f5105dadf3baeb4b25e3b0bf3f9b15
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_beit.py
@@ -0,0 +1,169 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+from unittest import TestCase
+
+import pytest
+import torch
+
+from mmpretrain.models import BEiT, BEiTPretrainViT
+from mmpretrain.structures import DataSample
+
+
+class TestBEiT(TestCase):
+
+    @pytest.mark.skipif(
+        platform.system() == 'Windows', reason='Windows mem limit')
+    def test_beit_pretrain_vit(self):
+        backbone = dict(
+            arch='base',
+            patch_size=16,
+            drop_path_rate=0.1,
+            final_norm=True,
+            layer_scale_init_value=0.1,
+        )
+
+        beit_backbone = BEiTPretrainViT(**backbone)
+        beit_backbone.init_weights()
+
+        fake_inputs = torch.randn((2, 3, 224, 224))
+        fake_mask = torch.zeros((2, 196))
+        fake_mask[:, 75:150] = 1
+
+        # test with mask
+        fake_outputs = beit_backbone(fake_inputs, fake_mask)
+        assert fake_outputs[0].shape == torch.Size([2, 197, 768])
+
+        # test without mask
+        fake_outputs = beit_backbone(fake_inputs, None)
+        assert fake_outputs[0].shape == torch.Size([2, 197, 768])
+
+    @pytest.mark.skipif(
+        platform.system() == 'Windows', reason='Windows mem limit')
+    def test_beitv1(self):
+        data_preprocessor = dict(
+            type='TwoNormDataPreprocessor',
+            mean=[123.675, 116.28, 103.53],
+            std=[58.395, 57.12, 57.375],
+            second_mean=[-31.875, -31.875, -31.875],
+            second_std=[318.75, 318.75, 318.75],
+            to_rgb=True)
+
+        # model settings
+        backbone = dict(
+            type='BEiTPretrainViT',
+            arch='base',
+            patch_size=16,
+            drop_path_rate=0.1,
+            final_norm=True,
+            layer_scale_init_value=0.1)
+        neck = None
+        head = dict(
+            type='BEiTV1Head',
+            embed_dims=768,
+            num_embed=8192,
+            loss=dict(type='CrossEntropyLoss'))
+        target_generator = dict(type='DALL-E')
+
+        # build model
+        model = BEiT(
+            backbone=backbone,
+            neck=neck,
+            head=head,
+            target_generator=target_generator,
+            data_preprocessor=data_preprocessor)
+
+        fake_img = torch.rand((1, 3, 224, 224))
+        fake_target_img = torch.rand((1, 3, 112, 112))
+        fake_mask = torch.zeros((196)).bool()
+        fake_mask[75:150] = 1
+        fake_data_sample = DataSample()
+        fake_data_sample.set_mask(fake_mask)
+        fake_data = {
+            'inputs': [fake_img, fake_target_img],
+            'data_samples': [fake_data_sample]
+        }
+
+        fake_inputs = model.data_preprocessor(fake_data)
+        fake_outputs = model(**fake_inputs, mode='loss')
+        assert isinstance(fake_outputs['loss'].item(), float)
+
+    @pytest.mark.skipif(
+        platform.system() == 'Windows', reason='Windows mem limit')
+    def test_beitv2(self):
+        data_preprocessor = dict(
+            type='TwoNormDataPreprocessor',
+            mean=(123.675, 116.28, 103.53),
+            std=(58.395, 57.12, 57.375),
+            second_mean=(127.5, 127.5, 127.5),
+            second_std=(127.5, 127.5, 127.5),
+            to_rgb=True)
+
+        # model settings
+        vqkd_encoder = dict(
+            arch='base',
+            img_size=224,
+            patch_size=16,
+            in_channels=3,
+            out_indices=-1,
+            drop_rate=0.,
+            drop_path_rate=0.,
+            norm_cfg=dict(type='LN', eps=1e-6),
+            final_norm=True,
+            out_type='featmap',
+            with_cls_token=True,
+            frozen_stages=-1,
+            use_abs_pos_emb=True,
+            use_rel_pos_bias=False,
+            use_shared_rel_pos_bias=False,
+            layer_scale_init_value=0.,
+            interpolate_mode='bicubic',
+            patch_cfg=dict(),
+            layer_cfgs=dict(),
+            init_cfg=None)
+
+        layer_scale_init_value = 0.1
+        drop_path_rate = 0.  # 0. for 300 epochs and 0.1 for 1600 epochs.
+        backbone = dict(
+            type='BEiTPretrainViT',
+            arch='base',
+            patch_size=16,
+            out_indices=[-4, -1],
+            drop_path_rate=drop_path_rate,
+            final_norm=False,
+            layer_scale_init_value=layer_scale_init_value)
+        neck = dict(
+            type='BEiTV2Neck',
+            num_layers=1,
+            early_layers=9,
+            backbone_arch='base',
+            drop_path_rate=drop_path_rate,
+            layer_scale_init_value=layer_scale_init_value)
+        head = dict(
+            type='BEiTV2Head',
+            embed_dims=768,
+            num_embed=8192,
+            loss=dict(type='CrossEntropyLoss'))
+        target_generator = dict(type='VQKD', encoder_config=vqkd_encoder)
+
+        model = BEiT(
+            backbone=backbone,
+            neck=neck,
+            head=head,
+            target_generator=target_generator,
+            data_preprocessor=data_preprocessor)
+
+        fake_img = torch.rand((1, 3, 224, 224))
+        fake_target_img = torch.rand((1, 3, 224, 224))
+        fake_mask = torch.zeros((196)).bool()
+        fake_mask[75:150] = 1
+        fake_data_sample = DataSample()
+        fake_data_sample.set_mask(fake_mask)
+        fake_data = {
+            'inputs': [fake_img, fake_target_img],
+            'data_samples': [fake_data_sample]
+        }
+
+        fake_inputs = model.data_preprocessor(fake_data)
+        fake_outputs = model(**fake_inputs, mode='loss')
+        assert isinstance(fake_outputs['loss_1'].item(), float)
+        assert isinstance(fake_outputs['loss_2'].item(), float)
diff --git a/tests/test_models/test_selfsup/test_byol.py b/tests/test_models/test_selfsup/test_byol.py
new file mode 100644
index 0000000000000000000000000000000000000000..59d9d87e8cb2f01448447b6a6ca28ca5160d3523
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_byol.py
@@ -0,0 +1,59 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+
+import pytest
+import torch
+
+from mmpretrain.models import BYOL
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_byol():
+    data_preprocessor = dict(
+        mean=(123.675, 116.28, 103.53),
+        std=(58.395, 57.12, 57.375),
+        to_rgb=True)
+    backbone = dict(type='ResNet', depth=18, norm_cfg=dict(type='BN'))
+    neck = dict(
+        type='NonLinearNeck',
+        in_channels=512,
+        hid_channels=2,
+        out_channels=2,
+        with_bias=True,
+        with_last_bn=False,
+        with_avg_pool=True,
+        norm_cfg=dict(type='BN1d'))
+    head = dict(
+        type='LatentPredictHead',
+        loss=dict(type='CosineSimilarityLoss'),
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=2,
+            hid_channels=2,
+            out_channels=2,
+            with_bias=True,
+            with_last_bn=False,
+            with_avg_pool=False,
+            norm_cfg=dict(type='BN1d')))
+
+    alg = BYOL(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        data_preprocessor=data_preprocessor)
+
+    fake_data = {
+        'inputs':
+        [torch.randn((2, 3, 224, 224)),
+         torch.randn((2, 3, 224, 224))],
+        'data_samples': [DataSample() for _ in range(2)]
+    }
+    fake_inputs = alg.data_preprocessor(fake_data)
+
+    fake_loss = alg(**fake_inputs, mode='loss')
+    assert isinstance(fake_loss['loss'].item(), float)
+    assert fake_loss['loss'].item() > -4
+
+    fake_feats = alg(fake_inputs['inputs'][0], mode='tensor')
+    assert list(fake_feats[0].shape) == [2, 512, 7, 7]
diff --git a/tests/test_models/test_selfsup/test_cae.py b/tests/test_models/test_selfsup/test_cae.py
new file mode 100644
index 0000000000000000000000000000000000000000..3c9127d403cc83f8924e7af432b56cfaf4753034
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_cae.py
@@ -0,0 +1,78 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+
+import pytest
+import torch
+
+from mmpretrain.models import CAE, CAEPretrainViT
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_cae_vit():
+    backbone = dict(
+        arch='deit-tiny', patch_size=16, layer_scale_init_value=0.1)
+
+    cae_backbone = CAEPretrainViT(**backbone)
+    cae_backbone.init_weights()
+    fake_inputs = torch.randn((1, 3, 224, 224))
+    fake_mask = torch.zeros((1, 196)).bool()
+    fake_mask[:, 75:150] = 1
+
+    # test with mask
+    fake_outputs = cae_backbone(fake_inputs, fake_mask)
+    assert list(fake_outputs.shape) == [1, 122, 192]
+
+    # test without mask
+    fake_outputs = cae_backbone(fake_inputs, None)
+    assert fake_outputs[0].shape == torch.Size([1, 197, 192])
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_cae():
+    data_preprocessor = dict(
+        type='TwoNormDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        second_mean=[-31.875, -31.875, -31.875],
+        second_std=[318.75, 318.75, 318.75],
+        to_rgb=True)
+
+    # model settings
+    backbone = dict(
+        type='CAEPretrainViT',
+        arch='deit-tiny',
+        patch_size=16,
+        layer_scale_init_value=0.1)
+    neck = dict(
+        type='CAENeck',
+        embed_dims=192,
+        num_heads=12,
+        regressor_depth=4,
+        decoder_depth=4,
+        mlp_ratio=4,
+        layer_scale_init_value=0.1)
+    head = dict(type='CAEHead', loss=dict(type='CAELoss', lambd=2))
+    target_generator = dict(type='DALL-E')
+
+    model = CAE(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        target_generator=target_generator,
+        data_preprocessor=data_preprocessor)
+
+    fake_img = torch.rand((1, 3, 224, 224))
+    fake_target_img = torch.rand((1, 3, 112, 112))
+    fake_mask = torch.zeros((196)).bool()
+    fake_mask[75:150] = 1
+    fake_data_sample = DataSample()
+    fake_data_sample.set_mask(fake_mask)
+    fake_data = {
+        'inputs': [fake_img, fake_target_img],
+        'data_samples': [fake_data_sample]
+    }
+
+    fake_inputs = model.data_preprocessor(fake_data)
+    fake_outputs = model(**fake_inputs, mode='loss')
+    assert isinstance(fake_outputs['loss'].item(), float)
diff --git a/tests/test_models/test_selfsup/test_densecl.py b/tests/test_models/test_selfsup/test_densecl.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ae807fe540d18d787ae31475a5db3d12bd3fb72
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_densecl.py
@@ -0,0 +1,62 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+
+import pytest
+import torch
+
+from mmpretrain.models import DenseCL
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_densecl():
+    data_preprocessor = {
+        'mean': (123.675, 116.28, 103.53),
+        'std': (58.395, 57.12, 57.375),
+        'to_rgb': True
+    }
+    queue_len = 32
+    feat_dim = 2
+    momentum = 0.001
+    loss_lambda = 0.5
+    backbone = dict(type='ResNet', depth=18, norm_cfg=dict(type='BN'))
+    neck = dict(
+        type='DenseCLNeck',
+        in_channels=512,
+        hid_channels=2,
+        out_channels=2,
+        num_grid=None)
+    head = dict(
+        type='ContrastiveHead',
+        loss=dict(type='CrossEntropyLoss'),
+        temperature=0.2)
+
+    alg = DenseCL(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        queue_len=queue_len,
+        feat_dim=feat_dim,
+        momentum=momentum,
+        loss_lambda=loss_lambda,
+        data_preprocessor=data_preprocessor)
+
+    # test init
+    assert alg.queue.size() == torch.Size([feat_dim, queue_len])
+    assert alg.queue2.size() == torch.Size([feat_dim, queue_len])
+
+    # test loss
+    fake_data = {
+        'inputs':
+        [torch.randn((2, 3, 224, 224)),
+         torch.randn((2, 3, 224, 224))],
+        'data_samples': [DataSample() for _ in range(2)]
+    }
+    fake_inputs = alg.data_preprocessor(fake_data)
+    fake_loss = alg(**fake_inputs, mode='loss')
+    assert isinstance(fake_loss['loss_single'].item(), float)
+    assert isinstance(fake_loss['loss_dense'].item(), float)
+    assert fake_loss['loss_single'].item() > 0
+    assert fake_loss['loss_dense'].item() > 0
+    assert alg.queue_ptr.item() == 2
+    assert alg.queue2_ptr.item() == 2
diff --git a/tests/test_models/test_selfsup/test_eva.py b/tests/test_models/test_selfsup/test_eva.py
new file mode 100644
index 0000000000000000000000000000000000000000..896ffc4cc07ed6e59a548bb518eb730d0d64f82e
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_eva.py
@@ -0,0 +1,51 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+from unittest.mock import MagicMock
+
+import pytest
+import torch
+
+from mmpretrain.models import EVA
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_eva():
+    data_preprocessor = {
+        'mean': [0.5, 0.5, 0.5],
+        'std': [0.5, 0.5, 0.5],
+        'to_rgb': True
+    }
+    backbone = dict(type='MAEViT', arch='b', patch_size=16, mask_ratio=0.75)
+    neck = dict(
+        type='MAEPretrainDecoder',
+        patch_size=16,
+        in_chans=3,
+        embed_dim=768,
+        decoder_embed_dim=512,
+        decoder_depth=8,
+        decoder_num_heads=16,
+        predict_feature_dim=512,
+        mlp_ratio=4.)
+    head = dict(
+        type='MIMHead',
+        loss=dict(
+            type='CosineSimilarityLoss', shift_factor=1.0, scale_factor=1.0))
+
+    alg = EVA(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        data_preprocessor=data_preprocessor)
+
+    target_generator = MagicMock(
+        return_value=(torch.ones(2, 197, 512), torch.ones(2, 197, 197)))
+    alg.target_generator = target_generator
+
+    fake_data = {
+        'inputs': torch.randn((2, 3, 224, 224)),
+        'data_samples': [DataSample() for _ in range(2)]
+    }
+    fake_inputs = alg.data_preprocessor(fake_data)
+    fake_outputs = alg(**fake_inputs, mode='loss')
+    assert isinstance(fake_outputs['loss'].item(), float)
diff --git a/tests/test_models/test_selfsup/test_itpn.py b/tests/test_models/test_selfsup/test_itpn.py
new file mode 100644
index 0000000000000000000000000000000000000000..a22b9c8ba0c00f04fec281eafa52452a8a677c46
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_itpn.py
@@ -0,0 +1,57 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+
+import pytest
+import torch
+
+from mmpretrain.models import iTPN
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_itpn():
+    data_preprocessor = {
+        'mean': [0.5, 0.5, 0.5],
+        'std': [0.5, 0.5, 0.5],
+        'to_rgb': True
+    }
+    backbone = dict(
+        type='iTPNHiViT',
+        arch='base',
+        reconstruction_type='pixel',
+        mask_ratio=0.75)
+    neck = dict(
+        type='iTPNPretrainDecoder',
+        num_patches=196,
+        patch_size=16,
+        in_chans=3,
+        embed_dim=512,
+        decoder_embed_dim=512,
+        decoder_depth=6,
+        decoder_num_heads=16,
+        mlp_ratio=4.,
+        reconstruction_type='pixel',
+        #  transformer pyramid
+        fpn_dim=256,
+        fpn_depth=2,
+        num_outs=3,
+    )
+    head = dict(
+        type='MAEPretrainHead',
+        norm_pix=True,
+        patch_size=16,
+        loss=dict(type='PixelReconstructionLoss', criterion='L2'))
+
+    alg = iTPN(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        data_preprocessor=data_preprocessor)
+
+    fake_data = {
+        'inputs': torch.randn((2, 3, 224, 224)),
+        'data_samples': [DataSample() for _ in range(2)]
+    }
+    fake_inputs = alg.data_preprocessor(fake_data)
+    fake_outputs = alg(**fake_inputs, mode='loss')
+    assert isinstance(fake_outputs['loss'].item(), float)
diff --git a/tests/test_models/test_selfsup/test_mae.py b/tests/test_models/test_selfsup/test_mae.py
new file mode 100644
index 0000000000000000000000000000000000000000..48fb88c60cbf3dbb21b8c4ce2b517556c0180594
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_mae.py
@@ -0,0 +1,61 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+
+import pytest
+import torch
+
+from mmpretrain.models import MAE, MAEViT
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_mae_vit():
+    backbone = dict(arch='b', patch_size=16, mask_ratio=0.75)
+    mae_backbone = MAEViT(**backbone)
+    mae_backbone.init_weights()
+    fake_inputs = torch.randn((2, 3, 224, 224))
+
+    # test with mask
+    fake_outputs = mae_backbone(fake_inputs)[0]
+    assert list(fake_outputs.shape) == [2, 50, 768]
+
+    # test without mask
+    fake_outputs = mae_backbone(fake_inputs, None)
+    assert fake_outputs[0].shape == torch.Size([2, 197, 768])
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_mae():
+    data_preprocessor = {
+        'mean': [0.5, 0.5, 0.5],
+        'std': [0.5, 0.5, 0.5],
+        'to_rgb': True
+    }
+    backbone = dict(type='MAEViT', arch='b', patch_size=16, mask_ratio=0.75)
+    neck = dict(
+        type='MAEPretrainDecoder',
+        patch_size=16,
+        in_chans=3,
+        embed_dim=768,
+        decoder_embed_dim=512,
+        decoder_depth=8,
+        decoder_num_heads=16,
+        mlp_ratio=4.,
+    )
+    loss = dict(type='PixelReconstructionLoss', criterion='L2')
+    head = dict(
+        type='MAEPretrainHead', norm_pix=False, patch_size=16, loss=loss)
+
+    alg = MAE(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        data_preprocessor=data_preprocessor)
+
+    fake_data = {
+        'inputs': torch.randn((2, 3, 224, 224)),
+        'data_samples': [DataSample() for _ in range(2)]
+    }
+    fake_inputs = alg.data_preprocessor(fake_data)
+    fake_outputs = alg(**fake_inputs, mode='loss')
+    assert isinstance(fake_outputs['loss'].item(), float)
diff --git a/tests/test_models/test_selfsup/test_maskfeat.py b/tests/test_models/test_selfsup/test_maskfeat.py
new file mode 100644
index 0000000000000000000000000000000000000000..75909c1fd777eb9145cb72fb5e0449f1aa31065f
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_maskfeat.py
@@ -0,0 +1,66 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+
+import pytest
+import torch
+from mmengine.utils import digit_version
+
+from mmpretrain.models import MaskFeat, MaskFeatViT
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_maskfeat_vit():
+    maskfeat_backbone = MaskFeatViT()
+    maskfeat_backbone.init_weights()
+    fake_inputs = torch.randn((2, 3, 224, 224))
+    fake_mask = torch.randn((2, 14, 14)).flatten(1).bool()
+
+    # test with mask
+    fake_outputs = maskfeat_backbone(fake_inputs, fake_mask)
+    assert list(fake_outputs.shape) == [2, 197, 768]
+
+    # test without mask
+    fake_outputs = maskfeat_backbone(fake_inputs, None)
+    assert fake_outputs[0].shape == torch.Size([2, 197, 768])
+
+
+@pytest.mark.skipif(
+    digit_version(torch.__version__) < digit_version('1.7.0'),
+    reason='torch version')
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_maskfeat():
+    data_preprocessor = {
+        'mean': [0.5, 0.5, 0.5],
+        'std': [0.5, 0.5, 0.5],
+        'to_rgb': True
+    }
+
+    backbone = dict(type='MaskFeatViT', arch='b', patch_size=16)
+    neck = dict(
+        type='LinearNeck', in_channels=768, out_channels=108, gap_dim=0)
+    head = dict(
+        type='MIMHead',
+        loss=dict(type='PixelReconstructionLoss', criterion='L2'))
+    target_generator = dict(
+        type='HOGGenerator', nbins=9, pool=8, gaussian_window=16)
+
+    alg = MaskFeat(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        target_generator=target_generator,
+        data_preprocessor=data_preprocessor)
+
+    # test forward_train
+    fake_data_sample = DataSample()
+    fake_mask = torch.rand((14, 14)).bool()
+    fake_data_sample.set_mask(fake_mask)
+    fake_data = {
+        'inputs': torch.randn((1, 3, 224, 224)),
+        'data_samples': [fake_data_sample]
+    }
+
+    fake_input = alg.data_preprocessor(fake_data)
+    fake_outputs = alg(**fake_input, mode='loss')
+    assert isinstance(fake_outputs['loss'].item(), float)
diff --git a/tests/test_models/test_selfsup/test_mff.py b/tests/test_models/test_selfsup/test_mff.py
new file mode 100644
index 0000000000000000000000000000000000000000..3ad0295f7eef0bd05dd95d8ee72f1fa52b91d39e
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_mff.py
@@ -0,0 +1,63 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+
+import pytest
+import torch
+
+from mmpretrain.models import MFF, MFFViT
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_mae_vit():
+    backbone = dict(
+        arch='b', patch_size=16, mask_ratio=0.75, out_indices=[1, 11])
+    mae_backbone = MFFViT(**backbone)
+    mae_backbone.init_weights()
+    fake_inputs = torch.randn((2, 3, 224, 224))
+
+    # test with mask
+    fake_outputs = mae_backbone(fake_inputs)[0]
+    assert list(fake_outputs.shape) == [2, 50, 768]
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_mae():
+    data_preprocessor = {
+        'mean': [0.5, 0.5, 0.5],
+        'std': [0.5, 0.5, 0.5],
+        'to_rgb': True
+    }
+    backbone = dict(
+        type='MFFViT',
+        arch='b',
+        patch_size=16,
+        mask_ratio=0.75,
+        out_indices=[1, 11])
+    neck = dict(
+        type='MAEPretrainDecoder',
+        patch_size=16,
+        in_chans=3,
+        embed_dim=768,
+        decoder_embed_dim=512,
+        decoder_depth=8,
+        decoder_num_heads=16,
+        mlp_ratio=4.,
+    )
+    loss = dict(type='PixelReconstructionLoss', criterion='L2')
+    head = dict(
+        type='MAEPretrainHead', norm_pix=False, patch_size=16, loss=loss)
+
+    alg = MFF(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        data_preprocessor=data_preprocessor)
+
+    fake_data = {
+        'inputs': torch.randn((2, 3, 224, 224)),
+        'data_samples': [DataSample() for _ in range(2)]
+    }
+    fake_inputs = alg.data_preprocessor(fake_data)
+    fake_outputs = alg(**fake_inputs, mode='loss')
+    assert isinstance(fake_outputs['loss'].item(), float)
diff --git a/tests/test_models/test_selfsup/test_milan.py b/tests/test_models/test_selfsup/test_milan.py
new file mode 100644
index 0000000000000000000000000000000000000000..f45f766dc3373a3b0f9808f0609fb5302d8f1f5f
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_milan.py
@@ -0,0 +1,69 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+import platform
+from unittest.mock import MagicMock
+
+import pytest
+import torch
+
+from mmpretrain.models import MILAN, MILANViT
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_milan_vit():
+    backbone = dict(arch='b', patch_size=16, mask_ratio=0.75)
+    milan_backbone = MILANViT(**backbone)
+    milan_backbone.init_weights()
+    fake_inputs = torch.randn((2, 3, 224, 224))
+
+    # test with mask
+    fake_outputs = milan_backbone(fake_inputs,
+                                  torch.ones(2, 197, 197)[:, 0, 1:])[0]
+    assert list(fake_outputs.shape) == [2, 50, 768]
+
+    # test without mask
+    fake_outputs = milan_backbone(fake_inputs, None)
+    assert fake_outputs[0].shape == torch.Size([2, 197, 768])
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_milan():
+    data_preprocessor = {
+        'mean': [0.5, 0.5, 0.5],
+        'std': [0.5, 0.5, 0.5],
+        'to_rgb': True
+    }
+
+    backbone = dict(type='MILANViT', arch='b', patch_size=16, mask_ratio=0.75)
+    neck = dict(
+        type='MILANPretrainDecoder',
+        patch_size=16,
+        in_chans=3,
+        embed_dim=768,
+        decoder_embed_dim=512,
+        decoder_depth=8,
+        decoder_num_heads=16,
+        mlp_ratio=4.)
+    head = dict(
+        type='MIMHead',
+        loss=dict(
+            type='CosineSimilarityLoss', shift_factor=2.0, scale_factor=2.0))
+
+    alg = MILAN(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        data_preprocessor=copy.deepcopy(data_preprocessor))
+
+    target_generator = MagicMock(
+        return_value=(torch.ones(2, 197, 512), torch.ones(2, 197, 197)))
+    alg.target_generator = target_generator
+
+    fake_data = {
+        'inputs': torch.randn((2, 3, 224, 224)),
+        'data_samples': [DataSample() for _ in range(2)]
+    }
+    fake_inputs = alg.data_preprocessor(fake_data)
+    fake_outputs = alg(**fake_inputs, mode='loss')
+    assert isinstance(fake_outputs['loss'].item(), float)
diff --git a/tests/test_models/test_selfsup/test_mixmim.py b/tests/test_models/test_selfsup/test_mixmim.py
new file mode 100644
index 0000000000000000000000000000000000000000..9e8e92338feca5f384f6ae10cc8c1e48cfa72269
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_mixmim.py
@@ -0,0 +1,71 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+
+import pytest
+import torch
+
+from mmpretrain.models import MixMIM, MixMIMPretrainTransformer
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_mixmmim_backbone():
+    mixmmim_backbone = MixMIMPretrainTransformer(
+        arch=dict(embed_dims=128, depths=[2, 2, 4, 2], num_heads=[4, 4, 4, 4]))
+    mixmmim_backbone.init_weights()
+    fake_inputs = torch.randn((1, 3, 224, 224))
+
+    # test with mask
+    fake_outputs, fake_mask_s4 = mixmmim_backbone(fake_inputs)
+    assert fake_outputs.shape == torch.Size([1, 49, 1024])
+    assert fake_mask_s4.shape == torch.Size([1, 49, 1])
+
+    # test without mask
+    fake_outputs = mixmmim_backbone(fake_inputs, None)
+    assert len(fake_outputs) == 1
+    assert fake_outputs[0].shape == torch.Size([1, 1024])
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_simmim():
+    data_preprocessor = {
+        'mean': [0.5, 0.5, 0.5],
+        'std': [0.5, 0.5, 0.5],
+        'to_rgb': True
+    }
+
+    # model config
+    backbone = dict(
+        type='MixMIMPretrainTransformer',
+        arch='B',
+        drop_rate=0.0,
+        drop_path_rate=0.0)
+    neck = dict(
+        type='MixMIMPretrainDecoder',
+        num_patches=49,
+        encoder_stride=32,
+        embed_dim=1024,
+        decoder_embed_dim=512,
+        decoder_depth=8,
+        decoder_num_heads=16)
+    head = dict(
+        type='MixMIMPretrainHead',
+        norm_pix=True,
+        loss=dict(type='PixelReconstructionLoss', criterion='L2'))
+
+    model = MixMIM(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        data_preprocessor=data_preprocessor)
+
+    # test forward_train
+    fake_data_sample = DataSample()
+    fake_data = {
+        'inputs': torch.randn((2, 3, 224, 224)),
+        'data_samples': [fake_data_sample for _ in range(2)]
+    }
+
+    fake_inputs = model.data_preprocessor(fake_data)
+    fake_outputs = model(**fake_inputs, mode='loss')
+    assert isinstance(fake_outputs['loss'].item(), float)
diff --git a/tests/test_models/test_selfsup/test_moco.py b/tests/test_models/test_selfsup/test_moco.py
new file mode 100644
index 0000000000000000000000000000000000000000..7db50ec502ef64e434fc791e1b648468877f2cdf
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_moco.py
@@ -0,0 +1,58 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+
+import pytest
+import torch
+
+from mmpretrain.models import MoCo
+from mmpretrain.structures import DataSample
+
+queue_len = 32
+feat_dim = 2
+momentum = 0.001
+backbone = dict(type='ResNet', depth=18, norm_cfg=dict(type='BN'))
+neck = dict(
+    type='MoCoV2Neck',
+    in_channels=512,
+    hid_channels=2,
+    out_channels=2,
+    with_avg_pool=True)
+head = dict(
+    type='ContrastiveHead',
+    loss=dict(type='CrossEntropyLoss'),
+    temperature=0.2)
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_moco():
+    data_preprocessor = {
+        'mean': (123.675, 116.28, 103.53),
+        'std': (58.395, 57.12, 57.375),
+        'to_rgb': True
+    }
+
+    alg = MoCo(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        queue_len=queue_len,
+        feat_dim=feat_dim,
+        momentum=momentum,
+        data_preprocessor=data_preprocessor)
+    assert alg.queue.size() == torch.Size([feat_dim, queue_len])
+
+    fake_data = {
+        'inputs':
+        [torch.randn((2, 3, 224, 224)),
+         torch.randn((2, 3, 224, 224))],
+        'data_samples': [DataSample() for _ in range(2)]
+    }
+
+    fake_inputs = alg.data_preprocessor(fake_data)
+    fake_loss = alg(**fake_inputs, mode='loss')
+    assert fake_loss['loss'] > 0
+    assert alg.queue_ptr.item() == 2
+
+    # test extract
+    fake_feats = alg(fake_inputs['inputs'][0], mode='tensor')
+    assert fake_feats[0].size() == torch.Size([2, 512, 7, 7])
diff --git a/tests/test_models/test_selfsup/test_mocov3.py b/tests/test_models/test_selfsup/test_mocov3.py
new file mode 100644
index 0000000000000000000000000000000000000000..b9d89a90a78f1fb34cb37475e042da2caa3cf603
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_mocov3.py
@@ -0,0 +1,91 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+from unittest import TestCase
+
+import pytest
+import torch
+
+from mmpretrain.models import MoCoV3, MoCoV3ViT
+from mmpretrain.structures import DataSample
+
+
+class TestMoCoV3(TestCase):
+
+    backbone = dict(
+        type='MoCoV3ViT',
+        arch='mocov3-small',  # embed_dim = 384
+        patch_size=16,
+        frozen_stages=12,
+        stop_grad_conv1=True,
+        norm_eval=True)
+    neck = dict(
+        type='NonLinearNeck',
+        in_channels=384,
+        hid_channels=2,
+        out_channels=2,
+        num_layers=2,
+        with_bias=False,
+        with_last_bn=True,
+        with_last_bn_affine=False,
+        with_last_bias=False,
+        with_avg_pool=False,
+        norm_cfg=dict(type='BN1d'))
+    head = dict(
+        type='MoCoV3Head',
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=2,
+            hid_channels=2,
+            out_channels=2,
+            num_layers=2,
+            with_bias=False,
+            with_last_bn=True,
+            with_last_bn_affine=False,
+            with_last_bias=False,
+            with_avg_pool=False,
+            norm_cfg=dict(type='BN1d')),
+        loss=dict(type='CrossEntropyLoss', loss_weight=2 * 0.2),
+        temperature=0.2)
+
+    @pytest.mark.skipif(
+        platform.system() == 'Windows', reason='Windows mem limit')
+    def test_vit(self):
+        vit = MoCoV3ViT(
+            arch='mocov3-small',
+            patch_size=16,
+            frozen_stages=12,
+            stop_grad_conv1=True,
+            norm_eval=True)
+        vit.init_weights()
+        vit.train()
+
+        for p in vit.parameters():
+            assert p.requires_grad is False
+
+    @pytest.mark.skipif(
+        platform.system() == 'Windows', reason='Windows mem limit')
+    def test_mocov3(self):
+        data_preprocessor = dict(
+            mean=(123.675, 116.28, 103.53),
+            std=(58.395, 57.12, 57.375),
+            to_rgb=True)
+        alg = MoCoV3(
+            backbone=self.backbone,
+            neck=self.neck,
+            head=self.head,
+            data_preprocessor=data_preprocessor)
+
+        fake_data = {
+            'inputs':
+            [torch.randn((2, 3, 224, 224)),
+             torch.randn((2, 3, 224, 224))],
+            'data_samples': [DataSample() for _ in range(2)]
+        }
+
+        fake_inputs = alg.data_preprocessor(fake_data)
+        fake_loss = alg(**fake_inputs, mode='loss')
+        self.assertGreater(fake_loss['loss'], 0)
+
+        # test extract
+        fake_feats = alg(fake_inputs['inputs'][0], mode='tensor')
+        self.assertEqual(fake_feats[0].size(), torch.Size([2, 384]))
diff --git a/tests/test_models/test_selfsup/test_simclr.py b/tests/test_models/test_selfsup/test_simclr.py
new file mode 100644
index 0000000000000000000000000000000000000000..24981ffef99c5d6354ae8a837375901c5fbaf383
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_simclr.py
@@ -0,0 +1,52 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+
+import pytest
+import torch
+
+from mmpretrain.models import SimCLR
+from mmpretrain.structures import DataSample
+
+backbone = dict(type='ResNet', depth=18, norm_cfg=dict(type='BN'))
+neck = dict(
+    type='NonLinearNeck',  # SimCLR non-linear neck
+    in_channels=512,
+    hid_channels=2,
+    out_channels=2,
+    num_layers=2,
+    with_avg_pool=True,
+    norm_cfg=dict(type='BN1d'))
+head = dict(
+    type='ContrastiveHead',
+    loss=dict(type='CrossEntropyLoss'),
+    temperature=0.1)
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_simclr():
+    data_preprocessor = {
+        'mean': (123.675, 116.28, 103.53),
+        'std': (58.395, 57.12, 57.375),
+        'to_rgb': True,
+    }
+
+    alg = SimCLR(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        data_preprocessor=data_preprocessor)
+
+    fake_data = {
+        'inputs':
+        [torch.randn((2, 3, 224, 224)),
+         torch.randn((2, 3, 224, 224))],
+        'data_samples': [DataSample() for _ in range(2)]
+    }
+
+    fake_inputs = alg.data_preprocessor(fake_data)
+    fake_loss = alg(**fake_inputs, mode='loss')
+    assert isinstance(fake_loss['loss'].item(), float)
+
+    # test extract
+    fake_feat = alg(fake_inputs['inputs'][0], mode='tensor')
+    assert fake_feat[0].size() == torch.Size([2, 512, 7, 7])
diff --git a/tests/test_models/test_selfsup/test_simmim.py b/tests/test_models/test_selfsup/test_simmim.py
new file mode 100644
index 0000000000000000000000000000000000000000..0440365201941ca8d0b341467923f2539fc39847
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_simmim.py
@@ -0,0 +1,70 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+
+import pytest
+import torch
+
+from mmpretrain.models import SimMIM, SimMIMSwinTransformer
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_simmim_swin():
+    backbone = dict(
+        arch='B',
+        img_size=192,
+        stage_cfgs=dict(block_cfgs=dict(window_size=6)))
+    simmim_backbone = SimMIMSwinTransformer(**backbone)
+    simmim_backbone.init_weights()
+    fake_inputs = torch.randn((2, 3, 192, 192))
+    fake_mask = torch.rand((2, 48, 48))
+
+    # test with mask
+    fake_outputs = simmim_backbone(fake_inputs, fake_mask)[0]
+    assert fake_outputs.shape == torch.Size([2, 1024, 6, 6])
+
+    # test without mask
+    fake_outputs = simmim_backbone(fake_inputs, None)
+    assert len(fake_outputs) == 1
+    assert fake_outputs[0].shape == torch.Size([2, 1024, 6, 6])
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_simmim():
+    data_preprocessor = {
+        'mean': [0.5, 0.5, 0.5],
+        'std': [0.5, 0.5, 0.5],
+        'to_rgb': True
+    }
+
+    # model config
+    backbone = dict(
+        type='SimMIMSwinTransformer',
+        arch='B',
+        img_size=192,
+        stage_cfgs=dict(block_cfgs=dict(window_size=6)))
+    neck = dict(
+        type='SimMIMLinearDecoder', in_channels=128 * 2**3, encoder_stride=32)
+    head = dict(
+        type='SimMIMHead',
+        patch_size=4,
+        loss=dict(type='PixelReconstructionLoss', criterion='L1', channel=3))
+
+    model = SimMIM(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        data_preprocessor=data_preprocessor)
+
+    # test forward_train
+    fake_data_sample = DataSample()
+    fake_mask = torch.rand((48, 48))
+    fake_data_sample.set_mask(fake_mask)
+    fake_data = {
+        'inputs': torch.randn((2, 3, 192, 192)),
+        'data_samples': [fake_data_sample for _ in range(2)]
+    }
+
+    fake_inputs = model.data_preprocessor(fake_data)
+    fake_outputs = model(**fake_inputs, mode='loss')
+    assert isinstance(fake_outputs['loss'].item(), float)
diff --git a/tests/test_models/test_selfsup/test_simsiam.py b/tests/test_models/test_selfsup/test_simsiam.py
new file mode 100644
index 0000000000000000000000000000000000000000..abbf754e08e164da20e180d38ceaa203603a0987
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_simsiam.py
@@ -0,0 +1,64 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import copy
+import platform
+
+import pytest
+import torch
+
+from mmpretrain.models import SimSiam
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_simsiam():
+    data_preprocessor = {
+        'mean': (123.675, 116.28, 103.53),
+        'std': (58.395, 57.12, 57.375),
+        'to_rgb': True,
+    }
+    backbone = dict(
+        type='ResNet',
+        depth=18,
+        norm_cfg=dict(type='BN'),
+        zero_init_residual=True)
+    neck = dict(
+        type='NonLinearNeck',
+        in_channels=512,
+        hid_channels=2,
+        out_channels=2,
+        num_layers=3,
+        with_last_bn_affine=False,
+        with_avg_pool=True,
+        norm_cfg=dict(type='BN1d'))
+    head = dict(
+        type='LatentPredictHead',
+        loss=dict(type='CosineSimilarityLoss'),
+        predictor=dict(
+            type='NonLinearNeck',
+            in_channels=2,
+            hid_channels=2,
+            out_channels=2,
+            with_avg_pool=False,
+            with_last_bn=False,
+            with_last_bias=True,
+            norm_cfg=dict(type='BN1d')))
+
+    alg = SimSiam(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        data_preprocessor=copy.deepcopy(data_preprocessor))
+
+    fake_data = {
+        'inputs':
+        [torch.randn((2, 3, 224, 224)),
+         torch.randn((2, 3, 224, 224))],
+        'data_samples': [DataSample() for _ in range(2)]
+    }
+    fake_inputs = alg.data_preprocessor(fake_data)
+    fake_loss = alg(**fake_inputs, mode='loss')
+    assert fake_loss['loss'] > -1
+
+    # test extract
+    fake_feat = alg(fake_inputs['inputs'][0], mode='tensor')
+    assert fake_feat[0].size() == torch.Size([2, 512, 7, 7])
diff --git a/tests/test_models/test_selfsup/test_spark.py b/tests/test_models/test_selfsup/test_spark.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb4fe3d585756194702dc9831bebe7ff709ca260
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_spark.py
@@ -0,0 +1,51 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+
+import pytest
+import torch
+
+from mmpretrain.models import SparK
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_spark():
+    data_preprocessor = {
+        'mean': (123.675, 116.28, 103.53),
+        'std': (58.395, 57.12, 57.375),
+        'to_rgb': True
+    }
+
+    backbone = dict(
+        type='SparseResNet',
+        depth=50,
+        out_indices=(0, 1, 2, 3),
+        drop_path_rate=0.05,
+        norm_cfg=dict(type='BN'))
+    neck = dict(
+        type='SparKLightDecoder',
+        feature_dim=512,
+        upsample_ratio=32,  # equal to downsample_raito
+        mid_channels=0,
+        norm_cfg=dict(type='BN'),
+        last_act=False)
+    head = dict(
+        type='SparKPretrainHead',
+        loss=dict(type='PixelReconstructionLoss', criterion='L2'))
+
+    alg = SparK(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        data_preprocessor=data_preprocessor,
+        enc_dec_norm_cfg=dict(type='BN'),
+    )
+
+    fake_data = {
+        'inputs': torch.randn((2, 3, 224, 224)),
+        'data_sample': [DataSample() for _ in range(2)]
+    }
+
+    fake_inputs = alg.data_preprocessor(fake_data)
+    fake_loss = alg(**fake_inputs, mode='loss')
+    assert isinstance(fake_loss['loss'].item(), float)
diff --git a/tests/test_models/test_selfsup/test_swav.py b/tests/test_models/test_selfsup/test_swav.py
new file mode 100644
index 0000000000000000000000000000000000000000..4bfe9cc9b57dbc1288882ed344e9d925d2e200b7
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_swav.py
@@ -0,0 +1,61 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+
+import pytest
+import torch
+
+from mmpretrain.models import SwAV
+from mmpretrain.structures import DataSample
+
+
+@pytest.mark.skipif(platform.system() == 'Windows', reason='Windows mem limit')
+def test_swav():
+    data_preprocessor = {
+        'mean': (123.675, 116.28, 103.53),
+        'std': (58.395, 57.12, 57.375),
+        'to_rgb': True
+    }
+    backbone = dict(
+        type='ResNet',
+        depth=18,
+        norm_cfg=dict(type='BN'),
+        zero_init_residual=True)
+    neck = dict(
+        type='SwAVNeck',
+        in_channels=512,
+        hid_channels=2,
+        out_channels=2,
+        norm_cfg=dict(type='BN1d'),
+        with_avg_pool=True)
+    head = dict(
+        type='SwAVHead',
+        loss=dict(
+            type='SwAVLoss',
+            feat_dim=2,  # equal to neck['out_channels']
+            epsilon=0.05,
+            temperature=0.1,
+            num_crops=[2, 6]))
+
+    alg = SwAV(
+        backbone=backbone,
+        neck=neck,
+        head=head,
+        data_preprocessor=data_preprocessor)
+
+    fake_data = {
+        'inputs': [
+            torch.randn((2, 3, 224, 224)),
+            torch.randn((2, 3, 224, 224)),
+            torch.randn((2, 3, 96, 96)),
+            torch.randn((2, 3, 96, 96)),
+            torch.randn((2, 3, 96, 96)),
+            torch.randn((2, 3, 96, 96)),
+            torch.randn((2, 3, 96, 96)),
+            torch.randn((2, 3, 96, 96))
+        ],
+        'data_samples': [DataSample() for _ in range(2)]
+    }
+
+    fake_inputs = alg.data_preprocessor(fake_data)
+    fake_outputs = alg(**fake_inputs, mode='loss')
+    assert isinstance(fake_outputs['loss'].item(), float)
diff --git a/tests/test_models/test_selfsup/test_target_generators.py b/tests/test_models/test_selfsup/test_target_generators.py
new file mode 100644
index 0000000000000000000000000000000000000000..f53530b1987b86804c206f27c2da4a91131b48c6
--- /dev/null
+++ b/tests/test_models/test_selfsup/test_target_generators.py
@@ -0,0 +1,72 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import platform
+from unittest import TestCase
+
+import pytest
+import torch
+
+from mmpretrain.models import VQKD, DALLEEncoder, HOGGenerator
+
+
+class TestDALLE(TestCase):
+
+    @pytest.mark.skipif(
+        platform.system() == 'Windows', reason='Windows mem limit')
+    def test_dalle(self):
+        model = DALLEEncoder()
+        fake_inputs = torch.rand((2, 3, 112, 112))
+        fake_outputs = model(fake_inputs)
+
+        assert list(fake_outputs.shape) == [2, 8192, 14, 14]
+
+
+class TestHOGGenerator(TestCase):
+
+    def test_hog_generator(self):
+        hog_generator = HOGGenerator()
+
+        fake_input = torch.randn((2, 3, 224, 224))
+        fake_output = hog_generator(fake_input)
+        assert list(fake_output.shape) == [2, 196, 108]
+
+        fake_hog_out = hog_generator.out[0].unsqueeze(0)
+        fake_hog_img = hog_generator.generate_hog_image(fake_hog_out)
+        assert fake_hog_img.shape == (224, 224)
+
+        with pytest.raises(AssertionError):
+            fake_hog_img = hog_generator.generate_hog_image(
+                hog_generator.out[0])
+
+
+class TestVQKD(TestCase):
+
+    ENCODER_CFG = dict(
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        in_channels=3,
+        out_indices=-1,
+        drop_rate=0.,
+        drop_path_rate=0.,
+        norm_cfg=dict(type='LN', eps=1e-6),
+        final_norm=True,
+        out_type='featmap',
+        with_cls_token=True,
+        frozen_stages=-1,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        use_shared_rel_pos_bias=False,
+        layer_scale_init_value=0.,
+        interpolate_mode='bicubic',
+        patch_cfg=dict(),
+        layer_cfgs=dict(),
+        init_cfg=None)
+
+    @pytest.mark.skipif(
+        platform.system() == 'Windows', reason='Windows mem limit')
+    def test_vqkd(self):
+        model = VQKD(encoder_config=self.ENCODER_CFG)
+        fake_inputs = torch.rand((2, 3, 224, 224))
+        fake_outputs = model(fake_inputs)
+
+        assert list(fake_outputs.shape) == [2, 196]
diff --git a/tests/test_models/test_tta.py b/tests/test_models/test_tta.py
new file mode 100644
index 0000000000000000000000000000000000000000..36d7663493e62f90182b8d90f5f6bfc391fc8f8c
--- /dev/null
+++ b/tests/test_models/test_tta.py
@@ -0,0 +1,67 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from copy import deepcopy
+from unittest import TestCase
+
+import torch
+from mmengine import ConfigDict
+from mmengine.registry import init_default_scope
+
+from mmpretrain.models import AverageClsScoreTTA, ImageClassifier
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+
+init_default_scope('mmpretrain')
+
+
+class TestAverageClsScoreTTA(TestCase):
+    DEFAULT_ARGS = dict(
+        type='AverageClsScoreTTA',
+        module=dict(
+            type='ImageClassifier',
+            backbone=dict(type='ResNet', depth=18),
+            neck=dict(type='GlobalAveragePooling'),
+            head=dict(
+                type='LinearClsHead',
+                num_classes=10,
+                in_channels=512,
+                loss=dict(type='CrossEntropyLoss'))))
+
+    def test_initialize(self):
+        model: AverageClsScoreTTA = MODELS.build(self.DEFAULT_ARGS)
+        self.assertIsInstance(model.module, ImageClassifier)
+
+    def test_forward(self):
+        inputs = torch.rand(1, 3, 224, 224)
+        model: AverageClsScoreTTA = MODELS.build(self.DEFAULT_ARGS)
+
+        # The forward of TTA model should not be called.
+        with self.assertRaisesRegex(NotImplementedError, 'will not be called'):
+            model(inputs)
+
+    def test_test_step(self):
+        cfg = ConfigDict(deepcopy(self.DEFAULT_ARGS))
+        cfg.module.data_preprocessor = dict(
+            mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5])
+        model: AverageClsScoreTTA = MODELS.build(cfg)
+
+        img1 = torch.randint(0, 256, (1, 3, 224, 224))
+        img2 = torch.randint(0, 256, (1, 3, 224, 224))
+        data1 = {
+            'inputs': img1,
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+        data2 = {
+            'inputs': img2,
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+        data_tta = {
+            'inputs': [img1, img2],
+            'data_samples': [[DataSample().set_gt_label(1)],
+                             [DataSample().set_gt_label(1)]]
+        }
+
+        score1 = model.module.test_step(data1)[0].pred_score
+        score2 = model.module.test_step(data2)[0].pred_score
+        score_tta = model.test_step(data_tta)[0].pred_score
+
+        torch.testing.assert_allclose(score_tta, (score1 + score2) / 2)
diff --git a/tests/test_models/test_utils/test_attention.py b/tests/test_models/test_utils/test_attention.py
new file mode 100644
index 0000000000000000000000000000000000000000..27c0e093c9525933b17e5e80518cba1db17d6682
--- /dev/null
+++ b/tests/test_models/test_utils/test_attention.py
@@ -0,0 +1,189 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+from unittest.mock import ANY, MagicMock
+
+import pytest
+import torch
+
+from mmpretrain.models.utils.attention import (ShiftWindowMSA, WindowMSA,
+                                               torch_meshgrid)
+
+
+def get_relative_position_index(window_size):
+    """Method from original code of Swin-Transformer."""
+    coords_h = torch.arange(window_size[0])
+    coords_w = torch.arange(window_size[1])
+    coords = torch.stack(torch_meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
+    coords_flatten = torch.flatten(coords, 1)  # 2, Wh*Ww
+    # 2, Wh*Ww, Wh*Ww
+    relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :]
+    # Wh*Ww, Wh*Ww, 2
+    relative_coords = relative_coords.permute(1, 2, 0).contiguous()
+    relative_coords[:, :, 0] += window_size[0] - 1  # shift to start from 0
+    relative_coords[:, :, 1] += window_size[1] - 1
+    relative_coords[:, :, 0] *= 2 * window_size[1] - 1
+    relative_position_index = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww
+    return relative_position_index
+
+
+class TestWindowMSA(TestCase):
+
+    def test_forward(self):
+        attn = WindowMSA(embed_dims=96, window_size=(7, 7), num_heads=4)
+        inputs = torch.rand((16, 7 * 7, 96))
+        output = attn(inputs)
+        self.assertEqual(output.shape, inputs.shape)
+
+        # test non-square window_size
+        attn = WindowMSA(embed_dims=96, window_size=(6, 7), num_heads=4)
+        inputs = torch.rand((16, 6 * 7, 96))
+        output = attn(inputs)
+        self.assertEqual(output.shape, inputs.shape)
+
+    def test_relative_pos_embed(self):
+        attn = WindowMSA(embed_dims=96, window_size=(7, 8), num_heads=4)
+        self.assertEqual(attn.relative_position_bias_table.shape,
+                         ((2 * 7 - 1) * (2 * 8 - 1), 4))
+        # test relative_position_index
+        expected_rel_pos_index = get_relative_position_index((7, 8))
+        self.assertTrue(
+            torch.allclose(attn.relative_position_index,
+                           expected_rel_pos_index))
+
+        # test default init
+        self.assertTrue(
+            torch.allclose(attn.relative_position_bias_table,
+                           torch.tensor(0.)))
+        attn.init_weights()
+        self.assertFalse(
+            torch.allclose(attn.relative_position_bias_table,
+                           torch.tensor(0.)))
+
+    def test_qkv_bias(self):
+        # test qkv_bias=True
+        attn = WindowMSA(
+            embed_dims=96, window_size=(7, 7), num_heads=4, qkv_bias=True)
+        self.assertEqual(attn.qkv.bias.shape, (96 * 3, ))
+
+        # test qkv_bias=False
+        attn = WindowMSA(
+            embed_dims=96, window_size=(7, 7), num_heads=4, qkv_bias=False)
+        self.assertIsNone(attn.qkv.bias)
+
+    def tets_qk_scale(self):
+        # test default qk_scale
+        attn = WindowMSA(
+            embed_dims=96, window_size=(7, 7), num_heads=4, qk_scale=None)
+        head_dims = 96 // 4
+        self.assertAlmostEqual(attn.scale, head_dims**-0.5)
+
+        # test specified qk_scale
+        attn = WindowMSA(
+            embed_dims=96, window_size=(7, 7), num_heads=4, qk_scale=0.3)
+        self.assertEqual(attn.scale, 0.3)
+
+    def test_attn_drop(self):
+        inputs = torch.rand(16, 7 * 7, 96)
+        attn = WindowMSA(
+            embed_dims=96, window_size=(7, 7), num_heads=4, attn_drop=1.0)
+        # drop all attn output, output shuold be equal to proj.bias
+        self.assertTrue(torch.allclose(attn(inputs), attn.proj.bias))
+
+    def test_prob_drop(self):
+        inputs = torch.rand(16, 7 * 7, 96)
+        attn = WindowMSA(
+            embed_dims=96, window_size=(7, 7), num_heads=4, proj_drop=1.0)
+        self.assertTrue(torch.allclose(attn(inputs), torch.tensor(0.)))
+
+    def test_mask(self):
+        inputs = torch.rand(16, 7 * 7, 96)
+        attn = WindowMSA(embed_dims=96, window_size=(7, 7), num_heads=4)
+        mask = torch.zeros((4, 49, 49))
+        # Mask the first column
+        mask[:, 0, :] = -100
+        mask[:, :, 0] = -100
+        outs = attn(inputs, mask=mask)
+        inputs[:, 0, :].normal_()
+        outs_with_mask = attn(inputs, mask=mask)
+        torch.testing.assert_allclose(outs[:, 1:, :], outs_with_mask[:, 1:, :])
+
+
+class TestShiftWindowMSA(TestCase):
+
+    def test_forward(self):
+        inputs = torch.rand((1, 14 * 14, 96))
+        attn = ShiftWindowMSA(embed_dims=96, window_size=7, num_heads=4)
+        output = attn(inputs, (14, 14))
+        self.assertEqual(output.shape, inputs.shape)
+        self.assertEqual(attn.w_msa.relative_position_bias_table.shape,
+                         ((2 * 7 - 1)**2, 4))
+
+        # test forward with shift_size
+        attn = ShiftWindowMSA(
+            embed_dims=96, window_size=7, num_heads=4, shift_size=3)
+        output = attn(inputs, (14, 14))
+        assert output.shape == (inputs.shape)
+
+        # test irregular input shape
+        input_resolution = (19, 18)
+        attn = ShiftWindowMSA(embed_dims=96, num_heads=4, window_size=7)
+        inputs = torch.rand((1, 19 * 18, 96))
+        output = attn(inputs, input_resolution)
+        assert output.shape == (inputs.shape)
+
+        # test wrong input_resolution
+        input_resolution = (14, 14)
+        attn = ShiftWindowMSA(embed_dims=96, num_heads=4, window_size=7)
+        inputs = torch.rand((1, 14 * 14, 96))
+        with pytest.raises(AssertionError):
+            attn(inputs, (14, 15))
+
+    def test_pad_small_map(self):
+        # test pad_small_map=True
+        inputs = torch.rand((1, 6 * 7, 96))
+        attn = ShiftWindowMSA(
+            embed_dims=96,
+            window_size=7,
+            num_heads=4,
+            shift_size=3,
+            pad_small_map=True)
+        attn.get_attn_mask = MagicMock(wraps=attn.get_attn_mask)
+        output = attn(inputs, (6, 7))
+        self.assertEqual(output.shape, inputs.shape)
+        attn.get_attn_mask.assert_called_once_with((7, 7),
+                                                   window_size=7,
+                                                   shift_size=3,
+                                                   device=ANY)
+
+        # test pad_small_map=False
+        inputs = torch.rand((1, 6 * 7, 96))
+        attn = ShiftWindowMSA(
+            embed_dims=96,
+            window_size=7,
+            num_heads=4,
+            shift_size=3,
+            pad_small_map=False)
+        with self.assertRaisesRegex(AssertionError, r'the window size \(7\)'):
+            attn(inputs, (6, 7))
+
+        # test pad_small_map=False, and the input size equals to window size
+        inputs = torch.rand((1, 7 * 7, 96))
+        attn.get_attn_mask = MagicMock(wraps=attn.get_attn_mask)
+        output = attn(inputs, (7, 7))
+        self.assertEqual(output.shape, inputs.shape)
+        attn.get_attn_mask.assert_called_once_with((7, 7),
+                                                   window_size=7,
+                                                   shift_size=0,
+                                                   device=ANY)
+
+    def test_drop_layer(self):
+        inputs = torch.rand((1, 14 * 14, 96))
+        attn = ShiftWindowMSA(
+            embed_dims=96,
+            window_size=7,
+            num_heads=4,
+            dropout_layer=dict(type='Dropout', drop_prob=1.0))
+        attn.init_weights()
+        # drop all attn output, output shuold be equal to proj.bias
+        self.assertTrue(
+            torch.allclose(attn(inputs, (14, 14)), torch.tensor(0.)))
diff --git a/tests/test_models/test_utils/test_batch_augments.py b/tests/test_models/test_utils/test_batch_augments.py
new file mode 100644
index 0000000000000000000000000000000000000000..b4ba3779d6f72a6339b98e20722d9eecccaca113
--- /dev/null
+++ b/tests/test_models/test_utils/test_batch_augments.py
@@ -0,0 +1,166 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+from unittest.mock import MagicMock, patch
+
+import numpy as np
+import torch
+
+from mmpretrain.models import Mixup, RandomBatchAugment
+from mmpretrain.registry import BATCH_AUGMENTS
+
+
+class TestRandomBatchAugment(TestCase):
+
+    def test_initialize(self):
+        # test single augmentation
+        augments = dict(type='Mixup', alpha=1.)
+        batch_augments = RandomBatchAugment(augments)
+        self.assertIsInstance(batch_augments.augments, list)
+        self.assertEqual(len(batch_augments.augments), 1)
+
+        # test specify augments with object
+        augments = Mixup(alpha=1.)
+        batch_augments = RandomBatchAugment(augments)
+        self.assertIsInstance(batch_augments.augments, list)
+        self.assertEqual(len(batch_augments.augments), 1)
+
+        # test multiple augmentation
+        augments = [
+            dict(type='Mixup', alpha=1.),
+            dict(type='CutMix', alpha=0.8),
+        ]
+        batch_augments = RandomBatchAugment(augments)
+        # mixup, cutmix
+        self.assertEqual(len(batch_augments.augments), 2)
+        self.assertIsNone(batch_augments.probs)
+
+        # test specify probs
+        augments = [
+            dict(type='Mixup', alpha=1.),
+            dict(type='CutMix', alpha=0.8),
+        ]
+        batch_augments = RandomBatchAugment(augments, probs=[0.5, 0.3])
+        # mixup, cutmix and None
+        self.assertEqual(len(batch_augments.augments), 3)
+        self.assertAlmostEqual(batch_augments.probs[-1], 0.2)
+
+        # test assertion
+        with self.assertRaisesRegex(AssertionError, 'Got 2 vs 1'):
+            RandomBatchAugment(augments, probs=0.5)
+
+        with self.assertRaisesRegex(AssertionError, 'exceeds 1.'):
+            RandomBatchAugment(augments, probs=[0.5, 0.6])
+
+    def test_call(self):
+        inputs = torch.rand(2, 3, 224, 224)
+        scores = torch.rand(2, 10)
+
+        augments = [
+            dict(type='Mixup', alpha=1.),
+            dict(type='CutMix', alpha=0.8),
+        ]
+        batch_augments = RandomBatchAugment(augments, probs=[0.5, 0.3])
+
+        with patch('numpy.random', np.random.RandomState(0)):
+            batch_augments.augments[1] = MagicMock()
+            batch_augments(inputs, scores)
+            batch_augments.augments[1].assert_called_once_with(inputs, scores)
+
+        augments = [
+            dict(type='Mixup', alpha=1.),
+            dict(type='CutMix', alpha=0.8),
+        ]
+        batch_augments = RandomBatchAugment(augments, probs=[0.0, 0.0])
+        mixed_inputs, mixed_samples = batch_augments(inputs, scores)
+        self.assertIs(mixed_inputs, inputs)
+        self.assertIs(mixed_samples, scores)
+
+
+class TestMixup(TestCase):
+    DEFAULT_ARGS = dict(type='Mixup', alpha=1.)
+
+    def test_initialize(self):
+        with self.assertRaises(AssertionError):
+            cfg = {**self.DEFAULT_ARGS, 'alpha': 'unknown'}
+            BATCH_AUGMENTS.build(cfg)
+
+    def test_call(self):
+        inputs = torch.rand(2, 3, 224, 224)
+        scores = torch.rand(2, 10)
+
+        mixup = BATCH_AUGMENTS.build(self.DEFAULT_ARGS)
+        mixed_inputs, mixed_scores = mixup(inputs, scores)
+        self.assertEqual(mixed_inputs.shape, (2, 3, 224, 224))
+        self.assertEqual(mixed_scores.shape, (2, 10))
+
+        # test binary classification
+        scores = torch.rand(2, 1)
+
+        mixed_inputs, mixed_scores = mixup(inputs, scores)
+        self.assertEqual(mixed_inputs.shape, (2, 3, 224, 224))
+        self.assertEqual(mixed_scores.shape, (2, 1))
+
+
+class TestCutMix(TestCase):
+    DEFAULT_ARGS = dict(type='CutMix', alpha=1.)
+
+    def test_initialize(self):
+        with self.assertRaises(AssertionError):
+            cfg = {**self.DEFAULT_ARGS, 'alpha': 'unknown'}
+            BATCH_AUGMENTS.build(cfg)
+
+    def test_call(self):
+        inputs = torch.rand(2, 3, 224, 224)
+        scores = torch.rand(2, 10)
+
+        # test with cutmix_minmax
+        cfg = {**self.DEFAULT_ARGS, 'cutmix_minmax': (0.1, 0.2)}
+        cutmix = BATCH_AUGMENTS.build(cfg)
+        mixed_inputs, mixed_scores = cutmix(inputs, scores)
+        self.assertEqual(mixed_inputs.shape, (2, 3, 224, 224))
+        self.assertEqual(mixed_scores.shape, (2, 10))
+
+        # test without correct_lam
+        cfg = {**self.DEFAULT_ARGS, 'correct_lam': False}
+        cutmix = BATCH_AUGMENTS.build(cfg)
+        mixed_inputs, mixed_scores = cutmix(inputs, scores)
+        self.assertEqual(mixed_inputs.shape, (2, 3, 224, 224))
+        self.assertEqual(mixed_scores.shape, (2, 10))
+
+        # test default settings
+        cutmix = BATCH_AUGMENTS.build(self.DEFAULT_ARGS)
+        mixed_inputs, mixed_scores = cutmix(inputs, scores)
+        self.assertEqual(mixed_inputs.shape, (2, 3, 224, 224))
+        self.assertEqual(mixed_scores.shape, (2, 10))
+
+        # test binary classification
+        scores = torch.rand(2, 1)
+
+        mixed_inputs, mixed_scores = cutmix(inputs, scores)
+        self.assertEqual(mixed_inputs.shape, (2, 3, 224, 224))
+        self.assertEqual(mixed_scores.shape, (2, 1))
+
+
+class TestResizeMix(TestCase):
+    DEFAULT_ARGS = dict(type='ResizeMix', alpha=1.)
+
+    def test_initialize(self):
+        with self.assertRaises(AssertionError):
+            cfg = {**self.DEFAULT_ARGS, 'alpha': 'unknown'}
+            BATCH_AUGMENTS.build(cfg)
+
+    def test_call(self):
+        inputs = torch.rand(2, 3, 224, 224)
+        scores = torch.rand(2, 10)
+
+        mixup = BATCH_AUGMENTS.build(self.DEFAULT_ARGS)
+        mixed_inputs, mixed_scores = mixup(inputs, scores)
+        self.assertEqual(mixed_inputs.shape, (2, 3, 224, 224))
+        self.assertEqual(mixed_scores.shape, (2, 10))
+
+        # test binary classification
+        scores = torch.rand(2, 1)
+
+        mixed_inputs, mixed_scores = mixup(inputs, scores)
+        self.assertEqual(mixed_inputs.shape, (2, 3, 224, 224))
+        self.assertEqual(mixed_scores.shape, (2, 1))
diff --git a/tests/test_models/test_utils/test_data_preprocessor.py b/tests/test_models/test_utils/test_data_preprocessor.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ae23bf346773bcf12cbfef044721d08152ee22b
--- /dev/null
+++ b/tests/test_models/test_utils/test_data_preprocessor.py
@@ -0,0 +1,248 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import pytest
+import torch
+
+from mmpretrain.models import (ClsDataPreprocessor, RandomBatchAugment,
+                               SelfSupDataPreprocessor,
+                               TwoNormDataPreprocessor, VideoDataPreprocessor)
+from mmpretrain.registry import MODELS
+from mmpretrain.structures import DataSample
+
+
+class TestClsDataPreprocessor(TestCase):
+
+    def test_stack_batch(self):
+        cfg = dict(type='ClsDataPreprocessor')
+        processor: ClsDataPreprocessor = MODELS.build(cfg)
+
+        data = {
+            'inputs': [torch.randint(0, 256, (3, 224, 224))],
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+        processed_data = processor(data)
+        inputs = processed_data['inputs']
+        data_samples = processed_data['data_samples']
+        self.assertEqual(inputs.shape, (1, 3, 224, 224))
+        self.assertEqual(len(data_samples), 1)
+        self.assertTrue((data_samples[0].gt_label == torch.tensor([1])).all())
+
+    def test_padding(self):
+        cfg = dict(type='ClsDataPreprocessor', pad_size_divisor=16)
+        processor: ClsDataPreprocessor = MODELS.build(cfg)
+
+        data = {
+            'inputs': [
+                torch.randint(0, 256, (3, 255, 255)),
+                torch.randint(0, 256, (3, 224, 224))
+            ]
+        }
+        inputs = processor(data)['inputs']
+        self.assertEqual(inputs.shape, (2, 3, 256, 256))
+
+        data = {'inputs': torch.randint(0, 256, (2, 3, 255, 255))}
+        inputs = processor(data)['inputs']
+        self.assertEqual(inputs.shape, (2, 3, 256, 256))
+
+    def test_to_rgb(self):
+        cfg = dict(type='ClsDataPreprocessor', to_rgb=True)
+        processor: ClsDataPreprocessor = MODELS.build(cfg)
+
+        data = {'inputs': [torch.randint(0, 256, (3, 224, 224))]}
+        inputs = processor(data)['inputs']
+        torch.testing.assert_allclose(data['inputs'][0].flip(0).float(),
+                                      inputs[0])
+
+        data = {'inputs': torch.randint(0, 256, (1, 3, 224, 224))}
+        inputs = processor(data)['inputs']
+        torch.testing.assert_allclose(data['inputs'].flip(1).float(), inputs)
+
+    def test_normalization(self):
+        cfg = dict(
+            type='ClsDataPreprocessor',
+            mean=[127.5, 127.5, 127.5],
+            std=[127.5, 127.5, 127.5])
+        processor: ClsDataPreprocessor = MODELS.build(cfg)
+
+        data = {'inputs': [torch.randint(0, 256, (3, 224, 224))]}
+        processed_data = processor(data)
+        inputs = processed_data['inputs']
+        self.assertTrue((inputs >= -1).all())
+        self.assertTrue((inputs <= 1).all())
+        self.assertIsNone(processed_data['data_samples'])
+
+        data = {'inputs': torch.randint(0, 256, (1, 3, 224, 224))}
+        inputs = processor(data)['inputs']
+        self.assertTrue((inputs >= -1).all())
+        self.assertTrue((inputs <= 1).all())
+
+    def test_batch_augmentation(self):
+        cfg = dict(
+            type='ClsDataPreprocessor',
+            num_classes=10,
+            batch_augments=dict(augments=[
+                dict(type='Mixup', alpha=0.8),
+                dict(type='CutMix', alpha=1.)
+            ]))
+        processor: ClsDataPreprocessor = MODELS.build(cfg)
+        self.assertIsInstance(processor.batch_augments, RandomBatchAugment)
+        data = {
+            'inputs': [torch.randint(0, 256, (3, 224, 224))],
+            'data_samples': [DataSample().set_gt_label(1)]
+        }
+        processed_data = processor(data, training=True)
+        self.assertIn('inputs', processed_data)
+        self.assertIn('data_samples', processed_data)
+
+        cfg['batch_augments'] = None
+        processor: ClsDataPreprocessor = MODELS.build(cfg)
+        self.assertIsNone(processor.batch_augments)
+        data = {'inputs': [torch.randint(0, 256, (3, 224, 224))]}
+        processed_data = processor(data, training=True)
+        self.assertIn('inputs', processed_data)
+        self.assertIsNone(processed_data['data_samples'])
+
+
+class TestSelfSupDataPreprocessor(TestCase):
+
+    def test_to_rgb(self):
+        cfg = dict(type='SelfSupDataPreprocessor', to_rgb=True)
+        processor: SelfSupDataPreprocessor = MODELS.build(cfg)
+        self.assertTrue(processor._channel_conversion)
+
+        fake_data = {
+            'inputs':
+            [torch.randn((2, 3, 224, 224)),
+             torch.randn((2, 3, 224, 224))],
+            'data_samples': [DataSample(), DataSample()]
+        }
+        inputs = processor(fake_data)['inputs']
+        torch.testing.assert_allclose(fake_data['inputs'][0].flip(1).float(),
+                                      inputs[0])
+        torch.testing.assert_allclose(fake_data['inputs'][1].flip(1).float(),
+                                      inputs[1])
+
+    def test_forward(self):
+        data_preprocessor = SelfSupDataPreprocessor(
+            to_rgb=True, mean=[124, 117, 104], std=[59, 58, 58])
+
+        # test list inputs
+        fake_data = {
+            'inputs': [torch.randn((2, 3, 224, 224))],
+            'data_samples': [DataSample(), DataSample()]
+        }
+        fake_output = data_preprocessor(fake_data)
+        self.assertEqual(len(fake_output['inputs']), 1)
+        self.assertEqual(len(fake_output['data_samples']), 2)
+
+        # test torch.Tensor inputs
+        fake_data = {
+            'inputs': torch.randn((2, 3, 224, 224)),
+            'data_samples': [DataSample(), DataSample()]
+        }
+        fake_output = data_preprocessor(fake_data)
+        self.assertEqual(fake_output['inputs'].shape,
+                         torch.Size((2, 3, 224, 224)))
+        self.assertEqual(len(fake_output['data_samples']), 2)
+
+
+class TestTwoNormDataPreprocessor(TestCase):
+
+    def test_assertion(self):
+        with pytest.raises(AssertionError):
+            _ = TwoNormDataPreprocessor(
+                to_rgb=True,
+                mean=(123.675, 116.28, 103.53),
+                std=(58.395, 57.12, 57.375),
+            )
+        with pytest.raises(AssertionError):
+            _ = TwoNormDataPreprocessor(
+                to_rgb=True,
+                mean=(123.675, 116.28, 103.53),
+                std=(58.395, 57.12, 57.375),
+                second_mean=(127.5, 127.5),
+                second_std=(127.5, 127.5, 127.5),
+            )
+        with pytest.raises(AssertionError):
+            _ = TwoNormDataPreprocessor(
+                to_rgb=True,
+                mean=(123.675, 116.28, 103.53),
+                std=(58.395, 57.12, 57.375),
+                second_mean=(127.5, 127.5, 127.5),
+                second_std=(127.5, 127.5),
+            )
+
+    def test_forward(self):
+        data_preprocessor = dict(
+            mean=(123.675, 116.28, 103.53),
+            std=(58.395, 57.12, 57.375),
+            second_mean=(127.5, 127.5, 127.5),
+            second_std=(127.5, 127.5, 127.5),
+            to_rgb=True)
+
+        data_preprocessor = TwoNormDataPreprocessor(**data_preprocessor)
+        fake_data = {
+            'inputs':
+            [torch.randn((2, 3, 224, 224)),
+             torch.randn((2, 3, 224, 224))],
+            'data_sample': [DataSample(), DataSample()]
+        }
+        fake_output = data_preprocessor(fake_data)
+        self.assertEqual(len(fake_output['inputs']), 2)
+        self.assertEqual(len(fake_output['data_samples']), 2)
+
+
+class TestVideoDataPreprocessor(TestCase):
+
+    def test_NCTHW_format(self):
+        data_preprocessor = VideoDataPreprocessor(
+            mean=[114.75, 114.75, 114.75],
+            std=[57.375, 57.375, 57.375],
+            to_rgb=True,
+            format_shape='NCTHW')
+
+        # test list inputs
+        fake_data = {
+            'inputs': [torch.randn((2, 3, 4, 224, 224))],
+            'data_sample': [DataSample(), DataSample()]
+        }
+        fake_output = data_preprocessor(fake_data)
+        self.assertEqual(len(fake_output['inputs']), 1)
+        self.assertEqual(len(fake_output['data_samples']), 2)
+
+        # test torch.Tensor inputs
+        fake_data = {
+            'inputs': torch.randn((2, 3, 4, 224, 224)),
+            'data_sample': [DataSample(), DataSample()]
+        }
+        fake_output = data_preprocessor(fake_data)
+        self.assertEqual(fake_output['inputs'].shape,
+                         torch.Size((2, 3, 4, 224, 224)))
+        self.assertEqual(len(fake_output['data_samples']), 2)
+
+    def test_NCHW_format(self):
+        data_preprocessor = VideoDataPreprocessor(
+            mean=[114.75, 114.75, 114.75],
+            std=[57.375, 57.375, 57.375],
+            to_rgb=True,
+            format_shape='NCHW')
+
+        # test list inputs
+        fake_data = {
+            'inputs': [torch.randn((2, 3, 224, 224))],
+            'data_sample': [DataSample(), DataSample()]
+        }
+        fake_output = data_preprocessor(fake_data)
+        self.assertEqual(len(fake_output['inputs']), 1)
+        self.assertEqual(len(fake_output['data_samples']), 2)
+
+        # test torch.Tensor inputs
+        fake_data = {
+            'inputs': torch.randn((2, 3, 224, 224)),
+            'data_sample': [DataSample(), DataSample()]
+        }
+        fake_output = data_preprocessor(fake_data)
+        self.assertEqual(fake_output['inputs'].shape,
+                         torch.Size((2, 3, 224, 224)))
+        self.assertEqual(len(fake_output['data_samples']), 2)
diff --git a/tests/test_models/test_utils/test_ema.py b/tests/test_models/test_utils/test_ema.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ffb6ec2d202fb80226e9835c3666c9ec0c9e4e9
--- /dev/null
+++ b/tests/test_models/test_utils/test_ema.py
@@ -0,0 +1,48 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import math
+from unittest import TestCase
+
+import torch
+import torch.nn as nn
+from mmengine.logging import MessageHub
+from mmengine.testing import assert_allclose
+
+from mmpretrain.models.utils import CosineEMA
+
+
+class TestEMA(TestCase):
+
+    def test_cosine_ema(self):
+        model = nn.Sequential(nn.Conv2d(1, 5, kernel_size=3), nn.Linear(5, 10))
+
+        # init message hub
+        max_iters = 5
+        test = dict(name='ema_test')
+        message_hub = MessageHub.get_instance(**test)
+        message_hub.update_info('max_iters', max_iters)
+
+        # test EMA
+        momentum = 0.996
+        end_momentum = 1.
+
+        ema_model = CosineEMA(model, momentum=1 - momentum)
+        averaged_params = [
+            torch.zeros_like(param) for param in model.parameters()
+        ]
+
+        for i in range(max_iters):
+            updated_averaged_params = []
+            for p, p_avg in zip(model.parameters(), averaged_params):
+                p.detach().add_(torch.randn_like(p))
+                if i == 0:
+                    updated_averaged_params.append(p.clone())
+                else:
+                    m = end_momentum - (end_momentum - momentum) * (
+                        math.cos(math.pi * i / float(max_iters)) + 1) / 2
+                    updated_averaged_params.append(
+                        (p_avg * m + p * (1 - m)).clone())
+            ema_model.update_parameters(model)
+            averaged_params = updated_averaged_params
+
+        for p_target, p_ema in zip(averaged_params, ema_model.parameters()):
+            assert_allclose(p_target, p_ema)
diff --git a/tests/test_models/test_utils/test_embed.py b/tests/test_models/test_utils/test_embed.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb2820083b3603d04964a4d6a0a2bd1451edd1f2
--- /dev/null
+++ b/tests/test_models/test_utils/test_embed.py
@@ -0,0 +1,88 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmpretrain.models.backbones import VGG
+from mmpretrain.models.utils import HybridEmbed, PatchEmbed, PatchMerging
+
+
+def cal_unfold_dim(dim, kernel_size, stride, padding=0, dilation=1):
+    return (dim + 2 * padding - dilation * (kernel_size - 1) - 1) // stride + 1
+
+
+def test_patch_embed():
+    # Test PatchEmbed
+    patch_embed = PatchEmbed()
+    img = torch.randn(1, 3, 224, 224)
+    img = patch_embed(img)
+    assert img.shape == torch.Size((1, 196, 768))
+
+    # Test PatchEmbed with stride = 8
+    conv_cfg = dict(kernel_size=16, stride=8)
+    patch_embed = PatchEmbed(conv_cfg=conv_cfg)
+    img = torch.randn(1, 3, 224, 224)
+    img = patch_embed(img)
+    assert img.shape == torch.Size((1, 729, 768))
+
+
+def test_hybrid_embed():
+    # Test VGG11 HybridEmbed
+    backbone = VGG(11, norm_eval=True)
+    backbone.init_weights()
+    patch_embed = HybridEmbed(backbone)
+    img = torch.randn(1, 3, 224, 224)
+    img = patch_embed(img)
+    assert img.shape == torch.Size((1, 49, 768))
+
+
+def test_patch_merging():
+    settings = dict(in_channels=16, out_channels=32, padding=0)
+    downsample = PatchMerging(**settings)
+
+    # test forward with wrong dims
+    with pytest.raises(AssertionError):
+        inputs = torch.rand((1, 16, 56 * 56))
+        downsample(inputs, input_size=(56, 56))
+
+    # test patch merging forward
+    inputs = torch.rand((1, 56 * 56, 16))
+    out, output_size = downsample(inputs, input_size=(56, 56))
+    assert output_size == (28, 28)
+    assert out.shape == (1, 28 * 28, 32)
+
+    # test different kernel_size in each direction
+    downsample = PatchMerging(kernel_size=(2, 3), **settings)
+    out, output_size = downsample(inputs, input_size=(56, 56))
+    expected_dim = cal_unfold_dim(56, 2, 2) * cal_unfold_dim(56, 3, 3)
+    assert downsample.sampler.kernel_size == (2, 3)
+    assert output_size == (cal_unfold_dim(56, 2, 2), cal_unfold_dim(56, 3, 3))
+    assert out.shape == (1, expected_dim, 32)
+
+    # test default stride
+    downsample = PatchMerging(kernel_size=6, **settings)
+    assert downsample.sampler.stride == (6, 6)
+
+    # test stride=3
+    downsample = PatchMerging(kernel_size=6, stride=3, **settings)
+    out, output_size = downsample(inputs, input_size=(56, 56))
+    assert downsample.sampler.stride == (3, 3)
+    assert out.shape == (1, cal_unfold_dim(56, 6, stride=3)**2, 32)
+
+    # test padding
+    downsample = PatchMerging(
+        in_channels=16, out_channels=32, kernel_size=6, padding=2)
+    out, output_size = downsample(inputs, input_size=(56, 56))
+    assert downsample.sampler.padding == (2, 2)
+    assert out.shape == (1, cal_unfold_dim(56, 6, 6, padding=2)**2, 32)
+
+    # test str padding
+    downsample = PatchMerging(in_channels=16, out_channels=32, kernel_size=6)
+    out, output_size = downsample(inputs, input_size=(56, 56))
+    assert downsample.sampler.padding == (0, 0)
+    assert out.shape == (1, cal_unfold_dim(56, 6, 6, padding=2)**2, 32)
+
+    # test dilation
+    downsample = PatchMerging(kernel_size=6, dilation=2, **settings)
+    out, output_size = downsample(inputs, input_size=(56, 56))
+    assert downsample.sampler.dilation == (2, 2)
+    assert out.shape == (1, cal_unfold_dim(56, 6, 6, dilation=2)**2, 32)
diff --git a/tests/test_models/test_utils/test_inverted_residual.py b/tests/test_models/test_utils/test_inverted_residual.py
new file mode 100644
index 0000000000000000000000000000000000000000..e61ceb1f15a90c721c708c473a418db73e4bfff9
--- /dev/null
+++ b/tests/test_models/test_utils/test_inverted_residual.py
@@ -0,0 +1,82 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.utils import InvertedResidual, SELayer
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def test_inverted_residual():
+
+    with pytest.raises(AssertionError):
+        # stride must be in [1, 2]
+        InvertedResidual(16, 16, 32, stride=3)
+
+    with pytest.raises(AssertionError):
+        # se_cfg must be None or dict
+        InvertedResidual(16, 16, 32, se_cfg=list())
+
+    # Add expand conv if in_channels and mid_channels is not the same
+    assert InvertedResidual(32, 16, 32).with_expand_conv is False
+    assert InvertedResidual(16, 16, 32).with_expand_conv is True
+
+    # Test InvertedResidual forward, stride=1
+    block = InvertedResidual(16, 16, 32, stride=1)
+    x = torch.randn(1, 16, 56, 56)
+    x_out = block(x)
+    assert getattr(block, 'se', None) is None
+    assert block.with_res_shortcut
+    assert x_out.shape == torch.Size((1, 16, 56, 56))
+
+    # Test InvertedResidual forward, stride=2
+    block = InvertedResidual(16, 16, 32, stride=2)
+    x = torch.randn(1, 16, 56, 56)
+    x_out = block(x)
+    assert not block.with_res_shortcut
+    assert x_out.shape == torch.Size((1, 16, 28, 28))
+
+    # Test InvertedResidual forward with se layer
+    se_cfg = dict(channels=32)
+    block = InvertedResidual(16, 16, 32, stride=1, se_cfg=se_cfg)
+    x = torch.randn(1, 16, 56, 56)
+    x_out = block(x)
+    assert isinstance(block.se, SELayer)
+    assert x_out.shape == torch.Size((1, 16, 56, 56))
+
+    # Test InvertedResidual forward without expand conv
+    block = InvertedResidual(32, 16, 32)
+    x = torch.randn(1, 32, 56, 56)
+    x_out = block(x)
+    assert getattr(block, 'expand_conv', None) is None
+    assert x_out.shape == torch.Size((1, 16, 56, 56))
+
+    # Test InvertedResidual forward with GroupNorm
+    block = InvertedResidual(
+        16, 16, 32, norm_cfg=dict(type='GN', num_groups=2))
+    x = torch.randn(1, 16, 56, 56)
+    x_out = block(x)
+    for m in block.modules():
+        if is_norm(m):
+            assert isinstance(m, GroupNorm)
+    assert x_out.shape == torch.Size((1, 16, 56, 56))
+
+    # Test InvertedResidual forward with HSigmoid
+    block = InvertedResidual(16, 16, 32, act_cfg=dict(type='HSigmoid'))
+    x = torch.randn(1, 16, 56, 56)
+    x_out = block(x)
+    assert x_out.shape == torch.Size((1, 16, 56, 56))
+
+    # Test InvertedResidual forward with checkpoint
+    block = InvertedResidual(16, 16, 32, with_cp=True)
+    x = torch.randn(1, 16, 56, 56)
+    x_out = block(x)
+    assert block.with_cp
+    assert x_out.shape == torch.Size((1, 16, 56, 56))
diff --git a/tests/test_models/test_utils/test_layer_scale.py b/tests/test_models/test_utils/test_layer_scale.py
new file mode 100644
index 0000000000000000000000000000000000000000..54b6b60ad025ccb6ef6e55be3b98e2980d610e26
--- /dev/null
+++ b/tests/test_models/test_utils/test_layer_scale.py
@@ -0,0 +1,47 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import torch
+
+from mmpretrain.models.utils import LayerScale
+
+
+class TestLayerScale(TestCase):
+
+    def test_init(self):
+        with self.assertRaisesRegex(AssertionError, "'data_format' could"):
+            cfg = dict(
+                dim=10,
+                data_format='BNC',
+            )
+            LayerScale(**cfg)
+
+        cfg = dict(dim=10)
+        ls = LayerScale(**cfg)
+        assert torch.equal(ls.weight,
+                           torch.ones(10, requires_grad=True) * 1e-5)
+
+    def forward(self):
+        # Test channels_last
+        cfg = dict(dim=256, inplace=False, data_format='channels_last')
+        ls_channels_last = LayerScale(**cfg)
+        x = torch.randn((4, 49, 256))
+        out = ls_channels_last(x)
+        self.assertEqual(tuple(out.size()), (4, 49, 256))
+        assert torch.equal(x * 1e-5, out)
+
+        # Test channels_first
+        cfg = dict(dim=256, inplace=False, data_format='channels_first')
+        ls_channels_first = LayerScale(**cfg)
+        x = torch.randn((4, 256, 7, 7))
+        out = ls_channels_first(x)
+        self.assertEqual(tuple(out.size()), (4, 256, 7, 7))
+        assert torch.equal(x * 1e-5, out)
+
+        # Test inplace True
+        cfg = dict(dim=256, inplace=True, data_format='channels_first')
+        ls_channels_first = LayerScale(**cfg)
+        x = torch.randn((4, 256, 7, 7))
+        out = ls_channels_first(x)
+        self.assertEqual(tuple(out.size()), (4, 256, 7, 7))
+        self.assertIs(x, out)
diff --git a/tests/test_models/test_utils/test_misc.py b/tests/test_models/test_utils/test_misc.py
new file mode 100644
index 0000000000000000000000000000000000000000..49d233e3e20073686e64f35290d9dd6526e7c1e4
--- /dev/null
+++ b/tests/test_models/test_utils/test_misc.py
@@ -0,0 +1,59 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from mmengine.utils import digit_version
+
+from mmpretrain.models.utils import channel_shuffle, is_tracing, make_divisible
+
+
+def test_make_divisible():
+    # test min_value is None
+    result = make_divisible(34, 8, None)
+    assert result == 32
+
+    # test when new_value > min_ratio * value
+    result = make_divisible(10, 8, min_ratio=0.9)
+    assert result == 16
+
+    # test min_value = 0.8
+    result = make_divisible(33, 8, min_ratio=0.8)
+    assert result == 32
+
+
+def test_channel_shuffle():
+    x = torch.randn(1, 24, 56, 56)
+    with pytest.raises(AssertionError):
+        # num_channels should be divisible by groups
+        channel_shuffle(x, 7)
+
+    groups = 3
+    batch_size, num_channels, height, width = x.size()
+    channels_per_group = num_channels // groups
+    out = channel_shuffle(x, groups)
+    # test the output value when groups = 3
+    for b in range(batch_size):
+        for c in range(num_channels):
+            c_out = c % channels_per_group * groups + c // channels_per_group
+            for i in range(height):
+                for j in range(width):
+                    assert x[b, c, i, j] == out[b, c_out, i, j]
+
+
+@pytest.mark.skipif(
+    digit_version(torch.__version__) < digit_version('1.6.0'),
+    reason='torch.jit.is_tracing is not available before 1.6.0')
+def test_is_tracing():
+
+    def foo(x):
+        if is_tracing():
+            return x
+        else:
+            return x.tolist()
+
+    x = torch.rand(3)
+    # test without trace
+    assert isinstance(foo(x), list)
+
+    # test with trace
+    traced_foo = torch.jit.trace(foo, (torch.rand(1), ))
+    assert isinstance(traced_foo(x), torch.Tensor)
diff --git a/tests/test_models/test_utils/test_norm.py b/tests/test_models/test_utils/test_norm.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4d3a8b7b608beb96250403c4a127b1671dad765
--- /dev/null
+++ b/tests/test_models/test_utils/test_norm.py
@@ -0,0 +1,60 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import torch
+import torch.nn.functional as F
+
+from mmpretrain.models.utils import GRN, LayerNorm2d
+
+
+class TestGRN(TestCase):
+
+    def test_init(self):
+        module = GRN(in_channels=32, eps=1e-3)
+        self.assertEqual(module.in_channels, 32)
+        self.assertEqual(module.eps, 1e-3)
+        self.assertTrue(module.gamma.requires_grad)
+        self.assertTrue(module.beta.requires_grad)
+        self.assertEqual(module.gamma.shape, (32, ))
+        self.assertTrue(module.beta.shape, (32, ))
+
+    def test_forward(self):
+        module = GRN(in_channels=32, eps=1e-3)
+        input_ = torch.rand(1, 28, 28, 32)
+        gx = torch.norm(input_, p=2, dim=(1, 2), keepdim=True)
+        nx = gx / (gx.mean(dim=3, keepdim=True) + 1e-3)
+        expected_out = module.gamma * input_ * nx + module.beta + input_
+
+        torch.testing.assert_allclose(
+            module(input_, data_format='channel_last'), expected_out)
+
+        input_ = input_.permute([0, 3, 1, 2])
+        expected_out = expected_out.permute([0, 3, 1, 2])
+        torch.testing.assert_allclose(
+            module(input_, data_format='channel_first'), expected_out)
+
+
+class TestLayerNorm2d(TestCase):
+
+    def test_init(self):
+        module = LayerNorm2d(num_channels=32, eps=1e-3)
+        self.assertEqual(module.num_channels, 32)
+        self.assertEqual(module.eps, 1e-3)
+        self.assertTrue(module.weight.requires_grad)
+        self.assertTrue(module.bias.requires_grad)
+        self.assertEqual(module.weight.shape, (32, ))
+        self.assertTrue(module.bias.shape, (32, ))
+
+    def test_forward(self):
+        module = LayerNorm2d(num_channels=32, eps=1e-3)
+        input_ = torch.rand(1, 28, 28, 32)
+        expected_out = F.layer_norm(input_, module.normalized_shape,
+                                    module.weight, module.bias, 1e-3)
+
+        torch.testing.assert_allclose(
+            module(input_, data_format='channel_last'), expected_out)
+
+        input_ = input_.permute([0, 3, 1, 2])
+        expected_out = expected_out.permute([0, 3, 1, 2])
+        torch.testing.assert_allclose(
+            module(input_, data_format='channel_first'), expected_out)
diff --git a/tests/test_models/test_utils/test_position_encoding.py b/tests/test_models/test_utils/test_position_encoding.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d80023cba80d76353f9768631133fbb18beddf4
--- /dev/null
+++ b/tests/test_models/test_utils/test_position_encoding.py
@@ -0,0 +1,21 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+
+from mmpretrain.models.utils import (ConditionalPositionEncoding,
+                                     RotaryEmbeddingFast)
+
+
+def test_conditional_position_encoding_module():
+    CPE = ConditionalPositionEncoding(in_channels=32, embed_dims=32, stride=2)
+    outs = CPE(torch.randn(1, 3136, 32), (56, 56))
+    assert outs.shape == torch.Size([1, 784, 32])
+
+
+def test_rotary_embedding_fast_module():
+    RoPE = RotaryEmbeddingFast(embed_dims=64, patch_resolution=24)
+    outs = RoPE(torch.randn(1, 2, 24 * 24, 64), (24, 24))
+    assert outs.shape == torch.Size([1, 2, 24 * 24, 64])
+
+    RoPE = RotaryEmbeddingFast(embed_dims=64, patch_resolution=(14, 20))
+    outs = RoPE(torch.randn(1, 2, 14 * 20, 64), (14, 20))
+    assert outs.shape == torch.Size([1, 2, 14 * 20, 64])
diff --git a/tests/test_models/test_utils/test_se.py b/tests/test_models/test_utils/test_se.py
new file mode 100644
index 0000000000000000000000000000000000000000..447eb0853cc7805cab6908cd2d63ae7c367ddfe7
--- /dev/null
+++ b/tests/test_models/test_utils/test_se.py
@@ -0,0 +1,95 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+from torch.nn.modules import GroupNorm
+from torch.nn.modules.batchnorm import _BatchNorm
+
+from mmpretrain.models.utils import SELayer
+
+
+def is_norm(modules):
+    """Check if is one of the norms."""
+    if isinstance(modules, (GroupNorm, _BatchNorm)):
+        return True
+    return False
+
+
+def test_se():
+    with pytest.raises(AssertionError):
+        # base_channels must be a number
+        SELayer(16, squeeze_channels='32')
+
+    with pytest.raises(AssertionError):
+        # base_channels must be None or a number larger than 0
+        SELayer(16, squeeze_channels=-1)
+
+    with pytest.raises(AssertionError):
+        # act_cfg must be two dict tuple
+        SELayer(
+            16,
+            act_cfg=(dict(type='ReLU'), dict(type='Sigmoid'),
+                     dict(type='ReLU')))
+
+    # Test SELayer forward, channels=64
+    input = torch.randn((4, 64, 112, 112))
+    se = SELayer(64)
+    output = se(input)
+    assert se.conv1.out_channels == 8
+    assert se.conv2.in_channels == 8
+    assert output.shape == torch.Size((4, 64, 112, 112))
+
+    # Test SELayer forward, ratio=4
+    input = torch.randn((4, 128, 112, 112))
+    se = SELayer(128, ratio=4)
+    output = se(input)
+    assert se.conv1.out_channels == 32
+    assert se.conv2.in_channels == 32
+    assert output.shape == torch.Size((4, 128, 112, 112))
+
+    # Test SELayer forward, channels=54, ratio=4
+    # channels cannot be divisible by ratio
+    input = torch.randn((1, 54, 76, 103))
+    se = SELayer(54, ratio=4)
+    output = se(input)
+    assert se.conv1.out_channels == 16
+    assert se.conv2.in_channels == 16
+    assert output.shape == torch.Size((1, 54, 76, 103))
+
+    # Test SELayer forward, divisor=2
+    se = SELayer(54, ratio=4, divisor=2)
+    output = se(input)
+    assert se.conv1.out_channels == 14
+    assert se.conv2.in_channels == 14
+    assert output.shape == torch.Size((1, 54, 76, 103))
+
+    # Test SELayer forward, squeeze_channels=25
+    input = torch.randn((1, 128, 56, 56))
+    se = SELayer(128, squeeze_channels=25)
+    output = se(input)
+    assert se.conv1.out_channels == 25
+    assert se.conv2.in_channels == 25
+    assert output.shape == torch.Size((1, 128, 56, 56))
+
+    # Test SELayer forward, not used ratio and divisor
+    input = torch.randn((1, 128, 56, 56))
+    se = SELayer(
+        128,
+        squeeze_channels=13,
+        ratio=4,
+        divisor=8,
+    )
+    output = se(input)
+    assert se.conv1.out_channels == 13
+    assert se.conv2.in_channels == 13
+    assert output.shape == torch.Size((1, 128, 56, 56))
+
+    # Test SELayer with HSigmoid activate layer
+    input = torch.randn((4, 128, 56, 56))
+    se = SELayer(
+        128,
+        squeeze_channels=25,
+        act_cfg=(dict(type='ReLU'), dict(type='HSigmoid')))
+    output = se(input)
+    assert se.conv1.out_channels == 25
+    assert se.conv2.in_channels == 25
+    assert output.shape == torch.Size((4, 128, 56, 56))
diff --git a/tests/test_models/test_utils/test_swiglu_ffn.py b/tests/test_models/test_utils/test_swiglu_ffn.py
new file mode 100644
index 0000000000000000000000000000000000000000..1ae8d760a60b99f20a1db03e18f8e6586d847ce8
--- /dev/null
+++ b/tests/test_models/test_utils/test_swiglu_ffn.py
@@ -0,0 +1,53 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import torch
+import torch.nn as nn
+
+from mmpretrain.models.utils import LayerScale, SwiGLUFFN, SwiGLUFFNFused
+
+
+class TestSwiGLUFFN(TestCase):
+
+    def test_init(self):
+        swiglu = SwiGLUFFN(embed_dims=4)
+        assert swiglu.w12.weight.shape == torch.ones((8, 4)).shape
+        assert swiglu.w3.weight.shape == torch.ones((4, 4)).shape
+        assert isinstance(swiglu.gamma2, nn.Identity)
+
+        swiglu = SwiGLUFFN(embed_dims=4, layer_scale_init_value=0.1)
+        assert isinstance(swiglu.gamma2, LayerScale)
+
+    def test_forward(self):
+        swiglu = SwiGLUFFN(embed_dims=4)
+        x = torch.randn((1, 8, 4))
+        out = swiglu(x)
+        self.assertEqual(out.size(), x.size())
+
+        swiglu = SwiGLUFFN(embed_dims=4, out_dims=12)
+        x = torch.randn((1, 8, 4))
+        out = swiglu(x)
+        self.assertEqual(tuple(out.size()), (1, 8, 12))
+
+
+class TestSwiGLUFFNFused(TestCase):
+
+    def test_init(self):
+        swiglu = SwiGLUFFNFused(embed_dims=4)
+        assert swiglu.w12.weight.shape == torch.ones((16, 4)).shape
+        assert swiglu.w3.weight.shape == torch.ones((4, 8)).shape
+        assert isinstance(swiglu.gamma2, nn.Identity)
+
+        swiglu = SwiGLUFFNFused(embed_dims=4, layer_scale_init_value=0.1)
+        assert isinstance(swiglu.gamma2, LayerScale)
+
+    def test_forward(self):
+        swiglu = SwiGLUFFNFused(embed_dims=4)
+        x = torch.randn((1, 8, 4))
+        out = swiglu(x)
+        self.assertEqual(out.size(), x.size())
+
+        swiglu = SwiGLUFFNFused(embed_dims=4, out_dims=12)
+        x = torch.randn((1, 8, 4))
+        out = swiglu(x)
+        self.assertEqual(tuple(out.size()), (1, 8, 12))
diff --git a/tests/test_structures/test_datasample.py b/tests/test_structures/test_datasample.py
new file mode 100644
index 0000000000000000000000000000000000000000..088d3164d199d6e72cdbbd95544856b4a60e2003
--- /dev/null
+++ b/tests/test_structures/test_datasample.py
@@ -0,0 +1,108 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import numpy as np
+import torch
+
+from mmpretrain.structures import DataSample, MultiTaskDataSample
+
+
+class TestDataSample(TestCase):
+
+    def _test_set_label(self, key):
+        data_sample = DataSample()
+        method = getattr(data_sample, 'set_' + key)
+        # Test number
+        method(1)
+        self.assertIn(key, data_sample)
+        label = getattr(data_sample, key)
+        self.assertIsInstance(label, torch.LongTensor)
+
+        # Test tensor with single number
+        method(torch.tensor(2))
+        self.assertIn(key, data_sample)
+        label = getattr(data_sample, key)
+        self.assertIsInstance(label, torch.LongTensor)
+
+        # Test array with single number
+        method(np.array(3))
+        self.assertIn(key, data_sample)
+        label = getattr(data_sample, key)
+        self.assertIsInstance(label, torch.LongTensor)
+
+        # Test tensor
+        method(torch.tensor([1, 2, 3]))
+        self.assertIn(key, data_sample)
+        label = getattr(data_sample, key)
+        self.assertIsInstance(label, torch.Tensor)
+        self.assertTrue((label == torch.tensor([1, 2, 3])).all())
+
+        # Test array
+        method(np.array([1, 2, 3]))
+        self.assertIn(key, data_sample)
+        label = getattr(data_sample, key)
+        self.assertTrue((label == torch.tensor([1, 2, 3])).all())
+
+        # Test Sequence
+        method([1, 2, 3])
+        self.assertIn(key, data_sample)
+        label = getattr(data_sample, key)
+        self.assertTrue((label == torch.tensor([1, 2, 3])).all())
+
+        # Test unavailable type
+        with self.assertRaisesRegex(TypeError, "<class 'str'> is not"):
+            method('hi')
+
+    def test_set_gt_label(self):
+        self._test_set_label('gt_label')
+
+    def test_set_pred_label(self):
+        self._test_set_label('pred_label')
+
+    def test_set_gt_score(self):
+        data_sample = DataSample()
+        data_sample.set_gt_score(torch.tensor([0.1, 0.1, 0.6, 0.1, 0.1]))
+        self.assertIn('gt_score', data_sample)
+        torch.testing.assert_allclose(data_sample.gt_score,
+                                      [0.1, 0.1, 0.6, 0.1, 0.1])
+
+        # Test invalid length
+        with self.assertRaisesRegex(AssertionError, 'should be equal to'):
+            data_sample.set_gt_score([1, 2])
+
+        # Test invalid dims
+        with self.assertRaisesRegex(AssertionError, 'but got 2'):
+            data_sample.set_gt_score(torch.tensor([[0.1, 0.1, 0.6, 0.1, 0.1]]))
+
+    def test_set_pred_score(self):
+        data_sample = DataSample()
+        data_sample.set_pred_score(torch.tensor([0.1, 0.1, 0.6, 0.1, 0.1]))
+        self.assertIn('pred_score', data_sample)
+        torch.testing.assert_allclose(data_sample.pred_score,
+                                      [0.1, 0.1, 0.6, 0.1, 0.1])
+
+        # Test invalid length
+        with self.assertRaisesRegex(AssertionError, 'should be equal to'):
+            data_sample.set_gt_score([1, 2])
+
+        # Test invalid dims
+        with self.assertRaisesRegex(AssertionError, 'but got 2'):
+            data_sample.set_pred_score(
+                torch.tensor([[0.1, 0.1, 0.6, 0.1, 0.1]]))
+
+
+class TestMultiTaskDataSample(TestCase):
+
+    def test_multi_task_data_sample(self):
+        gt_label = {'task0': {'task00': 1, 'task01': 1}, 'task1': 1}
+        data_sample = MultiTaskDataSample()
+        task_sample = DataSample().set_gt_label(gt_label['task1'])
+        data_sample.set_field(task_sample, 'task1')
+        data_sample.set_field(MultiTaskDataSample(), 'task0')
+        for task_name in gt_label['task0']:
+            task_sample = DataSample().set_gt_label(
+                gt_label['task0'][task_name])
+            data_sample.task0.set_field(task_sample, task_name)
+        self.assertIsInstance(data_sample.task0, MultiTaskDataSample)
+        self.assertIsInstance(data_sample.task1, DataSample)
+        self.assertIsInstance(data_sample.task0.task00, DataSample)
diff --git a/tests/test_structures/test_utils.py b/tests/test_structures/test_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0ba439488cdbc2f09e92e1d3417d439b7f4a62f
--- /dev/null
+++ b/tests/test_structures/test_utils.py
@@ -0,0 +1,63 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from unittest import TestCase
+
+import torch
+
+from mmpretrain.structures import (batch_label_to_onehot, cat_batch_labels,
+                                   tensor_split)
+
+
+class TestStructureUtils(TestCase):
+
+    def test_tensor_split(self):
+        tensor = torch.tensor([0, 1, 2, 3, 4, 5, 6])
+        split_indices = [0, 2, 6, 6]
+        outs = tensor_split(tensor, split_indices)
+        self.assertEqual(len(outs), len(split_indices) + 1)
+        self.assertEqual(outs[0].size(0), 0)
+        self.assertEqual(outs[1].size(0), 2)
+        self.assertEqual(outs[2].size(0), 4)
+        self.assertEqual(outs[3].size(0), 0)
+        self.assertEqual(outs[4].size(0), 1)
+
+        tensor = torch.tensor([])
+        split_indices = [0, 0, 0, 0]
+        outs = tensor_split(tensor, split_indices)
+        self.assertEqual(len(outs), len(split_indices) + 1)
+
+    def test_cat_batch_labels(self):
+        labels = [
+            torch.tensor([1]),
+            torch.tensor([3, 2]),
+            torch.tensor([0, 1, 4]),
+            torch.tensor([], dtype=torch.int64),
+            torch.tensor([], dtype=torch.int64),
+        ]
+
+        batch_label, split_indices = cat_batch_labels(labels)
+        self.assertEqual(split_indices, [1, 3, 6, 6])
+        self.assertEqual(len(batch_label), 6)
+        labels = tensor_split(batch_label, split_indices)
+        self.assertEqual(labels[0].tolist(), [1])
+        self.assertEqual(labels[1].tolist(), [3, 2])
+        self.assertEqual(labels[2].tolist(), [0, 1, 4])
+        self.assertEqual(labels[3].tolist(), [])
+        self.assertEqual(labels[4].tolist(), [])
+
+    def test_batch_label_to_onehot(self):
+        labels = [
+            torch.tensor([1]),
+            torch.tensor([3, 2]),
+            torch.tensor([0, 1, 4]),
+            torch.tensor([], dtype=torch.int64),
+            torch.tensor([], dtype=torch.int64),
+        ]
+
+        batch_label, split_indices = cat_batch_labels(labels)
+        batch_score = batch_label_to_onehot(
+            batch_label, split_indices, num_classes=5)
+        self.assertEqual(batch_score[0].tolist(), [0, 1, 0, 0, 0])
+        self.assertEqual(batch_score[1].tolist(), [0, 0, 1, 1, 0])
+        self.assertEqual(batch_score[2].tolist(), [1, 1, 0, 0, 1])
+        self.assertEqual(batch_score[3].tolist(), [0, 0, 0, 0, 0])
+        self.assertEqual(batch_score[4].tolist(), [0, 0, 0, 0, 0])
diff --git a/tests/test_tools.py b/tests/test_tools.py
new file mode 100644
index 0000000000000000000000000000000000000000..013584d0da1e453d9b744df95c089a2663c2411c
--- /dev/null
+++ b/tests/test_tools.py
@@ -0,0 +1,418 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import re
+import tempfile
+from collections import OrderedDict
+from pathlib import Path
+from subprocess import PIPE, Popen
+from unittest import TestCase
+
+import mmengine
+import torch
+from mmengine.config import Config
+
+from mmpretrain import ModelHub, get_model
+from mmpretrain.structures import DataSample
+
+MMPRE_ROOT = Path(__file__).parent.parent
+ASSETS_ROOT = Path(__file__).parent / 'data'
+
+
+class TestImageDemo(TestCase):
+
+    def setUp(self):
+        self.tmpdir = tempfile.TemporaryDirectory()
+        self.dir = Path(self.tmpdir.name)
+
+    def tearDown(self):
+        self.tmpdir.cleanup()
+
+    def test_run(self):
+        command = [
+            'python',
+            'demo/image_demo.py',
+            'demo/demo.JPEG',
+            'mobilevit-xxsmall_3rdparty_in1k',
+            '--device',
+            'cpu',
+        ]
+        p = Popen(command, cwd=MMPRE_ROOT, stdout=PIPE)
+        out, _ = p.communicate()
+        self.assertIn('sea snake', out.decode())
+
+
+class TestAnalyzeLogs(TestCase):
+
+    def setUp(self):
+        self.log_file = ASSETS_ROOT / 'vis_data.json'
+        self.tmpdir = tempfile.TemporaryDirectory()
+        self.out_file = Path(self.tmpdir.name) / 'out.png'
+
+    def tearDown(self):
+        self.tmpdir.cleanup()
+
+    def test_run(self):
+        command = [
+            'python',
+            'tools/analysis_tools/analyze_logs.py',
+            'cal_train_time',
+            str(self.log_file),
+        ]
+        p = Popen(command, cwd=MMPRE_ROOT, stdout=PIPE)
+        out, _ = p.communicate()
+        self.assertIn('slowest epoch 2, average time is 0.0219', out.decode())
+
+        command = [
+            'python',
+            'tools/analysis_tools/analyze_logs.py',
+            'plot_curve',
+            str(self.log_file),
+            '--keys',
+            'accuracy/top1',
+            '--out',
+            str(self.out_file),
+        ]
+        p = Popen(command, cwd=MMPRE_ROOT, stdout=PIPE)
+        out, _ = p.communicate()
+        self.assertIn(str(self.log_file), out.decode())
+        self.assertIn(str(self.out_file), out.decode())
+        self.assertTrue(self.out_file.exists())
+
+
+class TestAnalyzeResults(TestCase):
+
+    def setUp(self):
+        self.tmpdir = tempfile.TemporaryDirectory()
+        self.dir = Path(self.tmpdir.name)
+
+        dataset_cfg = dict(
+            type='CustomDataset',
+            data_root=str(ASSETS_ROOT / 'dataset'),
+        )
+        config = Config(dict(test_dataloader=dict(dataset=dataset_cfg)))
+        self.config_file = self.dir / 'config.py'
+        config.dump(self.config_file)
+
+        results = [{
+            'gt_label': 1,
+            'pred_label': 0,
+            'pred_score': [0.9, 0.1],
+            'sample_idx': 0,
+        }, {
+            'gt_label': 0,
+            'pred_label': 0,
+            'pred_score': [0.9, 0.1],
+            'sample_idx': 1,
+        }]
+        self.result_file = self.dir / 'results.pkl'
+        mmengine.dump(results, self.result_file)
+
+    def tearDown(self):
+        self.tmpdir.cleanup()
+
+    def test_run(self):
+        command = [
+            'python',
+            'tools/analysis_tools/analyze_results.py',
+            str(self.config_file),
+            str(self.result_file),
+            '--out-dir',
+            str(self.tmpdir.name),
+        ]
+        p = Popen(command, cwd=MMPRE_ROOT, stdout=PIPE)
+        p.communicate()
+        self.assertTrue((self.dir / 'success/2.jpeg.png').exists())
+        self.assertTrue((self.dir / 'fail/1.JPG.png').exists())
+
+
+class TestPrintConfig(TestCase):
+
+    def setUp(self):
+        self.tmpdir = tempfile.TemporaryDirectory()
+        self.config_file = MMPRE_ROOT / 'configs/resnet/resnet18_8xb32_in1k.py'
+
+    def tearDown(self):
+        self.tmpdir.cleanup()
+
+    def test_run(self):
+        command = [
+            'python',
+            'tools/misc/print_config.py',
+            str(self.config_file),
+        ]
+        p = Popen(command, cwd=MMPRE_ROOT, stdout=PIPE)
+        out, _ = p.communicate()
+        out = out.decode().strip().replace('\r\n', '\n')
+        self.assertEqual(out,
+                         Config.fromfile(self.config_file).pretty_text.strip())
+
+
+class TestVerifyDataset(TestCase):
+
+    def setUp(self):
+        self.tmpdir = tempfile.TemporaryDirectory()
+        self.dir = Path(self.tmpdir.name)
+        dataset_cfg = dict(
+            type='CustomDataset',
+            ann_file=str(self.dir / 'ann.txt'),
+            pipeline=[dict(type='LoadImageFromFile')],
+            data_root=str(ASSETS_ROOT / 'dataset'),
+        )
+        ann_file = '\n'.join(['a/2.JPG 0', 'b/2.jpeg 1', 'b/subb/3.jpg 1'])
+        (self.dir / 'ann.txt').write_text(ann_file)
+        config = Config(dict(train_dataloader=dict(dataset=dataset_cfg)))
+        self.config_file = Path(self.tmpdir.name) / 'config.py'
+        config.dump(self.config_file)
+
+    def tearDown(self):
+        self.tmpdir.cleanup()
+
+    def test_run(self):
+        command = [
+            'python',
+            'tools/misc/verify_dataset.py',
+            str(self.config_file),
+            '--out-path',
+            str(self.dir / 'log.log'),
+        ]
+        p = Popen(command, cwd=MMPRE_ROOT, stdout=PIPE)
+        out, _ = p.communicate()
+        self.assertIn(
+            f"{ASSETS_ROOT/'dataset/a/2.JPG'} cannot be read correctly",
+            out.decode().strip())
+        self.assertTrue((self.dir / 'log.log').exists())
+
+
+class TestEvalMetric(TestCase):
+
+    def setUp(self):
+        self.tmpdir = tempfile.TemporaryDirectory()
+        self.dir = Path(self.tmpdir.name)
+
+        results = [
+            DataSample().set_gt_label(1).set_pred_label(0).set_pred_score(
+                [0.6, 0.3, 0.1]).to_dict(),
+            DataSample().set_gt_label(0).set_pred_label(0).set_pred_score(
+                [0.6, 0.3, 0.1]).to_dict(),
+        ]
+        self.result_file = self.dir / 'results.pkl'
+        mmengine.dump(results, self.result_file)
+
+    def tearDown(self):
+        self.tmpdir.cleanup()
+
+    def test_run(self):
+        command = [
+            'python',
+            'tools/analysis_tools/eval_metric.py',
+            str(self.result_file),
+            '--metric',
+            'type=Accuracy',
+            'topk=1,2',
+        ]
+        p = Popen(command, cwd=MMPRE_ROOT, stdout=PIPE)
+        out, _ = p.communicate()
+        self.assertIn('accuracy/top1', out.decode())
+        self.assertIn('accuracy/top2', out.decode())
+
+
+class TestVisScheduler(TestCase):
+
+    def setUp(self):
+        self.tmpdir = tempfile.TemporaryDirectory()
+        self.dir = Path(self.tmpdir.name)
+
+        config = Config.fromfile(MMPRE_ROOT /
+                                 'configs/resnet/resnet18_8xb32_in1k.py')
+        config.param_scheduler = [
+            dict(
+                type='LinearLR',
+                start_factor=0.01,
+                by_epoch=True,
+                end=1,
+                convert_to_iter_based=True),
+            dict(type='CosineAnnealingLR', by_epoch=True, begin=1),
+        ]
+        config.work_dir = str(self.dir)
+        config.train_cfg.max_epochs = 2
+        self.config_file = Path(self.tmpdir.name) / 'config.py'
+        config.dump(self.config_file)
+
+    def tearDown(self):
+        self.tmpdir.cleanup()
+
+    def test_run(self):
+        command = [
+            'python',
+            'tools/visualization/vis_scheduler.py',
+            str(self.config_file),
+            '--dataset-size',
+            '100',
+            '--not-show',
+            '--save-path',
+            str(self.dir / 'out.png'),
+        ]
+        p = Popen(command, cwd=MMPRE_ROOT, stdout=PIPE)
+        p.communicate()
+        self.assertTrue((self.dir / 'out.png').exists())
+
+
+class TestPublishModel(TestCase):
+
+    def setUp(self):
+        self.tmpdir = tempfile.TemporaryDirectory()
+        self.dir = Path(self.tmpdir.name)
+
+        ckpt = dict(
+            state_dict=OrderedDict({
+                'a': torch.tensor(1.),
+            }),
+            ema_state_dict=OrderedDict({
+                'step': 1,
+                'module.a': torch.tensor(2.),
+            }))
+        self.ckpt_file = self.dir / 'ckpt.pth'
+        torch.save(ckpt, self.ckpt_file)
+
+    def tearDown(self):
+        self.tmpdir.cleanup()
+
+    def test_run(self):
+        command = [
+            'python',
+            'tools/model_converters/publish_model.py',
+            str(self.ckpt_file),
+            str(self.ckpt_file),
+            '--dataset-type',
+            'ImageNet',
+            '--no-ema',
+        ]
+        p = Popen(command, cwd=MMPRE_ROOT, stdout=PIPE)
+        out, _ = p.communicate()
+        self.assertIn('and drop the EMA weights.', out.decode())
+        self.assertIn('Successfully generated', out.decode())
+        output_ckpt = re.findall(r'ckpt_\d{8}-\w{8}.pth', out.decode())
+        self.assertGreater(len(output_ckpt), 0)
+        output_ckpt = output_ckpt[0]
+        self.assertTrue((self.dir / output_ckpt).exists())
+        # The input file won't be overridden.
+        self.assertTrue(self.ckpt_file.exists())
+
+
+class TestVisCam(TestCase):
+
+    def setUp(self):
+        self.tmpdir = tempfile.TemporaryDirectory()
+        self.dir = Path(self.tmpdir.name)
+
+        model = get_model('mobilevit-xxsmall_3rdparty_in1k')
+        self.config_file = self.dir / 'config.py'
+        model._config.dump(self.config_file)
+
+        self.ckpt_file = self.dir / 'ckpt.pth'
+        torch.save(model.state_dict(), self.ckpt_file)
+
+    def tearDown(self):
+        self.tmpdir.cleanup()
+
+    def test_run(self):
+        command = [
+            'python',
+            'tools/visualization/vis_cam.py',
+            str(ASSETS_ROOT / 'color.jpg'),
+            str(self.config_file),
+            str(self.ckpt_file),
+            '--save-path',
+            str(self.dir / 'cam.jpg'),
+        ]
+        p = Popen(command, cwd=MMPRE_ROOT, stdout=PIPE)
+        out, _ = p.communicate()
+        self.assertIn('backbone.conv_1x1_exp.bn', out.decode())
+        self.assertTrue((self.dir / 'cam.jpg').exists())
+
+
+class TestConfusionMatrix(TestCase):
+
+    def setUp(self):
+        self.tmpdir = tempfile.TemporaryDirectory()
+        self.dir = Path(self.tmpdir.name)
+
+        self.config_file = MMPRE_ROOT / 'configs/resnet/resnet18_8xb32_in1k.py'
+
+        results = [
+            DataSample().set_gt_label(1).set_pred_label(0).set_pred_score(
+                [0.6, 0.3, 0.1]).to_dict(),
+            DataSample().set_gt_label(0).set_pred_label(0).set_pred_score(
+                [0.6, 0.3, 0.1]).to_dict(),
+        ]
+        self.result_file = self.dir / 'results.pkl'
+        mmengine.dump(results, self.result_file)
+
+    def tearDown(self):
+        self.tmpdir.cleanup()
+
+    def test_run(self):
+        command = [
+            'python',
+            'tools/analysis_tools/confusion_matrix.py',
+            str(self.config_file),
+            str(self.result_file),
+            '--out',
+            str(self.dir / 'result.pkl'),
+        ]
+        Popen(command, cwd=MMPRE_ROOT, stdout=PIPE).wait()
+        result = mmengine.load(self.dir / 'result.pkl')
+        torch.testing.assert_allclose(
+            result, torch.tensor([
+                [1, 0, 0],
+                [1, 0, 0],
+                [0, 0, 0],
+            ]))
+
+
+class TestVisTsne(TestCase):
+
+    def setUp(self):
+        self.tmpdir = tempfile.TemporaryDirectory()
+        self.dir = Path(self.tmpdir.name)
+
+        config = ModelHub.get('mobilevit-xxsmall_3rdparty_in1k').config
+        test_dataloader = dict(
+            batch_size=1,
+            dataset=dict(
+                type='CustomDataset',
+                data_root=str(ASSETS_ROOT / 'dataset'),
+                pipeline=config.test_dataloader.dataset.pipeline,
+            ),
+            sampler=dict(type='DefaultSampler', shuffle=False),
+        )
+        config.test_dataloader = mmengine.ConfigDict(test_dataloader)
+        self.config_file = self.dir / 'config.py'
+        config.dump(self.config_file)
+
+    def tearDown(self):
+        self.tmpdir.cleanup()
+
+    def test_run(self):
+        command = [
+            'python',
+            'tools/visualization/vis_tsne.py',
+            str(self.config_file),
+            '--work-dir',
+            str(self.dir),
+            '--perplexity',
+            '2',
+        ]
+        Popen(command, cwd=MMPRE_ROOT, stdout=PIPE).wait()
+        self.assertTrue(len(list(self.dir.glob('tsne_*/feat_*.png'))) > 0)
+
+
+class TestGetFlops(TestCase):
+
+    def test_run(self):
+        command = [
+            'python',
+            'tools/analysis_tools/get_flops.py',
+            'mobilevit-xxsmall_3rdparty_in1k',
+        ]
+        ret_code = Popen(command, cwd=MMPRE_ROOT).wait()
+        self.assertEqual(ret_code, 0)
diff --git a/tests/test_utils/test_analyze.py b/tests/test_utils/test_analyze.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1bb2c45b5cdbd5e68566ef4d116032a21ee270b
--- /dev/null
+++ b/tests/test_utils/test_analyze.py
@@ -0,0 +1,43 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import os.path as osp
+import tempfile
+
+from mmpretrain.utils import load_json_log
+
+
+def test_load_json_log():
+    demo_log = """\
+{"lr": 0.0001, "data_time": 0.003, "loss": 2.29, "time": 0.010, "epoch": 1, "step": 150}
+{"lr": 0.0001, "data_time": 0.002, "loss": 2.28, "time": 0.007, "epoch": 1, "step": 300}
+{"lr": 0.0001, "data_time": 0.001, "loss": 2.27, "time": 0.008, "epoch": 1, "step": 450}
+{"accuracy/top1": 23.98, "accuracy/top5": 66.05, "step": 1}
+{"lr": 0.0001, "data_time": 0.001, "loss": 2.25, "time": 0.014, "epoch": 2, "step": 619}
+{"lr": 0.0001, "data_time": 0.000, "loss": 2.24, "time": 0.012, "epoch": 2, "step": 769}
+{"lr": 0.0001, "data_time": 0.003, "loss": 2.23, "time": 0.009, "epoch": 2, "step": 919}
+{"accuracy/top1": 41.82, "accuracy/top5": 81.26, "step": 2}
+{"lr": 0.0001, "data_time": 0.002, "loss": 2.21, "time": 0.007, "epoch": 3, "step": 1088}
+{"lr": 0.0001, "data_time": 0.005, "loss": 2.18, "time": 0.009, "epoch": 3, "step": 1238}
+{"lr": 0.0001, "data_time": 0.002, "loss": 2.16, "time": 0.008, "epoch": 3, "step": 1388}
+{"accuracy/top1": 54.07, "accuracy/top5": 89.80, "step": 3}
+"""  # noqa: E501
+    with tempfile.TemporaryDirectory() as tmpdir:
+        json_log = osp.join(tmpdir, 'scalars.json')
+        with open(json_log, 'w') as f:
+            f.write(demo_log)
+
+        log_dict = load_json_log(json_log)
+
+    assert log_dict.keys() == {'train', 'val'}
+    assert log_dict['train'][3] == {
+        'lr': 0.0001,
+        'data_time': 0.001,
+        'loss': 2.25,
+        'time': 0.014,
+        'epoch': 2,
+        'step': 619
+    }
+    assert log_dict['val'][2] == {
+        'accuracy/top1': 54.07,
+        'accuracy/top5': 89.80,
+        'step': 3
+    }
diff --git a/tests/test_utils/test_setup_env.py b/tests/test_utils/test_setup_env.py
new file mode 100644
index 0000000000000000000000000000000000000000..4f8adee46d6c2f4c7ff618bb1fd47454b812d908
--- /dev/null
+++ b/tests/test_utils/test_setup_env.py
@@ -0,0 +1,40 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import datetime
+import sys
+from unittest import TestCase
+
+from mmengine import DefaultScope
+
+from mmpretrain.utils import register_all_modules
+
+
+class TestSetupEnv(TestCase):
+
+    def test_register_all_modules(self):
+        from mmpretrain.registry import DATASETS
+
+        # not init default scope
+        sys.modules.pop('mmpretrain.datasets', None)
+        sys.modules.pop('mmpretrain.datasets.custom', None)
+        DATASETS._module_dict.pop('CustomDataset', None)
+        self.assertFalse('CustomDataset' in DATASETS.module_dict)
+        register_all_modules(init_default_scope=False)
+        self.assertTrue('CustomDataset' in DATASETS.module_dict)
+
+        # init default scope
+        sys.modules.pop('mmpretrain.datasets')
+        sys.modules.pop('mmpretrain.datasets.custom')
+        DATASETS._module_dict.pop('CustomDataset', None)
+        self.assertFalse('CustomDataset' in DATASETS.module_dict)
+        register_all_modules(init_default_scope=True)
+        self.assertTrue('CustomDataset' in DATASETS.module_dict)
+        self.assertEqual(DefaultScope.get_current_instance().scope_name,
+                         'mmpretrain')
+
+        # init default scope when another scope is init
+        name = f'test-{datetime.datetime.now()}'
+        DefaultScope.get_instance(name, scope_name='test')
+        with self.assertWarnsRegex(
+                Warning,
+                'The current default scope "test" is not "mmpretrain"'):
+            register_all_modules(init_default_scope=True)
diff --git a/tests/test_utils/test_version_utils.py b/tests/test_utils/test_version_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..07105e08a5b86cba860b0c163908a851081161b3
--- /dev/null
+++ b/tests/test_utils/test_version_utils.py
@@ -0,0 +1,21 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmpretrain import digit_version
+
+
+def test_digit_version():
+    assert digit_version('0.2.16') == (0, 2, 16, 0, 0, 0)
+    assert digit_version('1.2.3') == (1, 2, 3, 0, 0, 0)
+    assert digit_version('1.2.3rc0') == (1, 2, 3, 0, -1, 0)
+    assert digit_version('1.2.3rc1') == (1, 2, 3, 0, -1, 1)
+    assert digit_version('1.0rc0') == (1, 0, 0, 0, -1, 0)
+    assert digit_version('1.0') == digit_version('1.0.0')
+    assert digit_version('1.5.0+cuda90_cudnn7.6.3_lms') == digit_version('1.5')
+    assert digit_version('1.0.0dev') < digit_version('1.0.0a')
+    assert digit_version('1.0.0a') < digit_version('1.0.0a1')
+    assert digit_version('1.0.0a') < digit_version('1.0.0b')
+    assert digit_version('1.0.0b') < digit_version('1.0.0rc')
+    assert digit_version('1.0.0rc1') < digit_version('1.0.0')
+    assert digit_version('1.0.0') < digit_version('1.0.0post')
+    assert digit_version('1.0.0post') < digit_version('1.0.0post1')
+    assert digit_version('v1') == (1, 0, 0, 0, 0, 0)
+    assert digit_version('v1.1.5') == (1, 1, 5, 0, 0, 0)
diff --git a/tests/test_visualization/test_visualizer.py b/tests/test_visualization/test_visualizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..900e495cf34dbd5815c811da4086e383d58aedaf
--- /dev/null
+++ b/tests/test_visualization/test_visualizer.py
@@ -0,0 +1,200 @@
+# Copyright (c) Open-MMLab. All rights reserved.
+import os.path as osp
+import tempfile
+from unittest import TestCase
+from unittest.mock import patch
+
+import numpy as np
+import torch
+
+from mmpretrain.structures import DataSample
+from mmpretrain.visualization import UniversalVisualizer
+
+
+class TestUniversalVisualizer(TestCase):
+
+    def setUp(self) -> None:
+        super().setUp()
+        tmpdir = tempfile.TemporaryDirectory()
+        self.tmpdir = tmpdir
+        self.vis = UniversalVisualizer(
+            save_dir=tmpdir.name,
+            vis_backends=[dict(type='LocalVisBackend')],
+        )
+
+    def test_visualize_cls(self):
+        image = np.ones((10, 10, 3), np.uint8)
+        data_sample = DataSample().set_gt_label(1).set_pred_label(1).\
+            set_pred_score(torch.tensor([0.1, 0.8, 0.1]))
+
+        # Test show
+        def mock_show(drawn_img, win_name, wait_time):
+            self.assertFalse((image == drawn_img).all())
+            self.assertEqual(win_name, 'test_cls')
+            self.assertEqual(wait_time, 0)
+
+        with patch.object(self.vis, 'show', mock_show):
+            self.vis.visualize_cls(
+                image=image,
+                data_sample=data_sample,
+                show=True,
+                name='test_cls',
+                step=1)
+
+        # Test storage backend.
+        save_file = osp.join(self.tmpdir.name,
+                             'vis_data/vis_image/test_cls_1.png')
+        self.assertTrue(osp.exists(save_file))
+
+        # Test out_file
+        out_file = osp.join(self.tmpdir.name, 'results.png')
+        self.vis.visualize_cls(
+            image=image, data_sample=data_sample, out_file=out_file)
+        self.assertTrue(osp.exists(out_file))
+
+        # Test with dataset_meta
+        self.vis.dataset_meta = {'classes': ['cat', 'bird', 'dog']}
+
+        def patch_texts(text, *_, **__):
+            self.assertEqual(
+                text, '\n'.join([
+                    'Ground truth: 1 (bird)',
+                    'Prediction: 1, 0.80 (bird)',
+                ]))
+
+        with patch.object(self.vis, 'draw_texts', patch_texts):
+            self.vis.visualize_cls(image, data_sample)
+
+        # Test without pred_label
+        def patch_texts(text, *_, **__):
+            self.assertEqual(text, '\n'.join([
+                'Ground truth: 1 (bird)',
+            ]))
+
+        with patch.object(self.vis, 'draw_texts', patch_texts):
+            self.vis.visualize_cls(image, data_sample, draw_pred=False)
+
+        # Test without gt_label
+        def patch_texts(text, *_, **__):
+            self.assertEqual(text, '\n'.join([
+                'Prediction: 1, 0.80 (bird)',
+            ]))
+
+        with patch.object(self.vis, 'draw_texts', patch_texts):
+            self.vis.visualize_cls(image, data_sample, draw_gt=False)
+
+        # Test without score
+        del data_sample.pred_score
+
+        def patch_texts(text, *_, **__):
+            self.assertEqual(
+                text, '\n'.join([
+                    'Ground truth: 1 (bird)',
+                    'Prediction: 1 (bird)',
+                ]))
+
+        with patch.object(self.vis, 'draw_texts', patch_texts):
+            self.vis.visualize_cls(image, data_sample)
+
+        # Test adaptive font size
+        def assert_font_size(target_size):
+
+            def draw_texts(text, font_sizes, *_, **__):
+                self.assertEqual(font_sizes, target_size)
+
+            return draw_texts
+
+        with patch.object(self.vis, 'draw_texts', assert_font_size(7)):
+            self.vis.visualize_cls(
+                np.ones((224, 384, 3), np.uint8), data_sample)
+
+        with patch.object(self.vis, 'draw_texts', assert_font_size(2)):
+            self.vis.visualize_cls(
+                np.ones((10, 384, 3), np.uint8), data_sample)
+
+        with patch.object(self.vis, 'draw_texts', assert_font_size(21)):
+            self.vis.visualize_cls(
+                np.ones((1000, 1000, 3), np.uint8), data_sample)
+
+        # Test rescale image
+        with patch.object(self.vis, 'draw_texts', assert_font_size(14)):
+            self.vis.visualize_cls(
+                np.ones((224, 384, 3), np.uint8),
+                data_sample,
+                rescale_factor=2.)
+
+    def test_visualize_image_retrieval(self):
+        image = np.ones((10, 10, 3), np.uint8)
+        data_sample = DataSample().set_pred_score([0.1, 0.8, 0.1])
+
+        class ToyPrototype:
+
+            def get_data_info(self, idx):
+                img_path = osp.join(osp.dirname(__file__), '../data/color.jpg')
+                return {'img_path': img_path, 'sample_idx': idx}
+
+        prototype_dataset = ToyPrototype()
+
+        # Test show
+        def mock_show(drawn_img, win_name, wait_time):
+            if image.shape == drawn_img.shape:
+                self.assertFalse((image == drawn_img).all())
+            self.assertEqual(win_name, 'test_retrieval')
+            self.assertEqual(wait_time, 0)
+
+        with patch.object(self.vis, 'show', mock_show):
+            self.vis.visualize_image_retrieval(
+                image,
+                data_sample,
+                prototype_dataset,
+                show=True,
+                name='test_retrieval',
+                step=1)
+
+        # Test storage backend.
+        save_file = osp.join(self.tmpdir.name,
+                             'vis_data/vis_image/test_retrieval_1.png')
+        self.assertTrue(osp.exists(save_file))
+
+        # Test out_file
+        out_file = osp.join(self.tmpdir.name, 'results.png')
+        self.vis.visualize_image_retrieval(
+            image,
+            data_sample,
+            prototype_dataset,
+            out_file=out_file,
+        )
+        self.assertTrue(osp.exists(out_file))
+
+    def test_visualize_masked_image(self):
+        image = np.ones((10, 10, 3), np.uint8)
+        data_sample = DataSample().set_mask(
+            torch.tensor([
+                [0, 0, 1, 1],
+                [0, 1, 1, 0],
+                [1, 1, 0, 0],
+                [1, 0, 0, 1],
+            ]))
+
+        # Test show
+        def mock_show(drawn_img, win_name, wait_time):
+            self.assertTupleEqual(drawn_img.shape, (224, 224, 3))
+            self.assertEqual(win_name, 'test_mask')
+            self.assertEqual(wait_time, 0)
+
+        with patch.object(self.vis, 'show', mock_show):
+            self.vis.visualize_masked_image(
+                image, data_sample, show=True, name='test_mask', step=1)
+
+        # Test storage backend.
+        save_file = osp.join(self.tmpdir.name,
+                             'vis_data/vis_image/test_mask_1.png')
+        self.assertTrue(osp.exists(save_file))
+
+        # Test out_file
+        out_file = osp.join(self.tmpdir.name, 'results.png')
+        self.vis.visualize_masked_image(image, data_sample, out_file=out_file)
+        self.assertTrue(osp.exists(out_file))
+
+    def tearDown(self):
+        self.tmpdir.cleanup()
diff --git a/tools/analysis_tools/analyze_logs.py b/tools/analysis_tools/analyze_logs.py
new file mode 100644
index 0000000000000000000000000000000000000000..e12f0349eebfcb6bc9fff4f01da7cb2257076b03
--- /dev/null
+++ b/tools/analysis_tools/analyze_logs.py
@@ -0,0 +1,218 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os
+import re
+from itertools import groupby
+
+import matplotlib.pyplot as plt
+import numpy as np
+
+from mmpretrain.utils import load_json_log
+
+
+def cal_train_time(log_dicts, args):
+    """Compute the average time per training iteration."""
+    for i, log_dict in enumerate(log_dicts):
+        print(f'{"-" * 5}Analyze train time of {args.json_logs[i]}{"-" * 5}')
+        train_logs = log_dict['train']
+
+        if 'epoch' in train_logs[0]:
+            epoch_ave_times = []
+            for _, logs in groupby(train_logs, lambda log: log['epoch']):
+                if args.include_outliers:
+                    all_time = np.array([log['time'] for log in logs])
+                else:
+                    all_time = np.array([log['time'] for log in logs])[1:]
+                epoch_ave_times.append(all_time.mean())
+            epoch_ave_times = np.array(epoch_ave_times)
+            slowest_epoch = epoch_ave_times.argmax()
+            fastest_epoch = epoch_ave_times.argmin()
+            std_over_epoch = epoch_ave_times.std()
+            print(f'slowest epoch {slowest_epoch + 1}, '
+                  f'average time is {epoch_ave_times[slowest_epoch]:.4f}')
+            print(f'fastest epoch {fastest_epoch + 1}, '
+                  f'average time is {epoch_ave_times[fastest_epoch]:.4f}')
+            print(f'time std over epochs is {std_over_epoch:.4f}')
+
+        avg_iter_time = np.array([log['time'] for log in train_logs]).mean()
+        print(f'average iter time: {avg_iter_time:.4f} s/iter')
+        print()
+
+
+def get_legends(args):
+    """if legend is None, use {filename}_{key} as legend."""
+    legend = args.legend
+    if legend is None:
+        legend = []
+        for json_log in args.json_logs:
+            for metric in args.keys:
+                # remove '.json' in the end of log names
+                basename = os.path.basename(json_log)[:-5]
+                if basename.endswith('.log'):
+                    basename = basename[:-4]
+                legend.append(f'{basename}_{metric}')
+    assert len(legend) == (len(args.json_logs) * len(args.keys))
+    return legend
+
+
+def plot_phase_train(metric, train_logs, curve_label):
+    """plot phase of train curve."""
+    xs = np.array([log['step'] for log in train_logs])
+    ys = np.array([log[metric] for log in train_logs])
+
+    if 'epoch' in train_logs[0]:
+        scale_factor = train_logs[-1]['step'] / train_logs[-1]['epoch']
+        xs = xs / scale_factor
+        plt.xlabel('Epochs')
+    else:
+        plt.xlabel('Iters')
+
+    plt.plot(xs, ys, label=curve_label, linewidth=0.75)
+
+
+def plot_phase_val(metric, val_logs, curve_label):
+    """plot phase of val curve."""
+    xs = np.array([log['step'] for log in val_logs])
+    ys = np.array([log[metric] for log in val_logs])
+
+    plt.xlabel('Steps')
+    plt.plot(xs, ys, label=curve_label, linewidth=0.75)
+
+
+def plot_curve_helper(log_dicts, metrics, args, legend):
+    """plot curves from log_dicts by metrics."""
+    num_metrics = len(metrics)
+    for i, log_dict in enumerate(log_dicts):
+        for j, key in enumerate(metrics):
+            json_log = args.json_logs[i]
+            print(f'plot curve of {json_log}, metric is {key}')
+            curve_label = legend[i * num_metrics + j]
+
+            train_keys = {} if len(log_dict['train']) == 0 else set(
+                log_dict['train'][0].keys()) - {'step', 'epoch'}
+            val_keys = {} if len(log_dict['val']) == 0 else set(
+                log_dict['val'][0].keys()) - {'step'}
+
+            if key in val_keys:
+                plot_phase_val(key, log_dict['val'], curve_label)
+            elif key in train_keys:
+                plot_phase_train(key, log_dict['train'], curve_label)
+            else:
+                raise ValueError(
+                    f'Invalid key "{key}", please choose from '
+                    f'{set.union(set(train_keys), set(val_keys))}.')
+            plt.legend()
+
+
+def plot_curve(log_dicts, args):
+    """Plot train metric-iter graph."""
+    # set style
+    try:
+        import seaborn as sns
+        sns.set_style(args.style)
+    except ImportError:
+        pass
+
+    # set plot window size
+    wind_w, wind_h = args.window_size.split('*')
+    wind_w, wind_h = int(wind_w), int(wind_h)
+    plt.figure(figsize=(wind_w, wind_h))
+
+    # get legends and metrics
+    legends = get_legends(args)
+    metrics = args.keys
+
+    # plot curves from log_dicts by metrics
+    plot_curve_helper(log_dicts, metrics, args, legends)
+
+    # set title and show or save
+    if args.title is not None:
+        plt.title(args.title)
+    if args.out is None:
+        plt.show()
+    else:
+        print(f'save curve to: {args.out}')
+        plt.savefig(args.out)
+        plt.cla()
+
+
+def add_plot_parser(subparsers):
+    parser_plt = subparsers.add_parser(
+        'plot_curve', help='parser for plotting curves')
+    parser_plt.add_argument(
+        'json_logs',
+        type=str,
+        nargs='+',
+        help='path of train log in json format')
+    parser_plt.add_argument(
+        '--keys',
+        type=str,
+        nargs='+',
+        default=['loss'],
+        help='the metric that you want to plot')
+    parser_plt.add_argument('--title', type=str, help='title of figure')
+    parser_plt.add_argument(
+        '--legend',
+        type=str,
+        nargs='+',
+        default=None,
+        help='legend of each plot')
+    parser_plt.add_argument(
+        '--style',
+        type=str,
+        default='whitegrid',
+        help='style of the figure, need `seaborn` package.')
+    parser_plt.add_argument('--out', type=str, default=None)
+    parser_plt.add_argument(
+        '--window-size',
+        default='12*7',
+        help='size of the window to display images, in format of "$W*$H".')
+
+
+def add_time_parser(subparsers):
+    parser_time = subparsers.add_parser(
+        'cal_train_time',
+        help='parser for computing the average time per training iteration')
+    parser_time.add_argument(
+        'json_logs',
+        type=str,
+        nargs='+',
+        help='path of train log in json format')
+    parser_time.add_argument(
+        '--include-outliers',
+        action='store_true',
+        help='include the first value of every epoch when computing '
+        'the average time')
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Analyze Json Log')
+    # currently only support plot curve and calculate average train time
+    subparsers = parser.add_subparsers(dest='task', help='task parser')
+    add_plot_parser(subparsers)
+    add_time_parser(subparsers)
+    args = parser.parse_args()
+
+    if hasattr(args, 'window_size') and args.window_size != '':
+        assert re.match(r'\d+\*\d+', args.window_size), \
+            "'window-size' must be in format 'W*H'."
+    return args
+
+
+def main():
+    args = parse_args()
+
+    json_logs = args.json_logs
+    for json_log in json_logs:
+        assert json_log.endswith('.json')
+
+    log_dicts = [load_json_log(json_log) for json_log in json_logs]
+
+    if args.task == 'cal_train_time':
+        cal_train_time(log_dicts, args)
+    elif args.task == 'plot_curve':
+        plot_curve(log_dicts, args)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/analysis_tools/analyze_results.py b/tools/analysis_tools/analyze_results.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f2feb3cb72e3f5302e7a22cc5ea5b64abad426f
--- /dev/null
+++ b/tools/analysis_tools/analyze_results.py
@@ -0,0 +1,121 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from pathlib import Path
+
+import mmcv
+import mmengine
+import torch
+from mmengine import DictAction
+
+from mmpretrain.datasets import build_dataset
+from mmpretrain.structures import DataSample
+from mmpretrain.visualization import UniversalVisualizer
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description='MMPreTrain evaluate prediction success/fail')
+    parser.add_argument('config', help='test config file path')
+    parser.add_argument('result', help='test result json/pkl file')
+    parser.add_argument(
+        '--out-dir', required=True, help='dir to store output files')
+    parser.add_argument(
+        '--topk',
+        default=20,
+        type=int,
+        help='Number of images to select for success/fail')
+    parser.add_argument(
+        '--rescale-factor',
+        '-r',
+        type=float,
+        help='image rescale factor, which is useful if the output is too '
+        'large or too small.')
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. If the value to '
+        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+        'Note that the quotation marks are necessary and that no white space '
+        'is allowed.')
+    args = parser.parse_args()
+
+    return args
+
+
+def save_imgs(result_dir, folder_name, results, dataset, rescale_factor=None):
+    full_dir = osp.join(result_dir, folder_name)
+    vis = UniversalVisualizer()
+    vis.dataset_meta = {'classes': dataset.CLASSES}
+
+    # save imgs
+    dump_infos = []
+    for data_sample in results:
+        data_info = dataset.get_data_info(data_sample.sample_idx)
+        if 'img' in data_info:
+            img = data_info['img']
+            name = str(data_sample.sample_idx)
+        elif 'img_path' in data_info:
+            img = mmcv.imread(data_info['img_path'], channel_order='rgb')
+            name = Path(data_info['img_path']).name
+        else:
+            raise ValueError('Cannot load images from the dataset infos.')
+        if rescale_factor is not None:
+            img = mmcv.imrescale(img, rescale_factor)
+        vis.visualize_cls(
+            img, data_sample, out_file=osp.join(full_dir, name + '.png'))
+
+        dump = dict()
+        for k, v in data_sample.items():
+            if isinstance(v, torch.Tensor):
+                dump[k] = v.tolist()
+            else:
+                dump[k] = v
+            dump_infos.append(dump)
+
+    mmengine.dump(dump_infos, osp.join(full_dir, folder_name + '.json'))
+
+
+def main():
+    args = parse_args()
+
+    cfg = mmengine.Config.fromfile(args.config)
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+
+    # build the dataloader
+    cfg.test_dataloader.dataset.pipeline = []
+    dataset = build_dataset(cfg.test_dataloader.dataset)
+
+    results = list()
+    for result in mmengine.load(args.result):
+        data_sample = DataSample()
+        data_sample.set_metainfo({'sample_idx': result['sample_idx']})
+        data_sample.set_gt_label(result['gt_label'])
+        data_sample.set_pred_label(result['pred_label'])
+        data_sample.set_pred_score(result['pred_score'])
+        results.append(data_sample)
+
+    # sort result
+    results = sorted(results, key=lambda x: torch.max(x.pred_score))
+
+    success = list()
+    fail = list()
+    for data_sample in results:
+        if (data_sample.pred_label == data_sample.gt_label).all():
+            success.append(data_sample)
+        else:
+            fail.append(data_sample)
+
+    success = success[:args.topk]
+    fail = fail[:args.topk]
+
+    save_imgs(args.out_dir, 'success', success, dataset, args.rescale_factor)
+    save_imgs(args.out_dir, 'fail', fail, dataset, args.rescale_factor)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/analysis_tools/confusion_matrix.py b/tools/analysis_tools/confusion_matrix.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e6382cf47dd86cbb209f9f4c9b767152de937da
--- /dev/null
+++ b/tools/analysis_tools/confusion_matrix.py
@@ -0,0 +1,108 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import tempfile
+
+import mmengine
+from mmengine.config import Config, DictAction
+from mmengine.evaluator import Evaluator
+from mmengine.runner import Runner
+
+from mmpretrain.evaluation import ConfusionMatrix
+from mmpretrain.registry import DATASETS
+from mmpretrain.utils import register_all_modules
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description='Eval a checkpoint and draw the confusion matrix.')
+    parser.add_argument('config', help='test config file path')
+    parser.add_argument(
+        'ckpt_or_result',
+        type=str,
+        help='The checkpoint file (.pth) or '
+        'dumpped predictions pickle file (.pkl).')
+    parser.add_argument('--out', help='the file to save the confusion matrix.')
+    parser.add_argument(
+        '--show',
+        action='store_true',
+        help='whether to display the metric result by matplotlib if supports.')
+    parser.add_argument(
+        '--show-path', type=str, help='Path to save the visualization image.')
+    parser.add_argument(
+        '--include-values',
+        action='store_true',
+        help='To draw the values in the figure.')
+    parser.add_argument(
+        '--cmap',
+        type=str,
+        default='viridis',
+        help='The color map to use. Defaults to "viridis".')
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. If the value to '
+        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+        'Note that the quotation marks are necessary and that no white space '
+        'is allowed.')
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+
+    # register all modules in mmpretrain into the registries
+    # do not init the default scope here because it will be init in the runner
+    register_all_modules(init_default_scope=False)
+
+    # load config
+    cfg = Config.fromfile(args.config)
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+
+    if args.ckpt_or_result.endswith('.pth'):
+        # Set confusion matrix as the metric.
+        cfg.test_evaluator = dict(type='ConfusionMatrix')
+
+        cfg.load_from = str(args.ckpt_or_result)
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            cfg.work_dir = tmpdir
+            runner = Runner.from_cfg(cfg)
+            classes = runner.test_loop.dataloader.dataset.metainfo.get(
+                'classes')
+            cm = runner.test()['confusion_matrix/result']
+    else:
+        predictions = mmengine.load(args.ckpt_or_result)
+        evaluator = Evaluator(ConfusionMatrix())
+        metrics = evaluator.offline_evaluate(predictions, None)
+        cm = metrics['confusion_matrix/result']
+        try:
+            # Try to build the dataset.
+            dataset = DATASETS.build({
+                **cfg.test_dataloader.dataset, 'pipeline': []
+            })
+            classes = dataset.metainfo.get('classes')
+        except Exception:
+            classes = None
+
+    if args.out is not None:
+        mmengine.dump(cm, args.out)
+
+    if args.show or args.show_path is not None:
+        fig = ConfusionMatrix.plot(
+            cm,
+            show=args.show,
+            classes=classes,
+            include_values=args.include_values,
+            cmap=args.cmap)
+        if args.show_path is not None:
+            fig.savefig(args.show_path)
+            print(f'The confusion matrix is saved at {args.show_path}.')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/analysis_tools/eval_metric.py b/tools/analysis_tools/eval_metric.py
new file mode 100644
index 0000000000000000000000000000000000000000..4b2fec112166312f6a3d504c2ab60dd192526c4f
--- /dev/null
+++ b/tools/analysis_tools/eval_metric.py
@@ -0,0 +1,62 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+
+import mmengine
+import rich
+from mmengine import DictAction
+from mmengine.evaluator import Evaluator
+
+from mmpretrain.registry import METRICS
+
+HELP_URL = (
+    'https://mmpretrain.readthedocs.io/en/latest/useful_tools/'
+    'log_result_analysis.html#how-to-conduct-offline-metric-evaluation')
+
+prog_description = f"""\
+Evaluate metric of the results saved in pkl format.
+
+The detailed usage can be found in {HELP_URL}
+"""
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=prog_description)
+    parser.add_argument('pkl_results', help='Results in pickle format')
+    parser.add_argument(
+        '--metric',
+        nargs='+',
+        action='append',
+        dest='metric_options',
+        help='The metric config, the key-value pair in xxx=yyy format will be '
+        'parsed as the metric config items. You can specify multiple metrics '
+        'by use multiple `--metric`. For list type value, you can use '
+        '"key=[a,b]" or "key=a,b", and it also allows nested list/tuple '
+        'values, e.g. "key=[(a,b),(c,d)]".')
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+
+    if args.metric_options is None:
+        raise ValueError('Please speicfy at least one `--metric`. '
+                         f'The detailed usage can be found in {HELP_URL}')
+
+    test_metrics = []
+    for metric_option in args.metric_options:
+        metric_cfg = {}
+        for kv in metric_option:
+            k, v = kv.split('=', maxsplit=1)
+            metric_cfg[k] = DictAction._parse_iterable(v)
+        test_metrics.append(METRICS.build(metric_cfg))
+
+    predictions = mmengine.load(args.pkl_results)
+
+    evaluator = Evaluator(test_metrics)
+    eval_results = evaluator.offline_evaluate(predictions, None)
+    rich.print(eval_results)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/analysis_tools/get_flops.py b/tools/analysis_tools/get_flops.py
new file mode 100644
index 0000000000000000000000000000000000000000..c705f6ee2c66798c422e899307cdcb7b829259a8
--- /dev/null
+++ b/tools/analysis_tools/get_flops.py
@@ -0,0 +1,61 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+
+from mmengine.analysis import get_model_complexity_info
+
+from mmpretrain import get_model
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Get model flops and params')
+    parser.add_argument('config', help='config file path')
+    parser.add_argument(
+        '--shape',
+        type=int,
+        nargs='+',
+        default=[224, 224],
+        help='input image size')
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+    if len(args.shape) == 1:
+        input_shape = (3, args.shape[0], args.shape[0])
+    elif len(args.shape) == 2:
+        input_shape = (3, ) + tuple(args.shape)
+    else:
+        raise ValueError('invalid input shape')
+
+    model = get_model(args.config)
+    model.eval()
+    if hasattr(model, 'extract_feat'):
+        model.forward = model.extract_feat
+    else:
+        raise NotImplementedError(
+            'FLOPs counter is currently not currently supported with {}'.
+            format(model.__class__.__name__))
+    analysis_results = get_model_complexity_info(
+        model,
+        input_shape,
+    )
+    flops = analysis_results['flops_str']
+    params = analysis_results['params_str']
+    activations = analysis_results['activations_str']
+    out_table = analysis_results['out_table']
+    out_arch = analysis_results['out_arch']
+    print(out_arch)
+    print(out_table)
+    split_line = '=' * 30
+    print(f'{split_line}\nInput shape: {input_shape}\n'
+          f'Flops: {flops}\nParams: {params}\n'
+          f'Activation: {activations}\n{split_line}')
+    print('!!!Only the backbone network is counted in FLOPs analysis.')
+    print('!!!Please be cautious if you use the results in papers. '
+          'You may need to check if all ops are supported and verify that the '
+          'flops computation is correct.')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/analysis_tools/shape_bias.py b/tools/analysis_tools/shape_bias.py
new file mode 100644
index 0000000000000000000000000000000000000000..52e9fe69a5963ce6febcc907360ad5a9660ad63f
--- /dev/null
+++ b/tools/analysis_tools/shape_bias.py
@@ -0,0 +1,284 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Modified from https://github.com/bethgelab/model-vs-human
+import argparse
+import os
+import os.path as osp
+
+import matplotlib as mpl
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+from mmengine.logging import MMLogger
+from utils import FormatStrFormatter, ShapeBias
+
+# global default boundary settings for thin gray transparent
+# boundaries to avoid not being able to see the difference
+# between two partially overlapping datapoints of the same color:
+PLOTTING_EDGE_COLOR = (0.3, 0.3, 0.3, 0.3)
+PLOTTING_EDGE_WIDTH = 0.02
+ICONS_DIR = osp.join(
+    osp.dirname(__file__), '..', '..', 'resources', 'shape_bias_icons')
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--csv-dir', type=str, help='directory of csv files')
+parser.add_argument(
+    '--result-dir', type=str, help='directory to save plotting results')
+parser.add_argument('--model-names', nargs='+', default=[], help='model name')
+parser.add_argument(
+    '--colors',
+    nargs='+',
+    type=float,
+    default=[],
+    help=  # noqa
+    'the colors for the plots of each model, and they should be in the same order as model_names'  # noqa: E501
+)
+parser.add_argument(
+    '--markers',
+    nargs='+',
+    type=str,
+    default=[],
+    help=  # noqa
+    'the markers for the plots of each model, and they should be in the same order as model_names'  # noqa: E501
+)
+parser.add_argument(
+    '--plotting-names',
+    nargs='+',
+    default=[],
+    help=  # noqa
+    'the plotting names for the plots of each model, and they should be in the same order as model_names'  # noqa: E501
+)
+parser.add_argument(
+    '--delete-icons',
+    action='store_true',
+    help='whether to delete the icons after plotting')
+
+humans = [
+    'subject-01', 'subject-02', 'subject-03', 'subject-04', 'subject-05',
+    'subject-06', 'subject-07', 'subject-08', 'subject-09', 'subject-10'
+]
+
+icon_names = [
+    'airplane.png', 'response_icons_vertical_reverse.png', 'bottle.png',
+    'car.png', 'oven.png', 'elephant.png', 'dog.png', 'boat.png', 'clock.png',
+    'chair.png', 'keyboard.png', 'bird.png', 'bicycle.png',
+    'response_icons_horizontal.png', 'cat.png', 'bear.png', 'colorbar.pdf',
+    'knife.png', 'response_icons_vertical.png', 'truck.png'
+]
+
+
+def read_csvs(csv_dir: str) -> pd.DataFrame:
+    """Reads all csv files in a directory and returns a single dataframe.
+
+    Args:
+        csv_dir (str): directory of csv files.
+
+    Returns:
+        pd.DataFrame: dataframe containing all csv files
+    """
+    df = pd.DataFrame()
+    for csv in os.listdir(csv_dir):
+        if csv.endswith('.csv'):
+            cur_df = pd.read_csv(osp.join(csv_dir, csv))
+            cur_df.columns = [c.lower() for c in cur_df.columns]
+            df = df.append(cur_df)
+    df.condition = df.condition.astype(str)
+    return df
+
+
+def plot_shape_bias_matrixplot(args, analysis=ShapeBias()) -> None:
+    """Plots a matrixplot of shape bias.
+
+    Args:
+        args (argparse.Namespace): arguments.
+        analysis (ShapeBias): shape bias analysis. Defaults to ShapeBias().
+    """
+    mpl.rcParams['font.family'] = ['serif']
+    mpl.rcParams['font.serif'] = ['Times New Roman']
+
+    plt.figure(figsize=(9, 7))
+    df = read_csvs(args.csv_dir)
+
+    fontsize = 15
+    ticklength = 10
+    markersize = 250
+    label_size = 20
+
+    classes = df['category'].unique()
+    num_classes = len(classes)
+
+    # plot setup
+    fig = plt.figure(1, figsize=(12, 12), dpi=300.)
+    ax = plt.gca()
+
+    ax.set_xlim([0, 1])
+    ax.set_ylim([-.5, num_classes - 0.5])
+
+    # secondary reversed x axis
+    ax_top = ax.secondary_xaxis(
+        'top', functions=(lambda x: 1 - x, lambda x: 1 - x))
+
+    # labels, ticks
+    plt.tick_params(
+        axis='y', which='both', left=False, right=False, labelleft=False)
+    ax.set_ylabel('Shape categories', labelpad=60, fontsize=label_size)
+    ax.set_xlabel(
+        "Fraction of 'texture' decisions", fontsize=label_size, labelpad=25)
+    ax_top.set_xlabel(
+        "Fraction of 'shape' decisions", fontsize=label_size, labelpad=25)
+    ax.xaxis.set_major_formatter(FormatStrFormatter('%g'))
+    ax_top.xaxis.set_major_formatter(FormatStrFormatter('%g'))
+    ax.get_xaxis().set_ticks(np.arange(0, 1.1, 0.1))
+    ax_top.set_ticks(np.arange(0, 1.1, 0.1))
+    ax.tick_params(
+        axis='both', which='major', labelsize=fontsize, length=ticklength)
+    ax_top.tick_params(
+        axis='both', which='major', labelsize=fontsize, length=ticklength)
+
+    # arrows on x axes
+    plt.arrow(
+        x=0,
+        y=-1.75,
+        dx=1,
+        dy=0,
+        fc='black',
+        head_width=0.4,
+        head_length=0.03,
+        clip_on=False,
+        length_includes_head=True,
+        overhang=0.5)
+    plt.arrow(
+        x=1,
+        y=num_classes + 0.75,
+        dx=-1,
+        dy=0,
+        fc='black',
+        head_width=0.4,
+        head_length=0.03,
+        clip_on=False,
+        length_includes_head=True,
+        overhang=0.5)
+
+    # icons besides y axis
+    # determine order of icons
+    df_selection = df.loc[(df['subj'].isin(humans))]
+    class_avgs = []
+    for cl in classes:
+        df_class_selection = df_selection.query("category == '{}'".format(cl))
+        class_avgs.append(1 - analysis.analysis(
+            df=df_class_selection)['shape-bias'])
+    sorted_indices = np.argsort(class_avgs)
+    classes = classes[sorted_indices]
+
+    # icon placement is calculated in axis coordinates
+    WIDTH = 1 / num_classes
+    # placement left of yaxis (-WIDTH) plus some spacing (-.25*WIDTH)
+    XPOS = -1.25 * WIDTH
+    YPOS = -0.5
+    HEIGHT = 1
+    MARGINX = 1 / 10 * WIDTH  # vertical whitespace between icons
+    MARGINY = 1 / 10 * HEIGHT  # horizontal whitespace between icons
+
+    left = XPOS + MARGINX
+    right = XPOS + WIDTH - MARGINX
+
+    for i in range(num_classes):
+        bottom = i + MARGINY + YPOS
+        top = (i + 1) - MARGINY + YPOS
+        iconpath = osp.join(ICONS_DIR, '{}.png'.format(classes[i]))
+        plt.imshow(
+            plt.imread(iconpath),
+            extent=[left, right, bottom, top],
+            aspect='auto',
+            clip_on=False)
+
+    # plot horizontal intersection lines
+    for i in range(num_classes - 1):
+        plt.plot([0, 1], [i + .5, i + .5],
+                 c='gray',
+                 linestyle='dotted',
+                 alpha=0.4)
+
+    # plot average shapebias + scatter points
+    for i in range(len(args.model_names)):
+        df_selection = df.loc[(df['subj'].isin(args.model_names[i]))]
+        result_df = analysis.analysis(df=df_selection)
+        avg = 1 - result_df['shape-bias']
+        ax.plot([avg, avg], [-1, num_classes], color=args.colors[i])
+        class_avgs = []
+        for cl in classes:
+            df_class_selection = df_selection.query(
+                "category == '{}'".format(cl))
+            class_avgs.append(1 - analysis.analysis(
+                df=df_class_selection)['shape-bias'])
+
+        ax.scatter(
+            class_avgs,
+            classes,
+            color=args.colors[i],
+            marker=args.markers[i],
+            label=args.plotting_names[i],
+            s=markersize,
+            clip_on=False,
+            edgecolors=PLOTTING_EDGE_COLOR,
+            linewidths=PLOTTING_EDGE_WIDTH,
+            zorder=3)
+    plt.legend(frameon=True, labelspacing=1, loc=9)
+
+    figure_path = osp.join(args.result_dir,
+                           'cue-conflict_shape-bias_matrixplot.pdf')
+    fig.savefig(figure_path, bbox_inches='tight')
+    plt.close()
+
+
+def check_icons() -> bool:
+    """Check if icons are present, if not download them."""
+    if not osp.exists(ICONS_DIR):
+        return False
+    for icon_name in icon_names:
+        if not osp.exists(osp.join(ICONS_DIR, icon_name)):
+            return False
+    return True
+
+
+if __name__ == '__main__':
+
+    if not check_icons():
+        root_url = 'https://github.com/bethgelab/model-vs-human/raw/master/assets/icons'  # noqa: E501
+        os.makedirs(ICONS_DIR, exist_ok=True)
+        MMLogger.get_current_instance().info(
+            f'Downloading icons to {ICONS_DIR}')
+        for icon_name in icon_names:
+            url = osp.join(root_url, icon_name)
+            os.system('wget -O {} {}'.format(
+                osp.join(ICONS_DIR, icon_name), url))
+
+    args = parser.parse_args()
+    assert len(args.model_names) * 3 == len(args.colors), 'Number of colors \
+        must be 3 times the number of models. Every three colors are the RGB \
+            values for one model.'
+
+    # preprocess colors
+    args.colors = [c / 255. for c in args.colors]
+    colors = []
+    for i in range(len(args.model_names)):
+        colors.append(args.colors[3 * i:3 * i + 3])
+    args.colors = colors
+    args.colors.append([165 / 255., 30 / 255., 55 / 255.])  # human color
+
+    # if plotting names are not specified, use model names
+    if len(args.plotting_names) == 0:
+        args.plotting_names = args.model_names
+
+    # preprocess markers
+    args.markers.append('D')  # human marker
+
+    # preprocess model names
+    args.model_names = [[m] for m in args.model_names]
+    args.model_names.append(humans)
+
+    # preprocess plotting names
+    args.plotting_names.append('Humans')
+
+    plot_shape_bias_matrixplot(args)
+    if args.delete_icons:
+        os.system('rm -rf {}'.format(ICONS_DIR))
diff --git a/tools/analysis_tools/utils.py b/tools/analysis_tools/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..184cb32a9f95c539b9021a507eec80e2bf100fa5
--- /dev/null
+++ b/tools/analysis_tools/utils.py
@@ -0,0 +1,277 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+# Modified from https://github.com/bethgelab/model-vs-human
+from typing import Any, Dict, List, Optional
+
+import matplotlib as mpl
+import pandas as pd
+from matplotlib import _api
+from matplotlib import transforms as mtransforms
+
+
+class _DummyAxis:
+    """Define the minimal interface for a dummy axis.
+
+    Args:
+        minpos (float): The minimum positive value for the axis. Defaults to 0.
+    """
+    __name__ = 'dummy'
+
+    # Once the deprecation elapses, replace dataLim and viewLim by plain
+    # _view_interval and _data_interval private tuples.
+    dataLim = _api.deprecate_privatize_attribute(
+        '3.6', alternative='get_data_interval() and set_data_interval()')
+    viewLim = _api.deprecate_privatize_attribute(
+        '3.6', alternative='get_view_interval() and set_view_interval()')
+
+    def __init__(self, minpos: float = 0) -> None:
+        self._dataLim = mtransforms.Bbox.unit()
+        self._viewLim = mtransforms.Bbox.unit()
+        self._minpos = minpos
+
+    def get_view_interval(self) -> Dict:
+        """Return the view interval as a tuple (*vmin*, *vmax*)."""
+        return self._viewLim.intervalx
+
+    def set_view_interval(self, vmin: float, vmax: float) -> None:
+        """Set the view interval to (*vmin*, *vmax*)."""
+        self._viewLim.intervalx = vmin, vmax
+
+    def get_minpos(self) -> float:
+        """Return the minimum positive value for the axis."""
+        return self._minpos
+
+    def get_data_interval(self) -> Dict:
+        """Return the data interval as a tuple (*vmin*, *vmax*)."""
+        return self._dataLim.intervalx
+
+    def set_data_interval(self, vmin: float, vmax: float) -> None:
+        """Set the data interval to (*vmin*, *vmax*)."""
+        self._dataLim.intervalx = vmin, vmax
+
+    def get_tick_space(self) -> int:
+        """Return the number of ticks to use."""
+        # Just use the long-standing default of nbins==9
+        return 9
+
+
+class TickHelper:
+    """A helper class for ticks and tick labels."""
+    axis = None
+
+    def set_axis(self, axis: Any) -> None:
+        """Set the axis instance."""
+        self.axis = axis
+
+    def create_dummy_axis(self, **kwargs) -> None:
+        """Create a dummy axis if no axis is set."""
+        if self.axis is None:
+            self.axis = _DummyAxis(**kwargs)
+
+    @_api.deprecated('3.5', alternative='`.Axis.set_view_interval`')
+    def set_view_interval(self, vmin: float, vmax: float) -> None:
+        """Set the view interval to (*vmin*, *vmax*)."""
+        self.axis.set_view_interval(vmin, vmax)
+
+    @_api.deprecated('3.5', alternative='`.Axis.set_data_interval`')
+    def set_data_interval(self, vmin: float, vmax: float) -> None:
+        """Set the data interval to (*vmin*, *vmax*)."""
+        self.axis.set_data_interval(vmin, vmax)
+
+    @_api.deprecated(
+        '3.5',
+        alternative='`.Axis.set_view_interval` and `.Axis.set_data_interval`')
+    def set_bounds(self, vmin: float, vmax: float) -> None:
+        """Set the view and data interval to (*vmin*, *vmax*)."""
+        self.set_view_interval(vmin, vmax)
+        self.set_data_interval(vmin, vmax)
+
+
+class Formatter(TickHelper):
+    """Create a string based on a tick value and location."""
+    # some classes want to see all the locs to help format
+    # individual ones
+    locs = []
+
+    def __call__(self, x: str, pos: Optional[Any] = None) -> str:
+        """Return the format for tick value *x* at position pos.
+
+        ``pos=None`` indicates an unspecified location.
+
+        This method must be overridden in the derived class.
+
+        Args:
+            x (str): The tick value.
+            pos (Optional[Any]): The tick position. Defaults to None.
+        """
+        raise NotImplementedError('Derived must override')
+
+    def format_ticks(self, values: pd.Series) -> List[str]:
+        """Return the tick labels for all the ticks at once.
+
+        Args:
+            values (pd.Series): The tick values.
+
+        Returns:
+            List[str]: The tick labels.
+        """
+        self.set_locs(values)
+        return [self(value, i) for i, value in enumerate(values)]
+
+    def format_data(self, value: Any) -> str:
+        """Return the full string representation of the value with the position
+        unspecified.
+
+        Args:
+            value (Any): The tick value.
+
+        Returns:
+            str: The full string representation of the value.
+        """
+        return self.__call__(value)
+
+    def format_data_short(self, value: Any) -> str:
+        """Return a short string version of the tick value.
+
+        Defaults to the position-independent long value.
+
+        Args:
+            value (Any): The tick value.
+
+        Returns:
+            str: The short string representation of the value.
+        """
+        return self.format_data(value)
+
+    def get_offset(self) -> str:
+        """Return the offset string."""
+        return ''
+
+    def set_locs(self, locs: List[Any]) -> None:
+        """Set the locations of the ticks.
+
+        This method is called before computing the tick labels because some
+        formatters need to know all tick locations to do so.
+        """
+        self.locs = locs
+
+    @staticmethod
+    def fix_minus(s: str) -> str:
+        """Some classes may want to replace a hyphen for minus with the proper
+        Unicode symbol (U+2212) for typographical correctness.
+
+        This is a
+        helper method to perform such a replacement when it is enabled via
+        :rc:`axes.unicode_minus`.
+
+        Args:
+            s (str): The string to replace the hyphen with the Unicode symbol.
+        """
+        return (s.replace('-', '\N{MINUS SIGN}')
+                if mpl.rcParams['axes.unicode_minus'] else s)
+
+    def _set_locator(self, locator: Any) -> None:
+        """Subclasses may want to override this to set a locator."""
+        pass
+
+
+class FormatStrFormatter(Formatter):
+    """Use an old-style ('%' operator) format string to format the tick.
+
+    The format string should have a single variable format (%) in it.
+    It will be applied to the value (not the position) of the tick.
+
+    Negative numeric values will use a dash, not a Unicode minus; use mathtext
+    to get a Unicode minus by wrapping the format specifier with $ (e.g.
+    "$%g$").
+
+    Args:
+        fmt (str): Format string.
+    """
+
+    def __init__(self, fmt: str) -> None:
+        self.fmt = fmt
+
+    def __call__(self, x: str, pos: Optional[Any]) -> str:
+        """Return the formatted label string.
+
+        Only the value *x* is formatted. The position is ignored.
+
+        Args:
+            x (str): The value to format.
+            pos (Any): The position of the tick. Ignored.
+        """
+        return self.fmt % x
+
+
+class ShapeBias:
+    """Compute the shape bias of a model.
+
+    Reference: `ImageNet-trained CNNs are biased towards texture;
+    increasing shape bias improves accuracy and robustness
+    <https://arxiv.org/abs/1811.12231>`_.
+    """
+    num_input_models = 1
+
+    def __init__(self) -> None:
+        super().__init__()
+        self.plotting_name = 'shape-bias'
+
+    @staticmethod
+    def _check_dataframe(df: pd.DataFrame) -> None:
+        """Check that the dataframe is valid."""
+        assert len(df) > 0, 'empty dataframe'
+
+    def analysis(self, df: pd.DataFrame) -> Dict[str, float]:
+        """Compute the shape bias of a model.
+
+        Args:
+            df (pd.DataFrame): The dataframe containing the data.
+
+        Returns:
+            Dict[str, float]: The shape bias.
+        """
+        self._check_dataframe(df)
+
+        df = df.copy()
+        df['correct_texture'] = df['imagename'].apply(
+            self.get_texture_category)
+        df['correct_shape'] = df['category']
+
+        # remove those rows where shape = texture, i.e. no cue conflict present
+        df2 = df.loc[df.correct_shape != df.correct_texture]
+        fraction_correct_shape = len(
+            df2.loc[df2.object_response == df2.correct_shape]) / len(df)
+        fraction_correct_texture = len(
+            df2.loc[df2.object_response == df2.correct_texture]) / len(df)
+        shape_bias = fraction_correct_shape / (
+            fraction_correct_shape + fraction_correct_texture)
+
+        result_dict = {
+            'fraction-correct-shape': fraction_correct_shape,
+            'fraction-correct-texture': fraction_correct_texture,
+            'shape-bias': shape_bias
+        }
+        return result_dict
+
+    def get_texture_category(self, imagename: str) -> str:
+        """Return texture category from imagename.
+
+        e.g. 'XXX_dog10-bird2.png' -> 'bird '
+
+        Args:
+            imagename (str): Name of the image.
+
+        Returns:
+            str: Texture category.
+        """
+        assert type(imagename) is str
+
+        # remove unnecessary words
+        a = imagename.split('_')[-1]
+        # remove .png etc.
+        b = a.split('.')[0]
+        # get texture category (last word)
+        c = b.split('-')[-1]
+        # remove number, e.g. 'bird2' -> 'bird'
+        d = ''.join([i for i in c if not i.isdigit()])
+        return d
diff --git a/tools/benchmarks/mmdetection/mim_dist_test.sh b/tools/benchmarks/mmdetection/mim_dist_test.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1009d72bca0a7a347b2d940394cdc474aeaf993a
--- /dev/null
+++ b/tools/benchmarks/mmdetection/mim_dist_test.sh
@@ -0,0 +1,16 @@
+#!/usr/bin/env bash
+
+set -x
+
+CFG=$1
+CHECKPOINT=$2
+GPUS=$3
+PY_ARGS=${@:4}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+mim test mmdet \
+    $CFG \
+    --checkpoint $CHECKPOINT \
+    --launcher pytorch \
+    -G $GPUS \
+    $PY_ARGS
diff --git a/tools/benchmarks/mmdetection/mim_dist_train_c4.sh b/tools/benchmarks/mmdetection/mim_dist_train_c4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e28d573fb463f6a8be5b3e4ebc2d8924f6d520bb
--- /dev/null
+++ b/tools/benchmarks/mmdetection/mim_dist_train_c4.sh
@@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+
+set -x
+
+CFG=$1
+PRETRAIN=$2  # pretrained model
+GPUS=$3
+PY_ARGS=${@:4}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+mim train mmdet $CFG \
+    --launcher pytorch -G $GPUS \
+    --cfg-options model.backbone.init_cfg.type=Pretrained \
+    model.backbone.init_cfg.checkpoint=$PRETRAIN \
+    model.backbone.init_cfg.prefix="backbone." \
+    model.roi_head.shared_head.init_cfg.type=Pretrained \
+    model.roi_head.shared_head.init_cfg.checkpoint=$PRETRAIN \
+    model.roi_head.shared_head.init_cfg.prefix="backbone." \
+    $PY_ARGS
diff --git a/tools/benchmarks/mmdetection/mim_dist_train_fpn.sh b/tools/benchmarks/mmdetection/mim_dist_train_fpn.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9641851078b1780e03cd03a505c11f59b6c1fe02
--- /dev/null
+++ b/tools/benchmarks/mmdetection/mim_dist_train_fpn.sh
@@ -0,0 +1,16 @@
+#!/usr/bin/env bash
+
+set -x
+
+CFG=$1
+PRETRAIN=$2  # pretrained model
+GPUS=$3
+PY_ARGS=${@:4}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+mim train mmdet $CFG \
+    --launcher pytorch -G $GPUS \
+    --cfg-options model.backbone.init_cfg.type=Pretrained \
+    model.backbone.init_cfg.checkpoint=$PRETRAIN \
+    model.backbone.init_cfg.prefix="backbone." \
+    $PY_ARGS
diff --git a/tools/benchmarks/mmdetection/mim_slurm_test.sh b/tools/benchmarks/mmdetection/mim_slurm_test.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7209e926278a6d36fba16866acc1b347b2238e1f
--- /dev/null
+++ b/tools/benchmarks/mmdetection/mim_slurm_test.sh
@@ -0,0 +1,23 @@
+#!/usr/bin/env bash
+
+set -x
+
+PARTITION=$1
+CFG=$2
+CHECKPOINT=$3
+GPUS=${GPUS:-8}
+GPUS_PER_NODE=${GPUS_PER_NODE:-8}
+CPUS_PER_TASK=${CPUS_PER_TASK:-5}
+SRUN_ARGS=${SRUN_ARGS:-""}
+PY_ARGS=${@:4}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+mim test mmdet \
+    $CFG \
+    --checkpoint $CHECKPOINT \
+    --launcher slurm -G $GPUS \
+    --gpus-per-node $GPUS_PER_NODE \
+    --cpus-per-task $CPUS_PER_TASK \
+    --partition $PARTITION \
+    --srun-args "$SRUN_ARGS" \
+    $PY_ARGS
diff --git a/tools/benchmarks/mmdetection/mim_slurm_train_c4.sh b/tools/benchmarks/mmdetection/mim_slurm_train_c4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5ababa9f17964cd8664e476f2c55d5705478ddd5
--- /dev/null
+++ b/tools/benchmarks/mmdetection/mim_slurm_train_c4.sh
@@ -0,0 +1,27 @@
+#!/usr/bin/env bash
+
+set -x
+
+PARTITION=$1
+CFG=$2
+PRETRAIN=$3  # pretrained model
+GPUS=${GPUS:-8}
+GPUS_PER_NODE=${GPUS_PER_NODE:-8}
+CPUS_PER_TASK=${CPUS_PER_TASK:-5}
+SRUN_ARGS=${SRUN_ARGS:-""}
+PY_ARGS=${@:4}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+mim train mmdet $CFG \
+    --launcher slurm -G $GPUS \
+    --gpus-per-node $GPUS_PER_NODE \
+    --cpus-per-task $CPUS_PER_TASK \
+    --partition $PARTITION \
+    --srun-args "$SRUN_ARGS" \
+    --cfg-options model.backbone.init_cfg.type=Pretrained \
+    model.backbone.init_cfg.checkpoint=$PRETRAIN \
+    model.backbone.init_cfg.prefix="backbone." \
+    model.roi_head.shared_head.init_cfg.type=Pretrained \
+    model.roi_head.shared_head.init_cfg.checkpoint=$PRETRAIN \
+    model.roi_head.shared_head.init_cfg.prefix="backbone." \
+    $PY_ARGS
diff --git a/tools/benchmarks/mmdetection/mim_slurm_train_fpn.sh b/tools/benchmarks/mmdetection/mim_slurm_train_fpn.sh
new file mode 100644
index 0000000000000000000000000000000000000000..514e03647d822dbf7240463a2348ac04542a86fd
--- /dev/null
+++ b/tools/benchmarks/mmdetection/mim_slurm_train_fpn.sh
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+
+set -x
+
+PARTITION=$1
+CFG=$2
+PRETRAIN=$3  # pretrained model
+GPUS=${GPUS:-8}
+GPUS_PER_NODE=${GPUS_PER_NODE:-8}
+CPUS_PER_TASK=${CPUS_PER_TASK:-5}
+SRUN_ARGS=${SRUN_ARGS:-""}
+PY_ARGS=${@:4}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+mim train mmdet $CFG \
+    --launcher slurm -G $GPUS \
+    --gpus-per-node $GPUS_PER_NODE \
+    --cpus-per-task $CPUS_PER_TASK \
+    --partition $PARTITION \
+    --srun-args "$SRUN_ARGS" \
+    --cfg-options model.backbone.init_cfg.type=Pretrained \
+    model.backbone.init_cfg.checkpoint=$PRETRAIN \
+    model.backbone.init_cfg.prefix="backbone." \
+    $PY_ARGS
diff --git a/tools/benchmarks/mmsegmentation/mim_dist_test.sh b/tools/benchmarks/mmsegmentation/mim_dist_test.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9ebb1a7c4d72d9b4723555c3eb1355d4dc591d7b
--- /dev/null
+++ b/tools/benchmarks/mmsegmentation/mim_dist_test.sh
@@ -0,0 +1,16 @@
+#!/usr/bin/env bash
+
+set -x
+
+CFG=$1
+CHECKPOINT=$2
+GPUS=$3
+PY_ARGS=${@:4}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+mim test mmseg \
+    $CFG \
+    --checkpoint $CHECKPOINT \
+    --launcher pytorch \
+    -G $GPUS \
+    $PY_ARGS
diff --git a/tools/benchmarks/mmsegmentation/mim_dist_train.sh b/tools/benchmarks/mmsegmentation/mim_dist_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d44da21af6f34b205c9bbdb2fb80ac83269f5ad2
--- /dev/null
+++ b/tools/benchmarks/mmsegmentation/mim_dist_train.sh
@@ -0,0 +1,17 @@
+#!/usr/bin/env bash
+
+set -x
+
+CFG=$1
+PRETRAIN=$2  # pretrained model
+GPUS=$3
+PY_ARGS=${@:4}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+mim train mmseg $CFG \
+    --launcher pytorch -G $GPUS \
+    --cfg-options model.backbone.init_cfg.type=Pretrained \
+    model.backbone.init_cfg.checkpoint=$PRETRAIN \
+    model.backbone.init_cfg.prefix="backbone." \
+    model.pretrained=None \
+    $PY_ARGS
diff --git a/tools/benchmarks/mmsegmentation/mim_slurm_test.sh b/tools/benchmarks/mmsegmentation/mim_slurm_test.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7d25dea683ad5f09b6faf594f1d901ae20096899
--- /dev/null
+++ b/tools/benchmarks/mmsegmentation/mim_slurm_test.sh
@@ -0,0 +1,23 @@
+#!/usr/bin/env bash
+
+set -x
+
+PARTITION=$1
+CFG=$2
+CHECKPOINT=$3
+GPUS=${GPUS:-4}
+GPUS_PER_NODE=${GPUS_PER_NODE:-4}
+CPUS_PER_TASK=${CPUS_PER_TASK:-5}
+SRUN_ARGS=${SRUN_ARGS:-""}
+PY_ARGS=${@:4}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+mim test mmseg \
+    $CFG \
+    --checkpoint $CHECKPOINT \
+    --launcher slurm -G $GPUS \
+    --gpus-per-node $GPUS_PER_NODE \
+    --cpus-per-task $CPUS_PER_TASK \
+    --partition $PARTITION \
+    --srun-args "$SRUN_ARGS" \
+    $PY_ARGS
diff --git a/tools/benchmarks/mmsegmentation/mim_slurm_train.sh b/tools/benchmarks/mmsegmentation/mim_slurm_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b5870bf297600d5c63c79caf04cc64e26410c6d1
--- /dev/null
+++ b/tools/benchmarks/mmsegmentation/mim_slurm_train.sh
@@ -0,0 +1,25 @@
+#!/usr/bin/env bash
+
+set -x
+
+PARTITION=$1
+CFG=$2
+PRETRAIN=$3  # pretrained model
+GPUS=${GPUS:-4}
+GPUS_PER_NODE=${GPUS_PER_NODE:-4}
+CPUS_PER_TASK=${CPUS_PER_TASK:-5}
+SRUN_ARGS=${SRUN_ARGS:-""}
+PY_ARGS=${@:4}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+mim train mmseg $CFG \
+    --launcher slurm -G $GPUS \
+    --gpus-per-node $GPUS_PER_NODE \
+    --cpus-per-task $CPUS_PER_TASK \
+    --partition $PARTITION \
+    --srun-args "$SRUN_ARGS" \
+    --cfg-options model.backbone.init_cfg.type=Pretrained \
+    model.backbone.init_cfg.checkpoint=$PRETRAIN \
+    model.backbone.init_cfg.prefix="backbone." \
+    model.pretrained=None \
+    $PY_ARGS
diff --git a/tools/dataset_converters/convert_flickr30k_ann.py b/tools/dataset_converters/convert_flickr30k_ann.py
new file mode 100644
index 0000000000000000000000000000000000000000..eebd079b1159f970434530dee3bd3aea71fb49d1
--- /dev/null
+++ b/tools/dataset_converters/convert_flickr30k_ann.py
@@ -0,0 +1,56 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+"""Create COCO-Style GT annotations based on raw annotation of Flickr30k.
+
+GT annotations are used for evaluation in image caption task.
+"""
+
+import json
+
+
+def main():
+    with open('dataset_flickr30k.json', 'r') as f:
+        annotations = json.load(f)
+    ann_list = []
+    img_list = []
+    splits = ['train', 'val', 'test']
+    for split in splits:
+        for img in annotations['images']:
+
+            # img_example={
+            #     "sentids": [0, 1, 2],
+            #     "imgid": 0,
+            #     "sentences": [
+            #         {"raw": "Two men in green shirts standing in a yard.",
+            #          "imgid": 0, "sentid": 0},
+            #         {"raw": "A man in a blue shirt standing in a garden.",
+            #          "imgid": 0, "sentid": 1},
+            #         {"raw": "Two friends enjoy time spent together.",
+            #          "imgid": 0, "sentid": 2}
+            #     ],
+            #     "split": "train",
+            #     "filename": "1000092795.jpg"
+            # },
+
+            if img['split'] != split:
+                continue
+
+            img_list.append({'id': img['imgid']})
+
+            for sentence in img['sentences']:
+                ann_info = {
+                    'image_id': img['imgid'],
+                    'id': sentence['sentid'],
+                    'caption': sentence['raw']
+                }
+                ann_list.append(ann_info)
+
+        json_file = {'annotations': ann_list, 'images': img_list}
+
+        # generate flickr30k_train_gt.json, flickr30k_val_gt.json
+        # and flickr30k_test_gt.json
+        with open(f'flickr30k_{split}_gt.json', 'w') as f:
+            json.dump(json_file, f)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/dataset_converters/convert_imagenet_subsets.py b/tools/dataset_converters/convert_imagenet_subsets.py
new file mode 100644
index 0000000000000000000000000000000000000000..784002e4d352fe8ad5ef531d40f087960e44f07d
--- /dev/null
+++ b/tools/dataset_converters/convert_imagenet_subsets.py
@@ -0,0 +1,48 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+"""SimCLR provides list files for semi-supervised benchmarks
+https://github.com/google-research/simclr/tree/master/imagenet_subsets/"""
+import argparse
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description='Convert ImageNet subset lists provided by SimCLR into '
+        'the required format in MMPretrain.')
+    parser.add_argument(
+        'input', help='Input list file, downloaded from SimCLR github repo.')
+    parser.add_argument(
+        'output', help='Output list file with the required format.')
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+
+    # create dict with full imagenet annotation file
+    with open('data/imagenet/meta/train.txt', 'r') as f:
+        lines = f.readlines()
+    keys = [line.split('/')[0] for line in lines]
+    labels = [line.strip().split()[1] for line in lines]
+    mapping = {}
+    for k, l in zip(keys, labels):
+        if k not in mapping:
+            mapping[k] = l
+        else:
+            assert mapping[k] == l
+
+    # convert
+    with open(args.input, 'r') as f:
+        lines = f.readlines()
+    fns = [line.strip() for line in lines]
+    sample_keys = [line.split('_')[0] for line in lines]
+    sample_labels = [mapping[k] for k in sample_keys]
+    output_lines = [
+        f'{k}/{fn} {l}\n' for k, fn, l in zip(sample_keys, fns, sample_labels)
+    ]
+    with open(args.output, 'w+') as f:
+        f.writelines(output_lines)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/dataset_converters/convert_inaturalist.py b/tools/dataset_converters/convert_inaturalist.py
new file mode 100644
index 0000000000000000000000000000000000000000..8020c064d680c98b124928365ebf228f34bb15d4
--- /dev/null
+++ b/tools/dataset_converters/convert_inaturalist.py
@@ -0,0 +1,32 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+
+import mmcv
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description='Convert iNaturalist2018 annotations to MMPretrain format.'
+    )
+    parser.add_argument('input', type=str, help='Input annotation json file.')
+    parser.add_argument('output', type=str, help='Output list file.')
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+    data = mmcv.load(args.input)
+    output_lines = []
+    for img_item in data['images']:
+        for ann_item in data['annotations']:
+            if ann_item['image_id'] == img_item['id']:
+                output_lines.append(
+                    f"{img_item['file_name']} {ann_item['category_id']}\n")
+    assert len(output_lines) == len(data['images'])
+    with open(args.output, 'w') as f:
+        f.writelines(output_lines)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/dataset_converters/odl_cub_preprocess.sh b/tools/dataset_converters/odl_cub_preprocess.sh
new file mode 100755
index 0000000000000000000000000000000000000000..6053d0e93b9e10e2d9b7d4193247fa4ccc1f4c3d
--- /dev/null
+++ b/tools/dataset_converters/odl_cub_preprocess.sh
@@ -0,0 +1,15 @@
+#!/usr/bin/env bash
+
+set -x
+
+DOWNLOAD_DIR=$1
+DATA_ROOT=$2
+
+# unzip all of data
+cat $DOWNLOAD_DIR/CUB-200-2011/raw/*.tar.gz | tar -xvz -C $DOWNLOAD_DIR
+
+# move data into DATA_ROOT
+mv -f $DOWNLOAD_DIR/CUB-200-2011/CUB-200-2011/* $DATA_ROOT/
+
+# remove useless data file
+rm -R $DOWNLOAD_DIR/CUB-200-2011/
diff --git a/tools/dataset_converters/odl_imagenet1k_preprocess.sh b/tools/dataset_converters/odl_imagenet1k_preprocess.sh
new file mode 100755
index 0000000000000000000000000000000000000000..e73ba37247d9fd5b45420614be0843cddaabd041
--- /dev/null
+++ b/tools/dataset_converters/odl_imagenet1k_preprocess.sh
@@ -0,0 +1,22 @@
+#!/usr/bin/env bash
+
+set -x
+
+DOWNLOAD_DIR=$1
+DATA_ROOT=$2
+
+# unzip all of data
+cat $DOWNLOAD_DIR/ImageNet-1K/raw/*.tar.gz.* | tar -xvz -C $DOWNLOAD_DIR
+
+# move images into data/imagenet
+mv $DOWNLOAD_DIR/ImageNet-1K/{train,val,test} $DATA_ROOT
+
+# download the mate ann_files file
+wget -P $DATA_ROOT  https://download.openmmlab.com/mmclassification/datasets/imagenet/meta/caffe_ilsvrc12.tar.gz
+
+# unzip mate ann_files file and put it into 'meta' folder
+mkdir $DATA_ROOT/meta
+tar -xzvf $DATA_ROOT/caffe_ilsvrc12.tar.gz -C $DATA_ROOT/meta
+
+# remove useless data files
+rm -R $DOWNLOAD_DIR/ImageNet-1K
diff --git a/tools/dist_test.sh b/tools/dist_test.sh
new file mode 100644
index 0000000000000000000000000000000000000000..dea131b43ea8f1222661d20603d40c18ea7f28a1
--- /dev/null
+++ b/tools/dist_test.sh
@@ -0,0 +1,22 @@
+#!/usr/bin/env bash
+
+CONFIG=$1
+CHECKPOINT=$2
+GPUS=$3
+NNODES=${NNODES:-1}
+NODE_RANK=${NODE_RANK:-0}
+PORT=${PORT:-29500}
+MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+python -m torch.distributed.launch \
+    --nnodes=$NNODES \
+    --node_rank=$NODE_RANK \
+    --master_addr=$MASTER_ADDR \
+    --nproc_per_node=$GPUS \
+    --master_port=$PORT \
+    $(dirname "$0")/test.py \
+    $CONFIG \
+    $CHECKPOINT \
+    --launcher pytorch \
+    ${@:4}
diff --git a/tools/dist_train.sh b/tools/dist_train.sh
new file mode 100755
index 0000000000000000000000000000000000000000..3fca7641dec4090930c85991a079c28409529d4e
--- /dev/null
+++ b/tools/dist_train.sh
@@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+
+CONFIG=$1
+GPUS=$2
+NNODES=${NNODES:-1}
+NODE_RANK=${NODE_RANK:-0}
+PORT=${PORT:-29500}
+MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+python -m torch.distributed.launch \
+    --nnodes=$NNODES \
+    --node_rank=$NODE_RANK \
+    --master_addr=$MASTER_ADDR \
+    --nproc_per_node=$GPUS \
+    --master_port=$PORT \
+    $(dirname "$0")/train.py \
+    $CONFIG \
+    --launcher pytorch ${@:3}
diff --git a/tools/kfold-cross-valid.py b/tools/kfold-cross-valid.py
new file mode 100644
index 0000000000000000000000000000000000000000..3591254edc0eeb5a869b36294a4c3a446ac8fc9e
--- /dev/null
+++ b/tools/kfold-cross-valid.py
@@ -0,0 +1,254 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import copy
+import os
+import os.path as osp
+
+from mmengine.config import Config, ConfigDict, DictAction
+from mmengine.dist import sync_random_seed
+from mmengine.fileio import dump, load
+from mmengine.hooks import Hook
+from mmengine.runner import Runner, find_latest_checkpoint
+from mmengine.utils import digit_version
+from mmengine.utils.dl_utils import TORCH_VERSION
+
+EXP_INFO_FILE = 'kfold_exp.json'
+
+prog_description = """K-Fold cross-validation.
+
+To start a 5-fold cross-validation experiment:
+    python tools/kfold-cross-valid.py $CONFIG --num-splits 5
+
+To resume a 5-fold cross-validation from an interrupted experiment:
+    python tools/kfold-cross-valid.py $CONFIG --num-splits 5 --resume
+"""  # noqa: E501
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        description=prog_description)
+    parser.add_argument('config', help='train config file path')
+    parser.add_argument(
+        '--num-splits',
+        type=int,
+        help='The number of all folds.',
+        required=True)
+    parser.add_argument(
+        '--fold',
+        type=int,
+        help='The fold used to do validation. '
+        'If specify, only do an experiment of the specified fold.')
+    parser.add_argument('--work-dir', help='the dir to save logs and models')
+    parser.add_argument('--seed', type=int, default=None, help='random seed')
+    parser.add_argument(
+        '--resume',
+        action='store_true',
+        help='Resume the previous experiment.')
+    parser.add_argument(
+        '--amp',
+        action='store_true',
+        help='enable automatic-mixed-precision training')
+    parser.add_argument(
+        '--no-validate',
+        action='store_true',
+        help='whether not to evaluate the checkpoint during training')
+    parser.add_argument(
+        '--auto-scale-lr',
+        action='store_true',
+        help='whether to auto scale the learning rate according to the '
+        'actual batch size and the original batch size.')
+    parser.add_argument(
+        '--no-pin-memory',
+        action='store_true',
+        help='whether to disable the pin_memory option in dataloaders.')
+    parser.add_argument(
+        '--no-persistent-workers',
+        action='store_true',
+        help='whether to disable the persistent_workers option in dataloaders.'
+    )
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. If the value to '
+        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+        'Note that the quotation marks are necessary and that no white space '
+        'is allowed.')
+    parser.add_argument(
+        '--launcher',
+        choices=['none', 'pytorch', 'slurm', 'mpi'],
+        default='none',
+        help='job launcher')
+    parser.add_argument('--local_rank', type=int, default=0)
+    args = parser.parse_args()
+    if 'LOCAL_RANK' not in os.environ:
+        os.environ['LOCAL_RANK'] = str(args.local_rank)
+
+    return args
+
+
+def merge_args(cfg, args):
+    """Merge CLI arguments to config."""
+    if args.no_validate:
+        cfg.val_cfg = None
+        cfg.val_dataloader = None
+        cfg.val_evaluator = None
+
+    cfg.launcher = args.launcher
+
+    # work_dir is determined in this priority: CLI > segment in file > filename
+    if args.work_dir is not None:
+        # update configs according to CLI args if args.work_dir is not None
+        cfg.work_dir = args.work_dir
+    elif cfg.get('work_dir', None) is None:
+        # use config filename as default work_dir if cfg.work_dir is None
+        cfg.work_dir = osp.join('./work_dirs',
+                                osp.splitext(osp.basename(args.config))[0])
+
+    # enable automatic-mixed-precision training
+    if args.amp is True:
+        optim_wrapper = cfg.optim_wrapper.get('type', 'OptimWrapper')
+        assert optim_wrapper in ['OptimWrapper', 'AmpOptimWrapper'], \
+            '`--amp` is not supported custom optimizer wrapper type ' \
+            f'`{optim_wrapper}.'
+        cfg.optim_wrapper.type = 'AmpOptimWrapper'
+        cfg.optim_wrapper.setdefault('loss_scale', 'dynamic')
+
+    # enable auto scale learning rate
+    if args.auto_scale_lr:
+        cfg.auto_scale_lr.enable = True
+
+    # set dataloader args
+    default_dataloader_cfg = ConfigDict(
+        pin_memory=True,
+        persistent_workers=True,
+        collate_fn=dict(type='default_collate'),
+    )
+    if digit_version(TORCH_VERSION) < digit_version('1.8.0'):
+        default_dataloader_cfg.persistent_workers = False
+
+    def set_default_dataloader_cfg(cfg, field):
+        if cfg.get(field, None) is None:
+            return
+        dataloader_cfg = copy.deepcopy(default_dataloader_cfg)
+        dataloader_cfg.update(cfg[field])
+        cfg[field] = dataloader_cfg
+        if args.no_pin_memory:
+            cfg[field]['pin_memory'] = False
+        if args.no_persistent_workers:
+            cfg[field]['persistent_workers'] = False
+
+    set_default_dataloader_cfg(cfg, 'train_dataloader')
+    set_default_dataloader_cfg(cfg, 'val_dataloader')
+    set_default_dataloader_cfg(cfg, 'test_dataloader')
+
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+
+    return cfg
+
+
+def train_single_fold(cfg, num_splits, fold, resume_ckpt=None):
+    root_dir = cfg.work_dir
+    cfg.work_dir = osp.join(root_dir, f'fold{fold}')
+    if resume_ckpt is not None:
+        cfg.resume = True
+        cfg.load_from = resume_ckpt
+    dataset = cfg.train_dataloader.dataset
+
+    # wrap the dataset cfg
+    def wrap_dataset(dataset, test_mode):
+        return dict(
+            type='KFoldDataset',
+            dataset=dataset,
+            fold=fold,
+            num_splits=num_splits,
+            seed=cfg.kfold_split_seed,
+            test_mode=test_mode,
+        )
+
+    train_dataset = copy.deepcopy(dataset)
+    cfg.train_dataloader.dataset = wrap_dataset(train_dataset, False)
+
+    if cfg.val_dataloader is not None:
+        if 'pipeline' not in cfg.val_dataloader.dataset:
+            raise ValueError(
+                'Cannot find `pipeline` in the validation dataset. '
+                "If you are using dataset wrapper, please don't use this "
+                'tool to act kfold cross validation. '
+                'Please write config files manually.')
+        val_dataset = copy.deepcopy(dataset)
+        val_dataset['pipeline'] = cfg.val_dataloader.dataset.pipeline
+        cfg.val_dataloader.dataset = wrap_dataset(val_dataset, True)
+    if cfg.test_dataloader is not None:
+        if 'pipeline' not in cfg.test_dataloader.dataset:
+            raise ValueError(
+                'Cannot find `pipeline` in the test dataset. '
+                "If you are using dataset wrapper, please don't use this "
+                'tool to act kfold cross validation. '
+                'Please write config files manually.')
+        test_dataset = copy.deepcopy(dataset)
+        test_dataset['pipeline'] = cfg.test_dataloader.dataset.pipeline
+        cfg.test_dataloader.dataset = wrap_dataset(test_dataset, True)
+
+    # build the runner from config
+    runner = Runner.from_cfg(cfg)
+    runner.logger.info(
+        f'----------- Cross-validation: [{fold+1}/{num_splits}] ----------- ')
+    runner.logger.info(f'Train dataset: \n{runner.train_dataloader.dataset}')
+
+    class SaveInfoHook(Hook):
+
+        def after_train_epoch(self, runner):
+            last_ckpt = find_latest_checkpoint(cfg.work_dir)
+            exp_info = dict(
+                fold=fold,
+                last_ckpt=last_ckpt,
+                kfold_split_seed=cfg.kfold_split_seed,
+            )
+            dump(exp_info, osp.join(root_dir, EXP_INFO_FILE))
+
+    runner.register_hook(SaveInfoHook(), 'LOWEST')
+
+    # start training
+    runner.train()
+
+
+def main():
+    args = parse_args()
+
+    # load config
+    cfg = Config.fromfile(args.config)
+
+    # merge cli arguments to config
+    cfg = merge_args(cfg, args)
+
+    # set the unify random seed
+    cfg.kfold_split_seed = args.seed or sync_random_seed()
+
+    # resume from the previous experiment
+    if args.resume:
+        experiment_info = load(osp.join(cfg.work_dir, EXP_INFO_FILE))
+        resume_fold = experiment_info['fold']
+        cfg.kfold_split_seed = experiment_info['kfold_split_seed']
+        resume_ckpt = experiment_info.get('last_ckpt', None)
+    else:
+        resume_fold = 0
+        resume_ckpt = None
+
+    if args.fold is not None:
+        folds = [args.fold]
+    else:
+        folds = range(resume_fold, args.num_splits)
+
+    for fold in folds:
+        cfg_ = copy.deepcopy(cfg)
+        train_single_fold(cfg_, args.num_splits, fold, resume_ckpt)
+        resume_ckpt = None
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/misc/print_config.py b/tools/misc/print_config.py
new file mode 100644
index 0000000000000000000000000000000000000000..bc7860072e278214c4ebf8a43994cef039127c27
--- /dev/null
+++ b/tools/misc/print_config.py
@@ -0,0 +1,38 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+
+import rich.console
+from mmengine import Config, DictAction
+
+console = rich.console.Console()
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Print the whole config')
+    parser.add_argument('config', help='config file path')
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. If the value to '
+        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+        'Note that the quotation marks are necessary and that no white space '
+        'is allowed.')
+    args = parser.parse_args()
+
+    return args
+
+
+def main():
+    args = parse_args()
+
+    cfg = Config.fromfile(args.config)
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+    console.print(cfg.pretty_text, markup=False)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/misc/verify_dataset.py b/tools/misc/verify_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..ccc5e89b44495c3e9f2581fa4308bf5ac319852b
--- /dev/null
+++ b/tools/misc/verify_dataset.py
@@ -0,0 +1,145 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import multiprocessing
+import os
+from pathlib import Path
+
+from mmengine import (Config, DictAction, track_parallel_progress,
+                      track_progress)
+
+from mmpretrain.datasets import build_dataset
+from mmpretrain.registry import TRANSFORMS
+
+file_lock = multiprocessing.Lock()
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Verify Dataset')
+    parser.add_argument('config', help='config file path')
+    parser.add_argument(
+        '--out-path',
+        type=str,
+        default='brokenfiles.log',
+        help='output path of all the broken files. If the specified path '
+        'already exists, delete the previous file ')
+    parser.add_argument(
+        '--phase',
+        default='train',
+        type=str,
+        choices=['train', 'test', 'val'],
+        help='phase of dataset to visualize, accept "train" "test" and "val".')
+    parser.add_argument(
+        '--num-process', type=int, default=1, help='number of process to use')
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. If the value to '
+        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+        'Note that the quotation marks are necessary and that no white space '
+        'is allowed.')
+    args = parser.parse_args()
+    assert args.out_path is not None
+    assert args.num_process > 0
+    return args
+
+
+class DatasetValidator():
+    """the dataset tool class to check if all file are broken."""
+
+    def __init__(self, dataset_cfg, log_file_path):
+        super(DatasetValidator, self).__init__()
+        # keep only LoadImageFromFile pipeline
+        from mmpretrain.datasets import get_transform_idx
+
+        load_idx = get_transform_idx(dataset_cfg.pipeline, 'LoadImageFromFile')
+        assert load_idx >= 0, \
+            'This tool is only for datasets needs to load image from files.'
+        self.pipeline = TRANSFORMS.build(dataset_cfg.pipeline[load_idx])
+        dataset_cfg.pipeline = []
+        dataset = build_dataset(dataset_cfg)
+
+        self.dataset = dataset
+        self.log_file_path = log_file_path
+
+    def valid_idx(self, idx):
+        item = self.dataset[idx]
+        try:
+            item = self.pipeline(item)
+        except Exception:
+            with open(self.log_file_path, 'a') as f:
+                # add file lock to prevent multi-process writing errors
+                filepath = str(Path(item['img_path']))
+                file_lock.acquire()
+                f.write(filepath + '\n')
+                file_lock.release()
+                print(f'{filepath} cannot be read correctly, please check it.')
+
+    def __len__(self):
+        return len(self.dataset)
+
+
+def print_info(log_file_path):
+    """print some information and do extra action."""
+    print()
+    with open(log_file_path, 'r') as f:
+        content = f.read().strip()
+        if content == '':
+            print('There is no broken file found.')
+            os.remove(log_file_path)
+        else:
+            num_file = len(content.split('\n'))
+            print(f'{num_file} broken files found, name list save in file:'
+                  f'{log_file_path}')
+    print()
+
+
+def main():
+    # parse cfg and args
+    args = parse_args()
+    cfg = Config.fromfile(args.config)
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+
+    # touch output file to save broken files list.
+    output_path = Path(args.out_path)
+    if not output_path.parent.exists():
+        raise Exception("Path '--out-path' parent directory not found.")
+    if output_path.exists():
+        os.remove(output_path)
+    output_path.touch()
+
+    if args.phase == 'train':
+        dataset_cfg = cfg.train_dataloader.dataset
+    elif args.phase == 'val':
+        dataset_cfg = cfg.val_dataloader.dataset
+    elif args.phase == 'test':
+        dataset_cfg = cfg.test_dataloader.dataset
+    else:
+        raise ValueError("'--phase' only support 'train', 'val' and 'test'.")
+
+    # do validate
+    validator = DatasetValidator(dataset_cfg, output_path)
+
+    if args.num_process > 1:
+        # The default chunksize calcuation method of Pool.map
+        chunksize, extra = divmod(len(validator), args.num_process * 8)
+        if extra:
+            chunksize += 1
+
+        track_parallel_progress(
+            validator.valid_idx,
+            list(range(len(validator))),
+            args.num_process,
+            chunksize=chunksize,
+            keep_order=False)
+    else:
+        track_progress(validator.valid_idx, list(range(len(validator))))
+
+    print_info(output_path)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/clip_to_mmpretrain.py b/tools/model_converters/clip_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..54426281c65587ec0323addd58ac82c3bb722881
--- /dev/null
+++ b/tools/model_converters/clip_to_mmpretrain.py
@@ -0,0 +1,72 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_clip(ckpt):
+    new_ckpt = OrderedDict()
+
+    for k, v in list(ckpt.items()):
+        new_v = v
+        if k.startswith('head'):
+            new_k = k.replace('head.', 'head.layers.head.')
+            new_ckpt[new_k] = new_v
+            continue
+        elif k.startswith('patch_embed'):
+            if 'proj.' in k:
+                new_k = k.replace('proj.', 'projection.')
+            else:
+                new_k = k
+        elif k.startswith('norm_pre'):
+            new_k = k.replace('norm_pre', 'pre_norm')
+        elif k.startswith('blocks'):
+            new_k = k.replace('blocks.', 'layers.')
+            if 'norm1' in k:
+                new_k = new_k.replace('norm1', 'ln1')
+            elif 'norm2' in k:
+                new_k = new_k.replace('norm2', 'ln2')
+            elif 'mlp.fc1' in k:
+                new_k = new_k.replace('mlp.fc1', 'ffn.layers.0.0')
+            elif 'mlp.fc2' in k:
+                new_k = new_k.replace('mlp.fc2', 'ffn.layers.1')
+        elif k.startswith('norm'):
+            new_k = k.replace('norm', 'ln1')
+        else:
+            new_k = k
+
+        if not new_k.startswith('head'):
+            new_k = 'backbone.' + new_k
+        new_ckpt[new_k] = new_v
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in pretrained clip '
+        'models to mmpretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'state_dict' in checkpoint:
+        state_dict = checkpoint['state_dict']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_clip(state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/convnext_to_mmpretrain.py b/tools/model_converters/convnext_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..82f62361804903108631efa993b3b294bb3a7175
--- /dev/null
+++ b/tools/model_converters/convnext_to_mmpretrain.py
@@ -0,0 +1,63 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_convnext(ckpt):
+
+    new_ckpt = OrderedDict()
+
+    for k, v in list(ckpt.items()):
+        new_v = v
+        if k.startswith('head'):
+            new_k = k.replace('head.', 'head.fc.')
+            new_ckpt[new_k] = new_v
+            continue
+        elif k.startswith('stages'):
+            if 'dwconv' in k:
+                new_k = k.replace('dwconv', 'depthwise_conv')
+            elif 'pwconv' in k:
+                new_k = k.replace('pwconv', 'pointwise_conv')
+            else:
+                new_k = k
+        elif k.startswith('norm'):
+            new_k = k.replace('norm', 'norm3')
+        else:
+            new_k = k
+
+        if not new_k.startswith('head'):
+            new_k = 'backbone.' + new_k
+        new_ckpt[new_k] = new_v
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in pretrained convnext '
+        'models to mmpretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'model' in checkpoint:
+        state_dict = checkpoint['model']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_convnext(state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(dict(state_dict=weight), args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/davit_to_mmpretrain.py b/tools/model_converters/davit_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..c57802624bc74e36441aadeb5319e831442ef874
--- /dev/null
+++ b/tools/model_converters/davit_to_mmpretrain.py
@@ -0,0 +1,87 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_davit(ckpt):
+
+    new_ckpt = OrderedDict()
+
+    for k, v in list(ckpt.items()):
+        new_v = v
+        if k.startswith('patch_embeds.0'):
+            new_k = k.replace('patch_embeds.0', 'patch_embed')
+            new_k = new_k.replace('proj', 'projection')
+        elif k.startswith('patch_embeds'):
+            if k.startswith('patch_embeds.1'):
+                new_k = k.replace('patch_embeds.1', 'stages.0.downsample')
+            elif k.startswith('patch_embeds.2'):
+                new_k = k.replace('patch_embeds.2', 'stages.1.downsample')
+            elif k.startswith('patch_embeds.3'):
+                new_k = k.replace('patch_embeds.3', 'stages.2.downsample')
+            new_k = new_k.replace('proj', 'projection')
+        elif k.startswith('main_blocks'):
+            new_k = k.replace('main_blocks', 'stages')
+            for num_stages in range(4):
+                for num_blocks in range(9):
+                    if f'{num_stages}.{num_blocks}.0' in k:
+                        new_k = new_k.replace(
+                            f'{num_stages}.{num_blocks}.0',
+                            f'{num_stages}.blocks.{num_blocks}.spatial_block')
+                    elif f'{num_stages}.{num_blocks}.1' in k:
+                        new_k = new_k.replace(
+                            f'{num_stages}.{num_blocks}.1',
+                            f'{num_stages}.blocks.{num_blocks}.channel_block')
+            if 'cpe.0' in k:
+                new_k = new_k.replace('cpe.0', 'cpe1')
+            elif 'cpe.1' in k:
+                new_k = new_k.replace('cpe.1', 'cpe2')
+            if 'mlp' in k:
+                new_k = new_k.replace('mlp.fc1', 'ffn.layers.0.0')
+                new_k = new_k.replace('mlp.fc2', 'ffn.layers.1')
+            if 'spatial_block.attn' in new_k:
+                new_k = new_k.replace('spatial_block.attn',
+                                      'spatial_block.attn.w_msa')
+        elif k.startswith('norms'):
+            new_k = k.replace('norms', 'norm3')
+        elif k.startswith('head'):
+            new_k = k.replace('head', 'head.fc')
+        else:
+            new_k = k
+
+        if not new_k.startswith('head'):
+            new_k = 'backbone.' + new_k
+        new_ckpt[new_k] = new_v
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in pretrained davit '
+        'models to mmpretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'state_dict' in checkpoint:
+        state_dict = checkpoint['state_dict']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_davit(state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/deit3_to_mmpretrain.py b/tools/model_converters/deit3_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..0ceed1f0bee8d7c614ae3628807a827749eef64a
--- /dev/null
+++ b/tools/model_converters/deit3_to_mmpretrain.py
@@ -0,0 +1,75 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_deit3(ckpt):
+
+    new_ckpt = OrderedDict()
+
+    for k, v in list(ckpt.items()):
+        new_v = v
+        if k.startswith('head'):
+            new_k = k.replace('head.', 'head.layers.head.')
+            new_ckpt[new_k] = new_v
+            continue
+        elif k.startswith('patch_embed'):
+            if 'proj.' in k:
+                new_k = k.replace('proj.', 'projection.')
+            else:
+                new_k = k
+        elif k.startswith('blocks'):
+            new_k = k.replace('blocks.', 'layers.')
+            if 'norm1' in k:
+                new_k = new_k.replace('norm1', 'ln1')
+            elif 'norm2' in k:
+                new_k = new_k.replace('norm2', 'ln2')
+            elif 'mlp.fc1' in k:
+                new_k = new_k.replace('mlp.fc1', 'ffn.layers.0.0')
+            elif 'mlp.fc2' in k:
+                new_k = new_k.replace('mlp.fc2', 'ffn.layers.1')
+            elif 'gamma_1' in k:
+                new_k = new_k.replace('gamma_1', 'attn.gamma1.weight')
+            elif 'gamma_2' in k:
+                new_k = new_k.replace('gamma_2', 'ffn.gamma2.weight')
+        elif k.startswith('norm'):
+            new_k = k.replace('norm', 'ln1')
+        else:
+            new_k = k
+
+        if not new_k.startswith('head'):
+            new_k = 'backbone.' + new_k
+        new_ckpt[new_k] = new_v
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in pretrained deit3 '
+        'models to mmpretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'model' in checkpoint:
+        state_dict = checkpoint['model']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_deit3(state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/edgenext_to_mmpretrain.py b/tools/model_converters/edgenext_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..64a54680c030898d719844de931bf24ca07e2d70
--- /dev/null
+++ b/tools/model_converters/edgenext_to_mmpretrain.py
@@ -0,0 +1,74 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+from pathlib import Path
+
+import torch
+
+
+def convert_weights(weight):
+    """Weight Converter.
+
+    Converts the weights from timm to mmpretrain
+    Args:
+        weight (dict): weight dict from timm
+    Returns:
+        Converted weight dict for mmpretrain
+    """
+    result = dict()
+    result['meta'] = dict()
+    temp = dict()
+    mapping = {
+        'dwconv': 'depthwise_conv',
+        'pwconv1': 'pointwise_conv1',
+        'pwconv2': 'pointwise_conv2',
+        'xca': 'csa',
+        'convs': 'conv_modules',
+        'token_projection': 'proj',
+        'pos_embd': 'pos_embed',
+        'temperature': 'scale',
+    }
+    strict_mapping = {
+        'norm.weight': 'norm3.weight',
+        'norm.bias': 'norm3.bias',
+    }
+
+    try:
+        weight = weight['model_ema']
+    except KeyError:
+        weight = weight['state_dict']  # for model learned with usi
+    else:
+        raise NotImplementedError
+
+    for k, v in weight.items():
+        # keyword mapping
+        for mk, mv in mapping.items():
+            if mk in k:
+                k = k.replace(mk, mv)
+        # strict mapping
+        for mk, mv in strict_mapping.items():
+            if mk == k:
+                k = mv
+
+        if k.startswith('head.'):
+            temp['head.fc.' + k[5:]] = v
+        else:
+            temp['backbone.' + k] = v
+
+    result['state_dict'] = temp
+    return result
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Convert model keys')
+    parser.add_argument('src', help='src detectron model path')
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+    dst = Path(args.dst)
+    if dst.suffix != '.pth':
+        print('The path should contain the name of the pth format file.')
+        exit(1)
+    dst.parent.mkdir(parents=True, exist_ok=True)
+
+    original_model = torch.load(args.src, map_location='cpu')
+    converted_model = convert_weights(original_model)
+    torch.save(converted_model, args.dst)
diff --git a/tools/model_converters/efficientnet_to_mmpretrain.py b/tools/model_converters/efficientnet_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..f1541e38ee34fee3e2bd6c1e1c9753d634db4b73
--- /dev/null
+++ b/tools/model_converters/efficientnet_to_mmpretrain.py
@@ -0,0 +1,222 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os
+
+import numpy as np
+import torch
+from mmengine.model import Sequential
+from tensorflow.python.training import py_checkpoint_reader
+
+from mmpretrain.models.backbones.efficientnet import EfficientNet
+
+
+def tf2pth(v):
+    if v.ndim == 4:
+        return np.ascontiguousarray(v.transpose(3, 2, 0, 1))
+    elif v.ndim == 2:
+        return np.ascontiguousarray(v.transpose())
+    return v
+
+
+def read_ckpt(ckpt):
+    reader = py_checkpoint_reader.NewCheckpointReader(ckpt)
+    weights = {
+        n: torch.as_tensor(tf2pth(reader.get_tensor(n)))
+        for (n, _) in reader.get_variable_to_shape_map().items()
+    }
+    return weights
+
+
+def map_key(weight, l2_flag):
+    m = dict()
+    has_expand_conv = set()
+    is_MBConv = set()
+    max_idx = 0
+    name = None
+    for k, v in weight.items():
+        seg = k.split('/')
+        if len(seg) == 1:
+            continue
+        if 'edgetpu' in seg[0]:
+            name = 'e' + seg[0][21:].lower()
+        else:
+            name = seg[0][13:]
+        if seg[2] == 'tpu_batch_normalization_2':
+            has_expand_conv.add(seg[1])
+        if seg[1].startswith('blocks_'):
+            idx = int(seg[1][7:]) + 1
+            max_idx = max(max_idx, idx)
+            if 'depthwise' in k:
+                is_MBConv.add(seg[1])
+
+    model = EfficientNet(name)
+    idx2key = []
+    for idx, module in enumerate(model.layers):
+        if isinstance(module, Sequential):
+            for j in range(len(module)):
+                idx2key.append('{}.{}'.format(idx, j))
+        else:
+            idx2key.append('{}'.format(idx))
+
+    for k, v in weight.items():
+        if l2_flag:
+            k = k.replace('/ExponentialMovingAverage', '')
+
+        if 'Exponential' in k or 'RMS' in k:
+            continue
+
+        seg = k.split('/')
+        if len(seg) == 1:
+            continue
+        if seg[2] == 'depthwise_conv2d':
+            v = v.transpose(1, 0)
+
+        if seg[1] == 'stem':
+            prefix = 'backbone.layers.{}'.format(idx2key[0])
+            mapping = {
+                'conv2d/kernel': 'conv.weight',
+                'tpu_batch_normalization/beta': 'bn.bias',
+                'tpu_batch_normalization/gamma': 'bn.weight',
+                'tpu_batch_normalization/moving_mean': 'bn.running_mean',
+                'tpu_batch_normalization/moving_variance': 'bn.running_var',
+            }
+            suffix = mapping['/'.join(seg[2:])]
+            m[prefix + '.' + suffix] = v
+
+        elif seg[1].startswith('blocks_'):
+            idx = int(seg[1][7:]) + 1
+            prefix = '.'.join(['backbone', 'layers', idx2key[idx]])
+            if seg[1] not in is_MBConv:
+                mapping = {
+                    'conv2d/kernel':
+                    'conv1.conv.weight',
+                    'tpu_batch_normalization/gamma':
+                    'conv1.bn.weight',
+                    'tpu_batch_normalization/beta':
+                    'conv1.bn.bias',
+                    'tpu_batch_normalization/moving_mean':
+                    'conv1.bn.running_mean',
+                    'tpu_batch_normalization/moving_variance':
+                    'conv1.bn.running_var',
+                    'conv2d_1/kernel':
+                    'conv2.conv.weight',
+                    'tpu_batch_normalization_1/gamma':
+                    'conv2.bn.weight',
+                    'tpu_batch_normalization_1/beta':
+                    'conv2.bn.bias',
+                    'tpu_batch_normalization_1/moving_mean':
+                    'conv2.bn.running_mean',
+                    'tpu_batch_normalization_1/moving_variance':
+                    'conv2.bn.running_var',
+                }
+            else:
+
+                base_mapping = {
+                    'depthwise_conv2d/depthwise_kernel':
+                    'depthwise_conv.conv.weight',
+                    'se/conv2d/kernel': 'se.conv1.conv.weight',
+                    'se/conv2d/bias': 'se.conv1.conv.bias',
+                    'se/conv2d_1/kernel': 'se.conv2.conv.weight',
+                    'se/conv2d_1/bias': 'se.conv2.conv.bias'
+                }
+
+                if seg[1] not in has_expand_conv:
+                    mapping = {
+                        'conv2d/kernel':
+                        'linear_conv.conv.weight',
+                        'tpu_batch_normalization/beta':
+                        'depthwise_conv.bn.bias',
+                        'tpu_batch_normalization/gamma':
+                        'depthwise_conv.bn.weight',
+                        'tpu_batch_normalization/moving_mean':
+                        'depthwise_conv.bn.running_mean',
+                        'tpu_batch_normalization/moving_variance':
+                        'depthwise_conv.bn.running_var',
+                        'tpu_batch_normalization_1/beta':
+                        'linear_conv.bn.bias',
+                        'tpu_batch_normalization_1/gamma':
+                        'linear_conv.bn.weight',
+                        'tpu_batch_normalization_1/moving_mean':
+                        'linear_conv.bn.running_mean',
+                        'tpu_batch_normalization_1/moving_variance':
+                        'linear_conv.bn.running_var',
+                    }
+                else:
+                    mapping = {
+                        'depthwise_conv2d/depthwise_kernel':
+                        'depthwise_conv.conv.weight',
+                        'conv2d/kernel':
+                        'expand_conv.conv.weight',
+                        'conv2d_1/kernel':
+                        'linear_conv.conv.weight',
+                        'tpu_batch_normalization/beta':
+                        'expand_conv.bn.bias',
+                        'tpu_batch_normalization/gamma':
+                        'expand_conv.bn.weight',
+                        'tpu_batch_normalization/moving_mean':
+                        'expand_conv.bn.running_mean',
+                        'tpu_batch_normalization/moving_variance':
+                        'expand_conv.bn.running_var',
+                        'tpu_batch_normalization_1/beta':
+                        'depthwise_conv.bn.bias',
+                        'tpu_batch_normalization_1/gamma':
+                        'depthwise_conv.bn.weight',
+                        'tpu_batch_normalization_1/moving_mean':
+                        'depthwise_conv.bn.running_mean',
+                        'tpu_batch_normalization_1/moving_variance':
+                        'depthwise_conv.bn.running_var',
+                        'tpu_batch_normalization_2/beta':
+                        'linear_conv.bn.bias',
+                        'tpu_batch_normalization_2/gamma':
+                        'linear_conv.bn.weight',
+                        'tpu_batch_normalization_2/moving_mean':
+                        'linear_conv.bn.running_mean',
+                        'tpu_batch_normalization_2/moving_variance':
+                        'linear_conv.bn.running_var',
+                    }
+                mapping.update(base_mapping)
+            suffix = mapping['/'.join(seg[2:])]
+            m[prefix + '.' + suffix] = v
+        elif seg[1] == 'head':
+            seq_key = idx2key[max_idx + 1]
+            mapping = {
+                'conv2d/kernel':
+                'backbone.layers.{}.conv.weight'.format(seq_key),
+                'tpu_batch_normalization/beta':
+                'backbone.layers.{}.bn.bias'.format(seq_key),
+                'tpu_batch_normalization/gamma':
+                'backbone.layers.{}.bn.weight'.format(seq_key),
+                'tpu_batch_normalization/moving_mean':
+                'backbone.layers.{}.bn.running_mean'.format(seq_key),
+                'tpu_batch_normalization/moving_variance':
+                'backbone.layers.{}.bn.running_var'.format(seq_key),
+                'dense/kernel':
+                'head.fc.weight',
+                'dense/bias':
+                'head.fc.bias'
+            }
+            key = mapping['/'.join(seg[2:])]
+            if name.startswith('e') and 'fc' in key:
+                v = v[1:]
+            m[key] = v
+    return m
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('infile', type=str, help='Path to the ckpt.')
+    parser.add_argument('outfile', type=str, help='Output file.')
+    parser.add_argument(
+        '--l2',
+        action='store_true',
+        help='If true convert ExponentialMovingAverage weights. '
+        'l2 arch should use it.')
+    args = parser.parse_args()
+    assert args.outfile
+
+    outdir = os.path.dirname(os.path.abspath(args.outfile))
+    if not os.path.exists(outdir):
+        os.makedirs(outdir)
+    weights = read_ckpt(args.infile)
+    weights = map_key(weights, args.l2)
+    torch.save(weights, args.outfile)
diff --git a/tools/model_converters/efficientnetv2_to_mmpretrain.py b/tools/model_converters/efficientnetv2_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..5ada7ecc515e1dda3a108eff06351d46772ac89f
--- /dev/null
+++ b/tools/model_converters/efficientnetv2_to_mmpretrain.py
@@ -0,0 +1,100 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+"""convert the weights of efficientnetv2 in
+timm(https://github.com/rwightman/pytorch-image-models) to mmpretrain
+format."""
+import argparse
+import os.path as osp
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_from_efficientnetv2_timm(param):
+    # main change_key
+    param_lst = list(param.keys())
+    op = str(int(param_lst[-9][7]) + 2)
+    new_key = dict()
+    for name in param_lst:
+        data = param[name]
+        if 'blocks' not in name:
+            if 'conv_stem' in name:
+                name = name.replace('conv_stem', 'backbone.layers.0.conv')
+            if 'bn1' in name:
+                name = name.replace('bn1', 'backbone.layers.0.bn')
+            if 'conv_head' in name:
+                # if efficientnet-v2_s/base/b1/b2/b3，op = 7，
+                # if for m/l/xl , op = 8
+                name = name.replace('conv_head', f'backbone.layers.{op}.conv')
+            if 'bn2' in name:
+                name = name.replace('bn2', f'backbone.layers.{op}.bn')
+            if 'classifier' in name:
+                name = name.replace('classifier', 'head.fc')
+        else:
+            operator = int(name[7])
+            if operator == 0:
+                name = name[:7] + str(operator + 1) + name[8:]
+                name = name.replace('blocks', 'backbone.layers')
+                if 'conv' in name:
+                    name = name.replace('conv', 'conv')
+                if 'bn1' in name:
+                    name = name.replace('bn1', 'bn')
+            elif operator < 3:
+                name = name[:7] + str(operator + 1) + name[8:]
+                name = name.replace('blocks', 'backbone.layers')
+                if 'conv_exp' in name:
+                    name = name.replace('conv_exp', 'conv1.conv')
+                if 'conv_pwl' in name:
+                    name = name.replace('conv_pwl', 'conv2.conv')
+                if 'bn1' in name:
+                    name = name.replace('bn1', 'conv1.bn')
+                if 'bn2' in name:
+                    name = name.replace('bn2', 'conv2.bn')
+            else:
+                name = name[:7] + str(operator + 1) + name[8:]
+                name = name.replace('blocks', 'backbone.layers')
+                if 'conv_pwl' in name:
+                    name = name.replace('conv_pwl', 'linear_conv.conv')
+                if 'conv_pw' in name:
+                    name = name.replace('conv_pw', 'expand_conv.conv')
+                if 'conv_dw' in name:
+                    name = name.replace('conv_dw', 'depthwise_conv.conv')
+                if 'bn1' in name:
+                    name = name.replace('bn1', 'expand_conv.bn')
+                if 'bn2' in name:
+                    name = name.replace('bn2', 'depthwise_conv.bn')
+                if 'bn3' in name:
+                    name = name.replace('bn3', 'linear_conv.bn')
+                if 'se.conv_reduce' in name:
+                    name = name.replace('se.conv_reduce', 'se.conv1.conv')
+                if 'se.conv_expand' in name:
+                    name = name.replace('se.conv_expand', 'se.conv2.conv')
+        new_key[name] = data
+    return new_key
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert pretrained efficientnetv2 '
+        'models in timm to mmpretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'state_dict' in checkpoint:
+        state_dict = checkpoint['state_dict']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_from_efficientnetv2_timm(state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/eva02_to_mmpretrain.py b/tools/model_converters/eva02_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..e5a8682f0f034fcf12be1eff4d13f47c3da6c3b1
--- /dev/null
+++ b/tools/model_converters/eva02_to_mmpretrain.py
@@ -0,0 +1,153 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_eva02(ckpt):
+
+    new_ckpt = OrderedDict()
+    qkv_proj = {}
+    qkv_bias = {}
+    w12_weight = {}
+    w12_bias = {}
+
+    banned = {
+        'mask_token',
+        'lm_head.weight',
+        'lm_head.bias',
+        'norm.weight',
+        'norm.bias',
+    }
+
+    for k, v in list(ckpt.items()):
+
+        if k in banned:
+            continue
+
+        if k.startswith('head'):
+            new_k = k.replace('head.', 'head.fc.')
+            new_ckpt[new_k] = v
+        else:
+            if k.startswith('patch_embed'):
+                new_k = k.replace('proj.', 'projection.')
+
+            elif k.startswith('fc_norm') or k.startswith('norm'):
+                new_k = k.replace('norm.', 'ln2.')
+                new_k = k.replace('fc_norm.', 'ln2.')
+
+            elif k.startswith('blocks'):
+                new_k = k.replace('blocks.', 'layers.')
+
+                if 'mlp' in new_k:
+                    if 'w1.' in new_k or 'w2.' in new_k:
+                        # For base and large version, mlp is implemented with
+                        # 2 linears, where w1 and w2 are required to integrate
+                        # into w12.
+                        s = new_k.split('.')  # e.g. layers.0.mlp.w1.weight
+                        idx = s[1]
+                        if 'weight' in new_k:
+                            # w1.weight or w2.weight
+                            if idx not in w12_weight:
+                                w12_weight[idx] = {}
+                            w12_weight[idx][s[-2]] = v
+                        else:
+                            # w1.bias or w2.bias
+                            if idx not in w12_bias:
+                                w12_bias[idx] = {}
+                            w12_bias[idx][s[-2]] = v
+                        continue
+
+                    if 'ffn_ln' in new_k:
+                        new_k = new_k.replace('ffn_ln.', 'norm.')
+
+                elif 'attn' in new_k:
+                    if 'q_proj.weight' in new_k or \
+                            'k_proj.weight' in new_k or \
+                            'v_proj.weight' in new_k:
+                        # For base and large version, qkv projection is
+                        # implemented with three linear layers,
+                        s = new_k.split('.')
+                        idx = s[1]
+                        if idx not in qkv_proj:
+                            qkv_proj[idx] = {}
+                        qkv_proj[idx][s[-2]] = v
+                        continue
+
+                    if 'q_bias' in new_k or 'v_bias' in new_k:
+                        # k_bias is 0
+                        s = new_k.split('.')
+                        idx = s[1]
+                        if idx not in qkv_bias:
+                            qkv_bias[idx] = {}
+                        qkv_bias[idx][s[-1]] = v
+                        continue
+
+            else:
+                new_k = k
+
+            new_k = 'backbone.' + new_k
+            new_ckpt[new_k] = v
+
+    for idx in qkv_proj:
+        q_proj = qkv_proj[idx]['q_proj']
+        k_proj = qkv_proj[idx]['k_proj']
+        v_proj = qkv_proj[idx]['v_proj']
+        weight = torch.cat((q_proj, k_proj, v_proj))
+        new_k = f'backbone.layers.{idx}.attn.qkv.weight'
+        new_ckpt[new_k] = weight
+
+    for idx in qkv_bias:
+        q_bias = qkv_bias[idx]['q_bias']
+        k_bias = torch.zeros_like(q_bias)
+        v_bias = qkv_bias[idx]['v_bias']
+        weight = torch.cat((q_bias, k_bias, v_bias))
+        new_k = f'backbone.layers.{idx}.attn.qkv.bias'
+        new_ckpt[new_k] = weight
+
+    for idx in w12_weight:
+        w1 = w12_weight[idx]['w1']
+        w2 = w12_weight[idx]['w2']
+        weight = torch.cat((w1, w2))
+        new_k = f'backbone.layers.{idx}.mlp.w12.weight'
+        new_ckpt[new_k] = weight
+
+    for idx in w12_bias:
+        w1 = w12_bias[idx]['w1']
+        w2 = w12_bias[idx]['w2']
+        weight = torch.cat((w1, w2))
+        new_k = f'backbone.layers.{idx}.mlp.w12.bias'
+        new_ckpt[new_k] = weight
+
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in pretrained eva02 '
+        'models to mmpretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'module' in checkpoint:
+        state_dict = checkpoint['module']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_eva02(state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/eva_to_mmpretrain.py b/tools/model_converters/eva_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..227e377a382ffb0e247556ca1954c68d93c5b13e
--- /dev/null
+++ b/tools/model_converters/eva_to_mmpretrain.py
@@ -0,0 +1,76 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_eva(ckpt):
+
+    new_ckpt = OrderedDict()
+
+    for k, v in list(ckpt.items()):
+        if 'decoder' in k or 'mask_token' in k:
+            continue
+        new_v = v
+        if k.startswith('head'):
+            new_k = k.replace('head.', 'head.fc.')
+            new_ckpt[new_k] = new_v
+            continue
+        elif k.startswith('patch_embed'):
+            if 'proj.' in k:
+                new_k = k.replace('proj.', 'projection.')
+            else:
+                new_k = k
+        elif k.startswith('blocks'):
+            new_k = k.replace('blocks.', 'layers.')
+            if 'norm1' in k:
+                new_k = new_k.replace('norm1', 'ln1')
+            elif 'norm2' in k:
+                new_k = new_k.replace('norm2', 'ln2')
+            elif 'mlp.fc1' in k:
+                new_k = new_k.replace('mlp.fc1', 'ffn.layers.0.0')
+            elif 'mlp.fc2' in k:
+                new_k = new_k.replace('mlp.fc2', 'ffn.layers.1')
+        elif 'fc_norm' in k:
+            new_k = k.replace('fc_norm', 'ln2')
+        elif k.startswith('norm'):
+            # for mim pretrain
+            new_k = k.replace('norm', 'ln2')
+        else:
+            new_k = k
+
+        if not new_k.startswith('head'):
+            new_k = 'backbone.' + new_k
+        new_ckpt[new_k] = new_v
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in pretrained eva '
+        'models to mmpretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'model' in checkpoint:
+        state_dict = checkpoint['model']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_eva(state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/glip_to_mmpretrain.py b/tools/model_converters/glip_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..91f39620f1d558d129051d0d4c384b4b550fccb2
--- /dev/null
+++ b/tools/model_converters/glip_to_mmpretrain.py
@@ -0,0 +1,76 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_glip(ckpt):
+
+    def correct_unfold_reduction_order(x):
+        out_channel, in_channel = x.shape
+        x = x.reshape(out_channel, 4, in_channel // 4)
+        x = x[:, [0, 2, 1, 3], :].transpose(1,
+                                            2).reshape(out_channel, in_channel)
+        return x
+
+    def correct_unfold_norm_order(x):
+        in_channel = x.shape[0]
+        x = x.reshape(4, in_channel // 4)
+        x = x[[0, 2, 1, 3], :].transpose(0, 1).reshape(in_channel)
+        return x
+
+    new_ckpt = OrderedDict()
+
+    for k, v in list(ckpt.items()):
+        if 'language_backbone' in k or 'backbone' not in k or 'fpn' in k:
+            continue
+        new_v = v
+        new_k = k.replace('body.', '')
+        new_k = new_k.replace('module.', '')
+        if new_k.startswith('backbone.layers'):
+            new_k = new_k.replace('backbone.layers', 'backbone.stages')
+        if 'mlp' in new_k:
+            new_k = new_k.replace('mlp.fc1', 'ffn.layers.0.0')
+            new_k = new_k.replace('mlp.fc2', 'ffn.layers.1')
+        elif 'attn' in new_k:
+            new_k = new_k.replace('attn', 'attn.w_msa')
+        elif 'patch_embed' in k:
+            new_k = new_k.replace('proj', 'projection')
+        elif 'downsample' in new_k:
+            if 'reduction.' in k:
+                new_v = correct_unfold_reduction_order(new_v)
+            elif 'norm.' in k:
+                new_v = correct_unfold_norm_order(new_v)
+
+        new_ckpt[new_k] = new_v
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in pretrained glip models to mmcls style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'model' in checkpoint:
+        state_dict = checkpoint['model']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_glip(state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/hornet2mmpretrain.py b/tools/model_converters/hornet2mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..667a94c0a3c16f931722252181f0361a5052259e
--- /dev/null
+++ b/tools/model_converters/hornet2mmpretrain.py
@@ -0,0 +1,62 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_hornet(ckpt):
+
+    new_ckpt = OrderedDict()
+
+    for k, v in list(ckpt.items()):
+        new_v = v
+        if k.startswith('head'):
+            new_k = k.replace('head.', 'head.fc.')
+            new_ckpt[new_k] = new_v
+            continue
+        elif k.startswith('norm'):
+            new_k = k.replace('norm.', 'norm3.')
+        elif 'gnconv.pws' in k:
+            new_k = k.replace('gnconv.pws', 'gnconv.projs')
+        elif 'gamma1' in k:
+            new_k = k.replace('gamma1', 'gamma1.weight')
+        elif 'gamma2' in k:
+            new_k = k.replace('gamma2', 'gamma2.weight')
+        else:
+            new_k = k
+
+        if not new_k.startswith('head'):
+            new_k = 'backbone.' + new_k
+        new_ckpt[new_k] = new_v
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in pretrained hornet '
+        'models to mmpretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'model' in checkpoint:
+        state_dict = checkpoint['model']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_hornet(state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/levit2mmpretrain.py b/tools/model_converters/levit2mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..4e28e281a13a111197b36f75fec94a8636dc1591
--- /dev/null
+++ b/tools/model_converters/levit2mmpretrain.py
@@ -0,0 +1,80 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmengine
+import torch
+
+
+def convert_levit(args, ckpt):
+    new_ckpt = OrderedDict()
+    stage = 0
+    block = 0
+    change = True
+    for k, v in list(ckpt.items()):
+        new_v = v
+        if k.startswith('head_dist'):
+            new_k = k.replace('head_dist.', 'head.head_dist.')
+            new_k = new_k.replace('.l.', '.linear.')
+            new_ckpt[new_k] = new_v
+            continue
+        elif k.startswith('head'):
+            new_k = k.replace('head.', 'head.head.')
+            new_k = new_k.replace('.l.', '.linear.')
+            new_ckpt[new_k] = new_v
+            continue
+        elif k.startswith('patch_embed'):
+            new_k = k.replace('patch_embed.',
+                              'patch_embed.patch_embed.').replace(
+                                  '.c.', '.conv.')
+        elif k.startswith('blocks'):
+            strs = k.split('.')
+            # new_k = k.replace('.c.', '.').replace('.bn.', '.')
+            new_k = k
+            if '.m.' in k:
+                new_k = new_k.replace('.m.0', '.m.linear1')
+                new_k = new_k.replace('.m.2', '.m.linear2')
+                new_k = new_k.replace('.m.', '.block.')
+                change = True
+            elif change:
+                stage += 1
+                block = int(strs[1])
+                change = False
+            new_k = new_k.replace(
+                'blocks.%s.' % (strs[1]),
+                'stages.%d.%d.' % (stage, int(strs[1]) - block))
+            new_k = new_k.replace('.c.', '.linear.')
+        else:
+            new_k = k
+        # print(new_k)
+        new_k = 'backbone.' + new_k
+        new_ckpt[new_k] = new_v
+
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in timm pretrained vit models to '
+        'MMPretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = torch.load(args.src, map_location='cpu')
+    checkpoint = checkpoint['model']
+    if 'state_dict' in checkpoint:
+        # timm checkpoint
+        state_dict = checkpoint['state_dict']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_levit(args, state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/llava-delta2mmpre.py b/tools/model_converters/llava-delta2mmpre.py
new file mode 100644
index 0000000000000000000000000000000000000000..104ed07d477ff7389a35a2fc2e1628caedb44365
--- /dev/null
+++ b/tools/model_converters/llava-delta2mmpre.py
@@ -0,0 +1,79 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+from collections import OrderedDict
+from itertools import chain
+from pathlib import Path
+
+import torch
+from huggingface_hub import snapshot_download
+from transformers.modeling_utils import load_state_dict
+
+prog_description = """\
+Convert Llava weights and original weights.
+"""
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=prog_description)
+    parser.add_argument('src', type=str, help='The original checkpoint dir')
+    parser.add_argument('dst', type=str, help='The saved checkpoint path')
+    parser.add_argument('--delta', type=str, help='The delta checkpoint dir')
+    args = parser.parse_args()
+    return args
+
+
+def load_checkpoint(path: Path):
+    path = Path(path)
+    if path.is_file():
+        return torch.load(path)
+
+    state_dict = OrderedDict()
+    for ckpt in chain(
+            path.rglob('*.bin'), path.rglob('*.pth'),
+            path.rglob('*.safetensors')):
+        state_dict.update(load_state_dict(str(ckpt)))
+
+    return state_dict
+
+
+def main():
+    args = parse_args()
+
+    if Path(args.src).exists():
+        src_path = args.src
+    else:
+        src_path = snapshot_download(
+            args.src, allow_patterns='pytorch_model*.bin')
+    src_state_dict = load_checkpoint(src_path)
+
+    if args.delta is None:
+        delta_state_dict = {}
+    elif Path(args.delta).exists():
+        delta_state_dict = load_checkpoint(args.delta)
+    else:
+        delta_path = snapshot_download(
+            args.delta, allow_patterns='pytorch_model*.bin')
+        delta_state_dict = load_checkpoint(delta_path)
+
+    new_state_dict = OrderedDict()
+    for k, v in src_state_dict.items():
+        if k in delta_state_dict:
+            delta_v = delta_state_dict.pop(k)
+            if k in ['model.embed_tokens.weight', 'lm_head.weight']:
+                h, w = v.shape[:2]
+                delta_v[:h, :w] += v
+                v = delta_v
+            else:
+                v += delta_v
+        if 'rotary_emb.inv_freq' not in k:
+            new_state_dict['model.lang_encoder.' + k] = v
+
+    for k, v in delta_state_dict.items():
+        new_state_dict['model.lang_encoder.' + k] = v
+
+    torch.save(new_state_dict, args.dst)
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/merge_lora_weight.py b/tools/model_converters/merge_lora_weight.py
new file mode 100644
index 0000000000000000000000000000000000000000..fc51f9f261cd8c900d08330d958d6f5d429f1c06
--- /dev/null
+++ b/tools/model_converters/merge_lora_weight.py
@@ -0,0 +1,90 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+from pathlib import Path
+
+import torch
+from mmengine.config import Config
+
+from mmpretrain.registry import MODELS
+
+
+@torch.no_grad()
+def merge_lora_weight(cfg, lora_weight):
+    """Merge base weight and lora weight.
+
+    Args:
+        cfg (dict): config for LoRAModel.
+        lora_weight (dict): weight dict from LoRAModel.
+    Returns:
+        Merged weight.
+    """
+    temp = dict()
+    mapping = dict()
+    for name, param in lora_weight['state_dict'].items():
+        # backbone.module.layers.11.attn.qkv.lora_down.weight
+        if '.lora_' in name:
+            lora_split = name.split('.')
+            prefix = '.'.join(lora_split[:-2])
+            if prefix not in mapping:
+                mapping[prefix] = dict()
+            lora_type = lora_split[-2]
+            mapping[prefix][lora_type] = param
+        else:
+            temp[name] = param
+
+    model = MODELS.build(cfg['model'])
+    for name, param in model.named_parameters():
+        if name in temp or '.lora_' in name:
+            continue
+        else:
+            name_split = name.split('.')
+            prefix = prefix = '.'.join(name_split[:-2])
+            if prefix in mapping:
+                name_split.pop(-2)
+                if name_split[-1] == 'weight':
+                    scaling = get_scaling(model, prefix)
+                    lora_down = mapping[prefix]['lora_down']
+                    lora_up = mapping[prefix]['lora_up']
+                    param += lora_up @ lora_down * scaling
+            name_split.pop(1)
+            name = '.'.join(name_split)
+            temp[name] = param
+
+    result = dict()
+    result['state_dict'] = temp
+    result['meta'] = lora_weight['meta']
+    return result
+
+
+def get_scaling(model, prefix):
+    """Get the scaling of target layer.
+
+    Args:
+        model (LoRAModel): the LoRAModel.
+        prefix (str): the prefix of the layer.
+    Returns:
+        the scale of the LoRALinear.
+    """
+    prefix_split = prefix.split('.')
+    for i in prefix_split:
+        model = getattr(model, i)
+    return model.scaling
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Merge LoRA weight')
+    parser.add_argument('cfg', help='cfg path')
+    parser.add_argument('src', help='src lora model path')
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+    dst = Path(args.dst)
+    if dst.suffix != '.pth':
+        print('The path should contain the name of the pth format file.')
+        exit(1)
+    dst.parent.mkdir(parents=True, exist_ok=True)
+
+    cfg = Config.fromfile(args.cfg)
+    lora_model = torch.load(args.src, map_location='cpu')
+
+    merged_model = merge_lora_weight(cfg, lora_model)
+    torch.save(merged_model, args.dst)
diff --git a/tools/model_converters/mixmim_to_mmpretrain.py b/tools/model_converters/mixmim_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..b31bb00553d52ca5a1213c9cd7958b0e71f3b9a0
--- /dev/null
+++ b/tools/model_converters/mixmim_to_mmpretrain.py
@@ -0,0 +1,99 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def correct_unfold_reduction_order(x: torch.Tensor):
+    out_channel, in_channel = x.shape
+    x = x.reshape(out_channel, 4, in_channel // 4)
+    x = x[:, [0, 2, 1, 3], :].transpose(1, 2).reshape(out_channel, in_channel)
+    return x
+
+
+def correct_unfold_norm_order(x):
+    in_channel = x.shape[0]
+    x = x.reshape(4, in_channel // 4)
+    x = x[[0, 2, 1, 3], :].transpose(0, 1).reshape(in_channel)
+    return x
+
+
+def convert_mixmim(ckpt):
+
+    new_ckpt = OrderedDict()
+
+    for k, v in list(ckpt.items()):
+        new_v = v
+
+        if k.startswith('patch_embed'):
+            new_k = k.replace('proj', 'projection')
+
+        elif k.startswith('layers'):
+            if 'norm1' in k:
+                new_k = k.replace('norm1', 'ln1')
+            elif 'norm2' in k:
+                new_k = k.replace('norm2', 'ln2')
+            elif 'mlp.fc1' in k:
+                new_k = k.replace('mlp.fc1', 'ffn.layers.0.0')
+            elif 'mlp.fc2' in k:
+                new_k = k.replace('mlp.fc2', 'ffn.layers.1')
+            else:
+                new_k = k
+
+        elif k.startswith('norm') or k.startswith('absolute_pos_embed'):
+            new_k = k
+
+        elif k.startswith('head'):
+            new_k = k.replace('head.', 'head.fc.')
+
+        else:
+            raise ValueError
+
+        # print(new_k)
+        if not new_k.startswith('head'):
+            new_k = 'backbone.' + new_k
+
+        if 'downsample' in new_k:
+            print('Covert {} in PatchMerging from timm to mmcv format!'.format(
+                new_k))
+
+            if 'reduction' in new_k:
+                new_v = correct_unfold_reduction_order(new_v)
+            elif 'norm' in new_k:
+                new_v = correct_unfold_norm_order(new_v)
+
+        new_ckpt[new_k] = new_v
+
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in pretrained mixmim '
+        'models to mmpretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'model' in checkpoint:
+        state_dict = checkpoint['model']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_mixmim(state_dict)
+    # weight = convert_official_mixmim(state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/mlpmixer_to_mmpretrain.py b/tools/model_converters/mlpmixer_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..e10151489dc3c9255d3bc4126a6cbaa75c472904
--- /dev/null
+++ b/tools/model_converters/mlpmixer_to_mmpretrain.py
@@ -0,0 +1,58 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+from pathlib import Path
+
+import torch
+
+
+def convert_weights(weight):
+    """Weight Converter.
+
+    Converts the weights from timm to mmpretrain
+
+    Args:
+        weight (dict): weight dict from timm
+
+    Returns: converted weight dict for mmpretrain
+    """
+    result = dict()
+    result['meta'] = dict()
+    temp = dict()
+    mapping = {
+        'stem': 'patch_embed',
+        'proj': 'projection',
+        'mlp_tokens.fc1': 'token_mix.layers.0.0',
+        'mlp_tokens.fc2': 'token_mix.layers.1',
+        'mlp_channels.fc1': 'channel_mix.layers.0.0',
+        'mlp_channels.fc2': 'channel_mix.layers.1',
+        'norm1': 'ln1',
+        'norm2': 'ln2',
+        'norm.': 'ln1.',
+        'blocks': 'layers'
+    }
+    for k, v in weight.items():
+        for mk, mv in mapping.items():
+            if mk in k:
+                k = k.replace(mk, mv)
+        if k.startswith('head.'):
+            temp['head.fc.' + k[5:]] = v
+        else:
+            temp['backbone.' + k] = v
+    result['state_dict'] = temp
+    return result
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Convert model keys')
+    parser.add_argument('src', help='src detectron model path')
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+    dst = Path(args.dst)
+    if dst.suffix != '.pth':
+        print('The path should contain the name of the pth format file.')
+        exit(1)
+    dst.parent.mkdir(parents=True, exist_ok=True)
+
+    original_model = torch.load(args.src, map_location='cpu')
+    converted_model = convert_weights(original_model)
+    torch.save(converted_model, args.dst)
diff --git a/tools/model_converters/mobilenetv2_to_mmpretrain.py b/tools/model_converters/mobilenetv2_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..878f7378f846a07d093a81f6bd999ab6a8d568e3
--- /dev/null
+++ b/tools/model_converters/mobilenetv2_to_mmpretrain.py
@@ -0,0 +1,135 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+from collections import OrderedDict
+
+import torch
+
+
+def convert_conv1(model_key, model_weight, state_dict, converted_names):
+    if model_key.find('features.0.0') >= 0:
+        new_key = model_key.replace('features.0.0', 'backbone.conv1.conv')
+    else:
+        new_key = model_key.replace('features.0.1', 'backbone.conv1.bn')
+    state_dict[new_key] = model_weight
+    converted_names.add(model_key)
+    print(f'Convert {model_key} to {new_key}')
+
+
+def convert_conv5(model_key, model_weight, state_dict, converted_names):
+    if model_key.find('features.18.0') >= 0:
+        new_key = model_key.replace('features.18.0', 'backbone.conv2.conv')
+    else:
+        new_key = model_key.replace('features.18.1', 'backbone.conv2.bn')
+    state_dict[new_key] = model_weight
+    converted_names.add(model_key)
+    print(f'Convert {model_key} to {new_key}')
+
+
+def convert_head(model_key, model_weight, state_dict, converted_names):
+    new_key = model_key.replace('classifier.1', 'head.fc')
+    state_dict[new_key] = model_weight
+    converted_names.add(model_key)
+    print(f'Convert {model_key} to {new_key}')
+
+
+def convert_block(model_key, model_weight, state_dict, converted_names):
+    split_keys = model_key.split('.')
+    layer_id = int(split_keys[1])
+    new_layer_id = 0
+    sub_id = 0
+    if layer_id == 1:
+        new_layer_id = 1
+        sub_id = 0
+    elif layer_id in range(2, 4):
+        new_layer_id = 2
+        sub_id = layer_id - 2
+    elif layer_id in range(4, 7):
+        new_layer_id = 3
+        sub_id = layer_id - 4
+    elif layer_id in range(7, 11):
+        new_layer_id = 4
+        sub_id = layer_id - 7
+    elif layer_id in range(11, 14):
+        new_layer_id = 5
+        sub_id = layer_id - 11
+    elif layer_id in range(14, 17):
+        new_layer_id = 6
+        sub_id = layer_id - 14
+    elif layer_id == 17:
+        new_layer_id = 7
+        sub_id = 0
+
+    new_key = model_key.replace(f'features.{layer_id}',
+                                f'backbone.layer{new_layer_id}.{sub_id}')
+    if new_layer_id == 1:
+        if new_key.find('conv.0.0') >= 0:
+            new_key = new_key.replace('conv.0.0', 'conv.0.conv')
+        elif new_key.find('conv.0.1') >= 0:
+            new_key = new_key.replace('conv.0.1', 'conv.0.bn')
+        elif new_key.find('conv.1') >= 0:
+            new_key = new_key.replace('conv.1', 'conv.1.conv')
+        elif new_key.find('conv.2') >= 0:
+            new_key = new_key.replace('conv.2', 'conv.1.bn')
+        else:
+            raise ValueError(f'Unsupported conversion of key {model_key}')
+    else:
+        if new_key.find('conv.0.0') >= 0:
+            new_key = new_key.replace('conv.0.0', 'conv.0.conv')
+        elif new_key.find('conv.0.1') >= 0:
+            new_key = new_key.replace('conv.0.1', 'conv.0.bn')
+        elif new_key.find('conv.1.0') >= 0:
+            new_key = new_key.replace('conv.1.0', 'conv.1.conv')
+        elif new_key.find('conv.1.1') >= 0:
+            new_key = new_key.replace('conv.1.1', 'conv.1.bn')
+        elif new_key.find('conv.2') >= 0:
+            new_key = new_key.replace('conv.2', 'conv.2.conv')
+        elif new_key.find('conv.3') >= 0:
+            new_key = new_key.replace('conv.3', 'conv.2.bn')
+        else:
+            raise ValueError(f'Unsupported conversion of key {model_key}')
+    print(f'Convert {model_key} to {new_key}')
+    state_dict[new_key] = model_weight
+    converted_names.add(model_key)
+
+
+def convert(src, dst):
+    """Convert keys in torchvision pretrained MobileNetV2 models to mmpretrain
+    style."""
+
+    # load pytorch model
+    blobs = torch.load(src, map_location='cpu')
+
+    # convert to pytorch style
+    state_dict = OrderedDict()
+    converted_names = set()
+
+    for key, weight in blobs.items():
+        if 'features.0' in key:
+            convert_conv1(key, weight, state_dict, converted_names)
+        elif 'classifier' in key:
+            convert_head(key, weight, state_dict, converted_names)
+        elif 'features.18' in key:
+            convert_conv5(key, weight, state_dict, converted_names)
+        else:
+            convert_block(key, weight, state_dict, converted_names)
+
+    # check if all layers are converted
+    for key in blobs:
+        if key not in converted_names:
+            print(f'not converted: {key}')
+    # save checkpoint
+    checkpoint = dict()
+    checkpoint['state_dict'] = state_dict
+    torch.save(checkpoint, dst)
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Convert model keys')
+    parser.add_argument('src', help='src detectron model path')
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+    convert(args.src, args.dst)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/ofa.py b/tools/model_converters/ofa.py
new file mode 100644
index 0000000000000000000000000000000000000000..142c7ac3872db658b667a648117eb731a27a5d92
--- /dev/null
+++ b/tools/model_converters/ofa.py
@@ -0,0 +1,111 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import re
+from collections import OrderedDict, namedtuple
+from pathlib import Path
+
+import torch
+
+prog_description = """\
+Convert OFA official models to MMPretrain format.
+"""
+
+MapItem = namedtuple(
+    'MapItem', 'pattern repl key_action value_action', defaults=[None] * 4)
+
+
+def convert_by_mapdict(src_dict: dict, map_dict: Path):
+    dst_dict = OrderedDict()
+    convert_map_dict = dict()
+
+    for k, v in src_dict.items():
+        ori_k = k
+        for item in map_dict:
+            pattern = item.pattern
+            assert pattern is not None
+            match = next(re.finditer(pattern, k), None)
+            if match is None:
+                continue
+            match_group = match.groups()
+            repl = item.repl
+
+            key_action = item.key_action
+            if key_action is not None:
+                assert callable(key_action)
+                match_group = key_action(*match_group)
+                if isinstance(match_group, str):
+                    match_group = (match_group, )
+            start, end = match.span(0)
+            if repl is not None:
+                k = k[:start] + repl.format(*match_group) + k[end:]
+            else:
+                for i, sub in enumerate(match_group):
+                    start, end = match.span(i + 1)
+                    k = k[:start] + str(sub) + k[end:]
+
+            value_action = item.value_action
+            if value_action is not None:
+                assert callable(value_action)
+                v = value_action(v)
+
+        if v is not None:
+            dst_dict[k] = v
+        convert_map_dict[k] = ori_k
+    return dst_dict, convert_map_dict
+
+
+map_dict = [
+    # Encoder modules
+    MapItem(r'\.type_embedding\.', '.embed_type.'),
+    MapItem(r'\.layernorm_embedding\.', '.embedding_ln.'),
+    MapItem(r'\.patch_layernorm_embedding\.', '.image_embedding_ln.'),
+    MapItem(r'encoder.layer_norm\.', 'encoder.final_ln.'),
+    # Encoder layers
+    MapItem(r'\.attn_ln\.', '.attn_mid_ln.'),
+    MapItem(r'\.ffn_layernorm\.', '.ffn_mid_ln.'),
+    MapItem(r'\.final_layer_norm', '.ffn_ln'),
+    MapItem(r'encoder.*(\.self_attn\.)', key_action=lambda _: '.attn.'),
+    MapItem(
+        r'encoder.*(\.self_attn_layer_norm\.)',
+        key_action=lambda _: '.attn_ln.'),
+    # Decoder modules
+    MapItem(r'\.code_layernorm_embedding\.', '.code_embedding_ln.'),
+    MapItem(r'decoder.layer_norm\.', 'decoder.final_ln.'),
+    # Decoder layers
+    MapItem(r'\.self_attn_ln', '.self_attn_mid_ln'),
+    MapItem(r'\.cross_attn_ln', '.cross_attn_mid_ln'),
+    MapItem(r'\.encoder_attn_layer_norm', '.cross_attn_ln'),
+    MapItem(r'\.encoder_attn', '.cross_attn'),
+    MapItem(
+        r'decoder.*(\.self_attn_layer_norm\.)',
+        key_action=lambda _: '.self_attn_ln.'),
+    # Remove version key
+    MapItem(r'version', '', value_action=lambda _: None),
+    # Add model prefix
+    MapItem(r'^', 'model.'),
+]
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=prog_description)
+    parser.add_argument('src', type=str, help='The official checkpoint path.')
+    parser.add_argument('dst', type=str, help='The save path.')
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+    src = torch.load(args.src)
+    if 'extra_state' in src and 'ema' in src['extra_state']:
+        print('Use EMA weights.')
+        src = src['extra_state']['ema']
+    else:
+        src = src['model']
+    dst, _ = convert_by_mapdict(src, map_dict)
+    torch.save(dst, args.dst)
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/openai-clip_to_mmpretrain-clip.py b/tools/model_converters/openai-clip_to_mmpretrain-clip.py
new file mode 100644
index 0000000000000000000000000000000000000000..727255025519f4dc722ae0cdf1040da496aaf7d5
--- /dev/null
+++ b/tools/model_converters/openai-clip_to_mmpretrain-clip.py
@@ -0,0 +1,77 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_clip(ckpt):
+    new_ckpt = OrderedDict()
+
+    for k, v in list(ckpt.items()):
+        new_v = v
+        if k.startswith('visual.conv1'):
+            new_k = k.replace('conv1', 'patch_embed.projection')
+        elif k.startswith('visual.positional_embedding'):
+            new_k = k.replace('positional_embedding', 'pos_embed')
+            new_v = v.unsqueeze(dim=0)
+        elif k.startswith('visual.class_embedding'):
+            new_k = k.replace('class_embedding', 'cls_token')
+            new_v = v.unsqueeze(dim=0).unsqueeze(dim=0)
+        elif k.startswith('visual.ln_pre'):
+            new_k = k.replace('ln_pre', 'pre_norm')
+        elif k.startswith('visual.transformer.resblocks'):
+            new_k = k.replace('transformer.resblocks', 'layers')
+            if 'ln_1' in k:
+                new_k = new_k.replace('ln_1', 'ln1')
+            elif 'ln_2' in k:
+                new_k = new_k.replace('ln_2', 'ln2')
+            elif 'mlp.c_fc' in k:
+                new_k = new_k.replace('mlp.c_fc', 'ffn.layers.0.0')
+            elif 'mlp.c_proj' in k:
+                new_k = new_k.replace('mlp.c_proj', 'ffn.layers.1')
+            elif 'attn.in_proj_weight' in k:
+                new_k = new_k.replace('in_proj_weight', 'qkv.weight')
+            elif 'attn.in_proj_bias' in k:
+                new_k = new_k.replace('in_proj_bias', 'qkv.bias')
+            elif 'attn.out_proj' in k:
+                new_k = new_k.replace('out_proj', 'proj')
+        elif k.startswith('visual.ln_post'):
+            new_k = k.replace('ln_post', 'ln1')
+        elif k.startswith('visual.proj'):
+            new_k = k.replace('visual.proj', 'visual_proj.proj')
+        else:
+            new_k = k
+
+        new_ckpt[new_k] = new_v
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in pretrained clip '
+        'models to mmpretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'state_dict' in checkpoint:
+        state_dict = checkpoint['state_dict']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_clip(state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/otter2mmpre.py b/tools/model_converters/otter2mmpre.py
new file mode 100644
index 0000000000000000000000000000000000000000..5824518aa49ae0272d2c573594f2ddb955316730
--- /dev/null
+++ b/tools/model_converters/otter2mmpre.py
@@ -0,0 +1,66 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import re
+from collections import OrderedDict
+from itertools import chain
+from pathlib import Path
+
+import torch
+
+prog_description = """\
+Convert Official Otter HF models to MMPreTrain format.
+"""
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=prog_description)
+    parser.add_argument(
+        'name_or_dir', type=str, help='The Otter HF model name or directory.')
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+    if not Path(args.name_or_dir).is_dir():
+        from huggingface_hub import snapshot_download
+        ckpt_dir = Path(
+            snapshot_download(args.name_or_dir, allow_patterns='*.bin'))
+        name = args.name_or_dir.replace('/', '_')
+    else:
+        ckpt_dir = Path(args.name_or_dir)
+        name = ckpt_dir.name
+
+    state_dict = OrderedDict()
+    for k, v in chain.from_iterable(
+            torch.load(ckpt).items() for ckpt in ckpt_dir.glob('*.bin')):
+        adapter_patterns = [
+            r'^perceiver',
+            r'lang_encoder.*embed_tokens',
+            r'lang_encoder.*gated_cross_attn_layer',
+            r'lang_encoder.*rotary_emb',
+        ]
+        if not any(re.match(pattern, k) for pattern in adapter_patterns):
+            # Drop encoder parameters to decrease the size.
+            continue
+
+        # The keys are different between Open-Flamingo and Otter
+        if 'gated_cross_attn_layer.feed_forward' in k:
+            k = k.replace('feed_forward', 'ff')
+        if 'perceiver.layers' in k:
+            prefix_match = re.match(r'perceiver.layers.\d+.', k)
+            prefix = k[:prefix_match.end()]
+            suffix = k[prefix_match.end():]
+            if 'feed_forward' in k:
+                k = prefix + '1.' + suffix.replace('feed_forward.', '')
+            else:
+                k = prefix + '0.' + suffix
+        state_dict[k] = v
+    if len(state_dict) == 0:
+        raise RuntimeError('No checkpoint found in the specified directory.')
+
+    torch.save(state_dict, name + '.pth')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/publish_model.py b/tools/model_converters/publish_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..d5ead5768acfef11db58c6fcc21fa35839f3fde8
--- /dev/null
+++ b/tools/model_converters/publish_model.py
@@ -0,0 +1,108 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import datetime
+import hashlib
+import shutil
+import warnings
+from collections import OrderedDict
+from pathlib import Path
+
+import torch
+
+import mmpretrain
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description='Process a checkpoint to be published')
+    parser.add_argument('in_file', help='input checkpoint filename')
+    parser.add_argument('out_file', help='output checkpoint filename')
+    parser.add_argument(
+        '--no-ema',
+        action='store_true',
+        help='Use keys in `ema_state_dict` (no-ema keys).')
+    parser.add_argument(
+        '--dataset-type',
+        type=str,
+        help='The type of the dataset. If the checkpoint is converted '
+        'from other repository, this option is used to fill the dataset '
+        'meta information to the published checkpoint, like "ImageNet", '
+        '"CIFAR10" and others.')
+    args = parser.parse_args()
+    return args
+
+
+def process_checkpoint(in_file, out_file, args):
+    checkpoint = torch.load(in_file, map_location='cpu')
+    # remove unnecessary fields for smaller file size
+    for key in ['optimizer', 'param_schedulers', 'hook_msgs', 'message_hub']:
+        checkpoint.pop(key, None)
+
+    # For checkpoint converted from the official weight
+    if 'state_dict' not in checkpoint:
+        checkpoint = dict(state_dict=checkpoint)
+
+    meta = checkpoint.get('meta', {})
+    meta.setdefault('mmpretrain_version', mmpretrain.__version__)
+
+    # handle dataset meta information
+    if args.dataset_type is not None:
+        from mmpretrain.registry import DATASETS
+        dataset_class = DATASETS.get(args.dataset_type)
+        dataset_meta = getattr(dataset_class, 'METAINFO', {})
+    else:
+        dataset_meta = {}
+
+    meta.setdefault('dataset_meta', dataset_meta)
+
+    if len(meta['dataset_meta']) == 0:
+        warnings.warn('Missing dataset meta information.')
+
+    checkpoint['meta'] = meta
+
+    ema_state_dict = OrderedDict()
+    if 'ema_state_dict' in checkpoint:
+        for k, v in checkpoint['ema_state_dict'].items():
+            # The ema static dict has some extra fields
+            if k.startswith('module.'):
+                origin_k = k[len('module.'):]
+                assert origin_k in checkpoint['state_dict']
+                ema_state_dict[origin_k] = v
+        del checkpoint['ema_state_dict']
+        print('The input checkpoint has EMA weights, ', end='')
+        if args.no_ema:
+            # The values stored in `ema_state_dict` is original values.
+            print('and drop the EMA weights.')
+            assert ema_state_dict.keys() <= checkpoint['state_dict'].keys()
+            checkpoint['state_dict'].update(ema_state_dict)
+        else:
+            print('and use the EMA weights.')
+
+    temp_out_file = Path(out_file).with_name('temp_' + Path(out_file).name)
+    torch.save(checkpoint, temp_out_file)
+
+    with open(temp_out_file, 'rb') as f:
+        sha = hashlib.sha256(f.read()).hexdigest()[:8]
+    if out_file.endswith('.pth'):
+        out_file_name = out_file[:-4]
+    else:
+        out_file_name = out_file
+
+    current_date = datetime.datetime.now().strftime('%Y%m%d')
+    final_file = out_file_name + f'_{current_date}-{sha[:8]}.pth'
+    shutil.move(temp_out_file, final_file)
+
+    print(f'Successfully generated the publish-ckpt as {final_file}.')
+
+
+def main():
+    args = parse_args()
+    out_dir = Path(args.out_file).parent
+    if not out_dir.exists():
+        raise ValueError(f'Directory {out_dir} does not exist, '
+                         'please generate it manually.')
+    process_checkpoint(args.in_file, args.out_file, args)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/ram2mmpretrain.py b/tools/model_converters/ram2mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..5ee3b47677f39235592cf9524b09ce607f853dec
--- /dev/null
+++ b/tools/model_converters/ram2mmpretrain.py
@@ -0,0 +1,117 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+from copy import deepcopy
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_swin(ckpt):
+    new_ckpt = OrderedDict()
+    convert_mapping = dict()
+
+    def correct_unfold_reduction_order(x):
+        out_channel, in_channel = x.shape
+        x = x.reshape(out_channel, 4, in_channel // 4)
+        x = x[:, [0, 2, 1, 3], :].transpose(1,
+                                            2).reshape(out_channel, in_channel)
+        return x
+
+    def correct_unfold_norm_order(x):
+        in_channel = x.shape[0]
+        x = x.reshape(4, in_channel // 4)
+        x = x[[0, 2, 1, 3], :].transpose(0, 1).reshape(in_channel)
+        return x
+
+    for k, v in ckpt.items():
+        if 'attn_mask' in k:
+            continue
+        if k.startswith('head'):
+            continue
+        elif k.startswith('layers'):
+            new_v = v
+            if 'attn.' in k:
+                new_k = k.replace('attn.', 'attn.w_msa.')
+            elif 'mlp.' in k:
+                if 'mlp.fc1.' in k:
+                    new_k = k.replace('mlp.fc1.', 'ffn.layers.0.0.')
+                elif 'mlp.fc2.' in k:
+                    new_k = k.replace('mlp.fc2.', 'ffn.layers.1.')
+                else:
+                    new_k = k.replace('mlp.', 'ffn.')
+            elif 'downsample' in k:
+                new_k = k
+                if 'reduction.' in k:
+                    new_v = correct_unfold_reduction_order(v)
+                elif 'norm.' in k:
+                    new_v = correct_unfold_norm_order(v)
+            else:
+                new_k = k
+            new_k = new_k.replace('layers', 'stages', 1)
+        elif k.startswith('patch_embed'):
+            new_v = v
+            if 'proj' in k:
+                new_k = k.replace('proj', 'projection')
+            else:
+                new_k = k
+        elif k.startswith('norm'):
+            new_v = v
+            new_k = k.replace('norm', 'norm3')
+        else:
+            new_v = v
+            new_k = k
+
+        new_ckpt[new_k] = new_v
+        convert_mapping[k] = new_k
+
+    return new_ckpt, convert_mapping
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in official pretrained RAM models to'
+        'MMPretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+    if 'state_dict' in checkpoint:
+        state_dict = checkpoint['state_dict']
+    elif 'model' in checkpoint:
+        state_dict = checkpoint['model']
+    else:
+        state_dict = checkpoint
+
+    visual_ckpt = OrderedDict()
+    for key in state_dict:
+        if key.startswith('visual_encoder.'):
+            new_key = key.replace('visual_encoder.', '')
+            visual_ckpt[new_key] = state_dict[key]
+
+    new_visual_ckpt, convert_mapping = convert_swin(visual_ckpt)
+    new_ckpt = deepcopy(state_dict)
+    for key in state_dict:
+        if key.startswith('visual_encoder.'):
+            if 'attn_mask' in key:
+                del new_ckpt[key]
+                continue
+            del new_ckpt[key]
+            old_key = key.replace('visual_encoder.', '')
+            new_ckpt[key.replace(old_key,
+                                 convert_mapping[old_key])] = deepcopy(
+                                     new_visual_ckpt[key.replace(
+                                         old_key,
+                                         convert_mapping[old_key]).replace(
+                                             'visual_encoder.', '')])
+
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(new_ckpt, args.dst)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/reparameterize_model.py b/tools/model_converters/reparameterize_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..49a3913c9ea02e877bd8309e3f4cf9b167c0f430
--- /dev/null
+++ b/tools/model_converters/reparameterize_model.py
@@ -0,0 +1,57 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+from pathlib import Path
+
+import torch
+
+from mmpretrain.apis import init_model
+from mmpretrain.models.classifiers import ImageClassifier
+
+
+def convert_classifier_to_deploy(model, checkpoint, save_path):
+    print('Converting...')
+    assert hasattr(model, 'backbone') and \
+        hasattr(model.backbone, 'switch_to_deploy'), \
+        '`model.backbone` must has method of "switch_to_deploy".' \
+        f' But {model.backbone.__class__} does not have.'
+
+    model.backbone.switch_to_deploy()
+    checkpoint['state_dict'] = model.state_dict()
+    torch.save(checkpoint, save_path)
+
+    print('Done! Save at path "{}"'.format(save_path))
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert the parameters of the repvgg block '
+        'from training mode to deployment mode.')
+    parser.add_argument(
+        'config_path',
+        help='The path to the configuration file of the network '
+        'containing the repvgg block.')
+    parser.add_argument(
+        'checkpoint_path',
+        help='The path to the checkpoint file corresponding to the model.')
+    parser.add_argument(
+        'save_path',
+        help='The path where the converted checkpoint file is stored.')
+    args = parser.parse_args()
+
+    save_path = Path(args.save_path)
+    if save_path.suffix != '.pth' and save_path.suffix != '.tar':
+        print('The path should contain the name of the pth format file.')
+        exit()
+    save_path.parent.mkdir(parents=True, exist_ok=True)
+
+    model = init_model(
+        args.config_path, checkpoint=args.checkpoint_path, device='cpu')
+    assert isinstance(model, ImageClassifier), \
+        '`model` must be a `mmpretrain.classifiers.ImageClassifier` instance.'
+
+    checkpoint = torch.load(args.checkpoint_path)
+    convert_classifier_to_deploy(model, checkpoint, args.save_path)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/replknet_to_mmpretrain.py b/tools/model_converters/replknet_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..584b4403f44f5dc61013bd1e8b398ef332293315
--- /dev/null
+++ b/tools/model_converters/replknet_to_mmpretrain.py
@@ -0,0 +1,58 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+from collections import OrderedDict
+from pathlib import Path
+
+import torch
+
+
+def convert(src, dst):
+    print('Converting...')
+    blobs = torch.load(src, map_location='cpu')
+    converted_state_dict = OrderedDict()
+
+    for key in blobs:
+        splited_key = key.split('.')
+        print(splited_key)
+        splited_key = [
+            'backbone.stem' if i[:4] == 'stem' else i for i in splited_key
+        ]
+        splited_key = [
+            'backbone.stages' if i[:6] == 'stages' else i for i in splited_key
+        ]
+        splited_key = [
+            'backbone.transitions' if i[:11] == 'transitions' else i
+            for i in splited_key
+        ]
+        splited_key = [
+            'backbone.stages.3.norm' if i[:4] == 'norm' else i
+            for i in splited_key
+        ]
+        splited_key = [
+            'head.fc' if i[:4] == 'head' else i for i in splited_key
+        ]
+
+        new_key = '.'.join(splited_key)
+        converted_state_dict[new_key] = blobs[key]
+
+    torch.save(converted_state_dict, dst)
+    print('Done!')
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Convert model keys')
+    parser.add_argument('src', help='src detectron model path')
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    dst = Path(args.dst)
+    if dst.suffix != '.pth':
+        print('The path should contain the name of the pth format file.')
+        exit(1)
+    dst.parent.mkdir(parents=True, exist_ok=True)
+
+    convert(args.src, args.dst)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/repvgg_to_mmpretrain.py b/tools/model_converters/repvgg_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7a1f05301858e15f6c40920cf146f66d1bb5ae4
--- /dev/null
+++ b/tools/model_converters/repvgg_to_mmpretrain.py
@@ -0,0 +1,60 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+from collections import OrderedDict
+from pathlib import Path
+
+import torch
+
+
+def convert(src, dst):
+    print('Converting...')
+    blobs = torch.load(src, map_location='cpu')
+    converted_state_dict = OrderedDict()
+
+    for key in blobs:
+        splited_key = key.split('.')
+        splited_key = ['norm' if i == 'bn' else i for i in splited_key]
+        splited_key = [
+            'branch_norm' if i == 'rbr_identity' else i for i in splited_key
+        ]
+        splited_key = [
+            'branch_1x1' if i == 'rbr_1x1' else i for i in splited_key
+        ]
+        splited_key = [
+            'branch_3x3' if i == 'rbr_dense' else i for i in splited_key
+        ]
+        splited_key = [
+            'backbone.stem' if i[:6] == 'stage0' else i for i in splited_key
+        ]
+        splited_key = [
+            'backbone.stage_' + i[5] if i[:5] == 'stage' else i
+            for i in splited_key
+        ]
+        splited_key = ['se_layer' if i == 'se' else i for i in splited_key]
+        splited_key = ['conv1.conv' if i == 'down' else i for i in splited_key]
+        splited_key = ['conv2.conv' if i == 'up' else i for i in splited_key]
+        splited_key = ['head.fc' if i == 'linear' else i for i in splited_key]
+        new_key = '.'.join(splited_key)
+        converted_state_dict[new_key] = blobs[key]
+
+    torch.save(converted_state_dict, dst)
+    print('Done!')
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Convert model keys')
+    parser.add_argument('src', help='src detectron model path')
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    dst = Path(args.dst)
+    if dst.suffix != '.pth':
+        print('The path should contain the name of the pth format file.')
+        exit(1)
+    dst.parent.mkdir(parents=True, exist_ok=True)
+
+    convert(args.src, args.dst)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/revvit_to_mmpretrain.py b/tools/model_converters/revvit_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec9bc0b4aef203d2e6ee74b2dc895eef96feddb5
--- /dev/null
+++ b/tools/model_converters/revvit_to_mmpretrain.py
@@ -0,0 +1,99 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_revvit(ckpt):
+
+    new_ckpt = OrderedDict()
+
+    for k, v in list(ckpt.items()):
+        new_v = v
+        if k.startswith('head.projection'):
+            new_k = k.replace('head.projection', 'head.fc')
+            new_ckpt[new_k] = new_v
+            continue
+        elif k.startswith('patch_embed'):
+            if 'proj.' in k:
+                new_k = k.replace('proj.', 'projection.')
+            else:
+                new_k = k
+        elif k.startswith('rev_backbone'):
+            new_k = k.replace('rev_backbone.', '')
+            if 'F.norm' in k:
+                new_k = new_k.replace('F.norm', 'ln1')
+            elif 'G.norm' in k:
+                new_k = new_k.replace('G.norm', 'ln2')
+            elif 'F.attn' in k:
+                new_k = new_k.replace('F.attn', 'attn')
+            elif 'G.mlp.fc1' in k:
+                new_k = new_k.replace('G.mlp.fc1', 'ffn.layers.0.0')
+            elif 'G.mlp.fc2' in k:
+                new_k = new_k.replace('G.mlp.fc2', 'ffn.layers.1')
+        elif k.startswith('norm'):
+            new_k = k.replace('norm', 'ln1')
+        else:
+            new_k = k
+
+        if not new_k.startswith('head'):
+            new_k = 'backbone.' + new_k
+        new_ckpt[new_k] = new_v
+
+    tmp_weight_dir = []
+    tmp_bias_dir = []
+    final_ckpt = OrderedDict()
+    for k, v in list(new_ckpt.items()):
+        if 'attn.q.weight' in k:
+            tmp_weight_dir.append(v)
+        elif 'attn.k.weight' in k:
+            tmp_weight_dir.append(v)
+        elif 'attn.v.weight' in k:
+            tmp_weight_dir.append(v)
+            new_k = k.replace('attn.v.weight', 'attn.qkv.weight')
+            final_ckpt[new_k] = torch.cat(tmp_weight_dir, dim=0)
+            tmp_weight_dir = []
+        elif 'attn.q.bias' in k:
+            tmp_bias_dir.append(v)
+        elif 'attn.k.bias' in k:
+            tmp_bias_dir.append(v)
+        elif 'attn.v.bias' in k:
+            tmp_bias_dir.append(v)
+            new_k = k.replace('attn.v.bias', 'attn.qkv.bias')
+            final_ckpt[new_k] = torch.cat(tmp_bias_dir, dim=0)
+            tmp_bias_dir = []
+        else:
+            final_ckpt[k] = v
+
+    return final_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in pretrained revvit'
+        ' models to mmpretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'model_state' in checkpoint:
+        state_dict = checkpoint['model_state']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_revvit(state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/shufflenetv2_to_mmpretrain.py b/tools/model_converters/shufflenetv2_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..f909e41af3680941c1b01192f22dcda97e92b766
--- /dev/null
+++ b/tools/model_converters/shufflenetv2_to_mmpretrain.py
@@ -0,0 +1,113 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+from collections import OrderedDict
+
+import torch
+
+
+def convert_conv1(model_key, model_weight, state_dict, converted_names):
+    if model_key.find('conv1.0') >= 0:
+        new_key = model_key.replace('conv1.0', 'backbone.conv1.conv')
+    else:
+        new_key = model_key.replace('conv1.1', 'backbone.conv1.bn')
+    state_dict[new_key] = model_weight
+    converted_names.add(model_key)
+    print(f'Convert {model_key} to {new_key}')
+
+
+def convert_conv5(model_key, model_weight, state_dict, converted_names):
+    if model_key.find('conv5.0') >= 0:
+        new_key = model_key.replace('conv5.0', 'backbone.layers.3.conv')
+    else:
+        new_key = model_key.replace('conv5.1', 'backbone.layers.3.bn')
+    state_dict[new_key] = model_weight
+    converted_names.add(model_key)
+    print(f'Convert {model_key} to {new_key}')
+
+
+def convert_head(model_key, model_weight, state_dict, converted_names):
+    new_key = model_key.replace('fc', 'head.fc')
+    state_dict[new_key] = model_weight
+    converted_names.add(model_key)
+    print(f'Convert {model_key} to {new_key}')
+
+
+def convert_block(model_key, model_weight, state_dict, converted_names):
+    split_keys = model_key.split('.')
+    layer, block, branch = split_keys[:3]
+    layer_id = int(layer[-1]) - 2
+    new_key = model_key.replace(layer, f'backbone.layers.{layer_id}')
+
+    if branch == 'branch1':
+        if new_key.find('branch1.0') >= 0:
+            new_key = new_key.replace('branch1.0', 'branch1.0.conv')
+        elif new_key.find('branch1.1') >= 0:
+            new_key = new_key.replace('branch1.1', 'branch1.0.bn')
+        elif new_key.find('branch1.2') >= 0:
+            new_key = new_key.replace('branch1.2', 'branch1.1.conv')
+        elif new_key.find('branch1.3') >= 0:
+            new_key = new_key.replace('branch1.3', 'branch1.1.bn')
+    elif branch == 'branch2':
+
+        if new_key.find('branch2.0') >= 0:
+            new_key = new_key.replace('branch2.0', 'branch2.0.conv')
+        elif new_key.find('branch2.1') >= 0:
+            new_key = new_key.replace('branch2.1', 'branch2.0.bn')
+        elif new_key.find('branch2.3') >= 0:
+            new_key = new_key.replace('branch2.3', 'branch2.1.conv')
+        elif new_key.find('branch2.4') >= 0:
+            new_key = new_key.replace('branch2.4', 'branch2.1.bn')
+        elif new_key.find('branch2.5') >= 0:
+            new_key = new_key.replace('branch2.5', 'branch2.2.conv')
+        elif new_key.find('branch2.6') >= 0:
+            new_key = new_key.replace('branch2.6', 'branch2.2.bn')
+        else:
+            raise ValueError(f'Unsupported conversion of key {model_key}')
+    else:
+        raise ValueError(f'Unsupported conversion of key {model_key}')
+    print(f'Convert {model_key} to {new_key}')
+    state_dict[new_key] = model_weight
+    converted_names.add(model_key)
+
+
+def convert(src, dst):
+    """Convert keys in torchvision pretrained ShuffleNetV2 models to mmpretrain
+    style."""
+
+    # load pytorch model
+    blobs = torch.load(src, map_location='cpu')
+
+    # convert to pytorch style
+    state_dict = OrderedDict()
+    converted_names = set()
+
+    for key, weight in blobs.items():
+        if 'conv1' in key:
+            convert_conv1(key, weight, state_dict, converted_names)
+        elif 'fc' in key:
+            convert_head(key, weight, state_dict, converted_names)
+        elif key.startswith('s'):
+            convert_block(key, weight, state_dict, converted_names)
+        elif 'conv5' in key:
+            convert_conv5(key, weight, state_dict, converted_names)
+
+    # check if all layers are converted
+    for key in blobs:
+        if key not in converted_names:
+            print(f'not converted: {key}')
+    # save checkpoint
+    checkpoint = dict()
+    checkpoint['state_dict'] = state_dict
+    torch.save(checkpoint, dst)
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Convert model keys')
+    parser.add_argument('src', help='src detectron model path')
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+    convert(args.src, args.dst)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/tinyvit_to_mmpretrain.py b/tools/model_converters/tinyvit_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..0aad47cd239700c161e285ed3c29e1bffc2e9c03
--- /dev/null
+++ b/tools/model_converters/tinyvit_to_mmpretrain.py
@@ -0,0 +1,61 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+from pathlib import Path
+
+import torch
+
+
+def convert_weights(weight):
+    """Weight Converter.
+
+    Converts the weights from timm to mmpretrain
+    Args:
+        weight (dict): weight dict from timm
+    Returns:
+        Converted weight dict for mmpretrain
+    """
+    result = dict()
+    result['meta'] = dict()
+    temp = dict()
+    mapping = {
+        'c.weight': 'conv2d.weight',
+        'bn.weight': 'bn2d.weight',
+        'bn.bias': 'bn2d.bias',
+        'bn.running_mean': 'bn2d.running_mean',
+        'bn.running_var': 'bn2d.running_var',
+        'bn.num_batches_tracked': 'bn2d.num_batches_tracked',
+        'layers': 'stages',
+        'norm_head': 'norm3',
+    }
+
+    weight = weight['model']
+
+    for k, v in weight.items():
+        # keyword mapping
+        for mk, mv in mapping.items():
+            if mk in k:
+                k = k.replace(mk, mv)
+
+        if k.startswith('head.'):
+            temp['head.fc.' + k[5:]] = v
+        else:
+            temp['backbone.' + k] = v
+
+    result['state_dict'] = temp
+    return result
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Convert model keys')
+    parser.add_argument('src', help='src detectron model path')
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+    dst = Path(args.dst)
+    if dst.suffix != '.pth':
+        print('The path should contain the name of the pth format file.')
+        exit(1)
+    dst.parent.mkdir(parents=True, exist_ok=True)
+
+    original_model = torch.load(args.src, map_location='cpu')
+    converted_model = convert_weights(original_model)
+    torch.save(converted_model, args.dst)
diff --git a/tools/model_converters/torchvision_to_mmpretrain.py b/tools/model_converters/torchvision_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..679b791e3634352b1721072949f5317865281333
--- /dev/null
+++ b/tools/model_converters/torchvision_to_mmpretrain.py
@@ -0,0 +1,63 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+from collections import OrderedDict
+from pathlib import Path
+
+import torch
+
+
+def convert_resnet(src_dict, dst_dict):
+    """convert resnet checkpoints from torchvision."""
+    for key, value in src_dict.items():
+        if not key.startswith('fc'):
+            dst_dict['backbone.' + key] = value
+        else:
+            dst_dict['head.' + key] = value
+
+
+# model name to convert function
+CONVERT_F_DICT = {
+    'resnet': convert_resnet,
+}
+
+
+def convert(src: str, dst: str, convert_f: callable):
+    print('Converting...')
+    blobs = torch.load(src, map_location='cpu')
+    converted_state_dict = OrderedDict()
+
+    # convert key in weight
+    convert_f(blobs, converted_state_dict)
+
+    torch.save(converted_state_dict, dst)
+    print('Done!')
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Convert model keys')
+    parser.add_argument('src', help='src detectron model path')
+    parser.add_argument('dst', help='save path')
+    parser.add_argument(
+        'model', type=str, help='The algorithm needs to change the keys.')
+    args = parser.parse_args()
+
+    dst = Path(args.dst)
+    if dst.suffix != '.pth':
+        print('The path should contain the name of the pth format file.')
+        exit(1)
+    dst.parent.mkdir(parents=True, exist_ok=True)
+
+    # this tool only support model in CONVERT_F_DICT
+    support_models = list(CONVERT_F_DICT.keys())
+    if args.model not in CONVERT_F_DICT:
+        print(f'The "{args.model}" has not been supported to convert now.')
+        print(f'This tool only supports {", ".join(support_models)}.')
+        print('If you have done the converting job, PR is welcome!')
+        exit(1)
+
+    convert_f = CONVERT_F_DICT[args.model]
+    convert(args.src, args.dst, convert_f)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/twins2mmpretrain.py b/tools/model_converters/twins2mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..848913053ff8c3924e7fd1d0aa527770c83e4853
--- /dev/null
+++ b/tools/model_converters/twins2mmpretrain.py
@@ -0,0 +1,73 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmcv
+import torch
+from mmcv.runner import CheckpointLoader
+
+
+def convert_twins(args, ckpt):
+
+    new_ckpt = OrderedDict()
+
+    for k, v in list(ckpt.items()):
+        new_v = v
+        if k.startswith('head'):
+            new_k = k.replace('head.', 'head.fc.')
+            new_ckpt[new_k] = new_v
+            continue
+        elif k.startswith('patch_embeds'):
+            if 'proj.' in k:
+                new_k = k.replace('proj.', 'projection.')
+            else:
+                new_k = k
+        elif k.startswith('blocks'):
+            k = k.replace('blocks', 'stages')
+            # Union
+            if 'mlp.fc1' in k:
+                new_k = k.replace('mlp.fc1', 'ffn.layers.0.0')
+            elif 'mlp.fc2' in k:
+                new_k = k.replace('mlp.fc2', 'ffn.layers.1')
+
+            else:
+                new_k = k
+            new_k = new_k.replace('blocks.', 'layers.')
+        elif k.startswith('pos_block'):
+            new_k = k.replace('pos_block', 'position_encodings')
+            if 'proj.0.' in new_k:
+                new_k = new_k.replace('proj.0.', 'proj.')
+        elif k.startswith('norm'):
+            new_k = k.replace('norm', 'norm_after_stage3')
+        else:
+            new_k = k
+        new_k = 'backbone.' + new_k
+        new_ckpt[new_k] = new_v
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in timm pretrained vit models to '
+        'MMPretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'state_dict' in checkpoint:
+        # timm checkpoint
+        state_dict = checkpoint['state_dict']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_twins(args, state_dict)
+    mmcv.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/van2mmpretrain.py b/tools/model_converters/van2mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..563f3d95622d2d26819ebbd9ba66464c9a61d058
--- /dev/null
+++ b/tools/model_converters/van2mmpretrain.py
@@ -0,0 +1,66 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_van(ckpt):
+
+    new_ckpt = OrderedDict()
+
+    for k, v in list(ckpt.items()):
+        new_v = v
+        if k.startswith('head'):
+            new_k = k.replace('head.', 'head.fc.')
+            new_ckpt[new_k] = new_v
+            continue
+        elif k.startswith('patch_embed'):
+            if 'proj.' in k:
+                new_k = k.replace('proj.', 'projection.')
+            else:
+                new_k = k
+        elif k.startswith('block'):
+            new_k = k.replace('block', 'blocks')
+            if 'attn.spatial_gating_unit' in new_k:
+                new_k = new_k.replace('conv0', 'DW_conv')
+                new_k = new_k.replace('conv_spatial', 'DW_D_conv')
+            if 'dwconv.dwconv' in new_k:
+                new_k = new_k.replace('dwconv.dwconv', 'dwconv')
+        else:
+            new_k = k
+
+        if not new_k.startswith('head'):
+            new_k = 'backbone.' + new_k
+        new_ckpt[new_k] = new_v
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in pretrained van '
+        'models to mmpretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'state_dict' in checkpoint:
+        state_dict = checkpoint['state_dict']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_van(state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/vgg_to_mmpretrain.py b/tools/model_converters/vgg_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..46178db54da5f3fd45f3a9ac9aad5ae34448f2f5
--- /dev/null
+++ b/tools/model_converters/vgg_to_mmpretrain.py
@@ -0,0 +1,118 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os
+from collections import OrderedDict
+
+import torch
+
+
+def get_layer_maps(layer_num, with_bn):
+    layer_maps = {'conv': {}, 'bn': {}}
+    if with_bn:
+        if layer_num == 11:
+            layer_idxs = [0, 4, 8, 11, 15, 18, 22, 25]
+        elif layer_num == 13:
+            layer_idxs = [0, 3, 7, 10, 14, 17, 21, 24, 28, 31]
+        elif layer_num == 16:
+            layer_idxs = [0, 3, 7, 10, 14, 17, 20, 24, 27, 30, 34, 37, 40]
+        elif layer_num == 19:
+            layer_idxs = [
+                0, 3, 7, 10, 14, 17, 20, 23, 27, 30, 33, 36, 40, 43, 46, 49
+            ]
+        else:
+            raise ValueError(f'Invalid number of layers: {layer_num}')
+        for i, layer_idx in enumerate(layer_idxs):
+            if i == 0:
+                new_layer_idx = layer_idx
+            else:
+                new_layer_idx += int((layer_idx - layer_idxs[i - 1]) / 2)
+            layer_maps['conv'][layer_idx] = new_layer_idx
+            layer_maps['bn'][layer_idx + 1] = new_layer_idx
+    else:
+        if layer_num == 11:
+            layer_idxs = [0, 3, 6, 8, 11, 13, 16, 18]
+            new_layer_idxs = [0, 2, 4, 5, 7, 8, 10, 11]
+        elif layer_num == 13:
+            layer_idxs = [0, 2, 5, 7, 10, 12, 15, 17, 20, 22]
+            new_layer_idxs = [0, 1, 3, 4, 6, 7, 9, 10, 12, 13]
+        elif layer_num == 16:
+            layer_idxs = [0, 2, 5, 7, 10, 12, 14, 17, 19, 21, 24, 26, 28]
+            new_layer_idxs = [0, 1, 3, 4, 6, 7, 8, 10, 11, 12, 14, 15, 16]
+        elif layer_num == 19:
+            layer_idxs = [
+                0, 2, 5, 7, 10, 12, 14, 16, 19, 21, 23, 25, 28, 30, 32, 34
+            ]
+            new_layer_idxs = [
+                0, 1, 3, 4, 6, 7, 8, 9, 11, 12, 13, 14, 16, 17, 18, 19
+            ]
+        else:
+            raise ValueError(f'Invalid number of layers: {layer_num}')
+
+        layer_maps['conv'] = {
+            layer_idx: new_layer_idx
+            for layer_idx, new_layer_idx in zip(layer_idxs, new_layer_idxs)
+        }
+
+    return layer_maps
+
+
+def convert(src, dst, layer_num, with_bn=False):
+    """Convert keys in torchvision pretrained VGG models to mmpretrain
+    style."""
+
+    # load pytorch model
+    assert os.path.isfile(src), f'no checkpoint found at {src}'
+    blobs = torch.load(src, map_location='cpu')
+
+    # convert to pytorch style
+    state_dict = OrderedDict()
+
+    layer_maps = get_layer_maps(layer_num, with_bn)
+
+    prefix = 'backbone'
+    delimiter = '.'
+    for key, weight in blobs.items():
+        if 'features' in key:
+            module, layer_idx, weight_type = key.split(delimiter)
+            new_key = delimiter.join([prefix, key])
+            layer_idx = int(layer_idx)
+            for layer_key, maps in layer_maps.items():
+                if layer_idx in maps:
+                    new_layer_idx = maps[layer_idx]
+                    new_key = delimiter.join([
+                        prefix, 'features',
+                        str(new_layer_idx), layer_key, weight_type
+                    ])
+            state_dict[new_key] = weight
+            print(f'Convert {key} to {new_key}')
+        elif 'classifier' in key:
+            new_key = delimiter.join([prefix, key])
+            state_dict[new_key] = weight
+            print(f'Convert {key} to {new_key}')
+        else:
+            state_dict[key] = weight
+
+    # save checkpoint
+    checkpoint = dict()
+    checkpoint['state_dict'] = state_dict
+    torch.save(checkpoint, dst)
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Convert model keys')
+    parser.add_argument('src', help='src torchvision model path')
+    parser.add_argument('dst', help='save path')
+    parser.add_argument(
+        '--bn', action='store_true', help='whether original vgg has BN')
+    parser.add_argument(
+        '--layer-num',
+        type=int,
+        choices=[11, 13, 16, 19],
+        default=11,
+        help='number of VGG layers')
+    args = parser.parse_args()
+    convert(args.src, args.dst, layer_num=args.layer_num, with_bn=args.bn)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/model_converters/vig_to_mmpretrain.py b/tools/model_converters/vig_to_mmpretrain.py
new file mode 100644
index 0000000000000000000000000000000000000000..2642c7d86814609aed72e5e377653a2680927360
--- /dev/null
+++ b/tools/model_converters/vig_to_mmpretrain.py
@@ -0,0 +1,98 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+import re
+from collections import OrderedDict
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_vig(ckpt):
+    new_ckpt = OrderedDict()
+
+    for k, v in ckpt.items():
+        new_key = k
+        new_value = v
+        if 'pos_embed' in new_key:
+            new_key = new_key.replace('pos_embed', 'backbone.pos_embed')
+        elif 'stem' in new_key:
+            new_key = new_key.replace('stem.convs', 'backbone.stem')
+        elif 'backbone' in new_key:
+            new_key = new_key.replace('backbone', 'backbone.blocks')
+        elif 'prediction.0' in new_key:
+            new_key = new_key.replace('prediction.0', 'head.fc1')
+            new_value = v.squeeze(-1).squeeze(-1)
+        elif 'prediction.1' in new_key:
+            new_key = new_key.replace('prediction.1', 'head.bn')
+        elif 'prediction.4' in new_key:
+            new_key = new_key.replace('prediction.4', 'head.fc2')
+            new_value = v.squeeze(-1).squeeze(-1)
+        new_ckpt[new_key] = new_value
+    return new_ckpt
+
+
+def convert_pvig(ckpt):
+    new_ckpt = OrderedDict()
+
+    stage_idx = 0
+    stage_blocks = 0
+    for k, v in ckpt.items():
+        new_key: str = k
+        new_value = v
+        if 'pos_embed' in new_key:
+            new_key = new_key.replace('pos_embed', 'backbone.pos_embed')
+        elif 'stem' in new_key:
+            new_key = new_key.replace('stem.convs', 'backbone.stem')
+        elif re.match(r'^backbone\.\d+\.conv', new_key) is not None:
+            if new_key.endswith('0.weight'):
+                stage_idx += 1
+            stage_blocks = int(new_key.split('.')[1])
+            other = new_key.split('.', maxsplit=3)[-1]
+            new_key = f'backbone.stages.{stage_idx}.0.' + other
+        elif 'backbone' in new_key:
+            block_idx = int(new_key.split('.')[1]) - stage_blocks
+            other = new_key.split('.', maxsplit=2)[-1]
+            new_key = f'backbone.stages.{stage_idx}.{block_idx}.' + other
+        elif 'prediction.0' in new_key:
+            new_key = new_key.replace('prediction.0', 'head.fc1')
+            new_value = v.squeeze(-1).squeeze(-1)
+        elif 'prediction.1' in new_key:
+            new_key = new_key.replace('prediction.1', 'head.bn')
+        elif 'prediction.4' in new_key:
+            new_key = new_key.replace('prediction.4', 'head.fc2')
+            new_value = v.squeeze(-1).squeeze(-1)
+        new_ckpt[new_key] = new_value
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in pretrained vig models to '
+        'mmpretrain style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'model' in checkpoint:
+        state_dict = checkpoint['model']
+    else:
+        state_dict = checkpoint
+
+    if 'backbone.2.conv.0.weight' in state_dict:
+        weight = convert_pvig(state_dict)
+    else:
+        weight = convert_vig(state_dict)
+
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/slurm_test.sh b/tools/slurm_test.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6dd67e57442b741fc30f26102eb5afe16139edb1
--- /dev/null
+++ b/tools/slurm_test.sh
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+
+set -x
+
+PARTITION=$1
+JOB_NAME=$2
+CONFIG=$3
+CHECKPOINT=$4
+GPUS=${GPUS:-8}
+GPUS_PER_NODE=${GPUS_PER_NODE:-8}
+CPUS_PER_TASK=${CPUS_PER_TASK:-5}
+PY_ARGS=${@:5}
+SRUN_ARGS=${SRUN_ARGS:-""}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+srun -p ${PARTITION} \
+    --job-name=${JOB_NAME} \
+    --gres=gpu:${GPUS_PER_NODE} \
+    --ntasks=${GPUS} \
+    --ntasks-per-node=${GPUS_PER_NODE} \
+    --cpus-per-task=${CPUS_PER_TASK} \
+    --kill-on-bad-exit=1 \
+    ${SRUN_ARGS} \
+    python -u tools/test.py ${CONFIG} ${CHECKPOINT} --launcher="slurm" ${PY_ARGS}
diff --git a/tools/slurm_train.sh b/tools/slurm_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b3feb3d9c7a6c33d82739cdf5ee10365673aaded
--- /dev/null
+++ b/tools/slurm_train.sh
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+
+set -x
+
+PARTITION=$1
+JOB_NAME=$2
+CONFIG=$3
+WORK_DIR=$4
+GPUS=${GPUS:-8}
+GPUS_PER_NODE=${GPUS_PER_NODE:-8}
+CPUS_PER_TASK=${CPUS_PER_TASK:-5}
+SRUN_ARGS=${SRUN_ARGS:-""}
+PY_ARGS=${@:5}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+srun -p ${PARTITION} \
+    --job-name=${JOB_NAME} \
+    --gres=gpu:${GPUS_PER_NODE} \
+    --ntasks=${GPUS} \
+    --ntasks-per-node=${GPUS_PER_NODE} \
+    --cpus-per-task=${CPUS_PER_TASK} \
+    --kill-on-bad-exit=1 \
+    ${SRUN_ARGS} \
+    python -u tools/train.py ${CONFIG} --work-dir=${WORK_DIR} --launcher="slurm" ${PY_ARGS}
diff --git a/tools/test.py b/tools/test.py
new file mode 100644
index 0000000000000000000000000000000000000000..d9d99d26571f420bdabc54a97d5a869ee2657568
--- /dev/null
+++ b/tools/test.py
@@ -0,0 +1,193 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os
+import os.path as osp
+from copy import deepcopy
+
+import mmengine
+from mmengine.config import Config, ConfigDict, DictAction
+from mmengine.evaluator import DumpResults
+from mmengine.registry import RUNNERS
+from mmengine.runner import Runner
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description='MMPreTrain test (and eval) a model')
+    parser.add_argument('config', help='test config file path')
+    parser.add_argument('checkpoint', help='checkpoint file')
+    parser.add_argument(
+        '--work-dir',
+        help='the directory to save the file containing evaluation metrics')
+    parser.add_argument('--out', help='the file to output results.')
+    parser.add_argument(
+        '--out-item',
+        choices=['metrics', 'pred'],
+        help='To output whether metrics or predictions. '
+        'Defaults to output predictions.')
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. If the value to '
+        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+        'Note that the quotation marks are necessary and that no white space '
+        'is allowed.')
+    parser.add_argument(
+        '--amp',
+        action='store_true',
+        help='enable automatic-mixed-precision test')
+    parser.add_argument(
+        '--show-dir',
+        help='directory where the visualization images will be saved.')
+    parser.add_argument(
+        '--show',
+        action='store_true',
+        help='whether to display the prediction results in a window.')
+    parser.add_argument(
+        '--interval',
+        type=int,
+        default=1,
+        help='visualize per interval samples.')
+    parser.add_argument(
+        '--wait-time',
+        type=float,
+        default=2,
+        help='display time of every window. (second)')
+    parser.add_argument(
+        '--no-pin-memory',
+        action='store_true',
+        help='whether to disable the pin_memory option in dataloaders.')
+    parser.add_argument(
+        '--tta',
+        action='store_true',
+        help='Whether to enable the Test-Time-Aug (TTA). If the config file '
+        'has `tta_pipeline` and `tta_model` fields, use them to determine the '
+        'TTA transforms and how to merge the TTA results. Otherwise, use flip '
+        'TTA by averaging classification score.')
+    parser.add_argument(
+        '--launcher',
+        choices=['none', 'pytorch', 'slurm', 'mpi'],
+        default='none',
+        help='job launcher')
+    # When using PyTorch version >= 2.0.0, the `torch.distributed.launch`
+    # will pass the `--local-rank` parameter to `tools/train.py` instead
+    # of `--local_rank`.
+    parser.add_argument('--local_rank', '--local-rank', type=int, default=0)
+    args = parser.parse_args()
+    if 'LOCAL_RANK' not in os.environ:
+        os.environ['LOCAL_RANK'] = str(args.local_rank)
+    return args
+
+
+def merge_args(cfg, args):
+    """Merge CLI arguments to config."""
+    cfg.launcher = args.launcher
+
+    # work_dir is determined in this priority: CLI > segment in file > filename
+    if args.work_dir is not None:
+        # update configs according to CLI args if args.work_dir is not None
+        cfg.work_dir = args.work_dir
+    elif cfg.get('work_dir', None) is None:
+        # use config filename as default work_dir if cfg.work_dir is None
+        cfg.work_dir = osp.join('./work_dirs',
+                                osp.splitext(osp.basename(args.config))[0])
+
+    cfg.load_from = args.checkpoint
+
+    # enable automatic-mixed-precision test
+    if args.amp:
+        cfg.test_cfg.fp16 = True
+
+    # -------------------- visualization --------------------
+    if args.show or (args.show_dir is not None):
+        assert 'visualization' in cfg.default_hooks, \
+            'VisualizationHook is not set in the `default_hooks` field of ' \
+            'config. Please set `visualization=dict(type="VisualizationHook")`'
+
+        cfg.default_hooks.visualization.enable = True
+        cfg.default_hooks.visualization.show = args.show
+        cfg.default_hooks.visualization.wait_time = args.wait_time
+        cfg.default_hooks.visualization.out_dir = args.show_dir
+        cfg.default_hooks.visualization.interval = args.interval
+
+    # -------------------- TTA related args --------------------
+    if args.tta:
+        if 'tta_model' not in cfg:
+            cfg.tta_model = dict(type='mmpretrain.AverageClsScoreTTA')
+        if 'tta_pipeline' not in cfg:
+            test_pipeline = cfg.test_dataloader.dataset.pipeline
+            cfg.tta_pipeline = deepcopy(test_pipeline)
+            flip_tta = dict(
+                type='TestTimeAug',
+                transforms=[
+                    [
+                        dict(type='RandomFlip', prob=1.),
+                        dict(type='RandomFlip', prob=0.)
+                    ],
+                    [test_pipeline[-1]],
+                ])
+            cfg.tta_pipeline[-1] = flip_tta
+        cfg.model = ConfigDict(**cfg.tta_model, module=cfg.model)
+        cfg.test_dataloader.dataset.pipeline = cfg.tta_pipeline
+
+    # ----------------- Default dataloader args -----------------
+    default_dataloader_cfg = ConfigDict(
+        pin_memory=True,
+        collate_fn=dict(type='default_collate'),
+    )
+
+    def set_default_dataloader_cfg(cfg, field):
+        if cfg.get(field, None) is None:
+            return
+        dataloader_cfg = deepcopy(default_dataloader_cfg)
+        dataloader_cfg.update(cfg[field])
+        cfg[field] = dataloader_cfg
+        if args.no_pin_memory:
+            cfg[field]['pin_memory'] = False
+
+    set_default_dataloader_cfg(cfg, 'test_dataloader')
+
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+
+    return cfg
+
+
+def main():
+    args = parse_args()
+
+    if args.out is None and args.out_item is not None:
+        raise ValueError('Please use `--out` argument to specify the '
+                         'path of the output file before using `--out-item`.')
+
+    # load config
+    cfg = Config.fromfile(args.config)
+
+    # merge cli arguments to config
+    cfg = merge_args(cfg, args)
+
+    # build the runner from config
+    if 'runner_type' not in cfg:
+        # build the default runner
+        runner = Runner.from_cfg(cfg)
+    else:
+        # build customized runner from the registry
+        # if 'runner_type' is set in the cfg
+        runner = RUNNERS.build(cfg)
+
+    if args.out and args.out_item in ['pred', None]:
+        runner.test_evaluator.metrics.append(
+            DumpResults(out_file_path=args.out))
+
+    # start testing
+    metrics = runner.test()
+
+    if args.out and args.out_item == 'metrics':
+        mmengine.dump(metrics, args.out)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/torchserve/mmpretrain2torchserve.py b/tools/torchserve/mmpretrain2torchserve.py
new file mode 100644
index 0000000000000000000000000000000000000000..8d53bf3f92401d84959d60f10effdad18f262faa
--- /dev/null
+++ b/tools/torchserve/mmpretrain2torchserve.py
@@ -0,0 +1,112 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from argparse import ArgumentParser, Namespace
+from pathlib import Path
+from tempfile import TemporaryDirectory
+
+import mmengine
+
+try:
+    from model_archiver.model_packaging import package_model
+    from model_archiver.model_packaging_utils import ModelExportUtils
+except ImportError:
+    raise ImportError(
+        'Please run `pip install torchserve torch-model-archiver"` to '
+        'install required third-party libraries.')
+
+
+def mmpretrain2torchserve(
+    config_file: str,
+    checkpoint_file: str,
+    output_folder: str,
+    model_name: str,
+    model_version: str = '1.0',
+    force: bool = False,
+):
+    """Converts mmpretrain model (config + checkpoint) to TorchServe `.mar`.
+
+    Args:
+        config_file:
+            In MMPretrain config format.
+            The contents vary for each task repository.
+        checkpoint_file:
+            In MMPretrain checkpoint format.
+            The contents vary for each task repository.
+        output_folder:
+            Folder where `{model_name}.mar` will be created.
+            The file created will be in TorchServe archive format.
+        model_name:
+            If not None, used for naming the `{model_name}.mar` file
+            that will be created under `output_folder`.
+            If None, `{Path(checkpoint_file).stem}` will be used.
+        model_version:
+            Model's version.
+        force:
+            If True, if there is an existing `{model_name}.mar`
+            file under `output_folder` it will be overwritten.
+    """
+    mmengine.mkdir_or_exist(output_folder)
+
+    config = mmengine.Config.fromfile(config_file)
+
+    with TemporaryDirectory() as tmpdir:
+        config.dump(f'{tmpdir}/config.py')
+
+        args = Namespace(
+            **{
+                'model_file': f'{tmpdir}/config.py',
+                'serialized_file': checkpoint_file,
+                'handler': f'{Path(__file__).parent}/mmpretrain_handler.py',
+                'model_name': model_name or Path(checkpoint_file).stem,
+                'version': model_version,
+                'export_path': output_folder,
+                'force': force,
+                'requirements_file': None,
+                'extra_files': None,
+                'runtime': 'python',
+                'archive_format': 'default'
+            })
+        manifest = ModelExportUtils.generate_manifest_json(args)
+        package_model(args, manifest)
+
+
+def parse_args():
+    parser = ArgumentParser(
+        description='Convert mmpretrain models to TorchServe `.mar` format.')
+    parser.add_argument('config', type=str, help='config file path')
+    parser.add_argument('checkpoint', type=str, help='checkpoint file path')
+    parser.add_argument(
+        '--output-folder',
+        type=str,
+        required=True,
+        help='Folder where `{model_name}.mar` will be created.')
+    parser.add_argument(
+        '--model-name',
+        type=str,
+        default=None,
+        help='If not None, used for naming the `{model_name}.mar`'
+        'file that will be created under `output_folder`.'
+        'If None, `{Path(checkpoint_file).stem}` will be used.')
+    parser.add_argument(
+        '--model-version',
+        type=str,
+        default='1.0',
+        help='Number used for versioning.')
+    parser.add_argument(
+        '-f',
+        '--force',
+        action='store_true',
+        help='overwrite the existing `{model_name}.mar`')
+    args = parser.parse_args()
+
+    return args
+
+
+if __name__ == '__main__':
+    args = parse_args()
+
+    if package_model is None:
+        raise ImportError('`torch-model-archiver` is required.'
+                          'Try: pip install torch-model-archiver')
+
+    mmpretrain2torchserve(args.config, args.checkpoint, args.output_folder,
+                          args.model_name, args.model_version, args.force)
diff --git a/tools/torchserve/mmpretrain_handler.py b/tools/torchserve/mmpretrain_handler.py
new file mode 100644
index 0000000000000000000000000000000000000000..c924e0866ef180dfa5e939fd48dd845b313bb50f
--- /dev/null
+++ b/tools/torchserve/mmpretrain_handler.py
@@ -0,0 +1,68 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import base64
+import os
+
+import mmcv
+import numpy as np
+import torch
+from ts.torch_handler.base_handler import BaseHandler
+
+import mmpretrain.models
+from mmpretrain.apis import (ImageClassificationInferencer,
+                             ImageRetrievalInferencer, get_model)
+
+
+class MMPreHandler(BaseHandler):
+
+    def initialize(self, context):
+        properties = context.system_properties
+        self.map_location = 'cuda' if torch.cuda.is_available() else 'cpu'
+        self.device = torch.device(self.map_location + ':' +
+                                   str(properties.get('gpu_id')) if torch.cuda.
+                                   is_available() else self.map_location)
+        self.manifest = context.manifest
+
+        model_dir = properties.get('model_dir')
+        serialized_file = self.manifest['model']['serializedFile']
+        checkpoint = os.path.join(model_dir, serialized_file)
+        self.config_file = os.path.join(model_dir, 'config.py')
+
+        model = get_model(self.config_file, checkpoint, self.device)
+        if isinstance(model, mmpretrain.models.ImageClassifier):
+            self.inferencer = ImageClassificationInferencer(model)
+        elif isinstance(model, mmpretrain.models.ImageToImageRetriever):
+            self.inferencer = ImageRetrievalInferencer(model)
+        else:
+            raise NotImplementedError(
+                f'No available inferencer for {type(model)}')
+        self.initialized = True
+
+    def preprocess(self, data):
+        images = []
+
+        for row in data:
+            image = row.get('data') or row.get('body')
+            if isinstance(image, str):
+                image = base64.b64decode(image)
+            image = mmcv.imfrombytes(image)
+            images.append(image)
+
+        return images
+
+    def inference(self, data, *args, **kwargs):
+        results = []
+        for image in data:
+            results.append(self.inferencer(image)[0])
+        return results
+
+    def postprocess(self, data):
+        processed_data = []
+        for result in data:
+            processed_result = {}
+            for k, v in result.items():
+                if isinstance(v, (torch.Tensor, np.ndarray)):
+                    processed_result[k] = v.tolist()
+                else:
+                    processed_result[k] = v
+            processed_data.append(processed_result)
+        return processed_data
diff --git a/tools/torchserve/test_torchserver.py b/tools/torchserve/test_torchserver.py
new file mode 100644
index 0000000000000000000000000000000000000000..00bfa294aaa816ee4c62ae04c289fa87cd6e2c71
--- /dev/null
+++ b/tools/torchserve/test_torchserver.py
@@ -0,0 +1,43 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from argparse import ArgumentParser
+
+import numpy as np
+import requests
+
+from mmpretrain.apis import get_model, inference_model
+
+
+def parse_args():
+    parser = ArgumentParser()
+    parser.add_argument('img', help='Image file')
+    parser.add_argument('config', help='Config file')
+    parser.add_argument('checkpoint', help='Checkpoint file')
+    parser.add_argument('model_name', help='The model name in the server')
+    parser.add_argument(
+        '--inference-addr',
+        default='127.0.0.1:8080',
+        help='Address and port of the inference server')
+    parser.add_argument(
+        '--device', default='cuda:0', help='Device used for inference')
+    args = parser.parse_args()
+    return args
+
+
+def main(args):
+    # Inference single image by native apis.
+    model = get_model(args.config, args.checkpoint, device=args.device)
+    model_result = inference_model(model, args.img)
+
+    # Inference single image by torchserve engine.
+    url = 'http://' + args.inference_addr + '/predictions/' + args.model_name
+    with open(args.img, 'rb') as image:
+        response = requests.post(url, image)
+    server_result = response.json()
+
+    assert np.allclose(model_result['pred_score'], server_result['pred_score'])
+    print('Test complete, the results of PyTorch and TorchServe are the same.')
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    main(args)
diff --git a/tools/train.py b/tools/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..89c8548fc325dfdfe5ade4b122bd8c840f62734e
--- /dev/null
+++ b/tools/train.py
@@ -0,0 +1,162 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os
+import os.path as osp
+from copy import deepcopy
+
+from mmengine.config import Config, ConfigDict, DictAction
+from mmengine.registry import RUNNERS
+from mmengine.runner import Runner
+from mmengine.utils import digit_version
+from mmengine.utils.dl_utils import TORCH_VERSION
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Train a model')
+    parser.add_argument('config', help='train config file path')
+    parser.add_argument('--work-dir', help='the dir to save logs and models')
+    parser.add_argument(
+        '--resume',
+        nargs='?',
+        type=str,
+        const='auto',
+        help='If specify checkpoint path, resume from it, while if not '
+        'specify, try to auto resume from the latest checkpoint '
+        'in the work directory.')
+    parser.add_argument(
+        '--amp',
+        action='store_true',
+        help='enable automatic-mixed-precision training')
+    parser.add_argument(
+        '--no-validate',
+        action='store_true',
+        help='whether not to evaluate the checkpoint during training')
+    parser.add_argument(
+        '--auto-scale-lr',
+        action='store_true',
+        help='whether to auto scale the learning rate according to the '
+        'actual batch size and the original batch size.')
+    parser.add_argument(
+        '--no-pin-memory',
+        action='store_true',
+        help='whether to disable the pin_memory option in dataloaders.')
+    parser.add_argument(
+        '--no-persistent-workers',
+        action='store_true',
+        help='whether to disable the persistent_workers option in dataloaders.'
+    )
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. If the value to '
+        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+        'Note that the quotation marks are necessary and that no white space '
+        'is allowed.')
+    parser.add_argument(
+        '--launcher',
+        choices=['none', 'pytorch', 'slurm', 'mpi'],
+        default='none',
+        help='job launcher')
+    # When using PyTorch version >= 2.0.0, the `torch.distributed.launch`
+    # will pass the `--local-rank` parameter to `tools/train.py` instead
+    # of `--local_rank`.
+    parser.add_argument('--local_rank', '--local-rank', type=int, default=0)
+    args = parser.parse_args()
+    if 'LOCAL_RANK' not in os.environ:
+        os.environ['LOCAL_RANK'] = str(args.local_rank)
+
+    return args
+
+
+def merge_args(cfg, args):
+    """Merge CLI arguments to config."""
+    if args.no_validate:
+        cfg.val_cfg = None
+        cfg.val_dataloader = None
+        cfg.val_evaluator = None
+
+    cfg.launcher = args.launcher
+
+    # work_dir is determined in this priority: CLI > segment in file > filename
+    if args.work_dir is not None:
+        # update configs according to CLI args if args.work_dir is not None
+        cfg.work_dir = args.work_dir
+    elif cfg.get('work_dir', None) is None:
+        # use config filename as default work_dir if cfg.work_dir is None
+        cfg.work_dir = osp.join('./work_dirs',
+                                osp.splitext(osp.basename(args.config))[0])
+
+    # enable automatic-mixed-precision training
+    if args.amp is True:
+        cfg.optim_wrapper.type = 'AmpOptimWrapper'
+        cfg.optim_wrapper.setdefault('loss_scale', 'dynamic')
+
+    # resume training
+    if args.resume == 'auto':
+        cfg.resume = True
+        cfg.load_from = None
+    elif args.resume is not None:
+        cfg.resume = True
+        cfg.load_from = args.resume
+
+    # enable auto scale learning rate
+    if args.auto_scale_lr:
+        cfg.auto_scale_lr.enable = True
+
+    # set dataloader args
+    default_dataloader_cfg = ConfigDict(
+        pin_memory=True,
+        persistent_workers=True,
+        collate_fn=dict(type='default_collate'),
+    )
+    if digit_version(TORCH_VERSION) < digit_version('1.8.0'):
+        default_dataloader_cfg.persistent_workers = False
+
+    def set_default_dataloader_cfg(cfg, field):
+        if cfg.get(field, None) is None:
+            return
+        dataloader_cfg = deepcopy(default_dataloader_cfg)
+        dataloader_cfg.update(cfg[field])
+        cfg[field] = dataloader_cfg
+        if args.no_pin_memory:
+            cfg[field]['pin_memory'] = False
+        if args.no_persistent_workers:
+            cfg[field]['persistent_workers'] = False
+
+    set_default_dataloader_cfg(cfg, 'train_dataloader')
+    set_default_dataloader_cfg(cfg, 'val_dataloader')
+    set_default_dataloader_cfg(cfg, 'test_dataloader')
+
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+
+    return cfg
+
+
+def main():
+    args = parse_args()
+
+    # load config
+    cfg = Config.fromfile(args.config)
+
+    # merge cli arguments to config
+    cfg = merge_args(cfg, args)
+
+    # build the runner from config
+    if 'runner_type' not in cfg:
+        # build the default runner
+        runner = Runner.from_cfg(cfg)
+    else:
+        # build customized runner from the registry
+        # if 'runner_type' is set in the cfg
+        runner = RUNNERS.build(cfg)
+
+    # start training
+    runner.train()
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/visualization/browse_dataset.py b/tools/visualization/browse_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..18db4dc30ef5821e4b5309b8281aa3e6ac04c89d
--- /dev/null
+++ b/tools/visualization/browse_dataset.py
@@ -0,0 +1,253 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+import sys
+import textwrap
+
+from matplotlib import transforms
+from mmengine.config import Config, DictAction
+from mmengine.dataset import Compose
+from mmengine.registry import init_default_scope
+from mmengine.utils import ProgressBar
+from mmengine.visualization.utils import img_from_canvas
+
+from mmpretrain.datasets.builder import build_dataset
+from mmpretrain.structures import DataSample
+from mmpretrain.visualization import UniversalVisualizer, create_figure
+
+try:
+    from matplotlib._tight_bbox import adjust_bbox
+except ImportError:
+    # To be compatible with matplotlib 3.5
+    from matplotlib.tight_bbox import adjust_bbox
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Browse a dataset')
+    parser.add_argument('config', help='train config file path')
+    parser.add_argument(
+        '--output-dir',
+        '-o',
+        default=None,
+        type=str,
+        help='If there is no display interface, you can save it.')
+    parser.add_argument('--not-show', default=False, action='store_true')
+    parser.add_argument(
+        '--phase',
+        '-p',
+        default='train',
+        type=str,
+        choices=['train', 'test', 'val'],
+        help='phase of dataset to visualize, accept "train" "test" and "val".'
+        ' Defaults to "train".')
+    parser.add_argument(
+        '--show-number',
+        '-n',
+        type=int,
+        default=sys.maxsize,
+        help='number of images selected to visualize, must bigger than 0. if '
+        'the number is bigger than length of dataset, show all the images in '
+        'dataset; default "sys.maxsize", show all images in dataset')
+    parser.add_argument(
+        '--show-interval',
+        '-i',
+        type=float,
+        default=2,
+        help='the interval of show (s)')
+    parser.add_argument(
+        '--mode',
+        '-m',
+        default='transformed',
+        type=str,
+        choices=['original', 'transformed', 'concat', 'pipeline'],
+        help='display mode; display original pictures or transformed pictures'
+        ' or comparison pictures. "original" means show images load from disk'
+        '; "transformed" means to show images after transformed; "concat" '
+        'means show images stitched by "original" and "output" images. '
+        '"pipeline" means show all the intermediate images. '
+        'Defaults to "transformed".')
+    parser.add_argument(
+        '--rescale-factor',
+        '-r',
+        type=float,
+        help='(For `mode=original`) Image rescale factor, which is useful if'
+        'the output is too large or too small.')
+    parser.add_argument(
+        '--channel-order',
+        '-c',
+        default='BGR',
+        choices=['BGR', 'RGB'],
+        help='The channel order of the showing images, could be "BGR" '
+        'or "RGB", Defaults to "BGR".')
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. If the value to '
+        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+        'Note that the quotation marks are necessary and that no white space '
+        'is allowed.')
+    args = parser.parse_args()
+    return args
+
+
+def make_grid(imgs, names):
+    """Concat list of pictures into a single big picture, align height here."""
+    # A large canvas to ensure all text clear.
+    figure = create_figure(dpi=150, figsize=(16, 9))
+
+    # deal with imgs
+    max_nrows = 1
+    img_shapes = []
+    for img in imgs:
+        if isinstance(img, list):
+            max_nrows = max(len(img), max_nrows)
+            img_shapes.append([i.shape[:2] for i in img])
+        else:
+            img_shapes.append(img.shape[:2])
+    gs = figure.add_gridspec(max_nrows, len(imgs))
+
+    for i, img in enumerate(imgs):
+        if isinstance(img, list):
+            for j in range(len(img)):
+                subplot = figure.add_subplot(gs[j, i])
+                subplot.axis(False)
+                subplot.imshow(img[j])
+                name = '\n'.join(textwrap.wrap(names[i] + str(j), width=20))
+                subplot.set_title(
+                    f'{name}\n{img_shapes[i][j]}',
+                    fontsize=15,
+                    family='monospace')
+        else:
+            subplot = figure.add_subplot(gs[:, i])
+            subplot.axis(False)
+            subplot.imshow(img)
+            name = '\n'.join(textwrap.wrap(names[i], width=20))
+            subplot.set_title(
+                f'{name}\n{img_shapes[i]}', fontsize=15, family='monospace')
+
+    # Manage the gap of subplots
+    figure.tight_layout()
+
+    # Remove the white boundary (reserve 0.5 inches at the top to show label)
+    points = figure.get_tightbbox(
+        figure.canvas.get_renderer()).get_points() + [[0, 0], [0, 0.5]]
+    adjust_bbox(figure, transforms.Bbox(points))
+
+    return img_from_canvas(figure.canvas)
+
+
+class InspectCompose(Compose):
+    """Compose multiple transforms sequentially.
+
+    And record "img" field of all results in one list.
+    """
+
+    def __init__(self, transforms, intermediate_imgs, visualizer):
+        super().__init__(transforms=transforms)
+        self.intermediate_imgs = intermediate_imgs
+        self.visualizer = visualizer
+
+    def __call__(self, data):
+        if 'img' in data:
+            self.intermediate_imgs.append({
+                'name': 'Original',
+                'img': data['img'].copy()
+            })
+
+        for t in self.transforms:
+            data = t(data)
+            if data is None:
+                return None
+            if 'img' in data:
+                img = data['img'].copy()
+                if 'mask' in data:
+                    tmp_img = img[0] if isinstance(img, list) else img
+                    tmp_img = self.visualizer.add_mask_to_image(
+                        tmp_img,
+                        DataSample().set_mask(data['mask']),
+                        resize=tmp_img.shape[:2])
+                    img = [tmp_img] + img[1:] if isinstance(img,
+                                                            list) else tmp_img
+                self.intermediate_imgs.append({
+                    'name': t.__class__.__name__,
+                    'img': img
+                })
+        return data
+
+
+def main():
+    args = parse_args()
+    cfg = Config.fromfile(args.config)
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+
+    init_default_scope('mmpretrain')  # Use mmpretrain as default scope.
+
+    dataset_cfg = cfg.get(args.phase + '_dataloader').get('dataset')
+    dataset = build_dataset(dataset_cfg)
+
+    # init visualizer
+    cfg.visualizer.pop('type')
+    fig_cfg = dict(figsize=(16, 10))
+    visualizer = UniversalVisualizer(
+        **cfg.visualizer, fig_show_cfg=fig_cfg, fig_save_cfg=fig_cfg)
+    visualizer.dataset_meta = dataset.metainfo
+
+    # init inspection
+    intermediate_imgs = []
+    dataset.pipeline = InspectCompose(dataset.pipeline.transforms,
+                                      intermediate_imgs, visualizer)
+
+    # init visualization image number
+    display_number = min(args.show_number, len(dataset))
+    progress_bar = ProgressBar(display_number)
+
+    for i, item in zip(range(display_number), dataset):
+
+        rescale_factor = None
+        if args.mode == 'original':
+            image = intermediate_imgs[0]['img']
+            # Only original mode need rescale factor, `make_grid` will use
+            # matplotlib to manage the size of subplots.
+            rescale_factor = args.rescale_factor
+        elif args.mode == 'transformed':
+            image = make_grid([intermediate_imgs[-1]['img']], ['transformed'])
+        elif args.mode == 'concat':
+            ori_image = intermediate_imgs[0]['img']
+            trans_image = intermediate_imgs[-1]['img']
+            image = make_grid([ori_image, trans_image],
+                              ['original', 'transformed'])
+        else:
+            image = make_grid([result['img'] for result in intermediate_imgs],
+                              [result['name'] for result in intermediate_imgs])
+
+        intermediate_imgs.clear()
+
+        data_sample = item['data_samples'].numpy()
+
+        # get filename from dataset or just use index as filename
+        if hasattr(item['data_samples'], 'img_path'):
+            filename = osp.basename(item['data_samples'].img_path)
+        else:
+            # some dataset have not image path
+            filename = f'{i}.jpg'
+
+        out_file = osp.join(args.output_dir,
+                            filename) if args.output_dir is not None else None
+
+        visualizer.visualize_cls(
+            image if args.channel_order == 'RGB' else image[..., ::-1],
+            data_sample,
+            rescale_factor=rescale_factor,
+            show=not args.not_show,
+            wait_time=args.show_interval,
+            name=filename,
+            out_file=out_file)
+        progress_bar.update()
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/visualization/vis_cam.py b/tools/visualization/vis_cam.py
new file mode 100644
index 0000000000000000000000000000000000000000..21223902ee2f7f0e39d6bac1a0ec001763469e63
--- /dev/null
+++ b/tools/visualization/vis_cam.py
@@ -0,0 +1,274 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import copy
+import math
+import pkg_resources
+from functools import partial
+from pathlib import Path
+
+import mmcv
+import numpy as np
+import torch.nn as nn
+from mmcv.transforms import Compose
+from mmengine.config import Config, DictAction
+from mmengine.dataset import default_collate
+from mmengine.utils import to_2tuple
+from mmengine.utils.dl_utils import is_norm
+
+from mmpretrain import digit_version
+from mmpretrain.apis import get_model
+from mmpretrain.registry import TRANSFORMS
+
+try:
+    import pytorch_grad_cam as cam
+    from pytorch_grad_cam.activations_and_gradients import \
+        ActivationsAndGradients
+    from pytorch_grad_cam.utils.image import show_cam_on_image
+except ImportError:
+    raise ImportError('Please run `pip install "grad-cam>=1.3.6"` to install '
+                      '3rd party package pytorch_grad_cam.')
+
+# Alias name
+METHOD_MAP = {
+    'gradcam++': cam.GradCAMPlusPlus,
+}
+METHOD_MAP.update({
+    cam_class.__name__.lower(): cam_class
+    for cam_class in cam.base_cam.BaseCAM.__subclasses__()
+})
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Visualize CAM')
+    parser.add_argument('img', help='Image file')
+    parser.add_argument('config', help='Config file')
+    parser.add_argument('checkpoint', help='Checkpoint file')
+    parser.add_argument(
+        '--target-layers',
+        default=[],
+        nargs='+',
+        type=str,
+        help='The target layers to get CAM, if not set, the tool will '
+        'specify the norm layer in the last block. Backbones '
+        'implemented by users are recommended to manually specify'
+        ' target layers in commmad statement.')
+    parser.add_argument(
+        '--preview-model',
+        default=False,
+        action='store_true',
+        help='To preview all the model layers')
+    parser.add_argument(
+        '--method',
+        default='GradCAM',
+        help='Type of method to use, supports '
+        f'{", ".join(list(METHOD_MAP.keys()))}.')
+    parser.add_argument(
+        '--target-category',
+        default=[],
+        nargs='+',
+        type=int,
+        help='The target category to get CAM, default to use result '
+        'get from given model.')
+    parser.add_argument(
+        '--eigen-smooth',
+        default=False,
+        action='store_true',
+        help='Reduce noise by taking the first principle componenet of '
+        '``cam_weights*activations``')
+    parser.add_argument(
+        '--aug-smooth',
+        default=False,
+        action='store_true',
+        help='Wether to use test time augmentation, default not to use')
+    parser.add_argument(
+        '--save-path',
+        type=Path,
+        help='The path to save visualize cam image, default not to save.')
+    parser.add_argument('--device', default='cpu', help='Device to use cpu')
+    parser.add_argument(
+        '--vit-like',
+        action='store_true',
+        help='Whether the network is a ViT-like network.')
+    parser.add_argument(
+        '--num-extra-tokens',
+        type=int,
+        help='The number of extra tokens in ViT-like backbones. Defaults to'
+        ' use num_extra_tokens of the backbone.')
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. If the value to '
+        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+        'Note that the quotation marks are necessary and that no white space '
+        'is allowed.')
+    args = parser.parse_args()
+    if args.method.lower() not in METHOD_MAP.keys():
+        raise ValueError(f'invalid CAM type {args.method},'
+                         f' supports {", ".join(list(METHOD_MAP.keys()))}.')
+
+    return args
+
+
+def reshape_transform(tensor, model, args):
+    """Build reshape_transform for `cam.activations_and_grads`, which is
+    necessary for ViT-like networks."""
+    # ViT_based_Transformers have an additional clstoken in features
+    if tensor.ndim == 4:
+        # For (B, C, H, W)
+        return tensor
+    elif tensor.ndim == 3:
+        if not args.vit_like:
+            raise ValueError(f"The tensor shape is {tensor.shape}, if it's a "
+                             'vit-like backbone, please specify `--vit-like`.')
+        # For (B, L, C)
+        num_extra_tokens = args.num_extra_tokens or getattr(
+            model.backbone, 'num_extra_tokens', 1)
+
+        tensor = tensor[:, num_extra_tokens:, :]
+        # get heat_map_height and heat_map_width, preset input is a square
+        heat_map_area = tensor.size()[1]
+        height, width = to_2tuple(int(math.sqrt(heat_map_area)))
+        assert height * height == heat_map_area, \
+            (f"The input feature's length ({heat_map_area+num_extra_tokens}) "
+             f'minus num-extra-tokens ({num_extra_tokens}) is {heat_map_area},'
+             ' which is not a perfect square number. Please check if you used '
+             'a wrong num-extra-tokens.')
+        # (B, L, C) -> (B, H, W, C)
+        result = tensor.reshape(tensor.size(0), height, width, tensor.size(2))
+        # (B, H, W, C) -> (B, C, H, W)
+        result = result.permute(0, 3, 1, 2)
+        return result
+    else:
+        raise ValueError(f'Unsupported tensor shape {tensor.shape}.')
+
+
+def init_cam(method, model, target_layers, use_cuda, reshape_transform):
+    """Construct the CAM object once, In order to be compatible with
+    mmpretrain, here we modify the ActivationsAndGradients object."""
+    GradCAM_Class = METHOD_MAP[method.lower()]
+    cam = GradCAM_Class(
+        model=model, target_layers=target_layers, use_cuda=use_cuda)
+    # Release the original hooks in ActivationsAndGradients to use
+    # ActivationsAndGradients.
+    cam.activations_and_grads.release()
+    cam.activations_and_grads = ActivationsAndGradients(
+        cam.model, cam.target_layers, reshape_transform)
+
+    return cam
+
+
+def get_layer(layer_str, model):
+    """get model layer from given str."""
+    for name, layer in model.named_modules():
+        if name == layer_str:
+            return layer
+    raise AttributeError(
+        f'Cannot get the layer "{layer_str}". Please choose from: \n' +
+        '\n'.join(name for name, _ in model.named_modules()))
+
+
+def show_cam_grad(grayscale_cam, src_img, title, out_path=None):
+    """fuse src_img and grayscale_cam and show or save."""
+    grayscale_cam = grayscale_cam[0, :]
+    src_img = np.float32(src_img) / 255
+    visualization_img = show_cam_on_image(
+        src_img, grayscale_cam, use_rgb=False)
+
+    if out_path:
+        mmcv.imwrite(visualization_img, str(out_path))
+    else:
+        mmcv.imshow(visualization_img, win_name=title)
+
+
+def get_default_target_layers(model, args):
+    """get default target layers from given model, here choose nrom type layer
+    as default target layer."""
+    norm_layers = [
+        (name, layer)
+        for name, layer in model.backbone.named_modules(prefix='backbone')
+        if is_norm(layer)
+    ]
+    if args.vit_like:
+        # For ViT models, the final classification is done on the class token.
+        # And the patch tokens and class tokens won't interact each other after
+        # the final attention layer. Therefore, we need to choose the norm
+        # layer before the last attention layer.
+        num_extra_tokens = args.num_extra_tokens or getattr(
+            model.backbone, 'num_extra_tokens', 1)
+
+        # models like swin have no attr 'out_type', set out_type to avg_featmap
+        out_type = getattr(model.backbone, 'out_type', 'avg_featmap')
+        if out_type == 'cls_token' or num_extra_tokens > 0:
+            # Assume the backbone feature is class token.
+            name, layer = norm_layers[-3]
+            print('Automatically choose the last norm layer before the '
+                  f'final attention block "{name}" as the target layer.')
+            return [layer]
+
+    # For CNN models, use the last norm layer as the target-layer
+    name, layer = norm_layers[-1]
+    print('Automatically choose the last norm layer '
+          f'"{name}" as the target layer.')
+    return [layer]
+
+
+def main():
+    args = parse_args()
+    cfg = Config.fromfile(args.config)
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+
+    # build the model from a config file and a checkpoint file
+    model: nn.Module = get_model(cfg, args.checkpoint, device=args.device)
+    if args.preview_model:
+        print(model)
+        print('\n Please remove `--preview-model` to get the CAM.')
+        return
+
+    # apply transform and perpare data
+    transforms = Compose(
+        [TRANSFORMS.build(t) for t in cfg.test_dataloader.dataset.pipeline])
+    data = transforms({'img_path': args.img})
+    src_img = copy.deepcopy(data['inputs']).numpy().transpose(1, 2, 0)
+    data = model.data_preprocessor(default_collate([data]), False)
+
+    # build target layers
+    if args.target_layers:
+        target_layers = [
+            get_layer(layer, model) for layer in args.target_layers
+        ]
+    else:
+        target_layers = get_default_target_layers(model, args)
+
+    # init a cam grad calculator
+    use_cuda = ('cuda' in args.device)
+    cam = init_cam(args.method, model, target_layers, use_cuda,
+                   partial(reshape_transform, model=model, args=args))
+
+    # warp the target_category with ClassifierOutputTarget in grad_cam>=1.3.7,
+    # to fix the bug in #654.
+    targets = None
+    if args.target_category:
+        grad_cam_v = pkg_resources.get_distribution('grad_cam').version
+        if digit_version(grad_cam_v) >= digit_version('1.3.7'):
+            from pytorch_grad_cam.utils.model_targets import \
+                ClassifierOutputTarget
+            targets = [ClassifierOutputTarget(c) for c in args.target_category]
+        else:
+            targets = args.target_category
+
+    # calculate cam grads and show|save the visualization image
+    grayscale_cam = cam(
+        data['inputs'],
+        targets,
+        eigen_smooth=args.eigen_smooth,
+        aug_smooth=args.aug_smooth)
+    show_cam_grad(
+        grayscale_cam, src_img, title=args.method, out_path=args.save_path)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/visualization/vis_scheduler.py b/tools/visualization/vis_scheduler.py
new file mode 100644
index 0000000000000000000000000000000000000000..21207ae20200ca35dd3e14f980359291d2a82230
--- /dev/null
+++ b/tools/visualization/vis_scheduler.py
@@ -0,0 +1,280 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import json
+import os.path as osp
+import re
+from pathlib import Path
+from unittest.mock import MagicMock
+
+import matplotlib.pyplot as plt
+import rich
+import torch.nn as nn
+from mmengine.config import Config, DictAction
+from mmengine.hooks import Hook
+from mmengine.model import BaseModel
+from mmengine.runner import Runner
+from mmengine.visualization import Visualizer
+from rich.progress import BarColumn, MofNCompleteColumn, Progress, TextColumn
+
+
+class SimpleModel(BaseModel):
+    """simple model that do nothing in train_step."""
+
+    def __init__(self):
+        super(SimpleModel, self).__init__()
+        self.data_preprocessor = nn.Identity()
+        self.conv = nn.Conv2d(1, 1, 1)
+
+    def forward(self, inputs, data_samples, mode='tensor'):
+        pass
+
+    def train_step(self, data, optim_wrapper):
+        pass
+
+
+class ParamRecordHook(Hook):
+
+    def __init__(self, by_epoch):
+        super().__init__()
+        self.by_epoch = by_epoch
+        self.lr_list = []
+        self.momentum_list = []
+        self.wd_list = []
+        self.task_id = 0
+        self.progress = Progress(BarColumn(), MofNCompleteColumn(),
+                                 TextColumn('{task.description}'))
+
+    def before_train(self, runner):
+        if self.by_epoch:
+            total = runner.train_loop.max_epochs
+            self.task_id = self.progress.add_task(
+                'epochs', start=True, total=total)
+        else:
+            total = runner.train_loop.max_iters
+            self.task_id = self.progress.add_task(
+                'iters', start=True, total=total)
+        self.progress.start()
+
+    def after_train_epoch(self, runner):
+        if self.by_epoch:
+            self.progress.update(self.task_id, advance=1)
+
+    def after_train_iter(self, runner, batch_idx, data_batch, outputs):
+        if not self.by_epoch:
+            self.progress.update(self.task_id, advance=1)
+        self.lr_list.append(runner.optim_wrapper.get_lr()['lr'][0])
+        self.momentum_list.append(
+            runner.optim_wrapper.get_momentum()['momentum'][0])
+        self.wd_list.append(
+            runner.optim_wrapper.param_groups[0]['weight_decay'])
+
+    def after_train(self, runner):
+        self.progress.stop()
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description='Visualize a Dataset Pipeline')
+    parser.add_argument('config', help='config file path')
+    parser.add_argument(
+        '-p',
+        '--parameter',
+        type=str,
+        default='lr',
+        choices=['lr', 'momentum', 'wd'],
+        help='The parameter to visualize its change curve, choose from'
+        '"lr", "wd" and "momentum". Defaults to "lr".')
+    parser.add_argument(
+        '-d',
+        '--dataset-size',
+        type=int,
+        help='The size of the dataset. If specify, `build_dataset` will '
+        'be skipped and use this size as the dataset size.')
+    parser.add_argument(
+        '-n',
+        '--ngpus',
+        type=int,
+        default=1,
+        help='The number of GPUs used in training.')
+    parser.add_argument(
+        '-s',
+        '--save-path',
+        type=Path,
+        help='The learning rate curve plot save path')
+    parser.add_argument(
+        '--log-level',
+        default='WARNING',
+        help='The log level of the handler and logger. Defaults to '
+        'WARNING.')
+    parser.add_argument('--title', type=str, help='title of figure')
+    parser.add_argument(
+        '--style',
+        type=str,
+        default='whitegrid',
+        help='style of the figure, need `seaborn` package.')
+    parser.add_argument('--not-show', default=False, action='store_true')
+    parser.add_argument(
+        '--window-size',
+        default='12*7',
+        help='Size of the window to display images, in format of "$W*$H".')
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. If the value to '
+        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+        'Note that the quotation marks are necessary and that no white space '
+        'is allowed.')
+    args = parser.parse_args()
+    if args.window_size != '':
+        assert re.match(r'\d+\*\d+', args.window_size), \
+            "'window-size' must be in format 'W*H'."
+
+    return args
+
+
+def plot_curve(lr_list, args, param_name, iters_per_epoch, by_epoch=True):
+    """Plot learning rate vs iter graph."""
+    try:
+        import seaborn as sns
+        sns.set_style(args.style)
+    except ImportError:
+        pass
+
+    wind_w, wind_h = args.window_size.split('*')
+    wind_w, wind_h = int(wind_w), int(wind_h)
+    plt.figure(figsize=(wind_w, wind_h))
+
+    ax: plt.Axes = plt.subplot()
+    ax.plot(lr_list, linewidth=1)
+
+    if by_epoch:
+        ax.xaxis.tick_top()
+        ax.set_xlabel('Iters')
+        ax.xaxis.set_label_position('top')
+        sec_ax = ax.secondary_xaxis(
+            'bottom',
+            functions=(lambda x: x / iters_per_epoch,
+                       lambda y: y * iters_per_epoch))
+        sec_ax.set_xlabel('Epochs')
+    else:
+        plt.xlabel('Iters')
+    plt.ylabel(param_name)
+
+    if args.title is None:
+        plt.title(f'{osp.basename(args.config)} {param_name} curve')
+    else:
+        plt.title(args.title)
+
+
+def simulate_train(data_loader, cfg, by_epoch):
+    model = SimpleModel()
+    param_record_hook = ParamRecordHook(by_epoch=by_epoch)
+    default_hooks = dict(
+        param_scheduler=cfg.default_hooks['param_scheduler'],
+        runtime_info=None,
+        timer=None,
+        logger=None,
+        checkpoint=None,
+        sampler_seed=None,
+        param_record=param_record_hook)
+
+    runner = Runner(
+        model=model,
+        work_dir=cfg.work_dir,
+        train_dataloader=data_loader,
+        train_cfg=cfg.train_cfg,
+        log_level=cfg.log_level,
+        optim_wrapper=cfg.optim_wrapper,
+        param_scheduler=cfg.param_scheduler,
+        default_scope=cfg.default_scope,
+        default_hooks=default_hooks,
+        visualizer=MagicMock(spec=Visualizer),
+        custom_hooks=cfg.get('custom_hooks', None))
+
+    runner.train()
+
+    param_dict = dict(
+        lr=param_record_hook.lr_list,
+        momentum=param_record_hook.momentum_list,
+        wd=param_record_hook.wd_list)
+
+    return param_dict
+
+
+def main():
+    args = parse_args()
+    cfg = Config.fromfile(args.config)
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+    if cfg.get('work_dir', None) is None:
+        # use config filename as default work_dir if cfg.work_dir is None
+        cfg.work_dir = osp.join('./work_dirs',
+                                osp.splitext(osp.basename(args.config))[0])
+
+    cfg.log_level = args.log_level
+
+    # make sure save_root exists
+    if args.save_path and not args.save_path.parent.exists():
+        raise FileNotFoundError(
+            f'The save path is {args.save_path}, and directory '
+            f"'{args.save_path.parent}' do not exist.")
+
+    # init logger
+    print('Param_scheduler :')
+    rich.print_json(json.dumps(cfg.param_scheduler))
+
+    # prepare data loader
+    batch_size = cfg.train_dataloader.batch_size * args.ngpus
+
+    if 'by_epoch' in cfg.train_cfg:
+        by_epoch = cfg.train_cfg.get('by_epoch')
+    elif 'type' in cfg.train_cfg:
+        by_epoch = cfg.train_cfg.get('type') == 'EpochBasedTrainLoop'
+    else:
+        raise ValueError('please set `train_cfg`.')
+
+    if args.dataset_size is None and by_epoch:
+        from mmpretrain.datasets import build_dataset
+        dataset_size = len(build_dataset(cfg.train_dataloader.dataset))
+    else:
+        dataset_size = args.dataset_size or batch_size
+
+    class FakeDataloader(list):
+        dataset = MagicMock(metainfo=None)
+
+    data_loader = FakeDataloader(range(dataset_size // batch_size))
+    dataset_info = (
+        f'\nDataset infos:'
+        f'\n - Dataset size: {dataset_size}'
+        f'\n - Batch size per GPU: {cfg.train_dataloader.batch_size}'
+        f'\n - Number of GPUs: {args.ngpus}'
+        f'\n - Total batch size: {batch_size}')
+    if by_epoch:
+        dataset_info += f'\n - Iterations per epoch: {len(data_loader)}'
+    rich.print(dataset_info + '\n')
+
+    # simulation training process
+    param_dict = simulate_train(data_loader, cfg, by_epoch)
+    param_list = param_dict[args.parameter]
+
+    if args.parameter == 'lr':
+        param_name = 'Learning Rate'
+    elif args.parameter == 'momentum':
+        param_name = 'Momentum'
+    else:
+        param_name = 'Weight Decay'
+    plot_curve(param_list, args, param_name, len(data_loader), by_epoch)
+
+    if args.save_path:
+        plt.savefig(args.save_path)
+        print(f'\nThe {param_name} graph is saved at {args.save_path}')
+
+    if not args.not_show:
+        plt.show()
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/visualization/vis_tsne.py b/tools/visualization/vis_tsne.py
new file mode 100644
index 0000000000000000000000000000000000000000..2158f3092541e1406b7f18c67be3e63d9cf05b9b
--- /dev/null
+++ b/tools/visualization/vis_tsne.py
@@ -0,0 +1,267 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+import time
+from collections import defaultdict
+
+import matplotlib.pyplot as plt
+import numpy as np
+import rich.progress as progress
+import torch
+import torch.nn.functional as F
+from mmengine.config import Config, DictAction
+from mmengine.device import get_device
+from mmengine.logging import MMLogger
+from mmengine.runner import Runner
+from mmengine.utils import mkdir_or_exist
+
+from mmpretrain.apis import get_model
+from mmpretrain.registry import DATASETS
+
+try:
+    from sklearn.manifold import TSNE
+except ImportError as e:
+    raise ImportError('Please install `sklearn` to calculate '
+                      'TSNE by `pip install scikit-learn`') from e
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='t-SNE visualization')
+    parser.add_argument('config', help='tsne config file path')
+    parser.add_argument('--checkpoint', default=None, help='checkpoint file')
+    parser.add_argument('--work-dir', help='the dir to save logs and models')
+    parser.add_argument(
+        '--test-cfg',
+        help='tsne config file path to load config of test dataloader.')
+    parser.add_argument(
+        '--vis-stage',
+        choices=['backbone', 'neck', 'pre_logits'],
+        help='The visualization stage of the model')
+    parser.add_argument(
+        '--class-idx',
+        nargs='+',
+        type=int,
+        help='The categories used to calculate t-SNE.')
+    parser.add_argument(
+        '--max-num-class',
+        type=int,
+        default=20,
+        help='The first N categories to apply t-SNE algorithms. '
+        'Defaults to 20.')
+    parser.add_argument(
+        '--max-num-samples',
+        type=int,
+        default=100,
+        help='The maximum number of samples per category. '
+        'Higher number need longer time to calculate. Defaults to 100.')
+    parser.add_argument(
+        '--cfg-options',
+        nargs='+',
+        action=DictAction,
+        help='override some settings in the used config, the key-value pair '
+        'in xxx=yyy format will be merged into config file. If the value to '
+        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
+        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
+        'Note that the quotation marks are necessary and that no white space '
+        'is allowed.')
+    parser.add_argument('--device', help='Device used for inference')
+    parser.add_argument(
+        '--legend',
+        action='store_true',
+        help='Show the legend of all categories.')
+    parser.add_argument(
+        '--show',
+        action='store_true',
+        help='Display the result in a graphical window.')
+
+    # t-SNE settings
+    parser.add_argument(
+        '--n-components', type=int, default=2, help='the dimension of results')
+    parser.add_argument(
+        '--perplexity',
+        type=float,
+        default=30.0,
+        help='The perplexity is related to the number of nearest neighbors'
+        'that is used in other manifold learning algorithms.')
+    parser.add_argument(
+        '--early-exaggeration',
+        type=float,
+        default=12.0,
+        help='Controls how tight natural clusters in the original space are in'
+        'the embedded space and how much space will be between them.')
+    parser.add_argument(
+        '--learning-rate',
+        type=float,
+        default=200.0,
+        help='The learning rate for t-SNE is usually in the range'
+        '[10.0, 1000.0]. If the learning rate is too high, the data may look'
+        'like a ball with any point approximately equidistant from its nearest'
+        'neighbours. If the learning rate is too low, most points may look'
+        'compressed in a dense cloud with few outliers.')
+    parser.add_argument(
+        '--n-iter',
+        type=int,
+        default=1000,
+        help='Maximum number of iterations for the optimization. Should be at'
+        'least 250.')
+    parser.add_argument(
+        '--n-iter-without-progress',
+        type=int,
+        default=300,
+        help='Maximum number of iterations without progress before we abort'
+        'the optimization.')
+    parser.add_argument(
+        '--init', type=str, default='random', help='The init method')
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+
+    cfg = Config.fromfile(args.config)
+    if args.cfg_options is not None:
+        cfg.merge_from_dict(args.cfg_options)
+    # work_dir is determined in this priority: CLI > segment in file > filename
+    if args.work_dir is not None:
+        # update configs according to CLI args if args.work_dir is not None
+        cfg.work_dir = args.work_dir
+    elif cfg.get('work_dir', None) is None:
+        # use config filename as default work_dir if cfg.work_dir is None
+        work_type = args.config.split('/')[1]
+        cfg.work_dir = osp.join('./work_dirs', work_type,
+                                osp.splitext(osp.basename(args.config))[0])
+
+    # create work_dir
+    timestamp = time.strftime('%Y%m%d_%H%M%S', time.localtime())
+    tsne_work_dir = osp.join(cfg.work_dir, f'tsne_{timestamp}/')
+    mkdir_or_exist(osp.abspath(tsne_work_dir))
+
+    # init the logger before other steps
+    log_file = osp.join(tsne_work_dir, 'tsne.log')
+    logger = MMLogger.get_instance(
+        'mmpretrain',
+        logger_name='mmpretrain',
+        log_file=log_file,
+        log_level=cfg.log_level)
+
+    # build the model from a config file and a checkpoint file
+    device = args.device or get_device()
+    model = get_model(cfg, args.checkpoint, device=device)
+    logger.info('Model loaded.')
+
+    # build the dataset
+    if args.test_cfg is not None:
+        dataloader_cfg = Config.fromfile(args.test_cfg).get('test_dataloader')
+    elif 'test_dataloader' not in cfg:
+        raise ValueError('No `test_dataloader` in the config, you can '
+                         'specify another config file that includes test '
+                         'dataloader settings by the `--test-cfg` option.')
+    else:
+        dataloader_cfg = cfg.get('test_dataloader')
+
+    dataset = DATASETS.build(dataloader_cfg.pop('dataset'))
+    classes = dataset.metainfo.get('classes')
+
+    if args.class_idx is None:
+        num_classes = args.max_num_class if classes is None else len(classes)
+        args.class_idx = list(range(num_classes))[:args.max_num_class]
+
+    if classes is not None:
+        classes = [classes[idx] for idx in args.class_idx]
+    else:
+        classes = args.class_idx
+
+    # compress dataset, select that the label is less then max_num_class
+    subset_idx_list = []
+    counter = defaultdict(int)
+    for i in range(len(dataset)):
+        gt_label = dataset.get_data_info(i)['gt_label']
+        if (gt_label in args.class_idx
+                and counter[gt_label] < args.max_num_samples):
+            subset_idx_list.append(i)
+            counter[gt_label] += 1
+    dataset.get_subset_(subset_idx_list)
+    logger.info(f'Apply t-SNE to visualize {len(subset_idx_list)} samples.')
+
+    dataloader_cfg.dataset = dataset
+    dataloader_cfg.setdefault('collate_fn', dict(type='default_collate'))
+    dataloader = Runner.build_dataloader(dataloader_cfg)
+
+    results = dict()
+    features = []
+    labels = []
+    for data in progress.track(dataloader, description='Calculating...'):
+        with torch.no_grad():
+            # preprocess data
+            data = model.data_preprocessor(data)
+            batch_inputs, batch_data_samples = \
+                data['inputs'], data['data_samples']
+            batch_labels = torch.cat([i.gt_label for i in batch_data_samples])
+
+            # extract backbone features
+            extract_args = {}
+            if args.vis_stage:
+                extract_args['stage'] = args.vis_stage
+            batch_features = model.extract_feat(batch_inputs, **extract_args)
+
+            # post process
+            if batch_features[0].ndim == 4:
+                # For (N, C, H, W) feature
+                batch_features = [
+                    F.adaptive_avg_pool2d(inputs, 1).squeeze()
+                    for inputs in batch_features
+                ]
+            elif batch_features[0].ndim == 3:
+                # For (N, L, C) feature
+                batch_features = [inputs.mean(1) for inputs in batch_features]
+
+        # save batch features
+        features.append(batch_features)
+        labels.extend(batch_labels.cpu().numpy())
+
+    for i in range(len(features[0])):
+        key = 'feat_' + str(model.backbone.out_indices[i])
+        results[key] = np.concatenate(
+            [batch[i].cpu().numpy() for batch in features], axis=0)
+
+    # save features
+    for key, val in results.items():
+        output_file = f'{tsne_work_dir}{key}.npy'
+        np.save(output_file, val)
+
+    # build t-SNE model
+    tsne_model = TSNE(
+        n_components=args.n_components,
+        perplexity=args.perplexity,
+        early_exaggeration=args.early_exaggeration,
+        learning_rate=args.learning_rate,
+        n_iter=args.n_iter,
+        n_iter_without_progress=args.n_iter_without_progress,
+        init=args.init)
+
+    # run and get results
+    logger.info('Running t-SNE.')
+    for key, val in results.items():
+        result = tsne_model.fit_transform(val)
+        res_min, res_max = result.min(0), result.max(0)
+        res_norm = (result - res_min) / (res_max - res_min)
+        _, ax = plt.subplots(figsize=(10, 10))
+        scatter = ax.scatter(
+            res_norm[:, 0],
+            res_norm[:, 1],
+            alpha=1.0,
+            s=15,
+            c=labels,
+            cmap='tab20')
+        if args.legend:
+            legend = ax.legend(scatter.legend_elements()[0], classes)
+            ax.add_artist(legend)
+        plt.savefig(f'{tsne_work_dir}{key}.png')
+        if args.show:
+            plt.show()
+    logger.info(f'Save features and results to {tsne_work_dir}')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/vgg16_8xb32_in1k.py b/vgg16_8xb32_in1k.py
new file mode 100644
index 0000000000000000000000000000000000000000..a08543471d41608d803c63ba36338c320d43b2f7
--- /dev/null
+++ b/vgg16_8xb32_in1k.py
@@ -0,0 +1,18 @@
+_base_ = [
+    'configs/_base_/models/vgg16.py',
+    'configs/_base_/datasets/tiny_imagenet_bs32_pil_resize.py',
+    'configs/_base_/schedules/imagenet_bs256.py',
+    'configs/_base_/default_runtime.py',
+]
+
+import torch
+torch.backends.cuda.matmul.allow_tf32=True
+torch.backends.cudnn.allow_tf32=True
+
+# schedule settings
+optim_wrapper = dict(
+        #type='AmpOptimWrapper',
+        #dtype='float16',
+        optimizer=dict(lr=0.01)
+    )
+
diff --git a/vit-base-p16_32xb128-mae_in200.py b/vit-base-p16_32xb128-mae_in200.py
new file mode 100644
index 0000000000000000000000000000000000000000..b1e903ba6b225a997ced7f16c6c9cf3fbac3da3b
--- /dev/null
+++ b/vit-base-p16_32xb128-mae_in200.py
@@ -0,0 +1,74 @@
+_base_ = [
+    'configs/_base_/datasets/tiny_imagenet_bs64_swin_224.py',
+    'configs/_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    'configs/_base_/default_runtime.py'
+]
+
+import os
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='base',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=200,
+        in_channels=768,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# schedule settings
+optim_wrapper = dict(
+    #type='AmpOptimWrapper',
+    #dtype='float16',
+    optimizer=dict(
+        type='AdamW',
+        lr=1e-4 * 4096 / 256,
+        weight_decay=0.3,
+        eps=1e-8,
+        betas=(0.9, 0.95)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=1e-4)]
+
+# 自定义hooks，添加ProfilerHook, 只在rank0启用
+#custom_hooks = [
+#    dict(type='EMAHook', momentum=1e-4),
+#    dict(type='ProfilerHook', by_epoch=False,
+#        profile_times=12,
+#        on_trace_ready=dict(type="log_trace", sort_by="self_cuda_time_total"),
+#        json_trace_path=f"trace_vitb_bf16.json",
+#        activity_with_cuda=True,
+#        schedule=dict(wait=1, warmup=1, active=10, repeat=1))  # 这样的设置是10次
+#] if os.environ['LOCAL_RANK'] == '0' else [dict(type='EMAHook', momentum=1e-4)]
+
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)
diff --git a/vit-large-p16-64xb64-test.py b/vit-large-p16-64xb64-test.py
new file mode 100644
index 0000000000000000000000000000000000000000..344c81999b012b016a9fea945e6457aa11255b7a
--- /dev/null
+++ b/vit-large-p16-64xb64-test.py
@@ -0,0 +1,18 @@
+_base_ = [
+    'configs/_base_/models/tiny-vit-large-p16.py',
+    'configs/_base_/datasets/tiny_imagenet_bs64_pil_resize_autoaug.py',
+    'configs/_base_/schedules/imagenet_bs4096_AdamW.py',
+    'configs/_base_/default_runtime.py'
+]
+
+# model setting
+model = dict(
+    head=dict(hidden_dim=3072),
+    train_cfg=dict(augments=dict(type='Mixup', alpha=0.2)),
+)
+
+# schedule setting
+optim_wrapper = dict(
+        type='AmpOptimWrapper',
+        dtype='bfloat16',
+        clip_grad=dict(max_norm=1.0))
diff --git a/vit-large-p16_32xb128-mae_in200.py b/vit-large-p16_32xb128-mae_in200.py
new file mode 100644
index 0000000000000000000000000000000000000000..8c26eefd3c4ea8a10a8fbad2dcb9cb16da421b1e
--- /dev/null
+++ b/vit-large-p16_32xb128-mae_in200.py
@@ -0,0 +1,74 @@
+_base_ = [
+    'configs/_base_/datasets/tiny_imagenet_bs64_swin_224.py',
+    'configs/_base_/schedules/imagenet_bs1024_adamw_swin.py',
+    'configs/_base_/default_runtime.py'
+]
+
+import os
+
+# model settings
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='VisionTransformer',
+        arch='large',
+        img_size=224,
+        patch_size=16,
+        drop_path_rate=0.1),
+    neck=None,
+    head=dict(
+        type='VisionTransformerClsHead',
+        num_classes=200,
+        in_channels=1024,
+        loss=dict(
+            type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'),
+    ),
+    init_cfg=[
+        dict(type='TruncNormal', layer='Linear', std=.02),
+        dict(type='Constant', layer='LayerNorm', val=1., bias=0.),
+    ],
+    train_cfg=dict(augments=[
+        dict(type='Mixup', alpha=0.8),
+        dict(type='CutMix', alpha=1.0)
+    ]))
+
+# dataset settings
+train_dataloader = dict(batch_size=128)
+
+# schedule settings
+optim_wrapper = dict(
+    #type='AmpOptimWrapper',
+    #dtype='bfloat16',
+    optimizer=dict(
+        type='AdamW',
+        lr=1e-4 * 4096 / 256,
+        weight_decay=0.3,
+        eps=1e-8,
+        betas=(0.9, 0.95)),
+    paramwise_cfg=dict(
+        norm_decay_mult=0.0,
+        bias_decay_mult=0.0,
+        custom_keys={
+            '.cls_token': dict(decay_mult=0.0),
+            '.pos_embed': dict(decay_mult=0.0)
+        }))
+
+# runtime settings
+custom_hooks = [dict(type='EMAHook', momentum=1e-4)]
+
+# 自定义hooks，添加ProfilerHook, 只在rank0启用
+#custom_hooks = [
+#    dict(type='EMAHook', momentum=1e-4),
+#    dict(type='ProfilerHook', by_epoch=False,
+#        profile_times=12,
+#        on_trace_ready=dict(type="log_trace", sort_by="self_cuda_time_total"),
+#        json_trace_path=f"trace_vitb_bf16.json",
+#        activity_with_cuda=True,
+#        schedule=dict(wait=1, warmup=1, active=10, repeat=1))  # 这样的设置是10次
+#] if os.environ['LOCAL_RANK'] == '0' else [dict(type='EMAHook', momentum=1e-4)]
+
+
+# NOTE: `auto_scale_lr` is for automatically scaling LR
+# based on the actual training batch size.
+# base_batch_size = (32 GPUs) x (128 samples per GPU)
+auto_scale_lr = dict(base_batch_size=4096)